# 项目：分析鸢尾花种类数据

## 分析目标

此数据分析报告的目的是基于鸢尾花的属性数据，分析两种鸢尾花萼片、花瓣的长度和宽度平均值，是否存在显著性差异，让我们可以对不同种类鸢尾花的属性特征进行推断。

## 简介

原始数据`Iris.csv`包括两种鸢尾花，每种有 50 个样本，以及每个样本的一些属性，包括萼片的长度和宽度、花瓣的长度和宽度。

`Iris.csv`每列的含义如下：
- Id：样本的ID。
- SepalLengthCm：萼片的长度（单位为厘米）。
- SepalWidthCm：萼片的宽度（单位为厘米）。
- PetalLengthCm：花瓣的长度（单位为厘米）。
- PetalWidthCm：花瓣的宽度（单位为厘米）。
- Species：鸢尾花种类。

## 导入数据

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from statsmodels.stats.weightstats import ztest
from scipy.stats import ttest_ind
plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

In [2]:
Iris_original_data = pd.read_csv("./Iris.csv")
Iris_original_data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
95,96,5.7,3.0,4.2,1.2,Iris-versicolor
96,97,5.7,2.9,4.2,1.3,Iris-versicolor
97,98,6.2,2.9,4.3,1.3,Iris-versicolor
98,99,5.1,2.5,3.0,1.1,Iris-versicolor


## 数据清洗

### 检查结构问题

In [3]:
Iris_clean_data = Iris_original_data.copy()
Iris_clean_data

Unnamed: 0,Id,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
0,1,5.1,3.5,1.4,0.2,Iris-setosa
1,2,4.9,3.0,1.4,0.2,Iris-setosa
2,3,4.7,3.2,1.3,0.2,Iris-setosa
3,4,4.6,3.1,1.5,0.2,Iris-setosa
4,5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...,...
95,96,5.7,3.0,4.2,1.2,Iris-versicolor
96,97,5.7,2.9,4.2,1.3,Iris-versicolor
97,98,6.2,2.9,4.3,1.3,Iris-versicolor
98,99,5.1,2.5,3.0,1.1,Iris-versicolor


索引更改为`Id`列

In [4]:
Iris_clean_data = Iris_clean_data.set_index("Id")
Iris_clean_data

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1,5.1,3.5,1.4,0.2,Iris-setosa
2,4.9,3.0,1.4,0.2,Iris-setosa
3,4.7,3.2,1.3,0.2,Iris-setosa
4,4.6,3.1,1.5,0.2,Iris-setosa
5,5.0,3.6,1.4,0.2,Iris-setosa
...,...,...,...,...,...
96,5.7,3.0,4.2,1.2,Iris-versicolor
97,5.7,2.9,4.2,1.3,Iris-versicolor
98,6.2,2.9,4.3,1.3,Iris-versicolor
99,5.1,2.5,3.0,1.1,Iris-versicolor


### 检查数据内容

In [5]:
Iris_clean_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 1 to 100
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   SepalLengthCm  100 non-null    float64
 1   SepalWidthCm   100 non-null    float64
 2   PetalLengthCm  100 non-null    float64
 3   PetalWidthCm   100 non-null    float64
 4   Species        100 non-null    object 
dtypes: float64(4), object(1)
memory usage: 4.7+ KB


将`Species`列改为"category"格式

In [7]:
Iris_clean_data["Species"] = Iris_clean_data["Species"].astype("category")

In [8]:
Iris_clean_data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 100 entries, 1 to 100
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype   
---  ------         --------------  -----   
 0   SepalLengthCm  100 non-null    float64 
 1   SepalWidthCm   100 non-null    float64 
 2   PetalLengthCm  100 non-null    float64 
 3   PetalWidthCm   100 non-null    float64 
 4   Species        100 non-null    category
dtypes: category(1), float64(4)
memory usage: 4.1 KB


## 数据分析

In [22]:
Iris_clean_data.sample(10)

Unnamed: 0_level_0,SepalLengthCm,SepalWidthCm,PetalLengthCm,PetalWidthCm,Species
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
93,5.8,2.6,4.0,1.2,Iris-versicolor
68,5.8,2.7,4.1,1.0,Iris-versicolor
53,6.9,3.1,4.9,1.5,Iris-versicolor
14,4.3,3.0,1.1,0.1,Iris-setosa
89,5.6,3.0,4.1,1.3,Iris-versicolor
27,5.0,3.4,1.6,0.4,Iris-setosa
12,4.8,3.4,1.6,0.2,Iris-setosa
8,5.0,3.4,1.5,0.2,Iris-setosa
52,6.4,3.2,4.5,1.5,Iris-versicolor
56,5.7,2.8,4.5,1.3,Iris-versicolor


In [11]:
Iris_clean_data["Species"].value_counts()

Species
Iris-setosa        50
Iris-versicolor    50
Name: count, dtype: int64

分别提取出`Iris-setosa` `Iris-versicolor` 的数值列作为对比样本

In [16]:
setosa_data = Iris_clean_data[Iris_clean_data["Species"] == "Iris-setosa"]
versicolor_data = Iris_clean_data[Iris_clean_data["Species"] == "Iris-versicolor"]
#Iris-setosa类的样本组
setosa_data_SepalLength = setosa_data["SepalLengthCm"]
setosa_data_SepalWidth = setosa_data["SepalWidthCm"]
setosa_data_PetalLength = setosa_data["PetalLengthCm"]
setosa_data_PetalWidth = setosa_data["PetalWidthCm"]
#Iris-versicolor类的样本组
versicolor_data_SepalLength = versicolor_data["SepalLengthCm"]
versicolor_data_SepalWidth = versicolor_data["SepalWidthCm"]
versicolor_data_PetalLength = versicolor_data["PetalLengthCm"]
versicolor_data_PetalWidth = versicolor_data["PetalWidthCm"]

H0：两种花没有差异  
H1：两种花有差异  
双尾检验  
显著水平：0.05

In [20]:
alpha = 0.05
#萼片长度差异计算
SepalLength_t_stat,SepalLength_p_value = ttest_ind(setosa_data_SepalLength,versicolor_data_SepalLength)
print(SepalLength_t_stat,SepalLength_p_value)
if SepalLength_p_value > alpha:
    print("萼片长度有显著差异")
else:
    print("萼片长度没有显著差异")
#萼片宽度差异计算
SepalWidth_t_stat,SepalWidth_p_value = ttest_ind(setosa_data_SepalWidth,versicolor_data_SepalWidth)
print(SepalWidth_t_stat,SepalWidth_p_value)
if SepalWidth_p_value > alpha:
    print("萼片宽度有显著差异")
else:
    print("萼片宽度没有显著差异")
#花瓣长度差异计算
PetalLength_t_stat,PetalLength_p_value = ttest_ind(setosa_data_PetalLength,versicolor_data_PetalLength)
print(PetalLength_t_stat,PetalLength_p_value)
if PetalLength_p_value > alpha:
    print("花瓣长度有显著差异")
else:
    print("花瓣长度没有显著差异")
#花瓣宽度差异计算
PetalWidth_t_stat,PetalWidth_p_value = ttest_ind(setosa_data_PetalWidth,versicolor_data_PetalWidth)
print(PetalWidth_t_stat,PetalWidth_p_value)
if PetalWidth_p_value > alpha:
    print("花瓣长度有显著差异")
else:
    print("花瓣长度没有显著差异")

-10.52098626754911 8.985235037487079e-18
萼片长度没有显著差异
9.282772555558111 4.362239016010214e-15
萼片宽度没有显著差异
-39.46866259397272 5.717463758170621e-62
花瓣长度没有显著差异
-34.01237858829048 4.589080615710866e-56
花瓣长度没有显著差异
