【回顾&引言】前面一章的内容大家可以感觉到我们主要是对基础知识做一个梳理，让大家了解数据分析的一些操作，主要做了数据的各个角度的观察。那么在这里，我们主要是做数据分析的流程性学习，主要是包括了数据清洗以及数据的特征处理，数据重构以及数据可视化。这些内容是为数据分析最后的建模和模型评价做一个铺垫。

In [1]:
import numpy as np
import pandas as pd

# 第二章：数据清洗及特征处理

我们拿到的数据通常是不干净的，所谓的不干净，就是数据中有**缺失值**，有一些**异常点**等，需要经过一定的处理才能继续做后面的分析或建模，所以拿到数据的第一步是进行数据清洗，本章我们将学习缺失值、重复值、字符串和数据转换等操作，将数据清洗成可以分析或建模的亚子。

## 2.1 缺失值观察与处理

我们拿到的数据经常会有很多缺失值，比如我们可以看到Cabin列存在NaN，那其他列还有没有缺失值，这些缺失值要怎么处理呢?



### 任务一：缺失值观察

(1) 请查看每个特征缺失值个数

(2) 请查看Age， Cabin， Embarked列的数据

以上方式都有多种方式，所以大家多多益善

对于数值型数据，pandas 使用浮点值 **NaN**（Not a Number 来表示缺失值）。我们称为 **NaN** 为容易检测到的标识值。

In [64]:
path = r'data\train.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [7]:
# 方法一
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [11]:
# 方法二：每列中的非空数
data.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [14]:
data[['Age', 'Cabin', 'Embarked']].head(8)

Unnamed: 0,Age,Cabin,Embarked
0,22.0,,S
1,38.0,C85,C
2,26.0,,S
3,35.0,C123,S
4,35.0,,S
5,,,Q
6,54.0,E46,S
7,2.0,,S


### 任务二：对缺失值进行处理

(1)处理缺失值一般有几种思路

(2) 请尝试对Age列的数据的缺失值进行处理

(3) 请尝试使用不同的方法直接对整张表的缺失值进行处理

以下是举例：

In [73]:
path = r'data\train.csv'
data = pd.read_csv(path)
print(data['Age'].isnull().sum())
data[data['Age']==None]=0                # 因为虽然None 和 np.nan 不是一种类型，但 None 在 对象数组中也会当做 np.nan 处理，因此，这句话可以不用判断
print(data['Age'].isnull().sum())
data.head(6)

177
177


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q


In [66]:
data[data['Age'].isnull()] = 0
data.head(6)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,0,0,0,0,0,0.0,0,0,0,0.0,0,0


In [102]:
path = r'data\train.csv'
data = pd.read_csv(path)
data[data['Age'] == np.nan] = 0 # 该语句无法判断出 np.nan 
data.loc[5, 'Age'] == np.nan

False

In [98]:
type(np.nan)

float

In [67]:
data['Age'].isnull().sum()

0

In [74]:
(data['Age']==None).head(6)

0    False
1    False
2    False
3    False
4    False
5    False
Name: Age, dtype: bool

In [75]:
(data['Age'].isnull()).head(6)

0    False
1    False
2    False
3    False
4    False
5     True
Name: Age, dtype: bool

In [76]:
from numpy import nan

n = nan
n == None

False

In [77]:
type(np.nan)

float

In [78]:
type(None)

NoneType

注：np.nan 和 Python 中 的 None 类型是不相等，我们 np.nan == None，是不相等的，结果为False。

In [104]:
# 举例：
string_data = pd.Series(['aardvark', 'artichoke', np.nan, 'avocado'])
string_data

0     aardvark
1    artichoke
2          NaN
3      avocado
dtype: object

In [83]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

In [84]:
string_data[0] = None # Python 中的内建的None在对象数组中也是当做NA处理：
string_data.isnull()

0     True
1    False
2     True
3    False
dtype: bool

In [111]:
type(string_data.iloc[2])

float

In [115]:
string_data.iloc[2] == np.nan   # 错误，无法正确判断

False

注：对 np.nan 判断其是否为 np.nan 时，不能使用 == 号进行判断，因为 np.nan 是float 类型，不是空对象，要使用 np.isnan()方法,如下：

In [113]:
np.isnan(string_data.iloc[2])

True

In [114]:
np.isnan(data.loc[5, 'Age'])

True

总结：我们在判断DataFrame每列中是否含有 NaN （缺失值）时，使用 np.isnull() 方法，不要使用 == None 或者 == np.nan,可以使用 np.isnan(对象)。如下：

In [120]:
np.isnan(data['Age'])

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

In [121]:
data['Age'].isnull()

0      False
1      False
2      False
3      False
4      False
       ...  
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

dropna() : 默认情况下，会删除包含缺失值的行。举例如下：

In [124]:
data = pd.DataFrame([[1., 6.5, 3.],
                    [1, np.nan, np.nan],
                    [np.nan, np.nan, np.nan],
                    [np.nan, 6.5, 3.]])
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [128]:
cleaned = data.dropna()       # 只要含有缺失值改行就会被删除，默认不改变原对象，返回新对象
data

Unnamed: 0,0,1,2
0,1.0,6.5,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [127]:
cleaned 

Unnamed: 0,0,1,2
0,1.0,6.5,3.0


In [132]:
path = r'data\train.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [134]:
data.dropna().head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
10,11,1,3,"Sandstrom, Miss. Marguerite Rut",female,4.0,1,1,PP 9549,16.7,G6,S
11,12,1,1,"Bonnell, Miss. Elizabeth",female,58.0,0,0,113783,26.55,C103,S


In [136]:
data.fillna(0).head(3) # 缺失值部分填充为 0

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,0,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,0,S


## 2.2 重复值观察与处理

由于这样那样的原因，数据中会不会存在重复值呢，如果存在要怎样处理呢？


### 任务一：请查看数据中的重复值

In [137]:
path = r'data\train.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [142]:
data[data.duplicated()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked


In [143]:
# 举例：
df = pd.DataFrame({
    'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
    'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
    'rating': [4, 4, 3.5, 15, 5]
})
df

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
1,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [144]:
df.duplicated()    # 每行中所有列重复，则返回True

0    False
1     True
2    False
3    False
4    False
dtype: bool

In [146]:
df[df.duplicated()] # 得到重复的列

Unnamed: 0,brand,style,rating
1,Yum Yum,cup,4.0


### 任务二：对重复值进行处理

(1)重复值有哪些处理方式呢？

(2)处理我们数据的重复值

方法多多益善

以下是对整个行有重复值的清理的方法举例：

In [148]:
df = df.drop_duplicates() # 我们把原索引=1的重复的行删除
df.head() 

Unnamed: 0,brand,style,rating
0,Yum Yum,cup,4.0
2,Indomie,cup,3.5
3,Indomie,pack,15.0
4,Indomie,pack,5.0


In [152]:
data.drop_duplicates()    # 没有重复行，因此结果不变 
data.shape 

(891, 12)

### 将前面清洗的数据保存为csv格式

In [154]:
data.to_csv(r'data\test_clear.csv', index=None) # 原索引不转化为表中的列

## 特征观察与处理

我们对特征进行一下观察，可以把特征大概分为两大类：
数值型特征：Survived ，Pclass， Age ，SibSp， Parch， Fare，其中Survived， Pclass为离散型数值特征，Age，SibSp， Parch， Fare为连续型数值特征
文本型特征：Name， Sex， Cabin，Embarked， Ticket，其中Sex， Cabin， Embarked， Ticket为类别型文本特征。

数值型特征一般可以直接用于模型的训练，但有时候为了模型的稳定性及鲁棒性会对连续变量进行离散化。文本型特征往往需要转换成数值型特征才能用于建模分析。

### 任务一：对年龄进行分箱（离散化）处理

(1) 分箱操作是什么？

(2) 将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示

(3) 将连续变量Age划分为(0,5] (5,15] (15,30] (30,50] (50,80]五个年龄段，并分别用类别变量12345表示

(4) 将连续变量Age按10% 30% 50% 70% 90%五个年龄段，并用分类变量12345表示

(5) 将上面的获得的数据分别进行保存，保存为csv格式

In [155]:
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [157]:
#将连续变量Age平均分箱成5个年龄段，并分别用类别变量12345表示

data['AgeBand'] = pd.cut(data['Age'], 5, labels = [1,2,3,4,5])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBand
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,2
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3


In [162]:
# 举例：
df['addcolumn'] = [5, 6, 7, 8] # 给 DataFrame 增加一列
df

Unnamed: 0,brand,style,rating,addcolumn
0,Yum Yum,cup,4.0,5
2,Indomie,cup,3.5,6
3,Indomie,pack,15.0,7
4,Indomie,pack,5.0,8


In [170]:
data.to_csv(r'data\test_ave.csv', index=None)

In [176]:
#将连续变量Age划分为(0,5] (5,15] (15,30] (30,50] (50,80]五个年龄段，并分别用类别变量12345表示

data['AgeBand'] = pd.cut(data['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])
data.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBand
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,4
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3


In [179]:
data.to_csv(r'data\test_cut.csv', index=None)

In [180]:
#将连续变量Age按10% 30% 50 70% 90%五个年龄段，并用分类变量12345表示

data['AgeBand'] = pd.qcut(data['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,AgeBand
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,5
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,4
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,4


In [182]:
data.to_csv(r'data\test_pr.csv', index=None)

### 任务二：对文本变量进行转换

(1) 查看文本变量名及种类

(2) 将文本变量Sex， Cabin ，Embarked用数值变量12345表示

(3) 将文本变量Sex， Cabin， Embarked用one-hot编码表示

方法多多益善

In [184]:
# 查看类别文本变量名及种类

# 方法一: value_counts
data['Sex'].value_counts()

male      577
female    314
Name: Sex, dtype: int64