# 数据预处理方法

常见的预处理方法:
- 缺失值处理：真实的数据往往因为各种原因存在缺失值，需要用删除法或填补法来得到一个完整的数据子集。
- 离群值检测和处理：检测数据集中那些明显偏离数据集中的其他样本，为数据分析提供高质量的数据。
- 标准化：数据分析及建模过程中，许多机器学习算法需要其输入特征为标准化形式；若样本的特征之间的量纲差异太大，样本之间相似度评估结果将存在偏差。
- 特征编码：模型输入的特征通常需要是数值型的，所以需要将非数值型特征转换为数值特征。
- 离散化：在数据信息损失尽量少的前提下，尽可能减少元数。

![sklean中的相关类](./img/img.png)


In [74]:
import numpy as np
import pandas as pd

# Imputer has been remove
# from sklearn.preprocessing import Imputer 

from sklearn.impute import SimpleImputer
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import Binarizer
from sklearn.cluster import KMeans 

“teenager_sns”包含30000个样本的美国高中生社交网络信息数据集。每个样本包含40个变量，其中 gradyear, gender, age和friends四个变量代表高中生的毕业年份、性别、年龄和好友数等基本信息。 其余36个变量代表36个词语，代表高中生的5大兴趣。
“accord_sedan_testing”是一个二手汽车数据集，包含二手汽车的价格、已行驶英里、上市年份、档次、引擎缸数、换挡方式等

In [75]:
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
print(teenager_sns.shape)
teenager_sns.head(100)

(30000, 40)


Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,swimming,...,blonde,mall,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs
0,2006,M,18.980,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,2006,F,18.801,0,0,1,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,2006,M,18.335,69,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
3,2006,F,18.875,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,2006,,18.995,10,0,0,0,0,0,0,...,0,0,2,0,0,0,0,0,1,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2006,F,18.396,69,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
96,2006,F,18.261,20,12,3,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
97,2006,F,,13,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
98,2006,M,18.730,52,0,0,0,0,4,0,...,0,0,0,0,1,1,0,0,0,0


In [76]:
print(teenager_sns.info()) 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 30000 entries, 0 to 29999
Data columns (total 40 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   gradyear      30000 non-null  int64  
 1   gender        27276 non-null  object 
 2   age           24914 non-null  float64
 3   friends       30000 non-null  int64  
 4   basketball    30000 non-null  int64  
 5   football      30000 non-null  int64  
 6   soccer        30000 non-null  int64  
 7   softball      30000 non-null  int64  
 8   volleyball    30000 non-null  int64  
 9   swimming      30000 non-null  int64  
 10  cheerleading  30000 non-null  int64  
 11  baseball      30000 non-null  int64  
 12  tennis        30000 non-null  int64  
 13  sports        30000 non-null  int64  
 14  cute          30000 non-null  int64  
 15  sex           30000 non-null  int64  
 16  sexy          30000 non-null  int64  
 17  hot           30000 non-null  int64  
 18  kissed        30000 non-nu

### 1.缺失值处理
查看数据集的基本信息
可以看到性别和年龄有缺失值
<br>
填充方法:
- mean
- median(中位数)
- most_frequent(众数)
- constant(常数)<br>
<br>
#### 使用sklearn中的Imputer方法，将数据集“teenager_sns”中“age”列利用均值“mean”进行填充

In [77]:
# 基本案例，使用均值填充
# 先在训练集上得到每一列的均值，然后在测试集上进行拟合
import numpy as np
from sklearn.impute import SimpleImputer
imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean')
imp_mean.fit([[7, 2, 3], [4, np.nan, 6], [10, 5, 9]])
X = [[np.nan, 2, 3], [4, np.nan, 6], [10, np.nan, 9]]
print(imp_mean.transform(X))

[[ 7.   2.   3. ]
 [ 4.   3.5  6. ]
 [10.   3.5  9. ]]


In [78]:
imp_mean=SimpleImputer(missing_values=np.NaN, strategy='mean',copy=False) #这里不能使用字符串形式的NaN
imp_mean.fit(teenager_sns[['age']])
teenager_sns['age_imputed']=imp_mean.transform(teenager_sns[['age']])
# 显示缺失和填充的数据
teenager_sns[teenager_sns['age'].isnull()].head(100)

Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,swimming,...,mall,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs,age_imputed
5,2006,F,,142,0,0,0,0,0,0,...,0,1,0,0,0,0,0,1,0,17.993949
13,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
15,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
16,2006,,,135,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
26,2006,F,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
757,2006,F,,44,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
776,2006,,,7,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
781,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949
800,2006,M,,13,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,17.993949


In [79]:
# 性别的缺失值处理
imp_most_frequent=SimpleImputer(missing_values=np.NaN, strategy='most_frequent',copy=False)
imp_most_frequent.fit(teenager_sns[['gender']])
teenager_sns['gender_imputed']=imp_most_frequent.transform(teenager_sns[['gender']])

# print(type(teenager_sns['gender'].isnull())) #series
# print(teenager_sns['gender'].isnull())

teenager_sns[teenager_sns['gender'].isnull()].head(100)

Unnamed: 0,gradyear,gender,age,friends,basketball,football,soccer,softball,volleyball,swimming,...,shopping,clothes,hollister,abercrombie,die,death,drunk,drugs,age_imputed,gender_imputed
4,2006,,18.995,10,0,0,0,0,0,0,...,2,0,0,0,0,0,1,1,18.995000,F
13,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,17.993949,F
15,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,17.993949,F
16,2006,,,135,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,17.993949,F
41,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,17.993949,F
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1109,2006,,18.932,28,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,18.932000,F
1121,2006,,14.333,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,14.333000,F
1142,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,17.993949,F
1145,2006,,,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,17.993949,F


In [80]:
import os 
print(os.getcwd())

/Users/donga5/aCode/MachineLearning/main/DataProcessing


## 2. 离群值检测和处理
（详细请查看ppt）

共有两类方法：第一类是基于统计的方法，可以使用箱线图

第二类是基于近邻的方法，主要是使用LOF算法(Local Outlier Factor,局部异常因子）来进行检测。

案例:使用scikit-learning中的LocalOutlierFactor类来检测“accord_sedan_testing”数据集中的离群值


In [81]:
auto_test=pd.read_csv('../../dataset/accord_sedan_testing.csv')
auto_test.head()

Unnamed: 0,price,mileage,year,trim,engine,transmission
0,12995,68265,2006,ex,4 Cyl,Automatic
1,9690,92778,2006,ex,4 Cyl,Automatic
2,8995,136000,2006,ex,4 Cyl,Automatic
3,11995,72765,2006,lx,6 Cyl,Automatic
4,17999,36448,2006,ex,6 Cyl,Automatic


In [82]:
from sklearn.neighbors import LocalOutlierFactor
scaler=LocalOutlierFactor()
scaler.fit(auto_test[['price','mileage']])# 对里程数和价格进行离群值检测
auto_test['LOF']=- scaler.negative_outlier_factor_
auto_test[auto_test['LOF']>1.5]

Unnamed: 0,price,mileage,year,trim,engine,transmission,LOF
4,17999,36448,2006,ex,6 Cyl,Automatic,1.534739
52,14399,22110,2006,lx,4 Cyl,Automatic,2.235552


## 3.数据标准化
- 标准化是将数据按比例缩放，使之落入一个小的特定区间。在sklearn中，提供了StandardScaler类来实现这一功能。

#### 常见方法:
#### z-score:消除量纲，使得原本处于不同量纲之间的数据能够被去掉.Z-Score的标准化方法适用于特征的最大值或最小值未知、样本分布非常离散的情况。
   - 会改变原来数据分布情况  
   - <font color="red">注意标准化之后是均值为0，方差为1，而不是数据范围是在[0,1]</font>
    
![z-score](./img/img_1.png)

---

#### min-max:将数据按比例缩放到一个特定的区间，通常是[0, 1]或者[-1, 1]之间.不改变数据的分布情况

![z-score](./img/img_2.png)

----

#### 根据分位数范围（默认为IQR：Interquartile Range）减去中位数并缩放数据。对异常值不敏感，适合数据中存在异常值的情况。

![z-score](./img/img_3.png)



In [83]:
from sklearn import preprocessing
import numpy as np
X_train = np.array([[ 1., -1.,  2.],
                    [ 2.,  0.,  0.],
                    [ 0.,  1., -1.]])
scaler = preprocessing.StandardScaler().fit(X_train)
print(type(scaler))
print(scaler.mean_)# 按照列进行计算
print(scaler.scale_)

<class 'sklearn.preprocessing._data.StandardScaler'>
[1.         0.         0.33333333]
[0.81649658 0.81649658 1.24721913]


In [84]:
# 转换
X_scaled=scaler.transform(X_train)
print(type(X_scaled))
print(X_scaled)
print(X_scaled.mean(axis=0))# 按照行进行计算
print(X_scaled.std(axis=0))

<class 'numpy.ndarray'>
[[ 0.         -1.22474487  1.33630621]
 [ 1.22474487  0.         -0.26726124]
 [-1.22474487  1.22474487 -1.06904497]]
[0. 0. 0.]
[1. 1. 1.]


In [85]:
# z-score
from sklearn.preprocessing import StandardScaler
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
print('origin:',(teenager_sns['friends'].mean(),teenager_sns['friends'].std()))

scaler=StandardScaler()
teenager_sns_zcore=pd.DataFrame(scaler.fit_transform(teenager_sns[['friends']]),columns=['friends_StandardScaled'])
teenager_sns_zcore['friends']=teenager_sns['friends']

# 输出均值和方差
print((teenager_sns_zcore['friends_StandardScaled'].mean(),teenager_sns_zcore['friends_StandardScaled'].std()))
teenager_sns_zcore.head()

origin: (30.179466666666666, 36.530877467552315)
(-9.473903143468002e-18, 1.00001666708328)


Unnamed: 0,friends_StandardScaled,friends
0,-0.634528,7
1,-0.82615,0
2,1.062695,69
3,-0.82615,0
4,-0.552404,10


In [86]:
# min-max
from sklearn.preprocessing import MinMaxScaler
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
print('origin',(teenager_sns['friends'].min(),teenager_sns['friends'].max()))
print('origin mean,std',teenager_sns['friends'].mean(),teenager_sns['friends'].std())
print('*'*30)

scaler=MinMaxScaler()
teenager_sns_minmax=pd.DataFrame(scaler.fit_transform(teenager_sns[['friends']]),columns=['friends_MinMaxScaled'])
print('scaled:',(teenager_sns_minmax['friends_MinMaxScaled'].min(),teenager_sns_minmax['friends_MinMaxScaled'].max()))
print('scaled mean,std',(teenager_sns_minmax['friends_MinMaxScaled'].mean(),teenager_sns_minmax['friends_MinMaxScaled'].std()))



origin (0, 830)
origin mean,std 30.179466666666666 36.530877467552315
******************************
scaled: (0.0, 1.0)
scaled mean,std (0.036360803212851414, 0.04401310538259201)


In [87]:
# robust-scaler
from sklearn.preprocessing import RobustScaler
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
print('origin:',(teenager_sns['friends'].max(),teenager_sns['friends'].min()))

scaler=RobustScaler()
teenager_sns_robust=pd.DataFrame(scaler.fit_transform(teenager_sns[['friends']]),columns=['friends_RobustScaled'])
teenager_sns_robust['friends']=teenager_sns['friends']
print('scaled:',(teenager_sns_robust['friends_RobustScaled'].max(),teenager_sns_robust['friends_RobustScaled'].min()))
teenager_sns_robust.head()

origin: (830, 0)
scaled: (19.75609756097561, -0.4878048780487805)


Unnamed: 0,friends_RobustScaled,friends
0,-0.317073,7
1,-0.487805,0
2,1.195122,69
3,-0.487805,0
4,-0.243902,10


## 4.离散化和特征编码

#### 特征编码:将非数值型特征转化为数值型特征,常用的方法有数字编码、One-Hot编码、哑变量编码方法
- 数字编码:将类别映射为整数，但是这种方法会引入大小关系，不适合树模型
- One-Hot编码:将类别映射为二进制向量，适合树模型
- 哑变量编码:将类别映射为二进制向量，适合线性模型

One-Hot和哑变量区别:

One-Hot编码和哑变量编码都是将类别型数据转换为数值型数据的常用方法，主要区别在于处理类别数量大于2的情况时的处理方式。

1. One-Hot编码：对于每一个类别特征，One-Hot编码会创建一个虚拟变量。例如，如果一个特征有三个可能的类别A、B和C，One-Hot编码会创建三个新的特征，分别表示类别A、B和C。如果原始特征的值为A，那么新的特征A的值为1，特征B和C的值为0。这种编码方式可以避免引入不必要的大小关系，但是会增加数据的维度。

2. 哑变量编码：哑变量编码和One-Hot编码类似，但是在处理类别数量大于2的情况时，哑变量编码会创建比类别数量少一个的虚拟变量。例如，如果一个特征有三个可能的类别A、B和C，哑变量编码会创建两个新的特征，例如表示类别A和类别B。如果原始特征的值为A，那么新的特征A的值为1，特征B的值为0；如果原始特征的值为C，那么新的特征A和B的值都为0。这种编码方式可以避免数据的多重共线性问题，但是会丢失一部分信息。

总的来说，选择哪种编码方式取决于具体的应用场景和模型需求。

----
#### **特征离散化**：
+ 二值化
+ 等距离离散化
+ 等频率离散化
+ 聚类离散化


In [88]:
# 使用sklearn中的LabelEncoder方法，对数据集“teenager_sns”中的“gender”进行特征编码
from sklearn.preprocessing import LabelEncoder
le=LabelEncoder()
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
print(teenager_sns['gender'][:4])
print(le.fit_transform(teenager_sns['gender'][:4]))
print(le.classes_)# 编码后的值对应原来的类别


0    M
1    F
2    M
3    F
Name: gender, dtype: object
[1 0 1 0]
['F' 'M']


In [89]:
# onehot编码
from sklearn.preprocessing import OneHotEncoder
# 先进行数字编码，然后再对数字编码进行onehot编码
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
enc=OneHotEncoder()

teenager_sns['gender']=teenager_sns['gender'].map({'F':0,'M':1,np.NaN:3})
# 对性别用OneHotEncoder进行拟合
enc.fit(teenager_sns[['gender']])
enc.categories_

[array([0, 1, 3])]

In [90]:
# 离散化方法:二值化, 将大于阈值的特征值映射为1，而小于或等于阈值的特征值映射为0
from sklearn.preprocessing import Binarizer
scaler=Binarizer(threshold=3)
teenager_sns=pd.read_csv('../../dataset/teenager_sns.csv')
teenager_sns_scaled=pd.DataFrame(scaler.fit_transform(teenager_sns[['friends']]),columns=['friends_Binarizer'])
teenager_sns_scaled['friends']=teenager_sns['friends']

# 小于三个朋友数映射为0
res=teenager_sns_scaled[['friends_Binarizer','friends']]
print(res.shape)
print(res.head(10))

(30000, 2)
   friends_Binarizer  friends
0                  1        7
1                  0        0
2                  1       69
3                  0        0
4                  1       10
5                  1      142
6                  1       72
7                  1       17
8                  1       52
9                  1       39
