![](../../img/chinahadoop.png)

# 附加 数据处理、特征工程、特征选择 内容


* 数据处理
    * 分析数据的分布模式 - 发现异常值/噪声(boxplot，quantile)
    * pandas工具库:数据类型(时间型读成字符串)，数字形态的类别型(userid, 没有大小关系)，数据类型优化
    * 缺失值（数值型/类别型，缺失比例）
    * 时间序列：趋势分析
    * 单维度(连续值distplot、类别型countplot/value_counts)、关联维度(corr, heatmap)
    * 业务数据中做建模：最有效的特征通常是统计特征(怎么做统计，有哪些类别型的列可以做为groupby的对象，有哪些数值型的列可以用于统计聚合)，特别留意置信度(总数很小的时候，统计值不稳定，比例型特征稳定度高于绝对值)

* 特征工程
    * 数值型
        * 幅度缩放(最大最小值缩放、归一化...)
        * 离散化/分箱分桶(等距pd.cut、等频pd.qcut)(非线性/加速/特征交叉/健壮性)
        * 统计值(max min quantile)
        * 四则运算(加减乘除)
        * 幅度变化(有一些模型对于输入数据有分布假设，线性回归假设输入连续值特征符合正态分布，log1p/exp)
        * 监督学习分箱(用决策树建模，用决策树学习连续值划分方式，把决策树中间节点取出来作为组合特征) sklearn dt apply
    * 类别型
        * OneHot-encoding
        * label-encoding
        * binary-encoding
        * category-encoding
    * 时间型
        * 时间点/时间段(星期几、几点钟)
        * 时间分组/分段(工作日、周末、法定节假日...)
        * 时间间隔(距离当前为止...)
        * 和数值型一起做统计特征的时候，会选取不同的时间窗...
        * 组合...
    * 文本型
        * 词袋模型
        * tf-idf
        * lda
        * word2vec/word embedding
        * ...

* 特征选择
    * 过滤型(filter)
    * 包裹型(wrapper)
    * 嵌入型(embedded)
    * 基于树模型去判断特征的重要度，做实验去筛选

In [51]:
# import 工具库
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

### 载入数据

In [52]:
# Titanic数据
df_train = pd.read_csv('./data/train.csv')

### 了解一下数据
* head()
* info()
* describe()

In [53]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [54]:
df_train.shape

(891, 12)

In [55]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


In [56]:
df_train.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## 基本数据处理

### 0.缺失值处理
* pandas fillna
* sklearn Imputer

#### 可以用pandas的fillna函数

In [57]:
# 查询fillna函数
help(pd.DataFrame.fillna)

Help on function fillna in module pandas.core.frame:

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
    Fill NA/NaN values using the specified method.
    
    Parameters
    ----------
    value : scalar, dict, Series, or DataFrame
        Value to use to fill holes (e.g. 0), alternately a
        dict/Series/DataFrame of values specifying which value to use for
        each index (for a Series) or column (for a DataFrame). (values not
        in the dict/Series/DataFrame will not be filled). This value cannot
        be a list.
    method : {'backfill', 'bfill', 'pad', 'ffill', None}, default None
        Method to use for filling holes in reindexed Series
        pad / ffill: propagate last valid observation forward to next valid
        backfill / bfill: use NEXT valid observation to fill gap
    axis : {0 or 'index', 1 or 'columns'}
    inplace : boolean, default False
        If True, fill in place. Note: this will modify any

In [58]:
df_train['Age'].head(10)

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5     NaN
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64

In [59]:
df_train['Age'].fillna(value=df_train['Age'].mean()).head(10)

0    22.000000
1    38.000000
2    26.000000
3    35.000000
4    35.000000
5    29.699118
6    54.000000
7     2.000000
8    27.000000
9    14.000000
Name: Age, dtype: float64

#### 借助sklearn中的Imputer

In [60]:
from sklearn.preprocessing import Imputer

In [61]:
imp = Imputer(missing_values='NaN', strategy='mean', axis=0)



In [62]:
age = imp.fit_transform(df_train[['Age']].values)

In [63]:
df_train.loc[:,'Age'] = df_train['Age'].fillna(value=df_train['Age'].mean())

In [64]:
df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [65]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB


## 常见的特征工程操作

### 数值型

#### 幅度变换
* apply+numpy
* preprocessing scaler

In [66]:
# 取对数等变换
import numpy as np
log_age = df_train['Age'].apply(lambda x:np.log(x))

In [67]:
df_train.loc[:,'log_age'] = log_age

In [68]:
df_train.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348
5,6,0,3,"Moran, Mr. James",male,29.699118,0,0,330877,8.4583,,Q,3.391117
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S,3.988984
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S,0.693147
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S,3.295837
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C,2.639057


In [69]:
# 幅度缩放，最大最小值缩放
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()
fare_trans = mm_scaler.fit_transform(df_train[['Fare']])

In [70]:
# 幅度缩放，标准化
from sklearn.preprocessing import StandardScaler
std_scaler = StandardScaler()
fare_std_trans = std_scaler.fit_transform(df_train[['Fare']])

#### 统计值
* max,min
* quantile

In [71]:
# 最大最小值
max_age = df_train['Age'].max()
min_age = df_train["Age"].min()

In [72]:
max_age

80.0

In [73]:
min_age

0.42

In [74]:
# 分位数
age_quarter_1 = df_train['Age'].quantile(0.25)
age_quarter_3 = df_train['Age'].quantile(0.75)

In [75]:
age_quarter_1

22.0

In [76]:
age_quarter_3

35.0

#### 四则运算

In [77]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348


In [78]:
df_train.loc[:,'family_size'] = df_train['SibSp']+df_train['Parch']+1

In [79]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age,family_size
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042,2
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586,2
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348,2
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348,1


In [80]:
df_train.loc[:,'tmp'] = df_train['Age']*df_train['Pclass'] + 4*df_train['family_size']

In [81]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age,family_size,tmp
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042,2,74.0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586,2,46.0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097,1,82.0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348,2,43.0
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348,1,109.0


#### 高次特征与交叉特征
* preprocessing.PolynomialFeatures

[x1 x2 x3] => [x1^2 x3^3 x1\*x2 x1\*x3]

In [82]:
from sklearn.preprocessing import PolynomialFeatures

In [83]:
poly = PolynomialFeatures(degree=2)

In [84]:
df_train[['SibSp','Parch']].head()

Unnamed: 0,SibSp,Parch
0,1,0
1,1,0
2,0,0
3,1,0
4,0,0


In [85]:
poly_fea = poly.fit_transform(df_train[['SibSp','Parch']])

In [86]:
poly_fea

array([[1., 1., 0., 1., 0., 0.],
       [1., 1., 0., 1., 0., 0.],
       [1., 0., 0., 0., 0., 0.],
       ...,
       [1., 1., 2., 1., 2., 4.],
       [1., 0., 0., 0., 0., 0.],
       [1., 0., 0., 0., 0., 0.]])

#### 离散化/分箱/分桶
* pandas cut
* pandas qcut

In [87]:
# 等距切分
df_train.loc[:, 'fare_cut'] = pd.cut(df_train['Fare'], 5)

In [88]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age,family_size,tmp,fare_cut
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042,2,74.0,"(-0.512, 102.466]"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586,2,46.0,"(-0.512, 102.466]"
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097,1,82.0,"(-0.512, 102.466]"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348,2,43.0,"(-0.512, 102.466]"
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348,1,109.0,"(-0.512, 102.466]"


In [89]:
df_train['fare_cut'].unique()

[(-0.512, 102.466], (204.932, 307.398], (102.466, 204.932], (409.863, 512.329]]
Categories (4, interval[float64]): [(-0.512, 102.466] < (102.466, 204.932] < (204.932, 307.398] < (409.863, 512.329]]

In [90]:
# 等频切分
df_train.loc[:,'fare_qcut'] = pd.qcut(df_train['Fare'], 5)

In [91]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age,family_size,tmp,fare_cut,fare_qcut
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042,2,74.0,"(-0.512, 102.466]","(-0.001, 7.854]"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586,2,46.0,"(-0.512, 102.466]","(39.688, 512.329]"
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097,1,82.0,"(-0.512, 102.466]","(7.854, 10.5]"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348,2,43.0,"(-0.512, 102.466]","(39.688, 512.329]"
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348,1,109.0,"(-0.512, 102.466]","(7.854, 10.5]"


### 类别型(离散型)
#### OneHot encoding/独热向量编码
* pandas get_dummies
* OneHotEncoder()

In [92]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 17 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
log_age        891 non-null float64
family_size    891 non-null int64
tmp            891 non-null float64
fare_cut       891 non-null category
fare_qcut      891 non-null category
dtypes: category(2), float64(4), int64(6), object(5)
memory usage: 106.4+ KB


In [93]:
embarked_oht = pd.get_dummies(df_train[['Embarked']])

In [94]:
embarked_oht.head()

Unnamed: 0,Embarked_C,Embarked_Q,Embarked_S
0,0,0,1
1,1,0,0
2,0,0,1
3,0,0,1
4,0,0,1


In [95]:
fare_qcut_oht = pd.get_dummies(df_train[['fare_qcut']])

In [96]:
fare_qcut_oht.head()

Unnamed: 0,"fare_qcut_(-0.001, 7.854]","fare_qcut_(7.854, 10.5]","fare_qcut_(10.5, 21.679]","fare_qcut_(21.679, 39.688]","fare_qcut_(39.688, 512.329]"
0,1,0,0,0,0
1,0,0,0,0,1
2,0,1,0,0,0
3,0,0,0,0,1
4,0,1,0,0,0


### 组合特征

In [97]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age,family_size,tmp,fare_cut,fare_qcut
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042,2,74.0,"(-0.512, 102.466]","(-0.001, 7.854]"
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586,2,46.0,"(-0.512, 102.466]","(39.688, 512.329]"
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097,1,82.0,"(-0.512, 102.466]","(7.854, 10.5]"
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348,2,43.0,"(-0.512, 102.466]","(39.688, 512.329]"
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348,1,109.0,"(-0.512, 102.466]","(7.854, 10.5]"


In [98]:
# 借助条件去判断获取组合特征
df_train.loc[:,'alone'] = (df_train['SibSp']==0)&(df_train['Parch']==0)

In [99]:
df_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,log_age,family_size,tmp,fare_cut,fare_qcut,alone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,3.091042,2,74.0,"(-0.512, 102.466]","(-0.001, 7.854]",False
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3.637586,2,46.0,"(-0.512, 102.466]","(39.688, 512.329]",False
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,3.258097,1,82.0,"(-0.512, 102.466]","(7.854, 10.5]",True
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,3.555348,2,43.0,"(-0.512, 102.466]","(39.688, 512.329]",False
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,3.555348,1,109.0,"(-0.512, 102.466]","(7.854, 10.5]",True


### 时间型

#### 日期处理
* pandas to_datetime

In [100]:
car_sales = pd.read_csv('./data/car_data.csv')

In [101]:
car_sales.head()

Unnamed: 0,date_t,cnt
0,2012-12-31,
1,2013-01-01,
2,2013-01-02,68.0
3,2013-01-03,36.0
4,2013-01-04,5565.0


In [102]:
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1512 entries, 0 to 1511
Data columns (total 2 columns):
date_t    1512 non-null object
cnt       1032 non-null float64
dtypes: float64(1), object(1)
memory usage: 23.7+ KB


In [103]:
car_sales.describe()

Unnamed: 0,cnt
count,1032.0
mean,1760.124031
std,1153.164214
min,12.0
25%,1178.75
50%,1774.0
75%,2277.75
max,7226.0


In [104]:
car_sales['date_t'].dtype

dtype('O')

In [105]:
car_sales.loc[:,'date'] = pd.to_datetime(car_sales['date_t'])

In [106]:
car_sales.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1512 entries, 0 to 1511
Data columns (total 3 columns):
date_t    1512 non-null object
cnt       1032 non-null float64
date      1512 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(1), object(1)
memory usage: 35.5+ KB


In [107]:
car_sales.head()

Unnamed: 0,date_t,cnt,date
0,2012-12-31,,2012-12-31
1,2013-01-01,,2013-01-01
2,2013-01-02,68.0,2013-01-02
3,2013-01-03,36.0,2013-01-03
4,2013-01-04,5565.0,2013-01-04


#### 取出关键时间信息
* .dt.month
* .dt.dayofweek
* .dt.dayofyear
* ...

In [108]:
# 取出几月份
car_sales.loc[:,'month'] = car_sales['date'].dt.month

In [109]:
car_sales.head()

Unnamed: 0,date_t,cnt,date,month
0,2012-12-31,,2012-12-31,12
1,2013-01-01,,2013-01-01,1
2,2013-01-02,68.0,2013-01-02,1
3,2013-01-03,36.0,2013-01-03,1
4,2013-01-04,5565.0,2013-01-04,1


In [110]:
tmp_date = car_sales['date'].dt

In [111]:
# 取出来是几号
car_sales.loc[:,'dom'] = car_sales['date'].dt.day

In [112]:
# 取出一年当中的第几天
car_sales.loc[:,'doy'] = car_sales['date'].dt.dayofyear

In [113]:
# 取出星期几
car_sales.loc[:,'dow'] = car_sales['date'].dt.dayofweek

In [114]:
car_sales.head()

Unnamed: 0,date_t,cnt,date,month,dom,doy,dow
0,2012-12-31,,2012-12-31,12,31,366,0
1,2013-01-01,,2013-01-01,1,1,1,1
2,2013-01-02,68.0,2013-01-02,1,2,2,2
3,2013-01-03,36.0,2013-01-03,1,3,3,3
4,2013-01-04,5565.0,2013-01-04,1,4,4,4


In [115]:
car_sales.loc[:,'is_weekend'] = car_sales['dow'].apply(lambda x: 1 if (x==0 or x==6) else 0)

In [116]:
car_sales.head()

Unnamed: 0,date_t,cnt,date,month,dom,doy,dow,is_weekend
0,2012-12-31,,2012-12-31,12,31,366,0,1
1,2013-01-01,,2013-01-01,1,1,1,1,0
2,2013-01-02,68.0,2013-01-02,1,2,2,2,0
3,2013-01-03,36.0,2013-01-03,1,3,3,3,0
4,2013-01-04,5565.0,2013-01-04,1,4,4,4,0


### 文本型

#### 词袋模型
* CountVectorizer
![](./image/bag_of_words.png)

In [117]:
from sklearn.feature_extraction.text import CountVectorizer

In [118]:
vectorizer = CountVectorizer()

In [119]:
corpus = [
    '欢迎 大家 学习 这 个 课程',
    '机器学习 需要 一定 的 数学 基础',
    '课程 里 的 内容 会 涉及 机器学习 和 数学 原理',
    '希望 大家 都 能 消化 课程 内容'
]

In [120]:
X = vectorizer.fit_transform(corpus)

In [121]:
X

<4x14 sparse matrix of type '<class 'numpy.int64'>'
	with 20 stored elements in Compressed Sparse Row format>

In [122]:
vectorizer.get_feature_names()

['一定',
 '内容',
 '原理',
 '基础',
 '大家',
 '学习',
 '希望',
 '数学',
 '机器学习',
 '欢迎',
 '消化',
 '涉及',
 '课程',
 '需要']

In [123]:
X.toarray()

array([[0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0],
       [1, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1],
       [0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0],
       [0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0]], dtype=int64)

In [124]:
vec = CountVectorizer(ngram_range=(1,3))

In [125]:
X_ngram = vec.fit_transform(corpus)

In [126]:
vec.get_feature_names()

['一定',
 '一定 数学',
 '一定 数学 基础',
 '内容',
 '内容 涉及',
 '内容 涉及 机器学习',
 '原理',
 '基础',
 '大家',
 '大家 学习',
 '大家 学习 课程',
 '大家 消化',
 '大家 消化 课程',
 '学习',
 '学习 课程',
 '希望',
 '希望 大家',
 '希望 大家 消化',
 '数学',
 '数学 原理',
 '数学 基础',
 '机器学习',
 '机器学习 数学',
 '机器学习 数学 原理',
 '机器学习 需要',
 '机器学习 需要 一定',
 '欢迎',
 '欢迎 大家',
 '欢迎 大家 学习',
 '消化',
 '消化 课程',
 '消化 课程 内容',
 '涉及',
 '涉及 机器学习',
 '涉及 机器学习 数学',
 '课程',
 '课程 内容',
 '课程 内容 涉及',
 '需要',
 '需要 一定',
 '需要 一定 数学']

In [127]:
X_ngram.toarray()

array([[0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
       [1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
        0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1],
       [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1,
        1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0],
       [0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0]],
      dtype=int64)

#### TF-IDF
* TfidfVectorizer
![](./image/TF-IDF.png)

In [128]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [129]:
tfidf_vec = TfidfVectorizer()

In [130]:
tfidf_X = tfidf_vec.fit_transform(corpus)

In [131]:
tfidf_vec.get_feature_names()

['一定',
 '内容',
 '原理',
 '基础',
 '大家',
 '学习',
 '希望',
 '数学',
 '机器学习',
 '欢迎',
 '消化',
 '涉及',
 '课程',
 '需要']

In [132]:
tfidf_X.toarray()

array([[0.        , 0.        , 0.        , 0.        , 0.4530051 ,
        0.57457953, 0.        , 0.        , 0.        , 0.57457953,
        0.        , 0.        , 0.36674667, 0.        ],
       [0.48546061, 0.        , 0.        , 0.48546061, 0.        ,
        0.        , 0.        , 0.38274272, 0.38274272, 0.        ,
        0.        , 0.        , 0.        , 0.48546061],
       [0.        , 0.38144133, 0.48380996, 0.        , 0.        ,
        0.        , 0.        , 0.38144133, 0.38144133, 0.        ,
        0.        , 0.48380996, 0.30880963, 0.        ],
       [0.        , 0.41263976, 0.        , 0.        , 0.41263976,
        0.        , 0.52338122, 0.        , 0.        , 0.        ,
        0.52338122, 0.        , 0.33406745, 0.        ]])

## 特征选择

### 过滤式/Filter
* SelectKBest
* 通常在线性模型当中会使用

In [133]:
from sklearn.feature_selection import SelectKBest
from sklearn.datasets import load_iris

In [134]:
iris = load_iris()

In [135]:
X, y = iris.data, iris.target

In [136]:
X.shape

(150, 4)

In [137]:
X[:5,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

In [138]:
y

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [139]:
X_new = SelectKBest(k=2).fit_transform(X,y)

In [140]:
X_new.shape

(150, 2)

In [141]:
X_new[:5,:]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

### 包裹型/wrapper
* RFE

In [142]:
from sklearn.feature_selection import RFE

In [143]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()

In [144]:
rfe = RFE(estimator=rf, n_features_to_select=2)

In [145]:
X_rfe = rfe.fit_transform(X,y)



In [146]:
X_rfe.shape

(150, 2)

In [147]:
X_rfe[:5,:]

array([[1.4, 0.2],
       [1.4, 0.2],
       [1.3, 0.2],
       [1.5, 0.2],
       [1.4, 0.2]])

### 嵌入式/Embedded
* SelectFromModel

In [148]:
from sklearn.feature_selection import SelectFromModel

In [149]:
from sklearn.svm import LinearSVC

In [150]:
lsvc = LinearSVC(C=0.01, penalty="l1", dual=False).fit(X,y)

In [151]:
model = SelectFromModel(lsvc, prefit=True)

In [152]:
X_embed = model.transform(X)

In [153]:
X_embed.shape

(150, 3)

In [154]:
X_embed[:5,:]

array([[5.1, 3.5, 1.4],
       [4.9, 3. , 1.4],
       [4.7, 3.2, 1.3],
       [4.6, 3.1, 1.5],
       [5. , 3.6, 1.4]])