**範例 : (Kaggle)房價預測** <br />
**[教學目標]**<br />
用房價預測資料, 觀察填補缺值以及 標準化 / 最小最大化 對數值的影響<br />
**[範例重點]**<br />
知道如何查詢各欄位空缺值數量<br />
觀察替換不同補缺方式, 對於特徵的影響<br />
觀察替換不同特徵縮放方式, 對於特徵的影響<br />

In [57]:
#import the modules we need
import pandas as pd
import numpy as np
import warnings 
from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LinearRegression

warnings.filterwarnings('ignore')
data_dir = './data/'
house_train = pd.read_csv(data_dir + 'train.csv')
house_test = pd.read_csv(data_dir + 'test.csv')

print(house_train.shape)
print(house_test.shape)

(1460, 81)
(1459, 80)


In [58]:
house_label = np.log1p(house_train.SalePrice)
house_test_ids = house_test.Id
house_train = house_train.drop(['Id', 'SalePrice'] , axis=1)
house_test = house_test.drop(['Id'], axis = 1)

house_df = pd.concat([house_train,house_test])

house_df.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


In [59]:
#sum the count of missing values in every feature
house_df.isnull().sum().sort_values(ascending = False).head(10)

PoolQC          2909
MiscFeature     2814
Alley           2721
Fence           2348
FireplaceQu     1420
LotFrontage      486
GarageCond       159
GarageQual       159
GarageYrBlt      159
GarageFinish     159
dtype: int64

In [60]:
#Only take numerical data type 
numerical_features = np.array(house_df.columns [house_df.dtypes != 'object'] , dtype = str)

print(f' {len(numerical_features)} Numerical Features : {numerical_features} \n')

 36 Numerical Features : ['MSSubClass' 'LotFrontage' 'LotArea' 'OverallQual' 'OverallCond'
 'YearBuilt' 'YearRemodAdd' 'MasVnrArea' 'BsmtFinSF1' 'BsmtFinSF2'
 'BsmtUnfSF' 'TotalBsmtSF' '1stFlrSF' '2ndFlrSF' 'LowQualFinSF'
 'GrLivArea' 'BsmtFullBath' 'BsmtHalfBath' 'FullBath' 'HalfBath'
 'BedroomAbvGr' 'KitchenAbvGr' 'TotRmsAbvGrd' 'Fireplaces' 'GarageYrBlt'
 'GarageCars' 'GarageArea' 'WoodDeckSF' 'OpenPorchSF' 'EnclosedPorch'
 '3SsnPorch' 'ScreenPorch' 'PoolArea' 'MiscVal' 'MoSold' 'YrSold'] 



In [61]:
house_df_numerical = house_df[numerical_features]
house_train_num = len(house_label)
house_df_numerical.head(5)

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,60,65.0,8450,7,5,2003,2003,196.0,706.0,0.0,...,548.0,0,61,0,0,0,0,0,2,2008
1,20,80.0,9600,6,8,1976,1976,0.0,978.0,0.0,...,460.0,298,0,0,0,0,0,0,5,2007
2,60,68.0,11250,7,5,2001,2002,162.0,486.0,0.0,...,608.0,0,42,0,0,0,0,0,9,2008
3,70,60.0,9550,7,5,1915,1970,0.0,216.0,0.0,...,642.0,0,35,272,0,0,0,0,2,2006
4,60,84.0,14260,8,5,2000,2000,350.0,655.0,0.0,...,836.0,192,84,0,0,0,0,0,12,2008


In [62]:
# Fill the NAs with -1, and do the linear regression to see the effect

house_df_m1 = house_df_numerical.fillna(-1)
house_train_x = house_df_m1[0:house_train_num]
LR = LinearRegression()
cross_val_score(LR,house_train_x,house_label, cv = 5).mean()

0.8466400643386484

In [63]:
#Fill the NAs with 0, and do the linear regression to see the effect
house_df_0 = house_df_numerical.fillna(0)
house_train_x = house_df_0[0:house_train_num]
LR = LinearRegression()
cross_val_score(LR,house_train_x,house_label, cv = 5).mean()

0.8466118155868816

In [64]:
#Fill the NAs with mean value, and do the linear regression to see the effect
house_df_mean = house_df_numerical.fillna(house_df_numerical.mean())
house_train_x = house_df_mean[0:house_train_num]
LR = LinearRegression()
cross_val_score(LR,house_train_x,house_label, cv = 5).mean()

0.8442642432201322

In [65]:
#Fill the NAs with -1, and do the MinMax Scale on data
house_df_temp1 = MinMaxScaler().fit_transform(house_df_m1)
hosue_train_x = house_df_temp1[0:house_train_num]
LR = LinearRegression()
cross_val_score(LR,house_train_x,house_label,cv = 5).mean()

0.8442642432201322

In [66]:
#Fill the NAs with -1, and do the Standard Scale on data
house_df_temp2 = StandardScaler().fit_transform(house_df_m1)
house_train_x = house_df_temp2[0:house_train_num]
LR = LinearRegression()
cross_val_score(LR,house_train_x,house_label, cv = 5).mean()

0.846769588054143

**作業 : (Kaggle)鐵達尼生存預測** <br />
https://www.kaggle.com/c/titanic <br />
<br />
**[作業目標]**<br />
試著模仿範例寫法, 在鐵達尼生存預測中, 觀察填補缺值以及 標準化 / 最小最大化 對數值的影響<br />
**[作業重點]**<br />
觀察替換不同補缺方式, 對於特徵的影響<br />
觀察替換不同特徵縮放方式, 對於特徵的影響 <br />

In [67]:
#import the LogisticRegression
from sklearn.linear_model import LogisticRegression

titanic_train = pd.read_csv(data_dir + 'titanic_train.csv')
titanic_test = pd.read_csv(data_dir + 'titanic_test.csv')

titanic_label = titanic_train.Survived
titanic_ids = titanic_test.PassengerId
titanic_train = titanic_train.drop(['Survived','PassengerId'] , axis = 1)
titanic_test = titanic_test.drop(['PassengerId'], axis = 1)

titanic_df = pd.concat([titanic_train,titanic_test])
titanic_df.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [69]:
titanic_numerical_features = np.array( titanic_df.columns[titanic_df.dtypes != 'object'], dtype = str)
print(f' {len(titanic_numerical_features)} Numerical Features : {titanic_numerical_features} \n')

 5 Numerical Features : ['Pclass' 'Age' 'SibSp' 'Parch' 'Fare'] 



In [71]:
titanic_df_numerical = titanic_df[titanic_numerical_features]
titanic_train_num = len(titanic_label)
titanic_df_numerical.head(5)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare
0,3,22.0,1,0,7.25
1,1,38.0,1,0,71.2833
2,3,26.0,0,0,7.925
3,1,35.0,1,0,53.1
4,3,35.0,0,0,8.05


**作業1** <br />
試著在補空值區塊, 替換並執行兩種以上填補的缺值, 看看何者比較好?

In [72]:
#Fill the NAs with -1, and do the logistic regression
titanic_df_numerical_m1 = titanic_df_numerical.fillna(-1)
train_x = titanic_df_numerical_m1[0:titanic_train_num]
LogReg = LogisticRegression()
cross_val_score(LogReg,train_x,titanic_label,cv=5).mean()

0.6960299128976762

In [73]:
#Fill the NAs with mean value, and do the logistic regression
titanic_df_numerical_mean = titanic_df_numerical.fillna(titanic_df_numerical.mean())
train_x = titanic_df_numerical_mean[0:titanic_train_num]
LogReg = LogisticRegression()
cross_val_score(LogReg,train_x,titanic_label,cv=5).mean()

0.6981761033723469

In [74]:
#Fill the NAs with median value, and do the logistic regression
titanic_df_numerical_median = titanic_df_numerical.fillna(titanic_df_numerical.median())
train_x = titanic_df_numerical_median[0:titanic_train_num]
LogReg = LogisticRegression()
cross_val_score(LogReg,train_x,titanic_label,cv=5).mean()

0.6992934218081011

**Answer of HW1**<br />
<br />
**From above result, we can see that the NAs filling of median has better score. This might means that the original data has skewness problem.**

**作業2**<br />
使用不同的標準化方式 ( 原值 / 最小最大化 / 標準化 )，搭配羅吉斯迴歸模型，何者效果最好?

In [76]:
#original data with -1 fillna
titanic_df_numerical_m1 = titanic_df_numerical.fillna(-1)
train_x = titanic_df_numerical_m1[0:titanic_train_num]
LogReg = LogisticRegression()
cross_val_score(LogReg,train_x,titanic_label,cv=5).mean()

0.6960299128976762

In [78]:
#MinMax feature scaling
titanic_df_numerical_MinMax = MinMaxScaler().fit_transform(titanic_df_numerical_m1)
train_x = titanic_df_numerical_MinMax[0:titanic_train_num]
LogReg = LogisticRegression()
cross_val_score(LogReg,train_x,titanic_label,cv=5).mean()

0.6971346062663598

In [79]:
#Standard feature scaling
titanic_df_numerical_Standard = StandardScaler().fit_transform(titanic_df_numerical_m1)
train_x = titanic_df_numerical_Standard[0:titanic_train_num]
LogReg = LogisticRegression()
cross_val_score(LogReg,train_x,titanic_label,cv=5).mean()

0.6982582017719778

**Answer of HW2**<br />
<br />
**From above result, we can tell that the feature scaling might be helpful. And the Standard scaling is better and MinMax scaling. This might due to we have not do the data outlier process.**