**範例 : (Kaggle)房價預測精簡版** <br />
https://www.kaggle.com/c/house-prices-advanced-regression-techniques<br />
<br />
以下是房價預測的精簡版範例<br />
使用最小量的特徵工程以及線性回歸模型做預測, 最後輸出可以在Kaggle提交的預測檔<br />
**[教學目標]**<br />
以下程式碼雖然與 Day16 類似, 但是主要重點在於特徵工程的使用, 後續的課程當中會教導同學如何對這塊作調整<br />
**[範例重點]**<br />
精簡後的特徵工程 - 包含補缺失值(fillna). 標籤編碼(LabelEncoder).<br />
最小最大化(MinMaxScaler) 如何使用在同一個程式區塊中<br />


In [4]:
#import the modules we need
import numpy as np
import pandas as pd
import os
import warnings

warnings.filterwarnings('ignore')

#Modules for label encoding and data scaling
from sklearn.preprocessing import LabelEncoder, MinMaxScaler

In [5]:
#Path for train & test data
data_dir = './data/'
train_dir = os.path.join(data_dir,'train.csv')
test_dir = os.path.join(data_dir,'test.csv')

train_df = pd.read_csv(train_dir)
test_df = pd.read_csv(test_dir)

In [6]:
#Chcek the shape first
print(train_df.shape)
print(test_df.shape)

(1460, 81)
(1459, 80)


In [7]:
#Record the label and ID columns
train_label = np.log1p(train_df.SalePrice)
ids = test_df.Id

#Discard the columns we dont need in model training
train_df = train_df.drop(['Id','SalePrice'], axis = 1)
test_df = test_df.drop(['Id'], axis = 1)

#Merge two dataframe for data featuring
df = pd.concat([train_df,test_df])
print('Shape : ', df.shape)
print(df.head(5))

Shape :  (2919, 79)
   MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0          60       RL         65.0     8450   Pave   NaN      Reg   
1          20       RL         80.0     9600   Pave   NaN      Reg   
2          60       RL         68.0    11250   Pave   NaN      IR1   
3          70       RL         60.0     9550   Pave   NaN      IR1   
4          60       RL         84.0    14260   Pave   NaN      IR1   

  LandContour Utilities LotConfig  ... ScreenPorch PoolArea PoolQC Fence  \
0         Lvl    AllPub    Inside  ...           0        0    NaN   NaN   
1         Lvl    AllPub       FR2  ...           0        0    NaN   NaN   
2         Lvl    AllPub    Inside  ...           0        0    NaN   NaN   
3         Lvl    AllPub    Corner  ...           0        0    NaN   NaN   
4         Lvl    AllPub       FR2  ...           0        0    NaN   NaN   

  MiscFeature MiscVal  MoSold  YrSold  SaleType  SaleCondition  
0         NaN       0       2    2008

In [8]:
#We do the simple feature engineering here
# 1.Fill NAs with -1
# 2.Do the label encoding for object columns
# 3.Do the data scaling for numerical columns

le = LabelEncoder()
Scaler = MinMaxScaler()
for c in df.columns:
    if df[c].dtype == 'object':
        df[c] = df[c].fillna('None')
        df[c] = le.fit_transform(df[c])
    else:
        df[c] = df[c].fillna(-1)
        df[c] = Scaler.fit_transform(df[c].values.reshape(-1,1))
df.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,0.235294,4,0.210191,0.03342,1,1,3,3,0,4,...,0.0,0.0,3,4,1,0.0,0.090909,0.5,9,4
1,0.0,4,0.257962,0.038795,1,1,3,3,0,2,...,0.0,0.0,3,4,1,0.0,0.363636,0.25,9,4
2,0.235294,4,0.219745,0.046507,1,1,0,3,0,4,...,0.0,0.0,3,4,1,0.0,0.727273,0.5,9,4
3,0.294118,4,0.194268,0.038561,1,1,0,3,0,0,...,0.0,0.0,3,4,1,0.0,0.090909,0.0,9,0
4,0.235294,4,0.270701,0.060576,1,1,0,3,0,2,...,0.0,0.0,3,4,1,0.0,1.0,0.5,9,4


In [9]:
#After we finish the data feauturing, we separate the df into train and test dataset by their original length
train_len = len(train_label)
train_x = df[0:train_len]
test_x = df[train_len:]

#Using Linear regression to train the model, and predict the saleprice of test data
from sklearn.linear_model import LinearRegression
LR = LinearRegression()
LR.fit(train_x,train_label)

prediction = LR.predict(test_x)

In [10]:
prediction = np.expm1(prediction)
submission = pd.DataFrame( {'Id':ids, 'SalePrice' : prediction})
submission.to_csv('./data/house_submission.csv',index = False)

**作業 : (Kaggle)鐵達尼生存預測精簡版**<br />
https://www.kaggle.com/c/titanic<br />
<br />
**[作業目標]**<br />
試著不依賴說明, 只依照下列程式碼回答下列問題, 初步理解什麼是"特徵工程"的區塊<br />
**[作業重點]**<br />
試著不依賴註解, 以之前所學, 回答下列問題<br />
作業1<br />
下列A~E五個程式區塊中，哪一塊是特徵工程?<br />
A: block C<br />
<br />
作業2<br />
對照程式區塊 B 與 C 的結果，請問那些欄位屬於"類別型欄位"? (回答欄位英文名稱即可)<br />
A: Name, Sex,Ticket, Cabin, Embarked<br />
<br />
作業3<br />
續上題，請問哪個欄位是"目標值"?<br />
A: Survived<br />
<br />

In [11]:
# 程式區塊 A
df_train = pd.read_csv(data_dir+'titanic_train.csv')
df_test = pd.read_csv(data_dir+'titanic_test.csv')

print(df_train.shape)
print(df_test.shape)

(891, 12)
(418, 11)


In [28]:
# 程式區塊 B
train_y = df_train['Survived']
ids = df_test['PassengerId']
df_train = df_train.drop(['Survived','PassengerId'], axis = 1)
df_test = df_test.drop(['PassengerId'], axis = 1)

df_titanic = pd.concat([df_train,df_test])
df_titanic.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [29]:
# 程式區塊 C -> Which is feature engineering block
le = LabelEncoder()
scaler = MinMaxScaler()

for col in df_titanic.columns:
    if df_titanic[col].dtype == 'object':
        df_titanic[col] = df_titanic[col].fillna('None')
        df_titanic[col] = le.fit_transform(df_titanic[col])
    else:
        df_titanic[col] = df_titanic[col].fillna(-1)
        df_titanic[col] = scaler.fit_transform(df_titanic[col].values.reshape(-1,1))
df_titanic.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1.0,155,1,0.283951,0.125,0.0,720,0.016072,185,3
1,0.0,286,0,0.481481,0.125,0.0,816,0.140813,106,0
2,1.0,523,0,0.333333,0.0,0.0,914,0.017387,185,3
3,0.0,422,0,0.444444,0.125,0.0,65,0.10539,70,3
4,1.0,22,1,0.444444,0.0,0.0,649,0.01763,185,3


In [30]:
# 程式區塊 D
df_train_len = len(train_y)
df_train_x = df_titanic[0:df_train_len]
df_test_x = df_titanic[df_train_len:]

from sklearn.linear_model import LogisticRegression
LR = LogisticRegression()
LR.fit(df_train_x,train_y)
prediction = LR.predict(df_test_x)

In [31]:
# 程式區塊 E
titanic_submission = pd.DataFrame({'PassengerId':ids,'Survived':prediction})
titanic_submission.to_csv('./data/titanic_submission.csv', index = False)