**範例 : (Kaggle)房價預測** <br />
以下用房價預測資料, 觀察特徵的幾種類型<br />
這份資料有 'int64', 'float64', 'object' 三種欄位, 分別將其以python的list格式紀錄下來<br />
**[教學目標]**<br />
以下程式碼將示範 : 如何將欄位名稱, 依照所屬類型分開, 並列出指定類型的部分資料<br />
**[範例重點]**<br />
如何觀察目前的 DataFrame 中, 有哪些欄位類型, 以及數量各有多少 <br />
如何將欄位名稱依欄位類型分開 <br />
如何只顯示特定類型的欄位資料 <br />

In [17]:
#Import modules we need
import numpy as np
import pandas as pd
import warnings

warnings.filterwarnings('ignore')

#read the train & test data of house prediction task
data_dir = './data/'
train_df = pd.read_csv(data_dir + 'train.csv')
test_df = pd.read_csv(data_dir + 'test.csv')

print(train_df.shape)
print(test_df.shape)

(1460, 81)
(1459, 80)


In [18]:
#record the training label and discard the ID and training label columns in train & test dataset
train_label = np.log1p(train_df.SalePrice)
ids = test_df.Id
train_df = train_df.drop(['Id','SalePrice'], axis = 1)
test_df = test_df.drop(['Id'], axis = 1)

#Merge train & test dataset for some data featuring
df = pd.concat([train_df,test_df])
df.head(5)

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
3,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
4,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


In [19]:
df_dtype = df.dtypes.reset_index()
df_dtype.columns = ['Count','Column Type']
df_dtype = df_dtype.groupby('Column Type').aggregate('count').reset_index()
df_dtype

Unnamed: 0,Column Type,Count
0,int64,25
1,float64,11
2,object,43


In [22]:
int_features = []
float_features = []
object_features = []

for dtype, feature in zip(df.dtypes,df.columns):
    if dtype == 'float64':
        float_features.append(feature)
    elif dtype == 'int64':
        int_features.append(feature)
    else:
        object_features.append(feature)
        
print(f' {len(int_features)} Integer Features : {int_features} \n' )
print(f' {len(float_features)} Float Features : {float_features} \n')
print(f' {len(object_features)} Object Features : {object_features} \n')

 25 Integer Features : ['MSSubClass', 'LotArea', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF', 'GrLivArea', 'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold'] 

 11 Float Features : ['LotFrontage', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'BsmtFullBath', 'BsmtHalfBath', 'GarageYrBlt', 'GarageCars', 'GarageArea'] 

 43 Object Features : ['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional'

In [23]:
df[float_features].head(5)

Unnamed: 0,LotFrontage,MasVnrArea,BsmtFinSF1,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,BsmtFullBath,BsmtHalfBath,GarageYrBlt,GarageCars,GarageArea
0,65.0,196.0,706.0,0.0,150.0,856.0,1.0,0.0,2003.0,2.0,548.0
1,80.0,0.0,978.0,0.0,284.0,1262.0,0.0,1.0,1976.0,2.0,460.0
2,68.0,162.0,486.0,0.0,434.0,920.0,1.0,0.0,2001.0,2.0,608.0
3,60.0,0.0,216.0,0.0,540.0,756.0,1.0,0.0,1998.0,3.0,642.0
4,84.0,350.0,655.0,0.0,490.0,1145.0,1.0,0.0,2000.0,3.0,836.0


**作業 : (Kaggle)鐵達尼生存預測**<br />
**[作業目標]**<br />
試著完成三種不同特徵類型的三種資料操作, 觀察結果<br />
思考一下, 這三種特徵類型, 哪一種應該最複雜/最難處理<br />
**[作業重點]**<br />
完成剩餘的八種 類型 x 操作組合 <br />
思考何種特徵類型, 應該最複雜<br />

In [24]:
#loading the train and test dataset of titanic task
df_train = pd.read_csv(data_dir + 'titanic_train.csv')
df_test = pd.read_csv(data_dir + 'titanic_test.csv')

print(df_train.shape)
print(df_test.shape)

(891, 12)
(418, 11)


In [25]:
train_y = df_train.Survived
ids = df_test.PassengerId
df_train = df_train.drop(['PassengerId','Survived'], axis = 1)
df_test = df_test.drop(['PassengerId'], axis = 1)

titanic_df = pd.concat([df_train,df_test])
titanic_df.head(5)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [29]:
titanic_dtypes = titanic_df.dtypes.reset_index()
titanic_dtypes.columns = ['Count','Column Type']
titanic_dtypes = titanic_dtypes.groupby('Column Type').aggregate('count').reset_index()
titanic_dtypes

Unnamed: 0,Column Type,Count
0,int64,3
1,float64,2
2,object,5


In [30]:
t_int_features = []
t_float_features = []
t_object_features = []

for dtype, feature in zip(titanic_df.dtypes,titanic_df.columns):
    if dtype == 'int64':
        t_int_features.append(feature)
    elif dtype == 'float64':
        t_float_features.append(feature)
    else :
        t_object_features.append(feature)
        
print(f' {len(t_int_features)} Integer Features : {t_int_features} \n')
print(f' {len(t_float_features)} Float Features : {t_float_features} \n')
print(f' {len(t_object_features)} Object Features : {t_object_features} \n')

 3 Integer Features : ['Pclass', 'SibSp', 'Parch'] 

 2 Float Features : ['Age', 'Fare'] 

 5 Object Features : ['Name', 'Sex', 'Ticket', 'Cabin', 'Embarked'] 



**作業1** <br />
試著執行作業程式，觀察三種類型 (int / float / object) 的欄位分別進行( 平均 mean / 最大值 Max / 相異值 nunique ) <br />
中的九次操作會有那些問題? 並試著解釋那些發生Error的程式區塊的原因? <br />

In [42]:
#Mean values of int type data in titanic df
print(titanic_df[t_int_features].mean())
#Max values of int type data in titanic df
print(titanic_df[t_int_features].max())
#number of unique values of int type data in titanic df
print(titanic_df[t_int_features].nunique())

Pclass    2.294882
SibSp     0.498854
Parch     0.385027
dtype: float64
Pclass    3
SibSp     8
Parch     9
dtype: int64
Pclass    3
SibSp     7
Parch     8
dtype: int64


In [43]:
#Mean values of float type data in titanic df
print(titanic_df[t_float_features].mean())
#Max values of int type data in titanic df
print(titanic_df[t_float_features].max())
#number of unique values of float type data in titanic df
print(titanic_df[t_float_features].nunique())

Age     29.881138
Fare    33.295479
dtype: float64
Age      80.0000
Fare    512.3292
dtype: float64
Age      98
Fare    281
dtype: int64


In [44]:
#Mean values of float type data in titanic df
print(titanic_df[t_object_features].mean())
#Max values of int type data in titanic df
print(titanic_df[t_object_features].max())
#number of unique values of float type data in titanic df
print(titanic_df[t_object_features].nunique())

Series([], dtype: float64)
Name      van Melkebeke, Mr. Philemon
Sex                              male
Ticket                      WE/P 5735
dtype: object
Name        1307
Sex            2
Ticket       929
Cabin        186
Embarked       3
dtype: int64


**Answer of HW 1** <br />
<br />
These three operations on int&float type data seems okay. <br />
The mean & max operations on object are not okay. Since it is hard to define what is max or what is mean of object data.<br />
Although nunique operation on these three data types might will show some good information. It may can show the distribution of data. <br />

**作業2** <br />
思考一下，試著舉出今天五種類型以外的一種或多種資料類型，你舉出的新類型是否可以歸在三大類中的某些大類? <br />
所以三大類特徵中，哪一大類處理起來應該最複雜? <br />

**Answer of HW 2** <br />
1. Image data -> It might can be transfer to binary data type(mnist)<br />
2. I think the object type might be the hardest to process. We need to know the information in object data very well before do some data featuring on it.<br />