# House Prices Competition : Term Project 

#### Description:

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

#### To do List :

* Functions for each data preparation and processing method
* Read about features engineering and selection
* apply pca
* how to select the non_numerical features that are most important 


### Importing Libraries:

In [770]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA


import matplotlib.pyplot as plt
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

In [771]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [772]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [773]:
test.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal


In [774]:
print ("Train data shape:", train.shape)

Train data shape: (1460, 81)


In [775]:
print ("Test data shape:", test.shape)

Test data shape: (1459, 80)


### Features engineering :

Same transformations should be applied on the training and testing data when doing features engineering.

### Handling non-numerical features :

In [776]:
categoricals = train.select_dtypes(exclude=[np.number])
categoricals_test = test.select_dtypes(exclude=[np.number])
#numericals = train.select_dtypes(include=[np.number])
categoricals.describe()
#numericals.head()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
count,1460,1460,91,1460,1460,1460,1460,1460,1460,1460,...,1379,1379,1379,1379,1460,7,281,54,1460,1460
unique,5,2,2,4,4,2,5,3,25,9,...,6,3,5,5,3,3,4,4,9,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,...,Attchd,Unf,TA,TA,Y,Gd,MnPrv,Shed,WD,Normal
freq,1151,1454,50,925,1311,1459,1052,1382,225,1260,...,870,605,1311,1326,1340,3,157,49,1267,1198


In [777]:
#to check how many categories we have per feature
#for feature in categoricals:
    #print ("Unique values of ",feature," : " , train[feature].unique())

In [778]:
for feature in nominal_features_test:
    print ("Unique values of ",feature," : " , train[feature].unique())

Unique values of  MSZoning  :  ['RL' 'RM' 'C (all)' 'FV' 'RH']
Unique values of  Street  :  ['Pave' 'Grvl']
Unique values of  LandContour  :  ['Lvl' 'Bnk' 'Low' 'HLS']
Unique values of  LotConfig  :  ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
Unique values of  Neighborhood  :  ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
Unique values of  Condition1  :  ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
Unique values of  Condition2  :  ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
Unique values of  BldgType  :  ['1Fam' '2fmCon' 'Duplex' 'TwnhsE' 'Twnhs']
Unique values of  HouseStyle  :  ['2Story' '1Story' '1.5Fin' '1.5Unf' 'SFoyer' 'SLvl' '2.5Unf' '2.5Fin']
Unique values of  RoofStyle  :  ['Gable' 'Hip' 'Gambrel' 'Mansard' 'Flat' 'Shed']
Unique values of  Ro

First we will deal with nominal features with null values that could be significant, for example :
- Alley
- MasVnrType
- GarageType
- MiscFeature
- Electrical


In [779]:
categoricals_with_null = categoricals[['Alley','MasVnrType','GarageType','MiscFeature','Electrical']]
categoricals_with_null.head()

Unnamed: 0,Alley,MasVnrType,GarageType,MiscFeature,Electrical
0,,BrkFace,Attchd,,SBrkr
1,,,Attchd,,SBrkr
2,,BrkFace,Attchd,,SBrkr
3,,,Detchd,,SBrkr
4,,BrkFace,Attchd,,SBrkr


In [780]:
categoricals_with_null_test = categoricals_test[['Alley','MasVnrType','GarageType','MiscFeature','Electrical']]

In [781]:
categoricals_with_null['Alley'].unique()

alleyLabel = {'Grvl': 1, 'Pave': 2, None: 0}

categoricals_with_null['AlleyLabel'] = categoricals_with_null['Alley'].map(alleyLabel)
categoricals_with_null[['Alley', 'AlleyLabel']].iloc[20:25]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Alley,AlleyLabel
20,,0
21,Grvl,1
22,,0
23,,0
24,,0


In [782]:
categoricals_with_null_test['Alley'].unique()

alleyLabel = {'Grvl': 1, 'Pave': 2, None: 0}

categoricals_with_null_test['AlleyLabel'] = categoricals_with_null_test['Alley'].map(alleyLabel)
categoricals_with_null_test[['Alley', 'AlleyLabel']].iloc[20:25]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Alley,AlleyLabel
20,,0
21,,0
22,,0
23,,0
24,,0


In [783]:
categoricals_with_null['MasVnrType'].unique()

MasVnrTypeLabel = {'BrkFace': 1, 'Stone': 2,'BrkCmn': 3, 'None': 0}

categoricals_with_null['MasVnrTypeLabel'] = categoricals_with_null['MasVnrType'].map(MasVnrTypeLabel)
categoricals_with_null[['MasVnrType', 'MasVnrTypeLabel']].iloc[10:18]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,MasVnrType,MasVnrTypeLabel
10,,0.0
11,Stone,2.0
12,,0.0
13,Stone,2.0
14,BrkFace,1.0
15,,0.0
16,BrkFace,1.0
17,,0.0


In [784]:
categoricals_with_null_test['MasVnrType'].unique()

MasVnrTypeLabel = {'BrkFace': 1, 'Stone': 2,'BrkCmn': 3, 'None': 0}

categoricals_with_null_test['MasVnrTypeLabel'] = categoricals_with_null_test['MasVnrType'].map(MasVnrTypeLabel)
categoricals_with_null_test[['MasVnrType', 'MasVnrTypeLabel']].iloc[10:18]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,MasVnrType,MasVnrTypeLabel
10,,0.0
11,BrkFace,1.0
12,BrkFace,1.0
13,,0.0
14,,0.0
15,Stone,2.0
16,Stone,2.0
17,BrkFace,1.0


In [785]:
categoricals_with_null['GarageType'].unique()

GarageTypeLabel = {'Attchd' : 1, 'Detchd' : 2, 'BuiltIn' : 3, 'CarPort' : 3,'Basment': 3, '2Types': 4, None: 0}

categoricals_with_null['GarageTypeLabel'] = categoricals_with_null['GarageType'].map(GarageTypeLabel)
categoricals_with_null[['GarageType', 'GarageTypeLabel']].iloc[15:20]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,GarageType,GarageTypeLabel
15,Detchd,2
16,Attchd,1
17,CarPort,3
18,Detchd,2
19,Attchd,1


In [786]:
categoricals_with_null_test['GarageType'].unique()

GarageTypeLabel = {'Attchd' : 1, 'Detchd' : 2, 'BuiltIn' : 3, 'CarPort' : 3,'Basment': 3, '2Types': 4, None: 0}

categoricals_with_null_test['GarageTypeLabel'] = categoricals_with_null_test['GarageType'].map(GarageTypeLabel)
categoricals_with_null_test[['GarageType', 'GarageTypeLabel']].iloc[15:20]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,GarageType,GarageTypeLabel
15,Attchd,1
16,Attchd,1
17,Attchd,1
18,Attchd,1
19,Attchd,1


In [787]:
categoricals_with_null['MiscFeature'].unique()

MiscFeatureLabel = {'Shed' : 1, 'Gar2' : 2, 'Othr' : 3, 'TenC' : 4, None: 0}

categoricals_with_null['MiscFeatureLabel'] = categoricals_with_null['MiscFeature'].map(MiscFeatureLabel)
categoricals_with_null[['MiscFeature', 'MiscFeatureLabel']].iloc[5:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,MiscFeature,MiscFeatureLabel
5,Shed,1
6,,0
7,Shed,1
8,,0
9,,0


In [788]:
categoricals_with_null_test['MiscFeature'].unique()

MiscFeatureLabel = {'Shed' : 1, 'Gar2' : 2, 'Othr' : 3, 'TenC' : 4, None: 0}

categoricals_with_null_test['MiscFeatureLabel'] = categoricals_with_null_test['MiscFeature'].map(MiscFeatureLabel)
categoricals_with_null_test[['MiscFeature', 'MiscFeatureLabel']].iloc[5:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,MiscFeature,MiscFeatureLabel
5,,0
6,Shed,1
7,,0
8,,0
9,,0


In [789]:
#categoricals_with_null.Electrical = categoricals_with_null.Electrical.fillna('None')
#print(categoricals_with_null.Electrical.value_counts())

categoricals_with_null['Electrical'].unique()

electLabel = {'SBrkr' : 5, 'FuseF' : 4, 'FuseA' : 3, 'FuseP': 2, 'Mix' : 1, None: 0}

categoricals_with_null['ElectricalLabel'] = categoricals_with_null['Electrical'].map(electLabel)
categoricals_with_null[['Electrical', 'ElectricalLabel']].iloc[5:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  


Unnamed: 0,Electrical,ElectricalLabel
5,SBrkr,5
6,SBrkr,5
7,SBrkr,5
8,FuseF,4
9,SBrkr,5


In [790]:
categoricals_with_null_test['Electrical'].unique()

electLabel = {'SBrkr' : 5, 'FuseF' : 4, 'FuseA' : 3, 'FuseP': 2, 'Mix' : 1, None: 0}

categoricals_with_null_test['ElectricalLabel'] = categoricals_with_null_test['Electrical'].map(electLabel)
categoricals_with_null_test[['Electrical', 'ElectricalLabel']].iloc[5:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Electrical,ElectricalLabel
5,SBrkr,5
6,SBrkr,5
7,SBrkr,5
8,SBrkr,5
9,SBrkr,5


### Handling Nominal values:

We have several nominal features; we will try to map them into numerical values, here we considered these features as ordinal ones :
 * LotSHape
 * Utilities
 * LandSlope
 * ExterQual
 * ExterCond
 * BsmtQual

In [791]:
from sklearn.preprocessing import LabelEncoder

nominal_features = categoricals.drop(['LotShape','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual',
                                        'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                                       'KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC','Fence',
                                      'GarageFinish','Alley','MasVnrType','GarageType','MiscFeature','Electrical'], axis = 1)
nominal_features.describe()
#list(nominal_features)


Unnamed: 0,MSZoning,Street,LandContour,LotConfig,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,Foundation,Heating,CentralAir,Functional,PavedDrive,SaleType,SaleCondition
count,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460
unique,5,2,4,5,25,9,8,5,8,6,8,15,16,6,6,2,7,3,9,6
top,RL,Pave,Lvl,Inside,NAmes,Norm,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,PConc,GasA,Y,Typ,Y,WD,Normal
freq,1151,1454,1311,1052,225,1260,1445,1220,726,1141,1434,515,504,647,1428,1365,1360,1340,1267,1198


In [792]:
from sklearn.preprocessing import LabelEncoder

nominal_features_test = categoricals_test.drop(['LotShape','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual',
                                        'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                                       'KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC','Fence',
                                      'GarageFinish','Alley','MasVnrType','GarageType','MiscFeature','Electrical'], axis = 1)
nominal_features_test.describe()
#list(nominal_features)


Unnamed: 0,MSZoning,Street,LandContour,LotConfig,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,Foundation,Heating,CentralAir,Functional,PavedDrive,SaleType,SaleCondition
count,1455,1459,1459,1459,1459,1459,1459,1459,1459,1459,1459,1458,1458,1459,1459,1459,1457,1459,1458,1459
unique,5,2,4,5,25,9,5,5,7,6,4,13,15,6,4,2,7,3,9,6
top,RL,Pave,Lvl,Inside,NAmes,Norm,Norm,1Fam,1Story,Gable,CompShg,VinylSd,VinylSd,PConc,GasA,Y,Typ,Y,WD,Normal
freq,1114,1453,1311,1081,218,1251,1444,1205,745,1169,1442,510,510,661,1446,1358,1357,1301,1258,1204


we will label the left features using sickit learn library for each nominal feature :

In [793]:
from sklearn.preprocessing import LabelEncoder

#def labelFeatures(dataframe, feature):
    #gle = LabelEncoder()
    #genre_labels = gle.fit_transform(dataframe[feature])
    #genre_mappings = {index: label for index, label in 
     #                         enumerate(gle.classes_)}
    #genre_mappings

    #dataframe[feature+'Label'] = genre_labels
    #dataframe[[feature, feature + 'Label']]
    #return dataframe
    
    
#labelFeatures(nominal_features, "Street")
class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col+'Label'] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)
    

#'CentralAir','Electrical','Functional','PavedDrive','SaleType','SaleCondition'
MultiColumnLabelEncoder(columns = ['MSZoning','Street','LandContour','LotConfig','Neighborhood','Condition1','Condition2',
                                   'BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','Foundation',
                                   'Heating','CentralAir','Functional','PavedDrive','SaleType','SaleCondition'
                                  ]).fit_transform(nominal_features)



#columns =  nominal_features.columns.values.tolist()   
#columns



Unnamed: 0,MSZoning,Street,LandContour,LotConfig,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,...,RoofMatlLabel,Exterior1stLabel,Exterior2ndLabel,FoundationLabel,HeatingLabel,CentralAirLabel,FunctionalLabel,PavedDriveLabel,SaleTypeLabel,SaleConditionLabel
0,RL,Pave,Lvl,Inside,CollgCr,Norm,Norm,1Fam,2Story,Gable,...,1,12,13,2,1,1,6,2,8,4
1,RL,Pave,Lvl,FR2,Veenker,Feedr,Norm,1Fam,1Story,Gable,...,1,8,8,1,1,1,6,2,8,4
2,RL,Pave,Lvl,Inside,CollgCr,Norm,Norm,1Fam,2Story,Gable,...,1,12,13,2,1,1,6,2,8,4
3,RL,Pave,Lvl,Corner,Crawfor,Norm,Norm,1Fam,2Story,Gable,...,1,13,15,0,1,1,6,2,8,0
4,RL,Pave,Lvl,FR2,NoRidge,Norm,Norm,1Fam,2Story,Gable,...,1,12,13,2,1,1,6,2,8,4
5,RL,Pave,Lvl,Inside,Mitchel,Norm,Norm,1Fam,1.5Fin,Gable,...,1,12,13,5,1,1,6,2,8,4
6,RL,Pave,Lvl,Inside,Somerst,Norm,Norm,1Fam,1Story,Gable,...,1,12,13,2,1,1,6,2,8,4
7,RL,Pave,Lvl,Corner,NWAmes,PosN,Norm,1Fam,2Story,Gable,...,1,6,6,1,1,1,6,2,8,4
8,RM,Pave,Lvl,Inside,OldTown,Artery,Norm,1Fam,1.5Fin,Gable,...,1,3,15,0,1,1,2,2,8,0
9,RL,Pave,Lvl,Corner,BrkSide,Artery,Artery,2fmCon,1.5Unf,Gable,...,1,8,8,0,1,1,6,2,8,4


In [794]:
nominal_features_test = nominal_features_test.fillna('none')


for feature in nominal_features_test:
    print ("Unique values of ",feature," : " , nominal_features_test[feature].unique())

Unique values of  MSZoning  :  ['RH' 'RL' 'RM' 'FV' 'C (all)' 'none']
Unique values of  Street  :  ['Pave' 'Grvl']
Unique values of  LandContour  :  ['Lvl' 'HLS' 'Bnk' 'Low']
Unique values of  LotConfig  :  ['Inside' 'Corner' 'FR2' 'CulDSac' 'FR3']
Unique values of  Neighborhood  :  ['NAmes' 'Gilbert' 'StoneBr' 'BrDale' 'NPkVill' 'NridgHt' 'Blmngtn'
 'NoRidge' 'Somerst' 'SawyerW' 'Sawyer' 'NWAmes' 'OldTown' 'BrkSide'
 'ClearCr' 'SWISU' 'Edwards' 'CollgCr' 'Crawfor' 'Blueste' 'IDOTRR'
 'Mitchel' 'Timber' 'MeadowV' 'Veenker']
Unique values of  Condition1  :  ['Feedr' 'Norm' 'PosN' 'RRNe' 'Artery' 'RRNn' 'PosA' 'RRAn' 'RRAe']
Unique values of  Condition2  :  ['Norm' 'Feedr' 'PosA' 'PosN' 'Artery']
Unique values of  BldgType  :  ['1Fam' 'TwnhsE' 'Twnhs' 'Duplex' '2fmCon']
Unique values of  HouseStyle  :  ['1Story' '2Story' 'SLvl' '1.5Fin' 'SFoyer' '2.5Unf' '1.5Unf']
Unique values of  RoofStyle  :  ['Gable' 'Hip' 'Gambrel' 'Flat' 'Mansard' 'Shed']
Unique values of  RoofMatl  :  ['CompShg' '

In [795]:
MultiColumnLabelEncoder(columns = ['MSZoning','Street','LandContour','LotConfig','Neighborhood','Condition1','Condition2',
                                   'BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','Foundation',
                                   'Heating','CentralAir','Functional','PavedDrive','SaleType','SaleCondition'
                                  ]).fit_transform(nominal_features_test)

Unnamed: 0,MSZoning,Street,LandContour,LotConfig,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,...,RoofMatlLabel,Exterior1stLabel,Exterior2ndLabel,FoundationLabel,HeatingLabel,CentralAirLabel,FunctionalLabel,PavedDriveLabel,SaleTypeLabel,SaleConditionLabel
0,RH,Pave,Lvl,Inside,NAmes,Feedr,Norm,1Fam,1Story,Gable,...,0,10,12,1,0,1,6,2,8,4
1,RL,Pave,Lvl,Corner,NAmes,Norm,Norm,1Fam,1Story,Hip,...,0,11,13,1,0,1,6,2,8,4
2,RL,Pave,Lvl,Inside,Gilbert,Norm,Norm,1Fam,2Story,Gable,...,0,10,12,2,0,1,6,2,8,4
3,RL,Pave,Lvl,Inside,Gilbert,Norm,Norm,1Fam,2Story,Gable,...,0,10,12,2,0,1,6,2,8,4
4,RL,Pave,HLS,Inside,StoneBr,Norm,Norm,TwnhsE,1Story,Gable,...,0,6,6,2,0,1,6,2,8,4
5,RL,Pave,Lvl,Corner,Gilbert,Norm,Norm,1Fam,2Story,Gable,...,0,6,6,2,0,1,6,2,8,4
6,RL,Pave,Lvl,Inside,Gilbert,Norm,Norm,1Fam,1Story,Gable,...,0,6,6,2,0,1,6,2,8,4
7,RL,Pave,Lvl,Inside,Gilbert,Norm,Norm,1Fam,2Story,Gable,...,0,10,12,2,0,1,6,2,8,4
8,RL,Pave,Lvl,Inside,Gilbert,Norm,Norm,1Fam,1Story,Gable,...,0,6,6,2,0,1,6,2,8,4
9,RL,Pave,Lvl,Corner,NAmes,Norm,Norm,1Fam,1Story,Gable,...,0,8,9,1,0,1,6,2,8,4


### Handling ordinal features

In [796]:
ordinal_features = categoricals[['LotShape','Utilities','ExterQual','ExterCond','BsmtQual',
                                        'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                                       'KitchenQual','FireplaceQu','GarageQual','LandSlope','GarageCond','PoolQC','Fence']]
ordinal_features.head()
for feature in ordinal_features:
    print ("Unique values of ",feature," : " , train[feature].unique())

Unique values of  LotShape  :  ['Reg' 'IR1' 'IR2' 'IR3']
Unique values of  Utilities  :  ['AllPub' 'NoSeWa']
Unique values of  ExterQual  :  ['Gd' 'TA' 'Ex' 'Fa']
Unique values of  ExterCond  :  ['TA' 'Gd' 'Fa' 'Po' 'Ex']
Unique values of  BsmtQual  :  ['Gd' 'TA' 'Ex' nan 'Fa']
Unique values of  BsmtCond  :  ['TA' 'Gd' nan 'Fa' 'Po']
Unique values of  BsmtExposure  :  ['No' 'Gd' 'Mn' 'Av' nan]
Unique values of  BsmtFinType1  :  ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ']
Unique values of  BsmtFinType2  :  ['Unf' 'BLQ' nan 'ALQ' 'Rec' 'LwQ' 'GLQ']
Unique values of  HeatingQC  :  ['Ex' 'Gd' 'TA' 'Fa' 'Po']
Unique values of  KitchenQual  :  ['Gd' 'TA' 'Ex' 'Fa']
Unique values of  FireplaceQu  :  [nan 'TA' 'Gd' 'Fa' 'Ex' 'Po']
Unique values of  GarageQual  :  ['TA' 'Fa' 'Gd' nan 'Ex' 'Po']
Unique values of  LandSlope  :  ['Gtl' 'Mod' 'Sev']
Unique values of  GarageCond  :  ['TA' 'Fa' nan 'Gd' 'Po' 'Ex']
Unique values of  PoolQC  :  [nan 'Ex' 'Fa' 'Gd']
Unique values of  Fence  :  [nan 'MnPrv

In [797]:
ordinal_features_test = categoricals_test[['LotShape','Utilities','ExterQual','ExterCond','BsmtQual',
                                        'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                                       'KitchenQual','FireplaceQu','GarageQual','LandSlope','GarageCond','PoolQC','Fence']]
ordinal_features_test.head()
for feature in ordinal_features_test:
    print ("Unique values of ",feature," : " , ordinal_features_test[feature].unique())

Unique values of  LotShape  :  ['Reg' 'IR1' 'IR2' 'IR3']
Unique values of  Utilities  :  ['AllPub' nan]
Unique values of  ExterQual  :  ['TA' 'Gd' 'Ex' 'Fa']
Unique values of  ExterCond  :  ['TA' 'Gd' 'Fa' 'Po' 'Ex']
Unique values of  BsmtQual  :  ['TA' 'Gd' 'Ex' 'Fa' nan]
Unique values of  BsmtCond  :  ['TA' 'Po' 'Fa' 'Gd' nan]
Unique values of  BsmtExposure  :  ['No' 'Gd' 'Mn' 'Av' nan]
Unique values of  BsmtFinType1  :  ['Rec' 'ALQ' 'GLQ' 'Unf' 'BLQ' 'LwQ' nan]
Unique values of  BsmtFinType2  :  ['LwQ' 'Unf' 'Rec' 'BLQ' 'GLQ' 'ALQ' nan]
Unique values of  HeatingQC  :  ['TA' 'Gd' 'Ex' 'Fa' 'Po']
Unique values of  KitchenQual  :  ['TA' 'Gd' 'Ex' 'Fa' nan]
Unique values of  FireplaceQu  :  [nan 'TA' 'Gd' 'Po' 'Fa' 'Ex']
Unique values of  GarageQual  :  ['TA' nan 'Fa' 'Gd' 'Po']
Unique values of  LandSlope  :  ['Gtl' 'Mod' 'Sev']
Unique values of  GarageCond  :  ['TA' nan 'Fa' 'Gd' 'Po' 'Ex']
Unique values of  PoolQC  :  [nan 'Ex' 'Gd']
Unique values of  Fence  :  ['MnPrv' nan 'GdPrv' '

In [798]:
#ordinal_features['GarageType'].unique()



labels = {'TA': 3, 'Fa':2 , 'Gd': 4, 'Ex': 5, 'Po': 1, None: 0}
labels_1 = {'Reg':3, 'IR1':2, 'IR2':1, 'IR3':0}
labels_2 = {'AllPub': 3, 'NoSeWa': 1}
labels_3 = {'Gtl': 3, 'Mod': 2, 'Sev': 1}
labels_4 = {'No': 1, 'Gd': 4, 'Mn': 2, 'Av': 3, None: 0}
labels_5 = {'GLQ': 6 ,'ALQ': 5 ,'Unf': 1 ,'Rec': 3 ,'BLQ': 4 , None: 0, 'LwQ': 2 }
labels_6 = {None: 0, 'MnPrv': 1 , 'GdWo': 3 , 'GdPrv': 4 , 'MnWw': 2 ,}

features = ['ExterQual','ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC']

features_1 = ['LotShape','Utilities','LandSLope','BsmtExposure','BsmtFinType1','BsmtFinType2','Fence']

ordinal_features['LotShapeLabel'] = ordinal_features['LotShape'].map(labels_1)
ordinal_features['UtilitiesLabel'] = ordinal_features['Utilities'].map(labels_2)
ordinal_features['LandSlopeLabel'] = ordinal_features['LandSlope'].map(labels_3)
ordinal_features['BsmtExposureLabel'] = ordinal_features['BsmtExposure'].map(labels_4)
ordinal_features['BsmtFinType1Label'] = ordinal_features['BsmtFinType1'].map(labels_5)
ordinal_features['BsmtFinType2Label'] = ordinal_features['BsmtFinType2'].map(labels_5)
ordinal_features['FenceLabel'] = ordinal_features['Fence'].map(labels_6)


for f in features:
    ordinal_features[f + 'Label'] = ordinal_features[f].map(labels)
    #ordinal_features[[f, f + 'Label']]

ordinal_features.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,ExterQualLabel,ExterCondLabel,BsmtQualLabel,BsmtCondLabel,HeatingQCLabel,KitchenQualLabel,FireplaceQuLabel,GarageQualLabel,GarageCondLabel,PoolQCLabel
0,Reg,AllPub,Gd,TA,Gd,TA,No,GLQ,Unf,Ex,...,4,3,4,3,5,4,0,3,3,0
1,Reg,AllPub,TA,TA,Gd,TA,Gd,ALQ,Unf,Ex,...,3,3,4,3,5,3,3,3,3,0
2,IR1,AllPub,Gd,TA,Gd,TA,Mn,GLQ,Unf,Ex,...,4,3,4,3,5,4,3,3,3,0
3,IR1,AllPub,TA,TA,TA,Gd,No,ALQ,Unf,Gd,...,3,3,3,4,4,4,4,3,3,0
4,IR1,AllPub,Gd,TA,Gd,TA,Av,GLQ,Unf,Ex,...,4,3,4,3,5,4,3,3,3,0


In [799]:
labels = {'TA': 3, 'Fa':2 , 'Gd': 4, 'Ex': 5, 'Po': 1, None: 0}
labels_1 = {'Reg':3, 'IR1':2, 'IR2':1, 'IR3':0}
labels_2 = {'AllPub': 3, 'NoSeWa': 1}
labels_3 = {'Gtl': 3, 'Mod': 2, 'Sev': 1}
labels_4 = {'No': 1, 'Gd': 4, 'Mn': 2, 'Av': 3, None: 0}
labels_5 = {'GLQ': 6 ,'ALQ': 5 ,'Unf': 1 ,'Rec': 3 ,'BLQ': 4 , None: 0, 'LwQ': 2 }
labels_6 = {None: 0, 'MnPrv': 1 , 'GdWo': 3 , 'GdPrv': 4 , 'MnWw': 2 ,}

features = ['ExterQual','ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC']

features_1 = ['LotShape','Utilities','LandSLope','BsmtExposure','BsmtFinType1','BsmtFinType2','Fence']

ordinal_features_test['LotShapeLabel'] = ordinal_features_test['LotShape'].map(labels_1)
ordinal_features_test['UtilitiesLabel'] = ordinal_features_test['Utilities'].map(labels_2)
ordinal_features_test['LandSlopeLabel'] = ordinal_features_test['LandSlope'].map(labels_3)
ordinal_features_test['BsmtExposureLabel'] = ordinal_features_test['BsmtExposure'].map(labels_4)
ordinal_features_test['BsmtFinType1Label'] = ordinal_features_test['BsmtFinType1'].map(labels_5)
ordinal_features_test['BsmtFinType2Label'] = ordinal_features_test['BsmtFinType2'].map(labels_5)
ordinal_features_test['FenceLabel'] = ordinal_features_test['Fence'].map(labels_6)


for f in features:
    ordinal_features_test[f + 'Label'] = ordinal_features_test[f].map(labels)
    #ordinal_features[[f, f + 'Label']]

ordinal_features_test.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/s

Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,ExterQualLabel,ExterCondLabel,BsmtQualLabel,BsmtCondLabel,HeatingQCLabel,KitchenQualLabel,FireplaceQuLabel,GarageQualLabel,GarageCondLabel,PoolQCLabel
0,Reg,AllPub,TA,TA,TA,TA,No,Rec,LwQ,TA,...,3,3,3,3,3,3,0,3,3,0
1,IR1,AllPub,TA,TA,TA,TA,No,ALQ,Unf,TA,...,3,3,3,3,3,4,0,3,3,0
2,IR1,AllPub,TA,TA,Gd,TA,No,GLQ,Unf,Gd,...,3,3,4,3,4,3,3,3,3,0
3,IR1,AllPub,TA,TA,TA,TA,No,GLQ,Unf,Ex,...,3,3,3,3,5,4,4,3,3,0
4,IR1,AllPub,Gd,TA,Gd,TA,No,ALQ,Unf,Ex,...,4,3,4,3,5,4,0,3,3,0


### Use of One-Hot encoding:

One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.
we generate one boolean column for each category. Only one of these columns could take on the value 1 for each sample. Hence, the term one hot encoding.

In [800]:
# first we concatenate our results
categoricals_transformed = pd.concat([ordinal_features, nominal_features, categoricals_with_null], axis=1)
categoricals_transformed.shape
#categoricals_transformed.columns.values


(1460, 64)

In [801]:
# first we concatenate our results
categoricals_transformed_test = pd.concat([ordinal_features_test, nominal_features_test, categoricals_with_null_test], axis=1)
categoricals_transformed_test.shape
#categoricals_transformed.columns.values

(1459, 64)

In [802]:
one_hot_encoding_data = pd.get_dummies(categoricals,drop_first=True)
one_hot_encoding_data.head()




Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Pave,Alley_Pave,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_HLS,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,0,1,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,1,0
1,0,0,1,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,1,0
2,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
4,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0


In [803]:
one_hot_encoding_data_test = pd.get_dummies(categoricals_test,drop_first=True)
one_hot_encoding_data_test.head()

Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Pave,Alley_Pave,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_HLS,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,0,1,0,0,1,0,0,0,1,0,...,0,0,0,0,1,0,0,0,1,0
1,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
2,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
3,0,0,1,0,1,0,0,0,0,0,...,0,0,0,0,1,0,0,0,1,0
4,0,0,1,0,1,0,0,0,0,1,...,0,0,0,0,1,0,0,0,1,0


In [804]:
# concatenate everything

train_transformed = pd.concat([ordinal_features, nominal_features, categoricals_with_null, one_hot_encoding_data], axis=1)
train_transformed.shape
train_transformed.head()




Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,Reg,AllPub,Gd,TA,Gd,TA,No,GLQ,Unf,Ex,...,0,0,0,0,1,0,0,0,1,0
1,Reg,AllPub,TA,TA,Gd,TA,Gd,ALQ,Unf,Ex,...,0,0,0,0,1,0,0,0,1,0
2,IR1,AllPub,Gd,TA,Gd,TA,Mn,GLQ,Unf,Ex,...,0,0,0,0,1,0,0,0,1,0
3,IR1,AllPub,TA,TA,TA,Gd,No,ALQ,Unf,Gd,...,0,0,0,0,1,0,0,0,0,0
4,IR1,AllPub,Gd,TA,Gd,TA,Av,GLQ,Unf,Ex,...,0,0,0,0,1,0,0,0,1,0


In [805]:
# concatenate everything

train_transformed_test = pd.concat([ordinal_features_test, nominal_features_test, categoricals_with_null_test, one_hot_encoding_data_test], axis=1)
train_transformed_test.shape
train_transformed_test.head()


Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,Reg,AllPub,TA,TA,TA,TA,No,Rec,LwQ,TA,...,0,0,0,0,1,0,0,0,1,0
1,IR1,AllPub,TA,TA,TA,TA,No,ALQ,Unf,TA,...,0,0,0,0,1,0,0,0,1,0
2,IR1,AllPub,TA,TA,Gd,TA,No,GLQ,Unf,Gd,...,0,0,0,0,1,0,0,0,1,0
3,IR1,AllPub,TA,TA,TA,TA,No,GLQ,Unf,Ex,...,0,0,0,0,1,0,0,0,1,0
4,IR1,AllPub,Gd,TA,Gd,TA,No,ALQ,Unf,Ex,...,0,0,0,0,1,0,0,0,1,0


In [806]:
#concatenate with normalized numerical features from the other notebook : data celaning and preparation

%store -r train_numericals
%store -r test_numericals

#num_data_train = train_numericals
#num_data_test = test_numericals

#num_data_train = num_data_train.drop(['SalePrice'], axis = 1)

num_data_train = train.select_dtypes(include=[np.number])
num_data_test = test.select_dtypes(include=[np.number])


train_transformed = pd.concat([train_transformed, num_data_train], axis=1)

train_transformed.head()

Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SalePrice
0,Reg,AllPub,Gd,TA,Gd,TA,No,GLQ,Unf,Ex,...,0,61,0,0,0,0,0,2,2008,208500
1,Reg,AllPub,TA,TA,Gd,TA,Gd,ALQ,Unf,Ex,...,298,0,0,0,0,0,0,5,2007,181500
2,IR1,AllPub,Gd,TA,Gd,TA,Mn,GLQ,Unf,Ex,...,0,42,0,0,0,0,0,9,2008,223500
3,IR1,AllPub,TA,TA,TA,Gd,No,ALQ,Unf,Gd,...,0,35,272,0,0,0,0,2,2006,140000
4,IR1,AllPub,Gd,TA,Gd,TA,Av,GLQ,Unf,Ex,...,192,84,0,0,0,0,0,12,2008,250000


In [807]:
test_transformed = pd.concat([train_transformed_test, num_data_test], axis=1)

test_transformed.head()

Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
0,Reg,AllPub,TA,TA,TA,TA,No,Rec,LwQ,TA,...,730.0,140,0,0,0,120,0,0,6,2010
1,IR1,AllPub,TA,TA,TA,TA,No,ALQ,Unf,TA,...,312.0,393,36,0,0,0,0,12500,6,2010
2,IR1,AllPub,TA,TA,Gd,TA,No,GLQ,Unf,Gd,...,482.0,212,34,0,0,0,0,0,3,2010
3,IR1,AllPub,TA,TA,TA,TA,No,GLQ,Unf,Ex,...,470.0,360,36,0,0,0,0,0,6,2010
4,IR1,AllPub,Gd,TA,Gd,TA,No,ALQ,Unf,Ex,...,506.0,0,82,0,0,144,0,0,1,2010


In [808]:
# save to csv

train_transformed.to_csv('train_transformed.csv')
test_transformed.to_csv('test_transformed.csv')