# House Prices Competition : Term Project 

#### Description:

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

#### To do List :

* Functions for each data preparation and processing method
* Read about features engineering and selection
* apply pca
* how to select the non_numerical features that are most important 


### Importing Libraries:

In [31]:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import linear_model
from sklearn.metrics import mean_squared_error
from sklearn.decomposition import PCA


import matplotlib.pyplot as plt
plt.style.use(style='ggplot')
plt.rcParams['figure.figsize'] = (10, 6)

In [32]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

In [33]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [34]:
print ("Train data shape:", train.shape)

Train data shape: (1460, 81)


### Features engineering :

Same transformations should be applied on the training and testing data when doing features engineering.

### Handling non-numerical features :

In [35]:
categoricals = train.select_dtypes(exclude=[np.number])
#numericals = train.select_dtypes(include=[np.number])
categoricals.describe()
#numericals.head()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
count,1460,1460,91,1460,1460,1460,1460,1460,1460,1460,...,1379,1379,1379,1379,1460,7,281,54,1460,1460
unique,5,2,2,4,4,2,5,3,25,9,...,6,3,5,5,3,3,4,4,9,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,...,Attchd,Unf,TA,TA,Y,Gd,MnPrv,Shed,WD,Normal
freq,1151,1454,50,925,1311,1459,1052,1382,225,1260,...,870,605,1311,1326,1340,3,157,49,1267,1198


In [36]:
#to check how many categories we have per feature
#for feature in categoricals:
    #print ("Unique values of ",feature," : " , train[feature].unique())

First we will deal with nominal features with null values that could be significant, for example :
- Alley
- MasVnrType
- GarageType
- MiscFeature


In [37]:
categoricals_with_null = categoricals[['Alley','MasVnrType','GarageType','MiscFeature']]
categoricals_with_null.head()

Unnamed: 0,Alley,MasVnrType,GarageType,MiscFeature
0,,BrkFace,Attchd,
1,,,Attchd,
2,,BrkFace,Attchd,
3,,,Detchd,
4,,BrkFace,Attchd,


In [38]:
categoricals_with_null['Alley'].unique()

alleyLabel = {'Grvl': 1, 'Pave': 2, None: 0}

categoricals_with_null['AlleyLabel'] = categoricals_with_null['Alley'].map(alleyLabel)
categoricals_with_null[['Alley', 'AlleyLabel']].iloc[20:25]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,Alley,AlleyLabel
20,,0
21,Grvl,1
22,,0
23,,0
24,,0


In [39]:
categoricals_with_null['MasVnrType'].unique()

MasVnrTypeLabel = {'BrkFace': 1, 'Stone': 2,'BrkCmn': 3, 'None': 0}

categoricals_with_null['MasVnrTypeLabel'] = categoricals_with_null['MasVnrType'].map(MasVnrTypeLabel)
categoricals_with_null[['MasVnrType', 'MasVnrTypeLabel']].iloc[10:18]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,MasVnrType,MasVnrTypeLabel
10,,0.0
11,Stone,2.0
12,,0.0
13,Stone,2.0
14,BrkFace,1.0
15,,0.0
16,BrkFace,1.0
17,,0.0


In [40]:
categoricals_with_null['GarageType'].unique()

GarageTypeLabel = {'Attchd' : 1, 'Detchd' : 2, 'BuiltIn' : 3, 'CarPort' : 3,'Basment': 3, '2Types': 4, None: 0}

categoricals_with_null['GarageTypeLabel'] = categoricals_with_null['GarageType'].map(GarageTypeLabel)
categoricals_with_null[['GarageType', 'GarageTypeLabel']].iloc[15:20]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,GarageType,GarageTypeLabel
15,Detchd,2
16,Attchd,1
17,CarPort,3
18,Detchd,2
19,Attchd,1


In [41]:
categoricals_with_null['MiscFeature'].unique()

MiscFeatureLabel = {'Shed' : 1, 'Gar2' : 2, 'Othr' : 3, 'TenC' : 4, None: 0}

categoricals_with_null['MiscFeatureLabel'] = categoricals_with_null['MiscFeature'].map(MiscFeatureLabel)
categoricals_with_null[['MiscFeature', 'MiscFeatureLabel']].iloc[5:10]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,MiscFeature,MiscFeatureLabel
5,Shed,1
6,,0
7,Shed,1
8,,0
9,,0


### Handling Nominal values:

We have several nominal features; we will try to map them into numerical values, here we considered these features as ordinal ones :
 * LotSHape
 * Utilities
 * LandSlope
 * ExterQual
 * ExterCond
 * BsmtQual

In [59]:
from sklearn.preprocessing import LabelEncoder

nominal_features = categoricals.drop(['LotShape','Utilities','LandSlope','ExterQual','ExterCond','BsmtQual',
                                        'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                                       'KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC','Fence',
                                      'GarageFinish','Alley','MasVnrType','GarageType','MiscFeature'], axis = 1)
nominal_features.describe()
#list(nominal_features)


Unnamed: 0,MSZoning,Street,LandContour,LotConfig,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,...,Exterior1st,Exterior2nd,Foundation,Heating,CentralAir,Electrical,Functional,PavedDrive,SaleType,SaleCondition
count,1460,1460,1460,1460,1460,1460,1460,1460,1460,1460,...,1460,1460,1460,1460,1460,1459,1460,1460,1460,1460
unique,5,2,4,5,25,9,8,5,8,6,...,15,16,6,6,2,5,7,3,9,6
top,RL,Pave,Lvl,Inside,NAmes,Norm,Norm,1Fam,1Story,Gable,...,VinylSd,VinylSd,PConc,GasA,Y,SBrkr,Typ,Y,WD,Normal
freq,1151,1454,1311,1052,225,1260,1445,1220,726,1141,...,515,504,647,1428,1365,1334,1360,1340,1267,1198


In [60]:
nominal_features.Electrical = nominal_features.Electrical.fillna('None')
print(nominal_features.Electrical.value_counts())

SBrkr    1334
FuseA      94
FuseF      27
FuseP       3
Mix         1
None        1
Name: Electrical, dtype: int64


we will label the left features using sickit learn library for each nominal feature :

In [61]:
from sklearn.preprocessing import LabelEncoder

#def labelFeatures(dataframe, feature):
    #gle = LabelEncoder()
    #genre_labels = gle.fit_transform(dataframe[feature])
    #genre_mappings = {index: label for index, label in 
     #                         enumerate(gle.classes_)}
    #genre_mappings

    #dataframe[feature+'Label'] = genre_labels
    #dataframe[[feature, feature + 'Label']]
    #return dataframe
    
    
#labelFeatures(nominal_features, "Street")
class MultiColumnLabelEncoder:
    def __init__(self,columns = None):
        self.columns = columns # array of column names to encode

    def fit(self,X,y=None):
        return self # not relevant here

    def transform(self,X):
        '''
        Transforms columns of X specified in self.columns using
        LabelEncoder(). If no columns specified, transforms all
        columns in X.
        '''
        output = X.copy()
        if self.columns is not None:
            for col in self.columns:
                output[col] = LabelEncoder().fit_transform(output[col])
        else:
            for colname,col in output.iteritems():
                output[colname] = LabelEncoder().fit_transform(col)
        return output

    def fit_transform(self,X,y=None):
        return self.fit(X,y).transform(X)
    

#'CentralAir','Electrical','Functional','PavedDrive','SaleType','SaleCondition'
MultiColumnLabelEncoder(columns = ['MSZoning','Street','LandContour','LotConfig','Neighborhood','Condition1','Condition2',
                                   'BldgType','HouseStyle','RoofStyle','RoofMatl','Exterior1st','Exterior2nd','Foundation',
                                   'Heating','CentralAir','Functional','PavedDrive','SaleType','SaleCondition','Electrical'
                                  ]).fit_transform(nominal_features)

#columns =  nominal_features.columns.values.tolist()   
#columns



Unnamed: 0,MSZoning,Street,LandContour,LotConfig,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,...,Exterior1st,Exterior2nd,Foundation,Heating,CentralAir,Electrical,Functional,PavedDrive,SaleType,SaleCondition
0,3,1,3,4,5,2,2,0,5,1,...,12,13,2,1,1,5,6,2,8,4
1,3,1,3,2,24,1,2,0,2,1,...,8,8,1,1,1,5,6,2,8,4
2,3,1,3,4,5,2,2,0,5,1,...,12,13,2,1,1,5,6,2,8,4
3,3,1,3,0,6,2,2,0,5,1,...,13,15,0,1,1,5,6,2,8,0
4,3,1,3,2,15,2,2,0,5,1,...,12,13,2,1,1,5,6,2,8,4
5,3,1,3,4,11,2,2,0,0,1,...,12,13,5,1,1,5,6,2,8,4
6,3,1,3,4,21,2,2,0,2,1,...,12,13,2,1,1,5,6,2,8,4
7,3,1,3,0,14,4,2,0,5,1,...,6,6,1,1,1,5,6,2,8,4
8,4,1,3,4,17,0,2,0,0,1,...,3,15,0,1,1,1,2,2,8,0
9,3,1,3,0,3,0,0,1,1,1,...,8,8,0,1,1,5,6,2,8,4


### Handling ordinal features

In [62]:
ordinal_features = categoricals[['LotShape','Utilities','ExterQual','ExterCond','BsmtQual',
                                        'BsmtCond','BsmtExposure','BsmtFinType1','BsmtFinType2','HeatingQC',
                                       'KitchenQual','FireplaceQu','GarageQual','LandSlope','GarageCond','PoolQC','Fence']]
ordinal_features.head()
for feature in ordinal_features:
    print ("Unique values of ",feature," : " , train[feature].unique())

Unique values of  LotShape  :  ['Reg' 'IR1' 'IR2' 'IR3']
Unique values of  Utilities  :  ['AllPub' 'NoSeWa']
Unique values of  ExterQual  :  ['Gd' 'TA' 'Ex' 'Fa']
Unique values of  ExterCond  :  ['TA' 'Gd' 'Fa' 'Po' 'Ex']
Unique values of  BsmtQual  :  ['Gd' 'TA' 'Ex' nan 'Fa']
Unique values of  BsmtCond  :  ['TA' 'Gd' nan 'Fa' 'Po']
Unique values of  BsmtExposure  :  ['No' 'Gd' 'Mn' 'Av' nan]
Unique values of  BsmtFinType1  :  ['GLQ' 'ALQ' 'Unf' 'Rec' 'BLQ' nan 'LwQ']
Unique values of  BsmtFinType2  :  ['Unf' 'BLQ' nan 'ALQ' 'Rec' 'LwQ' 'GLQ']
Unique values of  HeatingQC  :  ['Ex' 'Gd' 'TA' 'Fa' 'Po']
Unique values of  KitchenQual  :  ['Gd' 'TA' 'Ex' 'Fa']
Unique values of  FireplaceQu  :  [nan 'TA' 'Gd' 'Fa' 'Ex' 'Po']
Unique values of  GarageQual  :  ['TA' 'Fa' 'Gd' nan 'Ex' 'Po']
Unique values of  LandSlope  :  ['Gtl' 'Mod' 'Sev']
Unique values of  GarageCond  :  ['TA' 'Fa' nan 'Gd' 'Po' 'Ex']
Unique values of  PoolQC  :  [nan 'Ex' 'Fa' 'Gd']
Unique values of  Fence  :  [nan 'MnPrv

In [74]:
#ordinal_features['GarageType'].unique()



labels = {'TA': 3, 'Fa':2 , 'Gd': 4, 'Ex': 5, 'Po': 1, None: 0}
labels_1 = {'Reg':3, 'IR1':2, 'IR2':1, 'IR3':0}
labels_2 = {'AllPub': 3, 'NoSeWa': 1}
labels_3 = {'Gtl': 3, 'Mod': 2, 'Sev': 1}
labels_4 = {'No': 1, 'Gd': 4, 'Mn': 2, 'Av': 3, None: 0}
labels_5 = {'GLQ': 6 ,'ALQ': 5 ,'Unf': 1 ,'Rec': 3 ,'BLQ': 4 , None: 0, 'LwQ': 2 }
labels_6 = {None: 0, 'MnPrv': 1 , 'GdWo': 3 , 'GdPrv': 4 , 'MnWw': 2 ,}

features = ['ExterQual','ExterCond','BsmtQual','BsmtCond','HeatingQC','KitchenQual','FireplaceQu','GarageQual','GarageCond','PoolQC']

features_1 = ['LotShape','Utilities','LandSLope','BsmtExposure','BsmtFinType1','BsmtFinType2','Fence']

ordinal_features['LotShapeLabel'] = ordinal_features['LotShape'].map(labels_1)
ordinal_features['UtilitiesLabel'] = ordinal_features['Utilities'].map(labels_2)
ordinal_features['LandSlopeLabel'] = ordinal_features['LandSlope'].map(labels_3)
ordinal_features['BsmtExposureLabel'] = ordinal_features['BsmtExposure'].map(labels_4)
ordinal_features['BsmtFinType1Label'] = ordinal_features['BsmtFinType1'].map(labels_5)
ordinal_features['BsmtFinType2Label'] = ordinal_features['BsmtFinType2'].map(labels_5)
ordinal_features['FenceLabel'] = ordinal_features['Fence'].map(labels_6)


for f in features:
    ordinal_features[f + 'Label'] = ordinal_features[f].map(labels)
    #ordinal_features[[f, f + 'Label']]

ordinal_features.head()



A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
A value is tryin

Unnamed: 0,LotShape,Utilities,ExterQual,ExterCond,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,HeatingQC,...,ExterCondLabel,BsmtQualLabel,BsmtCondLabel,HeatingQCLabel,KitchenQualLabel,FireplaceQuLabel,GarageQualLabel,GarageCondLabel,PoolQCLabel,LandSlopeLabel
0,Reg,AllPub,Gd,TA,Gd,TA,No,GLQ,Unf,Ex,...,3,4,3,5,4,0,3,3,0,3
1,Reg,AllPub,TA,TA,Gd,TA,Gd,ALQ,Unf,Ex,...,3,4,3,5,3,3,3,3,0,3
2,IR1,AllPub,Gd,TA,Gd,TA,Mn,GLQ,Unf,Ex,...,3,4,3,5,4,3,3,3,0,3
3,IR1,AllPub,TA,TA,TA,Gd,No,ALQ,Unf,Gd,...,3,3,4,4,4,4,3,3,0,3
4,IR1,AllPub,Gd,TA,Gd,TA,Av,GLQ,Unf,Ex,...,3,4,3,5,4,3,3,3,0,3


### Use of One-Hot encoding:

One hot encoding transforms categorical features to a format that works better with classification and regression algorithms.
we generate one boolean column for each category. Only one of these columns could take on the value 1 for each sample. Hence, the term one hot encoding.