# House price regression

The main goal of this project is to estimate the sale price of real estate

## Problem
We got hired by a real estate investor in order to propose him a software solution to estimate quickly the value of housholds. The goal of this application is to scan the whole real estate market in order to indentify undervaluate household.

## Solution
To answer this problem, we propose a regression algorithm that will estimate the value of household given some specific properties. Then by comparing the estimate value against the market value, we'll be able to spot investment opportunities

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

### Data import, cleaning and analysis
The first step would be to import the data in order to prepare it for the algorithm.
To do so, we'll start by importing the data and list every feature

In [2]:
data = pd.read_csv('train.csv')
for name,dtype in zip(data.columns, data.dtypes):
    print(name,": ", dtype, "         Ex: ", data[name].iloc[0], "    Number of NaN: ", data[name].isnull().sum())

Id :  int64          Ex:  1     Number of NaN:  0
MSSubClass :  int64          Ex:  60     Number of NaN:  0
MSZoning :  object          Ex:  RL     Number of NaN:  0
LotFrontage :  float64          Ex:  65.0     Number of NaN:  259
LotArea :  int64          Ex:  8450     Number of NaN:  0
Street :  object          Ex:  Pave     Number of NaN:  0
Alley :  object          Ex:  nan     Number of NaN:  1369
LotShape :  object          Ex:  Reg     Number of NaN:  0
LandContour :  object          Ex:  Lvl     Number of NaN:  0
Utilities :  object          Ex:  AllPub     Number of NaN:  0
LotConfig :  object          Ex:  Inside     Number of NaN:  0
LandSlope :  object          Ex:  Gtl     Number of NaN:  0
Neighborhood :  object          Ex:  CollgCr     Number of NaN:  0
Condition1 :  object          Ex:  Norm     Number of NaN:  0
Condition2 :  object          Ex:  Norm     Number of NaN:  0
BldgType :  object          Ex:  1Fam     Number of NaN:  0
HouseStyle :  object          Ex: 

Given this analysis, we can notice that some features doesn't contain any valuable information
Those features are:
- Id
- ...

In addition of that, we notice that there is many missing values. The job here would be to differanciate missing values (errors in the dataset) from the absence of the concernate feature in the house.

In the first case, we'll remove the row.

In the second case, we'll considere any object that can be absent as added value, and thus replacing NaNs by zero.

In [3]:
data.describe(include=['O'])

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
count,1460,1460,91,1460,1460,1460,1460,1460,1460,1460,...,1379,1379,1379,1379,1460,7,281,54,1460,1460
unique,5,2,2,4,4,2,5,3,25,9,...,6,3,5,5,3,3,4,4,9,6
top,RL,Pave,Grvl,Reg,Lvl,AllPub,Inside,Gtl,NAmes,Norm,...,Attchd,Unf,TA,TA,Y,Gd,MnPrv,Shed,WD,Normal
freq,1151,1454,50,925,1311,1459,1052,1382,225,1260,...,870,605,1311,1326,1340,3,157,49,1267,1198


In [8]:
#object_cols=[
#    'MSZoning' , 'Street', 'Alley', ' LotShape' , 
#    ' LandContour' , ' Utilities' , ' LotConfig' , ' LandSlope' , 
#    ' Neighborhood' , ' Condition1' , ' Condition2' , ' BldgType' , 
#    ' HouseStyle' , ' RoofStyle' , ' RoofMatl' , ' Exterior1st' , 
#    ' Exterior2nd' , ' MasVnrType' , ' ExterQual' , ' ExterCond' , 
#    ' Foundation' , ' BsmtQual' , ' BsmtCond' , ' BsmtExposure' , 
#    ' BsmtFinType1' , ' BsmtFinType2' , ' Heating' , ' HeatingQC' , 
#    ' CentralAir' , ' Electrical' , ' KitchenQual' , ' Functional' , 
#    ' FireplaceQu' , ' GarageType' , ' GarageFinish' , ' GarageQual' , 
#    ' GarageCond' , ' PavedDrive' , ' PoolQC' , ' Fence’ , ‘ MiscFeature' , 
#    ' SaleType' , ' SaleCondition'    
#]


data['MSZoning'] = data['MSZoning'].replace("A", 0).replace("RL", 1).replace("RP", 2).replace("RM", 3).replace("RH", 4).replace("C", 5).replace("FV", 6).replace("I", 7)
data['Street'] = data['Street'].replace("Grvl", 1).replace("Pave", 2)
data['Alley'] = data['Alley'].replace("Grvl", 1).replace("Pave", 2).replace("NA", 0)
data['LotShape'] = data['LotShape'].replace("Reg", 3).replace("IR1", 2).replace("IR2", 1).replace("IR3", 0)
data['LandContour'] = data['LandContour'].replace("Lvl", 0).replace("Bnk", 1).replace("HLS", 2).replace("Low", 3)

data['Utilities'] = data['Utilities'].replace("AllPub", 3).replace("NoSewr", 2).replace("NoSeWa", 1).replace("ELO", 0)
data['LotConfig'] = data['LotConfig'].replace("CulDSac", 0).replace("Inside", 1).replace("Corner", 2).replace("FR2", 3).replace("FR3", 4)
data['LandSlope'] = data['LandSlope'].replace("Gtl", 0).replace("Mod", 1).replace("Sev", 2)
data['Neighborhood'] = data['Neighborhood'].replace("", 0)
data['Condition1'] = data['Condition1'].replace("", 0)

data['Condition2'] = data['Condition2'].replace("", 0)
data['BldgType'] = data['BldgType'].replace("1Fam", 0).replace("2FmCon", 1).replace("Duplx", 2).replace("TwnhsE", 3).replace("TwnhsI", 4)
data['HouseStyle'] = data['HouseStyle'].replace("1Story", 0).replace("1.5Fin", 1).replace("1.5Unf", 2).replace("2Story", 3).replace("2.5Fin", 4).replace("2.5Unf", 5).replace("SFoyer", 6).replace("SLvl", 7)
data['RoofStyle'] = data['RoofStyle'].replace("Flat", 0).replace("Gable", 1).replace("Gambrel", 2).replace("Hip", 3).replace("Mansard", 4).replace("Shed", 5)
data['RoofMatl'] = data['RoofMatl'].replace("ClyTile", 0).replace("CompShg", 1).replace("Membran", 2).replace("Metal", 3).replace("Roll", 4).replace("Tar&Grv", 5).replace("WdShake", 6).replace("WdShngl", 7)

data['Exterior1st'] = data['Exterior1st'].replace("", 0)
data['Exterior2nd'] = data['Exterior2nd'].replace("", 0)
data['MasVnrType'] = data['MasVnrType'].replace("", 0)
data['ExterQual'] = data['ExterQual'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1)
data['ExterCond'] = data['ExterCond'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1)

data['Foundation'] = data['Foundation'].replace("BrkCmn", 1).replace("BrkFace", 2).replace("CBlock", 3).replace("Stone", 4).replace("None", 0)
data['BsmtQual'] = data['BsmtQual'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1).replace("Na", 0)
data['BsmtCond'] = data['BsmtCond'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1).replace("Na", 0)
data['BsmtExposure'] = data['BsmtExposure'].replace("Gd", 0).replace("Av", 0).replace("Mn", 0).replace("No", 0).replace("Na", 0)
data['BsmtFinType1'] = data['BsmtFinType1'].replace("GLQ", 6).replace("ALQ", 5).replace("BLQ", 4).replace("Rec", 3).replace("LwQ", 2).replace("Unf", 1).replace("NA", 0)

data['BsmtFinType2'] = data['BsmtFinType2'].replace("GLQ", 6).replace("ALQ", 5).replace("BLQ", 4).replace("Rec", 3).replace("LwQ", 2).replace("Unf", 1).replace("NA", 0)
data['Heating'] = data['Heating'].replace("Floor", 1).replace("GasA", 2).replace("GasW", 3).replace("Grav", 4).replace("OthW", 5).replace("Wall", 6)
data['HeatingQC'] = data['HeatingQC'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1)
data['CentralAir'] = data['CentralAir'].replace("Y", 1).replace("N", 0)
data['Electrical'] = data['Electrical'].replace("", 0)
 
data['KitchenQual'] = data['KitchenQual'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1)
data['Functional'] = data['Functional'].replace("", 0)
data['FireplaceQu'] = data['FireplaceQu'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1).replace("Na", 0)
data['GarageType'] = data['GarageType'].replace("", 0)
data['GarageFinish'] = data['GarageFinish'].replace("Fin", 3).replace("RFn", 2).replace("Unf", 1).replace("NA", 0)

data['GarageQual'] = data['GarageQual'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1).replace("Na", 0)
data['GarageCond'] = data['GarageCond'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1).replace("Na", 0)
data['PavedDrive'] = data['PavedDrive'].replace("", 0)
data['PoolQC'] = data['PoolQC'].replace("Ex", 5).replace("Gd", 4).replace("TA", 3).replace("Fa", 2).replace("Po", 1).replace("Na", 0)
data['Fence'] = data['Fence'].replace("", 0)

data['MiscFeature'] = data['MiscFeature'].replace("", 0)
data['SaleType'] = data['SaleType'].replace("", 0)
data['SaleCondition'] = data['SaleCondition'].replace("", 0)








In [9]:
print(data['MSZoning'])

0       1
1       1
2       1
3       1
4       1
       ..
1455    1
1456    1
1457    1
1458    1
1459    1
Name: MSZoning, Length: 1460, dtype: object
