<a href="https://colab.research.google.com/github/MTahaRF/House-Prices---Advanced-Regression-Techniques-Competition/blob/main/House_Prices.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#House Prices Competion


---


File descriptions

train.csv - the training set

test.csv - the test set
data_description.txt - full description of each column, originally prepared by Dean De Cock but lightly edited to match the column names used here

sample_submission.csv - a benchmark submission from a linear regression on year and month of sale, lot square footage, and number of bedrooms


---


#Data fields


---


Here's a brief version of what you'll find in the data description file.

SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.

MSSubClass: The building class

MSZoning: The general zoning classification

LotFrontage: Linear feet of street connected to property

LotArea: Lot size in square feet

Street: Type of road access

Alley: Type of alley access

LotShape: General shape of property

LandContour: Flatness of the property

Utilities: Type of utilities available

LotConfig: Lot configuration

LandSlope: Slope of property

Neighborhood: Physical locations within Ames city limits

Condition1: Proximity to main road or railroad

Condition2: Proximity to main road or railroad (if a second is present)

BldgType: Type of dwelling

HouseStyle: Style of dwelling

OverallQual: Overall material and finish quality

OverallCond: Overall condition rating

YearBuilt: Original construction date

YearRemodAdd: Remodel date

RoofStyle: Type of roof

RoofMatl: Roof material

Exterior1st: Exterior covering on house

Exterior2nd: Exterior covering on house (if more than one material)

MasVnrType: Masonry veneer type

MasVnrArea: Masonry veneer area in square feet

ExterQual: Exterior material quality

ExterCond: Present condition of the material on the exterior

Foundation: Type of foundation

BsmtQual: Height of the basement

BsmtCond: General condition of the basement

BsmtExposure: Walkout or garden level basement walls

BsmtFinType1: Quality of basement finished area

BsmtFinSF1: Type 1 finished square feet

BsmtFinType2: Quality of second finished area (if present)

BsmtFinSF2: Type 2 finished square feet

BsmtUnfSF: Unfinished square feet of basement area

TotalBsmtSF: Total square feet of basement area

Heating: Type of heating

HeatingQC: Heating quality and condition

CentralAir: Central air conditioning

Electrical: Electrical system

1stFlrSF: First Floor square feet

2ndFlrSF: Second floor square feet

LowQualFinSF: Low quality finished square feet (all floors)

GrLivArea: Above grade (ground) living area square feet

BsmtFullBath: Basement full bathrooms

BsmtHalfBath: Basement half bathrooms

FullBath: Full bathrooms above grade

HalfBath: Half baths above grade

Bedroom: Number of bedrooms above basement level

Kitchen: Number of kitchens

KitchenQual: Kitchen quality

TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)

Functional: Home functionality rating

Fireplaces: Number of fireplaces

FireplaceQu: Fireplace quality

GarageType: Garage location

GarageYrBlt: Year garage was built

GarageFinish: Interior finish of the garage

GarageCars: Size of garage in car capacity

GarageArea: Size of garage in square feet

GarageQual: Garage quality

GarageCond: Garage condition

PavedDrive: Paved driveway

WoodDeckSF: Wood deck area in square feet

OpenPorchSF: Open porch area in square feet

EnclosedPorch: Enclosed porch area in square feet

3SsnPorch: Three season porch area in square feet

ScreenPorch: Screen porch area in square feet

PoolArea: Pool area in square feet
PoolQC: Pool quality

Fence: Fence quality

MiscFeature: Miscellaneous feature not covered in other categories

MiscVal: $Value of miscellaneous feature

MoSold: Month Sold

YrSold: Year Sold

SaleType: Type of sale

SaleCondition: Condition of sale


---



#Agenda
---
Loading Libraries

Loading Data

Getting Basic Idea About Data

Missing Values and Dealing with Missing Values

One Hot Encoding (Creating dummies for categorical columns)

Standardization / Normalization

Splitting the dataset into train and test data

Dealing with Imbalanced Data

Generate Synthetic Samples


---




#Loading Libraries

In [2]:
 ! pip install kaggle
 ! mkdir ~/.kaggle
 ! cp kaggle.json ~/.kaggle/
 ! chmod 600 ~/.kaggle/kaggle.json
 ! kaggle competitions download -c House-Prices-Advanced-Regression-Techniques
 ! unzip House-Prices-Advanced-Regression-Techniques.zip

Downloading House-Prices-Advanced-Regression-Techniques.zip to /content
100% 199k/199k [00:00<00:00, 544kB/s]
100% 199k/199k [00:00<00:00, 544kB/s]
Archive:  House-Prices-Advanced-Regression-Techniques.zip
  inflating: data_description.txt    
  inflating: sample_submission.csv   
  inflating: test.csv                
  inflating: train.csv               


In [132]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

#Loading Data

In [133]:
train=pd.read_csv("/content/train.csv")
test=pd.read_csv("/content/test.csv")
test.shape

(1459, 80)

In [134]:
X = train.drop(['SalePrice'],axis=1)
y = train.SalePrice

In [135]:
from sklearn.model_selection import train_test_split
X_train, X_valid, y_train, y_valid = train_test_split(X, y)
X_train.shape,X_valid.shape

((1095, 80), (365, 80))

#Understanding the Data

In [136]:
train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [137]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1460 non-null   int64  
 1   MSSubClass     1460 non-null   int64  
 2   MSZoning       1460 non-null   object 
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64  
 5   Street         1460 non-null   object 
 6   Alley          91 non-null     object 
 7   LotShape       1460 non-null   object 
 8   LandContour    1460 non-null   object 
 9   Utilities      1460 non-null   object 
 10  LotConfig      1460 non-null   object 
 11  LandSlope      1460 non-null   object 
 12  Neighborhood   1460 non-null   object 
 13  Condition1     1460 non-null   object 
 14  Condition2     1460 non-null   object 
 15  BldgType       1460 non-null   object 
 16  HouseStyle     1460 non-null   object 
 17  OverallQual    1460 non-null   int64  
 18  OverallC

#Data Cleaning

###Filling Null Values

In [138]:
X_train_no = X_train.select_dtypes(exclude=['object'])
X_train_no.info()
X_valid_no = X_valid.select_dtypes(exclude=['object'])
X_valid_no.info()
test_no = test.select_dtypes(exclude=['object'])

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 78 to 179
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1095 non-null   int64  
 1   MSSubClass     1095 non-null   int64  
 2   LotFrontage    895 non-null    float64
 3   LotArea        1095 non-null   int64  
 4   OverallQual    1095 non-null   int64  
 5   OverallCond    1095 non-null   int64  
 6   YearBuilt      1095 non-null   int64  
 7   YearRemodAdd   1095 non-null   int64  
 8   MasVnrArea     1089 non-null   float64
 9   BsmtFinSF1     1095 non-null   int64  
 10  BsmtFinSF2     1095 non-null   int64  
 11  BsmtUnfSF      1095 non-null   int64  
 12  TotalBsmtSF    1095 non-null   int64  
 13  1stFlrSF       1095 non-null   int64  
 14  2ndFlrSF       1095 non-null   int64  
 15  LowQualFinSF   1095 non-null   int64  
 16  GrLivArea      1095 non-null   int64  
 17  BsmtFullBath   1095 non-null   int64  
 18  BsmtHalfBath 

In [139]:
from sklearn.impute import SimpleImputer

# Imputation
my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train_no))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid_no))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train_no.columns
imputed_X_valid.columns = X_valid_no.columns

imputed_X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Data columns (total 37 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             1095 non-null   float64
 1   MSSubClass     1095 non-null   float64
 2   LotFrontage    1095 non-null   float64
 3   LotArea        1095 non-null   float64
 4   OverallQual    1095 non-null   float64
 5   OverallCond    1095 non-null   float64
 6   YearBuilt      1095 non-null   float64
 7   YearRemodAdd   1095 non-null   float64
 8   MasVnrArea     1095 non-null   float64
 9   BsmtFinSF1     1095 non-null   float64
 10  BsmtFinSF2     1095 non-null   float64
 11  BsmtUnfSF      1095 non-null   float64
 12  TotalBsmtSF    1095 non-null   float64
 13  1stFlrSF       1095 non-null   float64
 14  2ndFlrSF       1095 non-null   float64
 15  LowQualFinSF   1095 non-null   float64
 16  GrLivArea      1095 non-null   float64
 17  BsmtFullBath   1095 non-null   float64
 18  BsmtHalf

In [140]:
X_train_cat = X_train.select_dtypes(include=['object'])
X_train_cat.info()
X_valid_cat = X_valid.select_dtypes(include=['object'])
X_valid_cat.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1095 entries, 78 to 179
Data columns (total 43 columns):
 #   Column         Non-Null Count  Dtype 
---  ------         --------------  ----- 
 0   MSZoning       1095 non-null   object
 1   Street         1095 non-null   object
 2   Alley          68 non-null     object
 3   LotShape       1095 non-null   object
 4   LandContour    1095 non-null   object
 5   Utilities      1095 non-null   object
 6   LotConfig      1095 non-null   object
 7   LandSlope      1095 non-null   object
 8   Neighborhood   1095 non-null   object
 9   Condition1     1095 non-null   object
 10  Condition2     1095 non-null   object
 11  BldgType       1095 non-null   object
 12  HouseStyle     1095 non-null   object
 13  RoofStyle      1095 non-null   object
 14  RoofMatl       1095 non-null   object
 15  Exterior1st    1095 non-null   object
 16  Exterior2nd    1095 non-null   object
 17  MasVnrType     436 non-null    object
 18  ExterQual      1095 non-null   ob

In [141]:
X_train_cat=X_train_cat.fillna("MISSING")
X_valid_cat=X_valid_cat.fillna("MISSING")
imputed_X_train.shape

(1095, 37)

In [142]:
X_train_cat.isnull().sum()

MSZoning         0
Street           0
Alley            0
LotShape         0
LandContour      0
Utilities        0
LotConfig        0
LandSlope        0
Neighborhood     0
Condition1       0
Condition2       0
BldgType         0
HouseStyle       0
RoofStyle        0
RoofMatl         0
Exterior1st      0
Exterior2nd      0
MasVnrType       0
ExterQual        0
ExterCond        0
Foundation       0
BsmtQual         0
BsmtCond         0
BsmtExposure     0
BsmtFinType1     0
BsmtFinType2     0
Heating          0
HeatingQC        0
CentralAir       0
Electrical       0
KitchenQual      0
Functional       0
FireplaceQu      0
GarageType       0
GarageFinish     0
GarageQual       0
GarageCond       0
PavedDrive       0
PoolQC           0
Fence            0
MiscFeature      0
SaleType         0
SaleCondition    0
dtype: int64

One Hot Encoding

In [143]:
from sklearn.preprocessing import OneHotEncoder

# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train_cat))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid_cat))

# One-hot encoding removed index; put it back
OH_cols_train.index = X_train_cat.index
OH_cols_valid.index = X_valid_cat.index

# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([imputed_X_train.reset_index(drop=True), OH_cols_train.reset_index(drop=True)], axis=1)
OH_X_valid = pd.concat([imputed_X_valid.reset_index(drop=True), OH_cols_valid.reset_index(drop=True)], axis=1)

# Ensure all columns have string type
OH_X_train.columns = OH_X_train.columns.astype(str)
OH_X_valid.columns = OH_X_valid.columns.astype(str)
OH_X_train.shape,OH_X_valid.shape

((1095, 299), (365, 299))

In [144]:
X_train = OH_X_train.copy()
X_valid = OH_X_valid.copy()
X_train.shape,X_valid.shape

((1095, 299), (365, 299))

In [145]:
X_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1095 entries, 0 to 1094
Columns: 299 entries, Id to 261
dtypes: float64(299)
memory usage: 2.5 MB


# Building The Model And Checking the Accuracy

In [243]:
from xgboost import XGBRegressor
baseline_model = XGBRegressor(n_estimators=1000, learning_rate=0.05)
baseline_model.fit(X_train, y_train,
             eval_set=[(X_valid, y_valid)],
             verbose=False)

In [244]:
y_pred=baseline_model.predict(X_valid)

In [246]:
from sklearn.metrics import mean_absolute_error
print("Baseline Mean Absolute Error: ",mean_absolute_error(y_valid, y_pred))

Baseline Mean Absolute Error:  17621.68949058219


In [247]:
baseline_model.score(X_valid,y_valid)

0.8628488678430443

# Improving The Model

In [255]:
def score_model(n_est,X_t=X_train, X_v=X_valid, y_t=y_train, y_v=y_valid):
    my_model = XGBRegressor(n_estimators=n_est, learning_rate=0.05)
    my_model.fit(X_t, y_t,
             early_stopping_rounds=5,
             eval_set=[(X_v, y_v)],
             verbose=False)
    preds = my_model.predict(X_v)
    mae = mean_absolute_error(y_v, preds)
    score = (my_model.score(X_v, y_v))*10000
    return mae,score

for i in range(0, 21):
    mae,score = score_model(i*50)
    print("Model %d MAE: %d (n_est = %d) (Model Score = %d)" % (i+1, mae,i*50,score))

Model 1 MAE: 187488 (n_est = 0) (Model Score = -52506)
Model 2 MAE: 19809 (n_est = 50) (Model Score = 8312)
Model 3 MAE: 18185 (n_est = 100) (Model Score = 8561)
Model 4 MAE: 17889 (n_est = 150) (Model Score = 8597)
Model 5 MAE: 17860 (n_est = 200) (Model Score = 8601)
Model 6 MAE: 17860 (n_est = 250) (Model Score = 8601)
Model 7 MAE: 17860 (n_est = 300) (Model Score = 8601)
Model 8 MAE: 17860 (n_est = 350) (Model Score = 8601)
Model 9 MAE: 17860 (n_est = 400) (Model Score = 8601)
Model 10 MAE: 17860 (n_est = 450) (Model Score = 8601)
Model 11 MAE: 17860 (n_est = 500) (Model Score = 8601)
Model 12 MAE: 17860 (n_est = 550) (Model Score = 8601)
Model 13 MAE: 17860 (n_est = 600) (Model Score = 8601)
Model 14 MAE: 17860 (n_est = 650) (Model Score = 8601)
Model 15 MAE: 17860 (n_est = 700) (Model Score = 8601)
Model 16 MAE: 17860 (n_est = 750) (Model Score = 8601)
Model 17 MAE: 17860 (n_est = 800) (Model Score = 8601)
Model 18 MAE: 17860 (n_est = 850) (Model Score = 8601)
Model 19 MAE: 17860

##Feature Selection

In [150]:
from sklearn.feature_selection import RFE #importing RFE class from sklearn library

rfe = RFE(estimator= baseline_model , step = 100)
# estimator clf_lr is the baseline model (basic model) that we have created under "Base line Model" selection
# step = 1: removes one feature at a time and then builds a model on the remaining features
# It uses the model accuracy to identify which features (and combination of features) contribute the most to predicting the target variable.
# we can even provide no. of features as an argument

# Fit the function for ranking the features
fit = rfe.fit(X_train, y_train)

print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)
selected_rfe_features = pd.DataFrame({'Feature':list(X_train.columns),'Ranking':rfe.ranking_})
selected_rfe_features.sort_values(by='Ranking')

Num Features: 149
Selected Features: [False  True  True  True  True  True  True  True  True  True  True  True
  True  True  True False  True  True False  True  True  True  True  True
  True  True  True  True  True  True  True False  True  True False  True
  True  True  True  True  True  True False False  True  True  True False
  True False  True False  True  True  True False False  True  True  True
  True  True  True False False False False False  True  True  True  True
  True  True  True  True  True  True False False  True False  True False
 False False  True  True  True False  True  True  True False  True  True
 False False False False False False False False False False False False
  True False False  True False  True False False False False  True  True
  True False  True False False False  True False False False False False
 False False False  True False  True  True False  True  True False  True
  True False False False False False  True False False  True False  True
 False  True F

Unnamed: 0,Feature,Ranking
298,261,1
237,200,1
236,199,1
231,194,1
230,193,1
...,...,...
123,86,3
215,178,3
214,177,3
98,61,3


In [151]:
# Transforming the data
X_train_rfe = rfe.transform(X_train)
X_test_rfe = rfe.transform(X_valid)

# Fitting our baseline model with the transformed data
lr_rfe_model = baseline_model.fit(X_train_rfe, y_train)

In [152]:
y_pred_rfe = lr_rfe_model.predict(X_test_rfe)

In [277]:
print("rfe Mean Absolute Error: ",mean_absolute_error(y_valid, y_pred_rfe))

rfe Mean Absolute Error:  17471.053692208905


In [154]:
lr_rfe_model.score(X_test_rfe,y_pred_rfe)

1.0

# Predicting test values with final model

###Preparing the Test Dataset for Prediction

In [113]:
imputed_test = pd.DataFrame(my_imputer.transform(test_no))
imputed_test.columns = test_no.columns

In [114]:
test_cat = test.select_dtypes(include=['object'])
test_cat=test_cat.fillna("MISSING")

In [115]:
OH_cols_test = pd.DataFrame(OH_encoder.transform(test_cat))
OH_cols_test.index = test_cat.index
OH_test = pd.concat([imputed_test.reset_index(drop=True), OH_cols_test.reset_index(drop=True)], axis=1)
OH_test.columns = OH_test.columns.astype(str)
test = OH_test.copy()
test.shape

(1459, 293)

In [116]:
y_pred=baseline_model.predict(test)

In [155]:
y_pred_rfe = lr_rfe_model.predict(X_test_rfe)

# Creating Submission.csv

In [129]:
dict = {'Id':list(test.Id),
        'SalePrice':list(y_pred)}
sub = pd.DataFrame(dict)
convert_dict = {'Id': int,
                'SalePrice': float
                }
sub = sub.astype(convert_dict)
sub.Id
sub.to_csv('sub.csv',index=False,header = True)

In [156]:
dict_1 = {'Id':list(test.Id),
        'SalePrice':list(y_pred_rfe)}
sub = pd.DataFrame(dict)
convert_dict = {'Id': int,
                'SalePrice': float
                }
sub = sub.astype(convert_dict)
sub.Id
sub.to_csv('btr_sub.csv',index=False,header = True)

##Rank on the House Prices - Advanced Regression Techniques Competition : 2256