# Project description


Ask a buyer to describe their dream home and they probably won't start with basement ceiling height or proximity to an east-west rail line. But the data set from this playground contest proves that price negotiations are influenced by much more than the number of bedrooms or a white picket fence.

![](https://storage.googleapis.com/kaggle-competitions/kaggle/5407/media/housesbanner.png)

This dataset contains 79 explanatory variables describing almost every aspect of residential homes in Ames, Iowa. 

**Goal:** It is your job to predict the sales price for each house using everything you have learned so far. If **you use a model not presented in class, you must justify it, explain how it works and describe precisely the role of each of the hyper-parameters**. For each Id in the test set, you must predict the value of the SalePrice variable. 

**Metric:** Predictions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

**Homework submission**: You must upload a zip archive containing 3 files to``lms.univ-cotedazur.fr`:

* A `pdf` report describing for each of the selected features the treatment performed
* A `jupyter notebook` performing the preprocessing, each step of which is inserted into a sklearn or imblearn pipeline (you must leave traces of notebook executions. The first cell should have the number 1, the second the number 2, etc.)
* A `result.csv` should contain your prediction for each of the properties in tthe test set in the the following format:
<pre>
        Id,SalePrice
        1461,169000.9876
        1462,187724.1233
        1463,175221.1928
        etc.
</pre>

The scale will be as follows:
* 8 points on the quality of the preprocessing and its description from the report 
* 8 points on the quality and correctness of the code contained in the notebook
* 4 points on the quality of the model produced

Here's a brief version of what you'll find in the data description file.

* SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
* MSSubClass: The building class
* MSZoning: The general zoning classification
* LotFrontage: Linear feet of street connected to property
* LotArea: Lot size in square feet
* Street: Type of road access
* Alley: Type of alley access
* LotShape: General shape of property
* LandContour: Flatness of the property
* Utilities: Type of utilities available
* LotConfig: Lot configuration
* LandSlope: Slope of property
* Neighborhood: Physical locations within Ames city limits
* Condition1: Proximity to main road or railroad
* Condition2: Proximity to main road or railroad (if a second is present)
* BldgType: Type of dwelling
* HouseStyle: Style of dwelling
* OverallQual: Overall material and finish quality
* OverallCond: Overall condition rating
* YearBuilt: Original construction date
* YearRemodAdd: Remodel date
* RoofStyle: Type of roof
* RoofMatl: Roof material
* Exterior1st: Exterior covering on house
* Exterior2nd: Exterior covering on house (if more than one material)
* MasVnrType: Masonry veneer type
* MasVnrArea: Masonry veneer area in square feet
* ExterQual: Exterior material quality
* ExterCond: Present condition of the material on the exterior
* Foundation: Type of foundation
* BsmtQual: Height of the basement
* BsmtCond: General condition of the basement
* BsmtExposure: Walkout or garden level basement walls
* BsmtFinType1: Quality of basement finished area
* BsmtFinSF1: Type 1 finished square feet
* BsmtFinType2: Quality of second finished area (if present)
* BsmtFinSF2: Type 2 finished square feet
* BsmtUnfSF: Unfinished square feet of basement area
* TotalBsmtSF: Total square feet of basement area
* Heating: Type of heating
* HeatingQC: Heating quality and condition
* CentralAir: Central air conditioning
* Electrical: Electrical system
* 1stFlrSF: First Floor square feet
* 2ndFlrSF: Second floor square feet
* LowQualFinSF: Low quality finished square feet (all floors)
* GrLivArea: Above grade (ground) living area square feet
* BsmtFullBath: Basement full bathrooms
* BsmtHalfBath: Basement half bathrooms
* FullBath: Full bathrooms above grade
* HalfBath: Half baths above grade
* Bedroom: Number of bedrooms above basement level
* Kitchen: Number of kitchens
* KitchenQual: Kitchen quality
* TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
* Functional: Home functionality rating
* Fireplaces: Number of fireplaces
* FireplaceQu: Fireplace quality
* GarageType: Garage location
* GarageYrBlt: Year garage was built
* GarageFinish: Interior finish of the garage
* GarageCars: Size of garage in car capacity
* GarageArea: Size of garage in square feet
* GarageQual: Garage quality
* GarageCond: Garage condition
* PavedDrive: Paved driveway
* WoodDeckSF: Wood deck area in square feet
* OpenPorchSF: Open porch area in square feet
* EnclosedPorch: Enclosed porch area in square feet
* 3SsnPorch: Three season porch area in square feet
* ScreenPorch: Screen porch area in square feet
* PoolArea: Pool area in square feet
* PoolQC: Pool quality
* Fence: Fence quality
* MiscFeature: Miscellaneous feature not covered in other categories
* MiscVal: $Value of miscellaneous feature
* MoSold: Month Sold
* YrSold: Year Sold
* SaleType: Type of sale
* SaleCondition: Condition of sale

more detail about the features on `data_description.txt`files

In [1]:
# Read train files
import pandas as pd

df_train = pd.read_csv("train.csv", index_col=0)
print(len(df_train))
df_train.head(3)

1000


Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,,,,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,,,,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,,,,0,9,2008,WD,Normal,223500


In [2]:
y=df_train["SalePrice"] #y is our variable with the labels only
y.shape

(1000,)

In [3]:
X = df_train.drop(["SalePrice"], axis=1) # for X I used the function drop in order to drop specified labels and row from the column
X.shape

(1000, 79)

In [4]:
#Finding the number of Numerical Columns in the dataset
num_cols = X._get_numeric_data().columns  #Used _get_numeric_data function order to get the numeric columns
print("The Number of Numrical Colums are :",str(len(num_cols))) #Used str len function to find the list of numrical columns
print("\n")
print("The List of Numerical Columns:")
print('\n')
print(num_cols)


The Number of Numrical Colums are : 36


The List of Numerical Columns:


Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath', 'HalfBath',
       'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces',
       'GarageYrBlt', 'GarageCars', 'GarageArea', 'WoodDeckSF', 'OpenPorchSF',
       'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'MiscVal',
       'MoSold', 'YrSold'],
      dtype='object')


In [5]:
cat_cols=X.select_dtypes(include=['object']).columns.tolist() #Here you can see I included only OBJECT type variables to count non numeric columns
print("The Number of Categorical Colums are :",str(len(cat_cols)))##Used str len function to find the list of categorical columns
print("\n")
print("The List of Categorical Columns:")
print('\n')
print(cat_cols)

The Number of Categorical Colums are : 43


The List of Categorical Columns:


['MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'Functional', 'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'MiscFeature', 'SaleType', 'SaleCondition']


In [61]:
#finding the unique values in the categorical colums to I get the idea of number of values I should aim for.
cat_dict={}
cat_dict=X[cat_cols].nunique() #Used nunique function to find the number of unique values in each categorical column 
cat_dict

MSZoning          5
Street            2
Alley             2
LotShape          4
LandContour       4
Utilities         2
LotConfig         5
LandSlope         3
Neighborhood     25
Condition1        9
Condition2        6
BldgType          5
HouseStyle        8
RoofStyle         5
RoofMatl          6
Exterior1st      11
Exterior2nd      15
MasVnrType        4
ExterQual         4
ExterCond         5
Foundation        6
BsmtQual          4
BsmtCond          4
BsmtExposure      4
BsmtFinType1      6
BsmtFinType2      6
Heating           4
HeatingQC         5
CentralAir        2
Electrical        5
KitchenQual       4
Functional        7
FireplaceQu       5
GarageType        6
GarageFinish      3
GarageQual        5
GarageCond        5
PavedDrive        3
PoolQC            2
Fence             4
MiscFeature       3
SaleType          9
SaleCondition     6
dtype: int64

In [6]:
#Here the necessity of seperating the numerical values which are important to us are included in the list, and the rest of the values in column are not included.
cat_ord = ["Utilities", "ExterQual", "ExterCond",
                       "BsmtQual", "BsmtCond", "BsmtExposure",
                       "BsmtFinType1", "BsmtFinType2", "HeatingQC",
                       "KitchenQual", "FireplaceQu","GarageFinish",
                       "GarageQual", "GarageCond", "PoolQC", "Fence"]

#Same goes for non numerical However we had exact number of values previously, but I re-created it so that it is easy to compare them in same block. code will not be messy
cat_nom = ["MSZoning", "Street", "Alley", "LotShape",
                       "LandContour", "LotConfig","LandSlope", "Condition1",
                       "Condition2", "BldgType", "HouseStyle","RoofStyle",
                       "RoofMatl", "MasVnrType", "Foundation", "Heating",
                       "CentralAir","Electrical", "Functional", "GarageType",
                       "PavedDrive", "SaleType", "SaleCondition"]

In [7]:
#Next Step is to find the null values in the categorical colums
nan_cat={}
for i in X[cat_cols].columns:
    if X[i].isnull().sum() > 0:   #I used here greater then Zero because this will exclude the colums which donot have null values otherwise it would have been messy as it will also show the columns with zero null values.
        nan_cat[i] = X[i].isnull().sum()
        
print(nan_cat)


{'Alley': 935, 'MasVnrType': 6, 'BsmtQual': 24, 'BsmtCond': 24, 'BsmtExposure': 25, 'BsmtFinType1': 24, 'BsmtFinType2': 25, 'FireplaceQu': 478, 'GarageType': 56, 'GarageFinish': 56, 'GarageQual': 56, 'GarageCond': 56, 'PoolQC': 998, 'Fence': 806, 'MiscFeature': 957}


In [8]:
#Finding the number of null values in the Numerical Columns 

nan_num={}
for i in X[num_cols].columns:
    if X[i].isnull().sum() > 0: #I used here greater then Zero because this will exclude the colums which donot have null values otherwise it would have been messy as it will also show the columns with zero null values.
        nan_num[i] = X[i].isnull().sum()
        
print(nan_num)

{'LotFrontage': 173, 'MasVnrArea': 6, 'GarageYrBlt': 56}


In [9]:
from sklearn.preprocessing import LabelEncoder

def encode_cat(X):
    
    X[['ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','GarageCond','KitchenQual','FireplaceQu','GarageFinish','BsmtFinType2','HeatingQC','GarageQual','PoolQC','Fence','Utilities']] = X[['ExterQual','ExterCond','BsmtQual','BsmtCond','BsmtExposure','BsmtFinType1','GarageCond','KitchenQual','FireplaceQu','GarageFinish','BsmtFinType2','HeatingQC','GarageQual','PoolQC','Fence','Utilities']].apply(LabelEncoder().fit_transform)
    
    return X

#Here I made us of the LabelEncoder to give numerical values to categorical columns as when I will fit them in Pipeline it only allows numerical values
#Hence using LabelEncoder it will give values for eg: 0,1,2,3 etc to the values in the columns to distinguish them.

In [10]:
#*Cat_Encoder* class act as a transformer that encodes ordinal categorical variables. 
#The *fit* method is used to fit the transformer on the training data, 
#while the *transform* method is used to apply the transformation on the data.

from sklearn.base import BaseEstimator, TransformerMixin

class Cat_Encoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, X, y=None):
        return self
    def transform(self, X):
        return encode_cat(X)
    
#The *fit* method takes two arguments: X and y. X is a dataframe containing the training data, 
#while y is a dataframe containing the target variable. 
#The *fit* method should return the transformer object itself, so that it can be used to transform the data later.
#The transform method takes a single argument X, which is the data that needs to be transformed. 
#It returns the transformed data.

In [11]:
from sklearn.preprocessing import OneHotEncoder

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

#*cat_ord_tf* is a pipeline that contains a single step: the *Cat_Encoder* transformer. 
#This transformer is likely used to encode ordinal categorical features, 
#which are categorical features that have a natural order or ranking.

cat_ord_tf = Pipeline(steps=[("ordinal_encoder", Cat_Encoder())]) 


cat_nom_tf = Pipeline(steps=[("imputer", SimpleImputer(strategy="constant")),("oh_encoder", OneHotEncoder(handle_unknown="ignore"))])

#*cat_nom_tf* is a pipeline that contains two steps: a *SimpleImputer* and an *OneHotEncoder*. 
#*SimpleImputer* is a transformer that is used to fill in missing values in the dataset by replacing them with a constant
#value. The *strategy parameter* specifies which "constant" value to use; in this case, 
#the strategy is "constant", which means that the missing values will be replaced with 
#the value specified in the fill_value parameter (if no value is specified, the default value is 0).



In [12]:
#The *SimpleImputer* transformer is used to fill in missing values in numerical columns 
#with a statistic median or mean.

num_tf = SimpleImputer(strategy="median")

#In this case, the *strategy* parameter is set to "median", 
#so the SimpleImputer will compute the median value of the column and use that value to fill in any missing values.

In [13]:
#The *p_processor* in this case is a "ColumnTransformer" that applies different transformers to different subsets of columns in a dataset. 
#The ColumnTransformer takes a list of tuples as input, where each tuple specifies the transformer to apply, 
#the columns to which the transformer should be applied, and any additional arguments.

p_processor = ColumnTransformer(
    transformers=[
        ("numerical", num_tf, num_cols),
        ("categorical_ordinal", cat_ord_tf,cat_ord),
        ("categorical_nominal", cat_nom_tf,cat_nom)
    ]
)

In [17]:
#importing the model
from sklearn.ensemble import RandomForestRegressor

random_forest = RandomForestRegressor(n_estimators=100, random_state=0)

In [18]:
#I used the *GridSearchCV* function from scikit-learn to perform hyperparameter 
#tuning by systematically training and evaluating models with different 
#combinations of hyperparameters.

from sklearn.model_selection import GridSearchCV

g_para = [{"n_estimators": [50, 100, 500, 750, 1000]}] #different hyperparameters 50,100,500,750,1000

g_search = GridSearchCV(random_forest, g_para, cv=3, scoring="neg_mean_squared_log_error") 

#I peviously used mean_squared_error here but it was giving out errors hence used "negative mean squared logarithmic error". 
#The *cv* parameter specifies the number of folds to use for cross-validation. In this case, it is set to 3, 
#so the GridSearchCV will use 3-fold cross-validation to evaluate the models.

In [19]:
#It's the final pipeline which consists of preprocessor and grid search results 

final_pipeline = Pipeline(steps=[("preprocessor", p_processor),("grid_search", g_search)])

final_pipeline.fit(X, y)

#The fit method will first apply the preprocessing step to the training data using the preprocessor, 
#and then it will use the preprocessed data to train and tune the model using the grid_search object

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical',
                                                  SimpleImputer(strategy='median'),
                                                  Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBat...
                                                   'Condition2', 'BldgType',
                                                   'HouseStyle', 'RoofStyle',
                                                   'RoofMatl', 'MasVnrType',
                                                   'Foundation', 'Heating',
                                                   'CentralAir', 'Electrical',
                                                   'Functional', 'GarageType',
                      

In [20]:
print(g_search.best_params_, g_search.best_estimator_, -g_search.best_score_) 

#here we find the best parameter, estimator and best score for the grid search 
#using the predefined functions.

{'n_estimators': 1000} RandomForestRegressor(n_estimators=1000, random_state=0) 0.023457341367474464


In [21]:
#Here I used the mean_squared_log_error in order to get realisitic results as when I used the root_mean_square_error the values coming out where not looking realistic.
from sklearn.metrics import mean_squared_log_error

def score(X, y, model): 
    y_pred = model.predict(X)
    return mean_squared_log_error(y, y_pred)

In [22]:
score(X, y, final_pipeline) #Score produced by model

0.003761648896756794

In [23]:
import xgboost as xgb
from xgboost import XGBRegressor 
#I used the XGBRegressor in order to boost the class gradient and to make the algorithm faster and efficient in order to get satisfactory results.

xgb = XGBRegressor(n_jobs=6, random_state=0)

g_para = [
    {"n_estimators": [100, 300, 500, 700, 1000], "learning_rate": [0.005, 0.01, 0.05, 0.1]}]

g_search = GridSearchCV(xgb, g_para, cv=3,scoring="neg_mean_squared_log_error")

final_pipeline = Pipeline(steps=[
    ("preprocessor", p_processor),
    ("grid_search", g_search)
])

final_pipeline.fit(X, y)

Pipeline(steps=[('preprocessor',
                 ColumnTransformer(transformers=[('numerical',
                                                  SimpleImputer(strategy='median'),
                                                  Index(['MSSubClass', 'LotFrontage', 'LotArea', 'OverallQual', 'OverallCond',
       'YearBuilt', 'YearRemodAdd', 'MasVnrArea', 'BsmtFinSF1', 'BsmtFinSF2',
       'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'LowQualFinSF',
       'GrLivArea', 'BsmtFullBath', 'BsmtHalfBat...
                                                     max_cat_threshold=None,
                                                     max_cat_to_onehot=None,
                                                     max_delta_step=None,
                                                     max_depth=None,
                                                     max_leaves=None,
                                                     min_child_weight=None,
                                             

In [24]:
print(g_search.best_params_, g_search.best_estimator_, -g_search.best_score_)

#here we find the best parameter, estimator and best score for the grid search 
#using the predefined functions.

{'learning_rate': 0.01, 'n_estimators': 1000} XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, feature_types=None, gamma=0, gpu_id=-1,
             grow_policy='depthwise', importance_type=None,
             interaction_constraints='', learning_rate=0.01, max_bin=256,
             max_cat_threshold=64, max_cat_to_onehot=4, max_delta_step=0,
             max_depth=6, max_leaves=0, min_child_weight=1, missing=nan,
             monotone_constraints='()', n_estimators=1000, n_jobs=6,
             num_parallel_tree=1, predictor='auto', random_state=0, ...) 0.021856328158122242


In [25]:

df_test = pd.read_csv("test.csv", index_col=0) #reading the test file in df_test variable.
print(len(df_test)) #to count number of elements
df_test.head(3)

460


Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1000,20,RL,74.0,10206,Pave,,Reg,Lvl,AllPub,Corner,...,0,0,,,,0,7,2009,WD,Normal
1001,30,RL,60.0,5400,Pave,,Reg,Lvl,AllPub,Corner,...,0,0,,,,0,1,2007,WD,Abnorml
1002,20,RL,75.0,11957,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,7,2008,WD,Normal


In [26]:
X_test = p_processor.transform(df_test) #Stored the processed data in the X_test.

In [27]:
y_pred = g_search.predict(X_test) #Storing the prediction labels in the y_pred variable.

In [28]:
#Saved my predictions in the CSV file.
df_results = pd.DataFrame({"Id": df_test.index,"SalePrice": y_pred})
df_results.to_csv("results.csv", index_label='id')
df_results.head()

Unnamed: 0,Id,SalePrice
0,1000,84239.703125
1,1001,89087.335938
2,1002,243305.171875
3,1003,144553.78125
4,1004,192827.203125


In [109]:
# Save your predictions
#df_results = pd.DataFrame(index=[1461,1462,1463],data={'SalePrice':[169000.9876,187724.1233,187724.1233]})
#df_results.to_csv("results.csv", index_label='id')
#df_results.head()

Unnamed: 0,SalePrice
1461,169000.9876
1462,187724.1233
1463,187724.1233
