## Project Overview: House Prices - Advanced Regression Techniques

***Assignment Task and Dataset taken from kraggle as a compeition submission***

#### **Project Relevance**

In this project, I am working on predicting house prices using advanced regression techniques. The dataset consists of **79 features** describing various aspects of residential homes in Ames, Iowa. This project is a great opportunity for me to apply feature engineering, data preprocessing, and machine learning models to a real-world problem. By participating in this challenge, I aim to strengthen my skills in **predictive analytics, regression modeling, and working with structured data.**

#### **Project Aim**

My goal is to build a predictive model that accurately estimates house prices based on key housing attributes. 

To achieve this, I will:

- Perform **feature engineering** to extract valuable insights from the dataset.
- Implement **regression algorithms** such as **Random Forest and Gradient Boosting** to improve prediction accuracy.
- Evaluate model performance using **Root Mean Squared Error (RMSE)** to ensure reliable predictions.
  
Through this project, I aim to enhance my ability to handle missing data, optimize machine learning models, and improve generalization performance. By the end, I hope to develop a well-tuned regression model that can effectively predict housing prices based on various property characteristics. 

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

In [2]:
import pandas as pd
import datetime

In [3]:
df = pd.read_csv(r'train.csv', low_memory=False)

In [4]:
df


Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [5]:
# Printing the shape
df.shape

(1460, 81)

In [6]:
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [7]:
df.isnull().sum()

Id                 0
MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 81, dtype: int64

In [8]:
print(df.columns)


Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
       'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
       'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
       'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
       'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
       'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
       'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
       'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
       'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
       'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
       'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
       'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
       'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
       'GarageCond', 'PavedDrive

In [9]:
# we replace LotFrontage with mode
df['LotFrontage'].mode()
df['LotFrontage'].fillna(60,inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['LotFrontage'].fillna(60,inplace=True)


In [10]:
df['LotFrontage'].isnull().sum()

np.int64(0)

In [11]:
df['Alley'].fillna(df['Alley'].mode,inplace=True)

In [12]:
df['Alley'].isnull().sum()

np.int64(0)

In [13]:
df['PoolQC'].isnull().sum()

np.int64(1453)

In [14]:
df['PoolQC'].fillna(df['PoolQC'].mean,inplace=True)

In [15]:
df['Fence'].isnull().sum()

np.int64(1179)

In [16]:
df['Fence'].fillna(df['Fence'].mean, inplace=True)

*Now that all null values have been handled according to the nature of each attribute—ensuring robust predictions—we have used mean, median, and mode as the most effective techniques.*

*Next, let's drop all string columns, as machine learning algorithms cannot process categorical text data directly.*

In [17]:
# lets check all data types
df.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
SalePrice          int64
Length: 81, dtype: object

In [18]:
df.drop(['MSZoning'], axis=1, inplace=True)

In [19]:
df.drop(['Street'], axis=1, inplace=True)

In [20]:
df.drop(['Alley'], axis=1, inplace=True)

In [21]:
df.drop(['LotConfig'], axis=1, inplace=True)

In [22]:
df.drop(['LandSlope'], axis=1, inplace=True)

In [23]:
df.drop(['Neighborhood'], axis=1, inplace=True)

In [24]:
df.drop(['Condition1'], axis=1, inplace=True)

In [25]:
df.drop(['Condition2'], axis=1, inplace=True)

In [26]:
df.drop(['BldgType'], axis=1, inplace=True)

In [27]:
df.drop(['PoolQC'], axis=1, inplace=True)

In [28]:
df.drop(['Fence'], axis=1, inplace=True)

In [29]:
df.drop(['MiscFeature'], axis=1, inplace=True)

In [30]:
df.drop(['SaleType'], axis=1, inplace=True)

In [31]:
df.drop(['SaleCondition'], axis=1, inplace=True)

In [32]:
df.drop(['HouseStyle'], axis=1, inplace=True)

In [33]:
df.drop(['RoofStyle'], axis=1, inplace=True)

In [34]:
df.drop(['RoofMatl'], axis=1, inplace=True)

In [35]:
df.drop(['Exterior1st'], axis=1, inplace=True)

In [36]:
df.drop(['Exterior2nd'], axis=1, inplace=True)

In [37]:
df.drop(['MasVnrType'], axis=1, inplace=True)

In [38]:
df.drop(['ExterQual'], axis=1, inplace=True)
df.drop(['ExterCond'], axis=1, inplace=True)
df.drop(['Foundation'], axis=1, inplace=True)
df.drop(['BsmtQual'], axis=1, inplace=True)
df.drop(['BsmtCond'], axis=1, inplace=True)
#BsmtExposure
df.drop(['BsmtExposure'], axis=1, inplace=True)
df.drop(['BsmtFinType1'], axis=1, inplace=True)
df.drop(['BsmtFinSF1'], axis=1, inplace=True)
df.drop(['BsmtFinType2'], axis=1, inplace=True)
df.drop(['Heating'], axis=1, inplace=True)
df.drop(['HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual', 'FireplaceQu',
'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'MoSold'], axis=1, inplace=True)
df.drop(['YrSold'], axis=1, inplace=True)
df.drop(['LotFrontage', 'MasVnrArea', 'Functional', 'GarageYrBlt'], axis=1, inplace=True)
df.drop(['LotShape', 'LandContour', 'Utilities'], axis=1, inplace=True)

In [39]:
df

Unnamed: 0,Id,MSSubClass,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,...,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,SalePrice
0,1,60,8450,7,5,2003,2003,0,150,856,...,2,548,0,61,0,0,0,0,0,208500
1,2,20,9600,6,8,1976,1976,0,284,1262,...,2,460,298,0,0,0,0,0,0,181500
2,3,60,11250,7,5,2001,2002,0,434,920,...,2,608,0,42,0,0,0,0,0,223500
3,4,70,9550,7,5,1915,1970,0,540,756,...,3,642,0,35,272,0,0,0,0,140000
4,5,60,14260,8,5,2000,2000,0,490,1145,...,3,836,192,84,0,0,0,0,0,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,7917,6,5,1999,2000,0,953,953,...,2,460,0,40,0,0,0,0,0,175000
1456,1457,20,13175,6,6,1978,1988,163,589,1542,...,2,500,349,0,0,0,0,0,0,210000
1457,1458,70,9042,7,9,1941,2006,0,877,1152,...,1,252,0,60,0,0,0,0,2500,266500
1458,1459,20,9717,5,6,1950,1996,1029,0,1078,...,1,240,366,0,112,0,0,0,0,142125


In [40]:
df.dtypes

Id               int64
MSSubClass       int64
LotArea          int64
OverallQual      int64
OverallCond      int64
YearBuilt        int64
YearRemodAdd     int64
BsmtFinSF2       int64
BsmtUnfSF        int64
TotalBsmtSF      int64
1stFlrSF         int64
2ndFlrSF         int64
LowQualFinSF     int64
GrLivArea        int64
BsmtFullBath     int64
BsmtHalfBath     int64
FullBath         int64
HalfBath         int64
BedroomAbvGr     int64
KitchenAbvGr     int64
TotRmsAbvGrd     int64
Fireplaces       int64
GarageCars       int64
GarageArea       int64
WoodDeckSF       int64
OpenPorchSF      int64
EnclosedPorch    int64
3SsnPorch        int64
ScreenPorch      int64
PoolArea         int64
MiscVal          int64
SalePrice        int64
dtype: object


*As observed above, the dataset now contains only integer values, ensuring a more accurate foundation for our predictions.*

First, we will split the dataset using a **70:30 ratio**, where **70%** will be used to train our **RandomForest model** and **30%** for evaluation.

Next, we will separate the **target variable (y)** and **features (X)** from the split DataFrames. Additionally, we will remove certain columns that are not suitable for training the model.

In [41]:
from sklearn.model_selection import train_test_split
# Create target object and call it y
y = df.SalePrice
# Create X
features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = df[features]

# Split into validation and training data
train_X, eval_X, train_y, eval_y = train_test_split(X, y, random_state=1)

In [42]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier

In [43]:
model_RF = RandomForestClassifier()

In [44]:
model_RF.fit(train_X, train_y)

In [45]:
predict_RF = model_RF.predict(eval_X)
predict_RF

array([192500, 184000, 130000,  92000, 155000, 142500, 255900, 144152,
       215000, 262000, 180000,  82500, 192500, 143000, 189000,  84900,
       100000, 140000, 143000, 143750, 150500, 117500, 193000, 255500,
       109500, 187000, 145000, 185000, 582933,  88000, 107500,  91500,
       120500, 109000, 145000, 320000, 115000, 119000, 305000, 105000,
       140000, 140000, 110000, 124000, 172400, 180000,  79000, 192000,
       227680, 235000, 132000, 325000, 100000, 240000, 185000, 110000,
       128500, 184100, 115000, 187000, 159000, 260000, 124000, 133000,
       174000, 119900, 230000, 215000, 176000, 157900, 197000, 129900,
       281213, 135000, 139500, 222000, 160000, 128500, 386250, 192000,
       226000, 127500, 129500, 163000, 194500, 140000, 149500, 142500,
       190000, 187000, 153337, 157000, 116050,  90000, 110500, 128000,
       120000, 130000, 118500, 154000, 205000, 129500, 127000, 124500,
       106500, 105000, 176432, 146500, 160000, 260000,  95000, 145000,
      

In [46]:
from sklearn.metrics import mean_absolute_error
mean_absolute_error(predict_RF, eval_y)

26790.54520547945

In [47]:
from sklearn.metrics import r2_score
r2_score(predict_RF, eval_y)

0.7363960047736771

In [48]:
from sklearn.model_selection import KFold, cross_val_score, StratifiedKFold

In [49]:
cv1 = KFold(n_splits=10, random_state=12,shuffle= True)
# evaluate the model with cross validation
scores = cross_val_score(model_RF, train_X, train_y, scoring='accuracy', cv=cv1, n_jobs=-1)
scores

array([0.01818182, 0.01818182, 0.03636364, 0.01818182, 0.        ,
       0.00917431, 0.        , 0.01834862, 0.01834862, 0.03669725])

In [50]:
from statistics import mean, stdev
# report perofmance
print('Accuracy: %.3f(%.3f)'% (mean(scores), stdev(scores)))

Accuracy: 0.017(0.013)


In [51]:
accuracy_score(predict_RF, eval_y)

0.00821917808219178

In [52]:
# lets use Hyper parametres like Random Search to improve our RFC model
# Random Search
import numpy as np
from sklearn.model_selection import RandomizedSearchCV
random_search = {'criterion': ['entropy', 'gini'],
 'max_depth': list(np.linspace(5, 1200, 10, dtype = int)) + [None],
 'max_features': ['auto', 'sqrt','log2', None],
 'min_samples_leaf': [4, 6, 8, 12],
 'min_samples_split': [3, 7, 10, 14],
 'n_estimators': list(np.linspace(5, 1200, 3, dtype = int))}
clf = RandomForestClassifier()
model_R = RandomizedSearchCV(estimator = clf, param_distributions = random_search, 
 cv = 4, verbose= 5, random_state= 101, n_jobs = -1)
model_R.fit(train_X,train_y)
model_R.best_params_

Fitting 4 folds for each of 10 candidates, totalling 40 fits


20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
6 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\model_selection\_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 1382, in wrapper
    estimator._validate_params()
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python310\lib\site-packages\sklearn\base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "C:\Users\jorda\AppData\Local\Programs\Python\Python310\lib\s

{'n_estimators': np.int64(1200),
 'min_samples_split': 10,
 'min_samples_leaf': 12,
 'max_features': 'log2',
 'max_depth': np.int64(536),
 'criterion': 'entropy'}

In [53]:
predict_R = model_R.predict(eval_X)
predict_R

array([215000, 140000, 110000,  80000, 148000, 290000, 290000, 144152,
       215000, 175000, 180000, 100000, 175900, 143000, 192000, 105000,
       129000, 140000, 157000, 140000, 127500, 200000, 180000, 318000,
       100000, 185000, 128000, 192000, 582933,  88000, 140000,  88000,
       120500, 110000, 145000, 315000, 115000, 119000, 250000, 105000,
       140000, 140000, 110000, 110000, 130000, 180000, 105000, 202500,
       192000, 190000, 110000, 239000, 100000, 240000, 185000, 110000,
       128500, 168500, 140000, 187000, 140000, 290000, 110000, 129500,
       165000, 135000, 108000, 215000, 152000, 145000, 240000, 128000,
       250000, 145000, 160000, 202500, 178000, 108000, 315000, 202500,
       190000, 127000, 124500, 140000, 194500, 140000, 155000, 155000,
       190000, 190000, 190000, 157000, 128500, 140000, 140000, 128000,
       110000, 130000, 129000, 165000, 190000, 129500, 155000, 110000,
       115000, 105000, 203000, 108000, 160000, 250000, 135000, 145000,
      

In [54]:
r2_score(predict_R, eval_y)

0.4833811981134588

In [55]:
accuracy_score(predict_R, eval_y)

0.0027397260273972603

From the results, we can see that the Random Forest model has been trained effectively, as indicated by the accuracy score and R² score. However, the cross-validation score suggests that further training and ETL processing are required before testing the model in different environments.

Now, let's proceed with making predictions using the test data.

In [56]:
# path to file you will use for predictions
test_data_path = 'test.csv'

# read test data file using pandas
test_data = pd.read_csv(test_data_path)
test_data

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
0,1461,20,RH,80.0,11622,Pave,,Reg,Lvl,AllPub,...,120,0,,MnPrv,,0,6,2010,WD,Normal
1,1462,20,RL,81.0,14267,Pave,,IR1,Lvl,AllPub,...,0,0,,,Gar2,12500,6,2010,WD,Normal
2,1463,60,RL,74.0,13830,Pave,,IR1,Lvl,AllPub,...,0,0,,MnPrv,,0,3,2010,WD,Normal
3,1464,60,RL,78.0,9978,Pave,,IR1,Lvl,AllPub,...,0,0,,,,0,6,2010,WD,Normal
4,1465,120,RL,43.0,5005,Pave,,IR1,HLS,AllPub,...,144,0,,,,0,1,2010,WD,Normal
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1454,2915,160,RM,21.0,1936,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,6,2006,WD,Normal
1455,2916,160,RM,21.0,1894,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,4,2006,WD,Abnorml
1456,2917,20,RL,160.0,20000,Pave,,Reg,Lvl,AllPub,...,0,0,,,,0,9,2006,WD,Abnorml
1457,2918,85,RL,62.0,10441,Pave,,Reg,Lvl,AllPub,...,0,0,,MnPrv,Shed,700,7,2006,WD,Normal


In [57]:
test_data.dtypes

Id                 int64
MSSubClass         int64
MSZoning          object
LotFrontage      float64
LotArea            int64
                  ...   
MiscVal            int64
MoSold             int64
YrSold             int64
SaleType          object
SaleCondition     object
Length: 80, dtype: object

In [58]:
# create test_X which comes from test_data but includes only the columns you used for prediction.
# The list of columns is stored in a variable called features
test_X = test_data[features]

# make predictions which we will submit. 
test_preds = model_R.predict(test_X)

# The lines below shows how to save predictions in format used for competition scoring
output = pd.DataFrame({'Id': test_data.Id,
                      'SalePrice': test_preds})
output.to_csv('submission.csv', index=False)