The housing dataset project begins by establishing a starting point for predicting house prices with machine learning. Data preprocessing, including cleaning and handling missing values, is carried out before scaling numerical features. Models are developed step by step, starting from simpler ones and gradually becoming more complex, with each iteration evaluated against the baseline to measure improvement. This approach is crucial for ensuring accurate and meaningful insights in subsequent analyses.

Variables to be used:
* "YearBuilt": Original construction date
* "LotFrontage": Linear feet of street connected to property
* "MasVnrType": Masonry veneer type
* "OverallQual": Rates the overall material and finish of the house
* "GrLivArea": Above grade (ground) living area square feet
* "GarageCars": Size of garage in car capacity

"SalePrice" will be used to create a target variable to predict house prices based on the selected features.

* Let's import the dataframe:

In [1]:
import pandas as pd
houses = pd.read_csv("data/house_prices_final_project.csv")

In [2]:
houses.head(2)

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500


Filter by the columns mentioned:

In [3]:
columns_list = ['YearBuilt',
                'LotFrontage',
                'MasVnrType',
                'OverallQual',
                'GrLivArea',
                'GarageCars',
                'SalePrice']

In [4]:
houses_df = houses[columns_list].copy()

In [5]:
houses_df.head(2)

Unnamed: 0,YearBuilt,LotFrontage,MasVnrType,OverallQual,GrLivArea,GarageCars,SalePrice
0,2003,65.0,BrkFace,7,1710,2,208500
1,1976,80.0,,6,1262,2,181500


* Convert any categorical data:

In [6]:
houses_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   YearBuilt    1460 non-null   int64  
 1   LotFrontage  1201 non-null   float64
 2   MasVnrType   588 non-null    object 
 3   OverallQual  1460 non-null   int64  
 4   GrLivArea    1460 non-null   int64  
 5   GarageCars   1460 non-null   int64  
 6   SalePrice    1460 non-null   int64  
dtypes: float64(1), int64(5), object(1)
memory usage: 80.0+ KB


* We can see that MasVnrType is the only object type feature that will be converted to categorical:

In [7]:
houses_df["MasVnrType"].unique()

array(['BrkFace', nan, 'Stone', 'BrkCmn'], dtype=object)

In [8]:
houses_df["MasVnrType"].isnull().sum()

872

In [9]:
houses_df["MasVnrType"].value_counts()

MasVnrType
BrkFace    445
Stone      128
BrkCmn      15
Name: count, dtype: int64

We can see that one of the values is 'nan' (missing value), because there's a possibility of having no masonry in the house.
We can't use null as one of the categories, so, I will fill the nan missing value with 'None', which aligns with the description of feature.

Note: The 'None' value has also the highest frequency among others, so it can count as mode (most of the houses don't have a masonry).


In [10]:
houses_df["MasVnrType"].fillna("None", inplace = True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  houses_df["MasVnrType"].fillna("None", inplace = True)


In [11]:
houses_df["MasVnrType"].unique()

array(['BrkFace', 'None', 'Stone', 'BrkCmn'], dtype=object)

In [12]:
categories_masonry = ['BrkFace', 'None', 'Stone', 'BrkCmn']
houses_df["MasVnrType"] = pd.Categorical(houses_df["MasVnrType"], categories = categories_masonry)

In [13]:
houses_df["MasVnrType"].dtype

CategoricalDtype(categories=['BrkFace', 'None', 'Stone', 'BrkCmn'], ordered=False, categories_dtype=object)

* Create the target column based on SalePrice. Min->Median bucket will have assigned the value 0 while the other bucket (Median->Max) will be 1.

In [14]:
houses_df["is_expensive"] = (houses_df["SalePrice"] > houses_df["SalePrice"].median()).astype('int')

In [15]:
#test
houses_df[houses_df["SalePrice"] > houses_df["SalePrice"].median()]

Unnamed: 0,YearBuilt,LotFrontage,MasVnrType,OverallQual,GrLivArea,GarageCars,SalePrice,is_expensive
0,2003,65.0,BrkFace,7,1710,2,208500,1
1,1976,80.0,,6,1262,2,181500,1
2,2001,68.0,BrkFace,7,1786,2,223500,1
4,2000,84.0,BrkFace,8,2198,3,250000,1
6,2004,75.0,Stone,8,1694,2,307000,1
...,...,...,...,...,...,...,...,...
1451,2008,78.0,Stone,8,1578,3,287090,1
1454,2004,62.0,,7,1221,2,185000,1
1455,1999,62.0,,6,1647,2,175000,1
1456,1978,85.0,Stone,6,2073,2,210000,1


### 1. HANDLING MISSING DATA

In [16]:
import numpy as np
from pgds_mpp_utils import split_dataset, score_approach
from sklearn.impute import SimpleImputer

In [17]:
houses_df.isnull().sum()

YearBuilt         0
LotFrontage     259
MasVnrType        0
OverallQual       0
GrLivArea         0
GarageCars        0
SalePrice         0
is_expensive      0
dtype: int64

I already worked above on MasVnrType missing values, which is explained by the feature description, meaning that some houses don't have masonry.
This was replaced by 'None' which is also the mode.

* Let's work now on LotFrontage missing values:

Start by splitting dataset to be used in future iterations:

In [18]:
train_df, test_df = split_dataset(houses_df, "is_expensive")

In [19]:
train_df.head(2)

Unnamed: 0,YearBuilt,LotFrontage,MasVnrType,OverallQual,GrLivArea,GarageCars,SalePrice,is_expensive
163,1956,55.0,,4,882,0,103200,0
1414,1923,64.0,,6,1848,2,207000,1


In [20]:
train_df.LotFrontage.describe()

count    804.000000
mean      69.568408
std       23.607871
min       21.000000
25%       59.000000
50%       69.000000
75%       80.000000
max      313.000000
Name: LotFrontage, dtype: float64

In [21]:
train_df.shape, test_df.shape

((978, 8), (482, 8))

In [22]:
train_df.isnull().mean()*100

YearBuilt        0.000000
LotFrontage     17.791411
MasVnrType       0.000000
OverallQual      0.000000
GrLivArea        0.000000
GarageCars       0.000000
SalePrice        0.000000
is_expensive     0.000000
dtype: float64

**1.1 ENCODING MISSING VALUE WITH VALUE TOKEN: SCORE 62%**

In [23]:
train_df_iteration0_1 = train_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()
test_df_iteration0_1 = test_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()

In [24]:
train_df_iteration0_1.isnull().mean()

LotFrontage     0.177914
YearBuilt       0.000000
is_expensive    0.000000
dtype: float64

In [25]:
missing_value_token = -1

In [26]:
train_df_iteration0_1["LotFrontage"] = train_df_iteration0_1["LotFrontage"].fillna(value=missing_value_token)
test_df_iteration0_1["LotFrontage"] = test_df_iteration0_1["LotFrontage"].fillna(value=missing_value_token)

In [27]:
score_approach(train_df_iteration0_1, test_df_iteration0_1, "is_expensive")

0.6203319502074689

In [28]:
train_df_iteration0_1.isnull().sum()

LotFrontage     0
YearBuilt       0
is_expensive    0
dtype: int64

**1.2 SIMPLE IMPUTER USING MEDIAN: SCORE 63.5%**

In [29]:
train_df_iteration1_1 = train_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()
test_df_iteration1_1 = test_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()

In [30]:
from sklearn.impute import SimpleImputer
imr_median = SimpleImputer(strategy='median')

In [31]:
train_df_iteration1_imputer1_1 = imr_median.fit_transform(train_df_iteration1_1)
test_df_iteration1_imputer1_1 = imr_median.transform(test_df_iteration1_1)

train_df_iteration1_1 = pd.DataFrame(train_df_iteration1_imputer1_1, columns = train_df_iteration1_1.columns)
test_df_iteration1_1 = pd.DataFrame(test_df_iteration1_imputer1_1, columns = test_df_iteration1_1.columns)

score_approach(train_df_iteration1_1, test_df_iteration1_1, "is_expensive")

0.6348547717842323

**1.3 MANUAL MEDIAN IMPUT: SCORE 63.5%**

In [32]:
train_df_iteration1 = train_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()
test_df_iteration1 = test_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()

In [33]:
median_LotFrontage = train_df_iteration1.LotFrontage.median()

In [34]:
train_df_iteration1["LotFrontage"] = train_df_iteration1["LotFrontage"].fillna(value=median_LotFrontage)
test_df_iteration1["LotFrontage"] = test_df_iteration1["LotFrontage"].fillna(value=median_LotFrontage)

score_approach(train_df_iteration1, test_df_iteration1, "is_expensive")

0.6348547717842323

**1.3 KNN IMPUT: SCORE 63.1%**

In [35]:
train_df_iterationknn = train_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()
test_df_iterationknn = test_df[["LotFrontage", "YearBuilt", "is_expensive"]].copy()

In [36]:
from sklearn.impute import KNNImputer
knn_imputer1 = KNNImputer(n_neighbors=2)

In [37]:
train_df_iterationknn_values = knn_imputer1.fit_transform(train_df_iterationknn)
test_df_iterationknn_values = knn_imputer1.transform(test_df_iterationknn)

In [38]:
train_df_knn = pd.DataFrame(train_df_iterationknn_values, columns = train_df_iterationknn.columns)
test_df_knn = pd.DataFrame(test_df_iterationknn_values, columns = test_df_iterationknn.columns)

In [39]:
score_approach(train_df_knn, test_df_knn, "is_expensive")

0.6307053941908713

**--> NEW BASELINE SCORE 63.5% (train_df_iteration1, test_df_iteration1)**

LotFrontage represents the linear feet of street connected to the property. Imputing missing values with the median ensures that the overall distribution of LotFrontage values remains consistent with the original data. This is crucial because the street footage is likely to vary across different neighborhoods and property types. LotFrontage also seems to have an outlier of 313 (max value) due to irregularly shaped lots or unusual neighborhood layouts. Imputing with the median is less affected by outliers compared to using a specific token like -1, which might distort the distribution and impact model performance.

### 2. ENCODING MasVnrType FEATURE WITH OHE VS. BINARY

**2.1 ENCODING WITH OHE: SCORE 67.6%**

In [40]:
train_df_iteration2 = train_df_iteration1.copy()
test_df_iteration2 = test_df_iteration1.copy()

In [41]:
train_df_iteration2["MasVnrType"] = train_df["MasVnrType"]
test_df_iteration2["MasVnrType"] = test_df["MasVnrType"]

In [42]:
from category_encoders import OneHotEncoder as OHE

In [43]:
ohe_enc = OHE(use_cat_names = True)

In [44]:
train_df_iteration2["MasVnrType"].unique()

['None', 'BrkFace', 'Stone', 'BrkCmn']
Categories (4, object): ['BrkFace', 'None', 'Stone', 'BrkCmn']

In [45]:
train_df_iteration2["MasVnrType"].isnull().sum()

0

In [46]:
train_df_iteration2_ohe = ohe_enc.fit_transform(train_df_iteration2)
test_df_iteration2_ohe = ohe_enc.transform(test_df_iteration2)

In [47]:
train_df_iteration2_ohe.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn
163,55.0,1956,0,1,0,0,0
1414,64.0,1923,1,1,0,0,0


In [48]:
score_approach(train_df_iteration2_ohe, test_df_iteration2_ohe, "is_expensive")

0.6763485477178424

**2.1 ENCODING WITH BINARY: SCORE 67.6%**

In [49]:
from category_encoders import BinaryEncoder

In [50]:
binary_encoder = BinaryEncoder()

In [51]:
train_df_iteration2_binary = binary_encoder.fit_transform(train_df_iteration2)
test_df_iteration2_binary = binary_encoder.transform(test_df_iteration2)

In [52]:
test_df_iteration2_binary.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_0,MasVnrType_1,MasVnrType_2
353,60.0,1928,0,0,0,1
92,80.0,1921,1,0,0,1


In [53]:
score_approach(train_df_iteration2_binary, test_df_iteration2_binary, "is_expensive")

0.6763485477178424

**--> NEW BASELINE SCORE: 67.6% (train_df_iteration2_ohe, test_df_iteration2_ohe)**

MasVnrType has a relatively small number of unique categories. Hence, both OHE and Binary Encoding are capable of capturing the same information without significant loss of information. The resulting binary columns may not significantly differ between the two encoding methods.

### 3. ORDINAL ENCODING OF GarageCars: SCORE 74%

In [54]:
train_df_iteration3 = train_df_iteration2_ohe.copy()
test_df_iteration3 = test_df_iteration2_ohe.copy()

In [55]:
train_df_iteration3["GarageCars"] = train_df["GarageCars"]
test_df_iteration3["GarageCars"] = test_df["GarageCars"]

In [56]:
train_df_iteration3["GarageCars"]

163     0
1414    2
227     1
694     2
1264    2
       ..
917     1
777     2
226     3
614     0
1274    2
Name: GarageCars, Length: 978, dtype: int64

In [57]:
train_df_iteration3["GarageCars"].unique()

array([0, 2, 1, 3, 4], dtype=int64)

In [58]:
from sklearn.preprocessing import OrdinalEncoder

In [59]:
ord_enc_garage = OrdinalEncoder(categories=[["0", "1", "2", "3", "4"]])

In [60]:
train_df_iteration3["GarageCars"] = ord_enc_garage.fit_transform(train_df_iteration3[["GarageCars"]])
test_df_iteration3["GarageCars"] = ord_enc_garage.transform(test_df_iteration3[["GarageCars"]])

In [61]:
test_df_iteration3.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn,GarageCars
353,60.0,1928,0,1,0,0,0,2.0
92,80.0,1921,1,1,0,0,0,2.0


In [62]:
score_approach(train_df_iteration3, test_df_iteration3, "is_expensive")

0.7406639004149378

**--> NEW BASELINE SCORE: 74% (train_df_iteration3, test_df_iteration3)**

### 4. ORDINAL ENCODING OF OverallQual: SCORE 84.2%

In [63]:
train_df_iteration4 = train_df_iteration3.copy()
test_df_iteration4 = test_df_iteration3.copy()

In [64]:
train_df_iteration4["OverallQual"] = train_df["OverallQual"]
test_df_iteration4["OverallQual"] = test_df["OverallQual"]

In [65]:
train_df_iteration4["OverallQual"].unique()

array([ 4,  6,  5,  7,  8, 10,  9,  3,  2,  1], dtype=int64)

In [66]:
ord_enc = OrdinalEncoder(categories=[["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]])

In [67]:
train_df_iteration4["OverallQual"] = ord_enc.fit_transform(train_df_iteration4[["OverallQual"]])
test_df_iteration4["OverallQual"] = ord_enc.transform(test_df_iteration4[["OverallQual"]])

In [68]:
score_approach(train_df_iteration4, test_df_iteration4, "is_expensive")

0.8423236514522822

In [69]:
test_df_iteration4.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn,GarageCars,OverallQual
353,60.0,1928,0,1,0,0,0,2.0,5.0
92,80.0,1921,1,1,0,0,0,2.0,4.0


**--> NEW BASELINE SCORE: 84.2% (train_df_iteration4, test_df_iteration4)**

### 5. BINNING OF YearBuilt WITH EQUAL WIDTH VS. EQUAL DEPTH (12 BINS)

**5.1 EQUAL WIDTH BINNING (UNIFORM): SCORE 84%**

In [70]:
from sklearn.preprocessing import KBinsDiscretizer

In [71]:
disc_uniform_estimator = KBinsDiscretizer(n_bins=[12], encode='ordinal', strategy='uniform')

In [72]:
train_df_iteration5 = train_df_iteration4.copy()
train_df_iteration5["YearBuilt"] = disc_uniform_estimator.fit_transform(train_df_iteration5[["YearBuilt"]])

test_df_iteration5 = test_df_iteration4.copy()
test_df_iteration5["YearBuilt"] = disc_uniform_estimator.transform(test_df_iteration5[["YearBuilt"]])



In [73]:
score_approach(train_df_iteration5, test_df_iteration5, "is_expensive")

0.8402489626556017

In [74]:
train_df_iteration5.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn,GarageCars,OverallQual
163,55.0,7.0,0,1,0,0,0,0.0,3.0
1414,64.0,4.0,1,1,0,0,0,2.0,5.0


**5.1 EQUAL DEPTH BINNING (QUANTILE): SCORE 84.2%**

In [75]:
disc_quantile_estimator = KBinsDiscretizer(n_bins=[12], encode='ordinal', strategy='quantile')

In [76]:
train_df_iteration6 = train_df_iteration4.copy()
train_df_iteration6["YearBuilt"] = disc_quantile_estimator.fit_transform(train_df_iteration6[["YearBuilt"]])

test_df_iteration6 = test_df_iteration4.copy()
test_df_iteration6["YearBuilt"] = disc_quantile_estimator.transform(test_df_iteration6[["YearBuilt"]])

In [77]:
score_approach(train_df_iteration6, test_df_iteration6, "is_expensive")

0.8423236514522822

In [78]:
test_df_iteration6.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn,GarageCars,OverallQual
353,60.0,1.0,0,1,0,0,0,2.0,5.0
92,80.0,0.0,1,1,0,0,0,2.0,4.0


In [79]:
train_df.YearBuilt.value_counts().head(50)

YearBuilt
2006    47
2005    46
2004    35
2007    33
2003    24
2000    20
1976    20
1977    20
1958    19
1999    19
1959    17
1971    17
1970    17
1965    16
1972    16
2008    16
2002    16
1954    15
2001    15
1920    15
1968    15
1998    14
2009    14
1950    14
1994    13
1957    13
1925    13
1966    13
1969    12
1953    11
1956    11
1960    11
1996    11
1941    11
1963    11
1948    11
1993    11
1955    10
1967    10
1964    10
1995    10
1990     9
1910     9
1962     9
1997     9
1940     9
1980     8
1915     8
1988     8
1978     8
Name: count, dtype: int64

**--> NEW BASELINE SCORE: 84.2%, EQUAL DEPTH BINNING (train_df_iteration6, test_df_iteration6)**

In the case of YearBuilt, there seems to be a higher concentration of frequency after the year 1990. Using quantile binning (79%) can effectively capture this distribution by creating bins that contain approximately the same number of instances, regardless of the range of years covered by each bin. This allows the model to focus more on the areas of the feature space where there is higher density of data points, potentially leading to better model performance.

Uniform binning (78.6%), on the other hand, divides the data into bins of equal width regardless of the distribution of the data. Therefore, uniform binning may not capture the underlying patterns as effectively.

### 6. FEATURE SCALING FOR GrLivArea WITH MINMAXSCALER VS ZSCORE

**6.1 MINMAX SCALER: SCORE 87.3%**

In [80]:
train_df.GrLivArea.describe()

count     978.000000
mean     1512.160532
std       519.167340
min       334.000000
25%      1130.250000
50%      1466.000000
75%      1768.000000
max      5642.000000
Name: GrLivArea, dtype: float64

In [81]:
train_df_iteration7 = train_df_iteration6.copy()
test_df_iteration7 = test_df_iteration6.copy()

train_df_iteration7["GrLivArea"] = train_df["GrLivArea"]
test_df_iteration7["GrLivArea"] = test_df["GrLivArea"]


In [82]:
from sklearn.preprocessing import MinMaxScaler
mmscaler = MinMaxScaler(feature_range=(-1, 1))

In [83]:
# Reshape the input data to a 2D array
train_X_reshaped = train_df_iteration7[["GrLivArea"]].values.reshape(-1, 1)
test_X_reshaped = test_df_iteration7[["GrLivArea"]].values.reshape(-1, 1)

# Apply MinMaxScaler
train_X_min_max_values = mmscaler.fit_transform(train_X_reshaped)
test_X_min_max_values = mmscaler.transform(test_X_reshaped)


In [84]:
# Add other columns from iteration7 to the new DataFrames
train_X_min_max_df = train_df_iteration7.drop(columns=["GrLivArea"]).copy()
train_X_min_max_df["GrLivArea"] = train_X_min_max_values

test_X_min_max_df = test_df_iteration7.drop(columns=["GrLivArea"]).copy()
test_X_min_max_df["GrLivArea"] = test_X_min_max_values


In [85]:
score_approach(train_X_min_max_df, test_X_min_max_df, "is_expensive")

0.8734439834024896

In [86]:
train_X_min_max_df.head(2)

Unnamed: 0,LotFrontage,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn,GarageCars,OverallQual,GrLivArea
163,55.0,3.0,0,1,0,0,0,0.0,3.0,-0.793519
1414,64.0,1.0,1,1,0,0,0,2.0,5.0,-0.42954


**6.2 ZSCORE: SCORE 85.7%**

In [87]:
train_df_iteration8 = train_df_iteration6.copy()
test_df_iteration8 = test_df_iteration6.copy()

train_df_iteration8["GrLivArea"] = train_df["GrLivArea"]
test_df_iteration8["GrLivArea"] = test_df["GrLivArea"]

In [88]:
from sklearn.preprocessing import StandardScaler
zscore = StandardScaler()

In [89]:
# Reshape the input data to a 2D array
train_zscore_reshaped = train_df_iteration8[["GrLivArea"]].values.reshape(-1, 1)
test_zscore_reshaped = test_df_iteration8[["GrLivArea"]].values.reshape(-1, 1)

# Apply MinMaxScaler
train_zscore_values = zscore.fit_transform(train_zscore_reshaped)
test_zscore_values = zscore.transform(test_zscore_reshaped)

In [90]:
# Add other columns from iteration7 to the new DataFrames
train_zscore_df = train_df_iteration8.drop(columns=["GrLivArea"]).copy()
train_zscore_df["GrLivArea"] = train_zscore_values

test_zscore_df = test_df_iteration8.drop(columns=["GrLivArea"]).copy()
test_zscore_df["GrLivArea"] = test_zscore_values


In [91]:
score_approach(train_zscore_df, test_zscore_df, "is_expensive")

0.8568464730290456

**--> NEW BASELINE SCORE: 87.3%, MINMAXSCALER (train_X_min_max_df, test_X_min_max_df)**

Given the wide range of values in GrLivArea (from 334 to 5642), using MinMaxScaler can compress this range into a more manageable scale. This compression can prevent the model from being overwhelmed by large values and can improve its ability to learn from the data.
Also, if there are outliers present in GrLivArea, MinMaxScaler might handle them better by bringing their magnitudes closer to the rest of the data range.

### 7. FEATURE SCALING FOR LotFrontage WITH MINMAXSCALER VS ZSCORE

**7.1 MINMAXSCALER: SCORE 87.3%**

In [92]:
train_df_iteration9 = train_X_min_max_df.copy()
test_df_iteration9 = test_X_min_max_df.copy()

In [93]:
train_df_iteration9.LotFrontage.describe()

count    978.000000
mean      69.467280
std       21.403743
min       21.000000
25%       60.000000
50%       69.000000
75%       79.000000
max      313.000000
Name: LotFrontage, dtype: float64

In [94]:
from sklearn.preprocessing import MinMaxScaler
mmscaler1 = MinMaxScaler(feature_range=(-1, 1))

In [95]:
# Reshape the input data to a 2D array
train_X_reshaped1 = train_df_iteration9[["LotFrontage"]].values.reshape(-1, 1)
test_X_reshaped1 = test_df_iteration9[["LotFrontage"]].values.reshape(-1, 1)

# Apply MinMaxScaler
train_X_min_max_values1 = mmscaler1.fit_transform(train_X_reshaped1)
test_X_min_max_values1 = mmscaler1.transform(test_X_reshaped1)


In [96]:
# Add other columns from iteration7 to the new DataFrames
train_X_min_max_df1 = train_df_iteration9.drop(columns=["LotFrontage"]).copy()
train_X_min_max_df1["LotFrontage"] = train_X_min_max_values1

test_X_min_max_df1 = test_df_iteration9.drop(columns=["LotFrontage"]).copy()
test_X_min_max_df1["LotFrontage"] = test_X_min_max_values1



In [97]:
score_approach(train_X_min_max_df1, test_X_min_max_df1, "is_expensive")

0.8734439834024896

In [98]:
test_X_min_max_df1.head(2)

Unnamed: 0,YearBuilt,is_expensive,MasVnrType_None,MasVnrType_BrkFace,MasVnrType_Stone,MasVnrType_BrkCmn,GarageCars,OverallQual,GrLivArea,LotFrontage
353,1.0,0,1,0,0,0,2.0,5.0,-0.854559,-0.732877
92,0.0,1,1,0,0,0,2.0,4.0,-0.762622,-0.59589


**7.2 ZSCORE: SCORE 83.4%**

In [99]:
train_df_iteration10 = train_X_min_max_df.copy()
test_df_iteration10 = test_X_min_max_df.copy()

In [100]:
from sklearn.preprocessing import StandardScaler
zscore1 = StandardScaler()

In [101]:
# Reshape the input data to a 2D array
train_zscore_reshaped1 = train_df_iteration10[["LotFrontage"]].values.reshape(-1, 1)
test_zscore_reshaped1 = test_df_iteration10[["LotFrontage"]].values.reshape(-1, 1)

# Apply MinMaxScaler
train_zscore_values1 = zscore1.fit_transform(train_zscore_reshaped1)
test_zscore_values1 = zscore1.transform(test_zscore_reshaped1)

In [102]:
# Add other columns from iteration7 to the new DataFrames
train_zscore_df1 = train_df_iteration10.drop(columns=["LotFrontage"]).copy()
train_zscore_df1["LotFrontage"] = train_zscore_values1

test_zscore_df1 = test_df_iteration10.drop(columns=["LotFrontage"]).copy()
test_zscore_df["LotFrontage"] = test_zscore_values1


In [103]:
score_approach(train_zscore_df1, test_zscore_df, "is_expensive")

0.8340248962655602

**--> NEW AND LAST BASELINE: 87.3%, MINMAXSCALER (train_X_min_max_df1, test_X_min_max_df1)**

Similar to GrLivArea, LotFrontage also has a wide range of values, with a minimum of 21 and a maximum of 313. By using MinMaxScaler, this wide range can be compressed into a normalized range, typically between 0 and 1. This compression makes the data more manageable for the model to learn from.
The preservation of relative differences between data points ensures that the model can still learn meaningful patterns and relationships from the data.

### CONCLUSION

The iterative process began with a baseline accuracy of 63.5% after addressing missing values, which significantly improved to 67.6% upon encoding the MasVnrType feature. Experimenting with feature encoding techniques showed that ordinal encoding for features like GarageCars and OverallQual led to substantial performance gains, with scores of 74% and 84.2%, respectively. Binning the YearBuilt feature using Equal Depth binning outperformed Equal Width binning, achieving a score of 84.2%. Moreover, employing MinMaxScaler for features like GrLivArea and LotFrontage yielded better results compared to ZScore scaling, achieving a noteworthy final score of 87.3%