# __HOUSE PRICES PREDICTION__

Objective: Develop a machine learning model to predict house prices using a dataset containing various house-related features.

Data Collection: You will use the "House Prices - Advanced Regression Techniques" dataset from Kaggle (or any other relevant house price dataset)

Dataset Link: https://www.kaggle.com/datasets/rohit265/housing-sales-factors-influencing-sale-prices

### __IMPORT NECESSARY LIBRARIES__

In [1]:
# Importing standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Importing libraries for data preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV
from sklearn.impute import SimpleImputer

# Importing libraries for machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from sklearn.neural_network import MLPClassifier

# Importing libraries for model evaluation
from sklearn.metrics import roc_auc_score, accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, precision_recall_curve

# Importing libraries for model interpretability
import shap
import lime
import lime.lime_tabular

# Importing libraries for API development and deployment
from flask import Flask, request, jsonify
import joblib

# Miscellaneous libraries
import warnings
warnings.filterwarnings('ignore')

# Display all rows and columns
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

  from .autonotebook import tqdm as notebook_tqdm


### __IMPORT DATASET__

In [2]:
# Import the data
df = pd.read_csv(r"C:\Users\antho\OneDrive\Documents\GitHub\Project--Third-Semester-\housing.csv")

In [3]:
df.head()

Unnamed: 0,Lot_Frontage,Lot_Area,Bldg_Type,House_Style,Overall_Cond,Year_Built,Exter_Cond,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,Fireplaces,Longitude,Latitude,Sale_Price
0,141,31770,OneFam,One_Story,Average,1960,Typical,1080,1656,0,1,0,3,1,2,-93.619754,42.054035,215000
1,80,11622,OneFam,One_Story,Above_Average,1961,Typical,882,896,0,1,0,2,1,0,-93.619756,42.053014,105000
2,81,14267,OneFam,One_Story,Above_Average,1958,Typical,1329,1329,0,1,1,3,1,0,-93.619387,42.052659,172000
3,93,11160,OneFam,One_Story,Average,1968,Typical,2110,2110,0,2,1,3,1,2,-93.61732,42.051245,244000
4,74,13830,OneFam,Two_Story,Average,1997,Typical,928,928,701,2,1,3,1,1,-93.638933,42.060899,189900


In [4]:
df.shape

(2413, 18)

### __SPLIT THE DATASET INTO TRAINING TEST DATASET__

In [6]:
# Set the random seed for reproducibility
random_seed = 42

# Define the fraction of data to be used for the training set
train_fraction = 0.8

# Sample the training data
train_df = df.sample(frac=train_fraction, random_state=random_seed)

# Get the test data by dropping the training indices
test_df = df.drop(train_df.index)

In [None]:
train_df.head()

Unnamed: 0,Lot_Frontage,Lot_Area,Bldg_Type,House_Style,Overall_Cond,Year_Built,Exter_Cond,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,Fireplaces,Longitude,Latitude,Sale_Price
765,85,10200,OneFam,One_Story,Average,2007,Typical,1578,1602,0,2,0,3,1,1,-93.684115,42.016468,293200
2387,54,13811,OneFam,One_Story,Above_Average,1987,Typical,1112,1137,0,2,0,2,1,1,-93.646099,41.999553,176000
2162,60,10800,OneFam,One_and_Half_Fin,Very_Good,1936,Typical,796,1096,370,2,0,3,1,1,-93.613899,42.034761,170000
1833,79,9245,OneFam,Two_Story,Average,2006,Typical,939,939,858,2,1,3,1,0,-93.684137,42.014823,213500
1814,120,10356,OneFam,One_Story,Above_Average,1975,Typical,969,969,0,1,1,3,1,0,-93.684354,42.021025,122000


In [None]:
test_df.head()

Unnamed: 0,Lot_Frontage,Lot_Area,Bldg_Type,House_Style,Overall_Cond,Year_Built,Exter_Cond,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,Fireplaces,Longitude,Latitude,Sale_Price
1,80,11622,OneFam,One_Story,Above_Average,1961,Typical,882,896,0,1,0,2,1,0,-93.619756,42.053014,105000
4,74,13830,OneFam,Two_Story,Average,1997,Typical,928,928,701,2,1,3,1,1,-93.638933,42.060899,189900
11,0,7980,OneFam,One_Story,Good,1992,Good,1168,1187,0,2,0,3,1,0,-93.635951,42.057419,185000
16,152,12134,OneFam,One_and_Half_Fin,Good,1988,Typical,559,1080,672,2,0,4,1,0,-93.623595,42.060351,164000
19,105,11751,OneFam,One_Story,Above_Average,1977,Typical,1844,1844,0,2,0,3,1,1,-93.633962,42.050346,190000


### __EXPLORATORY DATA ANALYSIS__

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

The dataset folder has training and test datasets but we will focus on the training dataset for the following reasons:

1. **Understanding Patterns**: The training dataset is what you'll use to train your model, so understanding its patterns, distributions, and relationships is crucial.

2. **Avoiding Data Leakage**: Performing EDA on the test set can lead to data leakage, where information from the test set influences the model training process, resulting in overly optimistic performance estimates.

3. **Model Validation**: The test set should be kept unseen until final model evaluation to provide an unbiased estimate of the model's performance.

However, I will briefly review the test set to ensure it has similar features and distributions to the training set, which helps in understanding if the datasets are consistent and comparable.

For exploratory data analysis, I will take the following steps:
    
Data Cleaning

Data Transformation

Feature engineering

### **GENERAL OVERVIEW OF THE DATASET**

In [None]:
# Check the first five rows
train_df.head()

Unnamed: 0,Lot_Frontage,Lot_Area,Bldg_Type,House_Style,Overall_Cond,Year_Built,Exter_Cond,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,Fireplaces,Longitude,Latitude,Sale_Price
765,85,10200,OneFam,One_Story,Average,2007,Typical,1578,1602,0,2,0,3,1,1,-93.684115,42.016468,293200
2387,54,13811,OneFam,One_Story,Above_Average,1987,Typical,1112,1137,0,2,0,2,1,1,-93.646099,41.999553,176000
2162,60,10800,OneFam,One_and_Half_Fin,Very_Good,1936,Typical,796,1096,370,2,0,3,1,1,-93.613899,42.034761,170000
1833,79,9245,OneFam,Two_Story,Average,2006,Typical,939,939,858,2,1,3,1,0,-93.684137,42.014823,213500
1814,120,10356,OneFam,One_Story,Above_Average,1975,Typical,969,969,0,1,1,3,1,0,-93.684354,42.021025,122000


### **How large is the dataset that we are working with?**

In [None]:
# Check the size of the dataset
data_size = train_df.shape

print(f'The training set has {data_size[0]} rows (observations) and {data_size[-1]} columns (features)')

The training set has 1930 rows (observations) and 18 columns (features)


### **What are the different features contained in our dataset?**

In [None]:
# Check the different features(columns) and their respective descriptions
columns = list(train_df.columns)
print(columns)

['Lot_Frontage', 'Lot_Area', 'Bldg_Type', 'House_Style', 'Overall_Cond', 'Year_Built', 'Exter_Cond', 'Total_Bsmt_SF', 'First_Flr_SF', 'Second_Flr_SF', 'Full_Bath', 'Half_Bath', 'Bedroom_AbvGr', 'Kitchen_AbvGr', 'Fireplaces', 'Longitude', 'Latitude', 'Sale_Price']


This is a brief description of all the features contained in our dataset

Lot_Frontage: Linear feet of street connected to the property.

Lot_Area: Lot size in square feet.

Bldg_Type: Type of building (e.g., single-family, multi-family).

House_Style: Style of the house (e.g., ranch, two-story).

Overall_Cond: Overall condition rating of the house.

Year_Built: Year the house was built.

Exter_Cond: Exterior condition rating of the house.

Total_Bsmt_SF: Total square feet of basement area.

First_Flr_SF: First-floor square feet.

Second_Flr_SF: Second-floor square feet.

Full_Bath: Number of full bathrooms.

Half_Bath: Number of half bathrooms.

Bedroom_AbvGr: Number of bedrooms above ground.

Kitchen_AbvGr: Number of kitchens above ground.

Fireplaces: Number of fireplaces.

Longitude: Longitude coordinates of the property location.

Latitude: Latitude coordinates of the property location.

Sale_Price: Sale price of the property.

### **Statistical summary of our data**

In [None]:
# Check the general overview of our data
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 1930 entries, 765 to 1213
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Lot_Frontage   1930 non-null   int64  
 1   Lot_Area       1930 non-null   int64  
 2   Bldg_Type      1930 non-null   object 
 3   House_Style    1930 non-null   object 
 4   Overall_Cond   1930 non-null   object 
 5   Year_Built     1930 non-null   int64  
 6   Exter_Cond     1930 non-null   object 
 7   Total_Bsmt_SF  1930 non-null   int64  
 8   First_Flr_SF   1930 non-null   int64  
 9   Second_Flr_SF  1930 non-null   int64  
 10  Full_Bath      1930 non-null   int64  
 11  Half_Bath      1930 non-null   int64  
 12  Bedroom_AbvGr  1930 non-null   int64  
 13  Kitchen_AbvGr  1930 non-null   int64  
 14  Fireplaces     1930 non-null   int64  
 15  Longitude      1930 non-null   float64
 16  Latitude       1930 non-null   float64
 17  Sale_Price     1930 non-null   int64  
dtypes: float64(

In [None]:
# Check the statistical summary of our data- numerical
train_df.describe(include=[np.number]).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Lot_Frontage,1930.0,54.965285,33.821358,0.0,35.0,60.0,76.0,313.0
Lot_Area,1930.0,10050.706218,8519.529871,1470.0,7250.5,9362.0,11424.5,215245.0
Year_Built,1930.0,1970.115026,29.108567,1872.0,1954.0,1972.0,1998.0,2010.0
Total_Bsmt_SF,1930.0,1027.575648,414.946084,0.0,784.0,972.0,1251.75,3206.0
First_Flr_SF,1930.0,1137.199482,374.324302,334.0,867.25,1057.0,1352.25,3820.0
Second_Flr_SF,1930.0,335.33057,423.726725,0.0,0.0,0.0,696.5,1872.0
Full_Bath,1930.0,1.545596,0.543822,0.0,1.0,2.0,2.0,4.0
Half_Bath,1930.0,0.376684,0.498391,0.0,0.0,0.0,1.0,2.0
Bedroom_AbvGr,1930.0,2.855959,0.813939,0.0,2.0,3.0,3.0,6.0
Kitchen_AbvGr,1930.0,1.039896,0.198397,0.0,1.0,1.0,1.0,2.0


In [None]:
# Check the statistical summary of our data- Categorical
train_df.describe(include= "object").T

Unnamed: 0,count,unique,top,freq
Bldg_Type,1930,5,OneFam,1585
House_Style,1930,8,One_Story,958
Overall_Cond,1930,9,Average,1040
Exter_Cond,1930,5,Typical,1675


### **DATA CLEANING**

Peace

Review data source and collection methods, use that to determine the features necessary for our machine learning project

Identify and handle missing data.

Identify and remove duplicate records.

Detect and handle outliers



Sunday

Correct typographical errors and inconsistencies

Ensure consistent formats (dates, time, units, labels)

Convert data types to appropriate formats

Addressing Data Quality Issues

Handling Categorical Variables: Clean text (remove special characters, punctuation, stop words)

### **Correcting typographical errors and inconsistencies**

In [None]:
# View the first 5 rows of our data
train_df.head()

Unnamed: 0,Lot_Frontage,Lot_Area,Bldg_Type,House_Style,Overall_Cond,Year_Built,Exter_Cond,Total_Bsmt_SF,First_Flr_SF,Second_Flr_SF,Full_Bath,Half_Bath,Bedroom_AbvGr,Kitchen_AbvGr,Fireplaces,Longitude,Latitude,Sale_Price
765,85,10200,OneFam,One_Story,Average,2007,Typical,1578,1602,0,2,0,3,1,1,-93.684115,42.016468,293200
2387,54,13811,OneFam,One_Story,Above_Average,1987,Typical,1112,1137,0,2,0,2,1,1,-93.646099,41.999553,176000
2162,60,10800,OneFam,One_and_Half_Fin,Very_Good,1936,Typical,796,1096,370,2,0,3,1,1,-93.613899,42.034761,170000
1833,79,9245,OneFam,Two_Story,Average,2006,Typical,939,939,858,2,1,3,1,0,-93.684137,42.014823,213500
1814,120,10356,OneFam,One_Story,Above_Average,1975,Typical,969,969,0,1,1,3,1,0,-93.684354,42.021025,122000


In [None]:
# Let us create a list of all object (categorical) data types
object_features = [
    "MSZoning", "Street", "Alley", "LotShape", "LandContour", "Utilities", 
    "LotConfig", "LandSlope", "Neighborhood", "Condition1", "Condition2", 
    "BldgType", "HouseStyle", "RoofStyle", "RoofMatl", "Exterior1st", 
    "Exterior2nd", "MasVnrType", "ExterQual", "ExterCond", "Foundation", 
    "BsmtQual", "BsmtCond", "BsmtExposure", "BsmtFinType1", "BsmtFinType2", 
    "Heating", "HeatingQC", "CentralAir", "Electrical", "KitchenQual", 
    "Functional", "FireplaceQu", "GarageType", "GarageFinish", "GarageQual", 
    "GarageCond", "PavedDrive", "PoolQC", "Fence", "MiscFeature", 
    "SaleType", "SaleCondition"
]

# Let us create a list of all integer features
integer_features = [
    "Id", "MSSubClass", "LotArea", "OverallQual", "OverallCond", 
    "YearBuilt", "YearRemodAdd", "BsmtFinSF1", "BsmtFinSF2", "BsmtUnfSF", 
    "TotalBsmtSF", "1stFlrSF", "2ndFlrSF", "LowQualFinSF", "GrLivArea", 
    "BsmtFullBath", "BsmtHalfBath", "FullBath", "HalfBath", "BedroomAbvGr", 
    "KitchenAbvGr", "TotRmsAbvGrd", "Fireplaces", "GarageCars", "GarageArea", 
    "WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "3SsnPorch", "ScreenPorch", 
    "PoolArea", "MiscVal", "MoSold", "YrSold", "SalePrice"
]

For the object type features which are 43 in number, I will be inspecting them one after the other to ensure that they are all correctly filled without unnecessary characters and figures. It is important to also note that the entries would not be tampered with to ensure that we have consistency with the test dataset that has not been opened yet which would be used in testing our model at the machine learning stage.

In [None]:
# Subset the object type features for inspection
object_data_df = train_df[object_features]
object_data_df.head()

Unnamed: 0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinType2,Heating,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,FireplaceQu,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,PoolQC,Fence,MiscFeature,SaleType,SaleCondition
0,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,No,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,,Attchd,RFn,TA,TA,Y,,,,WD,Normal
1,RL,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,Gable,CompShg,MetalSd,MetalSd,,TA,TA,CBlock,Gd,TA,Gd,ALQ,Unf,GasA,Ex,Y,SBrkr,TA,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal
2,RL,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Mn,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal
3,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,Gable,CompShg,Wd Sdng,Wd Shng,,TA,TA,BrkTil,TA,Gd,No,ALQ,Unf,GasA,Gd,Y,SBrkr,Gd,Typ,Gd,Detchd,Unf,TA,TA,Y,,,,WD,Abnorml
4,RL,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,Gable,CompShg,VinylSd,VinylSd,BrkFace,Gd,TA,PConc,Gd,TA,Av,GLQ,Unf,GasA,Ex,Y,SBrkr,Gd,Typ,TA,Attchd,RFn,TA,TA,Y,,,,WD,Normal


In [None]:
# View the unique entries in each feature
for feature in object_data_df.columns:
    unique_entries = object_data_df[feature].unique()
    print(f'The unique entries in {feature} are {unique_entries}')

The unique entries in MSZoning are ['RL' 'RM' 'C (all)' 'FV' 'RH']
The unique entries in Street are ['Pave' 'Grvl']
The unique entries in Alley are [nan 'Grvl' 'Pave']
The unique entries in LotShape are ['Reg' 'IR1' 'IR2' 'IR3']
The unique entries in LandContour are ['Lvl' 'Bnk' 'Low' 'HLS']
The unique entries in Utilities are ['AllPub' 'NoSeWa']
The unique entries in LotConfig are ['Inside' 'FR2' 'Corner' 'CulDSac' 'FR3']
The unique entries in LandSlope are ['Gtl' 'Mod' 'Sev']
The unique entries in Neighborhood are ['CollgCr' 'Veenker' 'Crawfor' 'NoRidge' 'Mitchel' 'Somerst' 'NWAmes'
 'OldTown' 'BrkSide' 'Sawyer' 'NridgHt' 'NAmes' 'SawyerW' 'IDOTRR'
 'MeadowV' 'Edwards' 'Timber' 'Gilbert' 'StoneBr' 'ClearCr' 'NPkVill'
 'Blmngtn' 'BrDale' 'SWISU' 'Blueste']
The unique entries in Condition1 are ['Norm' 'Feedr' 'PosN' 'Artery' 'RRAe' 'RRNn' 'RRAn' 'PosA' 'RRNe']
The unique entries in Condition2 are ['Norm' 'Artery' 'RRNn' 'Feedr' 'PosN' 'PosA' 'RRAn' 'RRAe']
The unique entries in BldgTyp

From the data description above for the various features in the dataset, I carefully studied the data description text file attached to the dataset from Kaggle and fortunately, everything is in order. There is no need to strip, split of even adjust any of the entries.

### **Ensure consistent formats (dates, time, units, labels)**

This section will be taking a look at the entries for each feature to ensure that they are entered in using the right metrics and units, especially date and time features, they mostly come as object or integer data types sometimes so there is usually a need to convert their data types and also ensure they are entered in the right format. For the units and lables, I will be using the data description file for inspection.

In [None]:
# Create a list of features that has to do with date and time
timedate_features = [
    "YearBuilt",
    "YearRemodAdd",
    "GarageYrBlt",
    "MoSold",
    "YrSold"
]

# Subset the date and time features in our dataset
date_time_df = train_df[timedate_features]

# Preview the dataframe
date_time_df.head()

Unnamed: 0,YearBuilt,YearRemodAdd,GarageYrBlt,MoSold,YrSold
0,2003,2003,2003.0,2,2008
1,1976,1976,1976.0,5,2007
2,2001,2002,2001.0,9,2008
3,1915,1970,1998.0,2,2006
4,2000,2000,2000.0,12,2008


In [None]:
# Let us inspect the data types
date_time_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   YearBuilt     1460 non-null   int64  
 1   YearRemodAdd  1460 non-null   int64  
 2   GarageYrBlt   1379 non-null   float64
 3   MoSold        1460 non-null   int64  
 4   YrSold        1460 non-null   int64  
dtypes: float64(1), int64(4)
memory usage: 57.2 KB


As we can see from the above, the date time features are all in integer and float data types. Now I understand that it might not really be necessary to convert them depending on the kind of analysis that needs to be carried out but I would be converting for the following reasons:

I will be analysing age of the houses and also doing some time series analysis as well which would be easier if the features are in date time formats.

The changes would be made on the original dataframe (train_df). The purpose of subsetting is to be able to see the section I want to work on clearly.

In [None]:
# Convert YearBuilt and YearRemodAdd to datetime
train_df['YearBuilt'] = pd.to_datetime(train_df['YearBuilt'], format='%Y', errors='coerce')
train_df['YearRemodAdd'] = pd.to_datetime(train_df['YearRemodAdd'], format='%Y', errors='coerce')

# Convert GarageYrBlt to datetime
train_df['GarageYrBlt'] = pd.to_datetime(train_df['GarageYrBlt'], format='%Y', errors='coerce')

# Convert MoSold and YrSold
train_df['MoSold'] = pd.to_datetime(train_df['MoSold'], format='%m', errors='coerce')
train_df['YrSold'] = pd.to_datetime(train_df['YrSold'], format='%Y', errors='coerce')

Lastly, the data description file was studied as well for each of our features and they are all consistent when it comes to labels, units of measurements and standards

### **Addressing Data Quality Issues**

### **PROPER EDA - PHASE 2**

Task 2.1: Conduct exploratory data analysis to understand the distribution of features and the target variable (house prices).

Task 2.2: Visualize the relationships between features and the target variable using scatter plots, histograms, and box plots.

<!-- Task 2.3: Identify and handle outliers in the dataset. --> PEACE ALREADY DID THIS