## 1. Problem Definition and Dataset Identification

### Problem Definition:

Let's assume we are working on a problem of predicting house prices based on various features such as location, size, number of bedrooms, and other relevant factors. Machine learning is suitable for this problem because it can identify complex patterns in the data and provide accurate predictions by learning from historical data.

### Why Machine Learning?

Machine learning is effective for predicting house prices because it can handle a large number of features, capture non-linear relationships, and improve over time as more data becomes available. Traditional statistical methods might fall short in capturing the complex interactions between features that machine learning models can.

### Dataset Identification:

For this project, we'll use the well-known "House Prices: Advanced Regression Techniques" dataset from Kaggle. You can download it from the following location: Kaggle House Prices Dataset.
https://www.kaggle.com/datasets/lespin/house-prices-dataset/data

## 2. Data Exploration, Cleaning, Feature Engineering, and Selection

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

### Load the dataset

In [4]:
# Load the dataset
df1 = pd.read_csv('train.csv')
df2 = pd.read_csv('test.csv')

data = pd.concat([df1, df2], ignore_index=True)
print(data.shape)
print(data.head(10))

(2919, 81)
   Id  MSSubClass MSZoning  LotFrontage  LotArea Street Alley LotShape  \
0   1          60       RL         65.0     8450   Pave   NaN      Reg   
1   2          20       RL         80.0     9600   Pave   NaN      Reg   
2   3          60       RL         68.0    11250   Pave   NaN      IR1   
3   4          70       RL         60.0     9550   Pave   NaN      IR1   
4   5          60       RL         84.0    14260   Pave   NaN      IR1   
5   6          50       RL         85.0    14115   Pave   NaN      IR1   
6   7          20       RL         75.0    10084   Pave   NaN      Reg   
7   8          60       RL          NaN    10382   Pave   NaN      IR1   
8   9          50       RM         51.0     6120   Pave   NaN      Reg   
9  10         190       RL         50.0     7420   Pave   NaN      Reg   

  LandContour Utilities  ... PoolArea PoolQC  Fence MiscFeature MiscVal  \
0         Lvl    AllPub  ...        0    NaN    NaN         NaN       0   
1         Lvl    AllPub 

### Data exploration

In [5]:
print(data.info())
print(data.describe())

# Check for missing values
missing_values = data.isnull().sum()
print(missing_values[missing_values > 0])

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2919 entries, 0 to 2918
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Id             2919 non-null   int64  
 1   MSSubClass     2919 non-null   int64  
 2   MSZoning       2915 non-null   object 
 3   LotFrontage    2433 non-null   float64
 4   LotArea        2919 non-null   int64  
 5   Street         2919 non-null   object 
 6   Alley          198 non-null    object 
 7   LotShape       2919 non-null   object 
 8   LandContour    2919 non-null   object 
 9   Utilities      2917 non-null   object 
 10  LotConfig      2919 non-null   object 
 11  LandSlope      2919 non-null   object 
 12  Neighborhood   2919 non-null   object 
 13  Condition1     2919 non-null   object 
 14  Condition2     2919 non-null   object 
 15  BldgType       2919 non-null   object 
 16  HouseStyle     2919 non-null   object 
 17  OverallQual    2919 non-null   int64  
 18  OverallC

### Data Cleaning

In [6]:
# Fill numerical columns with the median value
num_cols = data.select_dtypes(include=[np.number]).columns
data[num_cols] = data[num_cols].apply(lambda x: x.fillna(x.median()), axis=0)

# Fill categorical columns with the most frequent value
cat_cols = data.select_dtypes(include=[object]).columns
data[cat_cols] = data[cat_cols].apply(lambda x: x.fillna(x.value_counts().index[0]), axis=0)

# Encode categorical variables
data = pd.get_dummies(data)

print(data.head(10))

   Id  MSSubClass  LotFrontage  LotArea  OverallQual  OverallCond  YearBuilt  \
0   1          60         65.0     8450            7            5       2003   
1   2          20         80.0     9600            6            8       1976   
2   3          60         68.0    11250            7            5       2001   
3   4          70         60.0     9550            7            5       1915   
4   5          60         84.0    14260            8            5       2000   
5   6          50         85.0    14115            5            5       1993   
6   7          20         75.0    10084            8            5       2004   
7   8          60         68.0    10382            7            6       1973   
8   9          50         51.0     6120            7            5       1931   
9  10         190         50.0     7420            5            6       1939   

   YearRemodAdd  MasVnrArea  BsmtFinSF1  ...  SaleType_ConLw  SaleType_New  \
0          2003       196.0       706.0  

### Feature Engineering

In [7]:
data['TotalSF'] = data['TotalBsmtSF'] + data['1stFlrSF'] + data['2ndFlrSF']
print("Total Colums", len(data.columns))

Total Colums 290


### Feature selection

In [8]:
correlation = abs(data.corr())

columns = correlation['SalePrice'].sort_values(ascending=False);
top_cols = columns.iloc[1:121]
print(top_cols)

X = data.loc[:, top_cols.index]
print(X.head())

y = data['SalePrice']

# Convert the 100th row to JSON for flask api testing
row_json = X.iloc[100:101].to_json(orient="records")

import json
row_json = json.loads(row_json)[0]

# Print the JSON formatted 100th row
print(json.dumps(row_json, indent=4))
print(y[100])

TotalSF                 0.561596
OverallQual             0.542911
GrLivArea               0.518393
GarageCars              0.438936
GarageArea              0.432263
                          ...   
BldgType_Duplex         0.079707
Condition1_Norm         0.079499
Neighborhood_MeadowV    0.076826
Heating_Grav            0.075484
ExterQual_Fa            0.075482
Name: SalePrice, Length: 120, dtype: float64
   TotalSF  OverallQual  GrLivArea  GarageCars  GarageArea  TotalBsmtSF  \
0   2566.0            7       1710         2.0       548.0        856.0   
1   2524.0            6       1262         2.0       460.0       1262.0   
2   2706.0            7       1786         2.0       608.0        920.0   
3   2473.0            7       1717         3.0       642.0        756.0   
4   3343.0            8       2198         3.0       836.0       1145.0   

   1stFlrSF  ExterQual_TA  TotRmsAbvGrd  FullBath  ...  RoofMatl_CompShg  \
0       856         False             8         2  ...           

In [9]:
# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

print("Total Data", X.shape, y.shape)
print("Train Data", X_train.shape, y_train.shape)
print("Validation Data", X_val.shape, y_val.shape)
print("Test Data", X_test.shape, y_test.shape)

Total Data (2919, 120) (2919,)
Train Data (1868, 120) (1868,)
Validation Data (467, 120) (467,)
Test Data (584, 120) (584,)


## 3. Model Training, Evaluation, and Saving

Training Algorithm and Metrics:
<ul>
    <li>We'll use a RandomForestRegressor for this task.</li>
    <li>Metrics: We'll use Mean Squared Error (MSE) and R-squared (R2) to evaluate the model's performance.</li>
</ul>

In [10]:
# Define the model
model = RandomForestRegressor(n_estimators=100, random_state=42)

# Train the model
model.fit(X_train, y_train)

In [11]:
# Make predictions
y_pred_train = model.predict(X_train)
y_pred_val = model.predict(X_val)
y_pred_test = model.predict(X_test)

# Evaluate the model
train_mse = mean_squared_error(y_train, y_pred_train)
val_mse = mean_squared_error(y_val, y_pred_val)
test_mse = mean_squared_error(y_test, y_pred_test)

train_r2 = model.score(X_train, y_train)
val_r2 = model.score(X_val, y_val)
test_r2 = model.score(X_test, y_test)


print(f'Train MSE: {train_mse}, Train R2: {train_r2}')
print(f'Validation MSE: {val_mse}, Validation R2: {val_r2}')
print(f'Test MSE: {test_mse}, Test R2: {test_r2}')

Train MSE: 76973462.52525853, Train R2: 0.9751978824812902
Validation MSE: 353680644.37414235, Validation R2: 0.8887441252914191
Test MSE: 422900622.10432327, Test R2: 0.8851817966365961


### Save Model

In [12]:
import pickle

with open('Realestate_price_model.pkl', 'wb') as file:
    pickle.dump(model, file)