## Comparing Multiple Machine Learning models 

The comparison of multiple Machine Learning models refers to training, evaluating, and analyzing the performance of different algorithms on the same dataset to identify which model performs best for a specific predictive task. 
By comparing multiple models, we aim to select the most effective algorithm that offers the optimal balance of accuracy, complexity, and performance for their specific problem.

In [1]:
import pandas as pd
data = pd.read_csv('Real_Estate.csv')

In [2]:
data_head = data.head()
data_head

Unnamed: 0,Transaction date,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area
0,2012-09-02 16:42:30.519336,13.3,4082.015,8,25.007059,121.561694,6.488673
1,2012-09-04 22:52:29.919544,35.5,274.0144,2,25.012148,121.54699,24.970725
2,2012-09-05 01:10:52.349449,1.1,1978.671,10,25.00385,121.528336,26.694267
3,2012-09-05 13:26:01.189083,22.2,1055.067,5,24.962887,121.482178,38.091638
4,2012-09-06 08:29:47.910523,8.5,967.4,6,25.011037,121.479946,21.65471


In [3]:
data.info()  # to understand the data types

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Transaction date                     414 non-null    object 
 1   House age                            414 non-null    float64
 2   Distance to the nearest MRT station  414 non-null    float64
 3   Number of convenience stores         414 non-null    int64  
 4   Latitude                             414 non-null    float64
 5   Longitude                            414 non-null    float64
 6   House price of unit area             414 non-null    float64
dtypes: float64(5), int64(1), object(1)
memory usage: 22.8+ KB


In [4]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import datetime

Convert "Transaction date" to datetime and extract year and month

In [5]:
data['Transaction date'] = pd.to_datetime(data['Transaction date'])
data['Transaction year'] = data['Transaction date'].dt.year
data['Transaction month'] = data['Transaction date'].dt.month

Drop the orignal 'Transaction date' as we have extracted the relevant features

In [6]:
data = data.drop(columns=['Transaction date'])

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 8 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   House age                            414 non-null    float64
 1   Distance to the nearest MRT station  414 non-null    float64
 2   Number of convenience stores         414 non-null    int64  
 3   Latitude                             414 non-null    float64
 4   Longitude                            414 non-null    float64
 5   House price of unit area             414 non-null    float64
 6   Transaction year                     414 non-null    int32  
 7   Transaction month                    414 non-null    int32  
dtypes: float64(5), int32(2), int64(1)
memory usage: 22.8 KB


In [8]:
data.head()

Unnamed: 0,House age,Distance to the nearest MRT station,Number of convenience stores,Latitude,Longitude,House price of unit area,Transaction year,Transaction month
0,13.3,4082.015,8,25.007059,121.561694,6.488673,2012,9
1,35.5,274.0144,2,25.012148,121.54699,24.970725,2012,9
2,1.1,1978.671,10,25.00385,121.528336,26.694267,2012,9
3,22.2,1055.067,5,24.962887,121.482178,38.091638,2012,9
4,8.5,967.4,6,25.011037,121.479946,21.65471,2012,9


Define Features and Target variables

In [9]:
x = data.drop('House price of unit area', axis=1)
y = data['House price of unit area']

Split the data into training and testing sets

In [10]:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

Scale the features

In [11]:
scaler = StandardScaler()

In [12]:
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)
x_train_scaled.shape


(331, 7)

In [13]:
x_test_scaled.shape

(83, 7)

Training models and comparison

In [14]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

Initialize the Comparison

In [15]:
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42)
}

Dictionary to hold the evaluation metrics for each model

In [16]:
results = {}

Train and evaluate each model

In [17]:
for name, model in models.items():
    #training the model
    model.fit(x_train_scaled, y_train)
    
    # making predictions on the test set
    prediction = model.predict(x_test_scaled)
    # calculating evaluation metrics
    m = mean_absolute_error(y_test, prediction)
    r2 = r2_score(y_test, prediction)
    #storing the metrics
    results[name] = {"MAE": m, "R2": r2}


In [18]:
results_df = pd.DataFrame(results).T

The performance of each model on the test set, measured by Mean Absolute Error (MAE) and R Squared (R2), is as follows: 


In [19]:
results_df

Unnamed: 0,MAE,R2
Linear Regression,9.748246,0.529615
Decision Tree,11.760342,0.204962
Random Forest,9.887601,0.509547
Gradient Boosting,10.000117,0.476071


Compare Multiple ML Models, 

Linear regression model has lowest MAE: 9.748 and highest R2: 0.529, making it the best performing model among those being evaluated. It suggests that, despite it's simplicity, linear Regression is quite effective for this dataset. 

Decision Tree Regressor shows the highest MAE: 11.760 and the lowest R2: 0.204, indicating it may be overfitting to the training data and performing poorly on the test data. On the other hand, Random Forest Regressor and Gradient Boosting Regressor have similar MAEs (9.89 and 10.00, respectively) and R² scores (0.51 and 0.48, respectively), performing slightly worse than the Linear Regression model but better than the Decision Tree.

 