<a href="https://colab.research.google.com/github/SiddamVamsi264/Comparing-Multiple-Machine-Learning-Models/blob/main/Comparing_Machine_Learning_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Process for Comparing Multiple Machine Learning Models**  

When comparing multiple machine learning models, the goal is to identify the most effective algorithm that balances **accuracy, complexity, and performance** for a given problem. The following structured approach can be used:  

1. **Data Preprocessing** – Handle missing values, eliminate duplicates, and correct errors to ensure the dataset is clean and reliable.  
2. **Dataset Splitting** – Divide the data into **training and testing sets** (commonly **70-30%** or **80-20%**) to evaluate model performance effectively.  
3. **Model Selection** – Choose a diverse range of models, including **linear models, tree-based algorithms, ensemble methods,** and **advanced techniques**, based on data characteristics and problem complexity.  
4. **Model Training** – Train each selected model on the **training dataset**, allowing it to learn from the input features and target variable.  
5. **Performance Evaluation** – Assess each model using **relevant metrics** on the **test dataset** to determine effectiveness.  
6. **Comparison & Selection** – Compare models based on **accuracy, computational efficiency, and overall performance** to select the best-suited approach.  

This systematic approach helps in making informed decisions when selecting a machine learning model tailored to specific use cases.

Now, let’s get started with the task of training and comparing multiple Machine Learning models by importing the necessary Python libraries

In [7]:
import pandas as pd
data =pd.read_csv('Real_Estate.csv')
data_head=data.head()
print(data_head)

             Transaction date  House age  Distance to the nearest MRT station  \
0  2012-09-02 16:42:30.519336       13.3                            4082.0150   
1  2012-09-04 22:52:29.919544       35.5                             274.0144   
2  2012-09-05 01:10:52.349449        1.1                            1978.6710   
3  2012-09-05 13:26:01.189083       22.2                            1055.0670   
4  2012-09-06 08:29:47.910523        8.5                             967.4000   

   Number of convenience stores   Latitude   Longitude  \
0                             8  25.007059  121.561694   
1                             2  25.012148  121.546990   
2                            10  25.003850  121.528336   
3                             5  24.962887  121.482178   
4                             6  25.011037  121.479946   

   House price of unit area  
0                  6.488673  
1                 24.970725  
2                 26.694267  
3                 38.091638  
4             

In [9]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 414 entries, 0 to 413
Data columns (total 7 columns):
 #   Column                               Non-Null Count  Dtype  
---  ------                               --------------  -----  
 0   Transaction date                     414 non-null    object 
 1   House age                            414 non-null    float64
 2   Distance to the nearest MRT station  414 non-null    float64
 3   Number of convenience stores         414 non-null    int64  
 4   Latitude                             414 non-null    float64
 5   Longitude                            414 non-null    float64
 6   House price of unit area             414 non-null    float64
dtypes: float64(5), int64(1), object(1)
memory usage: 22.8+ KB
None


The dataset consists of 414 entries and 7 columns, with no missing values. Here’s a brief overview of the columns:

**Transaction date**: The date of the house sale (object type, which suggests it might need conversion or extraction of useful features like year, month, etc.).

**House age**: The age of the house in years (float).
Distance to the nearest MRT station: The distance to the nearest mass rapid transit station in meters (float).

**Number of convenience stores**: The number of convenience stores in the living circle on foot (integer).

**Latitude**: The geographic coordinate that specifies the north-south position (float).

**Longitude**: The geographic coordinate that specifies the east-west position (float).

**House price of unit area**: Price of the house per unit area (float), which is likely our target variable for prediction.

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import datetime

#Convert "Transaction date" to datetime and extract year and month
data['Transaction date'] = pd.to_datetime(data['Transaction date'])
data['Transaction Year'] = data['Transaction date'].dt.year
data['Transaction Month'] = data['Transaction date'].dt.month

#drop the original "Transaction date" as we have extracted relevant features
data=data.drop(columns=['Transaction date'])

#define features adn target variable
X=data.drop('House price of unit area', axis=1)
y=data['House price of unit area']

#split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

#scale the feautures
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)




In [12]:
X_train_scaled.shape

(331, 7)

In [13]:
X_test_scaled.shape

(83, 7)

### **Commonly Used Models for Regression Tasks**  

To compare regression models effectively, we begin with a selection of commonly used approaches:  

- **Linear Regression** – A fundamental baseline model for regression problems.  
- **Decision Tree Regressor** – A simple tree-based model that captures non-linear relationships.  
- **Random Forest Regressor** – An ensemble method that enhances the performance of decision trees by averaging multiple trees.  
- **Gradient Boosting Regressor** – A powerful boosting-based ensemble technique that improves predictive accuracy.  

Each model will be trained on the **training dataset** and evaluated on the **test dataset** using **Mean Absolute Error (MAE)** and **R-squared (R²)** as performance metrics. These metrics provide insights into the **average prediction error** and the **model’s ability to explain variance in the target variable**.

In [15]:
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score

#intializa the models
models={
    'Linear Regression': LinearRegression(),
    'Decision Tree Regressor': DecisionTreeRegressor(random_state=42),
    'Random Forest Regressor': RandomForestRegressor(random_state=42),
    'Gradient Boosting Regressor': GradientBoostingRegressor(random_state=42)
}

#dictionery to hold the evaluation metrics of each model

results = {}

#train and evaluate each model

for name, model in models.items():

  #training the model
  model.fit(X_train_scaled, y_train)

  #making predictions on the test set
  predictions = model.predict(X_test_scaled)

  #evaluating the model
  mae = mean_absolute_error(y_test, predictions)
  r2 = r2_score(y_test, predictions)

  #storing the results in the dictionary
  results[name] = {'MAE': mae, 'R2': r2}

results_df = pd.DataFrame(results).T
print(results_df)

                                   MAE        R2
Linear Regression             9.748246  0.529615
Decision Tree Regressor      11.760342  0.204962
Random Forest Regressor       9.887601  0.509547
Gradient Boosting Regressor  10.000117  0.476071


In [None]:
from google.colab import drive
drive.mount('/content/drive')

### **Model Performance Comparison**  

Based on the evaluation metrics, **Linear Regression** emerges as the **best-performing model**, achieving:  
- **Lowest Mean Absolute Error (MAE):** **9.75**  
- **Highest R-squared (R²):** **0.53**  

This suggests that, despite its simplicity, **Linear Regression effectively captures the relationships in the dataset.**  

#### **Performance of Other Models:**  
- **Decision Tree Regressor**  
  - **Highest MAE:** **11.76**  
  - **Lowest R²:** **0.20**  
  - Likely overfitting to the training data, leading to poor generalization on the test set.  

- **Random Forest Regressor**  
  - **MAE:** **9.89**  
  - **R²:** **0.51**  
  - Performs slightly worse than Linear Regression but significantly better than Decision Tree.  

- **Gradient Boosting Regressor**  
  - **MAE:** **10.00**  
  - **R²:** **0.48**  
  - Similar performance to Random Forest, though not as effective as Linear Regression.  

### **Conclusion:**  
**Linear Regression** provides the best balance of accuracy and generalization for this dataset. **Decision Tree Regressor** struggles with overfitting, while **Random Forest and Gradient Boosting** offer moderate improvements but do not outperform Linear Regression.