# Regressor Algorothms Predictions

By comparing multiple models, we aim to select the most effective algorithm that offers the optimal balance of accuracy, complexity, and performance for their specific problem. Below is the process we can follow for the task of comparing multiple Machine Learning models:

**1) Address missing values, remove duplicates, and correct errors in the dataset to ensure the quality of data fed into the models.**

**2) Divide the dataset into training and testing sets, typically using a 70-30% or 80-20% split.**

**3) Select a diverse set of models for comparison. It can include simple linear models, tree-based models, ensemble methods, and more advanced algorithms, depending on the problem’s complexity and data characteristics.**

**4) Fit each selected model to the training data. It involves adjusting the model to learn from the features and the target variable in the training set.**

**5) Use a set of metrics to evaluate each model’s performance on the test set.**

**6) Compare the models based on the evaluation metrics, considering both their performance and computational efficiency.**

## Importing Necessary Libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
import datetime
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, r2_score


## Data Collection

In [4]:
data = pd.read_csv(r"D:\Real_Estate.csv")

In [5]:
print(data.info)

<bound method DataFrame.info of                Transaction date  House age  \
0    2012-09-02 16:42:30.519336       13.3   
1    2012-09-04 22:52:29.919544       35.5   
2    2012-09-05 01:10:52.349449        1.1   
3    2012-09-05 13:26:01.189083       22.2   
4    2012-09-06 08:29:47.910523        8.5   
..                          ...        ...   
409  2013-07-25 15:30:36.565239       18.3   
410  2013-07-26 17:16:34.019780       11.9   
411  2013-07-28 21:47:23.339050        0.0   
412  2013-07-29 13:33:29.405317       35.9   
413  2013-08-01 09:49:41.506402       12.0   

     Distance to the nearest MRT station  Number of convenience stores  \
0                             4082.01500                             8   
1                              274.01440                             2   
2                             1978.67100                            10   
3                             1055.06700                             5   
4                              967.40000     

In [6]:
print(data.head())

             Transaction date  House age  Distance to the nearest MRT station  \
0  2012-09-02 16:42:30.519336       13.3                            4082.0150   
1  2012-09-04 22:52:29.919544       35.5                             274.0144   
2  2012-09-05 01:10:52.349449        1.1                            1978.6710   
3  2012-09-05 13:26:01.189083       22.2                            1055.0670   
4  2012-09-06 08:29:47.910523        8.5                             967.4000   

   Number of convenience stores   Latitude   Longitude  \
0                             8  25.007059  121.561694   
1                             2  25.012148  121.546990   
2                            10  25.003850  121.528336   
3                             5  24.962887  121.482178   
4                             6  25.011037  121.479946   

   House price of unit area  
0                  6.488673  
1                 24.970725  
2                 26.694267  
3                 38.091638  
4             

**The dataset consists of 414 entries and 7 columns, with no missing values. Here’s a brief overview of the columns:**

**Transaction date:** The date of the house sale (object type, which suggests it might need conversion or extraction of useful features like year, month, etc.).

**House age:** The age of the house in years (float).

**Distance to the nearest MRT station:** The distance to the nearest mass rapid transit station in meters (float).

**Number of convenience stores:** The number of convenience stores in the living circle on foot (integer).

**Latitude:** The geographic coordinate that specifies the north-south position (float).

**Longitude:** The geographic coordinate that specifies the east-west position (float).

**House price of unit area:** Price of the house per unit area (float), which is likely our target variable for prediction.

## Data Preprocessing

**start with the preprocessing steps. Below are the steps we will follow to preprocess our data:**

**1) Since the transaction date is in a string format, we will convert it into a datetime object. We can then extract features such as the transaction year and month, which might be useful for the model.**

**2) We’ll scale the continuous features to ensure they’re on a similar scale. It is particularly important for models like Support Vector Machines or K-nearest neighbours, which are sensitive to the scale of input features.**

**3) We’ll split the dataset into a training set and a testing set. A common practice is to use 80% of the data for training and 20% for testing.**

### convert "Transaction date" to datetime and extract year and month

In [7]:
data['Transaction date'] = pd.to_datetime(data['Transaction date'])
data['Transaction year'] = data['Transaction date'].dt.year
data['Transaction month'] = data['Transaction date'].dt.month

### drop the original "Transaction date" as we've extracted relevant features

In [8]:
data = data.drop(columns=['Transaction date'])

### define features and target variable

In [10]:
X = data.drop('House price of unit area', axis=1)
y = data['House price of unit area']

print("X Labels:\n {}\n\n y Labels: \n{}".format(X,y)) 

X Labels:
      House age  Distance to the nearest MRT station  \
0         13.3                           4082.01500   
1         35.5                            274.01440   
2          1.1                           1978.67100   
3         22.2                           1055.06700   
4          8.5                            967.40000   
..         ...                                  ...   
409       18.3                            170.12890   
410       11.9                            323.69120   
411        0.0                            451.64190   
412       35.9                            292.99780   
413       12.0                             90.45606   

     Number of convenience stores   Latitude   Longitude  Transaction year  \
0                               8  25.007059  121.561694              2012   
1                               2  25.012148  121.546990              2012   
2                              10  25.003850  121.528336              2012   
3               

### split the data into training and testing sets

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train :\n{} \n\n X_test: \n {} \n\n y_train: \n {} \n\n y_test: \n{}".format(X_train,X_test,y_train,y_test))

X_train :
     House age  Distance to the nearest MRT station  \
192       13.3                           2147.37600   
234       19.2                             90.45606   
5         13.3                            279.17260   
45         8.0                            405.21340   
245       37.1                           1559.82700   
..         ...                                  ...   
71        12.0                           2408.99300   
106        4.5                            579.20830   
270       30.4                            444.13340   
348       20.0                            552.43710   
102       34.9                           2185.12800   

     Number of convenience stores   Latitude   Longitude  Transaction year  \
192                             3  24.933732  121.564450              2013   
234                             5  24.986418  121.478117              2013   
5                               2  24.994994  121.543823              2012   
45               

### scale the features

In [14]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled.shape

(331, 7)

In [15]:
X_test_scaled.shape

(83, 7)

## Model Training, Prediction and Comparison

### initialize the models

In [16]:
models = {
    "Linear Regression": LinearRegression(),
    "Decision Tree": DecisionTreeRegressor(random_state=42),
    "Random Forest": RandomForestRegressor(random_state=42),
    "Gradient Boosting": GradientBoostingRegressor(random_state=42)
}

In [17]:
# dictionary to hold the evaluation metrics for each model
results = {}

# train and evaluate each model
for name, model in models.items():
    # training the model
    model.fit(X_train_scaled, y_train)

    # making predictions on the test set
    predictions = model.predict(X_test_scaled)

    # calculating evaluation metrics
    mae = mean_absolute_error(y_test, predictions)
    r2 = r2_score(y_test, predictions)

    # storing the metrics
    results[name] = {"MAE": mae, "R²": r2}

results_df = pd.DataFrame(results).T  # convert the results to a DataFrame for better readability
print(results_df)

                         MAE        R²
Linear Regression   9.748246  0.529615
Decision Tree      11.760342  0.204962
Random Forest       9.887601  0.509547
Gradient Boosting  10.000117  0.476071


## Conclussion

**The performance of each model on the test set, measured by Mean Absolute Error (MAE) and R-squared (R²), is as follows:**

Linear Regression has the **lowest MAE (9.75)** and the **highest R² (0.53)**, making it the best-performing model among those evaluated. It suggests that, despite its simplicity, **Linear Regression is quite effective for this dataset**.

**Decision Tree Regressor shows the highest MAE (11.76) and the lowest R² (0.20)**, indicating it may be **overfitting to the training data and performing poorly on the test data**. 

**Random Forest Regressor and Gradient Boosting Regressor have similar MAEs (9.89 and 10.00, respectively)** and **R² scores (0.51 and 0.48, respectively)**, performing slightly worse than the Linear Regression model but better than the Decision Tree.

