# Importing the required Libraries

In [1]:
!pip install catboost



In [2]:
# Importing libraries used for Data manipulation and analysis
import pandas as pd
import numpy as np

# Import necessary modules from the scikit-learn library for machine learning tasks.
from sklearn import model_selection
from sklearn import preprocessing
from sklearn.metrics import r2_score, log_loss, mean_absolute_error
from xgboost import XGBRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from catboost import CatBoostRegressor

# Import necessary module for saving model
import joblib

# Importing the dataset into a dataframe

In [3]:
HP = pd.read_csv("./data/Cleaned_House_Prices.csv")

# Making a copy of the dataset
House_Prices = HP.copy()

# Data Preprocessing
Now that we have imported our cleaned dataset, let's take a brief look at it.

In [4]:
House_Prices.head()

Unnamed: 0,area_type,location,size,total_sqft,bath,balcony,price,Home_availability,Home_availabilty_month
0,Super built-up Area,Electronic City Phase II,2,1056,2.0,1.0,39.07,Date Given,Dec
1,Plot Area,Chikka Tirupathi,4,2600,5.0,3.0,120.0,Ready To Move,Tbd
2,Built-up Area,Uttarahalli,3,1440,2.0,3.0,62.0,Ready To Move,Tbd
3,Super built-up Area,Lingadheeranahalli,3,1521,3.0,1.0,95.0,Ready To Move,Tbd
4,Super built-up Area,Kothanur,2,1200,2.0,1.0,51.0,Ready To Move,Tbd


Now that everything looks nice, let's proceed to the data preprocessing phase. During this stage, we will transform the data into a format that can be fully utilized by our model.

**Step 1: Handling Categorical Columns**<br>
We'll begin by addressing our categorical columns and converting them into dummy variables.

In [5]:
categorical_columns = House_Prices.select_dtypes("object").columns
for column in categorical_columns:
  House_Prices = House_Prices.join(pd.get_dummies(House_Prices[column])).drop(columns=column)

Let's now examine our dataset to verify whether all the changes have been implemented correctly.

In [6]:
House_Prices.head()

Unnamed: 0,size,total_sqft,bath,balcony,price,Built-up Area,Carpet Area,Plot Area,Super built-up Area,Anekal,...,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Tbd
0,2,1056,2.0,1.0,39.07,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,4,2600,5.0,3.0,120.0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,1440,2.0,3.0,62.0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,3,1521,3.0,1.0,95.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,2,1200,2.0,1.0,51.0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


Now that all our changes have been implemented, let's move on to the next section.

**Step 2: Scaling of Numerical Columns**<br>
Scaling is a crucial data transformation technique used to ensure that numerical features are on the same scale, preventing some features from dominating others during machine learning model training.<br>
We can going utilizing the Log scaling technique and it's particularly suitable for our model.

In [7]:
House_Prices['total_sqft'] = House_Prices['total_sqft'].apply(np.log10)
House_Prices['price'] = House_Prices['price'].apply(np.log10)

Let's check if the changes have been implemented.

In [8]:
House_Prices.head()

Unnamed: 0,size,total_sqft,bath,balcony,price,Built-up Area,Carpet Area,Plot Area,Super built-up Area,Anekal,...,Feb,Jan,Jul,Jun,Mar,May,Nov,Oct,Sep,Tbd
0,2,3.023664,2.0,1.0,1.591843,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,4,3.414973,5.0,3.0,2.079181,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
2,3,3.158362,2.0,3.0,1.792392,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,3,3.182129,3.0,1.0,1.977724,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1
4,2,3.079181,2.0,1.0,1.70757,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,1


**Step 3: Data Splitting**<br>
Great! The changes have been successfully implemented, and now it's time to split the dataset into two distinct sets: the training dataset and the testing dataset. Data splitting is a crucial step in machine learning as it allows us to assess the model's performance on unseen data.<br>
In the next steps, we'll split the dataset into these two subsets, ensuring that both the training and testing data are representative and properly randomized to avoid bias in our model evaluation.


In [9]:
X =  House_Prices.drop(columns=["price"]).astype(float)
y = list(House_Prices.price.values)
X_train, X_test, y_train, y_test = model_selection.train_test_split(X,
                                                                    y,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )

**Step 4: Model Selection and Training**<br>
CatBoost was selected due to it's powerful and versatile algorithm that can often deliver superior predictive performance in regression tasks while offering features that help combat overfitting and handle missing data.

Let's first create a function to make the model build processing less messy.

In [10]:
def model_builder(model, X_train, X_test, y_train, y_test):
    mod = model
    mod.fit(X_train, y_train)
    pred = mod.predict(X_test)
    print("r2_score", r2_score(y_test, pred), "Mae", mean_absolute_error(y_test, pred))
    return mod

Now that that is out of the way, Let's go straight to model building.

**Note:**<br>
The code below will take a considerable amount of time to execute.

In [11]:
model_1 = model_builder(CatBoostRegressor(silent=True), X_train, X_test, y_train, y_test)


r2_score 0.796526336730376 Mae 0.10290479607642279


### Model interpretation
1. **R-squared (R2) Score: 0.7965**

   - **Interpretation and Implications:** The R2 score of approximately `0.7965` indicates that the regression model explains around `79.65%` of the variance in the target variable. This signifies the model's effectiveness in capturing the data's variability, with approximately `20.35%` of the variance remaining unexplained.

2. **Mean Absolute Error (MAE): 0.1029**

   - **Interpretation and Implications:** The `MAE` value of approximately `0.1029` represents the average absolute difference between the model's predictions and the log-transformed actual values of the target variable. On average, the model's predictions deviate by approximately `0.1029` units from the log-transformed actual values. Evaluating the practical significance of this value should consider the context and scale of the target variable.

The dataset still has areas that can be improved. Let's begin by removing columns of little or no significance to the model. These columns should be eliminated to prevent any further hindrance to the model's performance.

In [12]:
feature_importance_df = pd.DataFrame({'Feature': House_Prices.drop(columns=["price"]).columns,
                                      'Importance': model_1.feature_importances_})

# Sort the DataFrame by importance scores in descending order
feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

# Display the feature importance table
print(feature_importance_df)

               Feature  Importance
1           total_sqft   53.562135
2                 bath    9.795317
6           Plot  Area    9.424733
0                 size    7.035608
984       Rajaji Nagar    1.513359
..                 ...         ...
536   Hommadevanahalli    0.000000
537        Hongasandra    0.000000
538              Hoodi    0.000000
540       Hoodi Layout    0.000000
664  Kanakadasa Layout    0.000000

[1329 rows x 2 columns]


The table above reveals the presence of numerous redundant columns within the dataframe, which must be eliminated to enhance the potential of our model.

In [13]:
# Removing redundant columns and resplitting dataset
X_tranformed = House_Prices[feature_importance_df.query("Importance>0").Feature]
X_train, X_test, y_train, y_test = model_selection.train_test_split(X_tranformed,
                                                                    y,
                                                                    test_size=0.2,
                                                                    random_state=42
                                                                   )

In [14]:
model_1 = model_builder(CatBoostRegressor(silent=True), X_train, X_test, y_train, y_test)

r2_score 0.7985814503017346 Mae 0.10244130973524479


### Model Interpretation

1. **R-squared (R2) Score: 0.7986**

   - **Interpretation and Implications:** The R2 score of approximately `0.7986` signifies that the regression model effectively explains about `79.86%` of the variance observed in the target variable. This suggests that the model adeptly captures and elucidates the underlying patterns in the data, leaving approximately `20.14%` of the variance unexplained.

2. **Mean Absolute Error (MAE): 0.1024**

   - **Interpretation and Implications:** With a `MAE` value of approximately `0.1024`, we find that the model's predictions exhibit an average absolute disparity of approximately `0.1024` units when compared to the log-transformed actual values of the target variable.

Now that every look's great, let's move on the next section.

**Step 5: HyperParameter Tuning**<br>
By fine-tuning these hyperparameters, we aim to improve the model's performance and predictive accuracy.
This process enables us to find the model settings that work best for a specific dataset and problem, leading to more accurate and efficient machine learning models.

In [15]:
model_1 = model_builder(CatBoostRegressor(learning_rate=0.09, depth=10, iterations=2000,
                                          loss_function='MAE', silent=True), X_train, X_test, y_train,
                         y_test)

r2_score 0.8217272936814837 Mae 0.09275911396920962


Impressive! This represents a significant enhancement compared to the outcomes we obtained from our prior model. Now, let's delve into the interpretation of these results.

### Model Interpretation

1. **R-squared (R2) Score:  0.8217**

   - **Interpretation and Implications:** The R2 score of approximately ` 0.8217` signifies that the regression model effectively explains about ` 82.17%` of the variance observed in the target variable. This suggests that the model adeptly captures and elucidates the underlying patterns in the data, leaving approximately `17.83%` of the variance unexplained.

2. **Mean Absolute Error (MAE): 0.09276**

   - **Interpretation and Implications:** With a `MAE` value of approximately `0.09276`, we find that the model's predictions exhibit an average absolute disparity of approximately `0.09276` units when compared to the log-transformed actual values of the target variable.

# Conclusion
In conclusion, our regression model has demonstrated outstanding performance with an `R2 score of 0.8217` and a `Mean Absolute Error of 0.09276`. This suggests that the model adeptly captures and elucidates the underlying patterns in the data, leaving approximately `17.83%` of the variance unexplained.

However, while the model has achieved impressive results, there is always room for improvement. To further enhance the model's accuracy and robustness, one key avenue is to collect more data. Additional data can provide the model with a broader and more representative sample of instances, especially if the dataset is limited. This larger and more diverse dataset can help the model generalize better to unseen examples and further increase its predictive power.

In summary, while our model has achieved impressive results, the pursuit of excellence continues. Collecting more data is just one step in our journey to create a highly accurate and reliable regression model. By continually refining our approach and incorporating best practices, we aim to build a model that excels in its predictive capabilities.

# Saving of Model

In [16]:
joblib.dump(model_1, './model/House_Price_Prediction_model.joblib')