# House Rent Prediction Dataset

## Description:
Housing in India varies widely, from the palaces of erstwhile maharajas to modern apartment buildings in metropolitan cities, and even to small huts in remote villages. With the rise in incomes, the housing sector in India has seen tremendous growth.

## Objective:
The goal is to create a predictive model that can estimate house rent based on various features such as location, size, and amenities. This model will help in understanding the factors influencing rent and assist in making informed decisions in the housing sector.


## Loading and Exploring the Dataset


In [1]:
# Importing necessary libraries
import pandas as pd
import numpy as np

# Load the dataset
data = pd.read_csv('houseRent.csv')

# Display the first few rows of the dataset to get an overview
print("First 5 rows of the dataset:")
print(data.head())

# Check the information of the dataset (data types, non-null counts)
print("\nDataset Info:")
print(data.info())

# Get the statistical summary of the dataset for numerical columns
print("\nStatistical Summary:")
print(data.describe())

# Check for missing values in the dataset
print("\nMissing Values:")
print(data.isnull().sum())

First 5 rows of the dataset:
   BHK   Rent  Size    Area Type             Area Locality     City  \
0    2  10000  1100   Super Area                    Bandel  Kolkata   
1    2  20000   800   Super Area  Phool Bagan, Kankurgachi  Kolkata   
2    2  17000  1000   Super Area   Salt Lake City Sector 2  Kolkata   
3    2  10000   800   Super Area               Dumdum Park  Kolkata   
4    2   7500   850  Carpet Area             South Dum Dum  Kolkata   

  Furnishing Status  Bathroom Point of Contact  
0       Unfurnished         2    Contact Owner  
1    Semi-Furnished         1    Contact Owner  
2    Semi-Furnished         1    Contact Owner  
3       Unfurnished         1    Contact Owner  
4       Unfurnished         1    Contact Owner  

Dataset Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4746 entries, 0 to 4745
Data columns (total 9 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   BHK                4746 non-nu

### Encoding Categorical Attributes

Since there are no missing values, we can proceed to encode the categorical attributes (i.e., those with `object` data types) so that they can be used in machine learning models. The categorical attributes in our dataset are:

1. **Area Type**: Categorical with values like `"Super Area"`, `"Carpet Area"`, etc.
2. **Area Locality**: Categorical, each representing a specific locality.
3. **City**: Categorical with names of cities.
4. **Furnishing Status**: Categorical with values like `"Furnished"`, `"Unfurnished"`, etc.
5. **Point of Contact**: Categorical, usually indicating who to contact. This column might not be important for prediction and can be dropped if irrelevant.

To encode the categorical features, we will use:

- **Label Encoding** for attributes with a small number of categories (like `Area Type` and `Furnishing Status`). This will convert categorical values into numerical labels, which is suitable when there are limited and distinct categories.

- **One-Hot Encoding** for attributes with a larger number of unique categories (like `Area Locality` and `City`). This will create binary (0 or 1) columns for each category, which is useful to avoid imposing any ordinal relationship between categories.

#### Steps:
1. Use `Label Encoding` for:
   - `Area Type`
   - `Furnishing Status`
   
2. Use `One-Hot Encoding` for:
   - `Area Locality`
   - `City`

3. The column `Point of Contact` will be dropped, assuming it does not provide useful information for predicting the rent.


In [2]:
from sklearn.preprocessing import LabelEncoder
import pandas as pd

# Label encoding for 'Area Type' and 'Furnishing Status'
label_encoder = LabelEncoder()
data['Area Type'] = label_encoder.fit_transform(data['Area Type'])
data['Furnishing Status'] = label_encoder.fit_transform(data['Furnishing Status'])

# One-hot encoding for 'Area Locality' and 'City'
data = pd.get_dummies(data, columns=['Area Locality', 'City'], drop_first=True)

# Drop 'Point of Contact' as it may not be useful for the model
data = data.drop('Point of Contact', axis=1)

# Display the first few rows after encoding
print("Data after encoding:")
print(data.head())

Data after encoding:
   BHK   Rent  Size  Area Type  Furnishing Status  Bathroom  \
0    2  10000  1100          2                  2         2   
1    2  20000   800          2                  1         1   
2    2  17000  1000          2                  1         1   
3    2  10000   800          2                  2         1   
4    2   7500   850          1                  2         1   

   Area Locality_ in Boduppal, NH 2 2  Area Locality_ in Erragadda, NH 9  \
0                               False                              False   
1                               False                              False   
2                               False                              False   
3                               False                              False   
4                               False                              False   

   Area Locality_ in Miyapur, NH 9  Area Locality_117 Residency, Chembur East  \
0                            False                            

In [3]:
# Convert all boolean values to 0 and 1
data = data.astype(int)

# Display the first few rows to confirm the change
print("Data after converting boolean values to 0s and 1s:")
print(data.head())


Data after converting boolean values to 0s and 1s:
   BHK   Rent  Size  Area Type  Furnishing Status  Bathroom  \
0    2  10000  1100          2                  2         2   
1    2  20000   800          2                  1         1   
2    2  17000  1000          2                  1         1   
3    2  10000   800          2                  2         1   
4    2   7500   850          1                  2         1   

   Area Locality_ in Boduppal, NH 2 2  Area Locality_ in Erragadda, NH 9  \
0                                   0                                  0   
1                                   0                                  0   
2                                   0                                  0   
3                                   0                                  0   
4                                   0                                  0   

   Area Locality_ in Miyapur, NH 9  Area Locality_117 Residency, Chembur East  \
0                               

## **Ensemble Methods**

### 1. Bagging
Bagging (Bootstrap Aggregating) involves training multiple versions of a model on different subsets of the training data (by sampling with replacement) and averaging their predictions. This reduces variance and prevents overfitting. We will use **Random Forest** for Bagging.

### 2. Boosting
Boosting is an iterative method that adjusts the weights of observations based on the previous model's errors. It trains a sequence of models, with each new model focusing on correcting the errors made by the previous one. We will use **AdaBoost** and **XGBoost** for Boosting.

## Model Implementations:

- **RandomForestRegressor**: A bagging technique that combines multiple decision trees to reduce variance.
- **AdaBoostRegressor**: A boosting technique that combines weak learners (e.g., decision trees) by giving more weight to hard-to-predict instances.
- **XGBRegressor**: A boosting algorithm that builds strong predictive models by sequentially correcting errors from weak models, with additional regularization to reduce overfitting.

## Steps Involved:

1. **Import Libraries**:
    - We import the necessary libraries for implementing the models and evaluation metrics.

2. **Initialize Models**:
    - **RandomForestRegressor**: A bagging model to reduce variance.
    - **AdaBoostRegressor**: A boosting model to iteratively improve predictions.
    - **XGBRegressor**: A scalable and efficient gradient boosting model.

3. **Evaluate Models**:
    - We define a function `evaluate_model` to train and test each model, compute predictions, and evaluate the model's performance using metrics like **MSE**, **RMSE**, and **R² Score**.

4. **Training and Testing**:
    - For each model, we train the model on the training set (`X_train`, `y_train`) and make predictions on the test set (`X_test`).
    - We then evaluate the model using **Mean Squared Error (MSE)**, **Root Mean Squared Error (RMSE)**, and **R² Score**.


In [5]:
# Import necessary libraries
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = data.drop(columns=['Rent'])  # Assuming 'Rent' is the target variable
y = data['Rent']

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Initialize models with improvements for AdaBoost
random_forest_model = RandomForestRegressor(n_estimators=100, random_state=42)  # Bagging
adaboost_model = AdaBoostRegressor(
    n_estimators=300,  # Increased number of estimators
    learning_rate=0.05,  # Smaller learning rate for gradual learning
    random_state=42)  # Boosting with AdaBoost
xgboost_model = XGBRegressor(n_estimators=100, random_state=42)  # Boosting with XGBoost

# Dictionary to store models
models = {
    "Random Forest": random_forest_model,
    "AdaBoost": adaboost_model,
    "XGBoost": xgboost_model
}

# Function to train, predict, and evaluate a model
def evaluate_model(model, X_train, y_train, X_test, y_test):
    # Train the model
    model.fit(X_train, y_train)
    
    # Predict on the test set
    y_pred = model.predict(X_test)
    
    # Evaluation Metrics
    mse = mean_squared_error(y_test, y_pred)
    rmse = np.sqrt(mse)
    r2 = r2_score(y_test, y_pred)
    
    # Display the results
    print(f"\nModel: {model.__class__.__name__}")
    print(f"Mean Squared Error (MSE): {mse:.2f}")
    print(f"Root Mean Squared Error (RMSE): {rmse:.2f}")
    print(f"R-squared (R² Score): {r2:.2f}")
    
    return mse, rmse, r2

# Train and evaluate each model
for model_name, model in models.items():
    print(f"\nTraining and Evaluating {model_name} Model:")
    evaluate_model(model, X_train, y_train, X_test, y_test)



Training and Evaluating Random Forest Model:

Model: RandomForestRegressor
Mean Squared Error (MSE): 1048197798.21
Root Mean Squared Error (RMSE): 32375.88
R-squared (R² Score): 0.70

Training and Evaluating AdaBoost Model:

Model: AdaBoostRegressor
Mean Squared Error (MSE): 1632308156.71
Root Mean Squared Error (RMSE): 40401.83
R-squared (R² Score): 0.54

Training and Evaluating XGBoost Model:

Model: XGBRegressor
Mean Squared Error (MSE): 978227620.26
Root Mean Squared Error (RMSE): 31276.63
R-squared (R² Score): 0.72


### Conclusion of the House Rent Prediction Models

The aim of this project was to predict house rent prices using three different regression models: **Random Forest**, **AdaBoost**, and **XGBoost**. Below is a detailed evaluation of each model based on their performance metrics.

1. **Random Forest Regressor**:
   - **Mean Squared Error (MSE)**: 1,04,81,97,798.21
   - **Root Mean Squared Error (RMSE)**: 32,375.88
   - **R-squared (R² Score)**: 0.70
   - The Random Forest model provided a decent performance, capturing a good amount of variance in the data with an R² score of 0.70. However, it still leaves room for improvement, as the RMSE suggests that there’s a substantial average prediction error.

2. **AdaBoost Regressor**:
   - **Mean Squared Error (MSE)**: 1,63,23,08,156.71
   - **Root Mean Squared Error (RMSE)**: 40,401.83
   - **R-squared (R² Score)**: 0.54
   - The AdaBoost model had a lower performance compared to Random Forest, with a higher MSE and RMSE. An R² score of 0.54 indicates that it was able to explain about half of the variance in the target variable. This suggests that while the model has some predictive power, it may not be the best choice for this dataset without further tuning.

3. **XGBoost Regressor**:
   - **Mean Squared Error (MSE)**: 97,82,27,620.26
   - **Root Mean Squared Error (RMSE)**: 31.276.63
   - **R-squared (R² Score)**: 0.72
   - XGBoost was the top performer among the three models, with the lowest MSE and RMSE, and the highest R² score of 0.72. This indicates that XGBoost was able to capture most of the variance in the data, leading to better predictions.

### Key Insights:
- **XGBoost** is the best model for predicting house rent in this dataset, providing the highest accuracy and the smallest average error.
- **Random Forest** is a strong alternative with decent performance, but it may require additional tuning or feature engineering to match the precision of XGBoost.
- **AdaBoost** did not perform as well as the other models, likely due to its sensitivity to noise and the need for carefully tuned hyperparameters.

