## Analysis Summary

1. **Data Import and Cleaning:**
   - Imported the dataset and converted the 'number of bathrooms' column to an integer type.
   - Dropped the 'id' column for simplicity.

2. **PCA on Numeric Columns:**
   - Applied Principal Component Analysis (PCA) to numeric columns, excluding 'date' and the output column ('price').

3. **Two Experimental Paths:**
   - **Path 1:** Conducted analysis on all columns using PyForest for automatic feature selection.
   - **Path 2:** Removed columns with low PCA variance to reduce complexity, resulting in a similar model performance but with reduced time complexity.


4. **Model Comparison**


### Path 1: All Columns
| Model                               | Adjusted R-Squared | R-Squared | RMSE      | Time Taken (s) |
|-------------------------------------|--------------------|-----------|-----------|-----------------|
| HistGradientBoostingRegressor       | 0.88               | 0.88      | 128699.89 | 0.77            |
| XGBRegressor                        | 0.88               | 0.88      | 131244.14 | 0.78            |
| LGBMRegressor                       | 0.88               | 0.88      | 131667.67 | 0.68            |
| ... (Top models)                    | ...                | ...       | ...       | ...             |

### Path 2: Reduced Columns
| Model                               | Adjusted R-Squared | R-Squared | RMSE      | Time Taken (s) |
|-------------------------------------|--------------------|-----------|-----------|-----------------|
| HistGradientBoostingRegressor       | 0.88               | 0.88      | 129600.41 | 0.71            |
| LGBMRegressor                       | 0.87               | 0.87      | 134150.78 | 0.39            |
| XGBRegressor                        | 0.87               | 0.87      | 136435.61 | 0.55            |
| ... (Top models)                    | ...                | ...       | ...       | ...             |


5. ***Conclusion***

- The models were evaluated using LazyRegressor to compare their performance on two different paths.
- Path 1: All columns included.
- Path 2: Removed some columns with low PCA to reduce complexity.

- **Top 5 Models (Path 1):**
    1. HistGradientBoostingRegressor
    2. XGBRegressor
    3. LGBMRegressor
    4. GradientBoostingRegressor
    5. RandomForestRegressor

- **Top 5 Models (Path 2):**
    1. HistGradientBoostingRegressor
    2. LGBMRegressor
    3. XGBRegressor
    4. RandomForestRegressor
    5. BaggingRegressor
    
    
   **Key Findings:**
   - The model performances between Path 1 and Path 2 are comparable.
   - Path 2, with reduced columns, exhibits reduced time complexity (In some models).

////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////

# Importing Libraries


In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [34]:
!pip install pyforest

Collecting pyforest
  Downloading pyforest-1.1.0.tar.gz (15 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: pyforest
  Building wheel for pyforest (setup.py) ... [?25ldone
[?25h  Created wheel for pyforest: filename=pyforest-1.1.0-py2.py3-none-any.whl size=14606 sha256=277190782efa915cab403ef6fe68fd1796a5e80a40707807c082bb524759f29e
  Stored in directory: /root/.cache/pip/wheels/9e/7d/2c/5d2f5e62de376c386fd3bf5a8e5bd119ace6a9f48f49df6017
Successfully built pyforest
Installing collected packages: pyforest
Successfully installed pyforest-1.1.0


In [35]:
!pip install lazypredict

Collecting lazypredict
  Downloading lazypredict-0.2.12-py2.py3-none-any.whl.metadata (12 kB)
Downloading lazypredict-0.2.12-py2.py3-none-any.whl (12 kB)
Installing collected packages: lazypredict
Successfully installed lazypredict-0.2.12


In [36]:
# Libraries for PCA
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

# Importing important libraries
import pyforest
from lazypredict.Supervised import LazyRegressor
from pandas.plotting import scatter_matrix

# Scikit-learn packages
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import ExtraTreesRegressor
from sklearn import metrics
from sklearn.metrics import mean_squared_error

# Hide warnings
import warnings

In [37]:
# Setting up max columns displayes to 100
pd.options.display.max_columns = None
pd.set_option('display.width', None)

In [38]:
warnings.filterwarnings("ignore")

# Data Import and Cleaning:

In [59]:
df = pd.read_csv("/kaggle/input/house-price-dataset-of-india/House Price India.csv")

In [60]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14620 entries, 0 to 14619
Data columns (total 23 columns):
 #   Column                                 Non-Null Count  Dtype  
---  ------                                 --------------  -----  
 0   id                                     14620 non-null  int64  
 1   Date                                   14620 non-null  int64  
 2   number of bedrooms                     14620 non-null  int64  
 3   number of bathrooms                    14620 non-null  float64
 4   living area                            14620 non-null  int64  
 5   lot area                               14620 non-null  int64  
 6   number of floors                       14620 non-null  float64
 7   waterfront present                     14620 non-null  int64  
 8   number of views                        14620 non-null  int64  
 9   condition of the house                 14620 non-null  int64  
 10  grade of the house                     14620 non-null  int64  
 11  Ar

In [61]:
df['number of bathrooms'] = df['number of bathrooms'].astype(int)

In [62]:
df = df.drop(columns = ['id'])

# PCA on Numeric Columns:

In [63]:
# Assuming df_numeric contains the numerical features including the 'Price' column
features_for_pca = df.drop(['Price','Date'], axis=1)

# Standardize the features
scaler = StandardScaler()
features_standardized = scaler.fit_transform(features_for_pca)

# Perform PCA
pca = PCA()
pca.fit(features_standardized)

# Explained variance ratio
explained_variance_ratio = pca.explained_variance_ratio_

# Create a DataFrame to display column names and their explained variance
df_explained_variance = pd.DataFrame({'Feature': features_for_pca.columns, 'Explained Variance': explained_variance_ratio})

# Display the DataFrame sorted by explained variance
df_explained_variance_sorted = df_explained_variance.sort_values(by='Explained Variance', ascending=False)
print(df_explained_variance_sorted)

                                  Feature  Explained Variance
0                      number of bedrooms                0.25
1                     number of bathrooms                0.10
2                             living area                0.09
3                                lot area                0.07
4                        number of floors                0.06
5                      waterfront present                0.05
6                         number of views                0.05
7                  condition of the house                0.05
8                      grade of the house                0.04
9   Area of the house(excluding basement)                0.04
10                   Area of the basement                0.03
11                             Built Year                0.03
12                        Renovation Year                0.03
13                            Postal Code                0.02
14                              Lattitude                0.02
15      

# Two Experimental Paths:

In [64]:
df.shape

(14620, 22)

In [65]:
cols_to_remove = ['living_area_renov', 'lot_area_renov', 'Number of schools nearby', 'Distance from the airport']

cols_to_remove = [col for col in cols_to_remove if col in df.columns]

df2 = df.drop(cols_to_remove, axis=1)

In [66]:
df2.shape


(14620, 18)

## Experiment 1:

In [67]:
df.replace([np.inf, -np.inf], np.nan, inplace=True)
df.dropna(inplace=True)

In [68]:
X = df.drop(columns=['Price'])
y = df.Price

In [69]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=5 ,test_size=0.25)

# Checking if the training set was correcly splitted
print("Training set - Features: ", X_train.shape, "Target: ", y_train.shape)
print("Test set - Features: ", X_test.shape, "Target: ",y_test.shape)


<IPython.core.display.Javascript object>

Training set - Features:  (10965, 21) Target:  (10965,)
Test set - Features:  (3655, 21) Target:  (3655,)


In [70]:
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models1, predictions1 = reg.fit(X_train, X_test, y_train, y_test)

print(models1)

 74%|███████▍  | 31/42 [02:04<00:30,  2.81s/it]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


 98%|█████████▊| 41/42 [02:29<00:01,  1.98s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.004252 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 2526
[LightGBM] [Info] Number of data points in the train set: 10965, number of used features: 21
[LightGBM] [Info] Start training from score 538441.563429


100%|██████████| 42/42 [02:29<00:00,  3.56s/it]

                               Adjusted R-Squared  R-Squared      RMSE  \
Model                                                                    
HistGradientBoostingRegressor                0.88       0.88 128699.89   
XGBRegressor                                 0.88       0.88 131244.14   
LGBMRegressor                                0.88       0.88 131667.67   
GradientBoostingRegressor                    0.87       0.87 137953.07   
RandomForestRegressor                        0.86       0.86 139228.24   
ExtraTreesRegressor                          0.86       0.86 139336.93   
BaggingRegressor                             0.85       0.85 144861.12   
KNeighborsRegressor                          0.75       0.75 189220.77   
TransformedTargetRegressor                   0.70       0.70 206956.71   
LinearRegression                             0.70       0.70 206956.71   
Lars                                         0.70       0.70 206956.71   
Lasso                                 




- **Top 5 Models (Path 1):**
    1. HistGradientBoostingRegressor
    2. XGBRegressor
    3. LGBMRegressor
    4. GradientBoostingRegressor
    5. RandomForestRegressor

## Experiment 2:

In [71]:
df2.replace([np.inf, -np.inf], np.nan, inplace=True)
df2.dropna(inplace=True)

In [72]:
A = df2.drop(columns=['Price'])
b = df2.Price

In [73]:
A_train, A_test, b_train, b_test = train_test_split(A, b, random_state=5, test_size=0.25)

# Checking if the training set was correcly splitted
print("Training set - Features: ", A_train.shape, "Target: ", b_train.shape)
print("Test set - Features: ", A_test.shape, "Target: ",b_test.shape)

<IPython.core.display.Javascript object>

Training set - Features:  (10965, 17) Target:  (10965,)
Test set - Features:  (3655, 17) Target:  (3655,)


In [74]:
reg = LazyRegressor(verbose=0, ignore_warnings=False, custom_metric=None)
models2 , predictions2 = reg.fit(A_train, A_test, b_train, b_test)

print(models2)

 74%|███████▍  | 31/42 [02:01<00:30,  2.79s/it]

QuantileRegressor model failed to execute
Solver interior-point is not anymore available in SciPy >= 1.11.0.


 98%|█████████▊| 41/42 [02:21<00:01,  1.72s/it]

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.003452 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 1980
[LightGBM] [Info] Number of data points in the train set: 10965, number of used features: 17
[LightGBM] [Info] Start training from score 538441.563429


100%|██████████| 42/42 [02:22<00:00,  3.39s/it]

                               Adjusted R-Squared  R-Squared      RMSE  \
Model                                                                    
HistGradientBoostingRegressor                0.88       0.88 129600.41   
LGBMRegressor                                0.87       0.87 134150.78   
XGBRegressor                                 0.87       0.87 136435.61   
RandomForestRegressor                        0.87       0.87 137128.54   
BaggingRegressor                             0.86       0.86 140333.88   
GradientBoostingRegressor                    0.86       0.86 142018.26   
ExtraTreesRegressor                          0.86       0.86 143215.99   
KNeighborsRegressor                          0.75       0.75 188210.73   
TransformedTargetRegressor                   0.70       0.70 206952.68   
LinearRegression                             0.70       0.70 206952.68   
LassoLarsIC                                  0.70       0.70 206952.68   
Lars                                  




- **Top 5 Models (Path 2):**
    1. HistGradientBoostingRegressor
    2. LGBMRegressor
    3. XGBRegressor
    4. RandomForestRegressor
    5. BaggingRegressor

### Experiment Significance:

Comparing the results from the two experiments (with and without column reduction), it appears that the choice of columns did not significantly impact the performance of the `HistGradientBoostingRegressor` on this dataset. Both experiments yielded comparable RMSE values, suggesting that the model's robustness to feature variations.
