## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [None]:
# Supress Warnings

import warnings
warnings.filterwarnings('ignore')

In [None]:
import numpy as np
import pandas as pd

In [None]:
housing = pd.read_csv(r'/kaggle/input/housing-simple-regression/Housing.csv')

In [None]:
# Check the head of the dataset
housing.head()

Inspect the various aspects of the housing dataframe

In [None]:
housing.shape

In [None]:
housing.info()

In [None]:
housing.describe()

## Step 2: Visualising the Data

Let's now spend some time doing what is arguably the most important step - **understanding the data**.
- If there is some obvious multicollinearity going on, this is the first place to catch it
- Here's where you'll also identify if some predictors directly have a strong association with the outcome variable

We'll visualise our data using `matplotlib` and `seaborn`.

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables

In [None]:
sns.pairplot(housing)
plt.show()

#### Visualising Categorical Variables

As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.violinplot(x = 'mainroad', y = 'price', data = housing)
plt.subplot(2,3,2)
sns.violinplot(x = 'guestroom', y = 'price', data = housing)
plt.subplot(2,3,3)
sns.violinplot(x = 'basement', y = 'price', data = housing)
plt.subplot(2,3,4)
sns.violinplot(x = 'hotwaterheating', y = 'price', data = housing)
plt.subplot(2,3,5)
sns.violinplot(x = 'airconditioning', y = 'price', data = housing)
plt.subplot(2,3,6)
sns.violinplot(x = 'furnishingstatus', y = 'price', data = housing)
plt.show()

We can also visualise some of these categorical features parallely by using the `hue` argument. Below is the plot for `furnishingstatus` with `airconditioning` as the hue.

In [None]:
plt.figure(figsize = (10, 5))
sns.violinplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = housing)
plt.show()

## Step 3: Data Preparation

- You can see that your dataset has many columns with values as 'Yes' or 'No'.

- But in order to fit a regression line, we would need numerical values and not string. Hence, we need to convert them to 1s and 0s, where 1 is a 'Yes' and 0 is a 'No'.

In [None]:
# List of variables to map

varlist =  ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']

# Defining the map function
def binary_map(x):
    return x.map({'yes': 1, "no": 0})

# Applying the function to the housing list
housing[varlist] = housing[varlist].apply(binary_map)

In [None]:
# Check the housing dataframe now

housing.head()

### Dummy Variables

The variable `furnishingstatus` has three levels. We need to convert these levels into integer as well. 

For this, we will use something called `dummy variables`.

In [None]:
# Get the dummy variables for the feature 'furnishingstatus' and store it in a new variable - 'status'
status = pd.get_dummies(housing['furnishingstatus'])

In [None]:
# Check what the dataset 'status' looks like
status.head()

Now, you don't need three columns. You can drop the `furnished` column, as the type of furnishing can be identified with just the last two columns where — 
- `00` will correspond to `furnished`
- `01` will correspond to `unfurnished`
- `10` will correspond to `semi-furnished`

In [None]:
# Let's drop the first column from status df using 'drop_first = True'

status = pd.get_dummies(housing['furnishingstatus'], drop_first = True)

In [None]:
# Add the results to the original housing dataframe

housing = pd.concat([housing, status], axis = 1)

In [None]:
# Now let's see the head of our dataframe.

housing.head()

In [None]:
# Drop 'furnishingstatus' as we have created the dummies for it

housing.drop(['furnishingstatus'], axis = 1, inplace = True)

In [None]:
housing.head()

## Step 4: Splitting the Data into Training and Testing Sets

As you know, the first basic step for regression is performing a train-test split.

In [None]:
from sklearn.model_selection import train_test_split

# We specify this so that the train and test data set always have the same rows, respectively
np.random.seed(0)
df_train, df_test = train_test_split(housing, train_size = 0.75, random_state = 100)

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
rf = RandomForestRegressor(random_state=42)

In [None]:
df_train.shape

In [None]:
 df_test.shape

In [None]:
df_test.head()

In [None]:
y_train = df_train.pop("price")
X_train = df_train
X_train.shape

In [None]:
y_test = df_test.pop("price")
X_test = df_test
X_test.shape

In [None]:
rf.fit(X_train, y_train)

Any 5 Decison Trees used in the equation

In [None]:
from sklearn import tree
for i in range(5):
    sample_tree = rf.estimators_[i]
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(sample_tree,
                   feature_names=X_train.columns,
                   filled=True)

In [None]:
from sklearn.metrics import r2_score

In [None]:
r2_score_rf_train=round(r2_score(y_train, rf.predict(X_train)),2)
print("R-squared Train:",r2_score_rf_train)

In [None]:
r2_score_rf_test=round(r2_score(y_test, rf.predict(X_test)),2)
print("R-squared Test:",r2_score_rf_test)

In [None]:
rf.feature_importances_

In [None]:
imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": rf.feature_importances_})

In [None]:
imp_df.sort_values(by="Imp", ascending=False)

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
params = {
    'max_depth': [2,3,5,10,20],
    'min_samples_leaf': [5,10,20,50,100,200],
    'n_estimators': [10, 25, 50, 100]
}

In [None]:
grid_search = GridSearchCV(estimator=rf,
                           param_grid=params,
                           cv=4,
                           n_jobs=-1, verbose=1)

In [None]:
%%time
grid_search.fit(X_train, y_train)

In [None]:
grid_search.best_score_

In [None]:
rf_best = grid_search.best_estimator_
rf_best

In [None]:
from sklearn import tree
for i in range(5):
    sample_tree = rf_best.estimators_[i]
    fig = plt.figure(figsize=(25,20))
    _ = tree.plot_tree(sample_tree,
                   feature_names=X_train.columns,
                   filled=True)

In [None]:
r2_score_rf_train=round(r2_score(y_train,rf_best.predict(X_train)),2)
print("R-squared Train:",r2_score_rf_train)

In [None]:
r2_score_rf_test=round(r2_score(y_test, rf_best.predict(X_test)),2)
print("R-squared Test:",r2_score_rf_test)

In [None]:
rf_best.feature_importances_

In [None]:
imp_df = pd.DataFrame({
    "Varname": X_train.columns,
    "Imp": rf_best.feature_importances_})

In [None]:
imp_df.sort_values(by="Imp", ascending=False)

Following Features which can be used focus to predict Housing Price keeping best R2 values for test & train:
- area
- bathrooms
- airconditioning
- prefarea
- parking
