<a href="https://colab.research.google.com/github/Surajkr1166/Data-science-hands-on-model-Pyspyder/blob/main/Linear_regression_model.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Multiple Linear Regression**
## **Housing Case Study**

#### **Problem Statement:**

Consider a real estate company that has a dataset containing the prices of properties in the Delhi region. It wishes to use the data to optimise the sale prices of the properties based on important factors such as area, bedrooms, parking, etc.

Essentially, the company wants —


- To identify the variables affecting house prices, e.g. area, number of rooms, bathrooms, etc.

- To create a linear model that quantitatively relates house prices with variables such as number of rooms, area, number of bathrooms, etc.

- To know the accuracy of the model, i.e. how well these variables can predict house prices.

**So interpretation is important!**

## Step 1: Reading and Understanding the Data

Let us first import NumPy and Pandas and read the housing dataset

In [41]:
# Importing the essesntial libraries
import numpy as np
import pandas as pd

In [None]:
from google.colab import files
uploaded = files.upload()

In [None]:
housing = pd.read_csv("Housing.csv")

In [None]:
# Check the head of the dataset or interal columns and rows
housing.head()

Inspect the various aspects of the housing dataframe


In [None]:
housing.shape

In [None]:
#it mainly used to give concise summary for the data frame
housing.info()

In [None]:
#it gives the descriptive statistics value but only on numeric column
housing.describe()

## Step 2: Visualising the Data

Let's now spend some time doing what is arguably the most important step - **understanding the data**.
- If there is some obvious multicollinearity going on, this is the first place to catch it
- Here's where you'll also identify if some predictors directly have a strong association with the outcome variable

We'll visualise our data using `matplotlib` and `seaborn`.

In [None]:
#essential libraries for the data visualization
import matplotlib.pyplot as plt
import seaborn as sns

#### Visualising Numeric Variables

Let's make a pairplot of all the numeric variables

In [None]:
sns.pairplot(housing)
plt.show()

#### Visualising Categorical Variables

As you might have noticed, there are a few categorical variables as well. Let's make a boxplot for some of these variables.

In [None]:
plt.figure(figsize=(20, 12))
plt.subplot(2,3,1)
sns.boxplot(x = 'mainroad', y = 'price', data = housing)
plt.subplot(2,3,2)
sns.boxplot(x = 'guestroom', y = 'price', data = housing)
plt.subplot(2,3,3)
sns.boxplot(x = 'basement', y = 'price', data = housing)
plt.subplot(2,3,4)
sns.boxplot(x = 'hotwaterheating', y = 'price', data = housing)
plt.subplot(2,3,5)
sns.boxplot(x = 'airconditioning', y = 'price', data = housing)
plt.subplot(2,3,6)
sns.boxplot(x = 'furnishingstatus', y = 'price', data = housing)
plt.show()

In [None]:
plt.figure(figsize = (10, 5))
sns.boxplot(x = 'furnishingstatus', y = 'price', hue = 'airconditioning', data = housing)
plt.show()

## Step 3: Data Preparation
- You can see that your dataset has many columns with values as 'Yes' or 'No'.

- But in order to fit a regression line, we would need numerical values and not string. Hence, we need to convert them to 1s and 0s, where 1 is a 'Yes' and 0 is a 'No'.

In [None]:
# List of variables to map

varlist =  ['mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'prefarea']

# Defining the map function
def binary_map(x):
    return x.map({'yes': 1, "no": 0})

# Applying the function to the housing list
housing[varlist] = housing[varlist].apply(binary_map)

In [None]:
# Check the housing dataframe now

housing.head()

### Dummy Variables
The variable `furnishingstatus` has three levels. We need to convert these levels into integer as well.

For this, we will use something called `dummy variables`.

In [None]:
# Get the dummy variables for the feature 'furnishingstatus' and store it in a new variable - 'status'
status = pd.get_dummies(housing['furnishingstatus'])

In [None]:
# Check what the dataset 'status' looks like
status.head()

Now, you don't need three columns. You can drop the `furnished` column, as the type of furnishing can be identified with just the last two columns where —
- `00` will correspond to `furnished`
- `01` will correspond to `unfurnished`
- `10` will correspond to `semi-furnished`

In [None]:
# Let's drop the first column from status df using 'drop_first = True'

status = pd.get_dummies(housing['furnishingstatus'], drop_first = True)

In [None]:
# Add the results to the original housing dataframe

housing = pd.concat([housing, status], axis = 1)

In [None]:
# Now let's see the head of our dataframe.

housing.head()

In [None]:
# Drop 'furnishingstatus' as we have created the dummies for it

housing.drop(['furnishingstatus'], axis = 1, inplace = True)

In [None]:
housing.info()

In [None]:
# Let's check the correlation coefficients to see which variables are highly correlated

plt.figure(figsize = (16, 10))
sns.heatmap(housing.corr(), annot = True, cmap="YlGnBu")
plt.show()

In [None]:
#As you might have noticed, `area` seems to the correlated to `price` the most. Let's see a pairplot for `area` vs `price`.
plt.figure(figsize=[6,6])
plt.scatter(housing.area, housing.price)
plt.show()

### Rescaling the Features

As you saw in the demonstration for Simple Linear Regression, scaling doesn't impact your model. Here we can see that except for `area`, all the columns have small integer values. So it is extremely important to rescale the variables so that they have a comparable scale. If we don't have comparable scales, then some of the coefficients as obtained by fitting the regression model might be very large or very small as compared to the other coefficients. This might become very annoying at the time of model evaluation. So it is advised to use standardization or normalization so that the units of the coefficients obtained are all on the same scale. As you know, there are two common ways of rescaling:

1. Min-Max scaling
2. Standardisation (mean-0, sigma-1)

This time, we will use MinMax scaling.

In [None]:
#Normalize numeric variables - Scales all numeric features so they are comparable (mean = 0, std = 1).
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()

In [None]:
X_scaled = scaler.fit_transform(housing.drop("price", axis=1))

In [None]:
X=X_scaled
y=housing["price"]

In [None]:
X

## Step 4: Splitting the Data into Training and Testing Sets

As you know, the first basic step for regression is performing a train-test split.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)


## Step 5:Train a simple model (Linera Regression)

In [None]:
# The model learns the relationship between features (X_train) and churn (y_train).
from sklearn.linear_model import LinearRegression

model = LinearRegression()
model.fit(X_train, y_train)

In [None]:
#Get results (coefficients and intercept)
print(f"Intercept: {model.intercept_}")
print(f"Coefficients: {model.coef_}")

In [None]:
##Make predictions
y_pred = model.predict(X_test)

In [None]:
y_pred

In [None]:
sns.displot(y_pred-y_test, kind='kde')

variance is very low means its a good model, -4 to +4

In [None]:
from sklearn.metrics import r2_score

In [None]:
#calculate R squared
r2=r2_score(y_pred,y_test)

In [None]:
r2

In [None]:
# Get n (number of observations) and p (number of features)
n = X.shape[0]  # Number of samples
p = X.shape[1]  # Number of features

# Calculate Adjusted R-squared
adjusted_r2 = 1 - (1 - r2) * (n - 1) / (n - p - 1)

print(f"R-squared: {r2:.4f}")
print(f"Adjusted R-squared: {adjusted_r2:.4f}")