<a href="https://colab.research.google.com/github/Shanthan0/Python/blob/main/Multiple_linear_regression_Introduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

---

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the following lessons:

 1. Multiple linear regression - Introduction
 
 

---

---

### Problem Statement

A real estate company wishes to analyse the prices of properties based on various factors such as area, number of rooms, bathrooms, bedrooms, etc. Create a multiple linear regression model which is capable of predicting the sale price of houses based on multiple factors and evaluate the accuracy of this model.








---

### List of Activities

**Activity 1:** Analysing the Dataset

**Activity 2:** Data Preparation
  
**Activity 3:** Train-Test Split

**Activity 4:**  Model Training

**Activity 5:** Model Prediction and Evaluation







---


#### Activity 1:  Analysing the Dataset

- Create a Pandas DataFrame for **Housing** dataset using the below link. This dataset consists of following columns:


|Field|Description|
|---:|:---|
|price|Sale price of a house in INR|
|area|Total size of a property in square feet|
|bedrooms|Number of bedrooms|
|bathrooms|Number of bathrooms|
|storeys|Number of storeys excluding basement|
|mainroad|yes, if the house faces a main road|
|livingroom|yes, if the house has a separate living room or a drawing room for guests|
|basement|yes, if the house has a basement|
|hotwaterheating|yes, if the house uses gas for hot water heating|
|airconditioning|yes, if there is central air conditioning|
|parking|number of cars that can be parked|
|prefarea|yes, if the house is located in the preferred neighbourhood of the city|


  **Dataset Link:** https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/house-prices.csv

- Print the first five rows of the dataset. Check for null values and treat them accordingly.






In [None]:
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns 
# Load the dataset
# Dataset Link: 'https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/house-prices.csv'
df = pd.read_csv('https://s3-student-datasets-bucket.whjr.online/whitehat-ds-datasets/house-prices.csv')
# Print first five rows using head() function
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,yes,no,no,no,yes,2,yes,furnished
1,12250000,8960,4,4,4,yes,no,no,no,yes,3,no,furnished
2,12250000,9960,3,2,2,yes,no,yes,no,no,2,yes,semi-furnished
3,12215000,7500,4,2,2,yes,no,yes,no,yes,3,yes,furnished
4,11410000,7420,4,1,2,yes,yes,yes,no,yes,2,no,furnished


In [None]:
# Check if there are any null values. If any column has null values, treat them accordingly
df.isnull().sum()

price               0
area                0
bedrooms            0
bathrooms           0
stories             0
mainroad            0
guestroom           0
basement            0
hotwaterheating     0
airconditioning     0
parking             0
prefarea            0
furnishingstatus    0
dtype: int64

---

#### Activity 2: Data Preparation

This dataset contains many columns having categorical data i.e. values 'Yes' or 'No'. However for linear regression, we need numerical data. So you need to convert all 'Yes' and 'No' values to 1s and 0s, where 
- 1 means 'Yes'
- 0 means 'No'

Similarly, replace

- `unfurnished` with 0
- `semi-furnished` with 1
- `furnished` with 2

**Hint:** To replace all 'Yes' values with 1 and 'No' values with 0, use `replace()` function of the DataFrame object. 

For ex: `df.replace(to_replace="yes", value=1, inplace=True)` $\Rightarrow$ replaces the "yes" values in all columns with 1. If you need to make changes inplace, use `inplace` boolean argument.



In [None]:
# Replace all the non-numeric values with numeric values.
df.replace(to_replace = 'yes',value = 1,inplace = True)
df.replace(to_replace = 'no',value = 0,inplace = True)
df.replace(to_replace = 'unfurnished',value = 0,inplace = True)
df.replace(to_replace = 'semi-furnished',value = 1,inplace = True)
df.replace(to_replace = 'furnished',value = 2, inplace = True)
df.head()

Unnamed: 0,price,area,bedrooms,bathrooms,stories,mainroad,guestroom,basement,hotwaterheating,airconditioning,parking,prefarea,furnishingstatus
0,13300000,7420,4,2,3,1,0,0,0,1,2,1,2
1,12250000,8960,4,4,4,1,0,0,0,1,3,0,2
2,12250000,9960,3,2,2,1,0,1,0,0,2,1,1
3,12215000,7500,4,2,2,1,0,1,0,1,3,1,2
4,11410000,7420,4,1,2,1,1,1,0,1,2,0,2


---

#### Activity 3: Train-Test Split

You need to predict the house prices based on several factors. Thus, `price` is the target variable and other columns except `price` will be feature variables.

Split the dataset into training set and test set such that the training set contains 67% of the instances and the remaining instances will become the test set.

In [None]:
# Split the DataFrame into the training and test sets.
from sklearn.model_selection import train_test_split
features = list(df.columns.values)
features.remove('price')
x = df[features]
y = df['price']
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.33,random_state = 42)

---

#### Activity 4: Model Training

Implement multiple linear regression using `sklearn` module in the following way:

1. Reshape the target variable array into two-dimensional arrays by using `reshape(-1, 1)` function of the numpy module.
2. Deploy the model by importing the `LinearRegression` class and create an object of this class.
3. Call the `fit()` function on the LinearRegression object.

In [None]:
# Create two-dimensional NumPy arrays for the target variable 
from sklearn.linear_model import LinearRegression
y_train_reshaped = y_train.values.reshape(-1,1)
y_test_reshaped = y_test.values.reshape(-1,1)

# Build linear regression model 
lin_reg = LinearRegression()
lin_reg.fit(x_train,y_train)

# Print the value of the intercept 

print('Intercept: ',lin_reg.intercept_)

# Print the names of the features along with the values of their corresponding coefficients.
print("\nFeature : Coefficient\n")
for i in range(len(features)):
  print(f'{features[i]} : {lin_reg.coef_[i]}')

Intercept:  -276654.39716309495

Feature : Coefficient

area : 251.3401999267822
bedrooms : 92716.60526930448
bathrooms : 1126479.3774358043
stories : 396248.42774732393
mainroad : 410635.15569710976
guestroom : 320496.71121046523
basement : 484622.2788531308
hotwaterheating : 623047.39290368
airconditioning : 678375.3422621787
parking : 292410.46314066974
prefarea : 524417.2428236585
furnishingstatus : 200615.3570355712


---

#### Activity 5: Model Prediction and Evaluation

Predict the values for both training and test sets by calling the `predict()` function on the LinearRegression object. Also, calculate the $R^2$, MSE, RMSE and MAE values to evaluate the accuracy of your model.

In [None]:
# Predict the target variable values for training and test set

y_train_predict = lin_reg.predict(x_train)
y_test_predict = lin_reg.predict(x_test)


In [None]:
# Evaluate the linear regression model using the 'r2_score', 'mean_squared_error' & 'mean_absolute_error' functions of the 'sklearn' module.
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
print('Train Dataset')
print(f'R square: {r2_score(y_train_reshaped,y_train_predict):3f}')
print(f'MSE: {mean_squared_error(y_train_reshaped,y_train_predict):3f}')
print(f'MAE: {mean_absolute_error(y_train_reshaped,y_train_predict):3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_train_reshaped,y_train_predict)):3f}')

print('\nTest Dataset')
print(f'R square: {r2_score(y_test_reshaped,y_test_predict):3f}')
print(f'MSE: {mean_squared_error(y_test_reshaped,y_test_predict):3f}')
print(f'MAE: {mean_absolute_error(y_test_reshaped,y_test_predict):3f}')
print(f'RMSE: {np.sqrt(mean_squared_error(y_test_reshaped,y_test_predict)):3f}')

Train Dataset
R square: 0.686036
MSE: 971946527815.663696
MAE: 720751.212948
RMSE: 985873.484690

Test Dataset
R square: 0.655707
MSE: 1475542475754.550781
MAE: 906953.790830
RMSE: 1214719.093352


---

---