### **1. Import the required packages**


session link: https://colab.research.google.com/drive/1tHVLaFwti2-CQ9YthCx_dkrMjXH9vCkD?usp=drive_link

In [2]:
#data reading and manipulation packages
import numpy as np
import pandas as pd

#data visualization packages
import matplotlib.pyplot as plt
import seaborn as sns

#machine related packages
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score

### DO the EDA **2. Reading and Exploring the data**

1. Import the dataset.
2. Check the shape, info, datatype of the columns.
3. Check for missing value and handle them.
4. Check for duplicates and handle them.
5. Do the encoding of the categorical columns.
6. Check for outlier

In [3]:
# Reading the dataset
data=pd.read_csv("Boston.csv")

In [4]:
data.head() #print the top 5 rows of the data for a quick inspection

Unnamed: 0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
1,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
2,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
3,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
4,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2


- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per 10,000usd
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population


In [5]:
data.shape  #print the total number of rows and columns present in the data

(506, 14)

In [6]:
data.dtypes  #print the datatype of values in each column

crim       float64
zn         float64
indus      float64
chas         int64
nox        float64
rm         float64
age        float64
dis        float64
rad          int64
tax          int64
ptratio    float64
black      float64
lstat      float64
medv       float64
dtype: object

In [7]:
data.isnull().sum() #print the total number of missing values in each column


crim       0
zn         0
indus      0
chas       0
nox        0
rm         0
age        0
dis        0
rad        0
tax        0
ptratio    0
black      0
lstat      0
medv       0
dtype: int64

In [8]:
#check for duplicates
data.duplicated().sum() #print the number of duplicate rows in the data

0

In [9]:
data.drop_duplicates(inplace = True)  #drop the duplicate rows (if any)

#### **Check and remove Outliers**

In [10]:
#function to check for outliers and remove them from the given columns
def remove_outliers(data, columns):
    for column in columns:
        if column in data.columns:
            Q1 = data[column].quantile(0.25)
            Q3 = data[column].quantile(0.75)
            IQR = Q3 - Q1
            lower_bound = Q1 - 1.5 * IQR
            upper_bound = Q3 + 1.5 * IQR
            data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
    return data

In [11]:
data_without_outlier = remove_outliers(data, data.columns)

In [12]:
data.shape  #original data

(506, 14)

In [13]:
data_without_outlier.shape  #dataframe without outlier and we lost lot of rows while outlier cleaning

(214, 14)

There are two ways in which we can deal with outliers:

1. **Outlier Removal (we did this above)**: Means removing the rows which have outliers. The problem with this approach is that we can loose lots of data while doing this.

2. **Data Transformation** : We don't remove the outliers but instead we transform the columns having outliers in such a way that outliers don't behave as outliers.
  - 2.1. Square root transformation.
  - 2.2. Log Transformation.
  - 2.3. Box-Cox Transformation.

### **Machine Learning Process**

1. Create X and y variables. Store input columns in X and output column in y.
2. Split the data into training and testing sets.
3. Standardize/Scale the data.
4. Apply the ML Algorithm on the data.
5. Check the performance of the model.
6. Necessary improvements in the model to increase its performance/accuracy.

In [14]:
#1. Create X and y variables. Store input columns in X and output column in y.
X=data.drop(columns="medv")
y=data["medv"]

In [15]:
#2. Split the data into training and testing sets.
X_train,X_test,y_train,y_test=train_test_split(X,y,random_state=100,test_size=.2)

In [16]:
# apply the linear regression
lin_reg=LinearRegression()
lin_reg.fit(X_train,y_train)

In [17]:
lin_reg.coef_  # m1 to m13 values

array([-8.14896492e-02,  4.80407782e-02, -5.47150249e-03,  3.06260576e+00,
       -1.61368815e+01,  3.67245067e+00, -8.51525259e-03, -1.51740854e+00,
        2.87271007e-01, -1.21207598e-02, -9.24160757e-01,  9.53460812e-03,
       -4.85895548e-01])

In [18]:
lin_reg.intercept_   # c value

36.33377028550789

In [19]:
y_predit=lin_reg.predict(X_test)

In [20]:
pd.DataFrame({"actuval":y_test,"pridicted":y_predit})

Unnamed: 0,actuval,pridicted
198,34.6,34.408110
229,31.5,31.185246
502,20.6,22.312861
31,14.5,17.886139
315,16.2,20.435721
...,...,...
166,50.0,36.185086
401,7.2,18.010970
368,50.0,23.182265
140,14.0,13.772710


#### **Performance metrics used in Regression**

1. **R2 Score** (also called R_squared).
2. **Adjusted R2 Score**.
3. **Mean Squared Error** (MSE).
4. **Root Mean Squared Error** (RMSE).

In [22]:
r2_score(y_test, y_predit) #r2_score of the model

0.755503308687131

In [24]:
1 - (1-r2_score(y_test, y_predit))*(len(y_test)-1)/(len(y_test)-X_test.shape[1]-1)  #Adjusted r2_score

0.719384479288639

In [26]:
mse = mean_squared_error(y_test, y_predit)  #mean squared error

In [27]:
mse

23.61699410056358

**MSE value is the squared error our model is making while predicting an output, so it is not directly interpretable. Instead we take it's square root which is called RMSE and that is an interpretable value for us.**

In [28]:
rmse = np.sqrt(mse) #root mean squared error

In [29]:
rmse

4.8597318959551234

**RMSE** of 4.85 means a predicted value using our ML model will be 4.85 units less or more than actual output.

### **Checking VIF score for multicollinearity**


In [31]:
from statsmodels.stats.outliers_influence import variance_inflation_factor

In [32]:
#creating a dataframe that will contain all the input column names and their respective VIF scores
vif = pd.DataFrame()
vif['Features'] = X_train.columns
vif['VIF'] = [variance_inflation_factor(X_train.values, i) for i in range(X_train.shape[1])]
vif['VIF'] = round(vif['VIF'], 2)
vif = vif.sort_values(by = "VIF", ascending = False)
vif

Unnamed: 0,Features,VIF
10,ptratio,89.27
5,rm,79.8
4,nox,74.55
9,tax,60.42
11,black,22.4
6,age,21.6
7,dis,15.34
2,indus,14.99
8,rad,14.69
12,lstat,11.14


As you can see there are lot of columns with very very high VIF scores. It means our data has a lot of correlated features which we will have to deal with.