# Multiple Linear Regression

In this Jupyter Notebook, we're going to work on a data set that I retrieved from [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/student+performance). There are many features related to students' performance. We'll build a machine learning model which makes predictions about students' final grades based on by absences, health, study time, free time and their midterm exam grades

### How does Multiple Linear Regression work ?

Multiple Linear Regression is works in exactly the *same* way with simple linear Regression. You can find the related notebook [here](../Linear%20Regression). Briefly, the primary aim of linear regression is to find the best line on the data points. Best line stands for a line which has the overall smallest distance from every datapoint. Our equation will be like following

<img src="img.png" width="400px" height="400px" align="left"/>


**y:** dependent variable<br>
**B0:** intercept<br>
**B1, B2, B3, ..., Bn:** coefficients<br>
**x1, x2, x3, ..., xn:** independent variables<br>


Our Linear Regression model will find the coefficent and the intercept. Let's import the necessary libraries/packages first.

In [1]:
# Import fundamental libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# We'll use Linear Regression class
from sklearn.linear_model import LinearRegression

# To evaluate our model
from sklearn.model_selection import cross_val_score
from sklearn.metrics import mean_squared_error

# To split our data into training and test sets
from sklearn.model_selection import train_test_split 

# I will keep the resulting plots
%matplotlib inline

# Enable Jupyter Notebook's intellisense
%config IPCompleter.greedy=True

# Disable the warnings
pd.options.mode.chained_assignment = None

## 1. First steps

Load the dataset into a DataFrame which we named as data

In [2]:
data = pd.read_csv('../datasets/student-mat.csv',delimiter=';')

## 2. Exploration and Preprocessing

Before build the model, we have to do some exploratory data analysis on the data we're working on

In [3]:
# I want to see entire columns
pd.set_option('display.max_columns', None)

# Print first five rows
display(data.head())

# Print the summary statistics
print("\nSummary Statistics\n")
print(data.describe())

# Print the information
print("\nInfo\n")
print(data.info())

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,GP,F,18,U,GT3,A,4,4,at_home,teacher,course,mother,2,2,0,yes,no,no,no,yes,yes,no,no,4,3,4,1,1,3,6,5,6,6
1,GP,F,17,U,GT3,T,1,1,at_home,other,course,father,1,2,0,no,yes,no,no,no,yes,yes,no,5,3,3,1,1,3,4,5,5,6
2,GP,F,15,U,LE3,T,1,1,at_home,other,other,mother,1,2,3,yes,no,yes,no,yes,yes,yes,no,4,3,2,2,3,3,10,7,8,10
3,GP,F,15,U,GT3,T,4,2,health,services,home,mother,1,3,0,no,yes,yes,yes,yes,yes,yes,yes,3,2,2,1,1,5,2,15,14,15
4,GP,F,16,U,GT3,T,3,3,other,other,home,father,1,2,0,no,yes,yes,no,yes,yes,no,no,4,3,2,1,2,5,4,6,10,10



Summary Statistics

              age        Medu        Fedu  traveltime   studytime    failures  \
count  395.000000  395.000000  395.000000  395.000000  395.000000  395.000000   
mean    16.696203    2.749367    2.521519    1.448101    2.035443    0.334177   
std      1.276043    1.094735    1.088201    0.697505    0.839240    0.743651   
min     15.000000    0.000000    0.000000    1.000000    1.000000    0.000000   
25%     16.000000    2.000000    2.000000    1.000000    1.000000    0.000000   
50%     17.000000    3.000000    2.000000    1.000000    2.000000    0.000000   
75%     18.000000    4.000000    3.000000    2.000000    2.000000    0.000000   
max     22.000000    4.000000    4.000000    4.000000    4.000000    3.000000   

           famrel    freetime       goout        Dalc        Walc      health  \
count  395.000000  395.000000  395.000000  395.000000  395.000000  395.000000   
mean     3.944304    3.235443    3.108861    1.481013    2.291139    3.554430   
std   

According to info, there is no missing values to deal with. As we see, there are lot's of features. G1, G2 and G3 columns are exam grades ( G3 - final grade will be our target). We'll use G1, G2, absences, freetime, health, study time and preceding exam grades to predict G3.

In [4]:
# Select the features we need
X = data[["G1","G2","absences","health","studytime","freetime"]]

# Seşect the dependent variable (target)
y = data[["G3"]]

# Print them out
display(X)
display(y)

Unnamed: 0,G1,G2,absences,health,studytime,freetime
0,5,6,6,3,2,3
1,5,5,4,3,2,3
2,7,8,10,3,2,3
3,15,14,2,5,3,2
4,6,10,4,5,2,3
...,...,...,...,...,...,...
390,9,9,11,4,2,5
391,14,16,3,2,1,4
392,10,8,3,3,1,5
393,11,12,0,5,1,4


Unnamed: 0,G3
0,6
1,6
2,10
3,15
4,10
...,...
390,9
391,16
392,7
393,10


Let's check the correlation  matrix by using pandas corr() method. corr() uses the Pearson correlation coefficients so correlation  values will change between -1 and 1. You can find further information [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.corr.html)

In [5]:
# Calculate the correlation matrix and display
display(X.corr())

Unnamed: 0,G1,G2,absences,health,studytime,freetime
G1,1.0,0.852118,-0.031003,-0.073172,0.160612,0.012613
G2,0.852118,1.0,-0.031777,-0.09772,0.13588,-0.013777
absences,-0.031003,-0.031777,1.0,-0.029937,-0.0627,-0.058078
health,-0.073172,-0.09772,-0.029937,1.0,-0.075616,0.075733
studytime,0.160612,0.13588,-0.0627,-0.075616,1.0,-0.143198
freetime,0.012613,-0.013777,-0.058078,0.075733,-0.143198,1.0


According to correlation matrix there is no significant correlation between our features except G1 and G2. To avoid negative effect, we'll calculate the mean of grades and use the mean only.

In [6]:
# Take mean of G1 and G2 into a new column
X['pre_mean'] = X[['G1','G2']].mean(axis=1)

# We no longer need the G1 and G2 columns
X.drop(['G1','G2'], axis=1, inplace=True)

# Display the final DataFrame again
display(X.head())

Unnamed: 0,absences,health,studytime,freetime,pre_mean
0,6,3,2,3,5.5
1,4,3,2,3,5.0
2,10,3,2,3,7.5
3,2,5,3,2,14.5
4,4,5,2,3,8.0


## 3. Build the model

Now, we can start to build our regression model. First, split the data into training and test sets

In [7]:
# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X.values,y.values, test_size=0.2, random_state=34)

Let's build the Linear Regression model.

In [8]:
# Initialize the regressor
lr = LinearRegression()

# Fit the model
lr.fit(X_train, y_train)

# Print the coeffiecents and the intercept
print("Coefficent:",lr.coef_)
print("Intercept:",lr.intercept_)

Coefficent: [[ 0.02351363  0.03688534 -0.3311641  -0.04250661  1.20053843]]
Intercept: [-1.9469428]


## 4. Evaluate the model

In [9]:
# Make predictions
y_pred_test = lr.predict(X_test)
y_pred_train = lr.predict(X_train)

# TRAINING ERRORS
print("\nTraining Errors:\n ")

# Mean Squared Error
print("MSE:",mean_squared_error(y_train, y_pred_train))

# Root Mean Squared Error
print("RMSE:",np.sqrt(mean_squared_error(y_train, y_pred_train)))
      
# R^2
print("R^2:",lr.score(X_train, y_train))


# TEST ERRORS
print("\nTest Errors:\n ")

# Mean Squared Error
print("MSE:",mean_squared_error(y_test, y_pred_test))

# Root Mean Squared Error
print("RMSE:",np.sqrt(mean_squared_error(y_test, y_pred_test)))
      
# R^2
print("R^2:",lr.score(X_test, y_test))



Training Errors:
 
MSE: 3.927009553200873
RMSE: 1.9816683761923621
R^2: 0.805196541473623

Test Errors:
 
MSE: 5.671540264453504
RMSE: 2.3814995831310792
R^2: 0.7604942025708831


It looks like our model works fine. This is the end of the notebook. I hope this notebook is helpful for understanding implementation of multiple linear regression 