# Multi Variable Linear Regression

Extending from the Simple Linear Regression, we're now going to consider a situation where we have a bunch of subjects taken at GCSE and we're going to try to use them to predict A2 Maths results.  This time the data comes in one Excel file but in two sheets.  

We're going to merge them into one data frame linked by Student_Id.  Then we can accommidate for missing values because not all students who take GCSE will take A-level maths.  Once the data is in a usable form, we will conduct a multi variable regression analysis to see if we can come up with a predictive model.

We will do this in two ways, using Sci Kit Learn and Stats model api which provides some nice regression analysis outputs.  We will then discuss these outputs and what they mean about the predictive nature of our model. From there we can go onto correct for any model mispecification.

(there is a first version of this file but it is a little messy so here is the clean up...there is some useful code in it which is worth a look at some point)

## Combining, cleaning and manipulating 

Let's get started with downloading the data, linking it by Student_Id.  We're also going to need all our usual imports

In [1]:
#this will keep all our graphs in the page
%matplotlib inline
# a few libraries that we will need

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options


# We need this library to read excel workbooks
import openpyxl 
#this allows you to open older versions of excel
import xlrd
from pandas import DataFrame, read_excel, merge

file_name = r"C:\Users\Mrs Farrelly\Data-Manipulation-and-Regression\Multi Variable GCSE and A2 values.xlsx"
table1 = "GCSE Values"
table2 = "Maths A2"
ID = "Student_ID"

#lets change the excel files to data frames
df_GCSE = pd.read_excel(file_name, sheet_name = table1, header = 0)

df_Maths_A2 = pd.read_excel(file_name, sheet_name = table2, header = 0)

#Lets have a look at the first 5 rows and see if it worked
df_GCSE.head()


Unnamed: 0,Student _ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7,7,,,6.0,,,,,,,,8,6.0,6.0,,,,,,7.0
1,375,F,,7.0,,,,,,,,,,,,,,,7,6,7.0,,,,,7.0,,,,7.0,8,,,,,,7.0,7.0,
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7,8,8.0,,,,8.0,8.0,,,,8.0,8,,8.0,,8.0,,,,
3,391,M,,,,,,,,,,,,,,,,,6,7,,,,7.0,,6.0,,,,5.0,7,6.0,,,,,7.0,7.0,
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8,7,8.0,,,,8.0,8.0,,,,8.0,8,,8.0,,8.0,,,,


Let's get a look at the header names...especially for Student_ID which is going to be our key

In [2]:
list(df_GCSE.columns.values)

['Student _ID',
 'Gender',
 'Arabic',
 'Art',
 'Astronomy',
 'Biology',
 'Chemistry',
 'Chinese',
 'Classical Civilisation',
 'Design & Technology',
 'Design & Technology Textiles',
 'Design Graphics',
 'Design With Resistant Materials',
 'Drama',
 'Dutch',
 'Electronics',
 'English',
 'English A Tier H',
 'English Language',
 'English Literature',
 'French',
 'Further Mathematics',
 'Geography',
 'German',
 'Greek',
 'History',
 'I.C.T.',
 'Italian',
 'Japanese',
 'Latin',
 'Mathematics',
 'Music',
 'Physics',
 'Portuguese',
 'Religious Studies',
 'Russian',
 'Science',
 'Science.',
 'Spanish']

In [3]:
list(df_Maths_A2.columns.values)

['Student_ID', 1, 'C1', 'C2', 'C3', 'C4', 'M1', 'S1', 'Year', 'Total']

We can see here that the first list of column names has a space in it which means we are not picking it up.  Let's do a rename to fix the problem with df_GCSE

In [4]:
#notice the space this time in the first name
df_GCSE.rename(columns = {'Student _ID':'Student_ID'}, inplace = True)
list(df_GCSE.columns.values)

['Student_ID',
 'Gender',
 'Arabic',
 'Art',
 'Astronomy',
 'Biology',
 'Chemistry',
 'Chinese',
 'Classical Civilisation',
 'Design & Technology',
 'Design & Technology Textiles',
 'Design Graphics',
 'Design With Resistant Materials',
 'Drama',
 'Dutch',
 'Electronics',
 'English',
 'English A Tier H',
 'English Language',
 'English Literature',
 'French',
 'Further Mathematics',
 'Geography',
 'German',
 'Greek',
 'History',
 'I.C.T.',
 'Italian',
 'Japanese',
 'Latin',
 'Mathematics',
 'Music',
 'Physics',
 'Portuguese',
 'Religious Studies',
 'Russian',
 'Science',
 'Science.',
 'Spanish']

now that it's gone we should be able to successfully merge the dataframes

In [9]:
df_combined = pd.merge(df_GCSE,df_Maths_A2, on = "Student_ID", how = "right")

#lets have a look
print (df_combined.shape)
df_combined.head()

(121, 48)


Unnamed: 0,Student_ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish,1,C1,C2,C3,C4,M1,S1,Year,Total
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7,7,,,6.0,,,,,,,,8.0,6.0,6.0,,,,,,7.0,366,67,76,42,52,48,58,2014,343
1,375,F,,7.0,,,,,,,,,,,,,,,7,6,7.0,,,,,7.0,,,,7.0,8.0,,,,,,7.0,7.0,,375,83,100,58,63,72,74,2014,450
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7,8,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,381,90,96,80,63,82,79,2014,490
3,391,M,,,,,,,,,,,,,,,,,6,7,,,,7.0,,6.0,,,,5.0,7.0,6.0,,,,,7.0,7.0,,391,73,58,28,23,47,52,2014,281
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8,7,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,399,100,99,100,97,83,97,2013,576


In [10]:
#lets take a look at our data
df_combined.describe()

Unnamed: 0,Student_ID,Mathematics,1,C1,C2,C3,C4,M1,S1,Year,Total
count,121.0,79.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0
mean,4900.917355,7.772152,4900.917355,83.198347,83.396694,72.330579,63.677686,73.487603,74.760331,2012.619835,450.85124
std,3706.402082,0.451475,3706.402082,21.97067,18.652292,23.270292,25.911778,20.752155,19.465793,1.45061,111.15827
min,366.0,6.0,366.0,0.0,0.0,8.0,0.0,13.0,8.0,2011.0,121.0
25%,1511.0,8.0,1511.0,80.0,79.0,62.0,45.0,60.0,67.0,2011.0,397.0
50%,7356.0,8.0,7356.0,91.0,89.0,78.0,70.0,77.0,77.0,2013.0,483.0
75%,8386.0,8.0,8386.0,96.0,97.0,92.0,83.0,90.0,90.0,2014.0,539.0
max,9622.0,8.0,9622.0,100.0,100.0,100.0,100.0,100.0,100.0,2015.0,599.0


we should expect a lot more than just these values to be returning something from a describe function...let take a look at the data types

In [11]:
print (df_combined.dtypes)

Student_ID                           int64
Gender                              object
Arabic                              object
Art                                 object
Astronomy                           object
Biology                             object
Chemistry                           object
Chinese                             object
Classical Civilisation              object
Design & Technology                 object
Design & Technology Textiles        object
Design Graphics                     object
Design With Resistant Materials     object
Drama                               object
Dutch                               object
Electronics                         object
English                             object
English A Tier H                    object
English Language                    object
English Literature                  object
French                              object
Further Mathematics                 object
Geography                           object
German     

So loads of them are simply objects. Simply trying to force them using df_combined.astype("float64") isn't going to work so lets try convert_object

## Hint:

Don't forget to say df_combined = ... if you want it to actually change the dataframe.

In [12]:
df_combined = df_combined.convert_objects(convert_numeric=True)

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.


Well python 3 isn't happy but it still worked...

In [13]:
print (df_combined.dtypes)

Student_ID                           int64
Gender                              object
Arabic                              object
Art                                float64
Astronomy                          float64
Biology                            float64
Chemistry                          float64
Chinese                            float64
Classical Civilisation             float64
Design & Technology                 object
Design & Technology Textiles        object
Design Graphics                    float64
Design With Resistant Materials    float64
Drama                              float64
Dutch                               object
Electronics                         object
English                            float64
English A Tier H                    object
English Language                   float64
English Literature                 float64
French                             float64
Further Mathematics                float64
Geography                          float64
German     

Okay so we have 121 rows and 48 columns. What we notice is that we have several columns where some pupils took some subject and not others. for example student 375 was the only one who took art in the first five student.

What we'll do is try and combine several subject to create a more useable data set and then use create a new dataframe which has these combinations within them.

To do this we need to know a certain amount about the specific sector but what we will do is make the following combinations by taking the mean of the columns where they exist

English Literature and lanuage will just be Eng. Spanish, russian, italian etc will just be MFL Biology, chemistry,physics and single award science will just be sci History, Greek, Geography, religious studies,drama and design graphics will just be HGGRD We will also create a dummy variable in gender by coding Female as 1 and Male as 0

## Hint:

When you are using .replace you need to put the value you want to replace first and what you want to put in it second.  Seems obvious right...

In [15]:
#Lets create our new variables
ENG = df_combined[['English', 'English A Tier H', 'English Language',
 'English Literature']].mean(axis = 1)
df_combined['ENG'] = ENG

MFL = df_combined[['Russian','Spanish','Portuguese','Japanese', 'Italian',
                   'German','French','Dutch',
                   'Chinese','Arabic',]].mean(axis = 1)
df_combined['MFL'] = MFL

SCI = df_combined[['Science', 'Science.','Physics','Chemistry','Biology', ]].mean (axis = 1)
df_combined['SCI'] = SCI

HGGRD = df_combined[['History','Geography','Greek','Drama','Religious Studies','Design Graphics',]].mean (axis = 1)
df_combined['HGGRD'] = HGGRD

Gender_Value = df_combined[["Gender"]]
df_combined['Gender_Value'] = Gender_Value
df_combined["Gender_Value"].replace(["F","M"], [1,0], inplace = True)

#let take a look
df_combined.head(10)

Unnamed: 0,Student_ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish,1,C1,C2,C3,C4,M1,S1,Year,Total,ENG,MFL,SCI,HGGRD,Gender_Value
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7.0,7.0,,,6.0,,,,,,,,8.0,6.0,6.0,,,,,,7.0,366,67,76,42,52,48,58,2014,343,7.0,7.0,7.0,6.0,1.0
1,375,F,,7.0,,,,,,,,,,,,,,,7.0,6.0,7.0,,,,,7.0,,,,7.0,8.0,,,,,,7.0,7.0,,375,83,100,58,63,72,74,2014,450,6.5,7.0,7.0,7.0,1.0
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7.0,8.0,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,381,90,96,80,63,82,79,2014,490,7.5,8.0,8.0,8.0,1.0
3,391,M,,,,,,,,,,,,,,,,,6.0,7.0,,,,7.0,,6.0,,,,5.0,7.0,6.0,,,,,7.0,7.0,,391,73,58,28,23,47,52,2014,281,6.5,7.0,7.0,6.0,0.0
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8.0,7.0,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,399,100,99,100,97,83,97,2013,576,7.5,8.0,8.0,8.0,0.0
5,427,M,,,,7.0,8.0,,,,,,,,,,,,6.0,7.0,8.0,,7.0,,6.0,8.0,,,,7.0,8.0,,8.0,,,,,,,427,83,86,85,82,72,94,2014,502,6.5,8.0,7.666667,7.0,0.0
6,429,M,,,,8.0,8.0,,6.0,,,,5.0,,,,,,6.0,6.0,,,7.0,,,,,,,,8.0,,8.0,,,,,,8.0,429,96,96,75,81,100,93,2013,541,6.0,8.0,8.0,7.0,0.0
7,431,M,,5.0,,7.0,8.0,8.0,,,,6.0,,,,,,,5.0,6.0,,,,7.0,,,,,,,8.0,7.0,8.0,,,,,,,431,91,96,72,79,87,77,2013,502,5.5,7.5,7.666667,6.0,0.0
8,435,M,,,,7.0,7.0,,,,,,,,,,,,8.0,7.0,8.0,,,8.0,,7.0,,,,6.0,7.0,6.0,7.0,,,,,,,435,60,37,47,22,53,52,2014,271,7.5,8.0,7.0,7.0,0.0
9,444,M,,4.0,,,,,,,,,,,,,,,5.0,5.0,6.0,,,,,8.0,,,,,7.0,,,,6.0,,7.0,6.0,,444,33,28,30,5,20,15,2014,131,5.0,6.0,6.5,7.0,0.0


Lets see if we can create a new data frame from the existing one based on the newly created columns and then try and manipulate that. 

In [17]:
df_new = df_combined[['Student_ID',"Mathematics","ENG", "MFL", "SCI","HGGRD","Gender_Value","Total"]]
print (df_new.shape)
df_new.describe()

(121, 8)


Unnamed: 0,Student_ID,Mathematics,ENG,MFL,SCI,HGGRD,Gender_Value,Total
count,121.0,79.0,79.0,77.0,79.0,79.0,79.0,121.0
mean,4900.917355,7.772152,6.767932,7.380952,7.42616,7.116034,0.253165,450.85124
std,3706.402082,0.451475,0.948615,0.940813,0.666531,0.888897,0.437603,111.15827
min,366.0,6.0,4.0,3.0,5.5,5.0,0.0,121.0
25%,1511.0,8.0,6.0,7.0,7.0,6.5,0.0,397.0
50%,7356.0,8.0,7.0,8.0,7.666667,7.0,0.0,483.0
75%,8386.0,8.0,7.5,8.0,8.0,8.0,0.5,539.0
max,9622.0,8.0,8.0,8.0,8.0,8.0,1.0,599.0


So, is there anything that is concerning here...Well, the count is causing problems...first of all total is 121 and the rest are 79 and MFL is only 77.  Lets take a look for missing values...

In [18]:
print (df_new.isnull().any())

Student_ID      False
Mathematics      True
ENG              True
MFL              True
SCI              True
HGGRD            True
Gender_Value     True
Total           False
dtype: bool


In [19]:
print (df_new.isnull().sum())

Student_ID       0
Mathematics     42
ENG             42
MFL             44
SCI             42
HGGRD           42
Gender_Value    42
Total            0
dtype: int64


The best thing to do here is create a new dataframe called df_dropped and drop the Nan values based on the subset MFL and see if that cures all our problems

In [23]:
df_dropped = df_new.dropna(subset = ["MFL"])
print (df_dropped.shape)
print(print (df_dropped.isnull().any()))

(77, 8)
Student_ID      False
Mathematics     False
ENG             False
MFL             False
SCI             False
HGGRD           False
Gender_Value    False
Total           False
dtype: bool
None


Okay looks like we're good to go

## Regression analysis

So what we're going to do here is check to see if any of the variation in the total for A2 maths can be explained by the variation in the other variables (excluding student id).

We don't have to sort by Student_ID here because we indexed it when we created our dataframe but if we hadn't that would be important.  We want to make sure our data is sorted by value otherwise when we take some data out for training it will skew our results because it will either be all high or all low.

We'll then have a look at the p-values to see if any of our explanitory variables aren't significant and then drop them accordingly.

There is probably an optimal model out there that does this...for now we'll just use the one in sci kit learn and statsmodels.api



In [24]:
list(df_dropped.columns.values)

['Student_ID',
 'Mathematics',
 'ENG',
 'MFL',
 'SCI',
 'HGGRD',
 'Gender_Value',
 'Total']

Using the statsmodel.api is really easy.  This isn't a machine learning way as I haven't split the data at all it's just a nice regression model and all you have to do is define the variables. and hit go.

In [25]:
## Without a constant
from sklearn import linear_model
import statsmodels.api as sm
#Some sample code which i didn't really use
# define the data/predictors as the pre-set feature names  
#df = pd.DataFrame(data.data, columns=data.feature_names)

# Put the target (housing value -- MEDV) in another DataFrame
#target = pd.DataFrame(data.target, columns=["MEDV"])

#Create some variable and run without a constant
X =df_dropped[['Mathematics', 'ENG', 'MFL', 'SCI', 'HGGRD', 'Gender_Value']]
y = df_dropped['Total']



# Note the difference in argument order
model = sm.OLS(y, X).fit()
#predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Total,R-squared:,0.969
Model:,OLS,Adj. R-squared:,0.966
Method:,Least Squares,F-statistic:,367.5
Date:,"Fri, 26 Oct 2018",Prob (F-statistic):,2.22e-51
Time:,11:12:22,Log-Likelihood:,-448.44
No. Observations:,77,AIC:,908.9
Df Residuals:,71,BIC:,922.9
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Mathematics,-15.0661,18.869,-0.798,0.427,-52.690,22.558
ENG,34.1935,14.810,2.309,0.024,4.663,63.724
MFL,-5.1012,13.605,-0.375,0.709,-32.228,22.026
SCI,32.2770,25.469,1.267,0.209,-18.507,83.061
HGGRD,20.4771,15.150,1.352,0.181,-9.731,50.685
Gender_Value,-44.0900,24.041,-1.834,0.071,-92.027,3.847

0,1,2,3
Omnibus:,16.554,Durbin-Watson:,2.039
Prob(Omnibus):,0.0,Jarque-Bera (JB):,19.144
Skew:,-1.153,Prob(JB):,6.96e-05
Kurtosis:,3.804,Cond. No.,49.8


The adjusted R-squared here is really good.  Basically 97% of the variation in the dependent variable is explained by the variation in the independent variables. 

It is worth checking to see if any of the coefficiencts in the independent variables are significantly differnt from 0.

Not all of them are, strangly also mathematics at GCSE doesn't seem to be different from 0 for maths at A2 which seems counter intuitive.  Lets try adding an intercept and see if that changes things at all.

In [26]:
## Without a constant
from sklearn import linear_model
import statsmodels.api as sm


#Create some variable and run without a constant
X =df_dropped[['Mathematics', 'ENG', 'MFL', 'SCI', 'HGGRD', 'Gender_Value']]
y = df_dropped['Total']



# Note the difference in argument order
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
#predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Total,R-squared:,0.621
Model:,OLS,Adj. R-squared:,0.588
Method:,Least Squares,F-statistic:,19.11
Date:,"Fri, 26 Oct 2018",Prob (F-statistic):,4.76e-13
Time:,11:38:17,Log-Likelihood:,-433.27
No. Observations:,77,AIC:,880.5
Df Residuals:,70,BIC:,896.9
Df Model:,6,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-825.6679,142.021,-5.814,0.000,-1108.921,-542.415
Mathematics,86.4172,23.415,3.691,0.000,39.718,133.116
ENG,28.3702,12.290,2.308,0.024,3.859,52.881
MFL,-0.9432,11.274,-0.084,0.934,-23.429,21.543
SCI,30.9402,21.065,1.469,0.146,-11.074,72.954
HGGRD,27.9782,12.596,2.221,0.030,2.856,53.100
Gender_Value,-49.3247,19.904,-2.478,0.016,-89.022,-9.628

0,1,2,3
Omnibus:,7.363,Durbin-Watson:,2.337
Prob(Omnibus):,0.025,Jarque-Bera (JB):,6.726
Skew:,-0.634,Prob(JB):,0.0346
Kurtosis:,3.699,Cond. No.,293.0


this feels much more like the output that we were expecting.

First of all the adj R^2 si 58.8% which is more realistic given the amount of time between GCSE results and A2 results.  The P values are now only sign differnt from 0 for MFL which feels correct although interestingly there is a 14.6% chance SCI contributions to A2 is nothing.

the Prob(F-stat) which checks to see if collectively all the values are 0 is still small.

Let's drop MFL and see what happens.


In [27]:
## Without a constant
from sklearn import linear_model
import statsmodels.api as sm


#Create some variable and run without a constant
X =df_dropped[['Mathematics', 'ENG','SCI', 'HGGRD', 'Gender_Value']]
y = df_dropped['Total']



# Note the difference in argument order
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
#predictions = model.predict(X) # make the predictions by the model

# Print out the statistics
model.summary()

0,1,2,3
Dep. Variable:,Total,R-squared:,0.621
Model:,OLS,Adj. R-squared:,0.594
Method:,Least Squares,F-statistic:,23.26
Date:,"Fri, 26 Oct 2018",Prob (F-statistic):,9.3e-14
Time:,11:45:15,Log-Likelihood:,-433.28
No. Observations:,77,AIC:,878.6
Df Residuals:,71,BIC:,892.6
Df Model:,5,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
const,-826.4215,140.741,-5.872,0.000,-1107.051,-545.792
Mathematics,86.3824,23.247,3.716,0.000,40.030,132.735
ENG,28.2968,12.172,2.325,0.023,4.026,52.568
SCI,30.1348,18.605,1.620,0.110,-6.963,67.233
HGGRD,28.0625,12.468,2.251,0.027,3.203,52.922
Gender_Value,-49.5126,19.638,-2.521,0.014,-88.669,-10.356

0,1,2,3
Omnibus:,7.21,Durbin-Watson:,2.336
Prob(Omnibus):,0.027,Jarque-Bera (JB):,6.567
Skew:,-0.62,Prob(JB):,0.0375
Kurtosis:,3.715,Cond. No.,261.0


We actually get slightly more explanitory power from our model with an adj R^2 of 59.4% The model is therefore

$$ Total = -826.4215 + 86.3824Mathematics + 28.2968ENG + 30.1348SCI +28.0625HGGRD + -49.5126Gender_Value $$

So basically if you get an extra grade in Mathematics at GCSE you will go up by 86.3 UMS in A2 Maths.

the next post in this series will be to have a look at some of the same process or combining, changing and cleaning our dataframe but using SQL instead of pandas.