# Multi Variable Linear Regression

Extending from the Simple Linear Regression, we're now going to consider a situation where we have a bunch of subjects taken at GCSE and we're going to try to use them to predict A2 Maths results.  This time the data comes in one Excel file but in two sheets.  

We're going to merge them into one data frame linked by Student_Id.  Then we can accommidate for missing values because not all students who take GCSE will take A-level maths.  Once the data is in a usable form, we will conduct a multi variable regression analysis to see if we can come up with a predictive model.

We will do this in two ways, using Sci Kit Learn and Stats model api which provides some nice regression analysis outputs.  We will then discuss these outputs and what they mean about the predictive nature of our model. From there we can go onto correct for any model mispecification.

Let's get started with downloading the data, linking it by Student_Id.  We're also going to need all our usual imports

In [1]:
#this will keep all our graphs in the page
%matplotlib inline
# a few libraries that we will need

import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options


# We need this library to read excel workbooks
import openpyxl 
#this allows you to open older versions of excel
import xlrd
from pandas import DataFrame, read_excel, merge

file_name = r"C:\Users\Mrs Farrelly\Data-Manipulation-and-Regression\Multi Variable GCSE and A2 values.xlsx"
table1 = "GCSE Values"
table2 = "Maths A2"
ID = "Student_ID"

#lets change the excel files to data frames
df_GCSE = pd.read_excel(file_name, sheet_name = table1, header = 0)

df_Maths_A2 = pd.read_excel(file_name, sheet_name = table2, header = 0)

#Lets have a look at the first 5 rows and see if it worked
df_GCSE.head()


Unnamed: 0,Student _ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7,7,,,6.0,,,,,,,,8,6.0,6.0,,,,,,7.0
1,375,F,,7.0,,,,,,,,,,,,,,,7,6,7.0,,,,,7.0,,,,7.0,8,,,,,,7.0,7.0,
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7,8,8.0,,,,8.0,8.0,,,,8.0,8,,8.0,,8.0,,,,
3,391,M,,,,,,,,,,,,,,,,,6,7,,,,7.0,,6.0,,,,5.0,7,6.0,,,,,7.0,7.0,
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8,7,8.0,,,,8.0,8.0,,,,8.0,8,,8.0,,8.0,,,,


In [2]:
#Lets just make sure the header names are the same
df_GCSE.rename(columns = {'Student_ID':'Student_ID'}, inplace = True)
df_Maths_A2.rename(columns = {'Student_ID':'Student_ID'}, inplace = True)
df_GCSE.head()

    

Unnamed: 0,Student _ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7,7,,,6.0,,,,,,,,8,6.0,6.0,,,,,,7.0
1,375,F,,7.0,,,,,,,,,,,,,,,7,6,7.0,,,,,7.0,,,,7.0,8,,,,,,7.0,7.0,
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7,8,8.0,,,,8.0,8.0,,,,8.0,8,,8.0,,8.0,,,,
3,391,M,,,,,,,,,,,,,,,,,6,7,,,,7.0,,6.0,,,,5.0,7,6.0,,,,,7.0,7.0,
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8,7,8.0,,,,8.0,8.0,,,,8.0,8,,8.0,,8.0,,,,


In [3]:
df_Maths_A2.head()

Unnamed: 0,Student_ID,1,C1,C2,C3,C4,M1,S1,Year,Total
0,7515,7515,97,93,80,76,88,67,2011,501
1,8305,8305,96,100,98,67,92,83,2011,536
2,7519,7519,92,84,74,58,76,75,2011,459
3,7311,7311,77,66,28,35,48,26,2011,280
4,7521,7521,91,88,70,45,51,78,2011,423


In [4]:
#Lets merge the data frames
#First we're going to set the index names to use join to combine our dataframes
#df_GCSE.set_index('Student_ID', inplace = True)
#df_Maths_A2.set_index('Student_ID', inplace = True)

#now lets use the join method 
#df_combined = df_GCSE.join(df_Maths_A2)

#lets take a look
#df_combined.head()

#how = "left" bases things on df_GCSE, how =  "right" bases things df_Maths_A2
#how = "outer" replicates everything with nan, how = "inner" only does the intersection
df_combined = pd.merge(df_GCSE,df_Maths_A2, left_on = "Student_ID", right_on = "Student_ID", how = "right")



KeyError: 'Student_ID'

It's worth leaving this in to see how we deal with KeyError. Let's try listing the column values to see if we can spot a difference in the column headers

In [5]:
list(df_GCSE.columns.values)

['Student _ID',
 'Gender',
 'Arabic',
 'Art',
 'Astronomy',
 'Biology',
 'Chemistry',
 'Chinese',
 'Classical Civilisation',
 'Design & Technology',
 'Design & Technology Textiles',
 'Design Graphics',
 'Design With Resistant Materials',
 'Drama',
 'Dutch',
 'Electronics',
 'English',
 'English A Tier H',
 'English Language',
 'English Literature',
 'French',
 'Further Mathematics',
 'Geography',
 'German',
 'Greek',
 'History',
 'I.C.T.',
 'Italian',
 'Japanese',
 'Latin',
 'Mathematics',
 'Music',
 'Physics',
 'Portuguese',
 'Religious Studies',
 'Russian',
 'Science',
 'Science.',
 'Spanish']

In [6]:
list(df_Maths_A2.columns.values)

['Student_ID', 1, 'C1', 'C2', 'C3', 'C4', 'M1', 'S1', 'Year', 'Total']

We can see here that the first list of column names has a space in it which means we are not picking it up.  Let's do a rename to fix the problem with df_GCSE

In [7]:
#notice the space this time in the first name
df_GCSE.rename(columns = {'Student _ID':'Student_ID'}, inplace = True)

In [8]:
list(df_GCSE.columns.values)


['Student_ID',
 'Gender',
 'Arabic',
 'Art',
 'Astronomy',
 'Biology',
 'Chemistry',
 'Chinese',
 'Classical Civilisation',
 'Design & Technology',
 'Design & Technology Textiles',
 'Design Graphics',
 'Design With Resistant Materials',
 'Drama',
 'Dutch',
 'Electronics',
 'English',
 'English A Tier H',
 'English Language',
 'English Literature',
 'French',
 'Further Mathematics',
 'Geography',
 'German',
 'Greek',
 'History',
 'I.C.T.',
 'Italian',
 'Japanese',
 'Latin',
 'Mathematics',
 'Music',
 'Physics',
 'Portuguese',
 'Religious Studies',
 'Russian',
 'Science',
 'Science.',
 'Spanish']

now that it's gone we should be able to successfully merge the dataframes

In [9]:
df_combined = pd.merge(df_GCSE,df_Maths_A2, on = "Student_ID", how = "right")

#lets have a look
df_combined.head()

Unnamed: 0,Student_ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish,1,C1,C2,C3,C4,M1,S1,Year,Total
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7,7,,,6.0,,,,,,,,8.0,6.0,6.0,,,,,,7.0,366,67,76,42,52,48,58,2014,343
1,375,F,,7.0,,,,,,,,,,,,,,,7,6,7.0,,,,,7.0,,,,7.0,8.0,,,,,,7.0,7.0,,375,83,100,58,63,72,74,2014,450
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7,8,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,381,90,96,80,63,82,79,2014,490
3,391,M,,,,,,,,,,,,,,,,,6,7,,,,7.0,,6.0,,,,5.0,7.0,6.0,,,,,7.0,7.0,,391,73,58,28,23,47,52,2014,281
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8,7,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,399,100,99,100,97,83,97,2013,576


In [10]:
df_combined.shape


(121, 48)

In [11]:
#lets take a look at our data
df_combined.describe()

Unnamed: 0,Student_ID,Mathematics,1,C1,C2,C3,C4,M1,S1,Year,Total
count,121.0,79.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0,121.0
mean,4900.917355,7.772152,4900.917355,83.198347,83.396694,72.330579,63.677686,73.487603,74.760331,2012.619835,450.85124
std,3706.402082,0.451475,3706.402082,21.97067,18.652292,23.270292,25.911778,20.752155,19.465793,1.45061,111.15827
min,366.0,6.0,366.0,0.0,0.0,8.0,0.0,13.0,8.0,2011.0,121.0
25%,1511.0,8.0,1511.0,80.0,79.0,62.0,45.0,60.0,67.0,2011.0,397.0
50%,7356.0,8.0,7356.0,91.0,89.0,78.0,70.0,77.0,77.0,2013.0,483.0
75%,8386.0,8.0,8386.0,96.0,97.0,92.0,83.0,90.0,90.0,2014.0,539.0
max,9622.0,8.0,9622.0,100.0,100.0,100.0,100.0,100.0,100.0,2015.0,599.0


Okay so we have 121 rows and 48 columns.  What we notice is that we have several columns where some pupils took some subject and not others.  for example student 375 was the only one who took art in the first five student.

What we'll do is try and combine several subject to create a more useable data set and then use create a new dataframe which has these combinations within them.

To do this we need to know a certain amount about the specific sector but what we will do is make the following combinations by taking the mean of the columns where they exist

English Literature and lanuage will just be Eng.
Spanish, russian, italian etc will just be MFL
Biology, chemistry,physics and single award science will just be sci
History, Greek, Geography, religious studies,drama and design graphics will just be HGGRD
We will also create a dummy variable in gender by coding Female as 1 and Male as 0

In [12]:
#df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
#Lets create our new variables
ENG = df_combined[['English', 'English A Tier H', 'English Language',
 'English Literature']].mean(axis = 1)
df_combined['ENG'] = ENG

MFL = df_combined[['Russian','Spanish','Portuguese','Japanese', 'Italian',
                   'German','French','Dutch',
                   'Chinese','Arabic',]].mean(axis = 1)
df_combined['MFL'] = MFL

SCI = df_combined[['Science', 'Science.','Physics','Chemistry','Biology', ]].mean (axis = 1)
df_combined['SCI'] = SCI

HGGRD = df_combined[['History','Geography','Greek','Drama','Religious Studies','Design Graphics',]].mean (axis = 1)
df_combined['HGGRD'] = HGGRD

Gender_Value = df_combined[["Gender"]]
df_combined['Gender_Value'] = Gender_Value
df_combined["Gender_Value"].replace([0,1], ["M","F"], inplace = True)


df_combined.head(10)

Unnamed: 0,Student_ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish,1,C1,C2,C3,C4,M1,S1,Year,Total,ENG,MFL,SCI,HGGRD,Gender_Value
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7,7,,,6.0,,,,,,,,8.0,6.0,6.0,,,,,,7.0,366,67,76,42,52,48,58,2014,343,,,,,F
1,375,F,,7.0,,,,,,,,,,,,,,,7,6,7.0,,,,,7.0,,,,7.0,8.0,,,,,,7.0,7.0,,375,83,100,58,63,72,74,2014,450,,,,,F
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7,8,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,381,90,96,80,63,82,79,2014,490,,,,,F
3,391,M,,,,,,,,,,,,,,,,,6,7,,,,7.0,,6.0,,,,5.0,7.0,6.0,,,,,7.0,7.0,,391,73,58,28,23,47,52,2014,281,,,,,M
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8,7,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,399,100,99,100,97,83,97,2013,576,,,,,M
5,427,M,,,,7.0,8.0,,,,,,,,,,,,6,7,8.0,,7.0,,6.0,8.0,,,,7.0,8.0,,8.0,,,,,,,427,83,86,85,82,72,94,2014,502,,,,,M
6,429,M,,,,8.0,8.0,,6.0,,,,5.0,,,,,,6,6,,,7.0,,,,,,,,8.0,,8.0,,,,,,8.0,429,96,96,75,81,100,93,2013,541,,,,,M
7,431,M,,5.0,,7.0,8.0,8.0,,,,6.0,,,,,,,5,6,,,,7.0,,,,,,,8.0,7.0,8.0,,,,,,,431,91,96,72,79,87,77,2013,502,,,,,M
8,435,M,,,,7.0,7.0,,,,,,,,,,,,8,7,8.0,,,8.0,,7.0,,,,6.0,7.0,6.0,7.0,,,,,,,435,60,37,47,22,53,52,2014,271,,,,,M
9,444,M,,4.0,,,,,,,,,,,,,,,5,5,6.0,,,,,8.0,,,,,7.0,,,,6.0,,7.0,6.0,,444,33,28,30,5,20,15,2014,131,,,,,M


Okay so why are we getting NaN values for our new columns.  The clue should have come earlier when we checked df_combined.describe().  Loads of the columns were missing. So lets check the data type

In [13]:
print (df_combined.dtypes)

Student_ID                           int64
Gender                              object
Arabic                              object
Art                                 object
Astronomy                           object
Biology                             object
Chemistry                           object
Chinese                             object
Classical Civilisation              object
Design & Technology                 object
Design & Technology Textiles        object
Design Graphics                     object
Design With Resistant Materials     object
Drama                               object
Dutch                               object
Electronics                         object
English                             object
English A Tier H                    object
English Language                    object
English Literature                  object
French                              object
Further Mathematics                 object
Geography                           object
German     

So loads of them are simply objects.  We can try and force them to a type which might work...otherwise we're going to have to do something a bit more cunning.

In [14]:
df_combined.astype("float64")

ValueError: could not convert string to float: 'M'

Okay, so that hasn't worked.  Lets try this

In [15]:
df_combined = df_combined.convert_objects(convert_numeric=True)

For all other conversions use the data-type specific converters pd.to_datetime, pd.to_timedelta and pd.to_numeric.
  """Entry point for launching an IPython kernel.


so I'm getting this little error which means my code isn't super clean but it still might have worked.  Lets find out.

In [16]:
#If I now re run this code Most of this should work
#df['avg'] = df[['Monday', 'Tuesday']].mean(axis=1)
#Lets create our new variables
ENG = df_combined[['English', 'English A Tier H', 'English Language',
 'English Literature']].mean(axis = 1)
df_combined['ENG'] = ENG

MFL = df_combined[['Russian','Spanish','Portuguese','Japanese', 'Italian',
                   'German','French','Dutch',
                   'Chinese','Arabic',]].mean(axis = 1)
df_combined['MFL'] = MFL

SCI = df_combined[['Science', 'Science.','Physics','Chemistry','Biology', ]].mean (axis = 1)
df_combined['SCI'] = SCI

HGGRD = df_combined[['History','Geography','Greek','Drama','Religious Studies','Design Graphics',]].mean (axis = 1)
df_combined['HGGRD'] = HGGRD

Gender_Value = df_combined[["Gender"]]
df_combined['Gender_Value'] = Gender_Value
df_combined["Gender_Value"].replace([0,1], ["M","F"], inplace = True)


df_combined.head(10)

Unnamed: 0,Student_ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish,1,C1,C2,C3,C4,M1,S1,Year,Total,ENG,MFL,SCI,HGGRD,Gender_Value
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7.0,7.0,,,6.0,,,,,,,,8.0,6.0,6.0,,,,,,7.0,366,67,76,42,52,48,58,2014,343,7.0,7.0,7.0,6.0,F
1,375,F,,7.0,,,,,,,,,,,,,,,7.0,6.0,7.0,,,,,7.0,,,,7.0,8.0,,,,,,7.0,7.0,,375,83,100,58,63,72,74,2014,450,6.5,7.0,7.0,7.0,F
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7.0,8.0,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,381,90,96,80,63,82,79,2014,490,7.5,8.0,8.0,8.0,F
3,391,M,,,,,,,,,,,,,,,,,6.0,7.0,,,,7.0,,6.0,,,,5.0,7.0,6.0,,,,,7.0,7.0,,391,73,58,28,23,47,52,2014,281,6.5,7.0,7.0,6.0,M
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8.0,7.0,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,399,100,99,100,97,83,97,2013,576,7.5,8.0,8.0,8.0,M
5,427,M,,,,7.0,8.0,,,,,,,,,,,,6.0,7.0,8.0,,7.0,,6.0,8.0,,,,7.0,8.0,,8.0,,,,,,,427,83,86,85,82,72,94,2014,502,6.5,8.0,7.666667,7.0,M
6,429,M,,,,8.0,8.0,,6.0,,,,5.0,,,,,,6.0,6.0,,,7.0,,,,,,,,8.0,,8.0,,,,,,8.0,429,96,96,75,81,100,93,2013,541,6.0,8.0,8.0,7.0,M
7,431,M,,5.0,,7.0,8.0,8.0,,,,6.0,,,,,,,5.0,6.0,,,,7.0,,,,,,,8.0,7.0,8.0,,,,,,,431,91,96,72,79,87,77,2013,502,5.5,7.5,7.666667,6.0,M
8,435,M,,,,7.0,7.0,,,,,,,,,,,,8.0,7.0,8.0,,,8.0,,7.0,,,,6.0,7.0,6.0,7.0,,,,,,,435,60,37,47,22,53,52,2014,271,7.5,8.0,7.0,7.0,M
9,444,M,,4.0,,,,,,,,,,,,,,,5.0,5.0,6.0,,,,,8.0,,,,,7.0,,,,6.0,,7.0,6.0,,444,33,28,30,5,20,15,2014,131,5.0,6.0,6.5,7.0,M


Okay, well most of my new values are now floats which is good.  I just need to fix Gender_Value then I'm ready to make my new dataframe

In [17]:
#I tried this apply method but it just returned None for all my values
#data['sex'] = data['sex'].apply({1:'Male', 0:'Female'}.get)
#Lets try something else...you can see the results below

In [18]:
#lets try this instead, the order here is super important
replacements = {"F": 1, "M":0}

df_combined['Gender_Value'].replace(replacements, inplace=True)

df_combined.head()


Unnamed: 0,Student_ID,Gender,Arabic,Art,Astronomy,Biology,Chemistry,Chinese,Classical Civilisation,Design & Technology,Design & Technology Textiles,Design Graphics,Design With Resistant Materials,Drama,Dutch,Electronics,English,English A Tier H,English Language,English Literature,French,Further Mathematics,Geography,German,Greek,History,I.C.T.,Italian,Japanese,Latin,Mathematics,Music,Physics,Portuguese,Religious Studies,Russian,Science,Science.,Spanish,1,C1,C2,C3,C4,M1,S1,Year,Total,ENG,MFL,SCI,HGGRD,Gender_Value
0,366,F,,,,8.0,7.0,,6.0,,,,,,,,,,7.0,7.0,,,6.0,,,,,,,,8.0,6.0,6.0,,,,,,7.0,366,67,76,42,52,48,58,2014,343,7.0,7.0,7.0,6.0,1.0
1,375,F,,7.0,,,,,,,,,,,,,,,7.0,6.0,7.0,,,,,7.0,,,,7.0,8.0,,,,,,7.0,7.0,,375,83,100,58,63,72,74,2014,450,6.5,7.0,7.0,7.0,1.0
2,381,F,,,,8.0,8.0,,,,,,,,,,,,7.0,8.0,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,381,90,96,80,63,82,79,2014,490,7.5,8.0,8.0,8.0,1.0
3,391,M,,,,,,,,,,,,,,,,,6.0,7.0,,,,7.0,,6.0,,,,5.0,7.0,6.0,,,,,7.0,7.0,,391,73,58,28,23,47,52,2014,281,6.5,7.0,7.0,6.0,0.0
4,399,M,,,8.0,8.0,8.0,,,,,,,,,,,,8.0,7.0,8.0,,,,8.0,8.0,,,,8.0,8.0,,8.0,,8.0,,,,,399,100,99,100,97,83,97,2013,576,7.5,8.0,8.0,8.0,0.0


Now that this has worked we can get rid of all the NaN values in Gender_Value

In [19]:
df_combined = df_combined.dropna(subset = ["Gender_Value"])

In [20]:
df_combined.shape

(79, 53)

In [21]:
#lets try this again
#replacements = {1: "F", 0:"M"}

#df_combined.Gender_Value.replace(replacements, inplace=True)

#df_combined.head()

In [22]:
#alright that didn't work...the apply method is not very economical but i'm out of options
#df_combined['Gender_Value'] = df_combined['Gender_Value'].apply({2:'M', 1:'F'}.get)
#df_combined.head()


okay so that also didn't work.  I'm not really sure why.  Could be the way I created the column. Let's go back and have a look. I've dropped a bunch of code below which I was using to try and figure out a way of changing the last column to a set of 1's and 0's but there is some interesting methods there so i've left it in

In [23]:
#lets try recreating the column like this
#Gender_Value = df_combined.Gender
#df_combined.head()

In [24]:
#print (df_combined.dtypes)

In [25]:
#df_combined = df_combined.convert_objects(convert_numeric=True)
#df_combined = df_combined.infer_objects()

In [26]:
#print (df_combined.dtypes.Gender_Value)

In [27]:
#df_combined = df_combined.drop(["Gender_Value"], axis=1)
#df_combined.head()

In [28]:
#I think i'm being stupid and this might work
#Gender_Value = df_combined[["Gender"]]
#df_combined['Gender_Value'] = Gender_Value
#df_combined = df_combined["Gender_Value"].replace([0,1], ["M","F"], inplace = True)

Lets see if we can create a new data frame from the existing one based on the newly created columns and then try and manipulate that.  We may just have to write a for loop with an if statement if we can't find a replacement method that works for the Gender problem.


In [29]:
df_new = df_combined[['Student_ID',"Mathematics","ENG", "MFL", "SCI","HGGRD","Total","Gender_Value"]]
df_new.describe()

Unnamed: 0,Student_ID,Mathematics,ENG,MFL,SCI,HGGRD,Total,Gender_Value
count,79.0,79.0,79.0,77.0,79.0,79.0,79.0,79.0
mean,4634.240506,7.772152,6.767932,7.380952,7.42616,7.116034,446.316456,0.253165
std,3678.089238,0.451475,0.948615,0.940813,0.666531,0.888897,111.693668,0.437603
min,366.0,6.0,4.0,3.0,5.5,5.0,121.0,0.0
25%,1511.5,8.0,6.0,7.0,7.0,6.5,391.5,0.0
50%,2550.0,8.0,7.0,8.0,7.666667,7.0,484.0,0.0
75%,8362.0,8.0,7.5,8.0,8.0,8.0,532.5,0.5
max,9546.0,8.0,8.0,8.0,8.0,8.0,589.0,1.0


Again below is some useful code for solving various problems 

In [30]:
#lets see if this changes the output of Gender

#df_title = df_new
#df_column = df_new.Gender

#def number_Gender(df_title, df_column):

#for G in df_new["Gender"]:
   # Gen = G.replace("F", 1)
   # df_new["Gender"] = Gen
   
                
#df_new.head()    

Let's try and create a dictionary from a column and then apply a function to that dictionary then move the new dictionary back into the dataframe

In [31]:
#I think i'm being stupid and this might work
#Gender_Value = df_new[["Gender"]]
#df_new['Gender_Value'] = Gender_Value
#df_new = df_new["Gender_Value"].replace([0,1], ["M","F"], inplace = True)
#df_new


Here is how to create a dictionary from a dataframe column.

In [None]:
#area_dict = dict(zip(lakes.area, lakes.count))
#Gender_dict = df_new.set_index("Student_ID")["Gender"].to_dict()
#print (Gender_dict)

Okay lets move through our dictionary and start replacing values. This doesn't work but i've posted a stackoverflow question and if i get the answer i'll put it here.

In [32]:
#Here is som example code
#a1 = {"Green": "Tree", "Red": "Rose", "Yellow": "Sunflower"}
#for color, flower in a1.items():
    #if flower == "Rose":
        #a1[color] = "Tulip"

#for Student_ID, Gender in Gender_dict.items():
    #if Gender == "F":
        #Gender_dict[Gender] = "1"
    #elif Gender == "M":
        #Gender_dict[Gender] = "0"
        
#print (Gender_dict)


In [33]:
#Here is som example code
#a1 = {"Green": "Tree", "Red": "Rose", "Yellow": "Sunflower"}
#for color, flower in a1.items():
    #if flower == "Rose":
        #a1[color] = "Tulip"
        
#def Values(Gender,GV):
    #if Gender == "F":
       # GV = 1
    #elif Gender == "M":
        #GV = 0
   # return GV


#df_new["Gender_Values"] = df_new["Gender"].apply(Values())

#df_new.head()
    


Glenda 2.30pm