#### Reading Test Scores 
The Programme for International Student Assessment (PISA) is a test given every three years to 15-year-old students from around the world to evaluate their performance in mathematics, reading, and science. 
The datasets pisa2009train.csv and pisa2009test.csv contain information about the demographics and schools for American students taking the exam, derived from 2009 PISA Public-Use Data Files distributed by the United States National Center for Education Statistics (NCES).
Each row in the datasets pisa2009train.csv and pisa2009test.csv represents one student taking the exam. The datasets have the following variables:

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
from sklearn.cross_validation import train_test_split
import math

In [2]:
pd.__version__

u'0.18.1'

In [3]:
##Read in data
train=pd.read_csv('pisa2009train.csv')
test=pd.read_csv('pisa2009test.csv')
train.shape

(3663, 24)

In [4]:
test.shape

(1570, 24)

In [5]:
train.head()

Unnamed: 0,grade,male,raceeth,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,11,1,,,0.0,,,1.0,,,...,0.0,1.0,0.0,225.0,,1.0,1,1,673.0,476.0
1,11,1,White,0.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
2,9,1,White,1.0,1.0,1.0,1.0,1.0,1.0,,...,1.0,1.0,0.0,250.0,28.0,1.0,1,0,1233.0,554.81
3,10,0,Black,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
4,10,1,Hispanic,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89


####  1.2 What is the average reading test score of males and females in th training set?

In [6]:
# average reading score of males 
train.readingScore[train.male==1].mean()

483.53247863247805

In [7]:
#Average female reading score
train.readingScore[train.male==0].mean()

512.94063093244

#### Locate missing Values
Which variables are missing data in at least one observation in the training set? 

In [8]:
train.isnull().sum()

grade                      0
male                       0
raceeth                   35
preschool                 56
expectBachelors           62
motherHS                  97
motherBachelors          397
motherWork                93
fatherHS                 245
fatherBachelors          569
fatherWork               233
selfBornUS                69
motherBornUS              71
fatherBornUS             113
englishAtHome             71
computerForSchoolwork     65
read30MinsADay            34
minutesPerWeekEnglish    186
studentsInEnglish        249
schoolHasLibrary         143
publicSchool               0
urban                      0
schoolSize               162
readingScore               0
dtype: int64

Linear regression discards observations with missing data, so we will remove all such observations from the training and testing sets. Later in the course, we will learn about imputation, which deals with missing data by filling in missing values with plausible information.

In [9]:
##Drop missing values
train.dropna(inplace=True)
test.dropna(inplace=True)

In [10]:
test.shape

(990, 24)

In [11]:
train.shape

(2414, 24)

In [12]:
train.head(5)

Unnamed: 0,grade,male,raceeth,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,fatherBachelors,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
1,11,1,White,0.0,0.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
3,10,0,Black,1.0,1.0,0.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
4,10,1,Hispanic,1.0,0.0,1.0,0.0,1.0,1.0,0.0,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89
7,10,0,White,1.0,1.0,1.0,0.0,0.0,1.0,0.0,...,1.0,1.0,1.0,300.0,30.0,1.0,1,0,1913.0,439.36
9,10,1,More than one race,1.0,1.0,1.0,1.0,1.0,0.0,0.0,...,1.0,1.0,0.0,294.0,24.0,1.0,1,0,899.0,465.9


In [13]:
## Reset indexing of the df
train.reset_index(inplace=True)
test.reset_index(inplace=True) ##Note reset_index add another column called 'index with the old index cols
train.head(5)

Unnamed: 0,index,grade,male,raceeth,preschool,expectBachelors,motherHS,motherBachelors,motherWork,fatherHS,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,1,11,1,White,0.0,0.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
1,3,10,0,Black,1.0,1.0,0.0,0.0,1.0,1.0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
2,4,10,1,Hispanic,1.0,0.0,1.0,0.0,1.0,1.0,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89
3,7,10,0,White,1.0,1.0,1.0,0.0,0.0,1.0,...,1.0,1.0,1.0,300.0,30.0,1.0,1,0,1913.0,439.36
4,9,10,1,More than one race,1.0,1.0,1.0,1.0,1.0,0.0,...,1.0,1.0,0.0,294.0,24.0,1.0,1,0,899.0,465.9


Factor variables are variables that take on a discrete set of values. An ordered factor has a natural ordering between the levels (an example would be the classifications "large," "medium," and "small"). 

Which of the following variables is an unordered factor with at least 3 levels?
Which of the following variables is an ordered factor with at least 3 levels? (Select all that apply

In [14]:
train.grade.value_counts()

10    1730
11     491
9      188
12       3
8        2
Name: grade, dtype: int64

In [15]:
train.male.value_counts()

1    1210
0    1204
Name: male, dtype: int64

In [16]:
train.raceeth.value_counts()

White                                     1470
Hispanic                                   500
Black                                      228
Asian                                       95
More than one race                          81
American Indian/Alaska Native               20
Native Hawaiian/Other Pacific Islander      20
Name: raceeth, dtype: int64

2. To include categorical features in a Linear regression model, we would need to convert the  categories into integers. In python, thisfunctionality is available in DictVectorizer from scikit-learn, or "get_dummies()" function. One Hot encoder is another option.
The difference is as follows:
* OneHotEncoder takes as input categorical values encoded as integers - you can get them from LabelEncoder.
* DictVectorizer expects data as a list of dictionaries, where each dictionary is a data row with column names as keys.
* One can use Patsy another python library.
* get_dummies() is another option

In [17]:
#Using get_dummies()
pd.get_dummies(train.raceeth).head(5)

Unnamed: 0,American Indian/Alaska Native,Asian,Black,Hispanic,More than one race,Native Hawaiian/Other Pacific Islander,White
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0
4,0.0,0.0,0.0,0.0,1.0,0.0,0.0


In [18]:
##Use get_dummies and drop the first level to get n-1 categorical levels
pd.get_dummies(train.raceeth,drop_first=True).head(3)
##Note you will need to make sure that its the most common category etc
###Pretty cool..can get dummies for multiple columns as get_dummies can handle a df directly
#and will convert all columns into cateory levels if possible.

Unnamed: 0,Asian,Black,Hispanic,More than one race,Native Hawaiian/Other Pacific Islander,White
0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,1.0,0.0,0.0,0.0,0.0
2,0.0,0.0,1.0,0.0,0.0,0.0


In [19]:
##We would like to use white as the reference level; so lets do iy one step at a time:
#First create a df with interger levels for reaceeth: Columns for all 7 levels are created
race_train=pd.get_dummies(train.raceeth)
race_test=pd.get_dummies(test.raceeth)
race_train.head(3)
race_test.head(3)


Unnamed: 0,American Indian/Alaska Native,Asian,Black,Hispanic,More than one race,Native Hawaiian/Other Pacific Islander,White
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
1,0.0,0.0,0.0,0.0,0.0,0.0,1.0
2,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [20]:
train.shape

(2414, 25)

In [21]:
race_train.shape
test.shape
race_test.shape

(990, 7)

In [22]:
##Concat the new cols to the original dfs
train=pd.concat([race_train,train],axis=1)
test=pd.concat([race_test,test],axis=1)
train.head(4)

Unnamed: 0,American Indian/Alaska Native,Asian,Black,Hispanic,More than one race,Native Hawaiian/Other Pacific Islander,White,index,grade,male,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1,11,1,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
1,0.0,0.0,1.0,0.0,0.0,0.0,0.0,3,10,0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,4,10,1,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89
3,0.0,0.0,0.0,0.0,0.0,0.0,1.0,7,10,0,...,1.0,1.0,1.0,300.0,30.0,1.0,1,0,1913.0,439.36


In [23]:
#Drop the 'white' as well as the original 'raceeth' column. Note: all those rows where all columns are=0 correspond to white=1
train.drop(['White','raceeth'],axis=1,inplace=True)
test.drop(['White','raceeth'],axis=1,inplace=True)

In [24]:
list(train.columns)

['American Indian/Alaska Native',
 'Asian',
 'Black',
 'Hispanic',
 'More than one race',
 'Native Hawaiian/Other Pacific Islander',
 'index',
 'grade',
 'male',
 'preschool',
 'expectBachelors',
 'motherHS',
 'motherBachelors',
 'motherWork',
 'fatherHS',
 'fatherBachelors',
 'fatherWork',
 'selfBornUS',
 'motherBornUS',
 'fatherBornUS',
 'englishAtHome',
 'computerForSchoolwork',
 'read30MinsADay',
 'minutesPerWeekEnglish',
 'studentsInEnglish',
 'schoolHasLibrary',
 'publicSchool',
 'urban',
 'schoolSize',
 'readingScore']

In [25]:
train.head(3)

Unnamed: 0,American Indian/Alaska Native,Asian,Black,Hispanic,More than one race,Native Hawaiian/Other Pacific Islander,index,grade,male,preschool,...,englishAtHome,computerForSchoolwork,read30MinsADay,minutesPerWeekEnglish,studentsInEnglish,schoolHasLibrary,publicSchool,urban,schoolSize,readingScore
0,0.0,0.0,0.0,0.0,0.0,0.0,1,11,1,0.0,...,1.0,1.0,1.0,450.0,25.0,1.0,1,0,1173.0,575.01
1,0.0,0.0,1.0,0.0,0.0,0.0,3,10,0,1.0,...,1.0,1.0,1.0,200.0,23.0,1.0,1,1,2640.0,458.11
2,0.0,0.0,0.0,1.0,0.0,0.0,4,10,1,1.0,...,1.0,1.0,1.0,250.0,35.0,1.0,1,1,1095.0,613.89


Building a model:
build a linear regression model (call it lmScore) using the training set to predict readingScore using all(the remaining) variables.
This time I want to use sklearn NOT stats models. But first lets get all the answer using statsmodel

In [26]:
features_to_use=[cols for cols in train.columns if cols not in ['readingScore','index']]
features_to_use

['American Indian/Alaska Native',
 'Asian',
 'Black',
 'Hispanic',
 'More than one race',
 'Native Hawaiian/Other Pacific Islander',
 'grade',
 'male',
 'preschool',
 'expectBachelors',
 'motherHS',
 'motherBachelors',
 'motherWork',
 'fatherHS',
 'fatherBachelors',
 'fatherWork',
 'selfBornUS',
 'motherBornUS',
 'fatherBornUS',
 'englishAtHome',
 'computerForSchoolwork',
 'read30MinsADay',
 'minutesPerWeekEnglish',
 'studentsInEnglish',
 'schoolHasLibrary',
 'publicSchool',
 'urban',
 'schoolSize']

In [27]:
y=train.readingScore
X=train[features_to_use]
X=sm.add_constant(X)
lmscore=sm.OLS(y,X)
lmscore=lmscore.fit()

In [28]:
lmscore.summary2()

0,1,2,3
Model:,OLS,Adj. R-squared:,0.317
Dependent Variable:,readingScore,AIC:,27647.0896
Date:,2016-09-04 17:47,BIC:,27814.9717
No. Observations:,2414,Log-Likelihood:,-13795.0
Df Model:,28,F-statistic:,41.04
Df Residuals:,2385,Prob (F-statistic):,1.7199999999999998e-180
R-squared:,0.325,Scale:,5448.0

0,1,2,3,4,5,6
,Coef.,Std.Err.,t,P>|t|,[0.025,0.975]
const,143.7663,33.8412,4.2483,0.0000,77.4051,210.1276
American Indian/Alaska Native,-67.2773,16.7869,-4.0077,0.0001,-100.1958,-34.3588
Asian,-4.1103,9.2201,-0.4458,0.6558,-22.1905,13.9699
Black,-67.0123,5.4609,-12.2713,0.0000,-77.7209,-56.3038
Hispanic,-38.9755,5.1777,-7.5275,0.0000,-49.1288,-28.8221
More than one race,-16.9225,8.4963,-1.9918,0.0465,-33.5834,-0.2617
Native Hawaiian/Other Pacific Islander,-5.1016,17.0057,-0.3000,0.7642,-38.4491,28.2459
grade,29.5427,2.9374,10.0574,0.0000,23.7826,35.3028
male,-14.5217,3.1559,-4.6014,0.0000,-20.7103,-8.3330

0,1,2,3
Omnibus:,8.273,Durbin-Watson:,1.998
Prob(Omnibus):,0.016,Jarque-Bera (JB):,8.362
Skew:,-0.141,Prob(JB):,0.015
Kurtosis:,2.943,Condition No.:,37016.0


In [29]:
lmscore.summary() ##I think I can use any one of the two summary types

0,1,2,3
Dep. Variable:,readingScore,R-squared:,0.325
Model:,OLS,Adj. R-squared:,0.317
Method:,Least Squares,F-statistic:,41.04
Date:,"Sun, 04 Sep 2016",Prob (F-statistic):,1.7199999999999998e-180
Time:,17:47:22,Log-Likelihood:,-13795.0
No. Observations:,2414,AIC:,27650.0
Df Residuals:,2385,BIC:,27810.0
Df Model:,28,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5
,coef,std err,t,P>|t|,[95.0% Conf. Int.]
const,143.7663,33.841,4.248,0.000,77.405 210.128
American Indian/Alaska Native,-67.2773,16.787,-4.008,0.000,-100.196 -34.359
Asian,-4.1103,9.220,-0.446,0.656,-22.191 13.970
Black,-67.0123,5.461,-12.271,0.000,-77.721 -56.304
Hispanic,-38.9755,5.178,-7.528,0.000,-49.129 -28.822
More than one race,-16.9225,8.496,-1.992,0.047,-33.583 -0.262
Native Hawaiian/Other Pacific Islander,-5.1016,17.006,-0.300,0.764,-38.449 28.246
grade,29.5427,2.937,10.057,0.000,23.783 35.303
male,-14.5217,3.156,-4.601,0.000,-20.710 -8.333

0,1,2,3
Omnibus:,8.273,Durbin-Watson:,1.998
Prob(Omnibus):,0.016,Jarque-Bera (JB):,8.362
Skew:,-0.141,Prob(JB):,0.0153
Kurtosis:,2.943,Cond. No.,37000.0


In [30]:
#residuals of the model;
lmscore.resid.head(4)

0     36.317872
1    -37.516051
2    177.011875
3   -126.773978
dtype: float64

In [31]:
rmse=math.sqrt((lmscore.resid**2).mean()) 
print rmse

73.365551433


#### What is the Multiple R-squared value of lmScore on the training set
* 0.3254

#### What is the training-set root-mean squared error (RMSE) of lmScore?
* 73.8108

#### Consider two students A and B. They have all variable values the same, except that student A is in grade 11 and student B is in grade 9. What is the predicted reading score of student A minus the predicted reading score of student B?
* Ans: 58.98. Since the coeff of grade is +29.49; for 2 grade levels; it is double the coeff.

#### What is the meaning of the coefficient associated with variable raceethAsian?
*  Predicted difference in the reading score between an Asian student and a white student who is otherwise identical 

#### Based on the significance values, which variables are candidates for removal from the model? 
Assume that the factor variable raceeth should only be removed if none of its levels are significant.)

In [34]:
lmscore.pvalues[lmscore.pvalues >0.05]

Asian                                     0.655781
Native Hawaiian/Other Pacific Islander    0.764208
preschool                                 0.200516
motherHS                                  0.320012
motherWork                                0.425167
fatherHS                                  0.471470
fatherWork                                0.183934
selfBornUS                                0.603307
motherBornUS                              0.181821
fatherBornUS                              0.491776
englishAtHome                             0.241527
minutesPerWeekEnglish                     0.232644
studentsInEnglish                         0.208460
schoolHasLibrary                          0.187487
urban                                     0.977830
dtype: float64

#### So all the above variables are candidates for removal except Asian/Native Hawaiian 

#### Predicting on unseen data:
Using the "predict" function and supplying the "newdata" argument, use the lmScore model to predict the reading scores of students in pisaTest. Call this vector of predictions "predTest". Do not change the variables in the model (for example, do not remove variables that we found were not significant in the previous part of this problem). Use the summary function to describe the test set predictions.

What is the range between the maximum and minimum predicted reading score on the test set?

In [35]:
lmscore.predict(X)

array([ 538.69212789,  495.62605148,  436.8781249 , ...,  447.98990547,
        468.91289474,  557.36336867])

In [44]:
X_test=test[features_to_use]
X_test=sm.add_constant(X_test)
X_test_pred=lmscore.predict(X_test)
y_test=test.readingScore

In [45]:
min(X_test_pred)

353.2231231139566

#### What is the range between the maximum and minimum predicted reading score on the test set

In [46]:
max(X_test_pred)-min(X_test_pred)

284.46831179513288

#### Test set SSE and RMSE:
What is the sum of squared errors (SSE) of lmScore on the testing set?

In [47]:
SSE=((lmscore.predict(X_test)-y_test)**2).sum()
SSE

5762082.371144367

What is the RSME of lmScore on the testing set?

In [48]:
math.sqrt(((lmscore.predict(X_test)-y_test)**2).mean())

76.29079383109176

#### Problem 4.3 - Baseline prediction and test-set SSE

What is the predicted test score used in the baseline model? Remember to compute this value using the training set and not the test set.

In [49]:
##baseline is the average of the training scores:
## Note this what is used to calculate Total sum of sqaures
y.mean()

517.9628873239429

What is the sum of squared errors of the baseline model on the testing set? HINT: We call the sum of squared errors for the baseline model the total sum of squares (SST).

In [50]:
#This can be computed as:
SST=((y.mean() - test.readingScore)**2).sum()
print SST

7802354.07761


What is the test-set R-squared value of lmScore?

In [51]:
#The test-set R^2 is defined as 1-SSE/SST, where SSE is the sum of squared errors of the model on the test set and SST is the sum of squared errors of the baseline model. For this model, the R^2 is then computed to be 1-5762082/7802354.

R_sq=1.0-(SSE/SST)
R_sq

0.26149437543770626