## David Abramowitz - Big Data Final
Tuesday, December 18th 

## Abstract
In this lab I attempted to use predictive models to hopefully estimate what students would have scored on the third test. In every model that I ran, the accuracy score was so low that estimating scores seemed pointless.  

## Dataset Prep
This project looked at grading data from 12th Grade Physics taught by Moses Rifkin. The data covered academic years 2016-17 to 2018-19. For the first two academic years students took four tests and two finals, with the fourth test being in between the first and second final exams. For each test and final exam, students had the oppurtunity to estimate their score and rate their feelings regarding the test. The feelings score is out of ten. This data was collected alongside each test/final. The notable exception is in the 2018-19 data: data only exists for the first two tests (the third test was canceled). Throughout all three years, blank cells exist in every category. For all of the data, sex (M/F) was collected. 

In [656]:
# Data analysis and wrangling
import pandas as pd
import numpy as np
import random as rnd

#Viz
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

#machine learning models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier


In [657]:
train_df = pd.read_csv('Physics_train3.csv')
test_df = pd.read_csv ('Physics_test3.csv')

In [658]:
from pandas import read_csv
import numpy
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

## Data cleaning
In Google Sheets I did the following:
- Deleted the test headers and renamed the columns
- Combined 2016/17 and 2017/18 data to create an effective amount of training data for 2018/19
- Removed score estimations and confidence ratings


below, I used code from Jason Brownlee which he posted on his GitHub page to remove rows containing one or more NaNs. https://machinelearningmastery.com/handle-missing-data-python/

In [659]:
train_df.dropna(inplace=True)
test_df.dropna(inplace=True)

## Data exploration

Becuase the data had already been cleaned, there was not much of note in the exploration phase. Even after the rows with null shave been removed, simple division indicates that each class has roughly the same number of students in the data.

In [660]:
#Printing first 10 rows to ensure accuracy
train_df[:10]

Unnamed: 0,Gender,T1_Score,T2_Score,T3_Score
0,F,69.0,89.0,77.0
1,M,94.0,90.0,93.0
2,F,46.0,57.5,62.0
3,M,61.0,94.0,79.0
4,M,96.0,97.0,62.0
5,F,77.0,73.0,79.0
6,M,79.0,95.0,86.0
7,M,85.0,88.0,75.0
8,F,71.0,83.0,73.0
9,M,84.0,99.0,90.5


In [661]:
#printing the columns
print(train_df.columns.values)
train_df.info()

['Gender' 'T1_Score' 'T2_Score' 'T3_Score']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 82 entries, 0 to 84
Data columns (total 4 columns):
Gender      82 non-null object
T1_Score    82 non-null float64
T2_Score    82 non-null float64
T3_Score    82 non-null float64
dtypes: float64(3), object(1)
memory usage: 3.2+ KB


In [662]:
print(test_df.columns.values)
test_df.info()

['Gender' 'T1_Score' 'T2_Score']
<class 'pandas.core.frame.DataFrame'>
Int64Index: 23 entries, 0 to 22
Data columns (total 3 columns):
Gender      23 non-null object
T1_Score    23 non-null int64
T2_Score    23 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 736.0+ bytes


## feature manipulation

My feature manipulation consisted of replacing the sex values with numerical values. I also Dropped every row that contained a NaN, however this may qualify as data cleaning rather than feature manipulation.

In [663]:
#Replacing sex values with numerical ones
train_df = train_df.replace({'M': '2', 'F': '1'})
test_df = test_df.replace({'M': '2', 'F': '1'})

This code to replace "M' and 'F' came from Andy Hayden on StackOverflow (https://stackoverflow.com/questions/18548662/rename-elements-in-a-column-of-a-data-frame-using-pandas).

## NO FEATURE ELIMINATION

Becuase I had previously removed the estimated and confidence scores in Google Sheets (I thought that they may be altering the accuracy scores of the models), there were no features to eliminate here.

In [666]:
#splitting train data into a training and a testing set
train_labels = train_df['T3_Score']
train_features = train_df.drop(['T3_Score'], axis=1)
#valid_train_features = train_df['T3_Score']

test_features = test_df


#train_labels = train_df.drop(['T1_Pred'],['T1_Feel'] axis=1)
#features - x
#labels - y

In [667]:
print("Shape of train_features: ", train_features.shape, " shape of train_labels: ", train_labels.shape, "Shape of test_features: ", test_features.shape)
print("train_features feature list: ", train_features.columns.values)
print("test_features feature list: ", test_features.columns.values)

Shape of train_features:  (82, 3)  shape of train_labels:  (82,) Shape of test_features:  (23, 3)
train_features feature list:  ['Gender' 'T1_Score' 'T2_Score']
test_features feature list:  ['Gender' 'T1_Score' 'T2_Score']


Here, the shapes of the various lists were examined. Then, the training data was split so that cross validation could be done.

In [668]:
#SPlit traning data in order to cross validate
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

valid_train_features, valid_test_features, valid_train_labels, valid_test_labels  = train_test_split(train_features, train_labels, train_size=0.2, test_size=0.8)

valid_train_features.shape, valid_test_features.shape, valid_test_labels.shape, valid_test_labels.shape

((16, 3), (66, 3), (66,), (66,))

In [669]:
print(valid_train_features)

   Gender  T1_Score  T2_Score
84      1      75.0      73.0
2       1      46.0      57.5
79      1      71.0      87.0
59      2      53.0      64.0
15      2      90.0     100.0
61      1      50.0      59.0
6       2      79.0      95.0
81      2      46.0      55.0
82      1      95.0      91.0
56      2      36.0      80.0
51      1      58.0      89.0
45      1      56.0      69.5
14      1      76.0      76.5
21      2      90.0     100.0
5       1      77.0      73.0
3       2      61.0      94.0


In [670]:
import numpy as np
from sklearn                        import metrics, svm
from sklearn.linear_model           import LogisticRegression
from sklearn import preprocessing
from sklearn import utils

In [671]:
lab_enc = preprocessing.LabelEncoder()
valid_train_labels = lab_enc.fit_transform(valid_train_labels)
valid_test_labels = lab_enc.fit_transform(valid_test_labels)
train_labels = lab_enc.fit_transform(train_labels)

#print(training_scores_encoded)
#print(utils.multiclass.type_of_target(training_scores_Y))
#print(utils.multiclass.type_of_target(training_scores_Y.astype('int')))
#print(utils.multiclass.type_of_target(training_scores_encoded))

In [672]:
print(valid_train_features)

   Gender  T1_Score  T2_Score
84      1      75.0      73.0
2       1      46.0      57.5
79      1      71.0      87.0
59      2      53.0      64.0
15      2      90.0     100.0
61      1      50.0      59.0
6       2      79.0      95.0
81      2      46.0      55.0
82      1      95.0      91.0
56      2      36.0      80.0
51      1      58.0      89.0
45      1      56.0      69.5
14      1      76.0      76.5
21      2      90.0     100.0
5       1      77.0      73.0
3       2      61.0      94.0


## Data models
I chose the following models primarily becuase I was comfortable with them. I had attempted to use other models, however I encountered problems with producing an accuracy score. becuase of how low these scores are, I suspect that no model would be able to achieve a significant accuracy score.

In [674]:
#log reg
logreg = LogisticRegression()
logreg.fit(valid_train_features, valid_train_labels)
log_pred = logreg.predict(valid_test_features)
print(accuracy_score(valid_test_labels, log_pred))

#Compare the accuracy of running the whole set vs the cross validated set so as to avoid overfitting
print (logreg.score( train_features , train_labels ) , logreg.score( valid_train_features , valid_train_labels ))



0.030303030303030304
0.012195121951219513 0.4375


In [675]:
#RandomForest
RFC = RandomForestClassifier()
RFC.fit(valid_train_features, valid_train_labels)
RFC_pred = RFC.predict(valid_test_features)
accuracy_score_RFC = (accuracy_score(valid_test_labels, RFC_pred))
print(accuracy_score_RFC)

#Compare the accuracy of running the whole set vs the cross validated set so as to avoid overfitting
print (RFC.score( train_features , train_labels ) , RFC.score( valid_train_features , valid_train_labels ))




0.045454545454545456
0.024390243902439025 0.9375


In [676]:
#KNN
knn_clf = KNeighborsClassifier()
knn_clf.fit(valid_train_features, valid_train_labels)
KNN_Pred = knn_clf.predict(valid_test_features)
print(accuracy_score(valid_test_labels, KNN_Pred))

#Compare the accuracy of running the whole set vs the cross validated set so as to avoid overfitting
print (knn_clf.score( train_features , train_labels ) , knn_clf.score( valid_train_features , valid_train_labels ))




0.015151515151515152
0.012195121951219513 0.25


In [677]:
#GaussianNB
from sklearn.naive_bayes import GaussianNB
GNB = GaussianNB()

GNB.fit(valid_train_features, valid_train_labels)
GNB_pred = GNB.predict(valid_test_features)
print(accuracy_score(valid_test_labels, GNB_pred))
print (GNB.score( train_features , train_labels ) , GNB.score( valid_train_features , valid_train_labels ))



0.0
0.024390243902439025 0.875


Given that these accuracy scores are so low (surprisingly low), it seems pointless to produce as estimate of individual scores. An area for future exploration would be why these scores are so low. I am somewhat confident in my work, but it is odd that no model would be able to produce even a moderatly high accuracy score.


## Data analysis and conclusion
One major area for future exploration would how sex factors into grade trends. It should be noted that Moses Rifkin grades tests and finals blindly; that is, tests are turned in with a number instead of a name. As such, Mr. Rifkin is unaware of students' sex when grading their tests. It would also be interesting to examine why the accuracy scores are so low. It is not uncommon for there to be great volatility between the first and second tests, and the data does not include or account for any changes in teaching style or course load. 

## Aknowledgements
All code except for the code to remove rows with NaNs, which came from Jason Brownlee on GitHub (https://machinelearningmastery.com/handle-missing-data-python/), and the code to replace M and F (Andy hayden - StackOverflow (https://stackoverflow.com/questions/18548662/rename-elements-in-a-column-of-a-data-frame-using-pandas)) came from Ms. Sconyers. A big thank you to Moses Rifkin, who volunteered his grading data.