Final Project: Predicting Student Performance
===
Authors: John Sabo and Michael Shimer

Goal: Make a machine learning model that can predict how well a student will do in school.

Frame the Problem and Look at the Big Picture 
===
1. Define the objective in business terms. 
    * The school is looking for a system that will predict how well a student will do to see if they can give them help earlier in the school year as opposed to later.
2. How will your solution be used? 
    * The model is going to be used to predict how well a student is doing or how well a student will do on their final grades.
    * Our solution will be used to find students who need additional help sooner rather than later.
3. What are the current solutions/workarounds (if any)? 
    * Predicting student performance is not a new idea, however, current methods used involve paper and pencil analysis.
4. How should you frame this problem (supervised/unsupervised, online/offline, ...)? 
    * Supervised - It will be a supervised learning model because we have a target feature or label.
        - The label is the students' final grade.
    * Regression - It will be a regression problem because grades are measured on a linear scale.
        - i.e. grades are '0 to 100' or 'F to A'
    * Offline - It will be offline because it will not recieve new data on a daily basis.
5. How should performance be measured? Is the performance measure aligned with the business objective? 
    * The root mean squared error will denote how close the model is to predicting the actual grade.
6. What would be the minimum performance needed to reach the business objective? 
    * To reach the business objective we will have a minimum performance requirement for the model of 10% error. In other words, the predicted final grade mnust always be within 10 points of the acutal grade.
7. What are comparable problems? Can you reuse experience or tools? 
    * Predicting an athlete performance based on factors in and off the field is a similar problem. There models that exist to predict performance of both students and athletes. Some models may exist that use some athletic data to predict the students performance. There are no tools or experience we can use to specifically help solve this problem, however, as our experience grows in Machine learning, we feel more ready for the task.
8. Is human expertise available? 
    * This project has no human expertise available.
9. How would you solve the problem manually? 
    * To solve this manually, all the features would be analyzed and given a weight based on how important each feature is determined to be. An alogrithm would be written to consider all the weights of the attributes and compute a numerical prediction for the final grade of the student.
10.  List the assumptions you (or others) have made so far. Verify assumptions if possible. 
    * So assumptions are being made against our data.

In [36]:
import ast

import numpy as np
import scipy.sparse
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import pickle

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import silhouette_score, adjusted_rand_score

from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler, FunctionTransformer, OneHotEncoder, MinMaxScaler, MultiLabelBinarizer

from sklearn.decomposition import PCA
from sklearn.mixture import GaussianMixture
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.ensemble import VotingClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingRegressor, StackingClassifier

pd.set_option('display.max_columns', 1000)
pd.set_option('display.max_rows', 100)


import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

In [37]:
def read_student_data(filename):
    """
    Reads the CSV file from the current directory and changes all semi-colons to comas
    as well as removes all unnecessary quotes.
    """
    fixed_data = []
    file = open(filename)
    for line in file:
        line_data = line.strip().split(';')
        temp = []
        for item in line_data:
            item.strip('"')
            temp.append(item)
        fixed_data.append(temp)
    column_names = fixed_data[0]
    fixed_data.remove(fixed_data[0])

    return pd.DataFrame(fixed_data, columns=column_names) 
    

def read_data():
    """
    Combines the two CSV files and returns one DataFrame.
    """
    return pd.concat([read_student_data('student-por.csv'), read_student_data('student-mat.csv')], axis=0)
    



In [41]:
data = read_data()
data.head(20)

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
0,"""GP""","""F""",18,"""U""","""GT3""","""A""",4,4,"""at_home""","""teacher""","""course""","""mother""",2,2,0,"""yes""","""no""","""no""","""no""","""yes""","""yes""","""no""","""no""",4,3,4,1,1,3,4,"""0""","""11""",11
1,"""GP""","""F""",17,"""U""","""GT3""","""T""",1,1,"""at_home""","""other""","""course""","""father""",1,2,0,"""no""","""yes""","""no""","""no""","""no""","""yes""","""yes""","""no""",5,3,3,1,1,3,2,"""9""","""11""",11
2,"""GP""","""F""",15,"""U""","""LE3""","""T""",1,1,"""at_home""","""other""","""other""","""mother""",1,2,0,"""yes""","""no""","""no""","""no""","""yes""","""yes""","""yes""","""no""",4,3,2,2,3,3,6,"""12""","""13""",12
3,"""GP""","""F""",15,"""U""","""GT3""","""T""",4,2,"""health""","""services""","""home""","""mother""",1,3,0,"""no""","""yes""","""no""","""yes""","""yes""","""yes""","""yes""","""yes""",3,2,2,1,1,5,0,"""14""","""14""",14
4,"""GP""","""F""",16,"""U""","""GT3""","""T""",3,3,"""other""","""other""","""home""","""father""",1,2,0,"""no""","""yes""","""no""","""no""","""yes""","""yes""","""no""","""no""",4,3,2,1,2,5,0,"""11""","""13""",13
5,"""GP""","""M""",16,"""U""","""LE3""","""T""",4,3,"""services""","""other""","""reputation""","""mother""",1,2,0,"""no""","""yes""","""no""","""yes""","""yes""","""yes""","""yes""","""no""",5,4,2,1,2,5,6,"""12""","""12""",13
6,"""GP""","""M""",16,"""U""","""LE3""","""T""",2,2,"""other""","""other""","""home""","""mother""",1,2,0,"""no""","""no""","""no""","""no""","""yes""","""yes""","""yes""","""no""",4,4,4,1,1,3,0,"""13""","""12""",13
7,"""GP""","""F""",17,"""U""","""GT3""","""A""",4,4,"""other""","""teacher""","""home""","""mother""",2,2,0,"""yes""","""yes""","""no""","""no""","""yes""","""yes""","""no""","""no""",4,1,4,1,1,1,2,"""10""","""13""",13
8,"""GP""","""M""",15,"""U""","""LE3""","""A""",3,2,"""services""","""other""","""home""","""mother""",1,2,0,"""no""","""yes""","""no""","""no""","""yes""","""yes""","""yes""","""no""",4,2,2,1,1,1,0,"""15""","""16""",17
9,"""GP""","""M""",15,"""U""","""GT3""","""T""",3,4,"""other""","""other""","""home""","""mother""",1,2,0,"""no""","""yes""","""no""","""yes""","""yes""","""yes""","""yes""","""no""",5,5,1,1,1,5,0,"""12""","""12""",13


In [39]:
data.describe()

Unnamed: 0,school,sex,age,address,famsize,Pstatus,Medu,Fedu,Mjob,Fjob,reason,guardian,traveltime,studytime,failures,schoolsup,famsup,paid,activities,nursery,higher,internet,romantic,famrel,freetime,goout,Dalc,Walc,health,absences,G1,G2,G3
count,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044,1044
unique,2,2,8,2,2,2,5,5,5,5,4,3,4,4,4,2,2,2,2,2,2,2,2,5,5,5,5,5,5,35,18,17,19
top,"""GP""","""F""",16,"""U""","""GT3""","""T""",4,2,"""other""","""other""","""course""","""mother""",1,2,0,"""no""","""yes""","""no""","""no""","""yes""","""yes""","""yes""","""no""",4,3,3,1,1,5,0,"""10""","""11""",10
freq,772,591,281,759,738,923,306,324,399,584,430,728,623,503,861,925,640,824,528,835,955,827,673,512,408,335,727,398,395,359,146,138,153


In [42]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1044 entries, 0 to 394
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   school      1044 non-null   object
 1   sex         1044 non-null   object
 2   age         1044 non-null   object
 3   address     1044 non-null   object
 4   famsize     1044 non-null   object
 5   Pstatus     1044 non-null   object
 6   Medu        1044 non-null   object
 7   Fedu        1044 non-null   object
 8   Mjob        1044 non-null   object
 9   Fjob        1044 non-null   object
 10  reason      1044 non-null   object
 11  guardian    1044 non-null   object
 12  traveltime  1044 non-null   object
 13  studytime   1044 non-null   object
 14  failures    1044 non-null   object
 15  schoolsup   1044 non-null   object
 16  famsup      1044 non-null   object
 17  paid        1044 non-null   object
 18  activities  1044 non-null   object
 19  nursery     1044 non-null   object
 20  higher   