# Machine Learning Homework

All of the subquestions (ie: a, b, c, etc.) are worth 5 points unless noted otherwise.

__Data Source__:

Altered version of the Mathematics (mat) dataset downloaded from this link:
https://archive.ics.uci.edu/ml/datasets/Student+Performance#

Paulo Cortez, University of Minho, GuimarÃ£es, Portugal, http://www3.dsi.uminho.pt/pcortez 


__Data Description__:

This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).


__Attribute Information__:

Input Variables:

1 school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira) 

2 sex - student's sex (binary: 'F' - female or 'M' - male) 

3 age - student's age (numeric: from 15 to 22) 

4 address - student's home address type (binary: 'U' - urban or 'R' - rural) 

5 famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3) 

6 Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart) 

7 Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education) 

8 Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 â€“ 5th to 9th grade, 3 â€“ secondary education or 4 â€“ higher education) 

9 Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 

10 Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other') 

11 reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other') 

12 guardian - student's guardian (nominal: 'mother', 'father' or 'other') 

13 traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour) 

14 studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours) 

15 failures - number of past class failures (numeric: n if 1<=n<3, else 4) 

16 schoolsup - extra educational support (binary: yes or no)

17 famsup - family educational support (binary: yes or no) 

18 paid - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no) 

19 activities - extra-curricular activities (binary: yes or no) 

20 nursery - attended nursery school (binary: yes or no) 

21 higher - wants to take higher education (binary: yes or no) 

22 internet - Internet access at home (binary: yes or no) 

23 romantic - with a romantic relationship (binary: yes or no) 

24 famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent) 

25 freetime - free time after school (numeric: from 1 - very low to 5 - very high) 

26 goout - going out with friends (numeric: from 1 - very low to 5 - very high)

27 Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high) 

28 Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high) 

29 health - current health status (numeric: from 1 - very bad to 5 - very good) 

30 absences - number of school absences (numeric: from 0 to 93) 

Output variables:

These grades are related with the course subject, Math or Portuguese...

31 G1 - first period grade (numeric: from 0 to 20) 

31 G2 - second period grade (numeric: from 0 to 20) 

32 G3 - final grade (numeric: from 0 to 20, output target)

__Additional Citation__:

P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. 
[Web Link]

----------------------------------------------------------------------------------------------------------------------

<b>1. (5 Points) Packages and Prebuilt Functions </b>

    a) (2 points) Run the code block below to load in needed packages

In [None]:
import numpy as np
import pandas as pd
import sklearn
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
import os
import warnings

warnings.filterwarnings('ignore')

    b) (3 points) Copy over the get_download_path() function we have used elsewhere in the course and run this cell to load in its functionality.

<b>2. (15 Points) Data </b>

    a) (3 points) Place the "student-mat.xlsx" in your Downloads folder.  Then, read the data file in python and display the first 5 rows.

In [None]:
# Your code below
grades = ...

    b) (6 points) Check if there are any NaN values using the ".info" and ".describe" commands and comment in the markdown cell why or why not you think there are null values - should be a sentence or two.  Comment on the distribution (mean, min, max, standard deviation) of the age and studytime columns.

In [None]:
grades.info()

In [None]:
grades.describe()

    c) (6 points) Drop the 'G1' and 'G2' columns.  In a markdown cell, explain why we dropped these two columns (hint, check the data description).

<b>3. (15 points) Working with Missing Data</b>

    a) (4 points) Print out the column names of any variables that contain null values along with the number of nulls that column has in our dataset. 

    b) (2 points) Print out all rows in which the column "studytime" contains a null value.

    c) (4 points) Replace (impute) all null values of the column "studytime" with the mean value for the existing values of "studytime".

In [None]:
mean_value = ...

    d) (3 points) What are some other methods to fill in missing values for both qualitative and quantitative data?  Does it ever make sense to fill in missing data with 0s?

    e) (2 points) Drop all remaining rows that contain a null value.  Print out how many rows are now in our dataset.

<b>4. (20 points) Creating Bins with Our Dependent Variable </b>

    a) (2 points) Print all unique values of 'G3'.

    b) (2 points) Plot 'G3' as a histogram.

    c) (6 points) Use "qcut" to create 4 bins for 'G3' in a variable called "g3_band".  What is the difference between "cut" and "qcut"?

    d) (4 points) Print out the value counts for g3_band.  Logically, why doesn't the qcut work as expected?  (hint - is the distribution even accross the value counts?)

    e) (3 points) Create a new variable called "grades" which is a 4 binned qcut with labels of 0, 1, 2, and 3.  

In [None]:
grades['grades'] = ...

    f) (3 points) Drop the columns "G3" and "g3_band".  Print out the value counts of "grades".

<b>5. (25 Points) First Machine Learning Models </b>

    a) (3 points) Create Y using the variable "grades" and all other variables as the X.

In [None]:
Y = ...
X = ...

    b) (5 points) Run the code block below.  What is it doing?  Explain what a "dummy variable" is.  Why are we turning changing these variables?  (hint - do all Machine Learning algorithms work with categorical variables in scikit learn?)

In [None]:
X = pd.get_dummies(X)
print(X.columns)
X.head()

    c) (2 points) Split the data into a training set as well a test set for both your X variables as well as for your Y variable with 21% test size.

In [None]:
X_train, X_test, Y_train, Y_test = ...

    d) (15 points) Interpret the results of the machine learning models' results below.  Namely, described what the training accuracy, testing accuracy and the difference between the two indicate.  Note how you tuned the hyper-parameters and why you ended up with the ones you did choose (ie: n_estimators, learning_rate, n_neighbors).  Choose one of the models and briefly explain at a high level how it works (2-4 sentences).

In [None]:
# Random Forest
random_forest = RandomForestClassifier(n_estimators=...)
random_forest.fit(X_train, Y_train)
random_forest_train_acc = random_forest.score(X_train, Y_train)
random_forest_test_acc = random_forest.score(X_test, Y_test)
print ('random_forest training acuracy= ',random_forest_train_acc)
print('random_forest test accuracy= ',random_forest_test_acc)

In [None]:
# AdaBoost
adaboost = AdaBoostClassifier(n_estimators=..., learning_rate=...)
adaboost.fit(X_train, Y_train)
adaboost_train_acc = adaboost.score(X_train, Y_train)
adaboost_test_acc = adaboost.score(X_test, Y_test)
print ('AdaBoost training acuracy= ',adaboost_train_acc)
print('AdaBoost test accuracy= ',adaboost_test_acc)

In [None]:
# K Nearest Neightbors
knn = KNeighborsClassifier(n_neighbors=...)                  
knn.fit(X_train, Y_train)                                    
knn_train_acc = knn.score(X_train, Y_train)
knn_test_acc = knn.score(X_test, Y_test)
print ('kNN training acuracy= ',knn_train_acc)
print('kNN test accuracy= ',knn_test_acc)

<b>6. (20 Points) Your Own Machine Learning Model </b>

    Come up with your own machine learning model different than the ones above (you can still use the same type of method if you wish, just tune the parameters differently) and interpret the results as well as document your steps and reasonings. Use as many code blocks as you like and be creative - there are infinetely many ways to do this!  Additionally, use graph(s) to explain your results and/or why you chose to use a certain method.