# TODO
1. Create and pass tests for fit and predict (including making sure our random seed thing is working actually)
2. change technical report calling code to use a stratified random subsampling method (instead of k-fold) to accomodate for large dataset
3. make visualizations to potentially eliminate attribute
4. make visualization of classifications to potentially include "distinction" as a subset of pass or withdraw as a subset of fail 
5. make whole API thing
6. make and prepare for Final presentation (five min)

# Introduction

From our dataset, we will be using the attributes gender, region, highest_education, age_band, num_of_prev_attempts, studied_credits, imd_band$^1$, disability to predict the classifier final_result.

## Cleaning our data

###  Data Analysis
Todo:
* Frequency Bar Charts 
* Relevant summary statistics about the dataset.


Because studied_credits, and num_of_prev_attempts are numerical attributes. We will convert them to categorical. The table below denotes how studied_credits will be converted, and num_of_prev_attempts will be converted to a bool where True denotes that the class has been previously attempted, and False denotes that it has not. 

|number|credit ranges|
|------|-------------|
|1|$\leq$ 59|
|2|60-119|
|3|120-179|
|4|180-239|
|5|$\geq$ 240 |

We further cleaned our data by dropping rows with missing values. We understand that this could impact the affects of our data as we would me losing out on some entries. However, upon further inspection we came to realize that only about 1,000 of the over 30,000 rows of data contained missing values. This small ratio made us sure that losing out on this data would not have a negative impact on our classifier. 

1: imd_band is a measure of poverty based on area in the UK.



In [3]:
import importlib

from tabulate import tabulate
import mysklearn.myutils
importlib.reload(mysklearn.myutils)
import mysklearn.myutils as myutils

import mysklearn.mypytable
importlib.reload(mysklearn.mypytable)
from mysklearn.mypytable import MyPyTable 

import mysklearn.myclassifiers
importlib.reload(mysklearn.myclassifiers)
from mysklearn.myclassifiers import MyKNeighborsClassifier, MySimpleLinearRegressor, MyNaiveBayesClassifier, MyDecisionTreeClassifier

import mysklearn.myevaluation
importlib.reload(mysklearn.myevaluation)
import mysklearn.myevaluation as myevaluation

student_data = MyPyTable().load_from_file("input_data/studentInfo.csv")
# remove missing values
student_data.remove_rows_with_missing_values()

gender = student_data.get_column("gender")
region = student_data.get_column("region")
highest_education = student_data.get_column("highest_education") 
age_band = student_data.get_column("age_band")
num_of_prev_attempts = student_data.get_column("num_of_prev_attempts") 
studied_credits = student_data.get_column("studied_credits")

myutils.convert_vals_into_cutoffs(num_of_prev_attempts, [0, 1], [False, True])
myutils.convert_vals_into_cutoffs(studied_credits, [59, 60, 120, 180, 240,], [1,2,3,4,5])

for i in range(len(student_data.data)):
    student_data.data[i][8] = num_of_prev_attempts[i]
    student_data.data[i][9] = studied_credits[i]

imd_band = student_data.get_column("imd_band")
disability = student_data.get_column("disability")
final_result = student_data.get_column("final_result")

## Classifications

### Accuracy Rates
For the purposes of this demo, we ran our, now clean, data through our decisions tree classifiers and naive bayes classifier. Naive Bayes is a bit more accurate than decision tree and our accuracy was a bit disappointing. We are hoping that our RandomForestClassifier will solve some of the overfitting and improve our accuracy rate for our final project.  

In [5]:
student_train_folds, student_test_folds = myevaluation.stratified_kfold_cross_validation(student_data.data, final_result, 10) 

student_test = []
student_train = []

final_results_test = []
final_results_train = []

# turn indexes into data sets
for row in student_train_folds:
    student_set = []
    final_results_set = []
    for item in row:
        student_set.append(student_data.data[item][3:10])
        final_results_set.append(student_data.data[item][-1])
    student_train.append(student_set)
    final_results_train.append(final_results_set)

# turn indexes into data sets
for row in student_test_folds:
    student_set = []
    final_results_set = []
    for item in row:
        student_set.append(student_data.data[item][3:10])
        final_results_set.append(student_data.data[item][-1])
    student_test.append(student_set)
    final_results_test.append(final_results_set)

#Naive Bayes model
total_Naive = []
total_expected = []
for i in range(10):
    student_Naive = MyNaiveBayesClassifier()    
    student_Naive.fit(student_train[i],final_results_train[i])
    Naive_predictions = student_Naive.predict(student_test[i])
    total_Naive.extend(Naive_predictions)
    total_expected.extend(final_results_test[i])
print("===========================================")
print("Predictive Accuracy")
print("===========================================")
print("Stratified 10-Fold Cross Validation")
accuracy, errorrate = myutils.accuracy_errorrate(total_Naive, total_expected)
print("Naive Bayes: accuracy = ", accuracy, "error rate = ", errorrate)

#Tree model
total_tree = []
total_expected = []
for i in range(10):
    student_tree = MyDecisionTreeClassifier()    
    student_tree.fit(student_train[i],final_results_train[i])
    tree_predictions = student_tree.predict(student_test[i])
    total_tree.extend(tree_predictions)
    total_expected.extend(final_results_test[i])



accuracy, errorrate = myutils.accuracy_errorrate(total_tree, total_expected)
print("Tree: accuracy = ", accuracy, "error rate = ", errorrate)

Predictive Accuracy
Stratified 10-Fold Cross Validation
Naive Bayes: accuracy =  0.4282447112635792 error rate =  0.5717552887364208
Tree: accuracy =  0.3956864239883108 error rate =  0.6043135760116892


### Confusion Matrix

We then made a confusion matrix for our decision tree classifier to see how our predicitions were distributed. It shows how the predictions are not clustered along the diagonal and this is due to the classifier's inaccuracies. 

In [6]:
tree_matrix = myevaluation.confusion_matrix(total_expected, total_tree, ["Pass","Withdrawn", "Fail", "Distinction"])

for i in range(len(tree_matrix)):
    total = 0
    rec = 0
    for item in tree_matrix[i]:
        total += item
    true_pos = tree_matrix[i][i]
    if total != 0:
        rec = (true_pos/total)*100
    tree_matrix[i].append(total)
    tree_matrix[i].append(rec)

tree_matrix[0].insert(0, "Pass")
tree_matrix[1].insert(0, "Withdrawn")
tree_matrix[2].insert(0, "Fail")
tree_matrix[3].insert(0, "Distinction")

print()
print("Decision Tree (Stratified 10 Fold Cross Validation Results)")
print(tabulate(tree_matrix, ["Final Result","Pass","Withdrawn","Fail", "Distinction","total", "Recognition %"]))



Decision Tree (Stratified 10 Fold Cross Validation Results)
Final Result      Pass    Withdrawn    Fail    Distinction    total    Recognition %
--------------  ------  -----------  ------  -------------  -------  ---------------
Pass              7496         3066    1088            180    11830         63.3643
Withdrawn         4861         4014     936            109     9920         40.4637
Fail              3721         2253     872             61     6907         12.6249
Distinction       2017          544     189             75     2825          2.65487
