# CIS4020 Example - Predicting whether student passes or fails a course

### Imports

The following cell has imports needed throughout the entire example. The documentation for the following libraries are publically available on their respective websites.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
#specifying configurations for our graphs
plt.rc("font", size=14)
sns.set(style="white")
sns.set(style="whitegrid", color_codes=True)

### Reading the data

pd.readcsv(filename, headerRow) -> reads the data and inputs it into a **DataFrame (refer to pandas documentation)**

data.dropna() -> drops all rows that have missing or invalid data in the form of **NaN (refer to python documentation)**

data.shape -> returns (row count, column count)

data.columns -> returns a list of all columns

In [None]:
data = pd.read_csv('student-mat.csv', header=0)
data = data.dropna()
print(data.shape)
print(list(data.columns))

### Analyzing the data

data['index'] -> returns all the rows in column index
data['index'].value_counts -> returns a count for each value appearing in the data. 

i.e. the value 10 appears 56 times in the dataset.

In [None]:
data['G3'].value_counts()

### Creating our classifiers

As we saw in the data the final grade of each student is out of 20 so we need to classify what grade is considered a passing grade and what grade is considered a failing grade.

In our exampl, we will use the standard 50% passing threshold and anything below this will be considered a fail.

Using this we can transform our data by labeling all grades below 10 to False and the remaining should be labeled as True.

In [None]:
data['G3'].values[data['G3'] < 10] = False
data['G3'].values[data['G3'] >= 10] = True

### Visualizing our classifiers using graphs

We can visualize the amount of passing and failing students using graphs.

We can infer that twice as many students have passed the course than failed.

In [None]:
sns.countplot(x='G3', data=data, palette='hls')
plt.show()
plt.savefig('count_plot')

Study time is defined as:

1 = < 2 hours per week

2 = 2 to 5 hours per week

3 = 5 to 10 hours per week

4 = > 10 hours per week

In [None]:
%matplotlib inline
pd.crosstab(data.studytime,data.G3).plot(kind='bar')
plt.title('Study Time vs Pass or Fail')
plt.xlabel('Study Time')
plt.ylabel('Frequency of pass/fail')
plt.savefig('purchase_fre_job')

### Visualizing our classifiers using numbers

Using numbers we can see that there are almost twice as many students who passed the course than failed.

In [None]:
count_fail = len(data[data['G3'] == 0])
count_pass = len(data[data['G3'] == 1])
pct_of_pass = count_pass/(count_pass+count_fail)
print("percentage of failing students is", pct_of_pass*100)
pct_of_pass = count_fail/(count_pass+count_fail)
print("percentage of passing is", pct_of_pass*100)

### Visualizing our classifiers using the mean of all features

We can visualize our classifiers based on the mean of all continuous. (categorical data will not be shown)

In [None]:
data.groupby('G3').mean()

### Transforming Categorical Columns into Boolean Columns

A lot of functions in sklearn are not compatible with categorical data so it can be helpful to expand a categorical columns into several boolean columns.

i.e. If school is a categorical variable with two possible categories (GP or MS) we can create two columns called school_GP and school_MS and use boolean values indicating whether a student attends MS or GP.

In the case, a student attends the school MS the columns would look like :

**school_MS school_GP**

**TRUE FALSE**

The next few cells accomplish this and creates a list of the new set of columns and updates the existing dataset with these columns



In [None]:
cat_vars=['school','sex','address','famsize','Pstatus','schoolsup','famsup','paid','activities','nursery', 'higher', 'internet', 'romantic', 'Mjob', 'Fjob', 'reason', 'guardian']
for var in cat_vars:
    cat_list='var'+'_'+var
    cat_list = pd.get_dummies(data[var], prefix=var)
    data1=data.join(cat_list)
    data=data1
cat_vars=['school','sex','address','famsize','Pstatus','schoolsup','famsup','paid','activities','nursery', 'higher', 'internet', 'romantic', 'Mjob', 'Fjob', 'reason', 'guardian']
data_vars=data.columns.values.tolist()
to_keep=[i for i in data_vars if i not in cat_vars]

In [None]:
data_final=data[to_keep]
(data_final.columns.values)

You may uncommon the code below to normalize the data by using a oversampling function provided by imblearn.

This function equalizes the training dataset with an even distribution of passing and failing students.

In [None]:
X = data_final.loc[:, data_final.columns != 'G3']
y = data_final.loc[:, data_final.columns == 'G3']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

os_data_X = X_train
os_data_y = y_train

# from imblearn.over_sampling import SMOTE

# os = SMOTE(random_state=0)
# columns = X_train.columns

# os_data_X,os_data_y=os.fit_sample(X_train, y_train.values.ravel())
# os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
# os_data_y= pd.DataFrame(data=os_data_y,columns=['G3'])

# # we can Check the numbers of our data
# print("length of oversampled data is ",len(os_data_X))
# print("Number of students failed in oversampled data",len(os_data_y[os_data_y['G3']==0]))
# print("Number of students passed",len(os_data_y[os_data_y['G3']==1]))
# print("Proportion of students failed data in oversampled data is ",len(os_data_y[os_data_y['G3']==0])/len(os_data_X))
# print("Proportion of students passed data in oversampled data is ",len(os_data_y[os_data_y['G3']==1])/len(os_data_X))


### Recursive Feature Elimination

We can recursively remove features by continuously removing features and analyzing whether there is any improvement in the model.



In [None]:
data_final_vars=data_final.columns.values.tolist()
y=['G3']
X=[i for i in data_final_vars if i not in y]
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear')
rfe = RFE(logreg, 20)
rfe = rfe.fit(os_data_X, os_data_y.values.ravel())
print(rfe.support_)
print(rfe.ranking_)

We then filter our data and only pick out the features selected by the RFE function.

In [None]:
from itertools import compress
cols = list(compress(X, rfe.support_))
X=os_data_X[cols]
y=os_data_y['G3']

### Running the model

We can then build the model from the training set and summarize our findings and remove features accordingly.

In [None]:
import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary2())

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression(solver='liblinear')
logreg.fit(X_train, y_train)

Predict whether a student passed or failed in the test dataset using our model.

In [None]:
y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

We can analyze the confusion matrix and see how well our model performs and how often a false positive or a false negative occurs.

In [None]:
from sklearn.metrics import confusion_matrix
confusion_matrix = confusion_matrix(y_test, y_pred)
print(confusion_matrix)

Lastly, we can validate our model using the precision, accuracy and recall values. (In this case our model seems to do well but that could also be due to overfitting and heavy bias due to a small dataset)

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))

As we will notice the accuracy is very high, this is likely due to the very small subset of data used which causes our model to overfit. 

A few things we can do is gather more data, increase our oversampling and analyze every feature used in the model.