<a href="https://colab.research.google.com/github/DLPY/Classification_Session_2/blob/main/Classification_Session2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Classification Session 2 - Notebook

##**1.Import Pandas, Pyplot and Read data**

In [None]:
import matplotlib.pyplot as plt #Visualization Lib
import numpy as np #mathamatical functions
import pandas as pd #Data manipulation lib
import seaborn as sns #Visualization Lib
from sklearn import metrics, preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.neighbors import KNeighborsClassifier
import statsmodels.api as sm

In [None]:
# CSV is first read in from a github raw file another option is to import the notebook to your session storage by click on the file icon on left toolbar then importing csv
! wget https://raw.githubusercontent.com/DLPY/Classification_Session_2/main/Student2020.csv

In [None]:
# Once we have the csv file pd.read_csv() converts it to a pandas dataframe
df = pd.read_csv('Student2020.csv')
df['Pass'] = df['Pass'].apply(lambda x : 0 if x == 'Fail' else 1 )

##**2. Exploratory Data Analysis (EDA)**

In [None]:
# Display the count of rows and columns.
df.shape

In [None]:
# Review a small sample of the data.
df.head()

### **Student Data Set**

The attributes are:

1) **Age**: continuous.

2) **Auditory**: Numeric, valid range [0,10]. Does student learn best when listening or talking.

3) **Kinaesthetic**: Numeric, valid range [0,10]. Does student learn best when doing.

4) **Visual**: Numeric, valid range [0,10]. Does student learn when reading text or from diagrams.

5) **Extrinsic Motivation**: Numeric, valid range [0,10]. Is the student motivation by external awards such as good grades

6) **Intrinsic Motivation**: Numeric, valid range [0,10]. Is the student motivated by an interest in learning itself

7) **Self-Efficacy**: Numeric, valid range [0,10]. Student's belief that they can do well

8) **Study Time**: Numeric, valid range [0,10]. Representative of weekly hours spent studying

9) **Conscientiousness**: Numeric, valid range [0,10]. Personality trait.

10) **CAO Points**: range [0, 625] leaving certificate points, end of school state exam in Ireland

11) **Maths**: range [0,100] - leaving certificate score in Mathematics

12) **English**: range [0,100] - leaving certificate score in English

In [None]:
#Checking for null values
print(df.isnull().sum())
sns.heatmap(df.isnull(), cbar=False)

In [None]:
# Detailed overview of the dataframe itself.
df.info()

In [None]:
# remove duplicates, if any
df = df[~df.duplicated()] 
df.shape
#No duplicate values

 ### i) Investigate correlation in the new dataframe.

In [None]:
corr = df.corr()
corr.style.background_gradient(cmap='coolwarm')

### ii) Variables correlated to Pass.

In [None]:
df.corr()['Pass'].sort_values().drop('Pass').plot(kind='barh')

### iii) Summary of Pass and Fail

In [None]:
sns.countplot(x='Pass', data=df, palette='hls')

From the above graph, we observed there are many more Students are 'Pass'. This is referred to as 'class imbalance'.

## **3.Standardization of data using MinMax Scaler**

In [None]:
# Independent Variable
X = df.drop(['Pass'], axis=1)

# Depenedent Variable
y = df.Pass.values

In [None]:
X.head()

In [None]:
trans = preprocessing.MinMaxScaler(feature_range=(-1,1))
scaled_X = pd.DataFrame(trans.fit_transform(X))
column_names = ['age', 'Auditory', 'Kinaesthetic', 'Visual', 'ExtrinsicMotivation', 'IntrinsicMotivation', 'SelfEfficacy', 'StudyTime', 'Conscientiousness', 'CAOpoints', 'Maths', 'English']
scaled_X.columns = column_names

In [None]:
scaled_X.head()

## **4.Classification Using KNN**

### i) KNN Model without scaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [None]:
neigh = KNeighborsClassifier(n_neighbors=5)
knn = neigh.fit(X_train, y_train)
y_pred = knn.predict(X_test)

In [None]:
target_names = ['Fail', 'Pass']
print('Confusion Matrix\n')
print(confusion_matrix(y_test, y_pred))
print('\nClassification report\n')
print(classification_report(y_test, y_pred, target_names=target_names))
classification_report_knn = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).T
knn_without_scaler = classification_report_knn['f1-score']['accuracy']

### ii) KNN Model with scaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.33, random_state=42)
neigh = KNeighborsClassifier(n_neighbors=5)
knn = neigh.fit(X_train, y_train)
y_pred = knn.predict(X_test)
print('Confusion Matrix\n')
print(confusion_matrix(y_test, y_pred))
print('\nClassification report\n')
print(classification_report(y_test, y_pred, target_names=target_names))

### iii) Parameter Search using For Loop

In [None]:
for x in [5,10,15,25,30,35]:
  neigh = KNeighborsClassifier(n_neighbors=x)
  knn = neigh.fit(X_train, y_train)
  y_pred = knn.predict(X_test)
  print(f"The value of K = {x}")
  print('\nClassification report\n')
  print(classification_report(y_test, y_pred, target_names=target_names))

### iv) Parameter Selection using GridSearchCV

In [None]:
grid_params = {
    'n_neighbors' : [1,3,5,7,9,11,13,15,17,19,21,23,25,27,29,31,33,35],
    'weights' : ['uniform','distance'],
    'metric' : ['euclidean','manhattan']
}

gs = GridSearchCV(
  KNeighborsClassifier(),
  grid_params, 
  cv = 3, # cross validation to try for each set of parameters
  n_jobs = -1, # number of processors -1 will use all avaliable
  verbose = 1 # detailed print out
)

gs_results = gs.fit(X_train, y_train)

In [None]:
print('Best Parameters\n')
print(gs_results.best_estimator_)

grid_predictions = gs.predict(X_test)

print('Confusion Matrix\n')
print(confusion_matrix(y_test, grid_predictions))
print('\nClassification report\n')
print(classification_report(y_test, grid_predictions, target_names=target_names))
classification_report_knn = pd.DataFrame(classification_report(y_test,grid_predictions,output_dict=True)).T
knn_with_scaler = classification_report_knn['f1-score']['accuracy']

## **5.Classification Using Logistic Regression**

### i) Logistic Regression Model without scaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# instantiate the model (using the default parameters)

logreg = LogisticRegression(solver='lbfgs', max_iter=300)

# fit the model with data
logreg.fit(X_train,y_train)

In [None]:
# Coefficient and Intercept
print(logreg.coef_)
print(logreg.intercept_)

In [None]:
# Create dataframe from regressor coefficient to display results in a dataframe
column_names = ['age', 'Auditory', 'Kinaesthetic', 'Visual', 'ExtrinsicMotivation', 'IntrinsicMotivation', 'SelfEfficacy', 'StudyTime', 'Conscientiousness', 'CAOpoints', 'Maths', 'English']
coefficient_df = pd.DataFrame(logreg.coef_) # T - Transpose dataframe rows to columns
coefficient_df.columns = column_names
coefficient_df

In [None]:
# Predict test set from model built during training 
y_pred = logreg.predict(X_test)

#### i) Confusion Matrix

In [None]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
ax.xaxis.set_ticklabels(['Fail', 'Pass']); ax.yaxis.set_ticklabels(['Fail', 'Pass']);

#### ii) classification report - Accuracy, Precision, Recall, F1-Score

In [None]:
log_without_scaler = classification_report(y_test, y_pred, target_names=target_names)
classificationReport = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).T
log_without_scaler = classificationReport['f1-score']['accuracy']
classificationReport

#### iii) ROC Curve

In [None]:
logit_roc_auc = metrics.roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = metrics.roc_curve(y_test,  y_pred)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

### ii) Logistic Regression Model with scaler

In [None]:
X_train, X_test, y_train, y_test = train_test_split(scaled_X, y, test_size=0.33, random_state=42)

In [None]:
# instantiate the model (using the default parameters)
logreg = LogisticRegression(solver='lbfgs', max_iter=300)

# fit the model with data
logreg.fit(X_train,y_train)

In [None]:
# Coefficient and Intercept
print(logreg.coef_)
print(logreg.intercept_)

In [None]:
# Create dataframe from regressor coefficient to display results in a dataframe
column_names = ['age', 'Auditory', 'Kinaesthetic', 'Visual', 'ExtrinsicMotivation', 'IntrinsicMotivation', 'SelfEfficacy', 'StudyTime', 'Conscientiousness', 'CAOpoints', 'Maths', 'English']
coefficient_df = pd.DataFrame(logreg.coef_) # T - Transpose dataframe rows to columns
coefficient_df.columns = column_names
coefficient_df

In [None]:
# Predict test set from model built during training 
y_pred = logreg.predict(X_test)

In [None]:
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots()
# create heatmap
sns.heatmap(pd.DataFrame(cnf_matrix), annot=True, cmap="YlGnBu" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
ax.xaxis.set_ticklabels(['Fail', 'Pass']); ax.yaxis.set_ticklabels(['Fail', 'Pass']);

In [None]:

classificationReport = pd.DataFrame(classification_report(y_test,y_pred,output_dict=True)).T
log_with_scaler = classificationReport['f1-score']['accuracy']
classificationReport

In [None]:
logit_roc_auc = metrics.roc_auc_score(y_test, y_pred)
fpr, tpr, thresholds = metrics.roc_curve(y_test,  y_pred)
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.show()

## **6.Final Evaluation on Accuracy KNN Vs Logistic Regression**

In [None]:
knn_log_compare = pd.DataFrame(columns=['knn_without_scaler', 'knn_with_scaler', 'log_without_scaler', 'log_with_scaler'])
knn_log_compare = knn_log_compare.append({'knn_without_scaler': knn_without_scaler, 'knn_with_scaler': knn_with_scaler, 'log_without_scaler': log_without_scaler, 'log_with_scaler': log_with_scaler }, ignore_index=True)

In [None]:
knn_log_compare