<a href="https://colab.research.google.com/github/TDStriker/Projects-in-ML/blob/main/ML_Proj_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

For the data analysis the data was split into features and grades so that grades could be used as the target value. The first and second period grades were removed so as to focus specifically on the final grade prediction.

There is also data removed later on as it was found to be statistically insignificant in its p-score evaluation

During data processing data is split between training and testing data

In [1]:
pip install ucimlrepo

Collecting ucimlrepo
  Downloading ucimlrepo-0.0.3-py3-none-any.whl (7.0 kB)
Installing collected packages: ucimlrepo
Successfully installed ucimlrepo-0.0.3


In [2]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from mlxtend.preprocessing import TransactionEncoder
import seaborn as sns
from scipy.stats import pearsonr
from sklearn import tree
from sklearn.metrics import confusion_matrix

In [3]:
from ucimlrepo import fetch_ucirepo

# fetch dataset
student_performance = fetch_ucirepo(id=320)

# data (as pandas dataframes)
feats = student_performance.data.features
grades = student_performance.data.targets
size = len(feats)

# metadata
print(student_performance.metadata)

# variable information
print(student_performance.variables)


{'uci_id': 320, 'name': 'Student Performance', 'repository_url': 'https://archive.ics.uci.edu/dataset/320/student+performance', 'data_url': 'https://archive.ics.uci.edu/static/public/320/data.csv', 'abstract': 'Predict student performance in secondary education (high school). ', 'area': 'Social Science', 'tasks': ['Classification', 'Regression'], 'characteristics': ['Multivariate'], 'num_instances': 649, 'num_features': 30, 'feature_types': ['Integer'], 'demographics': ['Sex', 'Age', 'Other', 'Education Level', 'Occupation'], 'target_col': ['G1', 'G2', 'G3'], 'index_col': None, 'has_missing_values': 'no', 'missing_values_symbol': None, 'year_of_dataset_creation': 2008, 'last_updated': 'Fri Jan 05 2024', 'dataset_doi': '10.24432/C5TG7T', 'creators': ['Paulo Cortez'], 'intro_paper': {'title': 'Using data mining to predict secondary school student performance', 'authors': 'P. Cortez, A. M. G. Silva', 'published_in': 'Proceedings of 5th Annual Future Business Technology Conference', 'year'

In [4]:
print(feats)

    school sex  age address famsize Pstatus  Medu  Fedu      Mjob      Fjob  \
0       GP   F   18       U     GT3       A     4     4   at_home   teacher   
1       GP   F   17       U     GT3       T     1     1   at_home     other   
2       GP   F   15       U     LE3       T     1     1   at_home     other   
3       GP   F   15       U     GT3       T     4     2    health  services   
4       GP   F   16       U     GT3       T     3     3     other     other   
..     ...  ..  ...     ...     ...     ...   ...   ...       ...       ...   
644     MS   F   19       R     GT3       T     2     3  services     other   
645     MS   F   18       U     LE3       T     3     1   teacher  services   
646     MS   F   18       U     GT3       T     1     1     other     other   
647     MS   M   17       U     LE3       T     3     1  services  services   
648     MS   M   18       R     LE3       T     3     2  services     other   

     ... higher internet  romantic  famrel  freetim

In [5]:
print(grades)

     G1  G2  G3
0     0  11  11
1     9  11  11
2    12  13  12
3    14  14  14
4    11  13  13
..   ..  ..  ..
644  10  11  10
645  15  15  16
646  11  12   9
647  10  10  10
648  10  11  11

[649 rows x 3 columns]


Data Statistics

Drop Insignificant Data.
Any data with a p-stat > .05 is dropped.

In [6]:
'''
P-Value failures:
  famsize: .25
  Pstatus: .9847
  Fjob: .1778
  Schoolsup: .09
  Famsup: .13
  Paid: .16
  Activities: .12
  Nursery: .465
  Romantic: .21
  Family Relationships: .1
  Go out: .0255
  Health: .012
  Absences: .0199
'''

'\nP-Value failures:\n  famsize: .25\n  Pstatus: .9847\n  Fjob: .1778\n  Schoolsup: .09\n  Famsup: .13\n  Paid: .16\n  Activities: .12\n  Nursery: .465\n  Romantic: .21\n  Family Relationships: .1\n  Go out: .0255\n  Health: .012\n  Absences: .0199\n'

In [7]:
feats = feats.drop(['famsize','Pstatus','Fjob','schoolsup','famsup','paid','activities','nursery','romantic','famrel','goout','health','absences'],axis=1)

Convert categorical data into quantitative data

In [8]:
#Convert all binary categories to binary values
le = LabelEncoder()
feats['sex'] = le.fit_transform(feats['sex']) #female should be 0
print("Female:",le.transform(['F']))
feats['address'] = le.fit_transform(feats['address']) #urban should be 0
print("Urban:",le.transform(['U']))
feats['school'] = le.fit_transform(feats['school'])
feats['higher'] = le.fit_transform(feats['higher']) #yes should be 0
print("Plan for higher education:",le.transform(['yes']))
feats['internet'] = le.fit_transform(feats['internet']) #no should be 0
print("Internet:",le.transform(['yes']))


Female: [0]
Urban: [1]
Plan for higher education: [1]
Internet: [1]


In [9]:
feats['Mjob'] = le.fit_transform(feats['Mjob']) #home = 0, health = 1, other = 2, services = 3, teacher = 4
print("Home,Health,Other,Services,Teacher:",le.transform(['at_home','health','other','services','teacher']))
feats['reason'] = le.fit_transform(feats['reason']) #course = 0, other = 2, home = 1
print("Course,Other,Home:",le.transform(['course','other','home']))
feats['guardian'] = le.fit_transform(feats['guardian']) #mother = 1, father = 0, other = 2
print("Mother,Father,Other:",le.transform(['mother','father','other']))

Home,Health,Other,Services,Teacher: [0 1 2 3 4]
Course,Other,Home: [0 2 1]
Mother,Father,Other: [1 0 2]


In [10]:
print(feats)

     school  sex  age  address  Medu  Fedu  Mjob  reason  guardian  \
0         0    0   18        1     4     4     0       0         1   
1         0    0   17        1     1     1     0       0         0   
2         0    0   15        1     1     1     0       2         1   
3         0    0   15        1     4     2     1       1         1   
4         0    0   16        1     3     3     2       1         0   
..      ...  ...  ...      ...   ...   ...   ...     ...       ...   
644       1    0   19        0     2     3     3       0         1   
645       1    0   18        1     3     1     4       0         1   
646       1    0   18        1     1     1     2       0         1   
647       1    1   17        1     3     1     3       0         1   
648       1    1   18        0     3     2     3       0         1   

     traveltime  studytime  failures  higher  internet  freetime  Dalc  Walc  
0             2          2         0       1         0         3     1     1  
1

In [11]:
#Merge data to do a training/testing split
merged = pd.merge(feats,grades['G3'],left_index=True,right_index=True)
#Shuffle
merged = merged.sample(frac = 1)
part = int(size*9/10)
test = merged.iloc[part:]
train = merged.iloc[:part]

In [12]:
#Separate into features and grades
train_grade = train["G3"]
train_feat = train.drop(columns=["G3"])
test_grade = test["G3"]
test_feat = test.drop(columns=["G3"])

In [13]:
#Should make all grades > 12 equal to 1, all less equal to 0
#Boolean list of if they passed
passed = train_grade.where(train_grade>12,0)
passed = passed.where(passed<=12,1)

In [14]:
#How many passed vs failed in the training data
passed.value_counts()

0    337
1    247
Name: G3, dtype: int64

Data Visualization

In [15]:
#Correlation coefficients
merged.corr()

Unnamed: 0,school,sex,age,address,Medu,Fedu,Mjob,reason,guardian,traveltime,studytime,failures,higher,internet,freetime,Dalc,Walc,G3
school,1.0,-0.08305,0.08717,-0.35452,-0.254787,-0.209806,-0.206829,-0.109754,-0.062333,0.252936,-0.137857,0.113788,-0.136112,-0.240486,0.034666,0.047169,0.014169,-0.284294
sex,-0.08305,1.0,-0.043662,0.025503,0.119127,0.083913,0.149635,0.010732,-0.036811,0.04088,-0.206214,0.073888,-0.058134,0.065911,0.146305,0.282696,0.320785,-0.129077
age,0.08717,-0.043662,1.0,-0.025848,-0.107832,-0.12105,-0.07177,-0.025855,0.26683,0.03449,-0.008415,0.319968,-0.265497,0.013115,-0.00491,0.134768,0.086357,-0.106505
address,-0.35452,0.025503,-0.025848,1.0,0.19032,0.141493,0.159761,-0.002367,-0.019359,-0.344902,0.062023,-0.063824,0.076706,0.175794,-0.036647,-0.047304,-0.012416,0.167637
Medu,-0.254787,0.119127,-0.107832,0.19032,1.0,0.647477,0.459337,0.132855,-0.014044,-0.265079,0.097006,-0.17221,0.213896,0.266052,-0.019686,-0.007018,-0.019766,0.240151
Fedu,-0.209806,0.083913,-0.12105,0.141493,0.647477,1.0,0.290703,0.08076,-0.101764,-0.208288,0.0504,-0.165915,0.191735,0.183483,0.006841,6.1e-05,0.038445,0.2118
Mjob,-0.206829,0.149635,-0.07177,0.159761,0.459337,0.290703,1.0,0.059397,0.008196,-0.164126,0.057176,-0.117882,0.148116,0.260658,0.053927,0.049576,0.025657,0.148252
reason,-0.109754,0.010732,-0.025855,-0.002367,0.132855,0.08076,0.059397,1.0,-0.065834,-0.092522,0.135874,-0.144459,0.091324,0.110168,-0.047001,-0.010735,0.010612,0.124969
guardian,-0.062333,-0.036811,0.26683,-0.019359,-0.014044,-0.101764,0.008196,-0.065834,1.0,0.026519,-0.009911,0.169605,-0.114735,-0.000412,0.051442,0.02333,-0.008312,-0.079609
traveltime,0.252936,0.04088,0.03449,-0.344902,-0.265079,-0.208288,-0.164126,-0.092522,0.026519,1.0,-0.063154,0.09773,-0.071958,-0.190826,0.000937,0.092824,0.057007,-0.127173


In [16]:
#Hyperparameters
learning_rate = 1e-2
num_epochs = 10

In [17]:
def sigmoid(a):
  return 1/(1+np.e**(-a))

In [18]:
def update(w,X,Y,b):
  A = sigmoid(np.dot(X,np.transpose(w)) + b)

  n = len(X)
  cost=-1/n * np.sum(Y * np.log(A) + (1-Y) * (np.log(1-A)))

  dw = np.dot((A-Y).T,X)/n
  db= np.sum(A-Y)/n

  w = w - learning_rate*dw
  b = b - learning_rate*db
  return (w,b)

In [19]:
def test(w,b):
  ind = 0
  correct = 0
  for _, row in test_feat.iterrows():
    pred = 1 if sigmoid(np.dot(np.transpose(w),row) + b) > 0.5 else 0
    act = 1 if test_grade.iloc[ind] > 12 else 0
    print("Predicted:",pred)
    print("Actual:",act)
    if(pred == act):
      correct+=1
    ind+=1
  print('Test Accuracy: {0:0.2f}%'.format(100*(correct/len(test_grade))))

Batch Gradient Descent

In [20]:
b = 0
w = [0] * train_feat.shape[1]

for i in range(num_epochs):
  w,b=update(w,train_feat,passed,b)

In [21]:
test(w,b)

Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 0
Predicted: 0
Actual: 1
Predicted: 0
Actual: 1
Predicted: 

Task 1

In [22]:
def classifier_test(classifier,feat=test_feat,grade=test_grade):
  ind = 0
  correct = 0
  for _, row in feat.iterrows():
    #row = np.reshape(row.to_numpy(),(1,-1))
    row = row.values.reshape((1,-1))
    pred = classifier.predict(row)
    act = 1 if grade.iloc[ind] > 12 else 0
    print("Predicted:",pred)
    print("Actual:",act)
    if(pred == act):
      correct+=1
    ind+=1
  print('Test Accuracy: {0:0.2f}%'.format(100*(correct/len(grade))))

In [23]:
classification_tree = tree.DecisionTreeClassifier()
classification_tree = classification_tree.fit(train_feat, passed)

In [24]:
classifier_test(classification_tree)

Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0




Modified parameters

In [25]:
#Merge data to modify uniformly
modified = pd.concat([train_feat,test_feat])

In [26]:
modified['age'] = modified['age'].apply(np.sqrt)

In [27]:
#Split back into test/train
test_mod = modified.iloc[part:]
train_mod = modified.iloc[:part]

In [28]:
mod_tree = tree.DecisionTreeClassifier()
mod_tree = mod_tree.fit(train_mod, passed)

In [29]:
classifier_test(mod_tree,feat=test_mod)



Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0




In [30]:
#Merge data to modify uniformly
modified = pd.concat([train_feat,test_feat])

In [31]:
def square(X):
  return np.power(X,2)

In [32]:
modified['age'] = modified['age'].apply(square)

In [33]:
#Split back into test/train
test_mod = modified.iloc[part:]
train_mod = modified.iloc[:part]

In [34]:
mod_tree = tree.DecisionTreeClassifier()
mod_tree = mod_tree.fit(train_mod, passed)

In [35]:
classifier_test(mod_tree,feat=test_mod)



Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1




Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Test Accuracy: 55.38%




In [36]:
#Merge data to modify uniformly
modified = pd.concat([train_feat,test_feat])

In [37]:
def minus1(X):
  return X-1

In [38]:
modified['Dalc'] = modified['Dalc'].apply(minus1)

In [39]:
modified['Walc'] = modified['Walc'].apply(minus1)

In [40]:
#Split back into test/train
test_mod = modified.iloc[part:]
train_mod = modified.iloc[:part]

In [41]:
mod_tree = tree.DecisionTreeClassifier()
mod_tree = mod_tree.fit(train_mod, passed)

In [42]:
classifier_test(mod_tree,feat=test_mod)



Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0





Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Test Accuracy: 55.38%




Modifications tried:
1.   Replacing age with sqrt(age). Resulted in an overall decrease in accuracy. Differences in age became diminished making the data less accurate.
2.   Replacing age with age^2. Also decreased accuracy overall. Differences in age became more extreme making the data less accurate.
3.   Subtracting 1 from both weekday and weekend alchohol consumption. The results fluctuate around that of the unmodified data. Sometimes the accuracy is higher, but overall it tends to be lower. The model does not care about the actual value of the 1-5 values in the data. Making it 0-4 instead is essentially just shaking up the values a bit.



Task 2

Bagging

In [43]:
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import AdaBoostClassifier

In [44]:
bag = BaggingClassifier(n_estimators=50)

#Cross Validation
bag_kfold = cross_val_score(bag, train_feat, passed, scoring='accuracy', cv=10, n_jobs=-1, error_score='raise')

#Train model
bag.fit(train_feat, passed)

In [45]:
classifier_test(bag)



Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1




Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted:



 [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0




Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 1
Predicted: [0]
Actual: 0
Test Accuracy: 66.15%




Boosting

In [46]:
ada = AdaBoostClassifier(n_estimators=4, random_state=0, algorithm='SAMME')

#Cross Validation
ada_kfold = cross_val_score(ada, train_feat, passed, scoring='accuracy', cv=10, n_jobs=-1, error_score='raise')

ada.fit(train_feat,passed)

In [47]:
classifier_test(ada)



Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 1
Predicted: [1]
Actual: 0
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0
Predicted: [1]
Actual: 1
Predicted: [0]
Actual: 0
Predicted: [1]
Actual: 0




In [48]:
print(bag_kfold)
print(ada_kfold)
print(bag_kfold-ada_kfold)

[0.62711864 0.6440678  0.72881356 0.6440678  0.62068966 0.70689655
 0.68965517 0.68965517 0.65517241 0.70689655]
[0.66101695 0.55932203 0.6779661  0.62711864 0.51724138 0.60344828
 0.60344828 0.75862069 0.53448276 0.60344828]
[-0.03389831  0.08474576  0.05084746  0.01694915  0.10344828  0.10344828
  0.0862069  -0.06896552  0.12068966  0.10344828]


In [49]:
print(bag_kfold.mean(),bag_kfold.std())
print(ada_kfold.mean(),ada_kfold.std())

0.6713033313851549 0.03573870051953867
0.6146113383985972 0.06811364034024564


Across the k-fold cross validation, the bagging model had a higher mean accuracy than the boosting model, however the boosting model had a lower standard deviation, making it more consistant. Throughout the 10 folds, there was only 1 where boosting achieved a higher accuracy than bagging did. Overall the two tend to get similar results in accuracy and standard deviation, but the bagging methods may sometimes achieve higher values while boosting stays relatively the same.

Task 3

In [50]:
test_passed = test_grade.where(test_grade>12,0)
test_passed = test_passed.where(test_passed<=12,1)

In [51]:
#How many passed vs failed in the test data
test_passed.value_counts()

0    36
1    29
Name: G3, dtype: int64

In [52]:
test_passed = test_passed.values.reshape((-1,1))

In [53]:
def confusion(classifier):
  pred = classifier.predict(test_feat)
  cm = confusion_matrix(test_passed, pred)
  return cm

In [54]:
#Generate confusion matrices
#t for tree, b for bag, a for ada
tcm = confusion(classification_tree)
bcm = confusion(bag)
acm = confusion(ada)
print(tcm)
print(bcm)
print(acm)

[[24 12]
 [13 16]]
[[23 13]
 [ 9 20]]
[[11 25]
 [ 1 28]]


In [55]:
ttn, tfp, tfn, ttp = tcm.ravel()
btn, bfp, bfn, btp = bcm.ravel()
atn, afp, afn, atp = acm.ravel()

The metric I will use for this is false positive rate. Since the model predicts if a student will pass their class or not, an incorrect prediction can be harmful to the student. If a student is falsely believed to be on track to pass the class, they may not be provided the resources they need to help them succeed. A student who is predicted to fail can be provided supplementary resources by the school to help them boost their grade.

In [56]:
print(tfp/len(test_passed))
print(bfp/len(test_passed))
print(afp/len(test_passed))

0.18461538461538463
0.2
0.38461538461538464


Across the three models, the bagging method has the lowest false positive rate. Comparatively, the boosting method used has the highest overall false positive rate.

Choosing a different evaluation metric could possibly invert the results entirely. Choosing false negative rate instead would likely do just that

In [57]:
print(tfn/len(test_passed))
print(bfn/len(test_passed))
print(afn/len(test_passed))

0.2
0.13846153846153847
0.015384615384615385


With false negative rate as the evaluation metric, the boosting model outperforms the other two by a wide margin, oftentimes getting 0% on the test data. The default and bagging methods get roughlt the same values, however bagging tends to still perform better than the default tree classifier on this task.

Note: Since the test data is randomly assigned, there is sometimes an imbalance in the data. This seems to affect the boosting model far more than the other two. There was an instance found where the boosting model didn't predict a positive result a single time. The bagging seemed mostly unaffected by the data imbalance.