**Homework 5: Mini-project**

**CS 412 Introduction to Machine Learning, Spring 2021 University of Illinois at Chicago**

**Due: April 29, 2021, 11:59pm**

*According to the Academic Integrity Policy of this course, all work submitted for grading must be done individually. While we encourage you to talk to your peers and learn from them, this interaction must be superficial with regards to all work submitted for grading. This means you cannot work in teams, you cannot work side-by-side, you cannot submit someone else’s work (partial or complete) as your own. In particular, note that you are guilty of academic dishonesty if you extend or receive any kind of unauthorized assistance. Absolutely no transfer of program code between students is permitted (paper or electronic), and you may not solicit code from family, friends, or online forums. Other examples of academic dishonesty include emailing your program to another student, copying-pasting code from the internet, working in a group on a homework assignment, and allowing a tutor, TA, or another individual to write an answer for you. Academic dishonesty is unacceptable, and penalties range from failure to expulsion from the university; cases are handled via the official student conduct process described at https://dos.uic.edu/community-standards/.
This homework is an individual assignment for all graduate students. Undergraduate students are allowed to work in pairs and submit one homework assignment per pair. There will be no extra credit given to undergraduate students who choose to work alone. The pairs of students who choose to work together and submit one homework assignment together still need to abide by the Academic Integrity Policy and not share or receive help from others (except each other).*


**Goal**

Gain experience utilizing machine learning methods on a real-world dataset by utilizing concepts and algorithms you have learned in class.


**Dataset**

The dataset consists of 13,930 news articles from Vox (www.vox.com). In this dataset your goal is to classify whether an article is about “politics”. You are provided a binary label vector, where ‘y=1’ means the article is about politics category, and ‘y=0’ means it is not. Each article is represented by a 300-dimensional word2vec vector using the GoogleNews embedding. To get the embedding, a word2vec vector is retrieved for every word in the article, and the mean vector is computed over all words.
The download link can be found here:
https://uofi.box.com/s/w5hdeyorrvrvht1c9o42whkslv3pi808
   
The file you download should be a pickle file which you can load using the pickle module.
There are four objects in the pickle file: (1) a 13,930x300 feature matrix, (2) a 13,930- dimensional label vector, (3) a 13,930-dimensional article id vector, (4) a 13,930-dimensional article link vector. You only need to use the features (1) and labels (2) for this task.
To load the data, you can use this code stub with the .pkl file in the same directory:

```python import pickle
with open("vox_data.pkl", "rb") as file:
    x, y, article_ids, article_links = pickle.load(file)
```
    
where x and y are the features and labels, respectively.
This dataset is a subset of the Vox dataset available on data.world. If you are interested in the original dataset with the actual article titles, text and topics, you can find it here: https://data.world/elenadata/vox-articles. The original dataset is not necessary for this task.


**Instructions**

For this Homework, your task is to implement two classifiers for binary classification that you have not already used in the previous homework assignments using existing python packages (e.g. sklearn, TensorFlow, etc.). Some examples include Random Forest, XGBoost, and Neural Networks. In addition, you will implement one classifier you have used in your previous homework assignments (i.e., Decision Trees, SVM, Nearest Neighbors, Perceptron) and compare the performance of your selected three classifiers. Use GridSearchCV to choose the best hyperparameters and then report on the final hyperparameters chosen for each classifier.
Report the accuracy of your selected classifiers with these hyperparameters on the test data.


**What to submit**

Write-up & code: You will create a Jupyter notebook for implementing your chosen classifiers. In addition to the code, make sure that you describe each step in detail. Upload a PDF of your Jupyter notebook where all code is visible.
Be sure to answer the following questions:
(a) Which three classifiers (two new, one old) did you choose? (b) What software did you use and why did you choose it?
(c) What are the results?
Your assignment will be graded for completeness and correctness.
 

**Piazza Note 162**
**clarification about HW5**

Based on some of the questions the TA and I have received, I would like to clarify that you are expected to do the following in HW5:

1. Split the data into training and testing.

2. For each of your three algorithms:

     - tune the hyperparameters on the training data only (using GridSearchCV is one way to do it but you are allowed to use others, as long as you explicitly specify what)

     - learn a model by fitting the training data with the best hyperparameter

     - report the accuracy of that model on both the training and test data

3. Compare the test accuracy of the three algorithms.

In [51]:
import pickle
import util
import numpy as np

# features = x 
# labels = y

cv = 2

with open("vox_data.pkl", "rb") as file:
    X0, Y0, article_ids, article_links = pickle.load(file)
print("X: " + str(X0))
print("Y: " + str(Y0))
X, Y, Xte, Yte = util.splitTrainTest(X0, Y0, 5)

print("Using " + str((len(Xte) / len(X0))*100) + " Percent Test Data")

X: [[ 0.02149611  0.01773633  0.03006189 ... -0.02084504  0.02229265
  -0.02980007]
 [ 0.03902273  0.03588958  0.01906366 ... -0.0136148   0.05033489
  -0.04458755]
 [ 0.03498428  0.01380107  0.02972877 ... -0.04971654  0.06716254
  -0.04707798]
 ...
 [ 0.02555814  0.02185948  0.04247121 ... -0.01317295  0.03737029
  -0.02939126]
 [-0.09677465  0.08302101 -0.09659703 ... -0.0781685  -0.00586668
   0.01836307]
 [ 0.04420679  0.04521789  0.03093789 ... -0.02244521  0.04476547
  -0.04448839]]
Y: [0. 0. 0. ... 1. 0. 1.]
Using 20.0 Percent Test Data


In [68]:
# Multi-layer Perceptron: A Neural Network

from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

#param_grid = {'solver': ['lbfgs'],
#              'max_iter': [1000,1100,1200,1300,1400,1500,1600,1700,1800,1900,2000],
#              'alpha': 10.0 ** -np.arange(1, 10),
#              'hidden_layer_sizes':np.arange(1, 10),
#              'random_state':[0,1,2,3,4,5,6,7,8,9]}
#             }
#mlp_clf = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(5, 2), random_state=1)

param_grid = {
    'solver': ['lbfgs'],
    'alpha': 10.0 ** -np.arange(1,3),
    'hidden_layer_sizes': np.arange(1,3),
    'random_state':[0,1,2]
}

mlp_clf = GridSearchCV(MLPClassifier(), param_grid, n_jobs=-1)
mlp_clf.fit(X, Y)

print("Score: " + str(mlp_clf.score(X, Y)))
print("Score: " + str(mlp_clf.best_score_))
print("Best Params: " + str(mlp_clf.best_params_))

mlp_clf.predict(Xte)

Score: 0.8664752333094041
Score: 0.8252807169730169
Best Params: {'alpha': 0.1, 'hidden_layer_sizes': 1, 'random_state': 0, 'solver': 'lbfgs'}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


array([0., 1., 1., ..., 1., 1., 1.])

In [49]:
# Random Forest: Ensemble learning by using many decision trees then taking the mean

from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators' : [40, 70, 100],
    'max_depth' : [5, 15, 20],
    'max_leaf_nodes' : [5, 10, 20],
    'random_state':[0,1,2,3,4,5,6,7,8,9]
}

#rnd_clf = RandomForestClassifier(max_depth=2, max_leaf_nodes=16, random_state=0, n_jobs=-1)

rnd_clf = GridSearchCV(RandomForestClassifier(), param_grid, n_jobs=-1)

rnd_clf.fit(X, Y)
print("Score: " + str(rnd_clf.score(X, Y)))
print("Best Score: " + str(rnd_clf.best_score_))
print("Best Params: " + str(rnd_clf.best_params_))
rnd_clf.predict(Xte)

Score: 0.8380294328786791
Best Score: 0.8150498609402901
Best Params: {'max_depth': 5, 'max_leaf_nodes': 20, 'n_estimators': 100}


array([0., 1., 0., ..., 1., 1., 1.])

In [69]:
# XGBoost: Regularized Gradient Boosting

import xgboost.sklearn as xgb
from sklearn.model_selection import TimeSeriesSplit

# from sklearn.model_selection import cross_val_score

# scores = cross_val_score(XGBRegressor(objective='reg:squarederror'), x, y, scoring='neg_mean_squared_error')

# root_mean_squared_error = (-scores)**0.5
# print(str(scores.mean()))


param_grid = {"subsample" : [0.5, 0.8]}

fit_params = {"early_stopping_rounds" : 42,
              "eval_metric" : "error",
              "eval_set" : [[Xte, Yte]]}

xgb_clf = GridSearchCV(xgb.XGBClassifier(use_label_encoder=False),
                           param_grid, cv=TimeSeriesSplit(n_splits=cv).get_n_splits([X, Y]))

xgb_clf.fit(X, Y, **fit_params)

print("Score: " + str(xgb_clf.score(X, Y)))
print("Best Params: " + str(xgb_clf.best_params_))

xgb_clf.predict(Xte)

[0]	validation_0-error:0.23798
[1]	validation_0-error:0.21788
[2]	validation_0-error:0.19526
[3]	validation_0-error:0.18988
[4]	validation_0-error:0.18665
[5]	validation_0-error:0.17911
[6]	validation_0-error:0.17732
[7]	validation_0-error:0.17732
[8]	validation_0-error:0.17624
[9]	validation_0-error:0.17444
[10]	validation_0-error:0.17301
[11]	validation_0-error:0.17085
[12]	validation_0-error:0.17014
[13]	validation_0-error:0.17157
[14]	validation_0-error:0.17121
[15]	validation_0-error:0.17193
[16]	validation_0-error:0.17050
[17]	validation_0-error:0.17014
[18]	validation_0-error:0.17014
[19]	validation_0-error:0.16906
[20]	validation_0-error:0.16942
[21]	validation_0-error:0.16942
[22]	validation_0-error:0.16870
[23]	validation_0-error:0.16691
[24]	validation_0-error:0.16834
[25]	validation_0-error:0.16691
[26]	validation_0-error:0.16439
[27]	validation_0-error:0.16978
[28]	validation_0-error:0.16547
[29]	validation_0-error:0.16691
[30]	validation_0-error:0.16798
[31]	validation_0-

[57]	validation_0-error:0.15614
[58]	validation_0-error:0.15614
[59]	validation_0-error:0.15757
[60]	validation_0-error:0.15542
[61]	validation_0-error:0.15506
[62]	validation_0-error:0.15470
[63]	validation_0-error:0.15578
[64]	validation_0-error:0.15578
[65]	validation_0-error:0.15614
[66]	validation_0-error:0.15578
[67]	validation_0-error:0.15506
[68]	validation_0-error:0.15578
[69]	validation_0-error:0.15398
[70]	validation_0-error:0.15614
[71]	validation_0-error:0.15650
[72]	validation_0-error:0.15470
[73]	validation_0-error:0.15434
[74]	validation_0-error:0.15470
[75]	validation_0-error:0.15363
[76]	validation_0-error:0.15363
[77]	validation_0-error:0.15363
[78]	validation_0-error:0.15578
[79]	validation_0-error:0.15434
[80]	validation_0-error:0.15434
[81]	validation_0-error:0.15363
[82]	validation_0-error:0.15470
[83]	validation_0-error:0.15327
[84]	validation_0-error:0.15398
[85]	validation_0-error:0.15434
[86]	validation_0-error:0.15363
[87]	validation_0-error:0.15363
[88]	val

array([0, 1, 1, ..., 1, 1, 1])

In [89]:
# Support Vector Machines

from sklearn import svm

param_grid = {'C': [1, 5, 10], 'kernel': ['linear']}

svm_clf = GridSearchCV(svm.SVC(), param_grid, n_jobs=-1)

svm_clf.fit(X, Y)

print("Score: " + str(svm_clf.score(X, Y)))
print("Best Params: " + str(svm_clf.best_params_))

svm_clf.predict(Xte)

Score: 0.8554379038047379
Best Params: {'C': 1, 'kernel': 'linear'}


array([0., 1., 1., ..., 1., 1., 1.])

In [90]:
from sklearn.metrics import accuracy_score

mlp_y_pred = mlp_clf.predict(Xte)
rnd_y_pred = rnd_clf.predict(Xte)
xgb_y_pred = xgb_clf.predict(Xte)
svm_y_pred = svm_clf.predict(Xte)

mlp_accuracy = accuracy_score(Yte, mlp_y_pred, normalize=False) / float(Yte.size)
rnd_accuracy = accuracy_score(Yte, rnd_y_pred, normalize=False) / float(Yte.size)
xgb_accuracy = accuracy_score(Yte, xgb_y_pred, normalize=False) / float(Yte.size)
svm_accuracy = accuracy_score(Yte, svm_y_pred, normalize=False) / float(Yte.size)

#mlp_error_rate = 1 - mlp_accuracy
#rnd_error_rate = 1 - rnd_accuracy
#xgb_error_rate = 1 - xgb_accuracy
#svm_error_rate = 1 - svm_accuracy

#print("MLP Error Rate: " + str(mlp_error_rate))
#print("RND Error Rate: " + str(rnd_error_rate))
#print("XGB Error Rate: " + str(xgb_error_rate))
#print("SVM Error Rate: " + str(svm_error_rate))

def namestr(obj, namespace):
    return [name for name in namespace if namespace[name] is obj]

print("MLP Accuracy: " + str(mlp_accuracy))
print("RND Accuracy: " + str(rnd_accuracy))
print("XGB Accuracy: " + str(xgb_accuracy))
print("SVM Accuracy: " + str(svm_accuracy))

ascending = sorted([mlp_accuracy, rnd_accuracy, xgb_accuracy, svm_accuracy])
print("Best accuracy value: " + str(ascending[-1]))
print("Best accuracy name: " + str(namestr(ascending[-1], globals())))

MLP Accuracy: 0.8517587939698492
RND Accuracy: 0.8248384781048098
XGB Accuracy: 0.8575017946877244
SVM Accuracy: 0.8449389806173726
Best accuracy value: 0.8575017946877244
Best accuracy name: ['xgb_accuracy']


Be sure to answer the following questions: 
(a) Which three classifiers (two new, one old) did you choose? 
    _I chose to use all 3 new classifiers, XGBoost, Random Forest, and Multi Layer Perceptron (NN), I also chose to use Support Vector Machines from earlier assignments.  I was able to get GridSearchCV to work on all 4 of the algos_
    
(b) What software did you use and why did you choose it? 
    _I chose to use scikit learn because I've heard really good things about how easy it is to work with and I felt it very straightforward to get to the brass tacks of working with the data, I liked it._
    
(c) What are the results? Your assignment will be graded for completeness and correctness.
    _My results were that XGBoost provided the best accuracy_