# Problem 0 - Jhanvi

As we have covered in class, we are training a logistic regression model to predict if
someone will click on an advertisement. Consider the logistic regression model with 3 features and
weights w = [1, −30, 3].

For the dataset with features
x1=[20,0,0], y1=1
x2=[23,1,1], y2=0,
•Compute the probabilities that the logistic regression assigns to these two customers clicking
on the advertisement (i.e. y=1)
•Compute the cross entropy loss of this logistic regression.
•Design a decision stump (a decision tree of depth 1) that splits on the first feature. What is
the Gini impurity of the root? What is the Gini impurity after the best split that you find?

In [61]:
import numpy as np
import math

w = ([1, -30, 3])
x = ([20, 0, 0], [23, 1, 1])
y = ([1, 0])

def logistic_reg(w, x):
    col_x = np.transpose(x)
    scores = np.dot(w, col_x)

    sigmoid_1 = float(1 / (1 + math.exp(-1 * scores[0])))
    sigmoid_2 = float(1 / (1 + math.exp(-1 * scores[1])))

    sigmoid_matrix = np.array([sigmoid_1, sigmoid_2])

    return sigmoid_matrix

def cross_entropy(y, sigmoid_matrix):
    probability = (y[0] * sigmoid_matrix[0]) + (y[1] * sigmoid_matrix[1])
    return -1 * math.log(probability)

sigmoid_matrix = logistic_reg(w, x)
loss = cross_entropy(np.transpose(y), sigmoid_matrix)

print("The logisitic regression assigned the following probabilities to the customers: ")
print(sigmoid_matrix)
print("The cross entropy loss is: ")
print(loss)

print("\nTo reflect, the model predicts that the first customer will click on an advertisement with very high probability and is correct. As such, the loss is very small.")

def decision_tree(x):
    col_x = np.transpose(x)
    split = (x[0][0] + x[1][0]) / 2
    print("x[0][0]: " + str(x[0][0]))
    print("x[1][0]: " + str(x[1][0]))

    if x[0][0] > split:
        right = 0
    else :
        left = 0

    if x[1][0] > split:
        right = 1
    else :
        left = 1

    return left, right, split

def gini_impurity(l_bucket, r_bucket):
    num_0_left, num_1_right, num_0_right, num_1_left = 0, 0, 0, 0

    if l_bucket == 0:
        num_0_left += 1
    if r_bucket == 1:
        num_1_right += 1
    bucket_total = 1

    gini_left = ((num_0_left/bucket_total) * (1 - num_0_left/bucket_total)) + ((num_1_left/bucket_total) * (1 - num_1_left/bucket_total))
    gini_right = ((num_0_right/bucket_total) * (1 - num_0_right/bucket_total)) + ((num_1_right/bucket_total) * (1 - num_1_right/bucket_total))

    return gini_left, gini_right

left, right, split = decision_tree(x)
gini = gini_impurity(left, right)
print("The gini impurity for the left and right buckets at a split of {} is {}".format(split, gini))
print("\nThe impurity calculated is 0, which is the best possible Gini impurity for a bucket, which indicates that all the elements in that bucket are of the same class.")

The logisitic regression assigned the following probabilities to the customers: 
[1.         0.01798621]
The cross entropy loss is: 
2.0611536942919273e-09

To reflect, the model predicts that the first customer will click on an advertisement with very high probability and is correct. As such, the loss is very small.
x[0][0]: 20
x[1][0]: 23
The gini impurity for the left and right buckets at a split of 21.5 is (0.0, 0.0)

The impurity calculated is 0, which is the best possible Gini impurity for a bucket, which indicates that all the elements in that bucket are of the same class.


# Problem 1: Logistic Regression and CIFAR-10. - Jhanvi
In this problem you will explore the dataset
CIFAR-10, and you will use multinomial (multi-label) Logistic Regression to try to classify it. You
will also explore visualizing the solution.

(Optional) You can read about the CIFAR-10 and CIFAR-100 datasets here: https://www.
cs.toronto.edu/~kriz/cifar.html.
•(Optional) OpenML curates a number of data sets. You will use a subset of CIFAR-10
provided by them. Read here for a description: https://www.openml.org/d/40926.
•Use the fetch openml command from sklearn.datasets to import the CIFAR-10-Small
data set.
•Figure out how to display some of the images in this data set, and display a couple. While
not high resolution, these should be recognizable if you are doing it correctly.
•There are 20,000 data points. Do a train-test split on 3/4 - 1/4.
•You will run multi-class logistic regression on these using the cross entropy loss. You have to
specify this specifically (multi class=’multinomial’). Use cross validation to see how good
your accuracy can be. In this case, cross validate to find as good regularization coefficients
as you can, for ℓ1 and ℓ2 regularization (called penalties), which are naturally supported in
sklearn.linear model.LogisticRegression. I recommend you use the solver saga.
•Report your training and test loss from above,
•How sparse can you make your solutions without deteriorating your testing error too much?
Here, we ask for a sparse solution that has test accuracy that is close to the best solution you
found.

In [2]:
from sklearn import model_selection
from matplotlib import pyplot as plt
from sklearn.datasets import fetch_openml
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import metrics
import numpy as np
from sklearn.linear_model import LogisticRegression
import time
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_validate

print("Fetching data + {}".format(time.time()))
# Fetch the data
cifar = fetch_openml('cifar_10_small')
cifar['categories'] = {
    '0' : 'airplane',
    '1' : 'automobile',
    '2' : 'bird',
    '3' : 'cat',
    '4' : 'deer',
    '5' : 'dog',
    '6' : 'frog',
    '7' : 'horse',
    '8' : 'ship',
    '9' : 'truck',
}

print("Splitting data + {}".format(time.time()))
# Test train split
X_train, X_test = train_test_split(cifar['data'], test_size=0.25, random_state=0)
Y_train, Y_test = train_test_split(cifar['target'], test_size=0.25, random_state=0)

train_labels = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
# distribution = len(train_labels)
# for category, size in zip(distribution.index, distribution.values):
#     print(f"{category} {size} images")

# plt.figure(figsize=(10, 5))
# train_labels["label"].value_counts().plot(kind='bar', title='Distribution of classes')

## image display - work by Jackson
def display_image_grid(dataset,
                        grid_width  = 5,
                        grid_height = 5,
                        img_width   = 32,
                        img_height  = 32,
                        figsize = 2.0,
                        _format = 'RGB'):
    fig, ax = plt.subplots(grid_height, grid_width, figsize=(figsize*grid_width, figsize*grid_height), facecolor='gray')

    for m in range(grid_height):
        for n in range(grid_width):
            i = np.random.choice(len(dataset['data']))
            ax[m][n].set_axis_off()

            if type(dataset['categories']) == dict:
                ax[m][n].set_title('%s: %s'%(i,dataset['categories'][dataset['target'].iloc[i]]))
            else:
                ax[m][n].set_title('%s: %s'%(i, dataset['target'].iloc[i]))

            im = np.array(dataset['data'].iloc[i]).astype('uint8')
            if _format == 'RGB':
                im = im.reshape((img_width, img_height, 3), order='F')
                im = np.swapaxes(im, 0, 1)
                ax[m][n].imshow(im)

            elif _format == 'grayscale':
                im = im.reshape((img_width, img_height), order='F')
                im = np.swapaxes(im, 0, 1)
                ax[m][n].imshow(im, cmap='gray')
            else:
                raise Exception('_format MUST be either RGB or grayscale')

train = {
    'data': X_train,
    'target': Y_train,
    'categories': cifar['categories']
}
# display_image_grid(train, grid_width = 5, grid_height = 3, figsize=3)

print("Skipped image creation + {}".format(time.time()))

print("starting log reg + {}".format(time.time()))

# # Logistic Regression
log_reg_model = LogisticRegression(penalty='elasticnet',solver='saga',multi_class='multinomial', verbose=1, l1_ratio=0.5)
#log_reg_model.fit(X_train, Y_train)
#predictions = log_reg_model.predict(X_test)
# scores = model_selection.cross_val_score(log_reg_model, X_train, Y_train, cv=10)
# print(scores)
# print('average score: {}'.format(scores.mean()))

preds = model_selection.cross_val_predict(log_reg_model, X_train, Y_train, cv=10)
print('\n with l1 ratio 0.5: best training accuracy ')
print(metrics.accuracy_score(Y_train, preds))

log_reg_model_2 = LogisticRegression(penalty='elasticnet',solver='saga',multi_class='multinomial', verbose=1, l1_ratio=1)
preds2 = model_selection.cross_val_predict(log_reg_model_2, X_train, Y_train, cv=10)
print('\n with l1-ratio 1: best training accuracy')
print(metrics.accuracy_score(Y_train, preds2))

log_reg_model_3 = LogisticRegression(penalty='elasticnet',solver='saga',multi_class='multinomial', verbose=1, l1_ratio=0)
preds3 = model_selection.cross_val_predict(log_reg_model_3, X_train, Y_train, cv=10)
print('\n with l1-ration 0: best training accuracy')
print(metrics.accuracy_score(Y_train, preds3))

log_reg_model_4 = LogisticRegression(penalty='elasticnet',solver='saga',multi_class='multinomial', verbose=1, l1_ratio=0.85)
preds2 = model_selection.cross_val_predict(log_reg_model_4, X_train, Y_train, cv=10)
print('\n with l1 ration 0.85: best training accuracy')
print(metrics.accuracy_score(Y_train, preds2))
print(log_reg_model_4)

log_reg_model_4.fit(X_train, Y_train)
#Testing Accuracy:
print("\n Using the model with the best training accuracy: the corresponding test error: ")
predictions = log_reg_model_4.predict(X_test)
print(metrics.accuracy_score(Y_test, predictions))
print(log_reg_model_4)

Fetching data + 1665416577.0736032
Splitting data + 1665416845.441797
Skipped image creation + 1665416845.614722
starting log reg + 1665416845.614774


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 247 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.1min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 286 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.8min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 273 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 281 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 282 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Epoch 1, change: 1.00000000
Epoch 2, change: 0.29057382
Epoch 3, change: 0.17171912
Epoch 4, change: 0.13320643
Epoch 5, change: 0.11206841
Epoch 6, change: 0.09193390
Epoch 7, change: 0.08154919
Epoch 8, change: 0.06753830
Epoch 9, change: 0.06351508
Epoch 10, change: 0.05537504
Epoch 11, change: 0.05149556
Epoch 12, change: 0.04640000
Epoch 13, change: 0.04257978
Epoch 14, change: 0.03888481
Epoch 15, change: 0.03541193
Epoch 16, change: 0.03351029
Epoch 17, change: 0.03087188
Epoch 18, change: 0.02911799
Epoch 19, change: 0.02717832
Epoch 20, change: 0.02556634
Epoch 21, change: 0.02434719
Epoch 22, change: 0.02380851
Epoch 23, change: 0.02276080
Epoch 24, change: 0.02208480
Epoch 25, change: 0.02136580
Epoch 26, change: 0.02038001
Epoch 27, change: 0.02013560
Epoch 28, change: 0.01932753
Epoch 29, change: 0.01882016
Epoch 30, change: 0.01846236
Epoch 31, change: 0.01777304
Epoch 32, change: 0.01740001
Epoch 33, change: 0.01690220
Epoch 34, change: 0.01638886
Epoch 35, change: 0.015

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 276 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 283 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.7min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 277 seconds


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 274 seconds

 with l1 ration 1: 
0.3711333333333333
LogisticRegression(l1_ratio=0.85, multi_class='multinomial',
                   penalty='elasticnet', solver='saga', verbose=1)


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  4.6min finished
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


max_iter reached after 307 seconds

 HELLO
0.3824
LogisticRegression(l1_ratio=0.85, multi_class='multinomial',
                   penalty='elasticnet', solver='saga', verbose=1)


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  5.1min finished


In [2]:
print("\nWith about a li-ratio of 0.5: 0.37106666666666666 was the best accuracy.")
print("\nWith a li-ratio of 0:  0.37126666666666667 was the best accuracy.")
print("\nWith a li-ratio of 1:  0.37153333333333333 was the best accuracy. ")
print("\nWith a  l1-ratio of 0.85:  0.3716 was the best accuracy. ")
print("\nAs such, setting the elastic-net mixing parameter to a l1 ratio penalty of 0.85 gave the best training accuracy.With that model, the best test accuracy is 0.3824.")
print("\nThe elasticnet model trains with both l1 and l2 norms (mixes both Lasso and Ridge Regularization allowing for a sparser model.")


With about a li-ratio of 0.5: 0.37106666666666666 was the best accuracy.

With a li-ratio of 0:  0.37126666666666667 was the best accuracy.

With a li-ratio of 1:  0.37153333333333333 was the best accuracy. 

With a  l1-ratio of 0.85:  0.3716 was the best accuracy. 

As such, setting the elastic-net mixing parameter to a l1 ratio penalty of 0.85 gave the best training accuracy.With that model, the best test accuracy is 0.3824.


# Problem 2: Multi-class Logistic Regression – Visualizing the Solution.  - Josh
You will repeat
the previous problem but for the MNIST dataset which you will find here: https://www.openml.
org/d/554. MNIST is a dataset of handwritten digits, and is considered one of the easiest image
recognition problems in computer vision. We will see here how well logistic regression does, as you
did above on the CIFAR-10 subset. In addition, we will see that we can visualize the solution, and
that in connection to this, sparsity can be useful.
•Use the fetch openml command from sklearn.datasets to import the MNIST data set,
•Choose a reasonable train-test split, and again run multi-class logistic regression on these
using the cross entropy loss, as you did above. Try to optimize the hyperparameters.
•Report your training and test loss from above,
•Choose an ℓ1 regularizer (penalty), and see if you can get a sparse solution with almost as
good accuracy.
•Note that in Logistic Regression, the coefficients returned (i.e., the β’s) are the same dimen-
sion as the data. Therefore we can pretend that the coefficients of the solution are an image
of the same dimension, and plot it. Do this for the 10 sets of coefficients that correspond to
the 10 classes. You should observe that, at least for the sparse solutions, these “kind of” look
like the digits they are classifying.

# Problem 3: Revisiting Logistic Regression and MNIST. - Josh
Here we throw the kitchen sink of classical ML (i.e. pre-deep learning) on MNIST.
•Use Random Forests to try to get the best possible test accuracy on MNIST. Use Cross
Validation to find the best settings. How well can you do? You should use the accuracy
metric to compare to logistic regression. What are the hyperparameters of your best model?
•Use Gradient Boosting to do the same. Try your best to tune your hyper parameters. What
are the hyperparameters of your best model?

# Problem 4: Revisiting Logistic Regression and CIFAR-10. - Jackson
As before, we’ll throw the kitchen sink of classical ML (i.e. pre-deep learning) on CIFAR-10.  
Keep in mind that CIFAR-10 is a few times larger.
* What is the best accuracy you can get on the test data, by tuning Random Forests? 
    * What are the hyperparameters of your best model?
* What is the best accuracy you can get on the test data, by tuning any model including Gradient boosting? 
    * What are the hyperparameters of your best model?

# Problem 5: Getting Started with Pytorch. - Jackson
 * Install Pytorch.
 * Work through this tutorial to familiarize yourself with Pytorch basics: https://pytorch.org/tutorials/beginner/blitz/tensor_tutorial.html#sphx-glr-beginner-blitz-tensor-tutorial-py
 * Work through this tutorial on MNIST starting from a Pytorch logistic regression and building to a CNN using torch.nn. Use a GPU (e.g. on Colab, through Google Cloud credits, Pa-perspace, or any other way). https://pytorch.org/tutorials/beginner/nn_tutorial.html
 * Design the best CNN you can to get the best accuracy on MNIST.

# Problem 6: CNNs for CIFAR-10. - Jackson
* Build a CNN and optimize the accuracy for CIFAR-10. 
    * Try different number of layers and different architectures (depth and convolutional filter hyperparameters).
* Is momentum and learning rate having a significant effect? 
    * Track the train and test loss across training epochs and plot them for different learning rates and momentum values.
* Is the depth of the CNN having a significant effect on performance? 
    * Describe the hyperparameters of the best model you could train.