# CMU 95-865 - Unstructured Data Analytics Fall 2021 Quiz 2

This is an 80 minute exam. We will only grade what is submitted via Canvas.

You must fill in your name and your Andrew ID for this quiz to be graded. Moreover, filling out your name and Andrew ID below will serve as your agreement with us, the course staff, that you did not collaborate with anyone on this exam and that what you submit is truly your own individual work and not that of anyone else. Violations found will result in severe penalties.

Your name:

Your Andrew ID:

**Warning: If you leave the above blank, your quiz will not be graded.**

**Important:** There are 3 problems that can be done in any order.

## Problem 1 (no coding) [30 points]

Consider the following convolutional neural network:

1. Conv2D layer with 64 filters (each 5-by-5), expecting input images to have a single channel
2. ReLU
3. 2-by-2 max pool 2D layer
4. Conv2D layer with 128 filters (each 4-by-4)
5. ReLU
6. 4-by-4 max pool 2D layer
7. Flatten
8. Linear layer with 1000 output nodes
9. Softmax

**(a) [3 points]** For the second Conv2D layer, what is the number of channels that it should expect for every input image it gets?

- 64, since the first layer has 64 filters, so it outputs a 64-channel image. Activation and pooling donot change the number of channels. 


**(b) [6 points]** How many parameters does the second Conv2D layer have?

- $128\cdot(64\cdot4\cdot4+1) = 131200$

**(c) [21 points across subparts]** For the subparts next, assume that the linear layer expects each of its inputs to be a 1D table with 2048 entries. Moreover, each input data point to the overall network is a square image (i.e., its height and width are the same).

**Subpart i [3 points].** What is the output shape of the 4-by-4 max pooling layer for a single data point?

- each channle of the second pooling layer stores a square image. (128,4,4)

**Subpart ii [6 points].** What is the *minimum* size of each input image to the 4-by-4 max pooling layer (specify the number of channels, the height, and the width)? Very briefly also explain why the the input images to the 4-by-4 max pooling layer could actually be slightly larger yet still output the same answer.

- (128, 16, 16). 

**Subpart iii [12 points].** What is the *minimum* height/width of each input image to the entire convnet?

- 42

## Problem 2 (no coding) [15 points]

Suppose that we are training a multilayer perceptron for a binary classification task. We set the final linear layer to have 2 nodes with softmax activation. However, we are not sure how to set the number of linear layers before the final linear layer (these linear layers before the final linear layer are commonly called "hidden" layers).

Specifically, consider the following architecture, where every input data point is assumed to be represented as a 1D table with a total of `D` numbers, and we have a total of `L` hidden layers:

- Repeat `L` times: linear layer with `K` nodes and ReLU activation
- Finally, we have the last linear layer with 2 nodes and softmax activation

Throughout this problem, assume that `K` is at least 2 and at most `D`.

**(a) [9 points across subparts]** Let's do some parameter counting.

**Subpart i. [6 points]** Suppose that `L` is equal to 3. What is the number of parameters of this multilayer perceptron in terms of `D` and `K`? Do *not* plug in numerical values for `D` or `K` in your answer; they should be left as variables.

- `D * K + K + 2 * (K * K + K) + K * 2 + 2`

**Subpart ii. [3 points]** Let's generalize your answer from **subpart i**: determine the number of parameters of the multilayer perceptron as a function of `L`, `D`, and `K` (so for this subpart, do *not* plug in numerical values for any of these three variables).

`D * K + K + (L - 1) * (K * K + K) + K * 2 + 2`

**(b) [6 points across subparts]** For this part only, assume that `D` is equal to 100, and we want to automatically choose `L` and `K`. We will try every combination of values `L` = 2, 4, 6, 8, 10, 12, `K` = 10, 20, 30, 40, 50, and also learning rates 0.01, 0.001, 0.0001 (using the Adam optimizer, which is the same optimizer we had used in the lecture demos).

**Subpart i. [3 points]** Using simple data splitting with a 50/50 train/validation split to select the hyperparameters, how many times will we do model fitting?

- 90

**Subpart ii. [3 points]** Will using simple data splitting with a 50/50 train/validation split always give us the same answer for what values of `L`, `K` and the learning rate are best? If not, clearly explain why not including explaining the different sources of randomness that could make the results different.

- No. different random choices of what is in the training set change the scores we get, leading to a different best hyperparameter setting found. Also, neural net training typically uses random parameter initialization; this also leads to different validation scores that could change what the best hyperparameter setting found is.

## Problem 3 (coding) [55 points]

**Throughout this problem, do not import any packages/functions that are not already imported in the next cell.**

In this problem, we examine customer complaints for different cars. Let's first get some Python imports and data loading sorted out. Please run the two cells below.

In [2]:
# DO NOT MODIFY THIS CELL
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
from collections import Counter
from operator import itemgetter
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

# load in data
with open('mystery_mentos.txt', 'r', encoding='utf-8') as f:
    cars = np.array([_.strip() for _ in f.readlines()])
with open('mystery_onions.txt', 'r', encoding='utf-8') as f:
    complaints = np.array([_.strip() for _ in f.readlines()])
with open('mystery_eggplant.txt', 'r', encoding='utf-8') as f:
    labels = np.array([_.strip() for _ in f.readlines()])

In [3]:
cars[:5]

array(['chevrolet camaro', 'chevrolet captiva', 'chevrolet city express',
       'chevrolet colorado', 'chevrolet corvette'], dtype='<U30')

In [4]:
labels[:5]

array(['ford c-max hybrid', 'ford c-max hybrid', 'ford c-max hybrid',
       'ford c-max hybrid', 'ford c-max hybrid'], dtype='<U30')

In [5]:
print(cars.shape,complaints.shape,labels.shape)

(165,) (21511,) (21511,)


In [6]:
'ford c-max hybrid' in cars

True

The variable `cars` lists out all the different cars in the dataset. The variable `complaints` contains a collection of customer complaints. The i-th entry in `complaints` (namely, `complaints[i]`) is about the specific car given by `labels[i]`. Every value in `labels` is one of the cars in `cars`.

**(a) [5 pts]** Let's first try to understand the labels in this dataset.

Print out a sorted list of cars, where we sort the cars based on the number of complaints there are about the car. Your printed list should be of the format:

```
car 1 : number of complaints
car 2 : number of complaints
...
last car : number of complaint
```

Again, `car 1` should correspond to whichever car has the most number of complaints, `car 2` should correspond to whichever car has the second most number of complaints, etc (so the number of complaints should go from biggest to smallest as you go down the list).

In [13]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#
car_complaint_count = Counter()
for car in cars:
    l= [complaint for complaint, label in zip(complaints, labels) if label==car]
    car_complaint_count[car]=len(l)

for key,value in car_complaint_count.most_common():
    print("car ", key, ": number of complaints: ",value)
#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

car  chrysler 200 : number of complaints:  1792
car  jeep cherokee : number of complaints:  1373
car  ford explorer : number of complaints:  975
car  jeep grand cherokee : number of complaints:  905
car  chevrolet silverado 1500 : number of complaints:  787
car  nissan altima : number of complaints:  653
car  ford focus : number of complaints:  618
car  ford fusion : number of complaints:  595
car  ford fusion energi : number of complaints:  595
car  hyundai sonata : number of complaints:  584
car  honda cr-v : number of complaints:  553
car  chevrolet tahoe : number of complaints:  476
car  honda accord hybrid : number of complaints:  475
car  honda accord : number of complaints:  453
car  chevrolet equinox : number of complaints:  439
car  jeep renegade : number of complaints:  438
car  ford escape : number of complaints:  429
car  ford edge : number of complaints:  392
car  chevrolet colorado : number of complaints:  350
car  nissan rogue : number of complaints:  319
car  ford musta

In [16]:
for count, car in sorted([((labels==car).sum(), car) for car in cars],reverse =True):
    print(car, " : ",count)

chrysler 200  :  1792
jeep cherokee  :  1373
ford explorer  :  975
jeep grand cherokee  :  905
chevrolet silverado 1500  :  787
nissan altima  :  653
ford focus  :  618
ford fusion energi  :  595
ford fusion  :  595
hyundai sonata  :  584
honda cr-v  :  553
chevrolet tahoe  :  476
honda accord hybrid  :  475
honda accord  :  453
chevrolet equinox  :  439
jeep renegade  :  438
ford escape  :  429
ford edge  :  392
chevrolet colorado  :  350
nissan rogue  :  319
ford mustang  :  272
jeep wrangler  :  263
nissan pathfinder  :  247
jeep patriot  :  245
chrysler town and country  :  245
toyota camry hybrid  :  242
nissan sentra  :  230
toyota camry  :  229
nissan versa note  :  220
honda civic hybrid  :  220
chevrolet impala  :  219
honda civic  :  218
honda fit  :  199
chevrolet malibu  :  199
chevrolet cruze  :  198
chevrolet silverado 2500  :  185
toyota sienna  :  184
toyota corolla  :  177
toyota rav4  :  172
ford fiesta  :  168
hyundai genesis  :  165
chevrolet camaro  :  162
chevrole

**(b) [8 points]** We're not going to be able to say much about cars that had too few complaints (not enough data!). Write code that computes filtered versions of `cars`, `complaints`, and `labels` respectively called `filtered_cars`, `filtered_complaints` and `filtered_labels`; the filtered versions should be in the same order as the unfiltered versions except that they only contain data for cars that appear in at least 100 complaints.

In [18]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE (DO NOT CHANGE THE VARIABLE NAMES FOR WHAT WE ARE ASKING YOU TO
# COMPUTE)
#
filtered_cars = np.array([car for car in cars if (labels==car).sum()>=100])
filtered_complaints, filtered_labels = zip(*[(complaint, label) for complaint, label in zip(complaints,labels) if label in filtered_cars])

#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

# DO NOT MODIFY THE CODE BELOW
filtered_complaints = np.array(filtered_complaints)
filtered_labels = np.array(filtered_labels)
print('Number of cars after filtering:', len(filtered_cars))
print('Filtered complaints and filtered labels have the same length:',
      len(filtered_complaints) == len(filtered_labels))
print('Number of complaints after filtering:', len(filtered_complaints))

Number of cars after filtering: 53
Filtered complaints and filtered labels have the same length: True
Number of complaints after filtering: 19359


**(c) [6 points across subparts]** We have written some code below for you that obtains a train/test split using the filtered data you computed in part **(b)**. Note that here, the test data are treated as actual test data and not validation data that we tune hyperparameters on. Please begin by running the next cell and then proceed to answer some questions.

In [None]:
# DO NOT MODIFY THIS CELL
train_indices, test_indices = \
    train_test_split(range(len(filtered_complaints)), test_size=0.25,
                     shuffle=True, random_state=42, stratify=filtered_labels)

def row_normalize(X):
    # given a 2D table, where rows index data points and columns index features, divide each row by its sum
    normalized_X = []
    for row in X:
        row_sum = row.sum()
        if row_sum > 0:
            normalized_X.append(row / row_sum)
        else:
            normalized_X.append(row)  # all 0's
    return np.array(normalized_X)

vectorizer = CountVectorizer(min_df=5, max_df=.7, stop_words='english')
X_train = row_normalize(vectorizer.fit_transform(filtered_complaints[train_indices]).toarray())
y_train = filtered_labels[train_indices]

**Subpart i. [2 points]** In the above code, the vectorizer builds a vocabulary. How many words are in this vocabulary? To answer this question, write code that prints out the answer.

In [None]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#


#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

**Subpart ii. [2 points]** What is the word index in the vocabulary for the word `transmission`? To answer this question, write code that prints out the answer.

In [None]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#


#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

**Subpart iii. [2 points]** Suppose that we train a model using `X_train` and `y_train` above. Now we want to use the model to predict on the test data. We first have to get the test data into the format expected by the model. Consider the following two lines of code:

```
X_test = row_normalize(vectorizer.fit_transform(filtered_complaints[test_indices]).toarray())
y_test = filtered_labels[test_indices]
```

Does the above code correctly encode the test data for prediction using the model trained using `X_train` and `y_train`? If you think that the answer is "yes", briefly justify why. If you instead think that the answer is "no", please explain what the bug in the code is.

**Your answer here (indicate either "yes" or "no", and provide an explanation; a correct answer without an explanation will not receive credit):** REPLACE THIS TEXT WITH YOUR ANSWER

**(d) [18 points across subparts]** We've actually already fit a model for you using `X_train` and `y_train`. Let's load in the resulting predictions on the test data first, and then please proceed to answering some questions.

In [None]:
# DO NOT MODIFY THIS CELL
y_test_predicted = np.load('mystery_mug.npy')

**Subpart i. [3 points]** What is the raw accuracy rate on the test data (fraction of predicted labels that are correct)? To answer this question, write code that prints out the answer. *As a reminder, do not import any additional packages/functions.*

In [None]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#


#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

**Subpart ii. [15 points]** For each car (in `filtered_cars`), compute the recall, precision, and F1 score for predicting that specific car. *As a reminder, do not import any additional packages/functions.*

Note on numerical issues: in computing precision, if the denominator is 0, then set precision to be 0 (i.e., do not actually divide by 0). Similarly, for computing F1 score, if precision and recall are both 0, then set the F1 score to be 0.

Afterward, print out three lists of cars:
- the first list has cars sorted in *descending* (largest to smallest) recall
- the second list has cars sorted in *descending* precision
- the third list has cars sorted in *descending* F1 score

Each list should be of the format:

```
car 1 : score
car 2 : score
...
last car : score
```

In [None]:
recall_car_pairs = []  # DO NOT MODIFY
precision_car_pairs = []  # DO NOT MODIFY
f1_car_pairs = []  # DO NOT MODIFY

for car in filtered_cars:  # DO NOT MODIFY
    # --------------------------------------------------------------------------------
    # YOUR CODE HERE (basically we are treating the current `car` as the "positive"
    # class and all the other cars collectively as the "negative" class to compute
    # recall/precision/F1 score here)
    #
    recall = None
    precision = None
    f1_score = None
    #
    # END OF YOUR CODE
    # --------------------------------------------------------------------------------
    
    recall_car_pairs.append((recall, car))  # DO NOT MODIFY
    precision_car_pairs.append((precision, car))  # DO NOT MODIFY
    f1_car_pairs.append((f1, car))  # DO NOT MODIFY


print('[Recall]')
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#

# print out cars sorted by descending recall scores (using `recall_car_pairs`)

#
# END OF YOUR CODE
# --------------------------------------------------------------------------------
print()

print('[Precision]')
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#

# print out cars sorted by descending precision scores (using `precision_car_pairs`)

#
# END OF YOUR CODE
# --------------------------------------------------------------------------------
print()

print('[F1]')
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#

# print out cars sorted by descending F1 scores (using `f1_car_pairs`)

#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

Note: the recall/precision/F1 scores can quickly give us a sense of which cars the model we've loaded has an easier time predicting vs which ones appear difficult for the model to predict.

---

**(e) [18 points across subparts]** Let's look at a specific car, `'honda cr-v'`. For the classifier that we already trained for you, it is actually capable of computing estimated probabilities of different cars. We load this information next.

In [None]:
# DO NOT MODIFY THIS CELL
y_test_predicted_prob = np.load('mystery_mouse.npy')

The columns of `y_test_predicted_prob` are in the same order as `filtered_cars` (assuming you computed it correctly).

**Subpart i. [6 points]** Let's look at an example row of `y_test_predicted_prob`.

In [None]:
# DO NOT MODIFY THIS CELL
print(y_test_predicted_prob[42])

Briefly explain what these numbers mean (your explanation should help explain the sum of these numbers), print out the complaint that corresponds to these numbers, and also print out the car name corresponding to the highest number.

**Your answer here for what the numbers mean (your explanation should help explain the sum of these numbers):** REPLACE THIS TEXT WITH YOUR ANSWER

In [None]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#
complaint = ''  # replace this with the complaint corresponding to `y_test_predicted_prob[42]`
car = ''  # replace this with the car corresponding to the highest number in `y_test_predicted_prob[42]`
#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

# DO NOT MODIFY THE CODE BELOW
print('Complaint:', complaint)
print()
print('Car:', car)

**Subpart ii. [5 points]** Compute the top 20 test set complaints with the highest probability for `'honda cr-v'`.

In [None]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#

top20_complaints = []  # list of strings

#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

print("\n\n".join(top20_complaints))

**Subpart iii. [5 points]** Using the vocabulary computed by `vectorizer` on the training data, compute the raw counts for words that appear in the top 20 test complaints (the ones from **subpart i**), and print out the top 10 words in the vocabulary that appear the most in the top 20 test complaints.

*Hint:* You may find it helpful to use `vectorizer` to transform the top 20 complaints you found in the previous subpart.

In [None]:
# --------------------------------------------------------------------------------
# YOUR CODE HERE
#


#
# END OF YOUR CODE
# --------------------------------------------------------------------------------

**Subpart iv. [2 points]** At least one of the words found is indeed a recurring problem that people complain about for the Honda CR-V car. Which word is it?

**Your answer here:** REPLACE THIS TEXT WITH YOUR ANSWER