# CPSC 330 practice questions

The University of British Columbia

DO NOT DISTRIBUTE WITHOUT PERMISSION

**These sample questions are designed to give you an understanding of the exam format, not to serve as comprehensive coverage of all the topics we've discussed. They also do not reflect the total number of questions that will be on the actual exam. As you study, please do not depend only on these questions or memorize their answers. There are no shortcuts to effectively prepare for the exam. It's essential to review all the material and ensure you have a solid understanding of the key concepts.**

## Q1 
---
The training accuracy of `DummyClassifier`  with `strategy="most_frequent"` might change depending upon how you split the data.

<b>BEGIN SOLUTION</b>

    [X] True
    [ ] False

<b>END SOLUTION</b>

## Q2
---
Is it a good practice to treat the `random_state` of `train_test_split` as a hyperparameter and pick the `random_state` which gives you the best cross-validation scores? Briefly explain.

<b>BEGIN SOLUTION</b>

> No it's not a good practice. If different random states give different results, it means that the model is very much sensitive to the training data it is trained on. If we pick the random states which gives the best cross-validation score the results might be too optimistic and the model is less likely to generalize on unseen data.

<b>END SOLUTION</b>

## Q3
---
In Lecture 3 we talked about different splits: train, validation, test, deployment. What is the difference between test data and deployment data?

<b>BEGIN SOLUTION</b>

> Test data has targets whereas deployment data doesn't have targets.

<b>END SOLUTION</b>

## Q4
---
![](img/trainvalid.png)

Explain how the plot above illustrates the fundamental tradeoff by filling in the missing words.

As you increase the model complexity, the training accuracy tends to go `[trainacc]` and the gap between training and validation accuracy tends to go `[gap]`.

<b>BEGIN SOLUTION</b>

> `up` for `[trainacc]` and `up` for `[gap]`. As we increase complexity (C), the training score goes up, but the train-test gap also goes up - this is the fundamental tradeoff.

<b>END SOLUTION</b>

## Q5
---
Give one similarity and one difference between KNNs and RBF SVMs.

<b>BEGIN SOLUTION</b>

>- Similarity: Both are analogy-based models.
>- Differences: (Many possible answers)
>    * During prediction KNNs find distance of the query point with all examples whereas RBF SVMs find key examples called support vectors and it calculates a weighted distance of the query point to each support vector.
>    * KNN has an integer hyperparameter `k` (`n_neighbors` in sklearn) whereas RBF SVMs have two main continuous hyperparameters: `C` and `gamma`.

<b>END SOLUTION</b>

## Q6
---
One of the imputation approaches we talked about was

```SimpleImputer(strategy='constant', fill_value='?')```

This means that every missing value gets replaced with a '?'. Let's say we use this approach on a categorical column, which is then passed into one-hot encoding. Thus, the '?' category would get its own column in the encoded dataset, and thus its own feature importance. Let's say we observe a high feature importance for the '?' column on our final model, and yet we happen to know that in deployment we will never have missing values for this feature. What are the implications of this information on our model's performance in deployment? Briefly explain.

If it's helpful, you can assume the model is not overfit; that is, your training, cross-validation, and test scores are all about the same.

<b>BEGIN SOLUTION</b>

In deployment the '?' column will always take value 0, so this feature will basically be ignored. Since it was important, we can expect worse performance in deployment. (If the model was overfitting on this feature then maybe "removing" it could have actually helped though - hard to say.)

<b>END SOLUTION</b>

## Q7
---
Suppose you are working on the problem of comment moderation, where the target is whether to accept or reject a given comment. Below are some features given to you.

| comment_text                                                                         | toxicity_severity | comment_type | is_threatening | target |
|--------------------------------------------------------------------------------------|-------------------|--------------|----------------|--------|
| Wars after wars! What a peaceful politician he turned out to be!                     | mild              | sarcastic    | 0              | accept |
| That's stupid. Stop behaving like a 5-year old.                                      | moderate          | insulting    | 0              | reject |
| You are my hero!                                                                     | none              | supportive   | 0              | accept |
| You hit the right notes in this piece.                                               | none              | supportive   | 0              | accept |
| This person shouldn't be allowed to talk or write.  Otherwise we will attack them.   | severe            | hate speech  | 1              | reject |

What encoding would you apply for each of the features above: `comment_text`, `toxicity_severity`, `comment_type`, and `is_threatening`?

<b>BEGIN SOLUTION</b>

> |                   |                       |
> |-------------------|-----------------------|
> | `comment_text`      | Bag of words encoding |
> | `toxicity_severity` | Ordinal encoding      |
> | `comment_type`      | One-hot encoding      |
> | `is_threatening`    | None                  |
> 
> Other Incorrect Match Options:
> - Scaling

<b>END SOLUTION</b>

## Q8
---
In Lecture 6 we discussed encoding categorical variables. Imagine we had a feature in our census dataset called `bank_account_size_category` representing the size of the person's bank account, and imagine that running
```python
df_train["bank_account_size_category"].value_counts()
```
returns the following:
```python
tiny     5
small    100
medium   15000
large    10000
enormous 2
```
Suppose you decide to use one-hot encoding on this column. What would be the problem when it comes to making predictions for people with tiny or enormous bank accounts? What encoding would be more appropriate for this column?

<b>BEGIN SOLUTION</b>

> Because there is very little data for tiny and enormous bank account size. Thus OHE will have almost nothing to learn with. Ordinal encoding will be more appropriate for this column. 

<b>END SOLUTION</b>

## Q9
---
Suppose you are tuning 8 hyperparameters, each taking 10 values, using `RandomizedSearchCV` with `CV=10` and `n_iter=80`. How many 10-fold cross-validations would be carried out in this case?

<b>BEGIN SOLUTION</b>

    [ ] 10^8
    [ ] 8^10
    [X] 80
    [ ] 10

<b>END SOLUTION</b>

## Q10
---

In Lecture 7 we predicted the sentiment of movie reviews. What is the main problem with the following statement:

_"The most positive movie review in the dataset is the review corresponding to the largest coefficient of our logistic regression classifier."_

?

<b>BEGIN SOLUTION</b>

> We have one coefficient per feature, not per review. It should say _"The most positive movie review in the dataset is the review corresponding to the largest `predict_proba` score for the positive class from our logistic regression classifier."_

<b>END SOLUTION</b>

## Q11
---
In Lecture 8 we covered the idea of overfitting on the validation set. We showed that the validation score might not lead you to the hyperparameters that actually give the best test score. In that case, if cross-validation scores are not a faithful representation of test scores, why not just use the test set directly to tune your hyperparameters?

<b>BEGIN SOLUTION</b>

> That would violate the Golden Rule and would introduce the exact problem that we previously had with cross-validation (well, probably worse because it's only one set and not cross-validation). And then we'd have no unseen test set left.

<b>END SOLUTION</b>

## Q12
---
In hw2 you looked at histograms of different Spotify song features, separated by the target class (whether the user liked the song or not). Here is the histogram for danceability: 

![](img/danceability.png)

In hw2 we used a decision tree classifier. If we were instead to try a logistic regression classifier on this dataset, and if we used danceability as our only feature, would you expect the logistic regression coefficient to be positive or negative? Briefly explain your reasoning.

<b>BEGIN SOLUTION</b>
> I would expect the coefficient to be positive because the positive class tends to have larger values of danceability than the negative class.

<b>END SOLUTION</b>

## Q13
---
Suppose we working on the problem of classifying toxic comments. You are interested in reducing the number of false negatives. Which of the following metric should you primarily be trying to improve?

<b>BEGIN SOLUTION</b>

    [ ] precision
    [X] recall
    [ ] accuracy
    [ ] MAPE

<b>END SOLUTION</b>

## Q14
---
In Lecture 9 we talked about different scoring metrics for binary classification. Let's say you train two classifiers on a given dataset. If the two classifiers have the same precision as each other, and the same recall as each other, does that mean they must have the same accuracy? Briefly explain.

<b>BEGIN SOLUTION</b>
> Almost but not quite - this can be avoided if the number of true positives is zero. Here is a counterexample.
>
> <u>Classifier 1</u>
>
> | negative | positive |
> |:--------:|:--------:|
> |     2    |     3    |
> |     2    |     0    |
>
> <u>Classifier 2</u>
>
> | negative | positive |
> |:--------:|:--------:|
> |     1    |     4    |
> |     2    |     0    |
>
> Both classifiers have zero precision and zero recall, but Classifier 1 has accuracy 2/7 and Classifier 2 has accuracy 1/7.

<b>END SOLUTION</b>

## Q15

----
In Lecture 10 we talked about scoring metrics for regression. I argued that RMSE (root mean squared error) and MAPE (mean absolute percent error) are more relatable or "human-readable" metrics than MSE (mean squared error). Why? 

<b>BEGIN SOLUTION</b>
> In regression the target has units. MSE has these units squared, which is hard to interpret. On the other hand RMSE has the original units and MAPE is unitless.

<b>END SOLUTION</b>

## Q16
---
In Lecture 11 we discussed two types of ensemble methods, averaging and stacking. What is one advantage and one disadvantage of averaging over stacking? Max one advantage and one disadvantage (please do not list more).

<b>BEGIN SOLUTION</b>
> - Possible advantages of averaging over stacking: code is faster to run, simpler.
> - Possible disadvantages of averaging over stacking: you have to trust all models equally even if some are better than others.

<b>END SOLUTION</b>

## Q17
---
Suppose you are predicting song popularity (between 0 and 100) based on a bunch of numeric and ordinal features using `Ridge` model which learned a coefficient of 10.0 for the ordinal feature `speed` which has three categories in the order given below:

```['slow', 'medium', 'fast']```

Suppose for a test example with "medium" speed, the model predicts a popularity of 50 likes. If we change the value of the feature from "medium" to "fast", what would be the popularity prediction of the model? Briefly explain. 

<b>BEGIN SOLUTION</b>
> The new predicted popularity would be 60.
>
> This is because the learned coefficient of 10 can be interpreted as follows: increasing by one category of ord1 (e.g. medium -> fast) increases the predicted popularity by 10. 

<b>END SOLUTION</b>

## Q18
---
Here is a SHAP dependence plot for the "energy" feature for SHAP values for class 1 (class 0: the user dislikes the song, class 1: the user likes the song) for the Spotify data set, which you used in homework 3. Comment on the relationship between the "energy" feature and the target class 1 based on this plot. 

![](img/shap_dependence_plot_class1.png)

<b>BEGIN SOLUTION</b>
> The plot captures the non-linear relationship between energy and the target to some extent. Based on this model, the user likes songs with middle range of energy, really dislikes songs with lower energy, and kind of dislikes songs with very high energy. 

<b>END SOLUTION</b>

## Q19
---

Assuming that all the appropriate libraries are imported, the code below tries to carry out hyperparameter optimization using random search. But the code below has some problems. Point out at least two major problems in this code, clearly mention the problems, and fix the code.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
vec = CountVectorizer()
X_train_enc = vec.fit_transform(X_train)
X_test_enc = vec.transform(X_test)
lr = LogisticRegression(max_iter=2000)
vocab = vec.get_feature_names_out()

param_dist = {
    "logisticregression__C": loguniform(1e-3, 1e3),
    "countvectorizer__max_features": randint(100, len(vocab))
}
random_search = RandomizedSearchCV(
    lr,
    param_distributions=param_dist,
    n_iter=50,
    n_jobs=-1,
    return_train_score=True
)

random_search.fit(X_train_enc, y_train)
```

<b>BEGIN SOLUTION<b>

Pointing out any of the two below should be fine.

1. We are passing transformed data to random search, which has inbuilt cross-validation. We'll be breaking the golden rule when we carry out cross validation in random search.

2. Since we are passing logistic regression directly to random search and we are not using pipelines, we do not need __ syntax.

3. We are transforming the data outside random search. So `max_features` hyperparameter will be invalid inside the random search.

Here is the fixed code.

```
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=123)
pipe_lr = make_pipeline(CountVectorizer(), LogisticRegression(max_iter=2000))
vocab = vec.get_feature_names_out()
param_dist = {
    "logisticregression__C": loguniform(1e-3, 1e3),
    "countvectorizer__max_features": randint(100, len(vocab))
}
random_search = RandomizedSearchCV(
    pipe_lr,
    param_distributions=param_dist,
    n_iter=50,
    n_jobs=-1,
    return_train_score=True
)

random_search.fit(X_train, y_train)
```
<b>END SOLUTION<b>


## Q20
---
In Lecture 8 we talked about optimization bias or overfitting of the validation set. Why are small datasets more prone to the phenomenon of overfitting the validation set? Briefly explain. 

<b>BEGIN SOLUTION<b>
The validation error is noisier with less data. A smaller dataset means the validation splits are going to be small during cross-validation of hyperparameter optimization. If we try out many possibilities on this small validation set, we might get good scores by luck for a certain hyperparameter combination which do not generalize well on the test set or the deployment data. 
<b>END SOLUTION<b>

## Q21
---
In hw8 you worked with text features generated from CountVectorizer vs. text features generated from pre-trained word embeddings. Let's say you had the following two texts:

"Why do u study all the time, can't we hang out @ the pub instead instead?"

and

"You're always researching and working instead of meeting up with me at the bar!"

Which of the two approaches, CountVectorizer or pre-trained word embeddings, seems more promising for detecting that these two messages are similar? Briefly justify your answer.

<b>BEGIN SOLUTION<b>

I would say the pre-trained embedding, because it can tell that there are similar words, like study vs. researching/working, or pub vs. bar. On the other hand, CountVectorizer only looks for the exact same words being present in both messages, and there's not a lot of overlap here.

<b>END SOLUTION<b>

Suppose you run K-Means twice with different initializations on a toy dataset and you get the following cluster assignments with the two runs: K-Means run A and K-Means run B, as shown below. 

|           | K-Means run A  | K-Means run B |
| --------- | --------- | --------- |
| example 1 |	0 	         | 2 |
| example 2 |	1 	         | 1 |
| example 3 |	1 	         | 1 |
| example 4 |	0 	         | 2 |
| example 5 |	0 	         | 2 |
| example 6 |	2 	         | 0 |

Are the two runs resulting in the same set of cluster centers? Briefly explain. (Max 60 words.)  

<b>BEGIN SOLUTION<b>

Although the labels are quite different, the clusters are equivalent and they will result in the same set of cluster centers. 

<b>END SOLUTION<b>

## Q22
---

Continuing the previous question, suppose K-Means run A returns the following cluster centers. 
| | | |
| ---- | ---- | --- |
| center 0	| 5.5	| 1.7 |
| center 1	| -2.6	| 9.1 |
| center 2	| 1.0	| -1.0| 

What might be d (the number of features) in this data?  

<b>BEGIN SOLUTION<b>
- [X] 2
- [ ] 3
- [ ] 4
- [ ] Cannot determine from the given information 
<b>END SOLUTION<b>

## Q23
---

Continuing the previous question: To which cluster would K-Means run A assign the new point [0, 0]? 

<b>BEGIN SOLUTION<b>
- [ ] Center 0
- [ ] Center 1
- [X] Center 2
- [ ] Cannot determine from the given information
<b>END SOLUTION<b>

## Q24
---

Continuing the previous question: suppose you fit DBSCAN model on this toy dataset. Would you be able to call predict on the new point [0, 0]? Why or why not?  

<b>BEGIN SOLUTION<b>

No. Unlike K-Means, DBSCAN doesn't have a clear notion of predict for new points. To get a cluster assignment for a new point, you have to run the algorithm again by including this point. 

<b>END SOLUTION<b>

## Q25
---

Let's imagine a company is building a supervised machine learning system to predict the success of future employees, to be used in the hiring process. The features for each employee include some CountVectorizer features from their resume & cover letter, pre-trained word embedding features based on the person's full name, some educational features (level of education, GPA), some professional experience features (years of work experience, categorical features based on past job titles), etc. They train the model using data from existing employees, with the target variable set as the current salary of those current employees, which is taken as a proxy for success. Describe a major issue with this approach from a bias/fairness perspective.

<b>BEGIN SOLUTION<b>

The algorithm will learn to emulate a system that is already biased. Salaries are probably not a good metric because it is known that there are wage gaps, for example between men and women. Furthermore, some of the features used are problematic as well, such as using their name, which can be tied to gender, race and culture. There likely other issues as well.

<b>END SOLUTION<b>