In [None]:
# Initialize Otter
import otter
grader = otter.Notebook("lab4.ipynb")

![](img/571_lab_banner.png)

# Lab 4: Naive Bayes and Logistic Regression

## Imports

In [None]:
import sys
from hashlib import sha1

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.dummy import DummyClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import (
    GridSearchCV,
    RandomizedSearchCV,
    cross_validate,
    train_test_split,
)
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert-warning">
    
## Instructions  
rubric={mechanics}

You will earn points for following these instructions and successfully submitting your work on Gradescope.  

### Before you start  

- Read the **[Use of Generative AI Policy](https://ubc-mds.github.io/policies/)**.
  
- Review the **[General Lab Instructions](https://ubc-mds.github.io/resources_pages/general_lab_instructions/)**.
    
- Check the **[MDS Rubrics](https://github.com/UBC-MDS/public/tree/master/rubric)** for grading criteria.

### Before submitting  

- **Run all cells** (‚ñ∂‚ñ∂ button) to ensure the notebook executes cleanly from top to bottom.

  - Execution counts must start at **1** and be sequential.
    
  - Notebooks with missing outputs or errors may lose marks.
    
- **Include a clickable link to your GitHub repository** below this cell.

- Make at least 3 commits to your GitHub repository and ensure it's up to date. If Gradescope becomes inaccessible, we'll grade the most recent GitHub version submitted before the deadline.

- **Do not upload or push data files** used in this lab to GitHub or Gradescope. (A `.gitignore` is provided to prevent this.)  



### Submitting on Gradescope  

- Upload **only** your `.ipynb` file (with outputs shown) and any required output files. Do **not** submit extra files.
  
- If needed, refer to the [Gradescope Student Guide](https://lthub.ubc.ca/guides/gradescope-student-guide/).  
- If your notebook is too large to render, also upload a **Web PDF** or **HTML** version.  
  - You can create one using **File $\rightarrow$ Save and Export Notebook As**.  
  - If you get an error when creating a PDF, try running the following commands in your lab directory:  

    ```bash
    conda install -c conda-forge nbconvert-playwright
    jupyter nbconvert --to webpdf lab1.ipynb
    ```  

  - Ensure all outputs are visible in your PDF or HTML file; TAs cannot grade your work if outputs are missing.

</div>


_Points:_ 2

YOUR REPO LINK GOES HERE

<!-- END QUESTION -->

<br><br>

## Exercise 1: Naive Bayes by hand
<hr>

Naive Bayes is commonly used for text classification. In the lecture, we explored its use for spam detection. Now we'll apply it to another text classification task called [sentiment analysis](https://en.wikipedia.org/wiki/Sentiment_analysis). 

Consider the toy data provided, which contains 10 training examples. Each example has 4 binary features indicating the presence or absence of a word, as well as the sentiment associated with each example.

In [None]:
sentiment_toy_data = {
    "predictable": [1, 1, 1, 0, 0, 1, 1, 0, 0, 1],
    "fun": [0, 1, 0, 1, 1, 0, 0, 0, 0, 1],
    "pathetic": [0, 0, 1, 0, 0, 1, 0, 1, 1, 1],
    "satire": [0, 0, 0, 0, 1, 1, 0, 1, 0, 1],
    "target": [
        "negative",
        "positive",
        "negative",
        "positive",
        "positive",
        "negative",
        "positive",
        "negative",
        "negative",
        "negative",
    ],
}

toy_df = pd.DataFrame(sentiment_toy_data)
toy_df

Given this information, you want to predict the target (positive or negative sentiment) for the following unseen test example: 
    
$$x_{new} = \begin{bmatrix}1 & 1 & 0 & 1\end{bmatrix}$$

In the following sub-exercises, we'll do this step by step. 

<br><br>

<div class="alert alert-info">
    
### 1.1 Class prior probabilities
rubric={autograde}

**Your tasks:**

Compute the estimates of the class prior probabilities by hand.
1. Calculate $p(\text{positive})$ and store the result in a variable named `pos_prior`.
1. Calculate $p(\text{negative})$ and store the result in a variable named `neg_prior`.

Simply compute the raw frequencies/proportions and assign them to the appropriate variables as fractions, such as 1/2. There's no need to show your calculation steps or provide explanations.

</div>

<div class="alert alert-warning">
    
Solution_1.1
    
</div>

_Points:_ 2

In [None]:
pos_prior = None  # type: float
neg_prior = None  # type: float

...

In [None]:
grader.check("q1.1")

<br><br>

<div class="alert alert-info">
    
### 1.2 Conditional probabilities
rubric={autograde}

**Your tasks:**

1. Manually calculate the conditional probabilities needed by naive Bayes for the test example $x_{new}$. 
    $$x_{new} = \begin{bmatrix}1 & 1 & 0 & 1\end{bmatrix}$$
2. Store the conditional probability values in the following variables. Each variable corresponds to the likelihood of a word's presence or absence given its class. For example, `fun1_pos` stands for $p(\text{fun} = 1  \mid \text{positive})$ and `pathetic0_pos` represents $p(\text{pathetic} = 0  \mid \text{positive})$. 

You do not need to show any work. Also, do not consider Laplace smoothing here, just compute the raw frequencies/proportions by hand and store them into the appropriate variables as fractions (e.g., 1/2).

</div>

<div class="alert alert-warning">
    
Solution_1.2
    
</div>

_Points:_ 4

In [None]:
predictable1_pos = None  # p(predictable = 1 | positive), type: float
predictable1_neg = None  # p(predictable = 1 | negative), type: float
fun1_pos = None  # p(fun = 1 | positive), type: float
fun1_neg = None  # p(fun = 1 | negative), type: float
pathetic0_pos = None  # p(pathetic = 0 | positive), type: float
pathetic0_neg = None  # p(pathetic = 0 | negative), type: float
satire1_pos = None  # p(satire = 1 | positive), type: float
satire1_neg = None  # p(satire = 1 | negative), type: float

...

In [None]:
grader.check("q1.2")

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 1.3 Prediction
rubric={reasoning}


**Your tasks:**

1. Based on the naive Bayes model and the probabilities you've estimated, what is the most probable label for the test example, $x_{new}$, given below: "positive" or "negative"? Please show your calculations. 

$$x_{new} = \begin{bmatrix}1 & 1 & 0 & 1\end{bmatrix}$$

> We are not expecting any code here. Compute the probabilities by hand and show your steps. You may typeset in LaTeX (preferred) or upload a clear photo/scan of handwritten work.

</div>

<div class="alert alert-warning">
    
Solution_1.3
    
</div>

_Points:_ 5

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 1.4 Smoothing
rubric={reasoning}

**Your tasks:**

1. Suppose you are asked to predict sentiment of another test example $\hat{x}_{new}$. What issues might arise when calculating $p(positive \mid \hat{x}_{new})$? How would you address these challenges? 

$$\hat{x}_{new} = \begin{bmatrix}0 & 1 & 1 & 1\end{bmatrix}$$

> You don't have to write any code here. Just explanation in a few sentences is enough.

</div>

<div class="alert alert-warning">
    
Solution_1.4
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
...

In [None]:
...

In [None]:
...

<!-- END QUESTION -->

<div class="alert alert-warning">
‚ö†Ô∏è Don't forget to <code>git commit</code>. Regular commits will help you track your progress!  
</div>

<br><br>

<div class="alert alert-info">
    
## Exercise 2: Implementing `DummyClassifier`
<hr>
rubric={autograde}

In this course, you will generally **not** be asked to implement machine learning algorithms (like logistic regression) from scratch (You will be doing it in DSCI 572). However, this exercise is an exception: you will implement the simplest possible classifier, `DummyClassifier`.

As a reminder, `DummyClassifier` is meant as a baseline and is generally the worst possible "model" you could "fit" to a dataset. All it does is predict the most popular class in the training set. So if there are more 0s than 1s it predicts 0 every time, and if there are more 1s than 0s it predicts 1 every time. For `predict_proba` it looks at the frequencies in the training set, so if you have 30% 0's 70% 1's it predicts `[0.3 0.7]` every time. Thus, `fit` only looks at `y` (not `X`).

Below you will find starter code for a class called `MyDummyClassifier`, which has methods `fit()`, `predict()`, `predict_proba()` and `score()`. Your task is to fill in those four functions. To get you started, I have given you a `return` statement in each case that returns the correct data type: `fit` returns nothing, `predict` returns an array whose size is the number of examples, `predict_proba` returns an array whose size is the number of examples $\times$ 2, and `score` returns a number.

The next code block has some tests you can use to assess whether your code is working. 

I suggest starting with `fit` and `predict`, and making sure those are working before moving on to `predict_proba`. For `predict_proba`, you should return the proportion of each class in the training data. Your `score` function should call your `predict` function. Again, you can compare with `DummyClassifier` using the code below.

To simplify this question, you can assume **binary classification**, and furthermore that these classes are **encoded as 0 and 1**. In other words, you can assume that `y` contains only 0s and 1s. Scikit-learn's `DummyClassifier` works when you have more than two classes, and also works if the target values are encoded differently, for example as "cat", "dog", "elephant", etc.

> _Hint: In Python, if you define a variable within a class method using `self.variable_name`, it become an instance variable, allowing you to access it from other methods within the same class._

</div>


<div class="alert alert-warning">
    
Solution_2
    
</div>

_Points:_ 8

In [None]:
class MyDummyClassifier:
    """
    A baseline classifier that predicts the most common class.
    The predicted probabilities come from the relative frequencies
    of the classes in the training data.

    This implementation only works when y only contains 0's and 1's.
    """

    def fit(self, X, y):
        """
        Fit the Dummy Classifier to the training data.

        Parameters:
        - X (array-like, shape (n_samples, n_features)): Training data.
        - y (array-like, shape (n_samples,)): Target labels (0's and 1's).

        Returns:
        - None
        """
        # Replace with your code
        ...
        return None  # Replace with your code

    def predict(self, X):
        """
        Predict the target labels for the input data.

        Parameters:
        - X (array-like, shape (n_samples, n_features)): Input data.

        Returns:
        - y_pred (array-like, shape (n_samples,)): Predicted target labels.
        """
        predictions = np.zeros(X.shape[0])  # initializing with all predictions set to 0
        # Replace with your code
        ...
        return predictions

    def predict_proba(self, X):
        """
        Predict class probabilities for the input data.

        Parameters:
        - X (array-like, shape (n_samples, n_features)): Input data.

        Returns:
        - probs (array-like, shape (n_samples, 2)): Predicted class probabilities.
          Column 0 corresponds to class 0, and column 1 corresponds to class 1.
        """
        probs = np.zeros((X.shape[0], 2))  # initializing all probabilities set to 0.
        # Replace with your code
        ...
        return probs  # Replace with your code

    def score(self, X, y):
        """
        Calculate the accuracy of the model on the input data.

        Parameters:
        - X (array-like, shape (n_samples, n_features)): Input data.
        - y (array-like, shape (n_samples,)): True target labels.

        Returns:
        - accuracy (float): Accuracy of the model on the input data.
        """
        accuracy = None
        # Replace with your code
        ...
        return accuracy

In [None]:
# For testing, generate random data
n_train = 101
n_valid = 21
d = 5
X_train_dummy = np.random.randn(n_train, d)
X_valid_dummy = np.random.randn(n_valid, d)
y_train_dummy = np.random.randint(2, size=n_train)
y_valid_dummy = np.random.randint(2, size=n_valid)

my_dc = MyDummyClassifier()
sk_dc = DummyClassifier(strategy="prior")

my_dc.fit(X_train_dummy, y_train_dummy)
sk_dc.fit(X_train_dummy, y_train_dummy)

assert np.array_equal(my_dc.predict(X_train_dummy), sk_dc.predict(X_train_dummy))
assert np.array_equal(my_dc.predict(X_valid_dummy), sk_dc.predict(X_valid_dummy))

In [None]:
y_train_dummy

In [None]:
assert np.allclose(
    my_dc.predict_proba(X_train_dummy), sk_dc.predict_proba(X_train_dummy)
)
assert np.allclose(
    my_dc.predict_proba(X_valid_dummy), sk_dc.predict_proba(X_valid_dummy)
)

In [None]:
assert np.isclose(
    my_dc.score(X_train_dummy, y_train_dummy), sk_dc.score(X_train_dummy, y_train_dummy)
)
assert np.isclose(
    my_dc.score(X_valid_dummy, y_valid_dummy), sk_dc.score(X_valid_dummy, y_valid_dummy)
)

In [None]:
grader.check("q2")

<div class="alert alert-warning">
‚ö†Ô∏è Don't forget to <code>git commit</code>. Regular commits will help you track your progress!  
</div>

<br><br><br><br>

<div class="alert alert-info">
    
## Exercise 3: Classifying happy moments
<hr>

Let's end this course on a happy note! We will use [HappyDB](https://www.kaggle.com/ritresearch/happydb) corpus which contains about 100,000 happy moments classified into 7 categories: *affection, exercise, bonding, nature, leisure, achievement, enjoy_the_moment*. The data was crowd-sourced via [Amazon Mechanical Turk (MTurk)](https://www.mturk.com/). The ground truth label is not available for all examples, and in this lab, we'll only use the examples where ground truth is available (~15,000 examples). 

- Download the data from [here](https://www.kaggle.com/ritresearch/happydb).
- Unzip the file and copy it in a folder called `data` under the lab directory.

The code below reads the data CSV (assuming that it's present in the current directory as *`data/cleaned_hm.csv`*),  cleans it up a bit, and splits it into train and test splits. 

</div>

In [None]:
df = pd.read_csv("data/cleaned_hm.csv", index_col=0)
sample_df = df.dropna()
sample_df.head()

In [None]:
sample_df = sample_df.rename(
    columns={"cleaned_hm": "moment", "ground_truth_category": "target"}
)
sample_df

In [None]:
train_df, test_df = train_test_split(sample_df, test_size=0.3, random_state=123)
X_train, y_train = train_df["moment"], train_df["target"]
X_test, y_test = test_df["moment"], test_df["target"]

It's helpful to understand the distribution of our target values, and identify potential label imbalances in our dataset. If substantial imbalance exists, the model tends to overfit to the majority class while learning very little from the minority class. In more extreme cases, it might even default to predicting only the majority class. 

By checking the frequency of each target label, we can also estimate the baseline performance‚Äîthe accuracy of a simple model. `DummyClassifier` would achieve this by always predicting the most frequent class.

In [None]:
train_df["target"].value_counts(normalize=True)

The results indicate a clear class imbalance. The "nature" and "exercise" categories are underrepresented, while "affection" appears most frequently, accounting for roughly 34% of the examples. Consequently, a `DummyClassifier` that always predicts "affection" would achieve a baseline accuracy of about 34%.

<br><br>

<div class="alert alert-info">
    
### 3.1 Different classifiers 
rubric={autograde}

**Your tasks:**
1. For each model in the `models` dictionary below, perform 5-fold cross-validation. Show the mean and standard deviation of the following metrics:
    - `fit_time`
    - `score_time`
    - `test_score` (cross-validation score)
    - `train_score` (training score)
2. Store results in a pandas dataframe named `results_df`, where
    - Each row corresponds to a model
    - Each column shows mean and standard deviation of the metrics listed above.
    - Column names should be: `fit_time`, `score_time`, `test_score`, and `train_score`.
      
Example table format: 

  | Model          | fit_time        | score_time      | test_score   | train_score  |
  |----------------|-----------------|-----------------|--------------|--------------|
  | dummy          |  0.085 ¬± 0.005  |  0.021 ¬± 0.003  |  0.343 ¬± 0.0 |  0.343 ¬± 0.0 | 
  | decision tree  |                 |                 |              |              |
  | kNN            |                 |                 |              |              |
  | RBF SVM        |                 |                 |              |              |

> Use the `build_pipeline` function below to set up pipelines for different models, which uses `CountVectorizer(stop_words="english")`. 

> You may reuse `mean_std_cross_val_scores` function from the lecture notes (with attribution). 

> ‚è≥ The code might take some time to run. Be patient.

</div>

In [None]:
models = {
    "dummy": DummyClassifier(random_state=123),
    "Decision Tree": DecisionTreeClassifier(random_state=123),
    "KNN": KNeighborsClassifier(),
    "RBF SVM": SVC(random_state=123),
    "Naive Bayes": MultinomialNB(),
    "Logistic Regression": LogisticRegression(max_iter=2000, random_state=123),
}

In [None]:
def build_pipeline(model):
    return make_pipeline(CountVectorizer(stop_words="english"), model)

<div class="alert alert-warning">
    
Solution_3.1
    
</div>

_Points:_ 8

In [None]:
...

In [None]:
grader.check("q3.1")

<br><br>

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 3.2 Discussion
rubric={reasoning}

**Your tasks:**

1. Reflect on the results from Exercise 3.1. You may consider the following questions for discussion:
    - Which models excel in this task? 
    - Among the best performing models, which one is the fastest one? 
    - Which model or models seem to suffer from overfitting?
</div>

<div class="alert alert-warning">
    
Solution_3.2
    
</div>

_Points:_ 3

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 3.3 Hyperparameter optimization 
rubric={accuracy,quality}

**Your tasks:**

1. Define a pipeline with 
    - `CountVectorizer(stop_words="english")`
    - `LogisitcRegression(max_iter=2000)`
2. Use `RandomizedSearchCV` to jointly optimize `C` of logistic regression and `max_features` of `CountVectorizer`. You can choose a suitable range for hyperparameter value and number of iterations. Make sure your `RandomizedSearchCV` is called `random_search` for the autograder to work properly in Exercises 4 and 5.
3. Show the top 3 models based on your random search, highlighting their training scores, CV scores, hyperparameter configuratiions, and fit times.

</div>

<div class="alert alert-warning">
    
Solution_3.3
    
</div>

_Points:_ 6

In [None]:
...

In [None]:
...

In [None]:
random_search = None

...

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 3.4 Discussion
rubric={reasoning}

**Your tasks:**

1. Did the hyperparameter optimization make a difference? Are there significant disparities in the cross-validation scores among the top three models produced by your random search? Provide a brief analysis.

</div>

<div class="alert alert-warning">
    
Solution_3.4
    
</div>

_Points:_ 2

_Type your answer here, replacing this text._

<!-- END QUESTION -->

<div class="alert alert-warning">
‚ö†Ô∏è Don't forget to <code>git commit</code>. Regular commits will help you track your progress!  
</div>

<br><br><br><br>

<div class="alert alert-info">
    
## Exercise 4: Interpreting features, test score, and final evaluation
<hr>

</div>


<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 4.1 Most informative words
rubric={accuracy,quality}

One of the major advantages of linear models is their interpretable coefficients, which allow us to understand predictions based on feature importances. It is beneficial to take a look at what the model has learned, which feature has it prioritized. Professional data scientist often would use their domain knowledge or past experience to judge if the coefficients of each feature are reasonable. In this exercise, we'll explore these coefficients learned by our logistic regression model. 

**Your tasks:**

1. Using the best estimator from 3.3, identify
    - the top 5 words that are positively associated and top 5 words which are negatively associated with the class "affection". 
    - the top 5 words that are positively associated and top 5 words which are negatively associated with the class "exercise". 

> The information you need is exposed by the `coef_` attribute of [LogisticRegression](http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) object. Note that for multiclass classification, for each class, the model learns coefficients per feature for that class. 

> The vocabulary (i.e., the mapping from feature indices to actual words) can be obtained by calling `get_feature_names_out()` on the `CountVectorizer` object. 

> You can adapt the code from lecture notes for this task. Please include a brief attribution like "Code adapted from Lecture 8."

</div>

<div class="alert alert-warning">
    
Solution_4.1
    
</div>

_Points:_ 6

_Type your answer here, replacing this text._

In [None]:
...

In [None]:
positive5_affection = None  # list
negative5_affection = None  # list

...

In [None]:
positive5_exercise = None  # list
negative5_exercise = None  # list

...

<!-- END QUESTION -->

<br><br>

<div class="alert alert-info">
    
### 4.2 Evaluation on test data
rubric={autograde}

Hopefully, the most informative words identified in the previous exercise made sense. Now, let's evaluate how well our best model performs on both the test set and some unseen examples.

**Your tasks:**

1. Evaluate the best model found by `random_search` from Exercise 3.3 on the entire train set and test set. Store the results in the corresponding variables below.

> Note: When `refit=True` (the default), [`RandomizedSearchCV`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) automatically retrains the best model on the full training set after finding the optimal parameters. Therefore, you can directly call `.score()` on the `random_search` object with your train and test data to get the respective scores. 

</div>

<div class="alert alert-warning">
    
Solution_4.2
    
</div>

_Points:_ 2

In [None]:
random_search_best_score = None
train_score = None
test_score = None

...

print("Random Search best model score: %0.3f" % random_search_best_score)
print("Train score on the full train set: %0.3f" % train_score)
print("Test score on the full test set: %0.3f" % test_score)

In [None]:
grader.check("q4.2")

<br><br>

<div class="alert alert-info">
    
### 4.3 Evaluation using probability scores
rubric={autograde}

**Your tasks:**

Using this model, find the following moments in the test set 

1. where the model is most confident that the moment belongs to class "achievement" (i.e., an example with highest predicted probability for class "achievement")
2. where the model is most confident that the moment belongs to class "nature" (i.e., an example with the highest predicted probability of being "nature")

In each case, print out the moment and the associated probability score. 

</div>

<div class="alert alert-warning">
    
Solution_4.3
    
</div>

_Points:_ 4

In [None]:
...

In [None]:
achievement_prob = None  # numpy.float64
achievement_msg = None  # str

...

In [None]:
nature_prob = None  # numpy.float64
nature_msg = None  # str

...

In [None]:
grader.check("q4.3")

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### 4.4 Fake moments 
rubric={reasoning}

**Your tasks:**

1. Test the best model on some fake moments. Some examples are given below. Feel free to add moments to this list. 
2. Briefly note your observations.

</div>

<div class="alert alert-warning">
    
Solution_4.4
    
</div>

_Points:_ 3

In [None]:
test_moments = [
    "I just finished my last assignment!",
    "On the weekend, I spent some quality time with my best friend.",
    "Collaborating with peers and teaching team members is what makes MDS enjoyable!!",
    "I went for a hike in the forest.",
    "I did yoga this morning.",
    "I am still breathing and I am alive!",
]

_Type your answer here, replacing this text._

In [None]:
...

<!-- END QUESTION -->

<div class="alert alert-warning">
‚ö†Ô∏è Don't forget to <code>git commit</code>. Regular commits will help you track your progress!  
</div>

<br><br><br><br>

<div class="alert alert-info">
    
## Exercise 5: Food for thought
<hr>

Each lab will have a few challenging questions. In some labs, I will be including challenging questions which lead to the material in the upcoming week. These are usually low-risk questions and will contribute to maximum 5% of the lab grade. The main purpose here is to challenge yourself or dig deeper in a particular area. When you start working on labs, attempt all other questions before moving to these questions. If you are running out of time, please skip these questions. 

We will be more strict with the marking of these questions. There might not be model answers. If you want to get full points in these questions, your answers need to
- be thorough, thoughtful, and well-written
- provide convincing justification and appropriate evidence for the claims you make 
- impress the reader of your lab with your understanding of the material, your analytical and critical reasoning skills, and your ability to think on your own

</div>

![](img/eva-game-on.png)

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### (Challenging) Exercise 5.1
rubric={reasoning}

**Your tasks:**

- Reflect on your journey through this course. Please identify and elaborate on at least three key concepts or experiences where you had an "aha" moment. How would you use the concepts learned in this course in your personal projects or how would you approach your past projects differently based on the insights gained in this course? We encourage you to dig deep and share your genuine reflections.

</div>

<div class="alert alert-warning">
    
Solution_5.1
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### (Challenging) Exercise 5.2
rubric={reasoning}

Machine learning has its own workflows and good habits, such as when to split the data, writing your code as clear pipelines, and tuning your models in a reproducible manner, to ensure the validity of your results. In this course, we not only learned how to use a number of machine learning methods but also some good habits as a machine learning practitioner. In this exercise, I would like you to review a couple of Kaggle notebooks of your choice for some of the most popular datasets we have explored in this class and assess their methodology. 

To get you started, here are a couple of example notebooks

- [Example notebook 1](https://www.kaggle.com/code/gudisesaichand/spotify-song-attributes-eda-and-prediction/notebook)
- [Example notebook 2](https://www.kaggle.com/code/anuragnayak03/adult-income-classification-using-all-classifiers)

and some aspects you might want to examine:  

- Are they splitting the data before EDA?
- Are they carrying out cross-validation? 
- Is their code well-written, compact, and reproducible?
- Do you trust their results? Why or why not?

</div>


<div class="alert alert-warning">
    
Solution_5.2
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<br><br>

<!-- BEGIN QUESTION -->

<div class="alert alert-info">
    
### (Challenging) Exercise 5.3
rubric={reasoning}

We have been saying that one of the main benefits of naive Bayes is that it's fast and scalable. 

**Your tasks:**
1. Try naive Bayes on a large dataset of your choice. Report the scores, `fit` and `score` times. Below are a couple of suggestions for the datasets.  
    - [Amazon customer reviews](https://www.kaggle.com/datasets/bhavikardeshna/amazon-customerreviews-polarity?select=train.csv)
    - [Yelp tip](https://www.kaggle.com/datasets/yelp-dataset/yelp-dataset/versions/6?select=yelp_tip.csv)

> **For this question, please do not include the code in this notebook because we do not want this question to affect the autograder. Please create a separate Jupyter notebook in your GitHub repository and add a link to that notebook here.**

> You are welcome to explore the [`partial_fit` method supported by naive Bayes](https://scikit-learn.org/0.15/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB.partial_fit), which is useful when the whole dataset is too big to fit in memory at once.

</div>

<div class="alert alert-warning">
    
Solution_5.3
    
</div>

_Points:_ 1

<!-- END QUESTION -->

<div class="alert alert-warning">
‚ö†Ô∏è Don't forget to <code>git commit</code>. Regular commits will help you track your progress!  
</div>

<br><br><br><br>

Before submitting your assignment, please ensure you have followed all the steps in the **Instructions** section at the top.  

### Submission checklist  

- [ ] Restart the kernel and run all cells (‚ñ∂‚ñ∂ button)
- [ ] Make at least three commits to your Github repository. 
- [ ] The `.ipynb` file runs without errors and shows all outputs.  
- [ ] Only the `.ipynb` file and required output files are uploaded (no extra files).  
- [ ] If the `.ipynb` file is too large to render on Gradescope, upload a Web PDF and/or HTML version as well.
- [ ] Include the link to your lab GitHub repository below the instructions.  


### Congratulations on finishing your last lab of the course! üëèüëèüëè

![](img/eva-congrats.png)