# BAIT 509 Assignment 3: Logistic Regression and Evaluation Metrics  

__Evaluates__: Lectures 6 - 9. 

__Rubrics__: Your solutions will be assessed primarily on the accuracy of your coding, as well as the clarity and correctness of your written responses. The MDS rubrics provide a good guide as to what is expected of you in your responses to the assignment questions and how the TAs will grade your answers. See the following links for more details:

- [mechanics_rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_mech.md): submit an assignment correctly.
- [accuracy rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_accuracy.md): evaluating your code.
- [reasoning rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_reasoning.md): evaluating your written responses.
- [autograde rubric](https://github.com/UBC-MDS/public/blob/master/rubric/rubric_autograde.md): evaluating questions that are either right or wrong (can be done either manually or automatically).

## Tidy Submission 
rubric={mechanics:2}

- Complete this assignment by filling out this jupyter notebook.
- Any place you see `...` or `____`, you must fill in the function, variable, or data to complete the code.
- Use proper English, spelling, and grammar.
- You will submit two files on Canvas:
    1. This jupyter notebook file containing your responses ( an `.ipynb` file); and,
    2. An `.html` file of your completed notebook that will render directly on Canvas without having to be downloaded.
        - To generate this html file you can click `File` -> `Export Notebook As` -> `HTML` in JupyterLab or type the following into a terminal `jupyter nbconvert --to html_embed assignment.ipynb`).
    
Submit your assignment through UBC Canvas by the deadline listed there.

## Introduction and learning goals <a name="in"></a>
<hr>

Welcome to the assignment! In this assignment, you will practice:

- Explain components of a confusion matrix.
- Define precision, recall, and f1-score and use them to evaluate different classifiers.
- Identify whether there is class imbalance and whether you need to deal with it.
- Explain `class_weight` and use it to deal with data imbalance.
- Apply different scoring functions with `cross_validate` and `GridSearchCV` and `RandomizedSearchCV`.
- Explain the general intuition behind linear models.
- Explain the `fit` and `predict` paradigm of linear models.
- Use `scikit-learn`'s `LogisticRegression` classifier.
    - Use `fit`, `predict` and `predict_proba`.   
    - Use `coef_` to interpret the model weights.
- Explain the advantages and limitations of linear classifiers. 


## Exercise 1:  Precision, recall, and f1 score "by hand" (without `sklearn`) <a name="1"></a>
<hr>


Consider the problem of predicting whether a new product will be successful or not and is worth investing in. Below are confusion matrices of two machine learning models: Model A and Model B. 

##### Model A
|    Actual/Predicted         | Predicted successful| Predicted not successful |
| :-------------------------- | ------------------: | -----------------------: |
| **Actually successful**     | 3                   | 5                        |
| **Actually not successful** | 6                   | 96                       |
 

##### Model B
|    Actual/Predicted         | Predicted successful| Predicted not successful |
| :-------------------------- | ------------------: | -----------------------: |
| **Actually successful**     | 6                   |                        14 |
| **Actually not successful** | 0                  |                       90 |  

### 1.1 Positive vs. negative class
rubric={autograde:1, reasoning:1}

<div class="alert alert-info" style="color:black">
    
Precision, recall, and f1 score depend crucially upon which class is considered "positive", that is the thing you wish to find. In the example above, which class ( `Actually successful` or `Actually not successful`)  is likely to be the "positive" class and why?

Save the label name in a string object named `answer_1_1`.

</div>

Actually successful is likely to be the *positive* class, because it is the target we are going to find.

In [1]:
answer_1_1 = 'Actually successful'

### 1.2 Accuracy
rubric={autograde:2}

<div class="alert alert-info" style="color:black">

Calculate accuracies for Model A and Model B. 

</div>

In [3]:
model_a_acc = (3+ 96) / (3+5+6+96)
model_b_acc = (6+90) / (6+14+90)
display(model_a_acc)
display(model_b_acc)

0.9

0.8727272727272727

### 1.3 Which model would you pick? 
rubric={reasoning:1}

<div class="alert alert-info" style="color:black">

Which model would you pick simply based on the accuracy metric? 
   
</div>

***A***

### 1.4 Model A - Precision, recall, f1-score
rubric={accuracy:1.5}

<div class="alert alert-info" style="color:black">

Calculate precision, recall, f1-score for **Model A** by designating the appropriate fraction to objects named `a_precision`, `a_recall` and `a_f1`. 

You can use the objects `a_precision` and `a_recall` to use in your `a_f1` calculation.
    
</div>

In [6]:
a_precision = 3 / 9
a_recall = 3 / 8
a_f1 = 2 /(1/a_precision + 1/a_recall)
display(a_precision)
display(a_recall)
display(a_f1)

0.3333333333333333

0.375

0.35294117647058826

### 1.5 Model B - Precision, recall, f1-score
rubric={accuracy:1.5}

<div class="alert alert-info" style="color:black">

Calculate precision, recall, f1-score for **Model B** by designating the appropriate fraction to objects named `b_precision`, `b_recall` and `b_f1`. 

You can use the objects `b_precision` and `b_recall` to use in your `b_f1` calculation.
    
</div>

In [8]:
b_precision = 6 / 6
b_recall = 6 / (6 + 14)
b_f1 = 2 / (1/b_precision + 1/b_recall)
display(b_precision)
display(b_recall)
display(b_f1)

1.0

0.3

0.46153846153846145

### 1.6 Metric choice
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">
    
Which metric(s) is more informative in this case? Why? 

</div>

Model B is more informative in this case, it shows a high dispersion between precision, recall and accuracy. The zero false negative illustrates its extremely high precison while their accuracies can not display.

### 1.7 Model choice
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">

Which model would you pick based on this information and why? 
    
</div>

***B***

Because of the high precision

## Exercise 2: Sentiment analysis on the IMDB dataset: model building <a name="3"></a>
<hr>

<img src="https://ia.media-imdb.com/images/M/MV5BMTk3ODA4Mjc0NF5BMl5BcG5nXkFtZTgwNDc1MzQ2OTE@._V1_.png"  width = "40%" alt="404 image" />

In this exercise, you will carry out sentiment analysis on a real corpus, [the IMDB movie review dataset](https://www.kaggle.com/utathya/imdb-review-dataset).
The starter code below loads the data CSV file (assuming that it's in the data directory) as a pandas DataFrame called `imdb_df`.

The supervised learning task is, given the text of a movie review, to predict whether the review sentiment is positive (reviewer liked the movie) or negative (reviewer disliked the movie). We have done a bit of preprocessing on the dataset already where the positive review are labelled `1` and the negative reviews are labelled `0`.

In [9]:
import pandas as pd
from sklearn.model_selection import train_test_split

pd.set_option('display.max_colwidth', 300)  # Set how wide columns to show (to be able to see the reviews)

imdb_df = pd.read_csv("imdb_speed.csv")
train_df, test_df = train_test_split(imdb_df, test_size=0.2, random_state=77)
train_df.head()

Unnamed: 0,review,label
2139,"This movie did attempt to capture the naive idealism that many young teenaged girls have for fun, friendship, escape, danger, sex, maturity, etc. The problem was that it failed to establish these things on every single level; which is why it failed to build a decent story around them. I couldn't...",0
7454,The Earth is destined to be no more thanks to Father Pergado and a bunch of Nuns. Christopher Lee (who has since said that he was duped in to appearing in this by his producers who told him loads of great actors were involved) is Father Pergado and gets to do his usual serious and scary routine....,0
8157,"I didn't see this movie when it originally came out, but there has been a couple songs sharing the title and the term still gets used from time to time and I figured there must be something to the flick, so I dug it up and gave a view. Now I would like the approximate hour and forty five minutes...",0
4435,"I saw this film purely based on the fact that it was on the DPP Video Nasty list, and while I'm glad I saw it because it's now 'another Video Nasty down' - on its own merits, Andy Milligan's film really isn't worth bothering with. There are, of course, far worse films on the infamous list; but t...",0
10168,"Having seen only once and in the dawn hours, I can't seem to forget this haunting film. A mix of mystery, suspense, and heartbreaking romance it reminds me of Vertigo.The actors, though not that well known are good especially Joan Hackett in one of her best performances.You believe in her, in he...",1


### 2.1 Feature and target objects 
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Separate our feature vectors from the target.

You will need to do this for both `train_df` and `test_df`.

Save the results in objects named `X_train`, `y_train`, `X_test` and `y_test`. 

(Makes sure that all 4 of these objects are of type Pandas Series. We will be using `CountVectorizer` for future questions and this transformation requires an input of Pandas Series)
    
</div>

In [10]:
X_train, y_train, X_test, y_test = train_df['review'], train_df['label'], test_df['review'], test_df['label']

### 2.2 Dummy classifier
rubric={accuracy:3}

<div class="alert alert-info" style="color:black">
    
Make a baseline model using `DummyClassifier`.

Carry out cross-validation using the `stratified` strategy. Pass the following `scoring` metrics to `cross_validate`. 
- accuracy
- f1
- recall
- precision

(We are using cross-validation here since we can obtain multiple scores at once) 

Make sure you use  `return_train_score=True` and 5-fold cross-validation.

Save your results in a dataframe named `dummy_scores_df`. 
    
</div>

In [None]:
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import cross_validate

model = DummyClassifier()
cross_validate(model, )


dummy_scores_df = ...

dummy_scores_df

### 2.3 Dummy classifier mean
rubric={accuracy:1}

<div class="alert alert-info" style="color:black">
    
What is the mean of each column in `dummy_scores_df`?

Save your result in an object named `dummy_mean`. 
    
</div>

In [None]:
dummy_mean = ...

dummy_mean

### 2.4 Pipeline
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Let's make a pipeline now. 

Since we only have 1 column to preprocess, we only need 1 main pipeline for this question. 

Create a pipeline with 2 steps, one for `CountVectorizer` and one with a `LogisticRegression` model. For the LogisticRegression model, it's a good idea to set the argument `max_iter=5000` to avoid any warnings and convergence issues. Also let's balance the classes in our splits by setting the appropriate argument in `LogisticRegression` (read its docstring to find out how to achieve this). 
    
</div>

In [None]:
pipe = ...

### 2.5 Hyperparameter optimization
rubric={accuracy:4}

<div class="alert alert-info" style="color:black">
    
Perform hyperparameter turning using a random search of 10 hyperparameter combinations.
We have provided the `params_grid`,
which contains a distribution of each parameter using scipy distribution functions.
In the interest of time,
you can limit your cross-validation of the hyperparameter combinations to 3-fold
and set the computation to run in parallel using `n_jobs`.
Also, use verbose output and return the training score.
Instead of the default scoring metric,
use the f1 score.

Finally,
make sure to fit your model on the entire training dataset. 

This can take a few minutes so please be patient.
    
</div>

In [None]:
from scipy.stats import loguniform, randint

param_grid = {
    "countvectorizer__max_features": randint(10, 10000),
    "logisticregression__C": loguniform(0.01, 100),
}

random_search = ...

### 2.6  Best hyperparameters
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
What are the best hyperparameter values found by `RandomizedSearchCV` for `C` and `max_features`. 
What was the corresponding validation score? Output these values either in separate cells or by printing all of them from a single cell (but we need to see all to be able to grade).
    
</div>

In [None]:
optimal_parameters = ...
optimal_score = ...

### 2.7 Hyperparameters and the fundamental tradeoff
rubric={reasoning:3}

<div class="alert alert-info" style="color:black">
    
Write 2-3 sentences each on the two questions below,
You can add code if needed or use information from previously executed cells
if you deem that to be sufficient.

1. From the set of possible models in the search, did your search return a relatively simple CountVectorizer or a relatively complex one?
2. Did it return a relatively simple LogisticRegression or a relatively complex one? Here ‘simple’ and ‘complex’ we mean with respect to the fundamental tradeoff.
    
</div>

In [None]:
# Any potential code here
...

YOUR ANSWER HERE

### 2.8 Train and test scores of best scoring model
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
What is the train and test `f1` score of the best scoring model?
    
</div>

In [None]:
...

### 2.9 Confusion matrix
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Plot a confusion matrix on the test set using your random search object as your estimator.
Use the `display_labels` parameter so that it's easier to recognized which class is a positive review and which is negative. 

</div>

In [None]:
...

### 2.10 Classification report
rubric={accuracy:3}

<div class="alert alert-info" style="color:black">
    
Print a classification report on the `X_test` predictions of your random search object's best model with measurements to 4 decimal places. Use this information to answer the questions:

1. What is the recall if we classify `1` as our "positive" class? 
2. What is the precision weighted average? Save the result to 4 decimal places. 
3. What is the `f1` score using `1` as your positive class?
4. Comment on the overall model performance in the context of what we used as input features.
    
</div>

In [None]:
...

YOUR ANSWER HERE

## Exercise 3: Model Interpretation <a name="4"></a>
<hr>

One of the primary advantage of linear models is their ability to interpret models in terms of important features. In this exercise, we'll explore the weights learned by logistic regression classifier.

Below we've create a dataframe that contains the words used in our optimal model along with their coefficients (if you named your `RandomizedSearchCV` object in question 2.5 something else than `random_search`, change the code below accordingly).

In [None]:
best_estimator = random_search.best_estimator_

coef_df = pd.DataFrame({
    'words': best_estimator[ "countvectorizer"].get_feature_names(),
    'coefficient': best_estimator["logisticregression"].coef_[0]
})

coef_df

### 3.1 Get the most informative positive words
rubric={accuracy:1, reasoning:1}

<div class="alert alert-info" style="color:black">
    
Using the dataframe `coef_df` above, find the 10 words that are most indicative of a positive review.

Elaborate on the positive words here - Do they make sense with their target value?
    
</div>

In [None]:
...

YOUR ANSWER HERE

### 3.2 Get the most informative negative words
rubric={accuracy:1, reasoning:1}

<div class="alert alert-info" style="color:black">
    
Using the dataframe `coef_df` above, find the 10 words that are most indicative of a negative review.

Elaborate on the negative words here - Do they make sense with their target value?
    
</div>

In [None]:
...

YOUR ANSWER HERE

### 3.3 Explaining the coefficients?
rubric={reasoning:2}

<div class="alert alert-info" style="color:black">
    
Do the words associated with positive and negative reviews make sense? Why is it useful to get access to this information?
    
</div>

YOUR ANSWER HERE

### 3.4 Using `predict` vs `predict_proba`
rubric={accuracy:3}

<div class="alert alert-info" style="color:black">
    
Make a dataframe named `results_df` that contains these 5 columns: 

- `review` - this should contain the reviews from `X_test`.
- `true_label` - This should contain the true `y_test` values. 
- `predicted_y` - The predicted labels generated from `best_model` for the `X_test` reviews using `.predict()`. 
- `neg_label_prob` - The probabilities of class `0` generated from `best_model` for the `X_test` reviews. These can be found at index 0 of the `predict_proba` output (you can get that using `[:,0]`). 
-  `pos_label_prob` - The probabilities of class `1` generated from `best_model` for the `X_test` reviews. These can be found at index 0 of the `predict_proba` output (you can get that using `[:,1]`). 
    
</div>

In [None]:
...

### 3.5 Looking into the probability scores with positive reviews 
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Find the top 5 movie reviews in `results_df` with the highest predicted probability of being positive (i.e., where the model is most confident that the review is positive). If you are curious to read these reviews, you can set the pandas column width or using `IPython.display.HTML` [using the tips in this so thread](https://stackoverflow.com/questions/25351968/how-can-i-display-full-non-truncated-dataframe-information-in-html-when-conver/)
    
</div>

In [None]:
...

### 3.6 Looking into the probability scores with negative reviews 
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Find the top 5 movie reviews in `results_df` with the highest predicted probability of being negative (i.e., where the model is most confident that the review is negative).
    
</div>

In [None]:
...

### 3.7 Looking at uncertain reviews
rubric={accuracy:2}

<div class="alert alert-info" style="color:black">
    
Find the 5 movie reviews in the test set with the most divided probability of being negative or positive (i.e., where the model is least confident in either review sentiment).

What do you think could contribute to the model being confused for how to score a review?
    
</div>

In [None]:
...

YOUR ANSWER HERE

### 3.8 Looking at wrongly predicted reviews
rubric={accuracy:1,reasoning:1}

<div class="alert alert-info" style="color:black">
    
Examine a review from the test set where our `best_model` is making mistakes, i.e., where the true labels do not match the predicted labels. 

What do you think could contribute to the model making an incorrect classification for a review?
    
</div>

In [None]:
...

YOUR ANSWER HERE

### Submission to Canvas

**PLEASE READ: When you are ready to submit your assignment do the following:**

- Read through your solutions
- **Restart your kernel and clear output and rerun your cells from top to bottom** 
- Makes sure that none of your code is broken 
- Convert your notebook to .html format by going to File -> Export Notebook As... -> Export Notebook to HTML
- Upload your `.ipynb` file and the `.html` file to Canvas under Assignment1. 
- **DO NOT** upload any `.csv` files. 

### Congratulations on finishing Assignment 3!