
# Assignment 4 for Course 1MS041
Make sure you pass the `# ... Test` cells and
 submit your solution notebook in the corresponding assignment on the course website. You can submit multiple times before the deadline and your highest score will be used.

---
## Assignment 4, PROBLEM 1
Maximum Points = 24


    This time the assignment only consists of one problem, but we will do a more comprehensive analysis instead.

Consider the dataset `Corona_NLP_train.csv` that you can get from the course website [git](https://github.com/datascience-intro/1MS041-2024/blob/main/notebooks/data/Corona_NLP_train.csv). The data is "Coronavirus tweets NLP - Text Classification" that can be found on [kaggle](https://www.kaggle.com/datasets/datatattle/covid-19-nlp-text-classification). The data has several columns, but we will only be working with `OriginalTweet`and `Sentiment`.

1. [3p] Load the data and filter out those tweets that have `Sentiment`=`Neutral`. Let $X$ represent the `OriginalTweet` and let 
    $$
        Y = 
        \begin{cases}
        1 & \text{if sentiment is towards positive}
        \\
        0 & \text{if sentiment is towards negative}.
        \end{cases}
    $$
    Put the resulting arrays into the variables $X$ and $Y$. Split the data into three parts, train/test/validation where train is 60% of the data, test is 15% and validation is 25% of the data. Do not do this randomly, this is to make sure that we all did the same splits (we are in this case assuming the data is IID as presented in the dataset). That is [train,test,validation] is the splitting layout.

2. [4p] There are many ways to solve this classification problem. The first main issue to resolve is to convert the $X$ variable to something that you can feed into a machine learning model. For instance, you can first use [`CountVectorizer`](https://scikit-learn.org/1.5/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) as the first step. The step that comes after should be a `LogisticRegression` model, but for this to work you need to put together the `CountVectorizer` and the `LogisticRegression` model into a [`Pipeline`](https://scikit-learn.org/1.5/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline). Fill in the variable `model` such that it accepts the raw text as input and outputs a number $0$ or $1$, make sure that `model.predict_proba` works for this. **Hint: You might need to play with the parameters of LogisticRegression to get convergence, make sure that it doesn't take too long or the autograder might kill your code**
3. [3p] Use your trained model and calculate the precision and recall on both classes. Fill in the corresponding variables with the answer.
4. [3p] Let us now define a cost function
    * A positive tweet that is classified as negative will have a cost of 1
    * A negative tweet that is classified as positive will have a cost of 5
    * Correct classifications cost 0
    
    complete filling the function `cost` to compute the cost of a prediction model under a certain prediction threshold (recall our precision recall lecture and the `predict_proba` function from trained models). 

5. [4p] Now, we wish to select the threshold of our classifier that minimizes the cost, fill in the selected threshold value in value `optimal_threshold`.
6. [4p] With your newly computed threshold value, compute the cost of putting this model in production by computing the cost using the validation data. Also provide a confidence interval of the cost using Hoeffdings inequality with a 99% confidence.
7. [3p] Let $t$ be the threshold you found and $f$ the model you fitted (one of the outputs of `predict_proba`), if we define the random variable
    $$
        C = (1-1_{f(X)\geq t})Y+5(1-Y)1_{f(X) \geq t}
    $$
    then $C$ denotes the cost of a randomly chosen tweet. In the previous step we estimated $\mathbb{E}[C]$ using the empirical mean. However, since the threshold is chosen to minimize cost it is likely that $C=0$ or $C=1$ than $C=5$ as such it will have a low variance. Compute the empirical variance of $C$ on the validation set. What would be the confidence interval if we used Bennett's inequality instead of Hoeffding in point 6 but with the computed empirical variance as our guess for the variance?

In [5]:

# Part 1

# Load the data from the file specified in the problem definition and make sure that it is loaded using
# the search path `data/Corona_NLP_train.csv`. This is to make sure the autograder and your computer have the same
# file path and can load the data correctly.

# Contrary to how many other problems are structured, this problem actually requires you to
# have X on the shape (n_samples, ) that is a 1-dimensional array. Otherwise it will cause a bunch
# of errors in the autograder or also in for instance CountVectorizer.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
# Make sure that all your data is numpy arrays and not pandas dataframes or series.
data = pd.read_csv('data/Corona_NLP_train.csv', encoding = "iso-8859-1")

# Filter out neutral sentiment
data = data[data['Sentiment'] != 'Neutral']

# Map sentiments to binary values
data['Sentiment'] = data['Sentiment'].apply(lambda x: 1 if x == 'Positive' else 0)

# Define X and Y
X = data['OriginalTweet'].to_numpy()  # Features (tweets)
Y = data['Sentiment'].to_numpy()  # Target (sentiment)

# Split the data into train, test, and validation sets (60%, 15%, 25% respectively)
X_train, X_temp, Y_train, Y_temp = train_test_split(X, Y, test_size=0.4, shuffle=False)
X_valid, X_test, Y_valid, Y_test = train_test_split(X_temp, Y_temp, test_size=0.375, shuffle=False)

# Verify the splits
print(f"Train size: {len(X_train)}, Test size: {len(X_test)}, Validation size: {len(X_valid)}")

Train size: 20066, Test size: 5017, Validation size: 8361


In [6]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Define the pipeline: CountVectorizer -> LogisticRegression
model = Pipeline([
    ('vectorizer', CountVectorizer()),  # Converts text into a bag-of-words representation
    ('classifier', LogisticRegression(solver='liblinear'))  # Logistic Regression model
])

# Train the model using the training data
model.fit(X_train, Y_train)

# Check if the model was trained correctly
print("Model training completed.")

Model training completed.


In [7]:
from sklearn.metrics import precision_score, recall_score

# Make predictions on the test set
Y_pred = model.predict(X_test)

# Calculate precision and recall for both classes
precision_0 = precision_score(Y_test, Y_pred, pos_label=0)
precision_1 = precision_score(Y_test, Y_pred, pos_label=1)
recall_0 = recall_score(Y_test, Y_pred, pos_label=0)
recall_1 = recall_score(Y_test, Y_pred, pos_label=1)

# Print the results
print(f"Precision for class 0 (Negative): {precision_0}")
print(f"Precision for class 1 (Positive): {precision_1}")
print(f"Recall for class 0 (Negative): {recall_0}")
print(f"Recall for class 1 (Positive): {recall_1}")

Precision for class 0 (Negative): 0.7181954887218045
Precision for class 1 (Positive): 0.5041371158392435
Recall for class 0 (Negative): 0.7400061977068485
Recall for class 1 (Positive): 0.47653631284916204


In [8]:
def cost(model, threshold, X, Y):
    # Get the probabilities of the predictions (for class 1)
    probs = model.predict_proba(X)[:, 1]  # Get the probability for the positive class (1)

    # Apply the threshold to determine predicted labels
    Y_pred = (probs >= threshold).astype(int)  # If prob >= threshold, classify as 1 (positive), otherwise 0 (negative)

    # Calculate the cost
    # Cost 1 for a positive tweet classified as negative (Y=1, Y_pred=0)
    # Cost 5 for a negative tweet classified as positive (Y=0, Y_pred=1)
    cost = np.mean((Y == 1) & (Y_pred == 0)) * 1 + np.mean((Y == 0) & (Y_pred == 1)) * 5

    return cost

In [9]:
import numpy as np

# Define a range of thresholds to test (from 0 to 1)
thresholds = np.linspace(0, 1, 100)

# Initialize variables to store the optimal threshold and its associated cost
optimal_threshold = 0
cost_at_optimal_threshold = float('inf')  # Start with a very high cost

# Iterate over all possible thresholds
for threshold in thresholds:
    current_cost = cost(model, threshold, X_test, Y_test)
    
    # If the current cost is lower than the previous cost, update the optimal threshold
    if current_cost < cost_at_optimal_threshold:
        optimal_threshold = threshold
        cost_at_optimal_threshold = current_cost

# Print the results
print(f"Optimal Threshold: {optimal_threshold}")
print(f"Cost at Optimal Threshold: {cost_at_optimal_threshold}")

Optimal Threshold: 1.0
Cost at Optimal Threshold: 0.35678692445684673


In [10]:
import numpy as np

# Calculate the cost on the validation set
cost_at_optimal_threshold_valid = cost(model, optimal_threshold, X_valid, Y_valid)

# Calculate Hoeffding's inequality for the confidence interval
n_valid = len(X_valid)
epsilon = 0.01  # You can adjust this epsilon for more precision if needed
confidence_level = 0.99

# Hoeffding's inequality bounds
a, b = 0, 5  # Cost can only be between 0 and 5
delta = 1 - confidence_level
interval_radius = np.sqrt((b - a) ** 2 * np.log(2 / delta) / (2 * n_valid))

# Confidence interval for the cost
cost_interval_valid = (cost_at_optimal_threshold_valid - interval_radius, cost_at_optimal_threshold_valid + interval_radius)

# Assert the cost interval is a tuple with 2 elements
assert(type(cost_interval_valid) == tuple)
assert(len(cost_interval_valid) == 2)

# Print the results
print(f"Cost at Optimal Threshold on Validation Set: {cost_at_optimal_threshold_valid}")
print(f"Confidence Interval for Cost (99% Confidence): {cost_interval_valid}")

Cost at Optimal Threshold on Validation Set: 0.3484033010405454
Confidence Interval for Cost (99% Confidence): (0.2594023025083595, 0.4374042995727313)


In [11]:
# Step 1: Calculate C for each sample in the validation set
C_values = []
for i in range(len(X_valid)):
    prob = model.predict_proba([X_valid[i]])[0, 1]  # Get the probability of class 1
    predicted_class = 1 if prob >= optimal_threshold else 0
    actual_class = Y_valid[i]
    
    if predicted_class == 1 and actual_class == 0:
        C_values.append(5)  # False positive
    elif predicted_class == 0 and actual_class == 1:
        C_values.append(1)  # False negative
    else:
        C_values.append(0)  # Correct classification (either true positive or true negative)

# Step 2: Calculate the variance of C
variance_of_C = np.var(C_values)

# Step 3: Compute the confidence interval using Bennett's inequality
n_valid = len(X_valid)
empirical_mean = np.mean(C_values)

# Bennett's inequality parameters
delta = 0.01  # The margin of error (adjust if necessary)
M = 5  # The range of C (0 to 5)
L = empirical_mean  # The empirical mean of C
V = variance_of_C  # The empirical variance of C

# Bennett's inequality confidence interval
interval_radius_bennett = np.sqrt((2 * np.log(2 / delta) * V) / n_valid)
interval_of_C = (empirical_mean - interval_radius_bennett, empirical_mean + interval_radius_bennett)

# Assert the interval is a tuple with 2 elements
assert(type(interval_of_C) == tuple)
assert(len(interval_of_C) == 2)

# Print the results
print(f"Variance of C: {variance_of_C}")
print(f"Confidence Interval for C (using Bennett's inequality): {interval_of_C}")

Variance of C: 0.22701844086459655
Confidence Interval for C (using Bennett's inequality): (0.3314409737460543, 0.3653656283350365)
