# Natural Language Processing 1
## Exercise sheet 2: Text classification with Logistic Regression and Naive Bayes

### Instructions

Fill in the numerical answers in the provided code cells. Submit the completed notebook when done. Reasoning is required only where specified.

- Numerical answers should be stored in variables in the provided code cells.
- Reasoning and detailed calculations should be added in the designated text cells.
- Do **not** modify the variable names or the structure of the notebook.
- Save and download your notebook with the following naming convention: `STUDENT_ID.ipynb`, where `STUDENT_ID` is your university-assigned ID.
- Upload the notebook to Moodle in the **Exercises** section. Make sure the file is correctly named before submitting.



### Exercise 1: Predicting book categories from titles [MG]

**Setup:**  We have a dataset of book titles labeled with either AI or Psychology. We first apply a Naive Bayes classifier, then move on to logistic regression with Bag-of-Words representations.


#### Part 1: Naive Bayes Classifier

Fill in the variables in the following code cells with:
- The class priors: $P(\texttt{ai})$ and $P(\texttt{psychology})$.
- The values $P(\texttt{ai}) \prod_{w} P(w \mid \texttt{ai})$ and $P(\texttt{psychology}) \prod_{w} P(w \mid \texttt{psychology})$.
- The predicted class **$c_{NB}$** based on $\arg\max$.



In [4]:
# Part 1.1: Compute class priors
P_ai = 1/2
P_psychology = 1/2

# Part 1.2: Compute the values for each class (before applying argmax)
value_ai = 1/81
value_psychology = 1/50

# Part 1.3: Determine the predicted class (argmax decision)
# write "ai" or "psychology"
c_NB = "psychology"

##### Reasoning for part 1
Part 1.1
$$
\ P(c) = \frac{N_c}{m}\
$$

Part 1.2
$$
\ P(c) \prod_{k=1}^n P(w_k \mid c)\
$$

$$
\ P(w_k \mid c) = \frac{\text{count}(w_k, c)}{\sum_{w \in V} \text{count}(w, c)}
$$

Part 1.3
$$
\arg\max_{c \in \mathcal{C}} P(c) \prod_{i=1}^n P(w_i \mid c)\
$$



#### Part 2: Logistic Regression

##### Part 2.1: Compute the BoW vectors

In [5]:
# Part 2.1: BoW vectors for the 8 training docs.
# Represent the vectors as lists.
# e.g. [1,1,0,3,1,0,0,3,0,0,0,0,0]
"""
deep, learning, artificial, intelligence, the, alingment, problem, reinforcement, to, breath, work, survival, brain
"""
bow_doc1 = [1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
bow_doc2 = [0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
bow_doc3 = [0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0]
bow_doc4 = [0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0]
bow_doc5 = [0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
bow_doc6 = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
bow_doc7 = [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0]
bow_doc8 = [0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]

##### Part 2.2: Choose a weight vector $\theta$


In [6]:
# Part 2.2: Your logistic regression weight vector
# This must be a list of length = size of your vocabulary.
# e.g.: [0, 10, 23, ...]
theta_logreg = [0, 0, 1, 1, 0, 1, 1, 1, 0, -1, -1, -1, 1]

### Exercise 2: Improving a sentiment analysis classifier [MG]

**Setup**:
We have two labeled documents, each in a 10-word vocabulary. We build a logistic regression classifier to assign **positive** vs. **negative** sentiment.

We want:
- **BoW vectors** for doc1 (`"the storyline was intriguing and captivating"`) and doc2 (`"dull, horribly dull storyline, not intriguing"`).
- A weight vector that yields about 0.75 probability for doc1 (positive) and 0.25 for doc2 (negative).
- Cross-entropy losses for each doc.
- One gradient-descent step.
- New probabilities after the update.


#### Exercise 2, Part 1: Compute the BoW vectors
Store each doc's Bag-of-Words as a list of length 10, corresponding to:
```
1. the
2. storyline
3. was
4. horribly
5. intriguing
6. and
7. dull
8. captivating
9. , (comma)
10. not
```


In [7]:
# Part 1: doc1_bow, doc2_bow
doc1_bow = [1, 1, 1, 0, 1, 1, 0, 1, 0, 0]
doc2_bow = [0, 1, 0, 1, 1, 0, 2, 0, 2, 1]

#### Exercise 2, Part 2: Propose a weight vector
We want logistic regression probabilities:
\(
P(y=1|doc1) \approx 0.75,\quad P(y=1|doc2) \approx 0.25.\)

Use nonzero weights **only** for these four words:
- word 4 ("horribly")
- word 5 ("intriguing")
- word 7 ("dull")
- word 8 ("captivating")

Set the other 6 weights to zero. Store your final weight vector  in the code cell below.

In [8]:
# list of length 10.
import numpy as np
theta_ex2 = np.array([0, 0, 0, -1, 0.0986, 0, -0.0986, 1, 0, 0])

#### Exercise 2, Part 3: Cross-entropy loss
For doc1 (label=1) and doc2 (label=0), compute:

$\mathcal{L}_{CE}(\hat{y},y)=-\bigl[y\log\hat{y}+(1-y)\log(1-\hat{y})\bigr].$

Store your two scalar values in the code cell below.


In [9]:
# Cross-entropy for doc1 (label=1) and doc2 (label=0)
ce_doc1 = 0.2876
ce_doc2 = 0.2876

#### Exercise 2, Part 4: One step of gradient descent
Use learning rate $\alpha=0.1$. Provide:
1. The gradient w.r.t. your weight vector $\theta$.
2. The updated $\theta'$ after the gradient step.
3. Briefly explain which entries changed and why.


In [10]:
# 4.1 The gradient: list of length 10.
# theta_ex2 = np.array([0, 0, 0, -1, 0.0986, 0, -0.0986, 1, 0, 0])
grad_theta_ex2 = -0.25 * np.array(doc1_bow) + 0.25 * np.array(doc2_bow)

# 4.2 The new theta after applying the gradient update.
updated_theta_ex2 = theta_ex2 - 0.1 * grad_theta_ex2

print(grad_theta_ex2)
print(updated_theta_ex2)
print("Si el gradiente en negativo, entonces ese parámetro aumenta, por lo que influye positivamente para la sigmoide.")
print("Si el gradiente en positivo, entonces ese parámetro disminuye, por lo que influye negativamente para la sigmoide.")

[-0.25  0.   -0.25  0.25  0.   -0.25  0.5  -0.25  0.5   0.25]
[ 0.025   0.      0.025  -1.025   0.0986  0.025  -0.1486  1.025  -0.05
 -0.025 ]
Si el gradiente en negativo, entonces ese parámetro aumenta, por lo que influye positivamente para la sigmoide.
Si el gradiente en positivo, entonces ese parámetro disminuye, por lo que influye negativamente para la sigmoide.


#### Exercise 2, Part 5: New probabilities
With your updated $\theta'$, recompute the logistic regression output for doc1 and doc2. Are the probabilities closer to the correct labels (doc1=1, doc2=0)?


In [11]:
import math

def sigmoid(z):
    return 1 / (1 + math.exp(-z))

In [12]:
# Recompute doc1 & doc2 probabilities with the updated theta.

p_doc1_new = sigmoid(updated_theta_ex2 @ doc1_bow)  # P(y = 1 | x(1))
p_doc2_new = 1 - sigmoid(updated_theta_ex2 @ doc2_bow)  #  P(y = 0 | x(2))

print(p_doc1_new)
print(p_doc2_new)

0.7682756376602405
0.7939006510813742


#### Reasoning for Exercise 2
Use this cell to provide your step-by-step calculations:

- How you chose the weights $\theta$.
Doing a 2 equations system

- How you compute the loss.
Substituing the training label and the predicted label in the loss formula.

- The gradient derivation and weight updates. Do the weight changes make sense to you? Which weights changed, and how do these changes  help the classifier improve its predictions?
Yes, the weight updates make sense, the classifier improves the predictions. The words that only appear in doc1 have a negative gradient while doc2 words have a positive gradient. If a word appears in both, they cancel to 0, meaning no gradient.

- the final updated probabilities. Did the classification predictions improve? That is, are the predicted probabilities for the true labels now closer to $1$, for all examples?
Yes the classification improves. Before they where ~0.75, now ~0.77 ans ~0.79.
