# Tutorial 7: Text classification Naive Bayes

### Naive Bayes inferencing

Consider you are building a system that can decide the author given the snippet of a book. Assume you simplify the problem so that the "documents" are represented by their key words (in other words, stop words are removed plus other preprocessing steps). You have annotated some of the training data as follows:

| doc_id | Document | label |
| --- | --- | --- |
| 1 | dialectic dasein dialectic | Hegel |
| 2 | dialectic analysis ideology | Hegel |
| 3 | utopia justice analysis | Plato |
| 4 | Lacanian analysis ideology | Zizek |
| 5 | justice Marx dialectic ideology | Zizek |

Now you are faced with the following sentence:

q = ```a dialectic analysis of ideology```

What is the expected prediction of your system? Please ignore out-of-vocabulary terms while prediction, and use Add-1 (Laplace) smoothing for computing the likelihood probabilities.

#### Your answer goes here:
##### vocabularies:

analysis, dasein, dialectic, 
ideology, justice, Lacanian,
Marx, utopia

##### priors:

P(Hegel) = 2/5

P(Plato) = 1/5

P(Zizek) = 2/5

##### Calculating Likelihoods: $P(w|c)$:


The formula for Multinomial Naive Bayes with Add-1 (Laplace) smoothing is:$$P(w|c) = \frac{\text{count}(w, c) + 1}{\text{count}(\text{all words in } c) + |V|}$$

P(dialectic|Hegel) = (3 + 1) / (6 + 8) = 2/7

P(analysis|Hegel) = P(ideology|Hegel) = (1 + 1) / (6 + 8) = 1/7

P(dialectic|Plato) = P(ideology|Plato) = (0 + 1) / (3 + 8) = 1/11

P(analysis|Plato) = (1 + 1) / (3 + 8) = 2/11

P(dialectic|Zizek) = P(analysis|Zizek) = (1 + 1) / (7 + 8) = 2/15

P(ideology) = (2 + 1) / (7 + 8) = 1/5


Example Calculation for P(dialectic|Hegel): 

Words in Hegel's docs: dialectic (3), dasein (1), analysis (1), ideology (1).
Total raw word count ($N_{Hegel}$) = 6. Denominator = $6 + 8 = \mathbf{14}$. 

"dialectic" appears 3 times in Hegel docs.$ 

P = (3 + 1) / 14 = \mathbf{2/7}$



##### Calculating Inference: $P(c|q)$:

Filter: We remove "a" and "of" because they are not in our vocabulary $V$. We strictly ignore terms we've never seen before.

Formula: $P(c|q) \propto P(c) \times \prod P(w_i|c)$




$\hat{P}(Hegel|q) \propto 2/5 * 2/7 * 1/7 * 1/7 \approx 0.00233$

$\hat{P}(Plato|q) \propto 1/5 * 1/11 * 1/11 * 2/11 \approx 0.00030$

$\hat{P}(Zizek|q) \propto 2/5 * 2/15 * 2/15 * 1/5 \approx 0.00142$

$\hat{P}(Hegel|q)$ is the highest among the three authors, therefore the system should predict "Hegel"

### Macro/Micro measures

Consider one of your models uses a 1-vs-rest strategy* to do the multi-class classification. The results shows as follows:

class name = Hegel

| | in the class | not in the class |
| --- | --- | --- |
| predict to be in the class | 20 | 10 |
| predicted to not be in the class |  10 | 60 |

class name = Plato

|| in the class | not in the class |
| --- | --- | --- |
| predicted to be in the class | 15 | 5 |
| predicted to not be in the class | 15 | 65 |

class name = Zizek

|| in the class | not in the class |
| --- | --- | --- |
| predicted to be in the class | 40 | 10 |
| predicted to not be in the class | 0 | 50 |

1. Calculate the counts of instances in these three classes respectively.
2. Calculate the precision/recall for the three classes
3. Calculate the micro and macro F1 of this model

*1-vs-all strategy means a multi-class classification strategy, where we transfer the multi-class classification into a binary classification. Its two classes are "to be this class" and "to be the rest classes"

#### Your answer goes here:

1.Since we use 1-vs-all strategy, for each class, all "in the classes" sum up to the counts of the instance under this class. Therefore:

#Hegel = 20 + 10 = 30
#Plato = 15 + 15 = 30
#Zizek = 40 + 0  = 40

2.The precision/recall for class Hegel:

TP (Top-Left): 20 (We predicted Hegel, and it was Hegel).

FP (Top-Right): 10 (We predicted Hegel, but it wasn't).

FN (Bottom-Left): 10 (We predicted "Not Hegel", but it actually was Hegel).

TN (Bottom-Right): 15 (We predicted "Not Hegel", and it wasn't).

Total Hegel Instances: This is the sum of the first column (Actual: Yes).

$TP + FN = 20 + 10 = \mathbf{30}$.

Precision (P): "When we predict Hegel, how often are we right?"

Formula: $TP / (TP + FP)$Math: $20 / (20 + 10) = 20/30 = \mathbf{2/3}$.

Recall (R): "Out of all real Hegel texts, how many did we find?"

Formula: $TP / (TP + FN)$Math: $20 / (20 + 10) = 20/30 = \mathbf{2/3}$.

for class Plato:

P = 3/4

R = 1/2

for class Zizek:

P = 4/5

R = 1

3.For the micro F1 we need to calculate the pooled table:

| | in the class | not in the class |
| --- | --- | --- |
| predicted to be in the class | 75 | 25 |
| predicted to not be in the class | 25 | 175 |


Recall F1 = 2 * (precision * recall) / (precision + recall)

micro precision = 75 / (75 + 25) = 0.75

micro recall = 75 / (75 + 25) = 0.75

micro f1 = 2 / (1/0.75 + 1/0.75) = 0.75

The Meaning: "If you feed a random document into this system, there is a good chance it will be correct."

The macro F1 is the average of the individual F1 scores for each class.

f1 Hegel = 2/3

f1 Plato =  2/(4/3 + 2) = 0.6

f1 Zizek = 2/(5/4 + 1) = 8/9

macro f1 = (2/3 + 0.6 + 8/9)/3 = 0.719

The Meaning: "The system is robust, but it has a specific weak spot." 
In our specific case, the system is good at identifying Hegel and Zizek, but struggles with Plato. If it were better, it would have a higher F1 score for Plato and thus a higher macro F1.


Macro averaging treats every Class (Author) as equally important, regardless of how many documents they have. You calculate the F1 score for each author individually, then take the simple average.

Micro averaging treats every Instance (Document) as equally important. You don't calculate separate F1 scores first. Instead, you throw all the predictions into one big pile (pool them) and calculate a single global F1. 


In text classification tasks (like predicting authors or sentiment), here is a rough guide for F1 scores:

0.90+ (Excellent): This is near-human performance. Usually required for medical or legal automation.

0.80 - 0.90 (Good): This is a solid "production-ready" model for most business apps.

0.70 - 0.80 (Acceptable): This is typical for a first prototype or a difficult task (like discerning 3 distinct philosophers). It makes mistakes, but is generally useful.

< 0.60 (Poor): The model is struggling. It might be slightly better than random, but not reliable enough to use.

Because Micro F1 is often "easier" to get high (it rides the wave of the majority class), we typically hold it to a higher standard than Macro F1 (for example, shifting the threshold by 0.05)

Rather than just looking at the raw numbers, data scientists also look at the gap between the two.$$\text{Gap} = \text{Micro F1} - \text{Macro F1}$$

Small Gap (< 0.05): Excellent. Your model is fair. It treats rare classes almost as well as common ones.

Large Gap (> 0.10): Your model is just overfitting to the majority class.

As we can see, the micro f1 is biased by the larger class (Zizek) and becomes better than the macro F1

### Vector space Classification

1. What are two premises of vector space classification?
2. Why do we need those?
3. Do they hold in practice?



1. Documents in the same class form a contiguous region and regions of different classes do not overlap.

Contiguity: We assume that documents belonging to the same class (e.g., "Hegel") are statistically similar to each other. In vector space, this means they cluster together to form a single, solid "cloud" or region. They aren't scattered randomly across the universe; they hang out in the same neighborhood.

Separability: We assume that the cloud for "Hegel" does not merge or mix with the cloud for "Plato." There is a clear boundary or gap between them.

2. To be able to separate the classes, find a boundary. We need these assumptions because most basic vector space classifiers (like Rocchio, kNN, and Linear SVMs) rely on linear decision boundaries.

3. If you choose your representation well.
Strictly speaking? No. In the real world, classes are messy. The word "Bank" appears in both "Finance" documents and "Nature" (river bank) documents. This creates overlap. However the answer is "Yes" IF we engineer our features correctly. While classes might overlap in 1 or 2 dimensions, they often separate cleanly when you look at 10,000 dimensions (words).