# Text Classification
A mapping h from input data $x$ to a label $y$.
$x$ exists from some instance space $\bf x$ and y exists in some output space $\bf y$.

We define $h(x)$ as the true mapping

We wish to find best $\hat h(x)$ that approximates $h(x)$.

The following are possible methods to do so:
* Rule Based (Decision Rules)
* Supervised Learning

**Applications**
* Subject Classification
* Spam Detection
* Authorship Attribution
* Sentiment Analysis

## Sentiment Analysis
A typical approach is to use a bag-of-words classifier

If the classifier finds `good` or `great` then the sentiment is likely to be good.
If the classifier finds `bad` or `terrible` then the sentiment is likely to be good.

### Issues
#### Negation
Negation might change the sentiment expressed by a word. `not good`.

Solution: prepend a special prefex to each word after a negation until the end of the sentence.

`I do not like this film` $\rightarrow$ `I do NOT_like NOT_this NOT_film`

#### Ambiguity
Some words can fall into either category based on context.

* "This movie is good for people with poor taste"

where `good` is used in a negative sense.

#### Lack of Data
Some words only appear infrequently. 

Solution: use offline dictionary to prepare model for analysis.



### tf-idf
Supposed that we have these words $(w_1, \dots, w_V)$, 
then for each collection, we obtain a count vector  $(c(w_1), \dots, c(w_V))$, which shows the count of each word in the collection.
Using these counts, we can use it as an indicator as to which class each document belongs to.

#### Term frequency
Definition: $tf_{w,d}$ of a **word** $w$ in **document** $d$ is the number of times $w$ occurs in $d$.

It makes sense that if a word appears many times in a document, the word is likely to characterise the document.
However, the frequency may not be a linear relationship, where a document with 100 occurence of the word is not 100 times more relevant compared to a document with 1 occurence of the word. 
Thus, we take the log frequency instead, to taper off the effects of high occurence of a word.

Hence, we derive that the weight of a word in d is

$$
w_{d} = 
    \begin{cases}
      1 + \log tf_{w, d} & tf_{w, d} > 0 \\
      0 & \text{otherwise} \\
    \end{cases}
$$

#### Document frequency
Definition: $df_{w}$ of a **word** $w$ is the number of documents that contains $w$.

Notice that for words that appears in many documents are less likely to be as descriptive as words that appears in few documents.

For example, `the` is less likely to differentiate 2 documents compared to `astigmatism`.

Thus, we obtain the relationship that the higher the document frequency, the lower the word's significance.

Thus, the inverse document frequency ($idf$), is defined as:

$$
idf_w = \log(N/df_w)
$$

Again, the log scale is used to dampen the effects of high document frequency.

#### Collection Frequency
Collection frequency is the number of occurence of a word in the collection, counting multiple occurence.

Contrast this with document frequency, which is the number of documents where a word appears in.

Notice that document frequency would be generally more indicative of the significance of the word since the same word appearing in the same document multiple times should not reduce the importance of the word.

For example, if you have many English documents and a single math document about geometry, even though `triangle` may appear many times in the collection, it is still a good differentiator for English and math document.

#### tf-idf weighting
Thus, it brings us to the step of bringing the two metrics together.

A typical way to do this is:
$$
weight_{w,d} = (1+\log tf_{w,d}) \times \log(N / df_w)
$$

Notice that it follows our intuition that the term frequency in the document and the term rarity across the document both have a positive effect on the weight.

### Weight Matrix
Thus, we can now replace our count matrix with the weight matrix, each with its own $tf-idf$ weight value.

### Vector Space Model
We now have a $|V|$-dimentional vector space, where each cardinal direction in the vector correspond to a word.

Words are the axes of the space.

Documents are points or vectors in the space.

We will now devise methods to compare the 2 documents.

#### Geometry
From linear algebra, we know that the dot product gives us an indication of how close 2 vectors are.

$$
|\vec a \cdot \vec b| = |\vec a | |\vec b| \cos \theta
$$

If the 2 vectors are similar, then the resultant vector will have large.

However, a large vector (caused by a frequent word) can also lead a large resultant vector.

Thus, it is better to use the angle, which normalizes the magnitude of the 2 vectors.


$$
\cos \theta = \frac{|\vec a \cdot \vec b|}{|\vec a | |\vec b|} = \frac {\sum ab} {\sqrt{\sum a^2 \sum b^2}}
$$

And since every entry in our vector is positive, this means the resultant angle have the range $0 \leq \theta \leq 1$.


### Naive Bayes Model <a id=naive-bayes></a>
Naive Bayes model is depends on a simple representation of the document, the Bag of Words.

It is based on the premise that a document of a certain class is more likely to produce words from a certain vocabulary.

Thus, for a given document $W$, we wish to find the class y which gives the largest $Pr(y | W)$

By Bayes Rule, we can expand the probability as follows

$$
Pr(y | W) = \frac{Pr(W | y) Pr(y)}{Pr(W)}
$$

We can further simplify the expression since we are finding the best class y.


$$
\text{argmax} Pr(y | W) = \text{argmax} \frac{Pr(W | y) Pr(y)}{Pr(W)} = \text{argmax} Pr(W | y) Pr(y)
$$

Now, we make the assumption that 
* Each word's probability to appear does not depend on its location in the document
* All the events of any word appearing are independant of each other

Thus, the probabilty of all the word appearing in the document given the class would be simply the product of all the probabilties of each word.
$$
\text{argmax} Pr(y | W) = \text{argmax} Pr(W | y) Pr(y) =  \text{argmax} \prod _{i=1} ^ n Pr(w_i | y) Pr(y)
$$

Note that this is similar to treating this as a unigram language model.

#### MLE Estimations
We can estimate each of the probabilities as follows:

$$
\begin{align}
Pr(w_i | y) &= \frac{c(w_i, y)} {\sum _{w \in V} c(w, y)} &\\
Pr(y) &= \frac{N_y}{N_{doc}} & N_y = \text{ Number of document with class of y}\\
& & N_{doc} = \text{ total number of document}
\end{align}
$$

As with language model, note that we might run into issues of [underflow](./language_model.ipynb#underflow) or [unknown words](./language_model.ipynb#unknown_words), both of which can be solved the same way as discussed in the language model chapter.

In [1]:
from collections import Counter
from math import log, exp
class NaiveBayes:
    def __init__(self, classes, add_one=True, smooth_prior=True):
        self.add_one = add_one
        self.vocab = set()
        self.table = {cls:Counter() for cls in classes}
        self.documents = Counter()
        self.smooth_prior = smooth_prior
        
    def add_document(self, tokens, label):
        self.vocab.update(tokens)
        self.table[label].update(tokens)
        self.documents.update([label])
        
    def get_prob(self, tokens, label):
        prob = 0
        for token in tokens:
            if self.add_one:
                num = self.table[label][token]+1
                dem = sum(self.table[label].values()) + len(self.vocab)
            else:
                num = self.table[label][token]
                dem = sum(self.table[label].values())

            print(token, num, dem)
                
            if num == 0:
                return 0
            prob += log(num) - log(dem)
        
        if self.smooth_prior:
            prior = (self.documents[label] + 1) / (sum(self.documents.values()) + len(self.table.keys()))
        else:
            prior = self.documents[label]  / sum(self.documents.values())
        
        return exp(prob + log(prior))

    def predict(self, tokens):
        return sorted([(self.get_prob(tokens, label), label) for label in self.table.keys()])[-1][1]

## Evaluation
After producing a classification model, we need to evaluate its performance.

A typical way is to use a confusion matrix

|                    | Actual Positive | Actual Negative |
| ------------------ | --------------- | --------------- |
| Predicted Positive | True Positive   | False Positive  |
| Predicted Negative | False Negative  | True Negative   |

### Accuracy
A natural metric is accuracy, which is simply $\frac{TP + TN}{TP+TN+FP+TN}$.

Accuracy indicates how many of the sample the model gets correct.

However, we do not usually use this to evaluate a classification model, because it does not work well for unbalanced samples.

Suppose our samples consists of 90% positive documents and 10% negative documents.

If our model simply classifies everything as positive, it will get an accuracy of 90%, but this is not indicative of its inability to differentiate the two classes.

### Precision
Precision is defined as $\frac{TP}{TP+FP}$.

It is indicative of the correctness of positive predictions.

Precision is concerned about the following row of the table


|                    | Actual Positive | Actual Negative |
| ------------------ |:---------------:|:---------------:|
| Predicted Positive | **True Positive**   | **False Positive**  |
| Predicted Negative | -  | -   |

### Recall
Recall is defined as $\frac{TP}{TP+FN}$.

It is indicative of the how many of the positive cases that the model is able to identify.

Recall is concerned about the following row of the table


|                    | Actual Positive | Actual Negative |
| ------------------ |:---------------:|:---------------:|
| Predicted Positive | **True Positive**   | - |
| Predicted Negative | **False Negative**  | -   |




---

Consider a classifier which we can tune its strictness in classifying.

If we increase the strictness, it will predict less of the samples as positive, thus it will begin to only predict positive for those samples that it is very confident about.

Thus, precision would increase and recall would fall.

We can see the reverse happening if we were to reduce strictness.

Thus, there is a general trend of inverse relationship between recall and precision

### Combined Measure
Because we wish to account for both precision and recall for the evaluation of our classifier, we can define a measure F as such:

$$
F = \frac{2pr}{p+r}
$$

This is also known as the geometric mean of $p$ and $r$.

Note that compared to the arithmetic mean, defined as $\frac{p+r}{2}$, the geometric mean penalizes the classifier heavier for having either one of the $p$ or $r$ being small.