# Text Categorization
### Goal of lesson
- What is Text Categorization
- Learn about the Bag-of-Words Model
- Understand Naive Bayes' Rule
- How to use Naive Bayes' Rule for sentiment classification (text categorization)
- What problem smoothing solves

### What is Text Categorization
- Example:
    - Inbox vs Spam
    - Product review: Positive vs Negtive review

### Bag-of-Words Model
- Model that represents text as an unordered collection of words
- The structure is not important
- Works well to classify

- Example
    - I **love** this product.
    - This product feels **cheap**.
    - This is the **best** product ever.

### Naive Bayes Classifier
- Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features ([wiki](https://en.wikipedia.org/wiki/Naive_Bayes_classifier))

### Bayes' Rule Theorem 
- Describes the probability of an event, based on prior knowledge of conditions that might be related to the event ([wiki](https://en.wikipedia.org/wiki/Bayes%27_theorem))
- $P(b|a) = \frac{P(a|b)P(b)}{P(a)}$

### Explained
$P(\text{positive})$

$P(\text{positive}| \text{"I love this product"}) = P(\text{positive} | \text{"I", "love", "this", "product"})$

Bayes's Rule implies it is equal to

$\frac{P(\text{"I", "love", "this", "product"} | \text{positive}) P(\text{positive})}{P(\text{"I", "love", "this", "product"})}$ 

Or proportional to

$P(\text{"I", "love", "this", "product"} | \text{positive}) P(\text{positive})$

The 'Naive' part we use this to simplify

$P(\text{positive})P(\text{"I"} | \text{positive})P(\text{"love"} | \text{positive})P(\text{"this"} | \text{positive})P(\text{"product"} | \text{positive})$

$P(\text{positive})=\frac{\text{number of positive samples}}{\text{number of samples}}$

$P(\text{"love"}|\text{positive})=\frac{\text{number of positive samples with "love"}}{\text{number of positive samples}}$



### Example

"I love this product"

| positive | negative |
| ------ | ------ |
| 0.47 | 0.53 |

| word | positive | negative |
| ------ | ------ | ------ |
| "I" | 0.30 | 0.20 |
| "love" | 0.40 | 0.05 |
| "this" | 0.28 | 0.42 |
| "product" | 0.25 | 0.28 |




$P(\text{positive})P(\text{"I"} | \text{positive})P(\text{"love"} | \text{positive})P(\text{"this"} | \text{positive})P(\text{"product"} | \text{positive}) = 0.47 * 0.30 * 0.40 * 0.28 * 0.25 = 0.003948$

$P(\text{negative})P(\text{"I"} | \text{negative})P(\text{"love"} | \text{negative})P(\text{"this"} | \text{negative})P(\text{"product"} | \text{negative}) = 0.53 * 0.20 * 0.05 * 0.42 * 0.28 = 0.00062328$

Calculate the likelyhood

"I love this product" is positive: 0.00394 / (0.00394 + 0.00062328) = 86.3%

"I love this product" is negative: 0.00062328 / (0.00394 + 0.00062328) = 13.7%

### Problem
- If a word never showed up in a sentence

### Additive Smoothing
- Adding a value to each value in the distribution to smooth the data

### Laplace smoothing
- Adding 1 to each value in the distribution

> #### Programming Notes:
> - Libraries used
>     - [**pandas**](https://pandas.pydata.org) - a data analysis and manipulation tool
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
> - Functionality and concepts used
>     - [**CSV**](https://en.wikipedia.org/wiki/Comma-separated_values) file ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
>     - [**read_csv()**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values (csv) file into **pandas** DataFrame.
>     - **List/Set/Dict Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**word_tokenize**](https://www.nltk.org/api/nltk.tokenize.html) Tokenize a string to split off punctuation other than periods
>     - [**NaiveBayesClassifier**](https://www.nltk.org/_modules/nltk/classify/naivebayes.html) A classifier based on the Naive Bayes algorithm

In [1]:
import nltk
import pandas as pd

In [11]:
data = pd.read_csv('files/sentiment.csv')
data.head()

Unnamed: 0,Text,Label
0,Not worth it,Negative
1,Kind of cheap,Negative
2,Really bad,Negative
3,Didn't work the way we expected,Negative
4,Overpriced for what you get,Negative


In [4]:
def extract_words(document):
    return set(
        word.lower() for word in nltk.word_tokenize(document)
        if any(c.isalpha() for c in word)
    )

In [5]:
words = set()

for line in data['Text'].to_list():
    words.update(extract_words(line))

In [9]:
features = []
for _, row in data.iterrows():
    features.append(({word: (word in row['Text']) for word in words}, row['Label']))

In [13]:
classifier = nltk.NaiveBayesClassifier.train(features)

In [15]:
s = input()

feature = {word: (word in extract_words(s)) for word in words}

result = classifier.prob_classify(feature)

for key in result.samples():
    print(key, result.prob(key))

this was great
 Negative 0.10747100603951745
 Positive 0.8925289939604821
