# Text Categorization
### Goal of lesson
- What is Text Categorization
- Learn about the Bag-of-Words Model
- Understand Naive Bayes' Rule
- How to use Naive Bayes' Rule for sentiment classification (text categorization)
- What problem smoothing solves

### What is Text Categorization
- Example:
    - Inbox vs Spam
    - Product review: Positive vs Negtive review

### Bag-of-Words Model
- Model that represents text as an unordered collection of words
- The structure is not important
- Works well to classify

- Example
    - I **love** this product.
    - This product feels **cheap**.
    - This is the **best** product ever.

### Naive Bayes Classifier
- Naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features ([wiki](https://en.wikipedia.org/wiki/Naive_Bayes_classifier))

### Bayes' Rule Theorem 
- Describes the probability of an event, based on prior knowledge of conditions that might be related to the event ([wiki](https://en.wikipedia.org/wiki/Bayes%27_theorem))
- $P(b|a) = \frac{P(a|b)P(b)}{P(a)}$

### Explained
$P(\text{positive})$

$P(\text{positive}| \text{"I love this product"}) = P(\text{positive} | \text{"I", "love", "this", "product"})$

Bayes's Rule implies it is equal to

$\frac{P(\text{"I", "love", "this", "product"} | \text{positive}) P(\text{positive})}{P(\text{"I", "love", "this", "product"})}$ 

Or proportional to

$P(\text{"I", "love", "this", "product"} | \text{positive}) P(\text{positive})$

The 'Naive' part we use this to simplify

$P(\text{positive})P(\text{"I"} | \text{positive})P(\text{"love"} | \text{positive})P(\text{"this"} | \text{positive})P(\text{"product"} | \text{positive})$

$P(\text{positive})=\frac{\text{number of positive samples}}{\text{number of samples}}$

$P(\text{"love"}|\text{positive})=\frac{\text{number of positive samples with "love"}}{\text{number of positive samples}}$



### Example

"I love this product"

| positive | negative |
| ------ | ------ |
| 0.47 | 0.53 |

| word | positive | negative |
| ------ | ------ | ------ |
| "I" | 0.30 | 0.20 |
| "love" | 0.40 | 0.05 |
| "this" | 0.28 | 0.42 |
| "product" | 0.25 | 0.28 |




$P(\text{positive})P(\text{"I"} | \text{positive})P(\text{"love"} | \text{positive})P(\text{"this"} | \text{positive})P(\text{"product"} | \text{positive}) = 0.47 * 0.30 * 0.40 * 0.28 * 0.25 = 0.003948$

$P(\text{negative})P(\text{"I"} | \text{negative})P(\text{"love"} | \text{negative})P(\text{"this"} | \text{negative})P(\text{"product"} | \text{negative}) = 0.53 * 0.20 * 0.05 * 0.42 * 0.28 = 0.00062328$

Calculate the likelyhood

"I love this product" is positive: 0.00394 / (0.00394 + 0.00062328) = 86.3%

"I love this product" is negative: 0.00062328 / (0.00394 + 0.00062328) = 13.7%

### Problem
- If a word never showed up in a sentence

### Additive Smoothing
- Adding a value to each value in the distribution to smooth the data

### Laplace smoothing
- Adding 1 to each value in the distribution

> #### Programming Notes:
> - Libraries used
>     - [**pandas**](https://pandas.pydata.org) - a data analysis and manipulation tool
>     - [**nltk**](https://www.nltk.org) - Natural Language Toolkit
> - Functionality and concepts used
>     - [**CSV**](https://en.wikipedia.org/wiki/Comma-separated_values) file ([Lecture on CSV](https://youtu.be/LEyojSOg4EI))
>     - [**read_csv()**](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) read a comma-separated values (csv) file into **pandas** DataFrame.
>     - **List/Set/Dict Comprehension** to convert data ([Lecture on **List Comprehension**](https://youtu.be/vCYEvtfXdig))
>     - [**word_tokenize**](https://www.nltk.org/api/nltk.tokenize.html) Tokenize a string to split off punctuation other than periods
>     - [**NaiveBayesClassifier**](https://www.nltk.org/_modules/nltk/classify/naivebayes.html) A classifier based on the Naive Bayes algorithm

In [None]:
# How can you figure sth out about a text

# Text Sentiment : Classify text, Is this a positive text, or a negative text?

In [None]:
# p(a|b): What is the probablilty of b given that a has happened

# "I love this product" is intentionally being a positive one.
# So that means actually the positive is seing these words
# "I", "love", "this", "product" is positive or not. 


# What is the probablity of "I love this product is positive".


# P("I"|positive): How many times dis "I" show up a positive review.

# samples are reviews. i.e. number of samples is equivalent to number of reviews.

In [1]:
# Examples

# Assume that 47 are positive samples, and 53 are negative reviews.

# ["I", "positive"]: "I" shows up in positive review's 30%. 
# 전체 positive 리뷰 중 30%의 리뷰에 "I"가 포함되어 있다. 

# "I love this product" is positive
# divided by sum of these two.


# " I love this product "가 positive 리뷰일 likelhood: 86.3%
# " I love this product "가 negative 리뷰일 likelhood: 13.7%
# So it's way more likely to be a positive review.

# Problem: what if product never showed up in the positive reviews?
#  Then ["product", "positive"] will be zero. (위 table 상에서)
#  if you multiply something with zero, it becomes zero.
#  So the positive review will be zero. 
#  ["love", "positive"]가 40% 로 매우 높음에도 불구하고," I love this product "가 negative 리뷰일 likelhood가 zero가 된다.
#  이렇게 됬을 경우 its a negative review라고 conclude할 것이다. 
#  이러한 문제점을 해결하기 위해 등장한 방법이 "Adding Smoothing"이다.

In [2]:
import pandas as pd
import nltk

In [5]:
data = pd.read_csv("files/sentiment.csv")
#data.head()
data.tail()

Unnamed: 0,Text,Label
8,So much fun,Positive
9,"Great product, would recommend",Positive
10,My grandson loved it,Positive
11,My mother really enjoyed the gift,Positive
12,Great purchase!,Positive


In [6]:
# Extract the word into a set of words.
def extract_words(document):
    return set(
        word.lower() for word in nltk.word_tokenize(document)
        if any(c.isalpha() for c in word)
    )

In [7]:
# Create all the words
# 
words = set()

for line in data["Text"].to_list():
    words.update(extract_words(line))

In [9]:
#words # set of words

In [12]:
# Adding features to words
# : to figure out 리뷰가 positive or negative
features = []
for _, row in data.iterrows():
    #print(row) # row : text and label
    #print(row["Text"], row["Label"])
    features.append(({word: (word in row["Text"]) for word in words}, row["Label"]))
    
    # word in row["text"]: take all the words in the text as a set.
    # for word in words: take all the words in "words"

In [13]:
# features takes the words and put a truth statement if it's in  there.
# 단어가 있는지 없는지 truth statement로 나타낸다.
features[0]

({'work': False,
  'so': False,
  'bad': False,
  'much': False,
  'better': False,
  'enjoyed': False,
  'purchase': False,
  'way': False,
  'gift': False,
  'great': False,
  "n't": False,
  'experience': False,
  'what': False,
  'buy': False,
  'was': False,
  'product': False,
  'worth': True,
  'of': False,
  'kind': False,
  'did': False,
  'we': False,
  'cheap': False,
  'it': True,
  'mother': False,
  'recommend': False,
  'overpriced': False,
  'not': False,
  'my': False,
  'get': False,
  'you': False,
  'expected': False,
  'been': False,
  'with': False,
  'grandson': False,
  'loved': False,
  'could': False,
  'the': False,
  'have': False,
  'really': False,
  'for': False,
  'would': False,
  'fun': False,
  'this': False},
 ' Negative')

In [14]:
# Creating and using classifier
classifier = nltk.NaiveBayesClassifier.train(features)

In [18]:
# s = input() # sentence
s = "this was great"
# same structure with the above one but this one is for sentence 's'

# word in extract_words(s) : extract all the words from sentence 's'
feature = {word: (word in extract_words(s)) for word in words}
result = classifier.prob_classify(feature)

for key in result.samples():
    print(key, result.prob(key))

 Negative 0.10747100603951773
 Positive 0.8925289939604821


In [None]:
# "this was great"
# Negative 0.10747100603951773: the probability of being negative is 10%
# Positive 0.8925289939604821: the probability of being positive is 89%

# You need to use the words that are in our dictionary.