import sys
import statistics
from sklearn import svm
import re
import sklearn
import spacy
import seaborn as sns
# ^^^ pyforest auto-imports - don't write above this line
<h4>Unit 3 <h1 style="text-align:center"> Chapter 5</h1>
 
 ---

I hope that the working of Naive Bayes Classifiers is clear now.

The calculation of frequency table, likelihood table is called **training** the classifier.


The same concept, like we did for a single word, can be extended for multiple words(tokens) in the corpus by multiplying the probability of each feature(token|class).

### There's a problem

Suppose a certain word(token) is not present for a specific class. 

For example, 

Think of a sentence where the token "awesome" is not labelled with class "0".

The probability P("awesome" | "0") will be equal to zero, and as a result due to multiplication, the whole class will have zero probability.


## The solution is to use add-one smoothing.

We use the same technique of smoothing as we read in the previous notebooks. One simple solution is add-one smoothing i.e. add +1 to the count while calculating probabilities.

---

### Another problem

What should we do with words that occur in the test set but not in the training set. These words are called **unknown words**.

``` The solution for such unknown words is to ignore them—remove them from the test document and not include any probability for them at all. ```

---

## Stopwords

Some times we can think of removing words that occur too frequently in the corpus, because their presence will shoot up the probabilities and the size of the vocabularies.


---

### Removing stopwords

**Method 1 -**

> Sort the vocabulary and remove the top most frequent tokens.

Consider this tiny document.

In [1]:
dataset = "Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1 indicates certainty. "

In [10]:
dataset

'Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1 indicates certainty. '

> Build vocabulary

In [11]:
tokens = dataset.split(" ")

In [12]:
len(tokens) # Number of words

53

In [7]:
vocab = set(tokens)

In [13]:
len(vocab) #Distinct words 

36

> Find frequency

In [41]:
from collections import Counter
import re

In [20]:
freq = Counter(dataset.split(" "))

In [32]:
sorted_Freq = sorted(freq.items(), key=lambda value: value[1],reverse=True)

We can see that the top words that occur more times than others are 'is', 'of', 'event'.

Hence these are the stop words for this specific corpus.

We have identified 'is', 'of', and 'event' to be stopwords. Hence we will be removing them.

In [33]:
stopwords = ['is','of','event']

In [48]:
clean_data = re.sub(r'(is)|(of)|(event)',"",dataset)

In [49]:
clean_data

'Probability  the branch  mathematics concerning numerical descriptions  how likely an   to occur or how likely it  that a proposition  true. The probability  an   a number between 0 and 1, where, roughly speaking, 0 indicates impossibility  the  and 1 indicates certainty. '

---

**Method 2 -**

> Use a predefined stopwords list available.

For this we will be using nltk's stopwords list

In [50]:
dataset

'Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, 0 indicates impossibility of the event and 1 indicates certainty. '

In [51]:
from nltk.corpus import stopwords

In [61]:
stopwords.words('english') # List of stopwords of English language

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [78]:
tokens = dataset.split(" ")
stopwords = stopwords.words('english')

In [79]:
tokens

['Probability',
 'is',
 'the',
 'branch',
 'of',
 'mathematics',
 'concerning',
 'numerical',
 'descriptions',
 'of',
 'how',
 'likely',
 'an',
 'event',
 'is',
 'to',
 'occur',
 'or',
 'how',
 'likely',
 'it',
 'is',
 'that',
 'a',
 'proposition',
 'is',
 'true.',
 'The',
 'probability',
 'of',
 'an',
 'event',
 'is',
 'a',
 'number',
 'between',
 '0',
 'and',
 '1,',
 'where,',
 'roughly',
 'speaking,',
 '0',
 'indicates',
 'impossibility',
 'of',
 'the',
 'event',
 'and',
 '1',
 'indicates',
 'certainty.',
 '']

In [80]:
stopwords

['i',
 'me',
 'my',
 'myself',
 'we',
 'our',
 'ours',
 'ourselves',
 'you',
 "you're",
 "you've",
 "you'll",
 "you'd",
 'your',
 'yours',
 'yourself',
 'yourselves',
 'he',
 'him',
 'his',
 'himself',
 'she',
 "she's",
 'her',
 'hers',
 'herself',
 'it',
 "it's",
 'its',
 'itself',
 'they',
 'them',
 'their',
 'theirs',
 'themselves',
 'what',
 'which',
 'who',
 'whom',
 'this',
 'that',
 "that'll",
 'these',
 'those',
 'am',
 'is',
 'are',
 'was',
 'were',
 'be',
 'been',
 'being',
 'have',
 'has',
 'had',
 'having',
 'do',
 'does',
 'did',
 'doing',
 'a',
 'an',
 'the',
 'and',
 'but',
 'if',
 'or',
 'because',
 'as',
 'until',
 'while',
 'of',
 'at',
 'by',
 'for',
 'with',
 'about',
 'against',
 'between',
 'into',
 'through',
 'during',
 'before',
 'after',
 'above',
 'below',
 'to',
 'from',
 'up',
 'down',
 'in',
 'out',
 'on',
 'off',
 'over',
 'under',
 'again',
 'further',
 'then',
 'once',
 'here',
 'there',
 'when',
 'where',
 'why',
 'how',
 'all',
 'any',
 'both',
 'each

In [82]:
filtered_data = [word for word in tokens if not word in stopwords]

In [85]:
" ".join(filtered_data)

'Probability branch mathematics concerning numerical descriptions likely event occur likely proposition true. The probability event number 0 1, where, roughly speaking, 0 indicates impossibility event 1 indicates certainty. '