## Naive Bayes Classifier:
Based on Naive and Bayes' Theorem:

- **Naive Theorem** assume that features are conditionally independent, in given the class i.e, for all instances of a given class, the features have little/no correlation with each other.

- **Bayes Theorem** is based on conditional probability. The conditional probability helps us in calculating the probability that something will happen given that something else already happened.
<img src='./Image/14.1 Image a.png' width='20%' height='20%'/>

**Naive Bayes** is a conditional probability model: given a problem instance to be classified, represented by a vector X= (x1, x2,..., xn) representing some n features of independent variable
<img src='./Image/14.1 Image b.PNG' width='40%' height='40%'/>

### Multinomial Naive Bayes:

- The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification).

- The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work.

### What is a Bag-of-Words?

- way of extracting features from text for train a model 

- The representation of text in a format of a matrix where each row is an observation and each column is a unique word.

- The value of each element of in a matrix is either a binary value that indicate the presence of each word or an integer that indicate how many times that word appears.

- or A bag-of-words is a representation of text that describes the occurrence of words with in a document.

###  CountVectorizer:

- A way that can produce a bag-of-words representation from a collection of text documents.

- A collection of documents is called a **corpus**.

- **Tokenization** is the process of splitting documents or a string into tokens(words).

- The CountVectorizer class tokenizes using a regular expression that splits strings on whitespace and extracts sequences of characters that are two or more in length.

In [1]:
messages = ["Hey hey hey lets go get lunch today..!",
            "Did you go home?..",
            "Hey!!! I need a favor"]

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer(max_df=.5, min_df=0, stop_words='english')

In [3]:
vect.fit(messages)
dtm = vect.transform(messages)
dtm.toarray()

array([[0, 0, 0, 1, 1, 0, 1],
       [1, 0, 1, 0, 0, 0, 0],
       [0, 1, 0, 0, 0, 1, 0]], dtype=int64)

In [9]:
for i in vect.get_feature_names():
    print(i, end=' ')

did favor home lets lunch need today 

In [4]:
import pandas as pd
df = pd.DataFrame(dtm.toarray(),columns = vect.get_feature_names())
df

Unnamed: 0,did,favor,home,lets,lunch,need,today
0,0,0,0,1,1,0,1
1,1,0,1,0,0,0,0
2,0,1,0,0,0,1,0


### Stop words:
- Text may contain stop words like ‘the’, ‘is’, ‘are’ will carry little or no meaning to overall sentence.
- Stop words can be filtered from the text to be processed.
- "stop words" usually refers to the most common words in a language.

In [10]:
list(vect.get_stop_words())

['would',
 'anything',
 'beside',
 'put',
 'all',
 'others',
 'front',
 'an',
 'ie',
 'sixty',
 'himself',
 'by',
 'the',
 'a',
 'same',
 'thru',
 'co',
 'perhaps',
 'via',
 'everyone',
 'becoming',
 'which',
 'while',
 'formerly',
 'against',
 'interest',
 'he',
 'ltd',
 'during',
 'whenever',
 'five',
 'is',
 'another',
 'nowhere',
 'between',
 'your',
 'rather',
 'nobody',
 'themselves',
 'anyhow',
 'through',
 'something',
 'how',
 'have',
 'above',
 'of',
 'moreover',
 'should',
 'thereupon',
 'has',
 'detail',
 'amoungst',
 'part',
 'therefore',
 'everything',
 'out',
 'among',
 'now',
 'eleven',
 'hereafter',
 'whose',
 'on',
 'after',
 'other',
 'since',
 'although',
 'my',
 'across',
 'with',
 'nor',
 'whereupon',
 'alone',
 'please',
 'done',
 'hasnt',
 'nine',
 'so',
 'off',
 'next',
 'such',
 'name',
 'none',
 'three',
 'ours',
 'they',
 'anyway',
 'yourself',
 'down',
 'mostly',
 'around',
 'who',
 'herein',
 'forty',
 'that',
 'well',
 'otherwise',
 'fifty',
 'mine',
 'wh

## TF-IDF(Term frequency-inverse document frequency):

- It's a way to score the importance of words (or "terms") in a document based on how frequently they appear across multiple documents.

- Term frequency-inverse document frequency is a statistic that reflects how important a word is to a specific document relative to all of the words in a collection of documents (the corpus).

<img src = './Image/14.1 Image c.png' width='45%' height='45%' />

### Extracting features from the training data using tfidvectorizer:
- **max_df** is used for removing terms that appear too frequently also k/a "corpus-specific stop words".
    - Eg. - max_df=0.50 means "ignore terms that appear in more than 50% of the documents"
          - max_df=25 means "ignore terms that appear in more than 25 documents"
    - The default max_df is 1.0 means "ignore terms that appear in more than 100% of the documents"


- **min_df** is used for removing terms that appear too infrequently.
    - Eg. - min_df=0.01 means "ignore terms that appear in less thn 1% of the documents"
          - min_df=5 means "ignore terms that appear in less than 5 documents"
    - The default min_df is 1 means "ignore terms that appear in less than 1 document"

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_df=.5, min_df=0, stop_words='english')

In [None]:
tfidf.fit(messages)
dtm = tfidf.transform(messages)
dtm.toarray()

In [None]:
df = pd.DataFrame(dtm.toarray(),columns = tfidf.get_feature_names())
df