## Multinomial Event Model

In [1]:
x = ["This was awesome an awesome movie",
     "Great movie! I liked it a lot",
     "Happy Ending! awesome acting by the hero",
     "loved it! truly great",
     "bad not upto the mark",
     "could have been better",
     "Surely a Disappointing movie"]

y = [1,1,1,1,0,0,0] # 1 - Positive, 0 - Negative Class

In [89]:
x_test = ["I was happy & happy and I loved the acting in the movie",
          "The movie I saw was bad"]

### 1.Cleaning

In [90]:
import clean_text as ct #clean_text

In [91]:
x_clean= [ct.getCleanedReview(i) for i in x] #List Comprehension
xt_clean= [ct.getCleanedReview(i) for i in x_test] 

In [92]:
print(x_clean)

['awesom awesom movi', 'great movi like lot', 'happi end awesom act hero', 'love truli great', 'bad upto mark', 'could better', 'sure disappoint movi']


## 2.Vectorisation

In [93]:
from sklearn.feature_extraction.text import CountVectorizer

In [94]:
cv =CountVectorizer()

x_vec=cv.fit_transform(x_clean).toarray()
print(x_vec)
print(x_vec.shape)

[[0 2 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 1 1 0 0 1 0 0 0]
 [1 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0]]
(7, 18)


In [95]:
print(cv.get_feature_names())

['act', 'awesom', 'bad', 'better', 'could', 'disappoint', 'end', 'great', 'happi', 'hero', 'like', 'lot', 'love', 'mark', 'movi', 'sure', 'truli', 'upto']


In [96]:
## Vectorisation on the test set

#xt_vec=cv.fit_transform(xt_clean).toarray()
#print(xt_vec)
#print(cv.get_feature_names())
##Avoid using fit transform on test data

In [97]:

xt_vec=cv.transform(xt_clean).toarray()
print(xt_vec)
print(cv.get_feature_names())
print(xt_vec.shape)

[[1 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 0 0]
 [0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]]
['act', 'awesom', 'bad', 'better', 'could', 'disappoint', 'end', 'great', 'happi', 'hero', 'like', 'lot', 'love', 'mark', 'movi', 'sure', 'truli', 'upto']
(2, 18)


## 3.Multinomial Naive bayes

In [98]:
from sklearn.naive_bayes import GaussianNB, MultinomialNB, BernoulliNB

In [99]:
mnb=MultinomialNB()
print(mnb)

#alpha ->LaPlace Smoothing Factor

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)


In [100]:
#Training

mnb.fit(x_vec,y)


MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [101]:
#Predictions

mnb.predict(xt_vec)

array([1, 0])

#### Not bad -> still given a neg review by our classifier bcz
  * 1.Using the bag of words model ->order of words do not matter!
  * 2.Removing stopwords('not'-may be a stopword)

##### Add n-grams and bi-grams to avoid this !

In [102]:
# Calculating posterior probablitiy using the model

In [103]:
mnb.predict_proba(xt_vec)

array([[0.09332629, 0.90667371],
       [0.61699717, 0.38300283]])

In [111]:
mnb.score(x_vec,y)

1.0

#### Interpretation of Probability
* sent1 -> belongs to class 0 with 0.09332629 prob, belongs to class 1 with 0.90667371 prob
* sent 2-> belongs to class 1 with 0.38823529 prob, belongs to class 0 with 0.61176471 prob


#### Therefore, the sent1 is predicted to belong to class 1
#### Therefore, the sent 2 is predicted to belong to class 0

* -> Based on max prob (argmax())

## 4. MultiVariate Bernoulli Event Model Naive Bayes

* Like MultinomialNB, this classifier is suitable for discrete data. 
* The difference is that while MultinomialNB works with occurrence counts, BernoulliNB is designed for binary/boolean features.

### Count vectorizer gives a vector with multiple values like(0,1,2,6..etc)
* It contains frequency of the word 
* These are not always boolean values
* But, the SciKit Model handles it automatically 

#### binarizefloat or None, default=0.0
* Threshold for binarizing (mapping to booleans) of sample features.
* If None, input is presumed to already consist of binary vectors.
* Means, anything greater than and equal to 1 ---> becomes 1
* Else,                                   -------> becomes 0

In [104]:
##feature_vec =[1 0 0 2 5 ] ==> [1 0 0 1 1]. When Threshold =0
##feature_vec =[1 0 0 2 5 ] ==> [0 0 0 1 1]. When Threshold =1.5


In [105]:
bnb= BernoulliNB(binarize=0.0)
print(bnb)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)


In [106]:
bnb.fit(x_vec,y)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [107]:
bnb.predict_proba(xt_vec)

array([[0.07647628, 0.92352372],
       [0.68830318, 0.31169682]])

In [108]:
bnb.predict(xt_vec)

array([1, 0])

In [110]:
bnb.score(x_vec,y)

1.0

In [112]:
#Accuracy =100%, but small dataset->doing overfitting