### Create a NLP Pipeline to 'clean' Reviews Data:
- Load Input File and Read Reviews
- Tokenize
- Remove Stopwords
- Perform Stemming
- write cleaned data to output file

In [219]:
sample_text = """ I <br /><br />loved this movie since I was 7 and I saw it on the opening day. It was so touching and beautiful. I strongly recommend seeing for all. It's a movie to watch with your family by far.<br /><br />My MPAA rating: PG-13 for thematic elements, prolonged scenes of disastor, nudity/sexuality and some language."""

### NLTK
Step 1 : Clean the data

In [220]:
from nltk.tokenize import RegexpTokenizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords

In [221]:
## initialize objects 
tokenizer  = RegexpTokenizer(r'\w+')# It extract all the words
# "\w+ means extract all the words
en_stopwords = stopwords.words('english')
# english stopwords
ps =PorterStemmer()

In [222]:
def getStemmedReview(reviews):
    reviews = reviews.lower() # first convert everything to lower case
    reviews = reviews.replace("<br /><br />"," ")

#Tokenize 
    tokens = tokenizer.tokenize(reviews)
#filterstopwords
    filtered_tokens = [t for t in tokens if t not in en_stopwords]
    stemmed_tokens = [ps.stem(t) for t in filtered_tokens]
    cleaned_review = ' '.join(stemmed_tokens)
    
    return cleaned_review
    

In [223]:
getStemmedReview(sample_text)

'love movi sinc 7 saw open day touch beauti strongli recommend see movi watch famili far mpaa rate pg 13 themat element prolong scene disastor nuditi sexual languag'

In [224]:
#write a function that accepts input file and return lean output file of movie
# reviews.

In [225]:
import sys

In [226]:
def getStemmedDocument(inputFile, outputFile):
    out = open(outputFile,'w', encoding= "utf8")
    with open(inputFile, encoding= "utf8") as f:
        reviews = f.readlines()
    for r in reviews :
        cleaned_review = getStemmedReview(reviews)
        print((cleaned_review), file = out)
    out.close()


# Read commandline arguments:
inputFile  = sys.argv[1]
outputFile = sys.argv[2]
getStemmedDocument(inputFile, outputFile)

FileNotFoundError: [Errno 2] No such file or directory: '-f'

### Text Classification - Naive Bayes

#### Multinomial Event Model

In [227]:
x = [" I <br /><br />loved this movie since I was 7 and I saw it on the opening day. It was so touching and beautiful. I strongly recommend seeing for all. It's a movie to watch with your family by far.<br /><br />My MPAA rating: PG-13 for thematic elements, prolonged scenes of disastor, nudity/sexuality and some language.",
     "A wonderful little production. <br /><br />The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. <br /><br />The actors are extremely well chosen- Michael Sheen not only ""has got all the polari"" but he has all the voices down pat too! You can truly see the seamless editing guided by the references to Williams' diary entries, not only is it well worth the watching but it is a terrificly written and performed piece. A masterful production about one of the great master's of comedy and his life. <br /><br />The realism really comes home with the little things: the fantasy of the guard which, rather than use the traditional 'dream' techniques remains solid then disappears. It plays on our knowledge and our senses, particularly with the scenes concerning Orton and Halliwell and the sets (particularly of their flat with Halliwell's murals decorating every surface) are terribly well done.",
     "I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air conditioned theater and watching a light-hearted comedy. The plot is simplistic, but the dialogue is witty and the characters are likable (even the well bread suspected serial killer). While some may be disappointed when they realize this is not Match Point 2: Risk Addiction, I thought it was proof that Woody Allen is still fully in control of the style many of us have grown to love.<br /><br />This was the most I'd laughed at one of Woody's comedies in years (dare I say a decade?). While I've never been impressed with Scarlet Johanson, in this she managed to tone down her ""sexy"" image and jumped right into a average, but spirited young woman.<br /><br />This may not be the crown jewel of his career, but it was wittier than ""Devil Wears Prada"" and more interesting than ""Superman"" a great comedy to go see with friend",
     "Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his parents are fighting all the time.<br /><br />This movie is slower than a soap opera... and suddenly, Jake decides to become Rambo and kill the zombie.<br /><br />OK, first of all when you're going to make a film you must Decide if its a thriller or a drama! As a drama the movie is watchable. Parents are divorcing & arguing like in real life. And then we have Jake with his closet which totally ruins all the film! I expected to see a BOOGEYMAN similar movie, and instead i watched a drama with some meaningless thriller spots.<br /><br />3 out of 10 just for the well playing parents & descent dialogs. As for the shots with Jake: just ignore them",
     "Encouraged by the positive comments about this film on here I was looking forward to watching this film. Bad mistake. I've seen 950+ films and this is truly one of the worst of them - it's awful in almost every way: editing, pacing, storyline, 'acting,' soundtrack (the film's only song - a lame country tune - is played no less than four times). The film looks cheap and nasty and is boring in the extreme. Rarely have I been so happy to see the end credits of a film. <br /><br />The only thing that prevents me giving this a 1-score is Harvey Keitel - while this is far from his best performance he at least seems to be making a bit of an effort. One for Keitel obsessives only"
    ]

y = [1, 1, 1, 0, 0 ] # 1 - positive , 0 - negative class 


In [262]:
test_x = ["I loved the acting in the movie",
          "The movie I saw was not bad"]

### 1. Cleaning of the data 


In [230]:
x_clean = [getStemmedReview(i) for i in x]

In [231]:
type(x_clean)

list

In [232]:
x_clean

['love movi sinc 7 saw open day touch beauti strongli recommend see movi watch famili far mpaa rate pg 13 themat element prolong scene disastor nuditi sexual languag',
 'wonder littl product film techniqu unassum old time bbc fashion give comfort sometim discomfort sens realism entir piec actor extrem well chosen michael sheen got polari voic pat truli see seamless edit guid refer william diari entri well worth watch terrificli written perform piec master product one great master comedi life realism realli come home littl thing fantasi guard rather use tradit dream techniqu remain solid disappear play knowledg sens particularli scene concern orton halliwel set particularli flat halliwel mural decor everi surfac terribl well done',
 'thought wonder way spend time hot summer weekend sit air condit theater watch light heart comedi plot simplist dialogu witti charact likabl even well bread suspect serial killer may disappoint realiz match point 2 risk addict thought proof woodi allen still

In [233]:
len(x_clean)

5

In [263]:
test_x

['I loved the acting in the movie', 'The movie I saw was not bad']

In [264]:
x_test_clean = [getStemmedReview(i) for i in test_x]

In [265]:
x_test_clean

['love act movi', 'movi saw bad']

### Step 2 : Use scikit learn 
- Multinomial Naive Bayes
- Vectorization

In [242]:
from sklearn.feature_extraction.text import CountVectorizer

In [243]:
cv = CountVectorizer()
x_vector = cv.fit_transform(x_clean)

In [244]:
print(x_vector)

  (0, 116)	1
  (0, 126)	2
  (0, 189)	1
  (0, 171)	1
  (0, 137)	1
  (0, 39)	1
  (0, 221)	1
  (0, 15)	1
  (0, 202)	1
  (0, 165)	1
  (0, 177)	1
  (0, 229)	1
  (0, 67)	1
  (0, 69)	1
  (0, 127)	1
  (0, 159)	1
  (0, 145)	1
  (0, 1)	1
  (0, 213)	1
  (0, 58)	1
  (0, 155)	1
  (0, 174)	1
  (0, 50)	1
  (0, 132)	1
  (0, 184)	1
  :	:
  (4, 3)	1
  (4, 196)	1
  (4, 195)	1
  (4, 105)	1
  (4, 35)	1
  (4, 224)	1
  (4, 109)	1
  (4, 76)	1
  (4, 25)	1
  (4, 130)	1
  (4, 20)	1
  (4, 158)	1
  (4, 87)	1
  (4, 60)	1
  (4, 36)	1
  (4, 153)	1
  (4, 175)	1
  (4, 88)	1
  (4, 101)	2
  (4, 17)	1
  (4, 108)	1
  (4, 178)	1
  (4, 18)	1
  (4, 57)	1
  (4, 133)	1


In [245]:
x_vector = x_vector.toarray()

In [246]:
x_vector

array([[0, 1, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 1, 1, 0],
       [1, 0, 0, ..., 0, 0, 2],
       [0, 0, 1, ..., 0, 0, 0]])

In [247]:
x_vector.shape

(5, 247)

In [248]:
print(cv.get_feature_names_out(), len(cv.get_feature_names_out()))

['10' '13' '950' 'act' 'actor' 'addict' 'air' 'allen' 'almost' 'argu'
 'averag' 'aw' 'bad' 'basic' 'bbc' 'beauti' 'becom' 'best' 'bit'
 'boogeyman' 'bore' 'boy' 'bread' 'career' 'charact' 'cheap' 'chosen'
 'closet' 'come' 'comedi' 'comfort' 'comment' 'concern' 'condit' 'control'
 'countri' 'credit' 'crown' 'dare' 'day' 'decad' 'decid' 'decor' 'descent'
 'devil' 'dialog' 'dialogu' 'diari' 'disappear' 'disappoint' 'disastor'
 'discomfort' 'divorc' 'done' 'drama' 'dream' 'edit' 'effort' 'element'
 'encourag' 'end' 'entir' 'entri' 'even' 'everi' 'expect' 'extrem'
 'famili' 'fantasi' 'far' 'fashion' 'fight' 'film' 'first' 'flat'
 'forward' 'four' 'friend' 'fulli' 'give' 'go' 'got' 'great' 'grown'
 'guard' 'guid' 'halliwel' 'happi' 'harvey' 'heart' 'home' 'hot' 'ignor'
 'imag' 'impress' 'instead' 'interest' 'jake' 'jewel' 'johanson' 'jump'
 'keitel' 'kill' 'killer' 'knowledg' 'lame' 'languag' 'laugh' 'least'
 'less' 'life' 'light' 'likabl' 'like' 'littl' 'look' 'love' 'make'
 'manag' 'mani' 

In [249]:
test_x

['I loved the acting in the movie', 'The movie I saw was bad']

In [121]:
# Now we apply on test set :
test_x_vector = cv.fit_transform(x_test_clean).toarray()

In [122]:
test_x_vector

array([[1, 0, 2, 1, 1, 0],
       [0, 1, 0, 0, 1, 1]])

 - the array size for the test data is very small
 - ideally it should be  same as our x_clean data
 - this is because  we are using the fit_transform 
 - rather than only transform, because fit function is used for 
 -  training data , not on test data

In [123]:
cv.get_feature_names_out()


array(['act', 'bad', 'happi', 'love', 'movi', 'saw'], dtype=object)

- Here we can see that fit function has made our model to learn from the test data.
- To avoid that we are going to use just the test data

In [250]:
test_x_vector = cv.transform(x_test_clean).toarray()

In [252]:
print(test_x_vector) # sparse matrix

[[0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

In [253]:
print(len(test_x_vector), test_x_vector.shape)

2 (2, 247)


In [164]:
cv.get_feature_names_out() , len(cv.get_feature_names_out())

(array(['10', '13', '950', 'act', 'actor', 'addict', 'air', 'allen',
        'almost', 'argu', 'averag', 'aw', 'bad', 'basic', 'bbc', 'beauti',
        'becom', 'best', 'bit', 'boogeyman', 'bore', 'boy', 'bread',
        'career', 'charact', 'cheap', 'chosen', 'closet', 'come', 'comedi',
        'comfort', 'comment', 'concern', 'condit', 'control', 'countri',
        'credit', 'crown', 'dare', 'day', 'decad', 'decid', 'decor',
        'descent', 'devil', 'dialog', 'dialogu', 'diari', 'disappear',
        'disappoint', 'disastor', 'discomfort', 'divorc', 'done', 'drama',
        'dream', 'edit', 'effort', 'element', 'encourag', 'end', 'entir',
        'entri', 'even', 'everi', 'expect', 'extrem', 'famili', 'fantasi',
        'far', 'fashion', 'fight', 'film', 'first', 'flat', 'forward',
        'four', 'friend', 'fulli', 'give', 'go', 'got', 'great', 'grown',
        'guard', 'guid', 'halliwel', 'happi', 'harvey', 'heart', 'home',
        'hot', 'ignor', 'imag', 'impress', 'instead', 'i

In [94]:
### Create Our Model and train 

### Step 3 : Multinomial Naive Bayes

In [254]:
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, GaussianNB

In [255]:
mnb = MultinomialNB()
print(mnb)

MultinomialNB()


In [256]:
mnb?

[0;31mType:[0m        MultinomialNB
[0;31mString form:[0m MultinomialNB()
[0;31mFile:[0m        /opt/anaconda3/lib/python3.11/site-packages/sklearn/naive_bayes.py
[0;31mDocstring:[0m  
Naive Bayes classifier for multinomial models.

The multinomial Naive Bayes classifier is suitable for classification with
discrete features (e.g., word counts for text classification). The
multinomial distribution normally requires integer feature counts. However,
in practice, fractional counts such as tf-idf may also work.

Read more in the :ref:`User Guide <multinomial_naive_bayes>`.

Parameters
----------
alpha : float or array-like of shape (n_features,), default=1.0
    Additive (Laplace/Lidstone) smoothing parameter
    (set alpha=0 and force_alpha=True, for no smoothing).

force_alpha : bool, default=False
    If False and alpha is less than 1e-10, it will set alpha to
    1e-10. If True, alpha will remain unchanged. This may cause
    numerical errors if alpha is too close to 0.

    .. 

In [257]:
### Training
mnb.fit(x_vector,y)

In [258]:
#### predictions
mnb.predict(test_x_vector)

array([1, 0])

In [259]:
test_x

['I loved the acting in the movie', 'The movie I saw was bad']

- Sometiimes results can go wrong because , we are using this bag of words
- and we are using stopwords which sometimes erradicate some important words
- like if I change the text to 'The movie I saw was not  bad', it still respond as a negative reviews because stopword has dismissed this word particularly.
- we can have n-gram in countvector .

In [269]:
mnb.predict_proba(test_x_vector)# class zero and class 1
# took the argmax of these values.

array([[0.49586196, 0.50413804],
       [0.5960209 , 0.4039791 ]])

In [286]:
mnb.predict(test_x_vector)

array([1, 0])

### 4. Multivariate Bernoulli Event Model Naive Bayes

- feature_vector = [1 0 0 2 5] ==>>> [1 0 0 1 1]


In [271]:
Bnb = BernoulliNB()

In [278]:
Bnb?

[0;31mType:[0m        BernoulliNB
[0;31mString form:[0m BernoulliNB()
[0;31mFile:[0m        /opt/anaconda3/lib/python3.11/site-packages/sklearn/naive_bayes.py
[0;31mDocstring:[0m  
Naive Bayes classifier for multivariate Bernoulli models.

Like MultinomialNB, this classifier is suitable for discrete data. The
difference is that while MultinomialNB works with occurrence counts,
BernoulliNB is designed for binary/boolean features.

Read more in the :ref:`User Guide <bernoulli_naive_bayes>`.

Parameters
----------
alpha : float or array-like of shape (n_features,), default=1.0
    Additive (Laplace/Lidstone) smoothing parameter
    (set alpha=0 and force_alpha=True, for no smoothing).

force_alpha : bool, default=False
    If False and alpha is less than 1e-10, it will set alpha to
    1e-10. If True, alpha will remain unchanged. This may cause
    numerical errors if alpha is too close to 0.

    .. versionadded:: 1.2
    .. deprecated:: 1.2
       The default value of `force_alp

In [279]:
# alpha is smoothing factor
# what is binarize ?

In [280]:
bnb = BernoulliNB(binarize = 0.0)

In [281]:
bnb.fit(x_vector, y)

In [284]:
bnb.predict_proba(test_x_vector)
# answer is different  from mltinomial naive bayes 
# as it is not worried about the frequency of word appeared in the sentence
# It is worried about the occurence fo the feature.

array([[9.17758552e-04, 9.99082241e-01],
       [2.06259054e-03, 9.97937409e-01]])

In [283]:
bnb.predict(test_x_vector)

array([1, 1])

In [287]:
mnb.predict(test_x_vector)

array([1, 0])

In [290]:
bnb.score(x_vector, y), mnb.score(x_vector, y)
# accuracy 100% for our small dataset
# case of overfiting
# score is checked on 

(1.0, 1.0)

In [295]:
bnb.score(test_x_vector, [1,0])

0.5

### Confusion Matrix 