## Module 5

#### 21 June, 2023

#movie #reviews #stopwords #corpus #nltk

Topic:1 **Text Classification**

In [1]:
import nltk
from nltk.corpus import movie_reviews

In [2]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Shanover\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [3]:
# get categories of the reviews

movie_reviews.categories()

['neg', 'pos']

In [4]:
# numbers of reviews

review_ids = movie_reviews.fileids()
num_reviews = len(review_ids)

print("Number of reviews:", num_reviews)

Number of reviews: 2000


In [5]:
# each categories

neg_r = movie_reviews.fileids('neg')
pos_r = movie_reviews.fileids('pos')


print("Number of positive reviews:", len(pos_r))
print("Number of negative reviews:", len(neg_r))

Number of positive reviews: 1000
Number of negative reviews: 1000


In [6]:
# Count overall words, based on categories 'negative(neg)', 'positive(pos)'

review_ids = movie_reviews.fileids()
totalpos = 0
totalneg = 0

for review_id in review_ids:
    category = movie_reviews.categories(review_id)[0]
    words = movie_reviews.words(review_id)
    num_words = len(words)
    
    if category == 'pos':
        totalpos += num_words
    else:
        totalneg += num_words
    
print('Total Pos:', totalpos)
print('Total Neg:', totalneg)
print('Total Words:', totalneg + totalpos)

Total Pos: 832564
Total Neg: 751256
Total Words: 1583820


## Remove puncations, stopwords

In [7]:
text = " ".join(movie_reviews.words())
print(len(text))

7810519


In [8]:
import string

In [9]:
text_filtered = text.translate(str.maketrans('','',string.punctuation))

`string.punctuation` is a string constant provided by the `string` module in Python. It contains all the ASCII punctuation characters like `!"#$%&'()*+,-./:;<=>?@[\]^_{|}~`.

`str.maketrans()` is a method that creates a translation table. It takes three arguments: `x`, `y`, and `z`.
- `x` is a string containing the characters that need to be replaced.
- `y` is a string containing the characters that will replace the characters in `x`. If `y` is shorter than `x`, the remaining characters in `x` will be mapped to `None`.
- `z` is a string containing the characters that need to be deleted (mapped to `None`).

`text.translate()` is a string method that applies the translation table to the given text, replacing or removing characters based on the translation table.


In [10]:
print('After removing punctuations:', len(text_filtered))
print('Total punctuations removed:', len(text) - len(text_filtered))

After removing punctuations: 7559896
Total punctuations removed: 250623


## remove stopwords in this context

In [11]:
from nltk.corpus import stopwords

In [12]:
stop_words = stopwords.words("english")
print('Total stopwords:',len(stop_words))

Total stopwords: 179


In [13]:
tokens = nltk.word_tokenize(text_filtered)
print('Tokenized length:', len(tokens))

Tokenized length: 1337085


In [14]:
# make it all lowercase
tokens = [token.lower() for token in tokens]

# remove stopwords
tokens_filtered = [token for token in tokens if token not in stop_words]
print('After removing stopwords:', len(tokens_filtered))
print('Stopwords removed:', len(tokens) - len(tokens_filtered))

After removing stopwords: 708475
Stopwords removed: 628610


### Frequency distribution

In [15]:
count_dict = nltk.FreqDist(tokens_filtered)
print(count_dict)

<FreqDist with 39295 samples and 708475 outcomes>


In [16]:
# get them most frequent words

count_dict.most_common(20)

[('film', 9519),
 ('one', 5853),
 ('movie', 5774),
 ('like', 3690),
 ('even', 2565),
 ('good', 2411),
 ('time', 2411),
 ('story', 2170),
 ('would', 2110),
 ('much', 2050),
 ('character', 2020),
 ('also', 1967),
 ('get', 1949),
 ('two', 1912),
 ('well', 1906),
 ('characters', 1859),
 ('first', 1836),
 ('see', 1749),
 ('way', 1693),
 ('make', 1642)]

In [17]:
docs = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        words = list(movie_reviews.words(fileid))
        doc = (words, category)
        docs.append(doc)

In [18]:
print('Total docs:', len(docs))

Total docs: 2000


In [19]:
# return type

print(type(docs[0]))

<class 'tuple'>


In [20]:
# first document at index 0

print('Category:', docs[0][1],', Content length:',len(docs[0][0]))

Category: neg , Content length: 879


In [21]:
# each doc category and length
for doc in docs:
    print('Category:', doc[1],', Content length:',len(doc[0]))

Category: neg , Content length: 879
Category: neg , Content length: 304
Category: neg , Content length: 581
Category: neg , Content length: 629
Category: neg , Content length: 901
Category: neg , Content length: 759
Category: neg , Content length: 687
Category: neg , Content length: 748
Category: neg , Content length: 854
Category: neg , Content length: 1025
Category: neg , Content length: 898
Category: neg , Content length: 629
Category: neg , Content length: 568
Category: neg , Content length: 1144
Category: neg , Content length: 670
Category: neg , Content length: 849
Category: neg , Content length: 780
Category: neg , Content length: 862
Category: neg , Content length: 542
Category: neg , Content length: 885
Category: neg , Content length: 872
Category: neg , Content length: 653
Category: neg , Content length: 801
Category: neg , Content length: 1315
Category: neg , Content length: 804
Category: neg , Content length: 664
Category: neg , Content length: 600
Category: neg , Content l

## Feature Extraction

In [264]:
word_features = [w[0] for w in count_dict.most_common(3000)]
# word_features = [w[0] for w in count_dict.most_common(60)]
# word_features = newWords

In [265]:
def search_features(doc):
    words = set(doc)
    features = {}
    for w in word_features:
        features[w] = (w in words)
    return features

### searching and checking for the match

In [266]:
match = search_features(docs[0][0])

totalTrue = 0
totalFalse = 0

for key,value in match.items():    
    if value==True:
        totalTrue = totalTrue + 1
    else:
        totalFalse = totalFalse + 1
        
print('True:',totalTrue)
print('False:',totalFalse)

True: 204
False: 2796


In [267]:
docs[100][::-1] # Notice the 'neg' category at the end of every doc in docs-list

('neg',
  ':',
  'spoilers',
  'are',
  'included',
  'in',
  'this',
  'review',
  '.',
  '.',
  '.',
  'but',
  'it',
  'doesn',
  "'",
  't',
  'really',
  'make',
  'much',
  'of',
  'a',
  'difference',
  '.',
  'deep',
  'impact',
  'begins',
  'the',
  'official',
  'summer',
  'movie',
  'season',
  ',',
  'and',
  'it',
  'also',
  'brings',
  'back',
  'memories',
  'of',
  '1997',
  '.',
  'remember',
  'when',
  'dante',
  "'",
  's',
  'peak',
  'came',
  'out',
  'in',
  'february',
  '?',
  'a',
  'few',
  'months',
  'later',
  ',',
  'volcano',
  'was',
  'released',
  '.',
  'the',
  'first',
  'film',
  'was',
  'smart',
  ',',
  'exhilirating',
  ',',
  'and',
  'one',
  'of',
  'the',
  'best',
  'disaster',
  'films',
  'i',
  'had',
  'ever',
  'seen',
  '.',
  'the',
  'latter',
  'film',
  'was',
  'an',
  'incohesive',
  'mess',
  'that',
  'defied',
  'logic',
  'and',
  'wasted',
  'talent',
  '.',
  'well',
  ',',
  'it',
  "'",
  's',
  'deja',
  'vu',
  '

### applying function on all reviews

In [268]:
featureset = [(search_features(doc), category) for (doc, category) in docs]

In [269]:
len(featureset[100][0])

3000

## Train/Test

In [270]:
training_set = featureset[:1600]
testing_set = featureset[1600:]

In [271]:
train_neg_total = 0
train_pos_total = 0
test_neg_total = 0
test_pos_total = 0

for item in training_set:
    category = item[1]
    if category == 'neg':
        train_neg_total = train_neg_total + 1
    else:
        train_pos_total = train_pos_total + 1
        
for item in testing_set:
    category = item[1]
    if category == 'neg':
        test_neg_total = test_neg_total + 1
    else:
        test_pos_total = test_pos_total + 1
        
print('Train total pos:', train_pos_total)
print('Train total neg:', train_neg_total)
print('Train Ratio:', (train_neg_total/train_pos_total),':',train_pos_total/train_pos_total)

print('\n')
print('Test total pos:', test_pos_total)
print('Test total neg:', test_neg_total)
# print('Test Ratio:', (test_pos_total/test_neg_total),':',test_neg_total/test_neg_total)

Train total pos: 600
Train total neg: 1000
Train Ratio: 1.6666666666666667 : 1.0


Test total pos: 400
Test total neg: 0


In [272]:
classifier = nltk.NaiveBayesClassifier.train(training_set)

In [273]:
print('Classifier accuracy is - testing: {}'.format(nltk.classify.accuracy(classifier, testing_set)*100))
print('Classifier accuracy is - training: {}'.format(nltk.classify.accuracy(classifier, training_set)*100))

Classifier accuracy is - testing: 73.75
Classifier accuracy is - training: 90.25


## Using top-60 as feedback loop into the model to train even better

In [227]:
top60 = classifier.most_informative_features(60)

newWords = []
for v in top60:
    newWords.append(v[0])
    
newWords

['ludicrous',
 'outstanding',
 'mulan',
 'inept',
 'whatsoever',
 'seagal',
 'idiotic',
 'damon',
 'finest',
 'freddie',
 'wonderfully',
 'prinze',
 'anger',
 'breathtaking',
 'flynt',
 'lame',
 'henstridge',
 'waste',
 'wasted',
 'garbage',
 'inane',
 'awful',
 'beautifully',
 'poorly',
 'thompson',
 'ripley',
 'tucker',
 'mess',
 'refreshing',
 'bother',
 'era',
 'pointless',
 'worst',
 'laughable',
 'tedious',
 'schumacher',
 'allows',
 'fantastic',
 'stupid',
 'zero',
 'uninteresting',
 'nomination',
 'alicia',
 'ridiculous',
 'bland',
 'fits',
 'unfunny',
 'sat',
 'portrayal',
 'patch',
 'lebowski',
 'jedi',
 'hank',
 'terrific',
 'mature',
 'religion',
 'italian',
 'lifeless',
 'gon',
 'terrible']

## Save a model

In [274]:
import pickle

In [275]:
save_classifier = open('naive_bayes_model.pkl',"wb")
pickle.dump(classifier, save_classifier)
save_classifier.close()

### Get model file back

In [276]:
classifier_f = open('naive_bayes_model.pkl','rb')
classifier = pickle.load(classifier_f)
classifier_f.close()

## Trying the model

In [277]:
from nltk import word_tokenize

In [278]:
custom_review = "I bad the restaurant. It was a disaster eating there. Poor service."
custom_tokens = word_tokenize(custom_review)
custom_review_set = search_features(custom_tokens)

In [279]:
print(classifier.classify(custom_review_set))

neg


In [280]:
prob_res = classifier.prob_classify(custom_review_set)
print(prob_res.max())
print(prob_res.prob('pos'))
print(prob_res.prob('neg'))

neg
1.2733017053383622e-06
0.9999987266982958


________________________________________________________