## Module 5

#### 21 June, 2023

#movie #reviews #stopwords #corpus #nltk

Topic:1 **Text Classification**

In [2]:
import nltk
from nltk.corpus import movie_reviews

In [3]:
nltk.download('movie_reviews')

[nltk_data] Downloading package movie_reviews to
[nltk_data]     C:\Users\Shanover\AppData\Roaming\nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


True

In [10]:
# get categories of the reviews

movie_reviews.categories()

['neg', 'pos']

In [11]:
# numbers of reviews

review_ids = movie_reviews.fileids()
num_reviews = len(review_ids)

print("Number of reviews:", num_reviews)

Number of reviews: 2000


In [12]:
# each categories

neg_r = movie_reviews.fileids('neg')
pos_r = movie_reviews.fileids('pos')


print("Number of positive reviews:", len(pos_r))
print("Number of negative reviews:", len(neg_r))

Number of positive reviews: 1000
Number of negative reviews: 1000


In [38]:
# Count overall words, based on categories 'negative(neg)', 'positive(pos)'

review_ids = movie_reviews.fileids()
totalpos = 0
totalneg = 0

for review_id in review_ids:
    category = movie_reviews.categories(review_id)[0]
    words = movie_reviews.words(review_id)
    num_words = len(words)
    
    if category == 'pos':
        totalpos += num_words
    else:
        totalneg += num_words
    
print('Total Pos:', totalpos)
print('Total Neg:', totalneg)
print('Total Words:', totalneg + totalpos)

Total Pos: 832564
Total Neg: 751256
Total Words: 1583820


## Remove puncations, stopwords

In [18]:
text = " ".join(movie_reviews.words())
print(len(text))

7810519


In [19]:
import string

In [24]:
text_filtered = text.translate(str.maketrans('','',string.punctuation))

`string.punctuation` is a string constant provided by the `string` module in Python. It contains all the ASCII punctuation characters like `!"#$%&'()*+,-./:;<=>?@[\]^_{|}~`.

`str.maketrans()` is a method that creates a translation table. It takes three arguments: `x`, `y`, and `z`.
- `x` is a string containing the characters that need to be replaced.
- `y` is a string containing the characters that will replace the characters in `x`. If `y` is shorter than `x`, the remaining characters in `x` will be mapped to `None`.
- `z` is a string containing the characters that need to be deleted (mapped to `None`).

`text.translate()` is a string method that applies the translation table to the given text, replacing or removing characters based on the translation table.


In [39]:
print('After removing punctuations:', len(text_filtered))
print('Total punctuations removed:', len(text) - len(text_filtered))

After removing punctuations: 7559896
Total punctuations removed: 250623


## remove stopwords in this context

In [25]:
from nltk.corpus import stopwords

In [36]:
stop_words = stopwords.words("english")
print('Total stopwords:',len(stop_words))

Total stopwords: 179


In [37]:
tokens = nltk.word_tokenize(text_filtered)
print('Tokenized length:', len(tokens))

Tokenized length: 1337085


In [40]:
# make it all lowercase
tokens = [token.lower() for token in tokens]

# remove stopwords
tokens_filtered = [token for token in tokens if token not in stop_words]
print('After removing stopwords:', len(tokens_filtered))
print('Stopwords removed:', len(tokens) - len(tokens_filtered))

After removing stopwords: 708475
Stopwords removed: 628610


### Frequency distribution

In [41]:
count_dict = nltk.FreqDist(tokens_filtered)
print(count_dict)

<FreqDist with 39295 samples and 708475 outcomes>


In [44]:
# get them most frequent words

count_dict.most_common(20)

[('film', 9519),
 ('one', 5853),
 ('movie', 5774),
 ('like', 3690),
 ('even', 2565),
 ('good', 2411),
 ('time', 2411),
 ('story', 2170),
 ('would', 2110),
 ('much', 2050),
 ('character', 2020),
 ('also', 1967),
 ('get', 1949),
 ('two', 1912),
 ('well', 1906),
 ('characters', 1859),
 ('first', 1836),
 ('see', 1749),
 ('way', 1693),
 ('make', 1642)]

In [72]:
docs = []
for category in movie_reviews.categories():
    for fileid in movie_reviews.fileids(category):
        words = list(movie_reviews.words(fileid))
        doc = (words, category)
        docs.append(doc)

In [73]:
print('Total docs:', len(docs))

Total docs: 2000


In [74]:
print(type(docs[0]))

<class 'tuple'>


In [79]:
print('Category:', docs[0][1],', Content length:',len(docs[0][0]))

Category: neg , Content length: 879


In [78]:

for doc in docs:
    print('Category:', doc[0[1],', Content length:',len(doc[0][0]))

Category: : , Content length: 4
Category: happy , Content length: 3
Category: is , Content length: 2
Category: quest , Content length: 1
Category: : , Content length: 8
Category: : , Content length: 7
Category: ask , Content length: 2
Category: ' , Content length: 4
Category: it , Content length: 4
Category: : , Content length: 4
Category: remembered , Content length: 4
Category: garofalo , Content length: 7
Category: now , Content length: 3
Category: movie , Content length: 1
Category: was , Content length: 3
Category: carpenter , Content length: 4
Category: ' , Content length: 1
Category: what , Content length: 2
Category: law , Content length: 3
Category: joe , Content length: 6
Category: spawn , Content length: 1
Category: in , Content length: 1
Category: knock , Content length: 1
Category: snake , Content length: 1
Category: the , Content length: 7
Category: might , Content length: 3
Category: loves , Content length: 7
Category: games , Content length: 8
Category: follow , Content