<a href="https://colab.research.google.com/github/Natural-Language-Processing-YU/Exercises/blob/main/M3_Exercise_Naive_Bayes_Classifier.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Exercise: Naive Bayes Classification in NLTK

In this example, we use the movie reviews dataset available in NLTK. We prepare the dataset by extracting the words and their corresponding categories (positive or negative). We then shuffle the documents to ensure randomness.

In [15]:
!pip install nltk
import nltk

nltk.download('movie_reviews')

import random
from nltk.corpus import movie_reviews
from nltk.classify import NaiveBayesClassifier
from nltk.classify.util import accuracy

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Package movie_reviews is already up-to-date!


In [16]:
# Example movie reviews
positive_reviews = movie_reviews.fileids('pos')[10:]
print("Positive Reviews:")
for review_id in positive_reviews:
    review = movie_reviews.raw(review_id)
    print(review)

Positive Reviews:
after watching " rat race " last week , i noticed my cheeks were sore and realized that , when not laughing aloud , i had held a grin for virtually all of the film's 112 minutes . 
saturday night , i attended another sneak preview for the movie and damned if i didn't enjoy it as much the second time as the first . 
 " rat race " is a great goofy delight , a dandy mix of energetic performances , inspired sight gags and flat-out silliness . 
hands down , this is the most fun film of the summer . 
the movie begins with zippy retro-style opening credits that look like they were torn straight out of a '60s slapstick comedy , featuring animated photos of the cast attached to herky-jerky bodies bounding around the screen . 
then comes the setup . 
donald sinclair ( john cleese ) , the extremely rich owner of the venetian hotel and casino in las vegas , enjoys concocting unusual bets for his high rolling clients . 
to that end , he places a half dozen very special tokens in h

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Prepare dataset

We prepare the dataset by extracting the words and their corresponding categories (positive or negative). We then shuffle the documents to ensure randomness.

In this case, we are creating feature sets. Creating feature sets involves transforming the raw data, such as text documents, into a format that can be used as input for a machine learning algorithm. In the context of text classification using Naive Bayes, feature sets typically represent the presence or absence of certain words or other linguistic features in the documents.

In [24]:

# Prepare the dataset
documents = [(list(movie_reviews.words(fileid)), category) #fileids – a list or regexp specifying the ids of the files whose sentences have to be returned.
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

# Shuffle the documents
random.shuffle(documents)

# Extract features from the documents (creating vocabulary)
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words.keys())[:2000]


def document_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains({})'.format(word)] = (word in document_words)
    return features

# Create feature sets
featuresets = [(document_features(d), c) for (d, c) in documents]
print(featuresets[1])
# Split the dataset into training and testing sets
train_set = featuresets[:1500]
test_set = featuresets[1500:]



0
:
({'contains(plot)': False, 'contains(:)': True, 'contains(two)': False, 'contains(teen)': False, 'contains(couples)': False, 'contains(go)': True, 'contains(to)': True, 'contains(a)': True, 'contains(church)': False, 'contains(party)': False, 'contains(,)': True, 'contains(drink)': False, 'contains(and)': True, 'contains(then)': True, 'contains(drive)': True, 'contains(.)': True, 'contains(they)': True, 'contains(get)': True, 'contains(into)': False, 'contains(an)': True, 'contains(accident)': False, 'contains(one)': True, 'contains(of)': True, 'contains(the)': True, 'contains(guys)': False, 'contains(dies)': False, 'contains(but)': True, 'contains(his)': False, 'contains(girlfriend)': False, 'contains(continues)': False, 'contains(see)': True, 'contains(him)': False, 'contains(in)': True, 'contains(her)': True, 'contains(life)': False, 'contains(has)': False, 'contains(nightmares)': False, 'contains(what)': True, "contains(')": True, 'contains(s)': True, 'contains(deal)': False, '

## Train and Test

In [19]:

# Train the classifier
classifier = NaiveBayesClassifier.train(train_set)

# Test the classifier
accuracy = accuracy(classifier, test_set)
print("Accuracy:", accuracy)

# Example usage
new_review = "The movie was fantastic!"
features = document_features(new_review.split())
predicted_class = classifier.classify(features)
print("Predicted class:", predicted_class)

Accuracy: 0.78
Predicted class: neg


## Classify with an example

In [20]:
# Example usage
new_review = "The acting was great"
features = document_features(new_review.split())
predicted_class = classifier.classify(features)
print("Predicted class:", predicted_class)


Predicted class: neg
