<a href="https://colab.research.google.com/gist/Melvinchen0404/161b4997d388409653a52e4b1dc74dfe/sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##NLP Technique 4: Sentiment Analysis
**STEP 1:** We may practise **sentiment analysis** by importing the `movie_reviews` corpus \
**STEP 2:** The `movie_reviews.categories()` function will return two categories: `neg` (for **negative**) and `pos` (for **positive**)

Sources: \
https://medium.com/@joel_34096/sentiment-analysis-of-movie-reviews-in-nltk-python-4af4b76a6f3

In [2]:
import nltk
nltk.download('punkt')

class color:
   BOLD = '\033[1m'
   END = '\033[0m'

nltk.download('movie_reviews')
movie_reviews.categories()

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package movie_reviews to /root/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.


['neg', 'pos']

**STEP 3:** We can get the file IDs of movies that received a **positive** review (`pos`). We can determine the number of films that have received `pos` by relying on the `FreqDist` function

In [4]:
from nltk.probability import FreqDist
movie_reviews.fileids('pos')
number_of_good_films = FreqDist(movie_reviews.fileids('pos'))
print(color.BOLD + 'Number of distinct films with positive reviews (samples) and number of films with positive reviews (outcomes): \n' + color.END, number_of_good_films)

[1mNumber of distinct films with positive reviews (samples) and number of films with positive reviews (outcomes): 
[0m <FreqDist with 1000 samples and 1000 outcomes>


**STEP 4:** We can do the same for films that have received `neg` by relying on the `FreqDist` function

In [5]:
movie_reviews.fileids('neg')
number_of_bad_films = FreqDist(movie_reviews.fileids('neg'))
print(color.BOLD + 'Number of distinct films with negative reviews (samples) and number of films with negative reviews (outcomes): \n' + color.END, number_of_bad_films)

[1mNumber of distinct films with negative reviews (samples) and number of films with negative reviews (outcomes): 
[0m <FreqDist with 1000 samples and 1000 outcomes>


**STEP 5:** We can extract the text from a `pos` or `neg` film by generating a random integer between 1-1,000 and making a random choice between the sentiments (`pos` or `neg`)

In [6]:
import random
from random import randint
value = randint(0, 1000)
print(color.BOLD + 'Randomized number of film (between 1-1,000): \n' + color.END, value)

sentiment = random.choice(['pos', 'neg'])
print(color.BOLD + 'Randomly chosen sentiment (pos or neg): \n' + color.END, sentiment)

[1mRandomized number of film (between 1-1,000): 
[0m 553
[1mRandomly chosen sentiment (pos or neg): 
[0m pos


**STEP 6:** We should be able to derive the file ID for the film (between 1-1,000) associated with this randomly chosen sentiment (`pos` or `neg`)

In [7]:
print(color.BOLD + 'File ID for film with the randomly chosen sentiment (pos or neg):' + color.END)
movie_reviews.fileids(sentiment)[value]

[1mFile ID for film with the randomly chosen sentiment (pos or neg):[0m


'pos/cv553_26915.txt'

**STEP 7:** We should be able to extract the sentences from this review using the relevant file ID. We can verify manually whether the review is `pos` or `neg`

In [8]:
print(color.BOLD + 'Text from this file:' + color.END)
movie_reviews.sents(movie_reviews.fileids(sentiment)[value])

[1mText from this file:[0m


[['bruce', 'willis', 'and', 'sixth', 'sense', 'director', 'm', '.', 'night', 'shyamalan', 're', '-', 'team', 'to', 'tell', 'the', 'story', 'of', 'david', 'dunne', '(', 'willis', ')', ',', 'a', 'stadium', 'security', 'guard', 'who', 'has', 'been', 'having', 'some', 'problems', 'at', 'home', 'that', 'are', 'affecting', 'his', 'relationship', 'with', 'his', 'wife', 'and', 'child', '.'], ['on', 'a', 'return', 'trip', 'from', 'new', 'york', 'where', 'he', 'was', 'trying', 'to', 'get', 'a', 'job', ',', 'dunne', 'is', 'in', 'a', 'horrible', 'train', 'accident', 'that', 'he', 'is', 'the', 'only', 'survivor', 'of', '.'], ...]

**STEP 8:** `all_words` is a dictionary which contains the **frequency** of words in the `movie_reviews` corpus. `len(all_words)` yields the total number of distinct words in this corpus

In [9]:
all_words = nltk.FreqDist(movie_reviews.words())
print(color.BOLD + 'Number of distinct words in the movie_reviews corpus:' + color.END)
len(all_words)

[1mNumber of distinct words in the movie_reviews corpus:[0m


39768

**STEP 9:** Let us define a `feature_vector` that contains the first 4,000 words of `all_words`. This will constitute c. 10% of the words in `all_words` (39,768) but we will accept this constraint to reduce computational costs.

In [10]:
feature_vector = list(all_words)[:4000]

**STEP 10:** We can manually analyze the sentiment of the movie review whose file ID we have derived by randomly generating a number between 1-1,000 and randomly choosing a sentiment (`pos` or `neg`) (see **STEP 6**). Certain words may be identified in the output that suggest or convey the randomly chosen sentiment (`pos` or `neg`)

In [11]:
print(color.BOLD + 'Recall randomly chosen sentiment (pos or neg) for film: \n' + color.END, sentiment)
# Initialization
feature = {}
# One movie review is chosen
review = movie_reviews.words(movie_reviews.fileids(sentiment)[value])
# ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’
for x in range(len(feature_vector)):
 feature[feature_vector[x]] = feature_vector[x] in review
# The words which are assigned ‘True’ are printed
[x for x in feature_vector if feature[x] == True]

[1mRecall randomly chosen sentiment (pos or neg) for film: 
[0m pos


[',',
 'the',
 '.',
 'a',
 'and',
 'of',
 'to',
 "'",
 'is',
 'in',
 's',
 '"',
 'it',
 'that',
 '-',
 ')',
 '(',
 'as',
 'with',
 'for',
 'his',
 'this',
 'film',
 'i',
 'he',
 'but',
 'on',
 'are',
 't',
 'by',
 'be',
 'one',
 'movie',
 'an',
 'who',
 'not',
 'you',
 'from',
 'at',
 'was',
 'have',
 'they',
 'has',
 'her',
 'all',
 '?',
 'there',
 'like',
 'so',
 'out',
 'up',
 'more',
 'what',
 'when',
 'which',
 'she',
 'some',
 'just',
 'can',
 'if',
 'we',
 'him',
 'into',
 'even',
 'only',
 'good',
 'time',
 'most',
 'its',
 'will',
 'story',
 'would',
 'been',
 'much',
 'character',
 'also',
 'get',
 'other',
 'do',
 'two',
 'well',
 'very',
 'characters',
 'first',
 'see',
 '!',
 'way',
 'make',
 'life',
 'any',
 'does',
 'really',
 'had',
 'how',
 'where',
 'could',
 'scene',
 'bad',
 'never',
 'best',
 'new',
 'doesn',
 'scenes',
 'many',
 'director',
 'such',
 'were',
 'here',
 'great',
 're',
 'another',
 'love',
 'go',
 'made',
 'something',
 'back',
 'still',
 'world',
 

**STEP 11:** We can now use **machine learning** techniques for **sentiment analysis**. Let us define a `document` as a **list** of (**words** of each movie review, **category** of review (`neg` or `pos`)) \
**STEP 12:** We can then define a function `find_feature(document)` that finds certain features. `True` is assigned if a word in `feature_vector` can also be found in a review. Otherwise, `False` shall be assigned.

In [12]:
document = [(movie_reviews.words(file_id),category) for file_id in movie_reviews.fileids() for category in movie_reviews.categories(file_id)]

In [13]:
# Define a function that finds the features
def find_feature(word_list):
# Initialization
  feature = {}
# For loop to find the feature. ‘True’ is assigned if word in feature_vector can also be found in review. Otherwise ‘False’
  for x in feature_vector:
    feature[x] = x in word_list
  return feature

**STEP 13:** Check that the `find_feature(document)` feature works adequately. It should return a set of values (`True` or `False`) for a list of words from the first movie review \
**STEP 14:** Create a `feature_sets` that stores the `feature` of every review. A **feature set** is a logical group in which **features** can be stored. **Feature sets** take data from offline or online sources, build a **list of features** through a set of transformations, and store the resulting **features** along with the associated metadata (e.g., labels, etc) \
Creating a **feature set** may take an hour

In [14]:
# Check that the function ‘find_feature’ works adequately
find_feature(document[0][0])

{',': True,
 'the': True,
 '.': True,
 'a': True,
 'and': True,
 'of': True,
 'to': True,
 "'": True,
 'is': True,
 'in': True,
 's': True,
 '"': True,
 'it': True,
 'that': True,
 '-': True,
 ')': True,
 '(': True,
 'as': True,
 'with': True,
 'for': True,
 'his': True,
 'this': True,
 'film': True,
 'i': True,
 'he': True,
 'but': True,
 'on': True,
 'are': True,
 't': True,
 'by': True,
 'be': True,
 'one': True,
 'movie': True,
 'an': True,
 'who': True,
 'not': True,
 'you': True,
 'from': True,
 'at': False,
 'was': False,
 'have': True,
 'they': True,
 'has': True,
 'her': True,
 'all': True,
 '?': True,
 'there': True,
 'like': True,
 'so': True,
 'out': True,
 'about': True,
 'up': True,
 'more': True,
 'what': True,
 'when': True,
 'which': True,
 'or': True,
 'she': False,
 'their': False,
 ':': True,
 'some': False,
 'just': True,
 'can': False,
 'if': False,
 'we': True,
 'him': True,
 'into': True,
 'even': True,
 'only': True,
 'than': False,
 'no': True,
 'good': True,


In [None]:
# Feature_sets stores the ‘feature’ of every review
feature_sets = [(find_feature(word_list),category) for (word_list,category) in document]

**STEP 15:** Create a **machine learning model**. The necessary packages (`scikitlearn`) and classifiers (`SklearnClassifier`, `SVC` (or **Support Vector Classifier**) are imported

In [None]:
from nltk.classify.scikitlearn import SklearnClassifier
from sklearn.svm import SVC
from sklearn import model_selection

**STEP 16:** We can split the dataset (`feature_sets`) into a **training set** and a **test set**. We can also check the size of the **training set** and **test set** by using the `print(len())` function

In [None]:
# Split into training and testing sets
train_set,test_set = model_selection.train_test_split(feature_sets,test_size = 0.3)

print(color.BOLD + 'Size of training set:' + color.END, len(train_set))
print(color.BOLD + 'Size of test set:' + color.END, len(test_set))

[1mSize of training set:[0m 1400
[1mSize of test set:[0m 600


**STEP 17:** Train the **machine learning** model on the **training set**

In [None]:
# Train the model on training data
model = SklearnClassifier(SVC(kernel = 'linear'))
model.train(train_set)

<SklearnClassifier(SVC(kernel='linear'))>

**STEP 18:** Test the trained **machine learning** model on the **test set** \
The **test set accuracy** can be further improved by: 
*   Choosing a more appropriate `feature_vector`;
*   Increasing the size of the `feature_vector` (we have accepted the constraint of a mere 4,000 words to reduce computational costs);
*   SVM hyper parameter tuning;
*   Combining multiple classification algorithms. \

The `float()` function allows us to eventually convert a **string** (`format(accuracy)`) into a **floating-point number** with a decimal point (yielding `percentage_accuracy` when multiplied by 100)

In [None]:
# Test the trained machine learning model on testing data and calculate the accuracy of classification
accuracy = nltk.classify.accuracy(model, test_set)
float_accuracy = float(format(accuracy))
percentage_accuracy = float_accuracy * 100
print(color.BOLD + 'SVC Accuracy:' + color.END, percentage_accuracy, '%')

[1mSVC Accuracy:[0m 83.5 %
