# Natural Language Processing - Assignment 2
# Sentiment analysis for movie reviews

This notebook was created for you to answer question 2, 3 and 4 from assignment 2. Please read the steps and the provided code carefully and make sure you understand them. 

The (red) comments at the beginning of each function explain what they should do, which parameters you should give as input and which variables should be returned by the function. After the (green) comments "### student code here###' you should write your own code.

**Please modify the next cell specifying your group number**

 *This is the Notebook of* ***Group 35*** 




### Prerequisite - Libraries
Make sure you have the needed libraries installed on your computer: scikit-learn, Pandas, NLTK...

### Prerequisite - Load Data

In the first step, we are going to load the data in a Pandas DataFrame. Pandas DataFrames are a useful way of storing data. DataFrames are tables in which data can be accessed as columns, as rows or as individual cells. You can find more info on DataFrames here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html

Read the code below and make sure you understand what is happening. Run the code to load your data.

In [2]:
import os
import re
import pandas as pd
import numpy as np
import glob
import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

[nltk_data] Downloading package punkt to /home/narrietal/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In [3]:
def get_path(filename):
    """
    Makes a list of all the paths that fit the search requirement
    
    :param filename: A regular expression that defines the search requirement for the filenames
    :return  Returns a list of all the pathnames
    """
    # place the movies folder in the same directory as this notebook
    current_directory = os.getcwd()
    # if you are using Google Colab, you will have to change the above line
    # to load the dataset from your Google Drive

    # glob.glob() is a pattern-matching path finder, it searches for the reviews in the movies folder based on a Regular Expression
    paths = glob.glob(current_directory + '/movies/' + filename)
    
    if len(paths) == 0:
        print('Your file list is empty. The code looks for the folder '+current_directory+'/movies, but could not find it.')
    else: 
        print("Found ", len(paths), "files")
    return paths

In [4]:
def load_data(pathset):
    """
    Loads the data into a dataframe
    
    :param pathset:  A list of paths
    :return  A dataframe with three columns: Path, Review (Text) and Label
    """
    # Files are named by sentiment (P for positive, N for negative)
    pattern = re.compile('P-train[0-9]*.txt')
    reviews = []
    labels = []
    df = pd.DataFrame(columns = ['Path', 'Review', 'Label'])
    for path in pathset:
        if re.search(pattern, path):
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Pos')
        else:
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Neg')
    df['Path'] = pathset
    df['Review'] = reviews
    df['Label'] = labels
    return df

In [5]:
def load_test_data(pathset):
    """
    Loads the data into a dataframe
    
    :param pathset:  A list of paths
    :return  A dataframe with three columns: Path, Review (Text) and Label
    """
    # Files are named by sentiment (P for positive, N for negative)
    pattern = re.compile('P-test[0-9]*.txt')
    reviews = []
    labels = []
    df = pd.DataFrame(columns = ['Path', 'Review', 'Label'])
    for path in pathset:
        if re.search(pattern, path):
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Pos')
        else:
            text = open(path, "r").read()
            reviews.append(text)
            labels.append('Neg')
    df['Path'] = pathset
    df['Review'] = reviews
    df['Label'] = labels
    return df

In [6]:
#Load the files in the Dataframe. This will take a while...
paths = get_path('train/[NP]-train[0-9]*.txt')
data = load_data(paths)
data.head()

Found  600 files


Unnamed: 0,Path,Review,Label
0,/home/narrietal/Documents/UT/NLP/HW/HW_2/movie...,"with that, carry the same dark weaknesses we a...",Pos
1,/home/narrietal/Documents/UT/NLP/HW/HW_2/movie...,"This film, which I rented under the title ""Bla...",Neg
2,/home/narrietal/Documents/UT/NLP/HW/HW_2/movie...,K Murli Mohan Rao made the much better BANDHAN...,Neg
3,/home/narrietal/Documents/UT/NLP/HW/HW_2/movie...,That snarl...\n\nThat scowl...\n\nThe acts of ...,Neg
4,/home/narrietal/Documents/UT/NLP/HW/HW_2/movie...,This movie was astonishing how good it was! Th...,Pos


### Part 2 - Tokenization

In this step, you should write a tokenizer and compare it with an off-the-shelf one.

#### Question 2.1 Making your own tokenizer

In [37]:
def my_tokenizer(text):
    """
    The implementation of your own tokenizer
    
    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """    
    
    tokenized_text = re.findall(r"[\w]+|[^\s\w]", text)
    
    return tokenized_text

sample_string0 = "If you have the chance, watch it. Although, a warning, you'll cry your eyes out."
sample_string1 = "I hope this email finds you well Anna. I know it's been tough times."
sample_string2 = "Hey Thomas! How is it going? I thought you were studying abroad." 
sample_string3 = "Please, mind the gap between the train and the platform."
print(my_tokenizer(sample_string0))
print(my_tokenizer(sample_string1))
print(my_tokenizer(sample_string2))
print(my_tokenizer(sample_string3))

['I', ' ', 'hope', ' ', 'this', ' ', 'email', ' ', 'finds', ' ', 'you', ' ', 'well', ' ', 'Anna', '.', ' ', 'I', ' ', 'know', ' ', 'it', "'", 's', ' ', 'been', ' ', 'tough', ' ', 'times', '.']
['Hey', ' ', 'Thomas', '!', ' ', 'How', ' ', 'is', ' ', 'it', ' ', 'going', '?', ' ', 'I', ' ', 'thought', ' ', 'you', ' ', 'were', ' ', 'studying', ' ', 'abroad', '.']
['Please', ',', ' ', 'mind', ' ', 'the', ' ', 'gap', ' ', 'between', ' ', 'the', ' ', 'train', ' ', 'and', ' ', 'the', ' ', 'platform', '.']


#### Question 2.2 Using an off-the-shelf tokenizer

In [8]:
#Now we are gonna compare the tokenizer you just wrote with the one from NLTK
#if you installed NLTK but never downloaded the 'punkt' tokenizer, uncomment the following lines:
def nltk_tokenizer(text):
    """
    This function should apply the word_tokenize (punkt) tokenizer of nltk to the input text
    
    :param text:  A string with a sentence (or paragraph, or document...)
    :return  A list of tokens
    """     
      
    tokenized_text = word_tokenize(text)
    
    
    return tokenized_text

test_sentences = ["I like this assignment because:\n-\tit is fun;\n-\tit helps me practice my Python skills.",
        "I won a prize, but I won't be able to attend the ceremony.",
        "“The strange case of Dr. Jekyll and Mr. Hyde” is a famous book... but I haven't read it.",
        "I work for the C.I.A.. And you?",
        "OMG #Twitter is sooooo coooool <3 :-) <-- lol...why do i write like this idk right? :) 🤷😂 🤖"]

for test_string in test_sentences:
    print(my_tokenizer(test_string))
    print(nltk_tokenizer(test_string))
    print("\n")
    

['I', 'like', 'this', 'assignment', 'because', ':', '-', 'it', 'is', 'fun', ';', '-', 'it', 'helps', 'me', 'practice', 'my', 'Python', 'skills', '.']
['I', 'like', 'this', 'assignment', 'because', ':', '-', 'it', 'is', 'fun', ';', '-', 'it', 'helps', 'me', 'practice', 'my', 'Python', 'skills', '.']


['I', 'won', 'a', 'prize', ',', 'but', 'I', 'won', "'", 't', 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']
['I', 'won', 'a', 'prize', ',', 'but', 'I', 'wo', "n't", 'be', 'able', 'to', 'attend', 'the', 'ceremony', '.']


['“', 'The', 'strange', 'case', 'of', 'Dr', '.', 'Jekyll', 'and', 'Mr', '.', 'Hyde', '”', 'is', 'a', 'famous', 'book', '.', '.', '.', 'but', 'I', 'haven', "'", 't', 'read', 'it', '.']
['“', 'The', 'strange', 'case', 'of', 'Dr.', 'Jekyll', 'and', 'Mr.', 'Hyde', '”', 'is', 'a', 'famous', 'book', '...', 'but', 'I', 'have', "n't", 'read', 'it', '.']


['I', 'work', 'for', 'the', 'C', '.', 'I', '.', 'A', '.', '.', 'And', 'you', '?']
['I', 'work', 'for', 'the', 'C.I.A', '

### Part 3 - Text classification with a unigram language model

#### Training phase
You now need to create the model and train it on the documents in the dataframe. Look at the scikit learn documentation to learn how to use the CountVectorizer and MultimodalNaiveBayes modules.

In [9]:
count_vector_standard = CountVectorizer()
count_vector_stop_words = CountVectorizer(stop_words="english")
count_vector_bigram = CountVectorizer(ngram_range=(2,2))
count_vector_trigram = CountVectorizer(ngram_range=(3,3))


def vectorize(reviews, stop_words=False, test=False, ngram_type='unigram'):
    reviews_array=reviews.to_numpy().flatten()
    
    if stop_words:
        count_vector = count_vector_stop_words
        if test:
            count_matrix = count_vector.transform(reviews_array)
        else:
            count_matrix = count_vector.fit_transform(reviews_array)

    else:
        if ngram_type == 'bigram':
            count_vector = count_vector_bigram
        elif ngram_type == 'trigram':
            count_vector = count_vector_trigram
        else:
            count_vector = count_vector_standard
            
        if test:
            count_matrix = count_vector.transform(reviews_array)
        else:
            count_matrix = count_vector.fit_transform(reviews_array)

    count_array = count_matrix.toarray()
    df = pd.DataFrame(data=count_array,columns = count_vector.get_feature_names())
    
    return df

In [10]:
# Standard model
train_data_vectorized = vectorize(data['Review'])
NB_model = MultinomialNB().fit(train_data_vectorized, data['Label'])

#### Testing phase
Now that you have a trained model, you need to test its performance.

1. Load your test data.
2. Classify your test data using the classifier you trained before.
3. Compute the accuracy of your classifier on the test data

In [11]:
paths = get_path('test/[NP]-test[0-9]*.txt')
test_data = load_test_data(paths)

Found  50 files


In [12]:
test_data_vectorized = vectorize(test_data['Review'],test=True)
y_predicted = NB_model.predict(test_data_vectorized)
y_true = test_data['Label']
accuracy_score(y_true, y_predicted)

0.74

Now train two more models: one without Laplace smoothing, and one where stopwords are removed. Then test them on the same test data, and compare the performance with the results you previously obtained.

In [13]:
#Model without smoothing:
NB_model_no_smoothing = MultinomialNB(alpha=0).fit(train_data_vectorized, data['Label'])
y_predicted_no_smoothing = NB_model_no_smoothing.predict(test_data_vectorized)
acc_no_smooth = accuracy_score(y_true, y_predicted_no_smoothing)
print("Accuracy of model with out smoothing", acc_no_smooth)

Accuracy of model with out smoothing 0.58




In [14]:
#Model with stop words removed:
train_data_vectorized_stop_words = vectorize(data['Review'], stop_words=True)
test_data_vectorized_stop_words = vectorize(test_data['Review'],stop_words=True, test=True)

NB_model_stop_words = MultinomialNB().fit(train_data_vectorized_stop_words, data['Label'])
y_predicted_stop_words = NB_model_stop_words.predict(test_data_vectorized_stop_words)
acc_stop_words = accuracy_score(y_true, y_predicted_stop_words)
print("Accuracy of model with stop words", acc_stop_words)

Accuracy of model with stop words 0.74


### Part 4 - Text classification with a bigram language model

Now we will classify the same dataset again, but this time with a bigram language model. 

#### Training phase
Build a Naïve Bayes classifier that uses bigrams instead of single words.


In [15]:
#Model with bigrams:
train_data_vectorized_bigrams = vectorize(data['Review'], ngram_type='bigram')
test_data_vectorized_bigrams = vectorize(test_data['Review'],test=True, ngram_type='bigram')

NB_model_bigram = MultinomialNB().fit(train_data_vectorized_bigrams, data['Label'])

#### Testing phase
As before, calculate the performance on your test data, and notice the difference with the previous

In [16]:
y_predicted_bigram = NB_model_bigram.predict(test_data_vectorized_bigrams)
acc_bigrams = accuracy_score(y_true, y_predicted_bigram)
print("Accuracy of model with bigrams", acc_bigrams)

Accuracy of model with bigrams 0.74


### Trigrams
When I asked students how to improve the classification performance on this dataset, the first question was always "use trigrams" (or even higher-order n-grams). Let's try how much of an improvement that would be, by training a trigram model and testing it.

In [17]:
#Model with trigrams:
train_data_vectorized_trigrams = vectorize(data['Review'], ngram_type='trigram')
test_data_vectorized_trigrams = vectorize(test_data['Review'],test=True, ngram_type='trigram')

NB_model_trigram = MultinomialNB().fit(train_data_vectorized_trigrams, data['Label'])
y_predicted_trigram = NB_model_trigram.predict(test_data_vectorized_trigrams)
acc_trigrams = accuracy_score(y_true, y_predicted_trigram)
print("Accuracy of model with trigrams", acc_trigrams)

Accuracy of model with trigrams 0.6
