# Week 4, Lesson 3, Activity 5: End-to-end sentiment analysis

&copy;2021, Ekaterina Kochmar \
(edited: Nadejda Roubtsova, February 2022)

Your task in this activity is to:

- Implement a sentiment analysis algorithm and train it on the set of reviews provided with the notebook.

## Step 1: Data loading

We will be using popular `polarity dataset 2.0` collected by [Bo Pang and colleagues from Cornell Univeristy](http://www.cs.cornell.edu/people/pabo/movie-review-data/). Let's first upload the data.

In [None]:
import os, codecs

def read_in(folder):
    files = os.listdir(folder)
    a_dict = {}
    for a_file in sorted(files):
        if not a_file.startswith("."):
            with codecs.open(folder + a_file, encoding='ISO-8859-1', errors ='ignore') as f:
                file_id = a_file.split(".")[0].strip()
                a_dict[file_id] = f.read()
            f.close()
    return a_dict

When you download the dataset, it comes as two subfolders named `pos/` for all positive reviews and `neg/` for all negative ones, put within a folder called `review_polarity/txt_sentoken/`. If you don't change the folder names, you can simply read in the contents of all positive and negative reviews and put them in separate Python dictionaries of review titles mapped to the reviews content, using the method `read_in` from above.

Let's also print out the number of reviews in positive and negative dictionaries, as well as the very first positive and very first negative reviews in the dictionaries.

In [None]:
folder = "review_polarity/txt_sentoken/"
pos_dict = read_in(# provide the relative path to the positive reviews folder
                   )
print(f"Number of positive sentiment reviews: {len(pos_dict)}") # check that this is 1000
print(pos_dict.get(next(iter(pos_dict))))

neg_dict = read_in(# provide the relative path to the negative reviews folder
                   )
print(f"Number of positive sentiment reviews: {len(neg_dict)}") # check that this is 1000
print(neg_dict.get(next(iter(neg_dict))))

## Step 2: Preprocess texts with spaCy

Import `spacy`; since processing with `spacy` might take time, let's run it once and store the results in dedicated data structures:

In [None]:
import spacy
nlp = spacy.load("en_core_web_md")

def spacy_preprocess_reviews(source):
    source_docs = {}
    index = 0
    for review_id in source.keys():
        #to speed processing up, you can disable "ner" – Named Entity Recognition module of spaCy
        source_docs[review_id] = nlp(source.get(review_id).replace("\n", ""), disable=["ner"])
        if index>0 and (index%200)==0:
            print(str(index) + " reviews processed")
        index += 1
    print("Dataset processed")
    return source_docs

pos_docs = # preprocess positive reviews with spacy_preprocess_reviews
neg_docs = # preprocess negative reviews with spacy_preprocess_reviews

## Step 3: Apply a machine learning classifier to the data

First, let's filter out punctuation marks (you can experiment by adding any other filters if you'd like e.g. stopwords) and prepare the data for the machine learning pipeline:

In [None]:
import random
import string
#from spacy.lang.en.stop_words import STOP_WORDS as stopwords_list # stopwords list
punctuation_list = [punct for punct in string.punctuation]

def text_filter(a_dict, label, exclude_lists):
    data = []
    for rev_id in a_dict.keys():
        tokens = []
        for token in a_dict.get(rev_id):
            if not token.text in exclude_lists:
                # append token's text to the list of tokens
                # Alternatively, use tokens.append(token.lemma_) for lemmas instead of word tokens
        data.append((' '.join(tokens), label))
    return data

def prepare_data(pos_docs, neg_docs, exclude_lists):
    data = text_filter(pos_docs, 1, exclude_lists)
    data += text_filter(neg_docs, -1, exclude_lists)
    random.seed(42)
    random.shuffle(data)
    texts = []
    labels = []
    for item in data:
        # append the first entry from the tuple to texts (this is the tokens from the review)
        # append the second entry from the tuple to labels (this is the labels: 1 for pos or -1 for neg)
    return texts, labels

# for the use of both lists in filtering:
# texts, labels = prepare_data(pos_docs, neg_docs, list(stopwords_list) + punctuation_list)

texts, labels = prepare_data(# insert the relevant data structures here
                             ) 

print(f"Total number of reviews = {len(texts)} and labels = {len(labels)}") # there should be 2000 texts and 2000 labels
print(texts[0])

Let's prepare $80\%$ of the data for training and rest for testing in this randomly shuffled set:

In [None]:
def split(texts, labels, proportion):
    train_data = []
    train_targets = []
    test_data = []
    test_targets = []
    for i in range(0, len(texts)):
        if i < proportion*len(texts):
            train_data.append(texts[i])
            train_targets.append(labels[i])
        else:
            # apply the same steps to the test set data structures
    return train_data, train_targets, test_data, test_targets

train_data, train_targets, test_data, test_targets = split(texts, labels, 0.8)
        
print(len(train_data)) # is this 1600?
print(len(train_targets)) # is this 1600?      
print(len(test_data)) # is this 400?       
print(len(test_targets)) # is this 400? 
print(train_targets[:10]) # print out the targets for the first 10 training reviews 
print(test_targets[:10]) # print out the targets for the first 10 test reviews 

Now, let's estimate the distribution of words across texts using `sklearn`'s `CountVectorizer`:

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_data)
# Check the dimensionality 
print(train_counts.shape)

This shows that our training set contains over $35,000$ distinct words (the exact number may change depending on your split). This is our training set vocabulary, and it will be applied to all test reviews only. Note that this vocabulary is learned on the training data only. Let's look 'under the hood' and print out the counts for some words in the first $10$ reviews from the training set:

In [None]:
print(train_counts[:11])

What do the results like (0, 5285)	5 and (0, 30800)	1 mean? \
The first review (index 0, a positive review since it has label `1` in `train_targets`) contains $5$ occurrences of some word with an index $5285$ and $1$ occurrences of a word with an index $30800$ from the vocabulary. Let's see what those indexes correspond to:

In [None]:
count_vect.get_feature_names_out()[5285]

In [None]:
count_vect.get_feature_names_out()[30800]

E.g., you might find out that index $5285$ corresponds to the word *characters* and index $30800$ to the word *stuck*.  \
(Please note that you will get different words if you experimented with alternative preprocessing.) \
Here is how you can check the whole list of words (features) mapped to indices:

In [None]:
count_vect.vocabulary_

Alternatively, to print the vocabulary of features in the alphabetical order run:

In [None]:
count_vect.get_feature_names_out()

Now let's convert word occurrences into binary values: use $1$ if the word occurs in a reivew, and $0$ otherwise:

In [None]:
from sklearn.preprocessing import Binarizer

transformer = Binarizer()
train_bin = transformer.fit_transform(train_counts)
print(train_bin.shape)
print(train_bin[0])

Finally, let's train the classifier and run it on the designated test set: 

In [None]:
from sklearn.naive_bayes import MultinomialNB

clf = MultinomialNB().fit(train_counts, train_targets)
test_counts = count_vect.transform(test_data)
predicted = clf.predict(test_counts)

for text, label in list(zip(test_data, predicted))[:10]:
    if label==1:
        print('%r => %s' % (text[:100], "pos"))
    else:
        # print out the negative label

Alternatively, this is how you can do the same using `sklearn`'s pipeline: 

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import Binarizer

text_clf = Pipeline([('vect', CountVectorizer(min_df=10, max_df=0.5)), 
                     ('binarizer', Binarizer()), # include this for detecting presence-absence of features
                     ('clf', MultinomialNB())
                    ])

text_clf.fit(train_data, train_targets) 
print(text_clf)
predicted = text_clf.predict(test_data)

Evaluate the results:

In [None]:
from sklearn import metrics

print("\nConfusion matrix:")
print(metrics.confusion_matrix(test_targets, predicted))
print(metrics.classification_report(test_targets, predicted))