# AnTeDe Lab 3: Sentiment Analysis - Part C

## Session goal
The goal of this session is to run document-level sentiment analysis on the IMDB movie review corpus by applying supervised text classification techniques. We begin by wrangling the IMBD corpus into lists, exactly like in 3b.

In [54]:
import random, nltk
from nltk.corpus import movie_reviews

corpus = [' '.join(movie_reviews.words(fileid)) \
          for category in movie_reviews.categories() \
          for fileid in movie_reviews.fileids(category)]

labels = [category \
          for category in movie_reviews.categories() \
          for fileid in movie_reviews.fileids(category)]

**scikit-learn** enables us the split the corpus into a training corpus and a test corpus. The parameter *test-size* is the desired ratio of the test corpus size to the training corpus size. The paramenter *random_state* ensures reproducibility.

In [55]:
from sklearn.model_selection import train_test_split
training_corpus, test_corpus, training_labels, test_labels = train_test_split(
        corpus, labels, test_size=0.2, random_state=21)

We reuse our helper function **get_metrics** from Lab 2.

In [56]:
def get_metrics(true_labels, predicted_labels):
        from sklearn import metrics
        import numpy as np
        print ('Accuracy:', np.round(
            metrics.accuracy_score(true_labels,
            predicted_labels), 3))
        
        from sklearn.metrics import classification_report
        print(classification_report(test_labels, predicted_labels))

Now we are going to use a MNB classifier for sentiment analysis. **CountVectorizer**'s parameter **binary** replaces token counts with binary values if set to True. In sentiment analysis, the number of occurrences of a token may not be as important as its presence or absence, so setting it to True may be a good idea, but let's perform 10-fold cross-validation to find out whether it really is. We want the mean to be as high as possible and the standard deviation to be as low as possible. 

In [57]:
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

mnb_pipeline_1 = Pipeline([ ('vectorizer', CountVectorizer(binary = False, stop_words='english')),\
                ('classifier', MultinomialNB())])  
    
scores = cross_val_score(mnb_pipeline_1, training_corpus, training_labels, cv=10)
import numpy as np
print (round(np.mean(scores), 3))
print (round(np.std(scores), 3))

0.802
0.023


In [58]:
from sklearn.model_selection import cross_val_score

mnb_pipeline_2 = Pipeline([ ('vectorizer', CountVectorizer(binary = True, stop_words='english')),\
                ('classifier', MultinomialNB())])  
    
scores = cross_val_score(mnb_pipeline_2, training_corpus, training_labels, cv=10)
import numpy as np
print (round(np.mean(scores), 3))
print (round(np.std(scores), 3))

0.833
0.028


Based on your 10-fold cross-validation results, choose one of the two pipelines, train it, run it, and analyze its performance.

In [59]:
# BEGIN_REMOVE
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

mnb_pipeline_2.fit(training_corpus, training_labels) 
predicted_labels = mnb_pipeline_2.predict(test_corpus)
get_metrics(true_labels=test_labels,
        predicted_labels=predicted_labels)
# END_REMOVE

Accuracy: 0.825
              precision    recall  f1-score   support

         neg       0.83      0.81      0.82       198
         pos       0.82      0.84      0.83       202

    accuracy                           0.82       400
   macro avg       0.83      0.82      0.82       400
weighted avg       0.83      0.82      0.82       400



Now let's train a Maximum Entropy classifier using the **LogisticRegression** module from **scikit-learn**. Please complete the code to train and run the classifier.

In [60]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline

vectorizer = CountVectorizer(binary = True, stop_words='english')
classifier = LogisticRegression()
maxent_pipeline = make_pipeline(vectorizer, classifier)


# BEGIN_REMOVE
maxent_pipeline.fit(training_corpus, training_labels) 
predicted_labels = maxent_pipeline.predict(test_corpus)
get_metrics(true_labels=test_labels,
        predicted_labels=predicted_labels)
# END_REMOVE

Accuracy: 0.83
              precision    recall  f1-score   support

         neg       0.85      0.80      0.82       198
         pos       0.82      0.86      0.84       202

    accuracy                           0.83       400
   macro avg       0.83      0.83      0.83       400
weighted avg       0.83      0.83      0.83       400



With **eli5**, you can get some insights into the most informative words in your corpus.

In [61]:
import eli5
eli5.show_weights(classifier, vec=vectorizer, top=10)


Weight?,Feature
+0.628,memorable
… 18553 more positive …,… 18553 more positive …
… 17414 more negative …,… 17414 more negative …
-0.583,reason
-0.588,script
-0.618,awful
-0.622,unfortunately
-0.665,boring
-0.716,waste
-0.721,worst


You can also explore your corpus and find out which specific tokens drove the classifier's decisions. Change the value of *item* to visualize what happened in each test sample.

In [62]:
item =16

print ('Prediction: '+predicted_labels[item]+'; groud truth: '+test_labels[item])
eli5.show_prediction(classifier, test_corpus[item], vec=vectorizer)

Prediction: neg; groud truth: pos


Contribution?,Feature
1.061,Highlighted in text (sum)
0.757,<BIAS>


If you've made it this far, you will have seen that the decisions are affected by words that are supposed to have a neutral semantic orientation. Open question: can you provide a few examples of such words?

It seems reasonable to get the classifier to focus on sentiment words. Let's begin by extracting a list of positive words and a list of negative words as we did in Lab 3b. 

In [63]:
with open('Words/positive-words.txt', errors='ignore') as opened:
    contents=opened.read()
contents_lines=['a+'] + contents.split('a+')[1].split('\n')

positive_words = [x for x in contents_lines if len(x)>0]


with open('Words/negative-words.txt', errors='ignore') as opened:
    contents=opened.read()
contents_lines=['2-faced'] + contents.split('2-faced')[1].split('\n')

negative_words = [x for x in contents_lines if len(x)>0]

Now, let's write a function to remove non-sentiment tokens from our corpus. Once we have it, we'll run it on the training corpus and the test corpus to transform them into sentiment-only corpora. 

In [64]:
# BEGIN_REMOVE
from nltk.tokenize import word_tokenize

def remove_non_sentiment_tokens(corpus):
    sentiment_only_corpus=[]
    for document in corpus:

        document_words = set(word for word in word_tokenize(document))
        sentiment_words = list(document_words.intersection(positive_words))+\
                          list(document_words.intersection(negative_words))
        sentiment_only_corpus.append(' '.join(sentiment_words))
    return sentiment_only_corpus   
    
sentiment_only_training_corpus = remove_non_sentiment_tokens(training_corpus)
sentiment_only_test_corpus = remove_non_sentiment_tokens(test_corpus)
# END_REMOVE

Now, please train **maxent_pipeline** on the **sentiment_only_training_corpus** and test it on the **sentiment_only_test_corpus**.

In [65]:
# BEGIN_REMOVE
maxent_pipeline.fit(sentiment_only_training_corpus, training_labels) 

predicted_labels = maxent_pipeline.predict(sentiment_only_test_corpus)
get_metrics(true_labels=test_labels,
        predicted_labels=predicted_labels)
# END_REMOVE

Accuracy: 0.835
              precision    recall  f1-score   support

         neg       0.85      0.81      0.83       198
         pos       0.82      0.86      0.84       202

    accuracy                           0.83       400
   macro avg       0.84      0.83      0.83       400
weighted avg       0.84      0.83      0.83       400



Open question: what do you think about the performance improvement? Is it greater than you expected, smaller than you expected, or just about what you expected? Why? 