# Sentiment Analysis
In this exercise, we will explore a movie review dataset.


**Task 1:** Load the data from `/dsa/data/all_datasets/movie_reviews` into mvr variable. While loading use `encoding='utf-8'`. (Solved for you)


In [1]:
from sklearn.datasets import load_files

data_dir = '/dsa/data/all_datasets/movie_reviews'

mvr = load_files(data_dir)

In [2]:
mvr.keys()

dict_keys(['data', 'filenames', 'target_names', 'target', 'DESCR'])

In [3]:
print('Number of Reviews: {0}'.format(len(mvr.filenames)))

Number of Reviews: 2000


**Task 2:** Apply `SentimentIntensityAnalyzer` on the entire dataset to estimate polarity scores. Print the top 3 `positive`, `negative`, and `neural` reviews based on the following rule: 


* positive sentiment: compound score >= 0.05
* neutral sentiment: (compound score > -0.05) and (compound score < 0.05)
* negative sentiment: compound score <= -0.05

In [4]:
# Add your code below
# -------------------
from nltk.sentiment.vader import SentimentIntensityAnalyzer
import numpy as np


In [8]:
sia = SentimentIntensityAnalyzer()
sentiments = []

In [9]:
for review in mvr.data:
    review_text = review.decode('utf-8')
    sentiment = sia.polarity_scores(review_text)
    sentiments.append((sentiment['compound'],review_text))

sentiments = sorted(sentiments, key=lambda x: x[0], reverse=True)


In [12]:
print("Top 3 Positive Reviews:")
for sentiment in sentiments[:3]:
    print (f"Score: {sentiment[0]}, Review:{sentiment[1][:200]}...")

Top 3 Positive Reviews:
Score: 0.9999, Review:as i write the review for the new hanks/ryan romantic comedy you've got mail , i am acutely aware that i am typing it on a computer and sending it a billion miles away on the internet . 
i am also awa...
Score: 0.9999, Review:note : some may consider portions of the following text to be spoilers . 
be forewarned . 
the teaser trailers for my best friend's wedding scarsely gave reason for hope - it looked like the sort of g...
Score: 0.9998, Review:most people fit into two different categories : you either love woody allen , or you hate his guts . 
my family , for the most part , hates him and his movies . 
i think he's very funny , but his shti...


In [13]:
print("\nTop 3 Negative Reviews:")
for sentiment in sentiments[-3:]:
    print(f"socre: {sentiment[0]}, Review: {sentiment[1][:200]}...")


Top 3 Negative Reviews:
socre: -0.9996, Review: natural born killers is really a very simple story that , in essence , has already been told in bonnie & clyde with some major variations in emphasis , mood and degree . 
both films glamorize " outlaw...
socre: -0.9996, Review: weighed down by tired plot lines and spielberg's reliance on formulas , _saving private ryan_ is a mediocre film which nods in the direction of realism before descending into an abyss of cliches . 
th...
socre: -0.9997, Review: the above is dialogue from this film , taken almost completely in context , and not jazzed up a bit to make it more inept than it is . 
it is spoken between two of the protagonists somewhere in the fi...


In [14]:
neutral_reviews = [s for s in sentiments if -0.05 < s[0] < 0.05]
print("\nTop 3 Neutral Reviews:")
for sentiment in neutral_reviews[:3]:
    print(f"Score: {sentiment[0]}, Review: {sentiment[1][:200]}...")
    


Top 3 Neutral Reviews:
Score: -0.0488, Review: pulp fiction , quentin tarantino's anxiously awaited and superb follow-up to reservoir dogs , is absolutely and without a doubt progressing as one of the most talked about , loved , and hated films of...


**Task 3:** Apply `SentimentIntensityAnalyzer` on the entire dataset to estimate polarity scores. Print a classification report based on the following rule: 


positive sentiment: compound score >= 0
negative sentiment: compound score < 0

In [15]:
# Add your code below
# -------------------
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import classification_report


In [16]:
sia = SentimentIntensityAnalyzer()

In [17]:
true_labels = []
pred_labels = []

for review in mvr.data:
    review_text = review.decode('utf-8')
    sentiment_score = sia.polarity_scores(review_text)['compound']
    
    true_label = 1 if review_text in mvr.target_names[1] else 0
    true_labels.append(true_label)
    
    if sentiment_score >=0:
        pred_labels.append(1)
    else:
        pred_labels.append(0)

print(classification_report(true_labels, pred_labels, target_names=['Negative', 'Positive']))


              precision    recall  f1-score   support

    Negative       1.00      0.31      0.47      2000
    Positive       0.00      0.00      0.00         0

    accuracy                           0.31      2000
   macro avg       0.50      0.15      0.24      2000
weighted avg       1.00      0.31      0.47      2000



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


# Save your notebook, then `File > Close and Halt`