# Sentiment Analysis and Recommender Systems Part 1 - Exercises with Results

In the following exercises, we will work with data from movie reviews for the sentiment analysis. The movie reviews were scraped from a website. Each review is a document and, collectively, the reviews form the corpus. We will be using the `movie_reviews.csv` file today!

## Exercise 1

#### Task 1
##### Load the following packages/libraries that are used in this module:
##### pandas, numpy, pickle (Helper packages); nltk (natural language toolkit for text processing); scikit-learn; matplotlib (for visualizing)

#### Result:

In [None]:
# Helper packages.
import os
import pandas as pd
import numpy as np
import pickle

# Packages with tools for text processing.
import nltk

# Packages to clean text data
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer

# Packages for working with text data and analyzing sentiment.
from nltk.sentiment.vader import SentimentIntensityAnalyzer 
from sklearn.feature_extraction.text import CountVectorizer



#### Task 2
##### Set working directory to folder where the dataset is present.

#### Result:

In [None]:
from pathlib import Path
home_dir = Path(".").resolve()
main_dir = home_dir.parent

data_dir = str(main_dir) + "/data"


#### Task 3

##### Load the `movie_reviews.csv` file and preview the data.
##### Execute the below chunk of code that performs the following steps in order to clean the movie reviews data.

    1. Convert all characters to lower case 
    2. Remove stop words 
    3. Remove punctuation, numbers, and all other symbols that are not letters 
    4. Stem words
    5. Filter the reviews which have greater than three words.
    6. Save the cleaned reviews in the list `reviews_clean_list` and 
    7. Create a Document-Term Matrix and save it as `reviews_DTM_array.` 


#### Result:

In [None]:
movie_reviews = pd.read_csv(data_dir + '/movie_reviews.csv')
movie_reviews.head()

In [None]:
reviews = movie_reviews["reviews"]

reviews_tokenized = [word_tokenize(reviews[i]) for i in range(0,len(reviews))]


stop_words = stopwords.words('english')

# Create a vector for clean reviews.
reviews_clean = [None] * len(reviews_tokenized)

# Create a vector of word counts for each clean reviews.
word_counts_per_review = [None] * len(reviews_tokenized)

# Process words in all documents.
for i in range(len(reviews_tokenized)):
    # 1. Convert to lower case.
    reviews_clean[i] = [reviews.lower() for reviews in reviews_tokenized[i]]
    
    # 2. Remove stopwords.
    reviews_clean[i] = [word for word in reviews_clean[i] if not word in stop_words]
    
    # 3. Remove punctuation and any non-alphabetical characters.
    reviews_clean[i] = [word for word in reviews_clean[i] if word.isalpha()]
    
    # 4. Stem words.
    reviews_clean[i] = [PorterStemmer().stem(word) for word in reviews_clean[i]]
    
    # Record the word count per reviews.
    word_counts_per_review[i] = len(reviews_clean[i])
    
# Array with length of each titles.
ex_word_counts_array = np.array(word_counts_per_review)
reviews_array = np.array(reviews_clean, dtype=object)
print(len(reviews_array))

# Find indices of all reviews where there are at least 3 words.
valid_titles = np.where(ex_word_counts_array >= 3)[0]

# Subset the reviews array to keep only those where there are at least 3 words.
reviews_array = reviews_array[valid_titles]
print(len(reviews_array))

# Convert the array back to a list.
reviews_clean = reviews_array.tolist()
    
reviews_clean_list = [' '.join(message) for message in reviews_clean]

ex_vec = CountVectorizer()
ex_X = ex_vec.fit_transform(reviews_clean_list)

reviews_DTM_array = ex_X.toarray()

#### Task 4

##### Print the first 10 reviews from `reviews_clean_list`.

#### Result:

In [None]:
print(reviews_clean_list[:10])

## Exercise 2

#### Task 1
##### We want to analyze the sentiment of the movie reviews.
##### Let us first add the sentiment labels to our cleaned reviews.
##### Load the sentiment analysis function we used in our module.

##### This function outputs a list of labels for each chat message:

#### Result:

In [None]:
def sentiment_analysis(texts):
    list_of_scores = []
    for text in texts:
        sid = SentimentIntensityAnalyzer()
        compound = sid.polarity_scores(text)["compound"]
        if compound >= 0:
            list_of_scores.append("positive")
        else:
            list_of_scores.append("negative")
    return(list_of_scores) 

#### Task 2
##### Assign labels to the `reviews_clean_list` using the `sentiment_analysis` function and save to them to `score_labels` variable.
##### Print the first 5 labels.

#### Result:

In [None]:
score_labels = sentiment_analysis(reviews_clean_list)

print(score_labels[:5])

#### Task 3
##### Save `score_labels`  as `score_labels_ex.sav`, `reviews_clean_list` as `reviews_clean_list_ex.sav`, and `reviews_DTM_array` as `reviews_DTM_array.sav` in the data_dir for using in next session.

#### Result:

In [None]:
pickle.dump(score_labels, open(data_dir + '/score_labels_ex.sav', 'wb'))
pickle.dump(reviews_clean_list, open(data_dir + '/reviews_clean_list_ex.sav', 'wb'))
pickle.dump(reviews_DTM_array, open(data_dir + '/reviews_DTM_array.sav', 'wb'))