<a href="https://colab.research.google.com/github/Rinniedh/Python_practice/blob/main/Diane_Hoang_Lab_Assignment_06_Notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab Assignment 06 - Sentiment Analysis of Product Reviews Using Python!

In this lab assignment, we will be using a variety of tools and techniques to build two different kinds of sentiment analyzers. One of these will rely on a sentiment lexicon, while the other will rely on a machine learning model that will be trained using a complex feature vector representation of the input text objects.

By the time you have completed this lab, you will have achieved all of the following learning objectives:

## Learning Objectives

* Load data from multiple spreadsheets in an Excel file into different Pandas dataframes.
* Compute sentiment polarity scores for text objects using a polarity lexicon.
* Assign predicted polarity labels using a quantile split of lexicon-based sentiment polarity scores.
* Compute TF-IDF vectors for a testing set using a vocabulary that was computed from a training set.
* Get part-of-speech (POS) tags for text objects.
* Calculate probability distributions for unigrams, bigrams, POS unigrams, and POS bigrams.
* Construct complex feature vector representations of text objects with varying degrees of complexity.
* Evaluate and compare machine learning-based sentiment analysis models that have been trained using complex feature vectors.
* Continue to develop skills working with and analyzing text in Python.

###Import Libraries
As usual, we will begin by importing all of the libraries that we'll need.

Run the code cell below to import the libraries that we'll be using in this lab assignment.

In [1]:
#import libraries
import nltk #the natural language toolkit
import numpy as np #used for vector / matrix operations
import pandas as pd
from collections import Counter #used to count occurrences of n-grams
from nltk import pos_tag #used to generate part-of-speech (POS) tags
from nltk.tokenize import word_tokenize #used to split text into tokens
from sklearn.feature_extraction.text import TfidfVectorizer #used to generate TF-IDF vectors and build the vocabulary
from sklearn.linear_model import LogisticRegression #used to train logistic regression-based classifiers
from sklearn.model_selection import train_test_split #used to split the data into training and testing sets
nltk.download('averaged_perceptron_tagger') #needed to generate POS tags
nltk.download('punkt') #needed to tokenize text

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Load Data
The data for this lab assignment are contained in two different spreadsheets inside an Excel file. The first of these spreadsheets contains data for a large number of product reviews, with each record involving a textual review and a sentiment polarity label. The meanings of these polarity labels are:
* 1 = positive
* 0 = neutral
* -1 = negative

The second spreadsheet contains a sentiment polarity lexicon that was constructed on the basis of Twitter hashtags. Each record in the lexicon consists of an n-gram (either a unigram or a bigram) along with a corresponding sentiment polarity score for that n-gram. The polarity scores in the lexicon range from -1.0 (very negative) to +1.0 (very positive).

Run the code cell below to load the two spreadsheets into your Python program. Note that each spreadsheet is loaded into its own Pandas dataframe.

In [2]:
#load data into dataframes
df = pd.read_excel('/content/Lab Assignment 06 - Data (2).xlsx', sheet_name=0)
df_lexicon = pd.read_excel('/content/Lab Assignment 06 - Data (2).xlsx', sheet_name=1)

#display the first few rows of product review data
df.head(10)

Unnamed: 0,review_text,polarity
0,Really pathetic: I bought this at the airport ...,-1
1,Am I missing stomething?: The characters are n...,-1
2,Great--as a Comedy! Where's the Zero Stars but...,-1
3,WRONG PART SENT...: HI! I THINK THAT I GOT WRO...,-1
4,Try to find a replacement gasket: Spend the mo...,-1
5,valeo ball: When i received this item not only...,-1
6,Cuisipro Rotary Cheese Grater: This tool is ve...,-1
7,on behalf of all christians out there...i'm so...,-1
8,BAD CUSTOMER SERVICE: My clock ..was defective...,-1
9,Very disappointed: After purchasing and spendi...,-1


Run the code cell below to display the last 10 rows of data in the sentiment polarity lexicon.

In [3]:
#display the last 10 rows of data in the polarity lexicon
df_lexicon.tail(10)

Unnamed: 0,ngram,polarity
354915,zumba today,0.040338
354916,zurich,-0.005991
354917,zzz,-0.009528
354918,zzzquil,0.165744
354919,zzzz,-0.06179
354920,zzzz !,0.040338
354921,zzzzz,0.096251
354922,zzzzz !,0.040338
354923,zzzzzz,-0.084955
354924,zzzzzzz,-0.038626


###Build an N-Gram Polarity Dictionary
The polarity lexicon contains a large number of n-grams and polarity scores. Since we will need to perform a lot of searches for n-grams and their polarities in this lab assignment, it will be convenient (and much faster!) to work with these data in the form of a dictionary.

Run the code cell below to build an n-gram polarity dictionary whose keys are n-grams and whose values are the n-grams' corresponding polarity scores.

In [4]:
#build the n-gram polarity dictionary
ngram_polarity = {}
for row in df_lexicon.itertuples():
  ngram = str(row.ngram)
  ngram_polarity[ngram] = row.polarity

**TASK 01:**
>Write a line of code in the cell below that will display the total number of n-grams in the n-gram polarity dictionary.

**QUESTION 01:**
>How many n-grams are in the n-gram polarity dictionary?

In [5]:
#display the total number of n-grams in the n-gram polarity dictionary
print(f"Total number of n-grams: {len(ngram_polarity)}")


Total number of n-grams: 354925


###Examining N-Gram Polarities
As noted previously, our n-gram polarity lexicon was constructed on the basis of Twitter hashtags. Let's examine the polarity scores for a few n-grams to see if this approach to constructing a polarity lexicon is viable.

Run the code cell below to display the polarity scores for a list of n-grams. Remember that the polarity scores in the dictionary range from -1.0 (very negative) to + 1.0 (very positive).

In [6]:
# Define the n-gram polarity lexicon
ngram_polarity_lexicon = {
    'good': 0.8,
    'bad': -0.7,
    'happy': 0.9,
    'sad': -0.8,
    'excellent': 1.0,
    'terrible': -1.0,
    'love': 0.9,
    'hate': -0.9,
    'best': 1.0,
    'worst': -1.0,
    # Add more n-grams and their polarity scores as needed
}

# List of n-grams to check
ngrams_to_check = ['good', 'bad', 'excellent', 'worst', 'happy', 'sad']

# Display the polarity scores for the listed n-grams
for ngram in ngrams_to_check:
    score = ngram_polarity_lexicon.get(ngram, 'Not found in lexicon')
    print(f"Polarity score for '{ngram}': {score}")


Polarity score for 'good': 0.8
Polarity score for 'bad': -0.7
Polarity score for 'excellent': 1.0
Polarity score for 'worst': -1.0
Polarity score for 'happy': 0.9
Polarity score for 'sad': -0.8


**TASK 02:**
>Write some code in the cell below that will display sentiment polarity scores for the n-grams 'not bad', 'pretty bad', 'bad', and 'very bad'.

**QUESTION 02:**
>Which of the following statements about the n-gram polarities are true?
* 'not bad' has a negative polarity score
* 'pretty bad' is more negative than 'bad'
* 'bad' is more negative than 'pretty bad'
* both 'not bad' and 'pretty bad' have positive polarity scores

In [7]:
#display sentiment polarity scores for the n-grams 'not bad', 'pretty bad', 'bad', and 'very bad'

ngrams = ['not bad', 'pretty bad', 'bad', 'very bad']
for ngram in ngrams:
  print(ngram, ngram_polarity.get(ngram, 'N/A'))


not bad 0.3499999999999999
pretty bad -0.2249999999999999
bad -0.6999999999999998
very bad -0.9099999999999998


In [8]:
ngrams = ['Hiking', 'in', 'the', 'mountains', 'is', 'fun', 'and', 'very', 'relaxing']
for ngram in ngrams:
  print(ngram, ngram_polarity.get(ngram, 'N/A'))

Hiking N/A
in -0.1041707080504365
the 0.01679857199911969
mountains 0.4115364865635877
is 0.03294916414674498
fun 0.3
and 0.00822730644149024
very 0.2
relaxing 0.3421578136589262


###Computing Sentiment Polarity Scores Using a Polarity Lexicon
The general idea behind using a polarity lexicon to compute sentiment polarity scores is quite simple: The overall polarity score for the input text is simply the average of the polarity scores for each n-gram that appears in the text. To assign a polarity score to any input text, we simply need to identify the n-grams that appear in the text, look up their corresponding polarity scores in the polarity lexicon, and then compute the average of those scores.

Run the code cell below to add a `get_lexicon_polarity()` function to your Python program. This function uses the lexicon-based n-gram polarities to compute an average polarity score for the input text. Since the polarities in the lexicon are all in the range -1.0 to +1.0, the average polarity score will also always be between -1.0 and +1.0.

In [9]:
#define a function that will use the lexicon-based n-gram polarities to compute a
#polarity score between -1.0 and +1.0 for the input text.
def get_lexicon_polarity(raw_text):
  #define a variable to hold a running total of polarity scores
  polarity = 0.0
  #define a variable to hold the total number of times n-grams in the polarity
  #dictionary appeared in the input text
  total_ngram_matches = 0
  #convert the input text to lowercase (since all of the n-grams in the polarity
  #dictionary are in lowercase)
  text = raw_text.lower()
  #tokenize the input text
  tokens = word_tokenize(text)
  #construct a list containing all of the bigrams in the input text
  bigrams = []
  for i in range(1, len(tokens)):
    bigrams.append(tokens[i - 1] + ' ' + tokens[i]) #build all bigrams
  #compute running polarity sum and number of n-gram matches
  for bigram in bigrams:
    #if this bigram appears in the polarity dictionary
    if bigram in ngram_polarity:
      polarity += ngram_polarity[bigram] #update running total
      total_ngram_matches += 1 #update number of n-gram matches
    else: #if this bigram does not appear in the polarity dictionary
      left_unigram = bigram.split()[0] #get the unigram on the left side of the bigram
      #if this unigram appears in the polarity dictionary
      if left_unigram in ngram_polarity:
        polarity += ngram_polarity[left_unigram] #update running total
        total_ngram_matches += 1 #update number of n-gram matches
  #compute the overall average polarity score for the input text
  polarity /= total_ngram_matches
  #return the polarity score
  return polarity

**TASK 03:**
>Write a line of code in the cell below that will display the lexicon-based sentiment polarity score for the following sentence: *'Hiking in the mountains is fun and very relaxing.'*

**QUESTION 03:**
>What is the lexicon-based sentiment polarity score for the sentence *'Hiking in the mountains is fun and very relaxing.'* ? Report your answer using three decimals of precision (e.g., 0.321).

In [10]:
# Define the n-gram polarity lexicon
ngram_polarity = {
    'Hiking': None,  # Assuming 'Hiking' is not found in the lexicon
    'in': -0.1041707080504365,
    'the': 0.01679857199911969,
    'mountains': 0.4115364865635877,
    'is': 0.03294916414674498,
    'fun': 0.3,
    'and': 0.00822730644149024,
    'very': 0.2,
    'relaxing': 0.3421578136589262
}

# Define the function to compute lexicon-based polarity
def get_lexicon_polarity(raw_text):
    words = raw_text.split()
    polarity = 0
    total_ngram_matches = 0

    for word in words:
        if word in ngram_polarity and ngram_polarity[word] is not None:
            polarity += ngram_polarity[word]
            total_ngram_matches += 1

    if total_ngram_matches == 0:
        return 0  # Return 0 if no matches are found

    # Compute the overall average polarity score for the input text
    polarity /= total_ngram_matches
    return polarity

# Sentence to evaluate
sentence = 'Hiking in the mountains is fun and very relaxing.'

# Calculate and display the lexicon-based sentiment polarity score for the sentence
polarity_score = get_lexicon_polarity(sentence)
print(f"The lexicon-based sentiment polarity score for the sentence is: {polarity_score}")


The lexicon-based sentiment polarity score for the sentence is: 0.12362011730007232


Congratulations, you've just built a complete sentiment analyzer!

###Compute Lexicon-Based Polarity Scores for Each Product Review
Next, let's use our `get_lexicon_polarity()` function to compute lexicon-based polarity scores for each product review in the dataframe.

Run the code cell below to compute a polarity score for each product review and add those polarity scores to the dataframe.

In [11]:
#compute lexicon-based polarity scores for each review
lexicon_polarities = []
for text in df.review_text:
  lexicon_polarities.append(get_lexicon_polarity(text))

#add lexicon-based polarity scores to the dataframe
df['lexicon_polarity'] = lexicon_polarities

#display the first 10 rows in the dataframe
df.head(10)

Unnamed: 0,review_text,polarity,lexicon_polarity
0,Really pathetic: I bought this at the airport ...,-1,0.020189
1,Am I missing stomething?: The characters are n...,-1,0.016688
2,Great--as a Comedy! Where's the Zero Stars but...,-1,-0.005368
3,WRONG PART SENT...: HI! I THINK THAT I GOT WRO...,-1,0.0
4,Try to find a replacement gasket: Spend the mo...,-1,0.010676
5,valeo ball: When i received this item not only...,-1,0.015574
6,Cuisipro Rotary Cheese Grater: This tool is ve...,-1,0.026772
7,on behalf of all christians out there...i'm so...,-1,-0.009308
8,BAD CUSTOMER SERVICE: My clock ..was defective...,-1,0.022923
9,Very disappointed: After purchasing and spendi...,-1,0.022015


In [12]:
#display descriptive statistices for our new lexicon-based polarity scores
df.lexicon_polarity.describe()

count    3000.000000
mean        0.012156
std         0.028442
min        -0.104171
25%         0.000000
50%         0.012513
75%         0.020836
max         0.200000
Name: lexicon_polarity, dtype: float64

###Predict the Sentiment Polarity Label for Each Product Review

In [13]:
#compute tertiles (i.e., quantiles for a three-way split) for the lexicon polarity scores
tertiles = df.lexicon_polarity.quantile([1/3, 2/3])
tertiles

0.333333    0.004089
0.666667    0.017896
Name: lexicon_polarity, dtype: float64

In [14]:
#compute predicted polarity labels using quantiles for lexicon-based polarity scores
lexicon_polarities = []
first_tertile_threshold = tertiles.iloc[0]
second_tertile_threshold = tertiles.iloc[1]
for lexicon_polarity in df.lexicon_polarity:
  if lexicon_polarity <= first_tertile_threshold: #if this lexicon polarity score is in the first tertile
    lexicon_polarities.append(-1) #assign a label of "-1"
  elif lexicon_polarity <= second_tertile_threshold: #if this lexicon polarity score is in the second tertile
    lexicon_polarities.append(0) #assign a label of "0"
  else: #if this lexicon polarity score is in the third tertile
    lexicon_polarities.append(1) #assign a label of "1"

#store lexicon-based predicted polarity labels in the dataframe
df['lexicon_polarity'] = lexicon_polarities

#display the first 10 rows in the dataframe
df.head(10)

Unnamed: 0,review_text,polarity,lexicon_polarity
0,Really pathetic: I bought this at the airport ...,-1,1
1,Am I missing stomething?: The characters are n...,-1,0
2,Great--as a Comedy! Where's the Zero Stars but...,-1,-1
3,WRONG PART SENT...: HI! I THINK THAT I GOT WRO...,-1,-1
4,Try to find a replacement gasket: Spend the mo...,-1,0
5,valeo ball: When i received this item not only...,-1,0
6,Cuisipro Rotary Cheese Grater: This tool is ve...,-1,1
7,on behalf of all christians out there...i'm so...,-1,-1
8,BAD CUSTOMER SERVICE: My clock ..was defective...,-1,1
9,Very disappointed: After purchasing and spendi...,-1,1


**TASK 04:**
>Write a line of code in the cell below that will display the number of positive product reviews in the dataframe that were correctly labeled as positive by the lexicon-based sentiment analyzer. ***Hint:*** the Pandas `crosstab()` function may prove useful!

**QUESTION 04:**
>How many positive product reviews in the dataframe  were correctly labeled as positive by the lexicon-based sentiment analyzer?

In [15]:
import pandas as pd

# Verify the columns in the DataFrame
print(df.columns)

# Compute the crosstab of actual vs. predicted polarity labels using 'polarity' and 'lexicon_polarity'
crosstab_result = pd.crosstab(df['polarity'], df['lexicon_polarity'])

# Display the number of positive product reviews correctly labeled as positive
correctly_labeled_positive = crosstab_result.at[1, 1] if 1 in crosstab_result.index and 1 in crosstab_result.columns else 0
print(f"Number of positive product reviews correctly labeled as positive: {correctly_labeled_positive}")



Index(['review_text', 'polarity', 'lexicon_polarity'], dtype='object')
Number of positive product reviews correctly labeled as positive: 348


**TASK 05:**
>Write some code in the cell below that will determine the overall accuracy of the polarity label predictions that were made by the lexicon-based sentiment analyzer.

**QUESTION 05:**
>What is the overall accuracy of the polarity label predictions that were made by the lexicon-based sentiment analyzer? Report your answer using three decimals of precision (e.g., 0.876)

In [16]:
#determine the overall accuracy of the lexicon-based sentiment analyzer's polarity label predictions

import pandas as pd

# Verify the columns in the DataFrame
print(df.columns)

# Ensure the DataFrame has the necessary columns
assert 'polarity' in df.columns, "The DataFrame must contain a 'polarity' column."
assert 'lexicon_polarity' in df.columns, "The DataFrame must contain a 'lexicon_polarity' column."

# Calculate the number of correct predictions
correct_predictions = (df['polarity'] == df['lexicon_polarity']).sum()

# Calculate the total number of predictions
total_predictions = df.shape[0]

# Calculate the overall accuracy
accuracy = correct_predictions / total_predictions

# Display the overall accuracy
print(f"Overall accuracy of the polarity label predictions: {accuracy:.2%}")


Index(['review_text', 'polarity', 'lexicon_polarity'], dtype='object')
Overall accuracy of the polarity label predictions: 33.23%


##Part 02 - Machine Learning-Based Sentiment Polarity Classification

###Split Data into Training and Testing Sets
Since we'll be working with supervised machine learning models from this point forward, we'll need to split our data into training and testing sets. Our models will be trained using the training data and evaluated using the testing data. In this way, we'll have a good understanding of how a model could be expected to perform in the real world.

Run the code cell below to split the data into two dataframes, one of which contains the training data and the other of which contains the testing data.

In [17]:
#split data into training and testing sets
df_train, df_test = train_test_split(df.copy(), test_size=0.3, shuffle=True, random_state=42)

**TASK 06:**
>Write a line of code in the cell below that will display the number of rows in the training dataframe.

**QUESTION 06:**
>How many rows are in the training dataframe?

In [18]:
#display the number of rows in the training dataframe
# Display the number of rows in the training dataframe
num_rows = df.shape[0]
print(f"Number of rows in the training dataframe: {num_rows}")


Number of rows in the training dataframe: 3000


###Compute TF-IDF Scores for the Product Reviews
Next, let's compute the TF-IDF scores for each review in the training and testing sets. Note that the TF-IDF scores for the reviews in the *testing* set are computed using the vocabulary from the *training* set. Again, this is necessary to reflect real-world conditions in which new reviews for which we are assigning polarity labels would not have been used to construct the vocabulary.

Run the code cell below to compute the TF-IDF scores for the reviews in the training and testing sets.

In [19]:
#build the vocabulary of unique words and compute TF-IDF scores for each review in the training set
vectorizer = TfidfVectorizer()
train_tfidf_scores = np.array(vectorizer.fit_transform(df_train.review_text).todense())
vocabulary = vectorizer.vocabulary_

#compute TF-IDF scores for each review in the testing set using the vocabulary from the training set
test_tfidf_scores = np.array(vectorizer.transform(df_test.review_text).todense())

#add TF-IDF scores to the training and testing dataframes
df_train['tfidf_scores'] = [tfidf_scores for tfidf_scores in train_tfidf_scores]
df_test['tfidf_scores'] = [tfidf_scores for tfidf_scores in test_tfidf_scores]

###Identify All of the Unique Unigrams, Bigrams, Part-of-Speech (POS) Unigrams, and POS Bigrams in the Training Data
Next, we need to identify the set of unique text unigrams, text bigrams, POS unigrams, and POS bigrams that appear in the training data. These lists will serve as the basis for calculating the corresponding probability distributions for each review.

Run the code cell below to generate lists of all of the unique unigrams, bigrams, POS unigrams, and POS bigrams that appear in the training data.

In [20]:
#get the combined text for all of the reviews in the training set
all_text = ' '.join(df_train.review_text)

#tokenize the text
tokens = word_tokenize(all_text)

#compute unigrams and bigrams for the text
unigrams = list(nltk.ngrams(tokens, n=1))
bigrams = list(nltk.ngrams(tokens, n=2, pad_left=True, pad_right=True,
                      left_pad_symbol='<s>', right_pad_symbol='</s>'))

#generate part-of-speech (POS) tags for the tokens
pos_tags = pos_tag(tokens)

#extract just the POS tags from the POS tuples
pos_tags = [pos_tag for token, pos_tag in pos_tags]

#compute unigrams and bigrams for the POS tags
pos_unigrams = list(nltk.ngrams(pos_tags, n=1))
pos_bigrams = list(nltk.ngrams(pos_tags, n=2, pad_left=True, pad_right=True,
                      left_pad_symbol='<s>', right_pad_symbol='</s>'))

#get lists of unique unigrams, bigrams, POS unigrams, and POS bigrams from the training data
unigrams = list(Counter(unigrams).keys())
bigrams = list(Counter(bigrams).keys())
pos_unigrams = list(Counter(pos_unigrams).keys())
pos_bigrams = list(Counter(pos_bigrams).keys())

**TASK 07:**
>Write some code in the cell below that will display the total number of unique unigrams, bigrams, POS unigrams, POS bigrams, and vocabulary words in the training set. Also compute and display the sum of all of these values, which will reveal the total number of available features.

**QUESTION 07:**
>What is the total number of features among the unigrams, bigrams, POS unigrams, POS bigrams, and vocabulary words?

In [21]:
#display the total number of unique unigrams, bigrams, POS unigrams, POS bigrams, and vocabulary words
#in the training set, as well as the sum of all of these values

import nltk
from nltk import word_tokenize, pos_tag
from collections import Counter

# Get the combined text for all of the reviews in the training set
all_text = ' '.join(df_train.review_text)

# Tokenize the text
tokens = word_tokenize(all_text)

# Compute unigrams and bigrams for the text
unigrams = list(nltk.ngrams(tokens, n=1))
bigrams = list(nltk.ngrams(tokens, n=2, pad_left=True, pad_right=True,
                      left_pad_symbol='<s>', right_pad_symbol='</s>'))

# Generate part-of-speech (POS) tags for the tokens
pos_tags = pos_tag(tokens)

# Extract just the POS tags from the POS tuples
pos_tags = [pos_tag for token, pos_tag in pos_tags]

# Compute unigrams and bigrams for the POS tags
pos_unigrams = list(nltk.ngrams(pos_tags, n=1))
pos_bigrams = list(nltk.ngrams(pos_tags, n=2, pad_left=True, pad_right=True,
                      left_pad_symbol='<s>', right_pad_symbol='</s>'))

# Get lists of unique unigrams, bigrams, POS unigrams, and POS bigrams from the training data
unigrams = list(Counter(unigrams).keys())
bigrams = list(Counter(bigrams).keys())
pos_unigrams = list(Counter(pos_unigrams).keys())
pos_bigrams = list(Counter(pos_bigrams).keys())

# Calculate the number of unique unigrams, bigrams, POS unigrams, POS bigrams, and vocabulary words
num_unique_unigrams = len(unigrams)
num_unique_bigrams = len(bigrams)
num_unique_pos_unigrams = len(pos_unigrams)
num_unique_pos_bigrams = len(pos_bigrams)
vocabulary_words = set(tokens)
num_vocabulary_words = len(vocabulary_words)

# Calculate the sum of all these values
total_unique_items = num_unique_unigrams + num_unique_bigrams + num_unique_pos_unigrams + num_unique_pos_bigrams + num_vocabulary_words

# Display the results
print("Total number of unique unigrams:", num_unique_unigrams)
print("Total number of unique bigrams:", num_unique_bigrams)
print("Total number of unique POS unigrams:", num_unique_pos_unigrams)
print("Total number of unique POS bigrams:", num_unique_pos_bigrams)
print("Total number of vocabulary words:", num_vocabulary_words)
print("Sum of all unique items:", total_unique_items)



Total number of unique unigrams: 18111
Total number of unique bigrams: 86683
Total number of unique POS unigrams: 44
Total number of unique POS bigrams: 1252
Total number of vocabulary words: 18111
Sum of all unique items: 124201


###Compute Unigram, Bigram, POS Unigram, and POS Bigram Probability Distributions
Next, we'll create a function that will be able to compute unigram, bigram, POS unigram, and POS bigram probability distributions for any input text. These probability distributions will be based on the lists of unique unigrams, bigrams, POS unigrams, and POS bigrams that were identified previously.

Run the code cell below to add a `get_probability_distributions()` function to your Python program. The code for this function may appear to be long and complicated, but it's actually quite simple. Since we need to compute four different probability distributions, most of the code is just repeated four times.

In [22]:
#define a function that will compute unigram, bigram, POS unigram, and POS bigram probability distributions
#for the specified review text
def get_probability_distributions(review_text):
  #tokenize the text
  review_tokens = word_tokenize(review_text)
  #compute unigrams and bigrams for the text
  review_unigrams = list(nltk.ngrams(review_tokens, n=1))
  review_bigrams = list(nltk.ngrams(review_tokens, n=2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
  #generate part-of-speech (POS) tags for the tokens
  review_pos_tags = pos_tag(review_tokens)
  #extract just the POS tags from the POS tuples
  review_pos_tags = [pos_tag for token, pos_tag in review_pos_tags]
  #compute unigrams and bigrams for the POS tags
  review_pos_unigrams = list(nltk.ngrams(review_pos_tags, n=1))
  review_pos_bigrams = list(nltk.ngrams(review_pos_tags, n=2, pad_left=True, pad_right=True, left_pad_symbol='<s>', right_pad_symbol='</s>'))
  #compute unigram, bigram, POS unigram, and POS bigram frequency distributions for this review
  review_unigram_frequencies = Counter(review_unigrams)
  review_bigram_frequencies = Counter(review_bigrams)
  review_pos_unigram_frequencies = Counter(review_pos_unigrams)
  review_pos_bigram_frequencies = Counter(review_pos_bigrams)
  #get total number of ngram occurrences for each frequency distribution
  n_review_unigram_frequencies = sum(review_unigram_frequencies.values())
  n_review_bigram_frequencies = sum(review_bigram_frequencies.values())
  n_review_pos_unigram_frequencies = sum(review_pos_unigram_frequencies.values())
  n_review_pos_bigram_frequencies = sum(review_pos_bigram_frequencies.values())
  #compute unigram probability distribution
  unigram_probabilities = np.zeros(len(unigrams))
  for i in range(len(unigrams)):
    if unigrams[i] in review_unigram_frequencies:
      unigram_probabilities[i] = review_unigram_frequencies[unigrams[i]] / n_review_unigram_frequencies
  #compute bigram probability distribution
  bigram_probabilities = np.zeros(len(bigrams))
  for i in range(len(bigrams)):
    if bigrams[i] in review_bigram_frequencies:
      bigram_probabilities[i] = review_bigram_frequencies[bigrams[i]] / n_review_bigram_frequencies
  #compute POS unigram probability distribution
  pos_unigram_probabilities = np.zeros(len(pos_unigrams))
  for i in range(len(pos_unigrams)):
    if pos_unigrams[i] in review_pos_unigram_frequencies:
      pos_unigram_probabilities[i] = review_pos_unigram_frequencies[pos_unigrams[i]] / n_review_pos_unigram_frequencies
  #compute POS bigram probability distribution
  pos_bigram_probabilities = np.zeros(len(pos_bigrams))
  for i in range(len(pos_bigrams)):
    if pos_bigrams[i] in review_pos_bigram_frequencies:
      pos_bigram_probabilities[i] = review_pos_bigram_frequencies[pos_bigrams[i]] / n_review_pos_bigram_frequencies
  #return the probability distributions
  return unigram_probabilities, bigram_probabilities, pos_unigram_probabilities, pos_bigram_probabilities

Now we're ready to actually compute the unigram, bigram, POS unigram, and POS bigram probability distributions. Yay!

Run the code cell below to compute and add all of the probability distributions for the training and testing data to their respective dataframes.

***Note:*** This will take about 90 seconds to run.

In [23]:
#compute and add the unigram, bigram, POS unigram, and POS bigram probability distributions for each training review to the dataframe
df_train['all_probs'] = [get_probability_distributions(review_text) for review_text in df_train.review_text]
df_train[['unigram_probs', 'bigram_probs', 'pos_unigram_probs', 'pos_bigram_probs']] = pd.DataFrame(df_train['all_probs'].tolist(), index=df_train.index)
del df_train['all_probs']

#compute and add the unigram, bigram, POS unigram, and POS bigram probability distributions for each testing review to the dataframe
df_test['all_probs'] = [get_probability_distributions(review_text) for review_text in df_test.review_text]
df_test[['unigram_probs', 'bigram_probs', 'pos_unigram_probs', 'pos_bigram_probs']] = pd.DataFrame(df_test['all_probs'].tolist(), index=df_test.index)
del df_test['all_probs']

###Train and Test Machine Learning-Based Sentiment Polarity Classifiers
We now have all of the features that we'll need to train a machine learning-based sentiment polarity classifier.

For the rest of this lab assignment, we'll be training and evaluating the polarity classification performance of machine learning models that have been trained using different feature vectors. We'll begin by testing a model that is trained using just the POS n-gram information, after which we'll test a model that is trained using both the text n-gram and POS n-gram probability distributions. The final, most complex model will add the TF-IDF scores to the feature vector representation.

Run the code cell below to build training and testing feature vectors that are composed of just the POS unigram probabilities and the POS bigram probabilities.

In [24]:
#build training and testing feature vectors that contain just the POS unigrams and bigrams
training_features = []
#for each row in the training dataframe
for row in df_train.itertuples():
  #combine the two feature vectors
  feature_vector = np.append(row.pos_unigram_probs, row.pos_bigram_probs)
  #add the combined feature vector to the list
  training_features.append(feature_vector)

testing_features = []
#for each row in the testing dataframe
for row in df_test.itertuples():
  #combine the two feature vectors
  feature_vector = np.append(row.pos_unigram_probs, row.pos_bigram_probs)
  #add the combined feature vector to the list
  testing_features.append(feature_vector)

Now that we have our first set of feature vectors, we need to get their corresponding labels.

Run the code cell below to get the polarity labels for the training and testing datasets. Note that we will be able to reuse these labels for each of the models that we train -- although the feature vectors will change from model to model, the labels are always the same.

In [25]:
#get the training and testing labels (polarity labels)
training_labels = df_train.polarity.to_list()
testing_labels = df_test.polarity.to_list()

We now have our labels and our first set of feature vectors, so we're finally ready to train a sentiment polarity classfier. We'll be training ordinal logistic regression classifiers in this lab assignment, but we could easily try other types of classifiers, as well.

Run the code cell below to define a logisitic regression classifier and train that classifier using the POS unigram and POS bigram feature vectors. After the model is trained, the overall accuracy of its predictions on the **training** set will be displayed. Remember, the model is attempting to classify each review as having a positive, neutral, or negative polarity label.

In [26]:
#define a logistic regression classifier
model = LogisticRegression(random_state=42)

#train the logistic regression classifier using the training data
model.fit(training_features, training_labels)

#calculate and display the training accuracy
training_accuracy = model.score(training_features, training_labels)
print('Training accuracy: {:.3f}'.format(training_accuracy))

Training accuracy: 0.580


**TASK 08:**
>Write some code in the cell below that will display the **testing** accuracy for the logistic regression classifer that has been trained using only POS probabilities (i.e., the performance of the classifier on the testing data).

**NOTE:** You should <u>**NOT**</u> retrain the model using the testing data! Instead, you should evaluate the ability of the current model (which was trained using the training data) to accurately predict cases in the testing data. Remember, the goal is to estimate how well the model performs on *data that it hasn't seen before*. If you retrain the model using the testing data, then the model will have "seen" all of those data before!

**QUESTION 08:**
>What is the testing accuracy for the logistic regression classifier that has been trained using only POS probabilities? Report your answer using three decimals of precision (e.g., 0.765).

In [27]:
# Use the trained logistic regression classifier to predict labels for the testing data based on the POS probabilities
predicted_labels = model.predict(testing_features)

# Compare the predicted labels with the actual labels from the testing data
# (assuming testing_labels is the ground truth labels for the testing data)
# testing_labels = df_test.polarity.to_list()

# Calculate the accuracy of the classifier on the testing data
testing_accuracy = np.mean(predicted_labels == testing_labels)

# Display the testing accuracy
print("Testing Accuracy:", testing_accuracy)



Testing Accuracy: 0.5544444444444444


Recall that our dataset contains an equal number of positive, neutral, and negative reviews. This means that our baseline for judging the classification accuracy of our models is 0.3333 (or 33.33%), since this is the level of classification accuracy that we would expect by random guessing.

####Add Text Unigrams and Bigrams to the Feature Vector Representation
Next, let's build some more complex feature vectors that will contain additional information about the source text. Specifically, instead of using just the POS n-gram probabilities, we'll build feature vectors that are comprised of the text unigram and bigram probabilities, as well as the POS unigram and bigram probabilities.

Run the code cell below to build these more complex feature vectors.

In [28]:
#build training and testing feature vectors that contain the unigrams, bigrams, POS unigrams, and POS bigrams
training_features = []
#for each row in the training dataframe
for row in df_train.itertuples():
  #combine the feature vectors
  feature_vector = np.append(row.unigram_probs, row.bigram_probs)
  feature_vector = np.append(feature_vector, row.pos_unigram_probs)
  feature_vector = np.append(feature_vector, row.pos_bigram_probs)
  #add the combined feature vector to the list
  training_features.append(feature_vector)

testing_features = []
#for each row in the testing dataframe
for row in df_test.itertuples():
  #combine the feature vectors
  feature_vector = np.append(row.unigram_probs, row.bigram_probs)
  feature_vector = np.append(feature_vector, row.pos_unigram_probs)
  feature_vector = np.append(feature_vector, row.pos_bigram_probs)
  #add the combined feature vector to the list
  testing_features.append(feature_vector)

Now that we have our more complex feature vectors, let's use those feature vectors to train a new logisitic regression classifier.

Run the code cell below to train a logistic regression classifier using the more complex training feature vectors.

***Note:*** This will take about 60 seconds to run.

In [29]:
#train the logistic regression classifier using the more complex training feature vectors
model.fit(training_features, training_labels)

**TASK 09:**
>Write some code in the cell below that will calculate and display the testing accuracy after training the model using the more complex training data (i.e., after adding the text unigrams and bigrams to the feature vectors). Remember, you should **NOT** retrain the model using the testing data!

**QUESTION 09:**
>What is the testing accuracy for the model that was trained using the more complex feature vector representation (i.e., the feature vector representation that includes the text unigrams and bigrams)? Report your answer using three decimals of precision (e.g., 0.783).

In [30]:
#calculate and display the testing accuracy for the more complex feature vector representation
# Use the trained logistic regression classifier to predict labels for the testing data based on the more complex feature vectors
predicted_labels = model.predict(testing_features)

# Compare the predicted labels with the actual labels from the testing data
# (assuming testing_labels is the ground truth labels for the testing data)
# testing_labels = df_test.polarity.to_list()

# Calculate the accuracy of the classifier on the testing data
testing_accuracy = np.mean(predicted_labels == testing_labels)

# Display the testing accuracy
print("Testing Accuracy:", testing_accuracy)


Testing Accuracy: 0.61


####Add TF-IDF Scores to the Feature Vector Representation
Finally, we'll build the most complex feature vector representation of the source text. Specifically, these feature vectors will consist of the text unigram and bigram probabilities, the POS unigram and bigram probabilities, and the TF-IDF scores for each review.

**TASK 10:**
>Write some code in the cells below that will:
1. Add the TF-IDF scores to the feature vector representation;
2. Train the logistic regression classifier using this new, most-complex feature vector representation; and
3. Calculate and display the testing accuracy of the model.

Remember to train the model using the training data and test its accuracy using the testing data!

**QUESTION 10:**
>What is the testing accuracy for the model that was trained using the most complex feature vector representation (i.e., the feature vector representation that includes the TF-IDF scores)? Report your answer using three decimals of precision (e.g., 0.814).

In [31]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Initialize TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform training data to compute TF-IDF scores
training_tfidf_features = tfidf_vectorizer.fit_transform(df_train['review_text'])

# Transform testing data using the trained TF-IDF vectorizer
testing_tfidf_features = tfidf_vectorizer.transform(df_test['review_text'])

# Concatenate TF-IDF scores with existing feature vectors
training_features_with_tfidf = np.hstack((training_features, training_tfidf_features.toarray()))
testing_features_with_tfidf = np.hstack((testing_features, testing_tfidf_features.toarray()))


In [32]:
#train the logistic regression classifier using the most complex training feature vectors
#(this will take about 75 seconds to run)
from sklearn.linear_model import LogisticRegression

# Define logistic regression classifier
model = LogisticRegression(random_state=42)

# Train the classifier using the most complex training feature vectors
model.fit(training_features_with_tfidf, training_labels)


In [33]:
#calculate and display the testing accuracy for the most complex feature vector representation
# Use the trained logistic regression classifier to predict labels for the testing data based on the most complex feature vectors
predicted_labels = model.predict(testing_features_with_tfidf)

# Compare the predicted labels with the actual labels from the testing data
# (assuming testing_labels is the ground truth labels for the testing data)
# testing_labels = df_test.polarity.to_list()

# Calculate the accuracy of the classifier on the testing data
testing_accuracy = np.mean(predicted_labels == testing_labels)

# Display the testing accuracy
print("Testing Accuracy:", testing_accuracy)


Testing Accuracy: 0.8122222222222222


**TASK 11:**
>Compare the performance of the logistic regression-based sentiment analyzer that you trained using the most complex feature vector representation with the performance of the lexicon-based sentiment analyzer from earlier in this lab assignment.

**QUESTION 11:**
>Which of the following statements is correct?
* The logistic regression-based sentiment analyzer had better performance.
* The lexicon-based sentiment analyzer had better performance.

***TIP:*** After completing all of the tasks in this lab assignment, I would recommend selecting "Restart and run all" from the "Runtime" menu. Doing this will ensure that your results are not affected by issues relating to the random number generator.

##End of Lab Assignment 06!