# 1. [20 pts] Considering two groups of reviews {sentiment 0, sentiment 1}, tokenize and lemmatize (here you can process in any way you want, such as removing contractions by regular expression, etc.) all the words in each group of reviews by combining all sentences, but not all reviews. As a result, we will have bags of words for each review and sentiment group. There must be 50,000 bags total. We neglected repetitions of the words in reviews. Compute the following: 
# Hint: Consider the event sample as the word being in the review or not.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import nltk
import csv
from urllib.request import urlopen
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
import csv
import re
from collections import Counter
from nltk import FreqDist
print(f'NLTK library version= {nltk.__version__}')

NLTK library version= 3.8.1


In [2]:
%%time
# Read the reviews and tokenize them - note the encoding
reviews_sentiment_0  = []
reviews_sentiment_1 = []
with open('movie_data.csv','r',encoding="utf-8") as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    next(reader)  # skip header
    for line in reader:
        if(line[1] == "0"):
            reviews_sentiment_0 += [line[0]]
        if(line[1] == "1"):
            reviews_sentiment_1 += [line[0]]

CPU times: total: 234 ms
Wall time: 1.05 s


In [3]:
%%time
sub_br_0  = [re.sub('<br />', ' ', review) for review in reviews_sentiment_0]
sub_br_1  = [re.sub('<br />', ' ', review) for review in reviews_sentiment_1]
_pat1 = r'(?i)[a-zA-Z]+(?=\s|$)'

sentences_sentiment_0  = [re.findall(_pat1, review) for review in sub_br_0]
sentences_sentiment_1 = [re.findall(_pat1, review) for review in sub_br_1]

CPU times: total: 9.47 s
Wall time: 11.3 s


In [4]:
%%time
def clean_sent(tokens_per_sentence):

    stop_words = set(stopwords.words('english'))
    stemmer = SnowballStemmer('english')

    # Apply stop word removal first
    stop_tokens_per_sentence = [[token for token in review if token.lower() not in stop_words] for review in tokens_per_sentence]
    stem_tokens_per_sentence = [[stemmer.stem(token) for token in review] for review in stop_tokens_per_sentence]
    return stem_tokens_per_sentence

CPU times: total: 0 ns
Wall time: 0 ns


In [5]:
%%time
cleaned_word_0=clean_sent(sentences_sentiment_0)
cleaned_word_1=clean_sent(sentences_sentiment_1)

CPU times: total: 1min 48s
Wall time: 2min 44s


In [6]:
%%time 
#Cleaned again because some Stop Words Got through
list_of_bags_0=[set(review) for review in cleaned_word_0]
list_of_bags_1=[set(review) for review in cleaned_word_1]


CPU times: total: 1.83 s
Wall time: 2.27 s


# Probability('good')

In [7]:
%%time
list_of_bags=[list_of_bags_0,list_of_bags_1]
def probability(word, sentiment=None):
    if sentiment is not None:
        prob_list_of_bags = [(word.lower() in freqDist) for freqDist in list_of_bags[sentiment]]
        average_frequency = (sum(prob_list_of_bags) / len(prob_list_of_bags))
    else:
        prob_list_of_bags = [[(word.lower() in freqDist) for freqDist in sentiment_group] for sentiment_group in list_of_bags]
        average_frequency = (sum(prob_list_of_bags[0]) + sum(prob_list_of_bags[1])) / (len(prob_list_of_bags[0]) + len(prob_list_of_bags[1]))
    
    return f"Probability of '{word}': {average_frequency:.2%}"

CPU times: total: 0 ns
Wall time: 0 ns


In [8]:
%%time
probability('good')

CPU times: total: 0 ns
Wall time: 41.3 ms


"Probability of 'good': 33.20%"

# Probability('good' | sentiment=0)

In [9]:
probability('good',0)

"Probability of 'good': 33.52%"

# Probability('good' | sentiment=1)

In [10]:
probability('good',1)

"Probability of 'good': 32.88%"

# Probability('good' and 'bad')

In [11]:
def probability_both(word,word2):
    prob_list_of_bags=[[(word.lower() in freqDist) and (word2.lower() in freqDist) for freqDist in sentiment_group] for sentiment_group in list_of_bags]
    average_frequency=(sum(prob_list_of_bags[0])+sum(prob_list_of_bags[1]))/(len(prob_list_of_bags[0])+len(prob_list_of_bags[1]))
    return f"Probability of '{word}' and '{word2}': {average_frequency:.2%}"

In [12]:
probability_both('good','bad')

"Probability of 'good' and 'bad': 8.87%"

# 2. [20 pts] According to this dataset and your NLP pipeline, is the word 'good' a good discriminator for sentiments?

According to this dataset, good is not a good discriminator for sentiments. This is because the prevalence of good in sentiment 1 vs sentiment 0 is minimal.

# How about the word 'bad'?

In [13]:
probability('bad',1)

"Probability of 'bad': 10.36%"

In [14]:
probability('bad',0)

"Probability of 'bad': 30.69%"

The word "bad" may be a good discriminator for sentiments, as the word "bad" appears in negative reviews 3x more than it appears in positive reviews. 

# 3. [20 pts] Compute the mutual information I('good', 'bad') in the IMDB dataset using your pipeline. 
# Hint: You must ignore the sentiments in this case and pool all reviews. Also see the hint in (Q1.)

In [15]:
import math
def conditional_probability(A,B):
    combined_list=list_of_bags[0]+list_of_bags[1]
    prob_list_of_bags=[(A.lower() in review) and (B.lower() in review) for review in combined_list]
    prob_of_A=[(A.lower() in review) for review in combined_list]
    prob_of_B=[(B.lower() in review) for review in combined_list]
    average_frequency=sum(prob_list_of_bags)/len(prob_list_of_bags)
    average_A=sum(prob_of_A)/len(prob_of_A)
    average_B=sum(prob_of_B)/len(prob_of_B)
    if(average_A==0 or average_B == 0):
        return f"{0:.2%}"
    final_freq=average_frequency*math.log(average_frequency/(average_A*average_B),2)
    return f"{final_freq:.2%}"
conditional_probability('good','bad')

'3.37%'

# Comment on results.

According to the results, there appears to be little overlap between good and bad when it comes to the amount of mutual information that is shared between them. Since it is only 3.37%, I will say that the mutual information is minimal.

# 4. [20 pts] Compute the following mutual information:

# I('good', sentiment=0)

In [17]:
import math
def conditional_probability_sentiment (A,B):
    combined_list=list_of_bags[0]+list_of_bags[1]
    small_list=list_of_bags[B]
    A_in_sentiment = [(A.lower() in review) for review in small_list]
    prob_of_A = [(A.lower() in review) for review in combined_list]
    average_A=sum(prob_of_A)/50000
    average_frequency=(sum(A_in_sentiment)/25000)/2
    average_B = .5
    final_freq=average_frequency*math.log(average_frequency/(average_A*average_B),2)
#     if(final_freq<0):
#         final_freq=0
    return f"{final_freq:.2%}"
conditional_probability_sentiment('good',0)

'0.23%'

# I('good', sentiment=1)

In [18]:
conditional_probability_sentiment('good',1)

'-0.23%'

# I('bad', sentiment=0)

In [19]:
conditional_probability_sentiment('bad',0)

'8.91%'

# I('bad', sentiment=1)

In [20]:
conditional_probability_sentiment('bad',1)

'-5.11%'

# Comment on your findings regarding question (Q2.)

My findings regarding question 2 versus question 4 solidify the various conclusions we have achieved in question 2. There was a minute difference between the prevalence of good in sentiment 1 and 0. With this, there is a .23% information rate between the two. Unsurprisingly, the information rate between good vs. sentiment one and bad vs. sentiment 1 was negative, as both had a more negligible prevalence of good and evil than sentiment 1. It also makes sense that the combination with the highest information rate was bad and sentiment 0, as we already knew that bad was 3x as prevalent in 0 vs. 1.


# 5. [20 pts] What is the marginal entropy H(X) of a variable X, and what is the mutual information of X with itself?

The marginal entropy of a variable H(X) of a variable X would be sigma(p(x)*log2(p(x))), where sigma represents the sum of the various outcomes, p(x) is the probability of a particular product for the specified x outcome, multiplied by log2(p(x)).
Mutual information is the amount of information one variable knows about the other. We can determine mutual information as I(X;X) = H(X) - H(X|X), where H(X|X) is the conditional entropy, or how much information is needed on average to communicate the other variable. H(X) represents the chance of a particular outcome. Since, in this case, you are looking at the mutual information a variable contains about itself, there is no mutual information, as all information is self-information. Therefore, there is no conditional entropy either.
