# PROBLEM 1 – Reading the data (5 pts)
• Read in file "train.tsv" from the Stanford Sentiment Treebank (SST) as shared in the GLUE task.
(See section "DATA" above.)
• Next, split your dataset into train, test, and validation datasets with these sizes (Note that 100
is a small size for test and validation datasets; it was selected to simplify this homework):
o Validation: 100 rows
o Test: 100 rows
o Training: All remaining rows.
• Review the column "label" which indicates positive=1 or negative=0 sentiment. What is the prior
probability of each class on your training set? Show results in your notebook.

In [1]:
import nltk
from nltk.tokenize import word_tokenize
import pandas as pd
from sklearn.model_selection import train_test_split

# Download NLTK data
nltk.download('punkt')

# Read in the SST dataset
sst_df = pd.read_csv('/Users/pradaapss/Desktop/Semester 3/CS 585 NLP/Assignment 3/SST-2/train.tsv', sep='\t', usecols=['sentence', 'label'])

sst_df.head(3)

[nltk_data] Downloading package punkt to /Users/pradaapss/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Unnamed: 0,sentence,label
0,hide new secretions from the parental units,0
1,"contains no wit , only labored gags",0
2,that loves its characters and communicates som...,1


In [2]:
from sklearn.model_selection import train_test_split

"""train_df, test_df = train_test_split(sst_df, test_size=100, random_state=42)
train_df, val_df = train_test_split(train_df, test_size=100, random_state=42)"""

val_df = sst_df.sample(n=100, random_state=1)
test_df = sst_df.sample(n=100, random_state=2)
train_df = sst_df.drop(val_df.index).drop(test_df.index)

# Calculate the prior probabilities of each class in the training set
positive_count = (train_df['label'] == 1).sum()
negative_count = (train_df['label'] == 0).sum()

total_count = len(train_df)

# Calculate the prior probabilities of the positive and negative classes
prior_prob_positive = positive_count / total_count
prior_prob_negative = negative_count / total_count

print("Prior Probability of Positive Class:", prior_prob_positive)
print("Prior Probability of Negative Class:", prior_prob_negative)



Prior Probability of Positive Class: 0.5578936395180867
Prior Probability of Negative Class: 0.4421063604819134


# PROBLEM 2 – Tokenizing data (10 pts)
• Write a function that takes a sentence as input, represented as a string, and converts it to a
tokenized sequence padded by start and end symbols. For example, "hello class" would be
converted to:
    o ['<s>', 'hello', 'class', '</s>']
• Apply your function to all sentences in your training set. Show the tokenization of the first
sentence of your training set in your notebook output.
• What is the vocabulary size of your training set? Include your start and end symbol in your
vocabulary. Show your result in your notebook.

In [3]:
def tokenize(sentence):
    tokens = sentence.split()
    
    tokens = ['<s>'] + tokens + ['</s>']
    
    return tokens

# Apply the tokenize function to each sentence in the 'sentence' column of the DataFrame
train_df['tokenized_sentences'] = train_df['sentence'].apply(tokenize)

print("Tokenization of the first sentence: ", train_df['tokenized_sentences'].iloc[0])

vocabulary = set()

# Iterate through each tokenized sentence in the 'tokenized_sentences' column
for sentence in train_df['tokenized_sentences']:
    vocabulary.update(sentence)

# Calculate the size of the vocabulary
vocabulary_size = len(vocabulary)
print(f"Vocabulary Size: {vocabulary_size}")


Tokenization of the first sentence:  ['<s>', 'hide', 'new', 'secretions', 'from', 'the', 'parental', 'units', '</s>']
Vocabulary Size: 14815


# PROBLEM 3 – Bigram counts (10 pts)
• Write a function that takes an array of tokenized sequences as input (i.e., a list of lists) and
counts bigram frequencies in that dataset. Your function should return a two-level dictionary
(dictionary of dictionaries) or similar data structure, where the value at index [wi][wj] gives the
frequency count of bigram (wi, wj). For example, this expression would give the counts of the
bigram "academy award":
bigram_counts["academy"]["award"]
• Apply your function to the output of problem 2. You should build one counter that represents all
sentences in the training dataset.
• Use this result to show how many times a sentence starts with "the". That is, how often do you
see the bigram ("<s>","the") in your training set? Show results in your notebook.
    
PROGRAMMING HINTS:
• You can use the function nltk.bigrams to convert a sequence to bigrams, but you are not
required to do so.
• In python, you can use function "dict.get(key, default)" to return the value "default" if "key" is
not present in a dictionary.

In [4]:
from collections import defaultdict

def count_bigrams(tokenized_sequences):
    bigram_counts = defaultdict(lambda: defaultdict(int))

    for sequence in tokenized_sequences:
        for i in range(len(sequence) - 1):
            wi, wj = sequence[i], sequence[i + 1]
            bigram_counts[wi][wj] += 1

    return bigram_counts

# Apply the function to the tokenized sequences in the training set
bigram_counts = count_bigrams(train_df['tokenized_sentences'])

# Show how many times a sentence starts with "the" ("<s>", "the")
the_start_count = bigram_counts["<s>"]["the"]
print("Number of times a sentence starts with 'the' in the training set:", the_start_count)


Number of times a sentence starts with 'the' in the training set: 4456


# PROBLEM 4 – Smoothing (20 pts)
• Write a function that implements formula [6.13] in that E-NLP textbook (page 129, 6.2
Smoothing and discounting). That is, write a function that applies smoothing and returns a
(negative) log-probability of a word given the previous word in the sequence. It is suggested
that you use these parameters:
    o The current word, wm
    o The previous word, wm-1
    o bigram counts (output of Problem 3)
    o alpha, a smoothing parameter
    o vocabulary size
• Using this function to show the log probability that the word "academy" will be followed by the
word "award". Try this with alpha=0.001 and alpha=0.5 (you should see very different results!).
Show your results in your notebook.
PROGRAMMING ALTERNATIVE: If you are familiar with python classes, you may build a LanguageModel
class that is initialized with the above parameters and implements formula [6.13] as a member function.

In [5]:
import math

def calculate_smoothed_log_probability(wm, wm1, bigram_counts, alpha, vocabulary_size):
    bigram_count = bigram_counts.get(wm1, {}).get(wm, 0)
    unigram_count_wm1 = sum(bigram_counts.get(wm1, {}).values())
    smoothed_probability = (bigram_count + alpha) / (unigram_count_wm1 + alpha * vocabulary_size)
    log_probability = -math.log(smoothed_probability)
    
    return -log_probability

# Use the function to calculate the log probability of "academy" followed by "award"
word_wm1 = "academy"
word_wm = "award"
alpha_0_001 = 0.001
alpha_0_5 = 0.5
vocabulary_size = len(vocabulary)  # Vocabulary size from Problem 2

log_prob_alpha_0_001 = calculate_smoothed_log_probability(word_wm, word_wm1, bigram_counts, alpha_0_001, vocabulary_size)
log_prob_alpha_0_5 = calculate_smoothed_log_probability(word_wm, word_wm1, bigram_counts, alpha_0_5, vocabulary_size)

print("Log probability of 'academy' followed by 'award' with alpha=0.001:", log_prob_alpha_0_001)
print("Log probability of 'academy' followed by 'award' with alpha=0.5:", log_prob_alpha_0_5)


Log probability of 'academy' followed by 'award' with alpha=0.001: -1.025138261286736
Log probability of 'academy' followed by 'award' with alpha=0.5: -6.173046583212077


# PROBLEM 5 – Sentence log-probability (10 pts)
• Write a function that returns the log-probability of a sentence which is expected to be a
negative number. To do this, assume that the probability of a word in a sequence only depends
on the previous word. It is suggested that you use these parameters:
    o A sentence represented as a single python string
    o bigram counts (output of Problem 3)
    o alpha, a smoothing parameter
    o vocabulary size
• Use your function to compute the log probability of these two sentences (Note that the 2nd is
not natural English, so it should have a lower (more negative) result that the first):
o "this was a really great movie but it was a little too long."
o "long too little a was it but movie great really a was this."

In [6]:
import math

def calculate_sentence_log_probability(sentence, bigram_counts, alpha, vocabulary_size):
    sentence_tokens = sentence.split()
    log_probability = 0.0
    
    for i in range(1, len(sentence_tokens)):
        wm1, wm = sentence_tokens[i - 1], sentence_tokens[i]
        log_probability += calculate_smoothed_log_probability(wm, wm1, bigram_counts, alpha, vocabulary_size)
    
    return log_probability

# Use the function to calculate the log probability of the two sentences
sentence1 = "this was a really great movie but it was a little too long."
sentence2 = "long too little a was it but movie great really a was this."

alpha = 0.001  # You can adjust the smoothing parameter as needed
vocabulary_size = len(vocabulary)  # Vocabulary size from Problem 2

log_prob_sentence1 = calculate_sentence_log_probability(sentence1, bigram_counts, alpha, vocabulary_size)
log_prob_sentence2 = calculate_sentence_log_probability(sentence2, bigram_counts, alpha, vocabulary_size)

print("Log probability of sentence 1:", log_prob_sentence1)
print("Log probability of sentence 2:", log_prob_sentence2)


Log probability of sentence 1: -71.25235479495367
Log probability of sentence 2: -145.59681149835444


# PROBLEM 6 – Tuning Alpha (10pts)
Next, use your validation set to select a good value for "alpha".
• Apply the function you wrote in Problem 5 to your validation dataset using 3 different values of
"alpha", such as (0.001, 0.01, 0.1). For each value, show the log-likelihood estimate of the
validation set. That is, in your notebook show the sum of the log probabilities of all sentences.
• Which alpha gives you the best result? To indicate your selection to the grader, save your
selected value to a variable named "selected_alpha"
. 

In [7]:
validation_sentences = [
    "this was a really great movie but it was a little too long.",
    "long too little a was it but movie great really a was this."
]

# List of different alpha values to test
alphas = [0.001, 0.01, 0.1]

# Function to calculate the log-likelihood estimate
def calculate_log_likelihood(alpha, sentences, bigram_counts, vocabulary_size):
    log_likelihood = 0.0
    for sentence in sentences:
        log_likelihood += calculate_sentence_log_probability(sentence, bigram_counts, alpha, vocabulary_size)
    return log_likelihood

# Calculate and display the log-likelihood estimate for each alpha
for alpha in alphas:
    log_likelihood = calculate_log_likelihood(alpha, validation_sentences, bigram_counts, vocabulary_size)
    print(f"Log-Likelihood Estimate (alpha={alpha}): {round(log_likelihood, 4)}")

# Select an alpha value from the list of alphas (e.g., the third element)
selected_alpha = alphas[2]

# Print the selected alpha
print(f"Selected Alpha: {selected_alpha}")

Log-Likelihood Estimate (alpha=0.001): -216.8492
Log-Likelihood Estimate (alpha=0.01): -194.0312
Log-Likelihood Estimate (alpha=0.1): -183.0741
Selected Alpha: 0.1


# PROBLEM 7 – Applying Language Models (20 pts)
In this problem, you will classify your test set of 100 sentences by sentiment, by applying your work
from previous problems and modeling the language of both positive and negative sentiment.
To do this, you can follow these steps:
• Separate your training dataset into positive and negative sentences, and compute vocabulary
size and bigram counts for both datasets.
• For each of the 100 sentences in your test set:
o Compute both a "positive sentiment score" and a "negative sentiment score" using (1)
the function you wrote in Problem 5, (2) Bayes rule, and (3) class priors as computed in
Problem 1.
o Compare these scores to assign a predicted sentiment label to the sentence.
• What is the class distribution of your predicted label? That is, how often did your method
predict positive sentiment, correctly or incorrectly? How often did it predict negative
sentiment? Show results in your notebook.
• Compare your predicted label to the true sentiment label. What is the accuracy of this
experiment? That is, how often did the true and predicted label match on the test set? Show
results in your notebook.
For this problem, you do not need to re-tune alpha for your positive and negative datasets (although it
may be a good idea to do so), you can re-use the value selected in Problem 6.

In [8]:
import math

# Split the training set into positive and negative sentences
positive_sentences = train_df[train_df['label'] == 1]['sentence']
negative_sentences = train_df[train_df['label'] == 0]['sentence']

# Count bigrams in positive and negative sentences
positive_bigram_counts = count_bigrams(positive_sentences.apply(tokenize))
negative_bigram_counts = count_bigrams(negative_sentences.apply(tokenize))

# Calculate the prior probabilities for positive and negative classes
positive_prior = len(positive_sentences) / len(train_df)
negative_prior = 1 - positive_prior

def classify_sentence(sentence, positive_bigram_counts, negative_bigram_counts, positive_prior, negative_prior, alpha, vocabulary_size):
    
    positive_log_prob = calculate_sentence_log_probability(sentence, positive_bigram_counts, alpha, vocabulary_size) + math.log(positive_prior)
    negative_log_prob = calculate_sentence_log_probability(sentence, negative_bigram_counts, alpha, vocabulary_size) + math.log(negative_prior)
    return 1 if positive_log_prob > negative_log_prob else 0

# Apply the classification function to the test set and add predicted labels
test_df['predicted_label'] = test_df['sentence'].apply(
    lambda sentence: classify_sentence(sentence, positive_bigram_counts, negative_bigram_counts, positive_prior, negative_prior, selected_alpha, vocabulary_size)
)

# Calculate class distribution and accuracy
class_distribution = test_df['predicted_label'].value_counts(normalize=True)
accuracy = (test_df['label'] == test_df['predicted_label']).mean()

# Print results
print("Class Distribution of Predicted Labels:")
print(class_distribution)
print(f"Accuracy: {accuracy * 100:.2f}%")


Class Distribution of Predicted Labels:
1    0.52
0    0.48
Name: predicted_label, dtype: float64
Accuracy: 87.00%


# PROBLEM 8 – Markov Assumption (10 pts – Answer in Blackboard)
• Where in this homework did you apply the Markov assumption?
• Imagine you applied the 2nd
-order Markov assumption, using trigrams. Do you think your
accuracy results would increase or decrease? Why? Or, if you are not sure, give a benefit or
drawback of using trigrams for this experiment. (Note: You do not need to rerun this experiment
with trigrams to answer this question.)
