<a href="https://colab.research.google.com/github/SameekshaNalla/Sameeksha_INFO5731_Fall2024/blob/main/Nalla_Sameeksha_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>



# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from **scratch instead of using any pre-existing libraries** to do so:

(1) Count the frequency of all the N-grams (N=3) and (N=2).

(2) Calculate the probabilities for all the bigrams in the dataset by using the formula  count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative
probabilities of each review in terms of other reviews (abstracts, or tweets) by using the formula  frequency (noun phrase) / max frequency (noun phrase) on the whole dataset.

Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [1]:
# Write your code here

# -------------------------------
# N-GRAM ANALYSIS (Assignment 3)
# Dataset: densho_narrators_clean.csv
# Column: display_name
# Author: Sameeksha Nalla
# -------------------------------

import csv
from collections import defaultdict
import re
import pandas as pd


# 1. LOAD DATA

filename = "densho_narrators_clean.csv"

texts = []
with open(filename, 'r', encoding='utf-8') as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row['display_name'].strip():
            texts.append(row['display_name'].lower())

print(f"Loaded {len(texts)} reviews/entries.")


# 2. TOKENIZATION HELPER

def tokenize(text):
    # remove punctuation and split
    text = re.sub(r'[^a-z\s]', '', text.lower())
    return [w for w in text.split() if w]


# 3. COMPUTE N-GRAM FREQUENCIES (BIGRAMS + TRIGRAMS)

bigram_freq = defaultdict(int)
trigram_freq = defaultdict(int)
unigram_freq = defaultdict(int)

for text in texts:
    tokens = tokenize(text)
    for i in range(len(tokens)):
        unigram_freq[tokens[i]] += 1
        if i < len(tokens) - 1:
            bigram = (tokens[i], tokens[i + 1])
            bigram_freq[bigram] += 1
        if i < len(tokens) - 2:
            trigram = (tokens[i], tokens[i + 1], tokens[i + 2])
            trigram_freq[trigram] += 1

print("\nTop 5 Bigrams:")
for bg, cnt in sorted(bigram_freq.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(bg, ":", cnt)

print("\nTop 5 Trigrams:")
for tg, cnt in sorted(trigram_freq.items(), key=lambda x: x[1], reverse=True)[:5]:
    print(tg, ":", cnt)


# 4. BIGRAM PROBABILITIES
# Formula: P(w2 | w1) = count(w1 w2) / count(w1)


bigram_prob = {}
for (w1, w2), cnt in bigram_freq.items():
    bigram_prob[(w1, w2)] = round(cnt / unigram_freq[w1], 4)

print("\nSample Bigram Probabilities:")
for k, v in list(bigram_prob.items())[:5]:
    print(f"P({k[1]}|{k[0]}) = {v}")


# 5. EXTRACT NOUN PHRASES (SIMPLE HEURISTIC)

noun_phrases = []
for text in texts:
    phrases = re.findall(r'([A-Z][a-z]+(?:\s[A-Z][a-z]+)*)', text.title())
    noun_phrases.extend(phrases)

# Compute global frequencies of each noun phrase
noun_freq = defaultdict(int)
for np in noun_phrases:
    noun_freq[np] += 1

max_freq = max(noun_freq.values()) if noun_freq else 1


# 6. RELATIVE PROBABILITY TABLE

rows = []
for i, text in enumerate(texts):
    row_counts = defaultdict(int)
    local_phrases = re.findall(r'([A-Z][a-z]+(?:\s[A-Z][a-z]+)*)', text.title())
    for np in local_phrases:
        row_counts[np] += 1
    # relative probability = freq(np in this review) / max freq(np overall)
    rel_probs = {np: round(row_counts[np] / max_freq, 3) for np in noun_freq.keys()}
    rel_probs['review_id'] = f"Review_{i+1}"
    rows.append(rel_probs)

# Create pandas DataFrame (rows=reviews, cols=noun phrases)
df = pd.DataFrame(rows).fillna(0).set_index('review_id')

print("\nRelative Probability Table (sample):")
print(df.head())


# 7. SAVE RESULTS

df.to_csv("noun_phrase_relative_probabilities.csv", index=True)
print("\nSaved table as 'noun_phrase_relative_probabilities.csv'")

FileNotFoundError: [Errno 2] No such file or directory: 'densho_narrators_clean.csv'

## Question 2 (25 points)

**Undersand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any **pre-existing libraries** to do so.

In [None]:
# Write your code here

# ----------------------------------------------
# TF-IDF & Cosine Similarity (Assignment 3)
# Dataset: densho_narrators_clean.csv
# Column: display_name
# Author: Sameeksha Nalla
# ----------------------------------------------

import csv, re, math
from collections import defaultdict


# 1. LOAD DATA

filename = "densho_narrators_clean.csv"
docs = []
with open(filename, "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        if row["display_name"].strip():
            docs.append(row["display_name"].lower())

print(f"Loaded {len(docs)} documents.")


# 2. TOKENIZATION HELPER

def tokenize(text):
    text = re.sub(r"[^a-z\s]", "", text.lower())
    return [w for w in text.split() if w]

tokenized_docs = [tokenize(d) for d in docs]


# 3. BUILD VOCABULARY

vocab = sorted(set(w for doc in tokenized_docs for w in doc))
N = len(tokenized_docs)


# 4. COMPUTE TERM FREQUENCY (TF)

tf = []  # list of dictionaries, one per document
for doc in tokenized_docs:
    counts = defaultdict(int)
    for w in doc:
        counts[w] += 1
    total = len(doc)
    tf.append({w: counts[w] / total for w in counts})


# 5. COMPUTE INVERSE DOCUMENT FREQUENCY (IDF)

df = defaultdict(int)
for w in vocab:
    df[w] = sum(1 for doc in tokenized_docs if w in doc)

idf = {w: math.log((N + 1) / (df[w] + 1)) + 1 for w in vocab}

# 6. COMPUTE TF-IDF MATRIX

tfidf = []
for i, doc_tf in enumerate(tf):
    weights = {w: doc_tf.get(w, 0) * idf[w] for w in vocab}
    tfidf.append(weights)

print("\nSample TF-IDF (first doc, top 10 terms):")
for w, val in list(sorted(tfidf[0].items(), key=lambda x: x[1], reverse=True))[:10]:
    print(w, ":", round(val, 4))


# 7. COSINE SIMILARITY FUNCTION

def cosine_sim(vec1, vec2):
    # dot product
    dot = sum(vec1[w] * vec2.get(w, 0) for w in vec1)
    mag1 = math.sqrt(sum(v * v for v in vec1.values()))
    mag2 = math.sqrt(sum(v * v for v in vec2.values()))
    return 0 if mag1 == 0 or mag2 == 0 else dot / (mag1 * mag2)


# 8. USER QUERY (DESIGN YOUR OWN)

query = "an outstanding movie with a haunting performance and best character development"
q_tokens = tokenize(query)
q_counts = defaultdict(int)
for w in q_tokens:
    q_counts[w] += 1
q_total = len(q_tokens)
q_tf = {w: q_counts[w] / q_total for w in q_counts}
q_tfidf = {w: q_tf[w] * idf.get(w, math.log((N + 1) / 1)) for w in q_tf}


# 9. RANK DOCUMENTS BY COSINE SIMILARITY

scores = []
for i, doc_vec in enumerate(tfidf):
    sim = cosine_sim(doc_vec, q_tfidf)
    scores.append((i, sim))

ranked = sorted(scores, key=lambda x: x[1], reverse=True)
print("\nTop 5 documents for query:")
for idx, score in ranked[:5]:
    print(f"Doc {idx+1} | Score = {round(score,4)} | Text = {docs[idx][:60]}...")


# 10. OPTIONAL: SAVE MATRIX & SCORES

import pandas as pd
pd.DataFrame(tfidf).to_csv("tfidf_matrix.csv", index=False)
pd.DataFrame(ranked, columns=["doc_id", "similarity"]).to_csv("cosine_similarity_scores.csv", index=False)
print("\nSaved results to 'tfidf_matrix.csv' and 'cosine_similarity_scores.csv'")


## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit or Fine tune bert model).

(2) Visualize the embeddings using PCA or t-SNE in 2D. Create a scatter plot of at least 20 words and show how similar words cluster together.

(3) Calculate the cosine similarity between a few pairs of words to see if the model captures semantic similarity accurately.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
!pip install matplotlib


In [None]:
# Write your code here


# ----------------------------------------------
# Question 3 – Word Embedding Model from Scratch
# Dataset: densho_narrators_clean.csv
# Column: display_name
# ----------------------------------------------

import csv, re, math, random, time
from collections import Counter, defaultdict
import numpy as np
import matplotlib.pyplot as plt

# LOAD DATA
file = "densho_narrators_clean.csv"
text_col = "display_name"
docs = []
with open(file, "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        txt = (row.get(text_col) or "").strip()
        if txt:
            docs.append(txt)

def tokenize(t):
    t = re.sub(r"[^a-z\s]", " ", t.lower())
    return [w for w in t.split() if w]

sentences = [tokenize(d) for d in docs if tokenize(d)]
print("Loaded", len(sentences), "sentences")

#  BUILD VOCAB
word_counts = Counter(w for s in sentences for w in s)
vocab = sorted(word_counts)
word2idx = {w:i for i,w in enumerate(vocab)}
idx2word = {i:w for w,i in word2idx.items()}
V = len(vocab)
print("Vocab size:", V)

#  NEGATIVE SAMPLING TABLE
pow_freq = np.array([word_counts[w]**0.75 for w in vocab])
neg_dist = pow_freq / pow_freq.sum()
neg_table = np.random.choice(np.arange(V), size=int(1e6), p=neg_dist)
def sample_neg(k): return neg_table[np.random.randint(0, len(neg_table), size=k)]

# INIT PARAMETERS
D = 300
W_in = (np.random.rand(V,D)-0.5)/D
W_out = (np.random.rand(V,D)-0.5)/D
window, neg_k, epochs = 2, 5, 2
lr = 0.025

def sigmoid(x): return 1/(1+np.exp(-x))

# TRAINING (Skip-gram + Negative Sampling)
for ep in range(epochs):
    start = time.time()
    random.shuffle(sentences)
    for s in sentences:
        for i,c in enumerate(s):
            center = word2idx[c]
            v_c = W_in[center]
            for j in range(max(0,i-window), min(len(s),i+window+1)):
                if i==j: continue
                context = word2idx[s[j]]
                v_o = W_out[context]
                # positive
                score = sigmoid(np.dot(v_c,v_o))
                grad = (1-score)
                W_in[center]  += lr*grad*v_o
                W_out[context]+= lr*grad*v_c
                # negatives
                for n in sample_neg(neg_k):
                    v_n = W_out[n]
                    score_n = sigmoid(-np.dot(v_c,v_n))
                    grad_n = (1-score_n)
                    W_in[center]  += lr*(-grad_n)*v_n
                    W_out[n]      += lr*(-grad_n)*v_c
    print(f"Epoch {ep+1} done in {time.time()-start:.2f}s")

emb = W_in.copy()

# PCA (2-D) USING SVD
words_show = [w for w,_ in word_counts.most_common(30)]
X = emb[[word2idx[w] for w in words_show]]
Xc = X - X.mean(0)
U,S,VT = np.linalg.svd(Xc, full_matrices=False)
proj = Xc @ VT[:2].T

plt.figure(figsize=(9,7))
plt.scatter(proj[:,0], proj[:,1])
for i,w in enumerate(words_show):
    plt.annotate(w,(proj[i,0],proj[i,1]),fontsize=9)
plt.title("Word Embeddings (300-D → 2-D PCA)")
plt.show()

# COSINE SIMILARITY EXAMPLES
def cos(a,b):
    a,b = emb[word2idx[a]], emb[word2idx[b]]
    return np.dot(a,b)/(np.linalg.norm(a)*np.linalg.norm(b)+1e-9)

pairs=[("george","mary"),("family","home"),("man","woman")]
for a,b in pairs:
    if a in word2idx and b in word2idx:
        print(f"cos({a},{b})={cos(a,b):.4f}")
    else:
        print(f"Pair missing: {a},{b}")





## Question 4 (20 Points)

**Create your own training and evaluation dataset for an NLP task.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.




1.   Which NLP Task you would like perform on your selected dataset
(NER, Summarization, Sentiment Analysis, Text classficication)
2.  Explain your labeling Schema you have used and mention those labels

3.  You can take AI assistance for labeling the data only.



In [None]:
# The GitHub link of final csv file


# Link: https://github.com/SameekshaNalla/nlp-sentiment-dataset/blob/main/MovieReviews_Dataset_Labeled.csv

# 1. The dataset is movie reviews dataset. I would like to perform Sentiment Analysis on the dataset.

# 2. The dataset uses a three-class labeling schema to identify the sentiment of each movie review:

#Positive: Reviews that express favorable opinions or satisfaction with the movie.
#Negative: Reviews that show dissatisfaction or criticism of the movie.
#Neutral: Reviews that provide balanced or factual feedback without strong emotion.


# Mandatory Question

Provide your thoughts on the assignment by filling this survey link. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Type your answer