<a href="https://colab.research.google.com/github/17251A0404/Abhigna_INFO5731_Spring2024/blob/main/INFO5731_Assignment_3_1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment 3**

In this assignment, we will delve into various aspects of natural language processing (NLP) and text analysis. The tasks are designed to deepen your understanding of key NLP concepts and techniques, as well as to provide hands-on experience with practical applications.

Through these tasks, you'll gain practical experience in NLP techniques such as N-gram analysis, TF-IDF, word embedding model creation, and sentiment analysis dataset creation.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).


**Total points**: 100

**Deadline**: See Canvas

**Late Submission will have a penalty of 10% reduction for each day after the deadline.**


## Question 1 (30 points)

**Understand N-gram**

Write a python program to conduct N-gram analysis based on the dataset in your assignment two. You need to write codes from scratch instead of using any pre-existing libraries to do so:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the noun phrases and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).

In [None]:
import re
from collections import Counter

import csv

def read_csv_file():
    """
    Reads a CSV file and returns the data.
    """
    data = []
    with open("sample_data/movie_reviews.csv", newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
          data.append(row)
    return data

def iterate_first_row(data):
    """
    Iterates over the data in the first row.
    """
    final_data = []
    if data:
      data = data[1:]
      for item in data:
        final_data.append(item[0])
    return final_data

import re
from collections import Counter

def preprocess_text(text):
    """
    Preprocesses the text by removing special characters and converting to lowercase.
    """
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)  # Remove special characters
    text = text.lower()  # Convert to lowercase
    return text

def generate_ngrams(words, n):
    """
    Generates n-grams from the given text.
    """
    ngrams = [tuple(words[i:i+n]) for i in range(len(words)-n+1)]
    return ngrams

def count_ngrams(ngrams):
    """
    Counts the frequency of n-grams.
    """
    return Counter(ngrams)

def calculate_bigram_probabilities(text):
    """
    Calculates probabilities for all bigrams in the text.
    """
    words = text.split()
    bigrams = generate_ngrams(words, 2)
    bigram_counts = count_ngrams(bigrams)

    bigram_probabilities = {}
    for bigram, count in bigram_counts.items():
        w2, w1 = bigram
        bigram_probabilities[bigram] = count / words.count(w2)

    return bigram_probabilities

def extract_noun_phrases(text):
    """
    Extracts noun phrases from the text.
    """
    # Dummy implementation, you may need to use a more sophisticated method
    noun_phrases = re.findall(r'\b(?:NOUN)\b', text)
    return noun_phrases

def calculate_relative_probabilities(reviews):
    """
    Calculates relative probabilities of each review in terms of noun phrase frequency.
    """
    noun_phrase_frequencies = Counter()
    for review in reviews:
        noun_phrases = extract_noun_phrases(review)
        noun_phrase_frequencies.update(noun_phrases)

    max_frequency = max(noun_phrase_frequencies.values()) if noun_phrase_frequencies and noun_phrase_frequencies.values() else 0

    relative_probabilities = {}
    for i, review in enumerate(reviews):
        noun_phrases = extract_noun_phrases(review)
        review_probability = {noun_phrase: frequency / max_frequency for noun_phrase, frequency in noun_phrase_frequencies.items()}
        relative_probabilities[f'Review {i+1}'] = review_probability

    return relative_probabilities

def print_table(relative_probabilities):
    """
    Prints the relative probabilities in a table format.
    """
    header = ['Noun Phrase'] + list(relative_probabilities.keys())
    rows = []
    for noun_phrase in relative_probabilities[next(iter(relative_probabilities))]:
        row = [noun_phrase]
        for review, probabilities in relative_probabilities.items():
            row.append(probabilities.get(noun_phrase, 0))
        rows.append(row)

    col_width = max(len(word) for row in [header] + rows for word in row) + 2  # padding
    for row in [header] + rows:
        print("".join(word.ljust(col_width) for word in row))

data = read_csv_file()
final_data = iterate_first_row(data)
reviews = final_data
# Preprocess text
preprocessed_reviews = [preprocess_text(review) for review in reviews if review]

# Calculate bigram probabilities for each review
for i, review in enumerate(preprocessed_reviews):
    print(f"\nReview {i+1} Bigram Probabilities:")
    bigram_probabilities = calculate_bigram_probabilities(review)
    for bigram, probability in bigram_probabilities.items():
        print(f"{bigram}: {probability:.2f}")

# Calculate relative probabilities of each review in terms of noun phrase frequency
relative_probabilities = calculate_relative_probabilities(preprocessed_reviews)

# Print relative probabilities in a table
print("\nRelative Probabilities:")
print_table(relative_probabilities)



[1;30;43mStreaming output truncated to the last 5000 lines.[0m
('think', 'is'): 0.20
('the', 'core'): 0.01
('core', 'of'): 1.00
('my', 'issues'): 0.33
('issues', 'with'): 1.00
('film', 'maximus'): 0.07
('has', 'no'): 0.20
('no', 'strong'): 0.50
('strong', 'reason'): 0.50
('reason', 'to'): 0.50
('to', 'want'): 0.05
('want', 'what'): 1.00
('the', 'plot'): 0.04
('plot', 'is'): 0.25
('is', 'actually'): 0.06
('actually', 'about'): 0.50
('about', 'he'): 0.20
('he', 'wants'): 1.00
('wants', 'revenge'): 1.00
('revenge', 'against'): 0.33
('against', 'commodus'): 1.00
('commodus', 'which'): 0.14
('which', 'should'): 0.50
('should', 'be'): 0.33
('be', 'enough'): 0.50
('enough', 'but'): 1.00
('but', 'then'): 0.12
('then', 'theres'): 0.50
('theres', 'this'): 0.25
('this', 'effort'): 0.18
('effort', 'to'): 0.50
('to', 'wrap'): 0.05
('wrap', 'him'): 1.00
('effort', 'by'): 0.50
('by', 'the'): 0.14
('the', 'senate'): 0.03
('senate', 'led'): 0.50
('led', 'by'): 1.00
('by', 'gracchus'): 0.14
('gracchus

## Question 2 (25 points)

**Undersand TF-IDF and Document representation**

Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the documents-terms weights (tf * idf) matrix.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using cosine similarity.

Note: You need to write codes from scratch instead of using any pre-existing libraries to do so.

In [None]:
# Write your code here
import re
from collections import Counter

import csv

def read_csv_file():
    """
    Reads a CSV file and returns the data.
    """
    data = []
    with open("sample_data/movie_reviews.csv", newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
          data.append(row)
    return data

def iterate_first_row(data):
    """
    Iterates over the data in the first row.
    """
    final_data = []
    if data:
      data = data[1:]
      for item in data:
        final_data.append(item[0])
    return final_data

import math

data = read_csv_file()
documents = iterate_first_row(data)

print(documents[0][0])
# Preprocess function to lowercase the text and remove punctuation
def preprocess_text(text):
    text = text.lower()
    text = ''.join([c for c in text if c.isalnum() or c.isspace()])
    return text

# Preprocess documents
preprocessed_documents = [preprocess_text(doc[0]) for doc in documents]

# Function to calculate term frequency (TF)
def calculate_tf(document):
    words = document.split()
    word_count = len(words)
    tf_dict = {}
    for word in words:
        tf_dict[word] = tf_dict.get(word, 0) + 1 / word_count
    return tf_dict

# Function to calculate inverse document frequency (IDF)
def calculate_idf(documents):
    word_doc_count = {}
    for document in documents:
        words = set(document.split())
        for word in words:
            word_doc_count[word] = word_doc_count.get(word, 0) + 1

    num_documents = len(documents)
    idf_dict = {}
    for word, count in word_doc_count.items():
        idf_dict[word] = math.log(num_documents / count)
    return idf_dict

# Calculate TF for each document
tf_documents = [calculate_tf(doc) for doc in preprocessed_documents]

# Calculate IDF for all terms
idf_dict = calculate_idf(preprocessed_documents)

# Function to calculate TF-IDF weights
def calculate_tfidf(tf_document, idf_dict):
    tfidf_dict = {}
    for word, tf in tf_document.items():
        if word in idf_dict:
            tfidf_dict[word] = tf * idf_dict[word]
    return tfidf_dict


# Calculate TF-IDF for each document
tfidf_documents = [calculate_tfidf(tf_doc, idf_dict) for tf_doc in tf_documents]

# Sample query
query = "An Outstanding movie with a haunting performance and best character development"
preprocessed_query = preprocess_text(query)
tf_query = calculate_tf(preprocessed_query)
tfidf_query = calculate_tfidf(tf_query, idf_dict)

# Function to calculate cosine similarity
def calculate_cosine_similarity(tfidf1, tfidf2):
    dot_product = 0
    magnitude1 = 0
    magnitude2 = 0
    for term in set(tfidf1.keys()) & set(tfidf2.keys()):
        dot_product += tfidf1[term] * tfidf2[term]
        magnitude1 += tfidf1[term] ** 2
        magnitude2 += tfidf2[term] ** 2
    magnitude1 = math.sqrt(magnitude1)
    magnitude2 = math.sqrt(magnitude2)
    if magnitude1 == 0 or magnitude2 == 0:
        return 0
    else:
        return dot_product / (magnitude1 * magnitude2)

# Calculate cosine similarity between query and documents
cosine_similarities = []
for tfidf_doc in tfidf_documents:
    cosine_similarities.append(calculate_cosine_similarity(tfidf_query, tfidf_doc))

# Rank documents based on cosine similarity
ranked_documents = sorted(zip(documents, cosine_similarities), key=lambda x: x[1], reverse=True)

# Print ranked documents
print("Ranked Documents:")
for i, (document, similarity) in enumerate(ranked_documents, start=1):
    print(f"{i}. {document} - Similarity: {similarity:.2f}")







[['Review', 'Cleaned_Review'], ['Gladiator" is an epic masterpiece that captivates from start to finish. Set against the backdrop of ancient Rome, Ridley Scott\'s magnum opus immerses viewers in a world of political intrigue, betrayal, and the quest for justice.I\'ve seen this movie for 5 times and don\'t mind to watch it again. Russell Crowe delivers a tour de force performance as Maximus, a general turned gladiator seeking vengeance against the corrupt emperor who murdered his family and betrayed him.With its compelling characters, powerful themes of honor and redemption, and iconic moments that linger long after the credits roll, "Gladiator" stands as a timeless classic and a testament to the enduring power of cinema. 10/10.', 'gladiat epic masterpiec captiv start finish set backdrop ancient rome ridley scott magnum opu immers viewer world polit intrigu betray quest justic seen movi time dont mind watch russel crow deliv tour de forc perform maximu gener turn gladiat seek vengeanc c

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



## Question 3 (25 points)

**Create your own word embedding model**

Use the data you collected for assignment 2 to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [None]:
# Write your code here
import re
from collections import Counter

import csv

def read_csv_file():
    """
    Reads a CSV file and returns the data.
    """
    data = []
    with open("sample_data/movie_reviews.csv", newline='') as csvfile:
        reader = csv.reader(csvfile)
        for row in reader:
          data.append(row)
    return data

def iterate_first_row(data):
    """
    Iterates over the data in the first row.
    """
    final_data = []
    if data:
      data = data[1:]
      for item in data:
        final_data.append(item[0])
    return final_data


data = read_csv_file()

from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')  # Download the punkt tokenizer

# Example corpus (replace with your own dataset)
corpus = iterate_first_row(data)

# Tokenize the corpus
tokenized_corpus = [word_tokenize(sentence.lower()) for sentence in corpus]

# Train Word2Vec model
model = Word2Vec(sentences=tokenized_corpus, vector_size=300, window=5, min_count=1, workers=4)

# Save the model
model.save("word2vec_model")

# Load the model
# model = Word2Vec.load("word2vec_model")

# Get the vocabulary
vocab = model.wv.index_to_key

# Check if "apples" is in the vocabulary
if "apples" in vocab:
    word_vector = model.wv["gladiator"]
    print("Word vector for 'gladiator':", word_vector)
else:
    print("'gladiator' is not in the vocabulary.")

# Find similar words
similar_words = model.wv.most_similar("gladiator")
print("Similar words to 'gladiator':", similar_words)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


'gladiator' is not in the vocabulary.
Similar words to 'gladiator': [("''", 0.8432629704475403), ('``', 0.832253634929657), ('grade', 0.8049382567405701), ('conclusion', 0.7706557512283325), ('debt', 0.7616943120956421), ('stuff', 0.7605570554733276), ('must-watch.10/10', 0.7582780122756958), ('absolute', 0.758093535900116), ('widespread', 0.7564403414726257), ('highly', 0.7512801289558411)]


## Question 4 (20 Points)

**Create your own training and evaluation data for sentiment analysis.**

 **You don't need to write program for this question!**

 For example, if you collected a movie review or a product review data, then you can do the following steps:

*   Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral).

*   Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew.

*   This datset will be used for assignment four: sentiment analysis and text classification.


In [None]:

# The GitHub link of your final csv file


# Link:https://github.com/17251A0404/Abhigna_INFO5731_Spring2024/blob/main/sentiment_dataset.csv




# Mandatory Question

Provide your thoughts on the assignment. What did you find challenging, and what aspects did you enjoy? Your opinion on the provided time to complete the assignment.

In [None]:
# Type your answer