<a href="https://colab.research.google.com/github/MoulikaGudipally/Moulika_INFO5731_Fall2023/blob/main/Gudipally_Moulika_Assignment_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Three**

In this assignment, you are required to conduct information extraction, semantic analysis based on **the dataset you collected from assignment two**. You may use scipy and numpy package in this assignment.

# **Question 1: Understand N-gram**

(45 points). Write a python program to conduct N-gram analysis based on the dataset in your assignment two:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the **noun phrases** and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).


In [15]:
import nltk
import pandas as pd
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.probability import FreqDist
from nltk.util import ngrams
from nltk.tag import pos_tag

# Download NLTK resources if needed
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from google.colab import files
import pandas as pd

# This will prompt you to select a file.
# Once you select the file, it will be uploaded to Colab.
uploaded = files.upload()

# Read the uploaded file
for fn in uploaded.keys():
    print('Uploaded file "{name}" with length {length} bytes'.format(
        name=fn, length=len(uploaded[fn])))


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Saving imdb_review.csv to imdb_review (2).csv
Uploaded file "imdb_review (2).csv" with length 47956 bytes


In [5]:

# Use pandas to read the CSV file
import io
df = pd.read_csv(io.StringIO(uploaded['imdb_review.csv'].decode('utf-8')))


# Get the text data from the CSV file as it has a column named 'text'
dataset = df['text'].tolist()

# Function to preprocess text
def preprocess_text(text):
    sentences = sent_tokenize(text)
    tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in sentences]
    return tokenized_sentences

# Function to get N-grams
def get_ngrams(tokenized_text, n):
    n_grams = []
    for sentence_tokens in tokenized_text:
        n_grams.extend(list(ngrams(sentence_tokens, n)))
    return n_grams

# Function to extract noun phrases
def extract_noun_phrases(text):
    tagged_sentences = pos_tag(word_tokenize(text))
    grammar = "NP: {<DT>?<JJ>*<NN>}"
    cp = nltk.RegexpParser(grammar)
    tree = cp.parse(tagged_sentences)
    noun_phrases = []
    for subtree in tree.subtrees(filter=lambda t: t.label() == 'NP'):
        noun_phrases.append(' '.join(word for word, tag in subtree.leaves()))
    return noun_phrases

# Calculate N-gram frequencies (N=3)
tokenized_dataset = [preprocess_text(text) for text in dataset]
three_grams = get_ngrams([token for sublist in tokenized_dataset for token in sublist], 3)
three_gram_freq = FreqDist(three_grams)

# Calculate probabilities for bigrams
two_grams = get_ngrams([token for sublist in tokenized_dataset for token in sublist], 2)
two_gram_freq = FreqDist(two_grams)

probabilities = {}
for bigram in two_gram_freq:
    first_word_freq = two_gram_freq[bigram[0]]
    if first_word_freq != 0:
        probabilities[bigram] = two_gram_freq[bigram] / first_word_freq
    else:
        probabilities[bigram] = 0

# Extract and calculate relative probabilities of noun phrases
noun_phrases_all_reviews = []
for text in dataset:
    noun_phrases = extract_noun_phrases(text)
    noun_phrases_all_reviews.append(noun_phrases)

noun_phrases_flattened = [phrase for sublist in noun_phrases_all_reviews for phrase in sublist]
noun_phrase_freq = FreqDist(noun_phrases_flattened)

relative_probabilities = {}
for i, review_phrases in enumerate(noun_phrases_all_reviews):
    relative_probabilities[f"Review {i + 1}"] = {
        phrase: noun_phrase_freq[phrase] / max(noun_phrase_freq.values()) for phrase in review_phrases
    }

# Display results
print("Frequency of N-grams (N=3):")
print(three_gram_freq.most_common())

print("\nProbabilities for bigrams:")
print(probabilities)

print("\nRelative probabilities of noun phrases for each review:")
df = pd.DataFrame.from_dict(relative_probabilities, orient='index')
print(df)


Frequency of N-grams (N=3):
[(('the', 'shawshank', 'redemption'), 19), (('one', 'of', 'the'), 17), (('this', 'movie', 'is'), 11), (('*', '*', '*'), 10), (('--', '--', '--'), 10), (('of', 'all', 'time'), 9), (('!', '!', '!'), 9), ((',', 'and', 'the'), 8), (('of', 'the', 'best'), 8), (('of', 'the', 'film'), 7), ((',', 'it', 'is'), 7), (('shawshank', 'redemption', '.'), 6), ((',', 'the', 'film'), 6), (('the', 'story', 'of'), 6), (('``', 'red', "''"), 6), (('if', 'you', 'have'), 6), (('is', 'one', 'of'), 6), (('of', 'the', 'most'), 6), (('some', 'of', 'the'), 6), (('this', 'movie', ','), 6), (('red', "''", 'redding'), 5), ((',', 'it', "'s"), 5), (('the', 'film', "'s"), 5), (('of', 'the', 'greatest'), 5), (('(', 'morgan', 'freeman'), 5), (('morgan', 'freeman', ')'), 5), (('shawshank', 'redemption', 'is'), 4), (('and', 'morgan', 'freeman'), 4), (('the', 'box', 'office'), 4), (('in', 'the', 'film'), 4), (('in', 'the', 'world'), 4), (('of', 'the', 'movie'), 4), (('it', "'s", 'a'), 4), (('you',

# **Question 2: Undersand TF-IDF and Document representation**

(20 points). Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the **documents-terms weights (tf*idf) matrix bold text**.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using **cosine similarity**.

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

from google.colab import files
import pandas as pd



In [11]:

# Use pandas to read the CSV file
import io
df = pd.read_csv(io.StringIO(uploaded['imdb_review.csv'].decode('utf-8')))
# The CSV column containing the documents is named 'text'
documents = df['text'].tolist()

# Design your query
query = "An outstanding movie with a haunting performance and best character development"

# Preprocess the documents and query
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(documents + [query])

# Calculate cosine similarity between documents and the query
cosine_similarities = cosine_similarity(tfidf_matrix[:-1], tfidf_matrix[-1])

# Rank documents based on cosine similarity
document_scores = list(zip(range(len(documents)), cosine_similarities.flatten()))
sorted_documents = sorted(document_scores, key=lambda x: x[1], reverse=True)

# Display ranked documents
print("Ranked Documents based on Cosine Similarity to the Query:")
for idx, score in sorted_documents:
    print(f"Document {idx + 1}: Similarity Score = {score}")


Ranked Documents based on Cosine Similarity to the Query:
Document 24: Similarity Score = 0.15203814910130675
Document 6: Similarity Score = 0.11774092862605477
Document 10: Similarity Score = 0.10421876291445477
Document 11: Similarity Score = 0.10336487197161069
Document 7: Similarity Score = 0.10029752714961938
Document 4: Similarity Score = 0.09907432638750387
Document 12: Similarity Score = 0.07999024864863176
Document 15: Similarity Score = 0.07756137973005124
Document 5: Similarity Score = 0.07706270786484506
Document 16: Similarity Score = 0.0733332587752542
Document 21: Similarity Score = 0.07068484558032075
Document 9: Similarity Score = 0.06921151857160743
Document 14: Similarity Score = 0.06818330477164308
Document 17: Similarity Score = 0.06767997348166811
Document 8: Similarity Score = 0.06557025084294885
Document 20: Similarity Score = 0.06471004366476066
Document 1: Similarity Score = 0.06402422923466083
Document 25: Similarity Score = 0.05838709078907912
Document 23: S

# **Question 3: Create your own word embedding model**

(20 points). Use the data you collected for assignment two to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [14]:
# Write your code hereimport pandas as pd
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from nltk.tokenize import word_tokenize
import nltk
nltk.download('punkt')

# Load the data
df = pd.read_csv('imdb_review copy.csv')

# Tokenize the reviews
tokenized_reviews = [word_tokenize(text.lower()) for review in df['text']]

# Train the Word2Vec model
model = Word2Vec(sentences=tokenized_reviews, vector_size=300, window=5, min_count=1, workers=4)

# Save the model
model.save('word2vec_model')

# Get vocabulary
vocab = list(model.wv.key_to_index.keys())

# Visualize word embeddings
def plot_embeddings(model, words):
    for word in words:
        if word in model.wv.key_to_index:
            # Your visualization code here for the existing word
            pass
        else:
            print(f"Word '{word}' not found in the model's vocabulary.")


# Visualize some embeddings
plot_words = ['great', 'amazing', 'terrible', 'acting', 'plot', 'director']
plot_embeddings(model, plot_words)



Word 'great' not found in the model's vocabulary.
Word 'amazing' not found in the model's vocabulary.
Word 'terrible' not found in the model's vocabulary.
Word 'acting' not found in the model's vocabulary.
Word 'plot' not found in the model's vocabulary.
Word 'director' not found in the model's vocabulary.


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


# **Question 4: Create your own training and evaluation data for sentiment analysis**

(15 points). **You dodn't need to write program for this question!** Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral). Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew. This datset will be used for assignment four: sentiment analysis and text classification.


In [None]:
# The GitHub link of your final csv file



# Link:

https://github.com/MoulikaGudipally/Moulika_INFO5731_Fall2023/blob/main/Assignment_3_Question_4.ipynb



