<a href="https://colab.research.google.com/github/unt-iialab/INFO5731_Spring2020/blob/master/Assignments/INFO5731_Assignment_Three.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 Assignment Three**

In this assignment, you are required to conduct information extraction, semantic analysis based on **the dataset you collected from assignment two**. You may use scipy and numpy package in this assignment.

# **Question 1: Understand N-gram**

(45 points). Write a python program to conduct N-gram analysis based on the dataset in your assignment two:

(1) Count the frequency of all the N-grams (N=3).

(2) Calculate the probabilities for all the bigrams in the dataset by using the fomular count(w2 w1) / count(w2). For example, count(really like) / count(really) = 1 / 3 = 0.33.

(3) Extract all the **noun phrases** and calculate the relative probabilities of each review in terms of other reviews (abstracts, or tweets) by using the fomular frequency (noun phrase) / max frequency (noun phrase) on the whole dataset. Print out the result in a table with column name the all the noun phrases and row name as all the 100 reviews (abstracts, or tweets).


In [12]:
# Write your code here

import pandas as pd
import nltk
from nltk import ngrams
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from collections import Counter

# Load the IMDb movie reviews dataset
df = pd.read_csv('/content/imdb_movie_reviews.csv')

# Select the first 100 reviews (you can adjust the number as needed)
reviews = df['Review'][:100]

# Function to tokenize and generate N-grams
def generate_ngrams(text, n):
    tokens = nltk.word_tokenize(text)
    n_grams = ngrams(tokens, n)
    return [' '.join(gram) for gram in n_grams]

# Function to count N-grams frequency
def count_ngrams(reviews, n):
    all_ngrams = []
    for review in reviews:
        ngrams_review = generate_ngrams(review, n)
        all_ngrams.extend(ngrams_review)
    return Counter(all_ngrams)

# Function to calculate bigram probabilities
def calculate_bigram_probabilities(reviews):
    bigrams_count = count_ngrams(reviews, 2)
    unigrams_count = count_ngrams(reviews, 1)

    bigram_probabilities = {}
    for bigram, count in bigrams_count.items():
        unigram = bigram.split()[0]
        probability = count / unigrams_count[unigram]
        bigram_probabilities[bigram] = probability

    return bigram_probabilities

# Function to extract and calculate noun phrase probabilities
def calculate_noun_phrase_probabilities(reviews):
    noun_phrases = []
    for review in reviews:
        tokens = nltk.word_tokenize(review)
        tagged_tokens = pos_tag(tokens)
        grammar = "NP: {<DT>?<JJ>*<NN>}"
        cp = nltk.RegexpParser(grammar)
        tree = cp.parse(tagged_tokens)

        for subtree in tree.subtrees(filter=lambda x: x.label() == 'NP'):
            noun_phrase = ' '.join([word for word, tag in subtree.leaves()])
            noun_phrases.append(noun_phrase)

    noun_phrase_counts = Counter(noun_phrases)
    max_counts = max(noun_phrase_counts.values())

    noun_phrase_probabilities = {}
    for review in reviews:
        tokens = nltk.word_tokenize(review)
        review_noun_phrases = [word for word in tokens if word in noun_phrases]
        review_counts = Counter(review_noun_phrases)
        probabilities = {noun_phrase: count / max_counts for noun_phrase, count in review_counts.items()}
        noun_phrase_probabilities[review] = probabilities

    return noun_phrase_probabilities

# Count N-grams (N=3)
ngram_counts = count_ngrams(reviews, 3)
print("N-gram Counts:")
print(ngram_counts)
print("\n")

# Calculate bigram probabilities
bigram_probabilities = calculate_bigram_probabilities(reviews)
print("Bigram Probabilities:")
print(bigram_probabilities)
print("\n")

# Extract and calculate noun phrase probabilities
noun_phrase_probabilities = calculate_noun_phrase_probabilities(reviews)
df_noun_phrases = pd.DataFrame.from_dict(noun_phrase_probabilities, orient='index')
df_noun_phrases.fillna(0, inplace=True)  # Fill NaN with 0 for better representation
print("Noun Phrase Probabilities:")
print(df_noun_phrases)


N-gram Counts:
Counter({'The Shawshank Redemption': 18, 'one of the': 15, '! ! !': 11, '* * *': 10, '-- -- --': 10, '. It is': 8, 'this movie is': 8, 'of all time': 8, ', and the': 8, 'of the best': 8, 'of the film': 7, ', it is': 7, 'Shawshank Redemption .': 6, ', the film': 6, 'the story of': 6, "`` Red ''": 6, 'is one of': 6, 'of the most': 6, 'some of the': 6, 'this movie ,': 6, "Red '' Redding": 5, ", it 's": 5, 'If you have': 5, '. This is': 5, "the film 's": 5, 'of the greatest': 5, '( Morgan Freeman': 5, 'Morgan Freeman )': 5, 'Shawshank Redemption is': 4, 'and Morgan Freeman': 4, '. However ,': 4, 'in the film': 4, 'in the world': 4, 'of the movie': 4, "you do n't": 4, 'film of all': 4, '. If you': 4, "you have n't": 4, "have n't seen": 4, '`` The Shawshank': 4, '. But the': 4, '. It has': 4, 'the movie is': 4, ". It 's": 4, ', and I': 4, 'movie , but': 4, '( Tim Robbins': 4, ', `` The': 4, "'' , ``": 4, 'directed by Frank': 3, 'by Frank Darabont': 3, 'Rita Hayworth and': 3, '

# **Question 2: Undersand TF-IDF and Document representation**

(20 points). Starting from the documents (all the reviews, or abstracts, or tweets) collected for assignment two, write a python program:

(1) To build the **documents-terms weights (tf*idf) matrix bold text**.

(2) To rank the documents with respect to query (design a query by yourself, for example, "An Outstanding movie with a haunting performance and best character development") by using **cosine similarity**.

In [13]:
# Write your code here

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Load the IMDb movie reviews dataset
df = pd.read_csv('/content/imdb_movie_reviews.csv')

# Select the first 100 reviews (you can adjust the number as needed)
reviews = df['Review'][:100]

# Design a query
query = "An Outstanding movie with a haunting performance and best character development"

# Function to calculate cosine similarity between query and documents
def rank_documents(query, documents):
    # Combine the query with the existing documents
    all_texts = [query] + documents.tolist()

    # Create the TF-IDF matrix
    vectorizer = TfidfVectorizer()
    tfidf_matrix = vectorizer.fit_transform(all_texts)

    # Calculate cosine similarity
    cosine_similarities = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1:]).flatten()

    # Create a DataFrame with document indices and corresponding cosine similarities
    result_df = pd.DataFrame({'Document': range(1, len(cosine_similarities) + 1),
                              'Cosine Similarity': cosine_similarities})

    # Sort documents by cosine similarity in descending order
    result_df = result_df.sort_values(by='Cosine Similarity', ascending=False).reset_index(drop=True)

    return result_df

# Rank documents based on cosine similarity to the query
result_df = rank_documents(query, reviews)

# Print the ranked documents
print("Ranked Documents:")
print(result_df)


Ranked Documents:
    Document  Cosine Similarity
0         24           0.152038
1          6           0.117741
2         10           0.104219
3         11           0.103365
4          7           0.100298
5          4           0.099074
6         12           0.079990
7         15           0.077561
8          5           0.077063
9         16           0.073333
10        21           0.070685
11         9           0.069212
12        14           0.068183
13        17           0.067680
14         8           0.065570
15        20           0.064710
16         1           0.064024
17        25           0.058387
18        23           0.054052
19        18           0.043918
20         3           0.043520
21         2           0.039526
22        22           0.038186
23        13           0.037972
24        19           0.024270


# **Question 3: Create your own word embedding model**

(20 points). Use the data you collected for assignment two to build a word embedding model:

(1) Train a 300-dimension word embedding (it can be word2vec, glove, ulmfit, bert, or others).

(2) Visualize the word embedding model you created.

Reference: https://machinelearningmastery.com/develop-word-embeddings-python-gensim/

Reference: https://jaketae.github.io/study/word2vec/

In [21]:
# Write your code here


import pandas as pd
from gensim.models import Word2Vec
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
import string

# Load your dataset from assignment two
df = pd.read_csv('/content/imdb_movie_reviews.csv')

# Assuming you have a 'text' column in your dataset containing the text data
corpus = df['Review'].tolist()

# Tokenize and preprocess the text
def preprocess_text(text):
    tokens = word_tokenize(text)
    tokens = [word.lower() for word in tokens if word.isalpha()]
    tokens = [word for word in tokens if word not in stopwords.words('english')]
    return tokens

# Preprocess the entire corpus
preprocessed_corpus = [preprocess_text(text) for text in corpus]

# Train a Word2Vec model
model = Word2Vec(sentences=preprocessed_corpus, vector_size=300, window=5, min_count=1, workers=4)

# Save the model
model.save("word2vec_model.model")

# To load the model later:
# model = Word2Vec.load("word2vec_model.model")

example_word = 'movie'

# Example: Retrieve word vectors
vector = model.wv[example_word]

# Example: Most similar words
similar_words = model.wv.most_similar(example_word, topn=5)

print("Vector for {}:".format(example_word), vector)
print("Most similar words to 'example_word':", similar_words)





Vector for movie: [ 2.67876964e-03 -7.08241889e-04 -2.32462306e-04  7.16871407e-04
 -1.06130596e-04 -5.43645700e-04  2.56273453e-03  1.56903977e-03
 -8.88410257e-04 -6.65750704e-04  1.85976410e-03 -2.84594920e-04
 -2.71135039e-04  3.05756717e-03 -2.30056536e-03 -7.73325097e-04
  3.51689127e-03  2.10750173e-03  7.11746747e-04 -3.12499399e-03
 -8.12411890e-05 -9.18560894e-04  3.85384355e-03  4.27828229e-04
  1.46904273e-03  7.47912563e-04 -1.52392744e-03 -1.54234411e-03
 -3.42926971e-04 -1.48554018e-03  2.46367953e-03  2.84231105e-03
 -5.44715382e-04  9.97478492e-04 -2.02301145e-03  7.50728883e-04
 -1.93469564e-03 -3.90726374e-03 -2.12439196e-03 -3.08508868e-03
  2.10896600e-03 -1.73417141e-03  2.88549298e-03 -3.14773270e-03
  1.34515716e-03  3.84120224e-03 -2.33297539e-03 -2.94257654e-03
 -1.72518962e-03 -3.02660424e-04 -1.11933638e-04 -3.03579750e-03
 -3.12517164e-03  1.11245946e-03 -2.97484710e-03 -2.23706919e-03
 -3.43421998e-04 -2.76992167e-03 -2.17660330e-03 -2.98457593e-03
 -4.432

# **Question 4: Create your own training and evaluation data for sentiment analysis**

(15 points). **You dodn't need to write program for this question!** Read each review (abstract or tweet) you collected in detail, and annotate each review with a sentiment (positive, negative, or neutral). Save the annotated dataset into a csv file with three columns (first column: document_id, clean_text, sentiment), upload the csv file to GitHub and submit the file link blew. This datset will be used for assignment four: sentiment analysis and text classification.


In [25]:
import pandas as pd
from nltk.sentiment import SentimentIntensityAnalyzer

# Load your dataset from assignment two
# For example, assuming you have a CSV file named 'your_dataset.csv'
df = pd.read_csv('/content/imdb_movie_reviews.csv')

# Assuming you have a 'text' column in your dataset containing the text data
corpus = df['Review'].tolist()

# Initialize SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()

# Function to get sentiment category
def get_sentiment(text):
    compound_score = sid.polarity_scores(text)['compound']
    if compound_score >= 0.05:
        return 'positive'
    elif compound_score <= -0.05:
        return 'negative'
    else:
        return 'neutral'

# Apply sentiment analysis to each document
df['sentiment'] = df['Review'].apply(get_sentiment)

# Save the annotated dataset to a new CSV file
df.to_csv('annotated_dataset.csv', index=False)


# The GitHub link of your final csv file

https://github.com/LasyaBobbili/lasyabobbili_info5731_fall2023/blob/main/Sentimental_analysed_dataset.csv


# Link:



