<a href="https://colab.research.google.com/github/143211/TARUN_INFO5731/blob/main/INFO5731_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:



Bag of Words (BoW): Represents text by the frequency of each word. Useful for identifying key words that indicate
sentiment.
TF-IDF: Weighs the words based on their frequency across documents, highlighting unique words that are more likely
to be significant for sentiment.
Sentiment Scores: Uses pre-trained sentiment analyzers to give a numerical sentiment score to text segments,
directly indicating positive or negative sentiment.
N-grams: Captures sequences of words (like phrases) that have specific sentiments not clear when words are considered
individually.
Part-of-Speech Tags: Identifies adjectives and adverbs, which are often closely tied to expressing sentiment.




'''

"\nPlease write you answer here:\n\n\n\nWord Frequency Distribution: This method tallies how often each word appears, pinpointing significant terms that reflect the review's tone.\nTerm Importance Weighting: Adjusts the significance of words based on their rarity across all documents, emphasizing unique terms that likely carry more sentiment weight.\nEmotion Ratings: Employs existing sentiment analysis tools to assign a numerical value to sections of text, directly signaling whether the sentiment is positive or negative.\nWord Sequence Patterns: Detects word combinations that convey specific sentiments, capturing nuanced expressions that single words might miss.\nGrammar Tags: Focuses on the role of descriptive words and modifiers in sentences, which are frequently linked to sentiment expression.\n\n\n\n\n\n"

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [13]:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


True

In [25]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# Sample movie reviews
documents = [
    "A visually stunning movie that fails to deliver a compelling story.",
    "The soundtrack and visual effects were amazing, but the characters lacked depth.",
    "An underrated gem with a gripping plot and a heartfelt performance by the lead actor.",
    "A predictable romantic comedy that doesn’t bring anything new to the genre.",
    "The film’s attempt at humor falls flat, overshadowed by its overly complex plot."
]

# Tokenize and preprocess the text
nltk.download('punkt')
tokenized_documents = [nltk.word_tokenize(doc.lower()) for doc in documents]

# (BoW) features
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform([" ".join(doc) for doc in tokenized_documents])

# TF-IDF features
vectorizer_tfidf = TfidfVectorizer()
tfidf_features = vectorizer_tfidf.fit_transform([" ".join(doc) for doc in tokenized_documents])

# (bi-grams and tri-grams)
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 3))
ngrams_features = vectorizer_ngrams.fit_transform([" ".join(doc) for doc in tokenized_documents])

# Topic modeling features using LDA
lda = LatentDirichletAllocation(n_components=2)
lda_features = lda.fit_transform(bow_features)


# You would typically load pre-trained embeddings and transform the text into vectors.

# Display the extracted features
print("BoW Features:")
print(bow_features.toarray())

print("\nTF-IDF Features:")
print(tfidf_features.toarray())

print("\nN-grams Features:")
print(ngrams_features.toarray())

print("\nTopic Modeling Features (LDA):")
print(lda_features)


BoW Features:
[[0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
  0 0 0 1 1 1 0 1 0 0 1 0 0]
 [0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
  0 0 1 0 0 0 2 0 0 1 0 1 0]
 [1 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 1 0 0 0 0 1 1
  0 0 0 0 0 0 1 0 1 0 0 0 1]
 [0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0
  1 1 0 0 0 1 1 1 0 0 0 0 0]
 [0 0 0 0 0 1 1 0 0 1 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 1
  0 0 0 0 0 0 1 0 0 0 0 0 0]]

TF-IDF Features:
[[0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.34706676 0.         0.34706676 0.         0.         0.
  0.34706676 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.34706676 0.         0.         0.         0.         0.
  0.         0.         0.         0.34706676 0.34706676 0.28001128
  0.         0.28001128 

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [24]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
import numpy as np

# Provided movie reviews
reviews = [
    "A visually stunning movie that fails to deliver a compelling story.",
    "The soundtrack and visual effects were amazing, but the characters lacked depth.",
    "An underrated gem with a gripping plot and a heartfelt performance by the lead actor.",
    "A predictable romantic comedy that doesn’t bring anything new to the genre.",
    "The film’s attempt at humor falls flat, overshadowed by its overly complex plot."
]

# Hypothetical sentiment labels for the reviews: 0 for negative, 1 for positive
# Assuming the sentiment based on the content of the reviews
targets = np.array([0, 1, 1, 0, 0])  # Assign labels according to your criteria

# Generating Bag of Words features
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform(reviews)

# Feature selection using chi-squared statistic
k_best = SelectKBest(score_func=chi2, k='all')
k_best.fit(bow_features, targets)

# Get the indices of selected features in descending order of importance
sorted_feature_indices = (-k_best.scores_).argsort()

# List the features in descending order of importance
sorted_features = [vectorizer_bow.get_feature_names_out()[i] for i in sorted_feature_indices]

print("Top Features in Descending Order of Importance:")
for feature in sorted_features:
    print(feature)



Top Features in Descending Order of Importance:
and
actor
visual
underrated
soundtrack
performance
lead
lacked
heartfelt
were
gem
effects
depth
gripping
with
but
an
characters
amazing
to
that
the
stunning
overshadowed
story
anything
visually
romantic
predictable
at
overly
deliver
movie
compelling
doesn
comedy
fails
falls
film
complex
flat
humor
its
bring
attempt
genre
new
by
plot


In [27]:
pip install transformers



In [28]:
pip install torch torchvision torchaudio



## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [29]:
import numpy as np
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel


# my query
query = "A movie with great visual effects and deep storyline."

# Loading pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"  # You can choose other BERT variants as well
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode the query and documents
query_encoding = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
document_encodings = tokenizer(documents, return_tensors="pt", padding=True, truncation=True)

# Get BERT embeddings for the query and documents
with torch.no_grad():
    query_embedding = model(**query_encoding).last_hidden_state.mean(dim=1)
    document_embeddings = model(**document_encodings).last_hidden_state.mean(dim=1)

# Calculate cosine similarity between the query and documents
similarities = cosine_similarity(query_embedding, document_embeddings).flatten()

# Rank documents by similarity in descending order
ranked_documents = [(document, similarity) for document, similarity in zip(documents, similarities)]
ranked_documents.sort(key=lambda x: x[1], reverse=True)

# Print ranked documents
for i, (document, similarity) in enumerate(ranked_documents, start=1):
    print(f"Rank {i}: Similarity = {similarity:.4f}")
    print(document)
    print()





The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Rank 1: Similarity = 0.8086
A visually stunning movie that fails to deliver a compelling story.

Rank 2: Similarity = 0.7873
A predictable romantic comedy that doesn’t bring anything new to the genre.

Rank 3: Similarity = 0.7865
An underrated gem with a gripping plot and a heartfelt performance by the lead actor.

Rank 4: Similarity = 0.7434
The soundtrack and visual effects were amazing, but the characters lacked depth.

Rank 5: Similarity = 0.6555
The film’s attempt at humor falls flat, overshadowed by its overly complex plot.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Working on this exercise was really insightful. I got to dive into how we can pull out useful bits from text data to
help computers understand it better. Learning about different ways to break down and highlight important parts of the
 text, like finding key words or figuring out the overall vibe (positive or negative), was super helpful. It was like
  learning a new way to read text, not just for the story, but for the hidden clues that tell us more about what's
  being said.

I did run into a few bumps, especially trying to figure out which bits of the text were the most important and how
to use Python tools to make sense of it all. It's one thing to read about these methods, but applying them was a
whole other story.

This whole thing ties back to NLP, or how we get computers to process and understand human language. Everything I did
in this exercise is part of that bigger picture, helping computers get the gist of what we're saying or writing.
It's pretty cool to think about, especially since it's becoming a big deal in lots of areas, like making smarter
chatbots or sorting through tons of reviews to find out what people really think.

All in all, it was a great way to get my hands dirty with some real NLP techniques and see how they're used in the
 real world.





'''