<a href="https://colab.research.google.com/github/Saketh-11653883/UNT-SAKETH_INFO5731/blob/main/Kaveti_Saketh_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

Sentiment analysis of movie reviews is a intresting text categorization task. The objective of this work is to categorize reviews of movies as favorable, negative, or neutral depending on the text's tone.

Characteristics for Emotion Evaluation in Film Reviews:

Word Frequency:
Analyzing a document's word frequency might reveal information about its general tone. Some words can convey either positive or negative feelings.
Why it's beneficial? The sentiment of a review can be greatly influenced by the frequency of positive adjectives (like "amazing," "brilliant") or negative ones (like "disappointing," "boring").

N-grams:
Examining sets of related words (n-grams) can help comprehend sentiment intricacies and provide context.
Why it's beneficial? Sentiment may be expressed more effectively by certain phrases or word combinations than by single words. For instance, although if the word "good" by itself is positive, saying it is "not good" might convey a negative opinion.

Sentiment Lexicons:
Using lexicons or dictionaries that correlate words with their sentiment polarity is known as sentiment lexicon usage.
Why it's beneficial Lexicons aid in the identification of sentiment-bearing words and the polarities that go along with them, allowing for a more in-depth examination. Words like "joyful" are positive examples, but "horrifying" is negative.

Part-of-Speech (POS) Tags:
Classifying words according to their part of speech (nouns, verbs, adjectives) in order to analyze the grammatical structure.
Why it's beneficial? The words that are utilized in a statement might affect how it feels. Finding and detecting adjectives and adverbs helps improve the model's accuracy because they frequently play a significant role in expressing sentiment.


Negation Handling:
To identify negation words and evaluate how they affect the sentiment of the words that follow.
Why it's beneficial? "Not good" and other such phrases convey a different meaning than "good" alone. By assisting the model in identifying when a word is negated, negative handling enhances the accuracy of the sentiment analysis.


When combined, these qualities offer a complete understanding of the emotions expressed in film reviews.
The system may be trained to identify relationships and patterns in the text by incorporating them into a machine learning model, which would raise the accuracy of the sentiment categorization.




## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
# You code here (Please add comments in the code):
import nltk
from nltk.tokenize import word_tokenize
from nltk import ngrams
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk.tag import pos_tag
from nltk.stem import WordNetLemmatizer

nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

# Sample movie reviews
reviews = [
    "This movie is absolutely amazing! The acting and plot are fantastic.",
    "I was really disappointed with the film. It was boring and predictable.",
    "The movie didn't live up to my expectations. It was not good at all.",
    "An intriguing plot and great characters make this movie a must-watch.",
]

# Feature 1: Word Frequency
def get_word_frequency(text):
    words = word_tokenize(text.lower())
    word_freq = nltk.FreqDist(words)
    return word_freq

# Feature 2: N-grams
def get_ngrams(text, n=2):
    words = word_tokenize(text.lower())
    n_grams = list(ngrams(words, n))
    return n_grams

# Feature 3: Sentiment Lexicons
def get_sentiment_score(text):
    sia = SentimentIntensityAnalyzer()
    sentiment_score = sia.polarity_scores(text)
    return sentiment_score

# Feature 4: Part-of-Speech Tags
def get_pos_tags(text):
    words = word_tokenize(text.lower())
    pos_tags = pos_tag(words)
    return pos_tags

# Feature 5: Negation Handling
def handle_negation(text):
    words = word_tokenize(text.lower())
    negation_words = set(["not", "no", "never"])

    negated_text = []
    negate = False
    for word in words:
        if word in negation_words:
            negate = not negate
        else:
            negated_text.append("not_" + word if negate else word)

    return " ".join(negated_text)

# Apply the functions to each review
for review in reviews:
    print("\nReview:", review)

    # Feature 1: Word Frequency
    word_frequency = get_word_frequency(review)
    print("Word Frequency:", word_frequency)

    # Feature 2: N-grams
    n_grams = get_ngrams(review)
    print("N-grams:", n_grams)

    # Feature 3: Sentiment Lexicons
    sentiment_score = get_sentiment_score(review)
    print("Sentiment Score:", sentiment_score)

    # Feature 4: Part-of-Speech Tags
    pos_tags = get_pos_tags(review)
    print("POS Tags:", pos_tags)

    # Feature 5: Negation Handling
    negated_review = handle_negation(review)
    print("Negation Handling:", negated_review)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!



Review: This movie is absolutely amazing! The acting and plot are fantastic.
Word Frequency: <FreqDist with 13 samples and 13 outcomes>
N-grams: [('this', 'movie'), ('movie', 'is'), ('is', 'absolutely'), ('absolutely', 'amazing'), ('amazing', '!'), ('!', 'the'), ('the', 'acting'), ('acting', 'and'), ('and', 'plot'), ('plot', 'are'), ('are', 'fantastic'), ('fantastic', '.')]
Sentiment Score: {'neg': 0.0, 'neu': 0.53, 'pos': 0.47, 'compound': 0.8395}
POS Tags: [('this', 'DT'), ('movie', 'NN'), ('is', 'VBZ'), ('absolutely', 'RB'), ('amazing', 'JJ'), ('!', '.'), ('the', 'DT'), ('acting', 'NN'), ('and', 'CC'), ('plot', 'NN'), ('are', 'VBP'), ('fantastic', 'JJ'), ('.', '.')]
Negation Handling: this movie is absolutely amazing ! the acting and plot are fantastic .

Review: I was really disappointed with the film. It was boring and predictable.
Word Frequency: <FreqDist with 12 samples and 14 outcomes>
N-grams: [('i', 'was'), ('was', 'really'), ('really', 'disappointed'), ('disappointed', 'wi

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [2]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import SelectKBest, chi2
import numpy as np

# Sample data
reviews = [
    "Feature Selection (FS) methods alleviate key problems in classification procedures as they are used to improve classification accuracy, reduce data dimensionality, and remove irrelevant data.",
    "FS methods have received a great deal of attention from the text classification community. However, only a few literature surveys include them focusing on text classification, and the ones available are either a superficial analysis or present a very small set of work in the subject.",
    "For this reason, we conducted a Systematic Literature Review (SLR) that assesses 1376 unique papers from journals and conferences published in the past eight years (2013–2020). After abstract screening and full-text eligibility analysis, 175 studies were included in our SLR.",
    "Our contribution is twofold. We have considered several aspects of each proposed method and mapped them into a new categorization schema. Additionally, we mapped the main characteristics of the experiments, identifying which datasets, languages, machine learning algorithms, and validation methods have been used to evaluate new and existing techniques.",
    "By following the SLR protocol, we allow the replication of our revision process and minimize the chances of bias while classifying the included studies. By mapping issues and experiment settings, our SLR helps researchers to develop and position new studies with respect to the existing literature.",
    "Keywords Feature selection · Dimensionality reduction · Text classification · Systematic literature review"
]

# Labels for the text classification task (assuming two classes for simplicity)
labels = [0, 1, 0, 1, 0, 1]

# Convert text data to numerical features using CountVectorizer
vectorizer = CountVectorizer()
X_transformed = vectorizer.fit_transform(reviews)

# Use Chi-square for feature selection
chi2_selector = SelectKBest(chi2, k=5)  # Select top 5 features
X_chi2_selected = chi2_selector.fit_transform(X_transformed, labels)

# Get feature names from the CountVectorizer
feature_names = np.array(vectorizer.get_feature_names_out())

# Create a dictionary mapping feature names to their importance scores
feature_chi2_scores = {feature: score for feature, score in zip(feature_names, chi2_selector.scores_)}

# Rank features by importance in descending order
sorted_features = sorted(feature_chi2_scores.items(), key=lambda x: x[1], reverse=True)

# Display the selected features and their importance scores
print("\nRanked Features based on Chi-square Importance:")
for feature, score in sorted_features:
    print(f"Feature: {feature}, Chi-square Score: {score:.4f}")



Ranked Features based on Chi-square Importance:
Feature: slr, Chi-square Score: 4.0000
Feature: have, Chi-square Score: 3.0000
Feature: studies, Chi-square Score: 3.0000
Feature: by, Chi-square Score: 2.0000
Feature: data, Chi-square Score: 2.0000
Feature: included, Chi-square Score: 2.0000
Feature: mapped, Chi-square Score: 2.0000
Feature: them, Chi-square Score: 2.0000
Feature: 1376, Chi-square Score: 1.0000
Feature: 175, Chi-square Score: 1.0000
Feature: 2013, Chi-square Score: 1.0000
Feature: 2020, Chi-square Score: 1.0000
Feature: abstract, Chi-square Score: 1.0000
Feature: accuracy, Chi-square Score: 1.0000
Feature: additionally, Chi-square Score: 1.0000
Feature: after, Chi-square Score: 1.0000
Feature: algorithms, Chi-square Score: 1.0000
Feature: alleviate, Chi-square Score: 1.0000
Feature: allow, Chi-square Score: 1.0000
Feature: as, Chi-square Score: 1.0000
Feature: aspects, Chi-square Score: 1.0000
Feature: assesses, Chi-square Score: 1.0000
Feature: attention, Chi-square S

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [3]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample movie reviews
reviews = [
    "This movie is absolutely amazing! The acting and plot are fantastic.",
    "I was really disappointed with the film. It was boring and predictable.",
    "The movie didn't live up to my expectations. It was not good at all.",
    "An intriguing plot and great characters make this movie a must-watch.",
]

# Sample queries for testing
queries = [
    "I'm looking for a movie with a captivating plot and strong characters.",
    "Any recommendations for a comedy with a lot of humor?",
    "Looking for a mystery thriller that will keep me guessing until the end.",
    "What's the best animated movie for kids?",
]

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embeddings(text):
    tokens = tokenizer(text, return_tensors='pt')
    outputs = model(**tokens)
    return outputs.pooler_output.detach().numpy()

# Calculate BERT embeddings for each review
reviews_embeddings = [get_bert_embeddings(review) for review in reviews]

# Compare each query with movie reviews
for query in queries:
    print(f"\nQuery: {query}\n")

    # Calculate BERT embeddings for the query
    query_embedding = get_bert_embeddings(query)

    # Calculate cosine similarity between the query and each review
    similarities = [cosine_similarity(query_embedding, review_embedding.reshape(1, -1))[0][0] for review_embedding in reviews_embeddings]

    # Rank the reviews based on similarity in descending order
    ranked_reviews = sorted(zip(reviews, similarities), key=lambda x: x[1], reverse=True)

    # Display the ranked reviews
    print("Ranked Reviews based on Cosine Similarity:")
    for idx, (review, similarity) in enumerate(ranked_reviews, start=1):
        print(f"{idx}. Similarity: {similarity:.4f}\n   Review: {review}\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.



Query: I'm looking for a movie with a captivating plot and strong characters.

Ranked Reviews based on Cosine Similarity:
1. Similarity: 0.9712
   Review: I was really disappointed with the film. It was boring and predictable.

2. Similarity: 0.9489
   Review: An intriguing plot and great characters make this movie a must-watch.

3. Similarity: 0.9323
   Review: The movie didn't live up to my expectations. It was not good at all.

4. Similarity: 0.8864
   Review: This movie is absolutely amazing! The acting and plot are fantastic.


Query: Any recommendations for a comedy with a lot of humor?

Ranked Reviews based on Cosine Similarity:
1. Similarity: 0.9846
   Review: The movie didn't live up to my expectations. It was not good at all.

2. Similarity: 0.9772
   Review: This movie is absolutely amazing! The acting and plot are fantastic.

3. Similarity: 0.9639
   Review: An intriguing plot and great characters make this movie a must-watch.

4. Similarity: 0.9294
   Review: I was really

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



After completing this exercises, I have a practical understanding of how text classification and feature extraction are accomplished using machine learning and natural language processing (NLP) approaches. The key ideas were comprehending various feature extraction methods, feature selection strategies, and text similarity ranking with BERT embeddings.To use BERT embeddings for text similarity, one must have a solid understanding of tokenization techniques and the Hugging Face Transformers library.Testing and subject knowledge are required to select the most relevant features for a text categorization assignment.These exercises generally aid in comprehension of NLP principles and ability to use them in practical situations.




