<a href="https://colab.research.google.com/github/Madhu-3499/DataScienceEssentials/blob/main/Surisetti_Madhu_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

sentiment analysis of customer reviews for a product or service. The goal is to predict whether a review will be positive, negative, or neutral. The following five feature types could be useful in building a machine learning model for this task:
Bag of Words (BoW):

Every page is depicted as a bag of words that takes word frequency into account but ignores word order and syntax.
Why? BoW captures all of the terminology used in a review, so the model may identify terms that are commonly associated with positive or negative emotions.
TF-IDF, or term frequency-inverse document frequency:

the act of allocating weights to words based on how frequently they occur in a document rather than across all papers.
Why?  TF-IDF facilitates the process of locating important phrases inside a document. Words with higher TF-IDF scores are often more revealing for sentiment analysis.

N-grams:

looking through a document's word sequences, or collections of n consecutive words.
Why? Because they preserve word associations and context, n-grams aid the model in identifying sentiment-filled sentences or expressions. Bigrams, for instance, is capable of expressing two-word feelings like "not good."
Perceptual Lexicons:

Using word lists that have already been created and given a positive or negative sentiment value.
Why? since using sentiment lexicons can offer more contextual information. Sentimentally explicit terms like "excellent," "happy," and "disappointing" can improve the model's comprehension.

Part-of-Speech (POS):

Assigning a grammatical category—such as noun, verb, or adjective—to every word in a document.
Why? Understanding a word's syntactic context is facilitated by analyzing its grammatical structure. Adjectives or verbs that are positive, for instance, can express optimism.
Emotional Analysis:

recognizing the text's emotional tone, be it joy, fury, sorrow, etc.
Why: Knowing the emotional content adds a layer of nuance to sentiment analysis. A review might, for instance, be both enthusiastic and a little let down.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from textblob import TextBlob

# Sample text data
reviews = [
    "This product is amazing! I love it.",
    "Not satisfied with the quality. Disappointing experience.",
    "Fast delivery and excellent customer service.",
    "The packaging was damaged, but the product is good.",
]

# Tokenization using NLTK
nltk.download('punkt')
tokenized_reviews = [nltk.word_tokenize(review) for review in reviews]

# 1. Bag of Words BoW
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(reviews)
bow_feature_names = vectorizer.get_feature_names_out()

# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(reviews)
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# 3. N-grams
n_gram_vectorizer = CountVectorizer(ngram_range=(1, 2))
n_gram_matrix = n_gram_vectorizer.fit_transform(reviews)
n_gram_feature_names = n_gram_vectorizer.get_feature_names_out()

# 4. Sentiment Lexicons
sentiment_scores = [TextBlob(review).sentiment.polarity for review in reviews]

# 5. Part-of-Speech POS Tags
nltk.download('averaged_perceptron_tagger')
pos_tags = [nltk.pos_tag(tokens) for tokens in tokenized_reviews]

# Displaying the features
print("1. Bag of Words (BoW):")
print(bow_matrix.toarray())
print("Feature names:", bow_feature_names)
print("\n2. TF-IDF:")
print(tfidf_matrix.toarray())
print("Feature names:", tfidf_feature_names)
print("\n3. N-grams:")
print(n_gram_matrix.toarray())
print("Feature names:", n_gram_feature_names)
print("\n4. Sentiment Lexicons:")
print(sentiment_scores)
print("\n5. Part-of-Speech (POS) Tags:")
print(pos_tags)





[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


1. Bag of Words (BoW):
[[1 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 1]
 [0 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0]
 [0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 1 1 0 0 0 2 0 1 0]]
Feature names: ['amazing' 'and' 'but' 'customer' 'damaged' 'delivery' 'disappointing'
 'excellent' 'experience' 'fast' 'good' 'is' 'it' 'love' 'not' 'packaging'
 'product' 'quality' 'satisfied' 'service' 'the' 'this' 'was' 'with']

2. TF-IDF:
[[0.43671931 0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.34431452
  0.43671931 0.43671931 0.         0.         0.34431452 0.
  0.         0.         0.         0.43671931 0.         0.        ]
 [0.         0.         0.         0.         0.         0.
  0.38861429 0.         0.38861429 0.         0.         0.
  0.         0.         0.38861429 0.         0.         0.38861429
  0.38861429 0.         0.30638797 0.         0.         0.38861429]
 [0.         0.

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [10]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2

# Sample text data
reviews = [
    "The movie was fantastic! The acting was superb and the plot was engaging.",
    "I was really disappointed with the film. The storyline was weak, and the performances were mediocre.",
    "This book is a masterpiece. The writing is brilliant, and the characters are unforgettable.",
]

# Sample labels for sentiment analysis, for example
labels = ['positive', 'negative', 'neutral']

# Labels
review_labels = ['positive', 'negative', 'neutral']

# Convert labels to numerical values
label_map = {label: idx for idx, label in enumerate(labels)}
numerical_labels = [label_map[label] for label in review_labels]

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(reviews)

# Chi-squared test for feature selection
chi2_scores, _ = chi2(tfidf_features, numerical_labels)

# Create a dictionary to associate feature names with chi-squared scores
feature_scores = dict(zip(tfidf_vectorizer.get_feature_names_out(), chi2_scores))

# Sort features by their chi-squared scores in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top N features and their scores
top_n = 20  # number of top features
for feature, score in sorted_features[:top_n]:
    print(f"Feature: {feature}, Chi-squared Score: {score:.2f}")


Feature: is, Chi-squared Score: 1.08
Feature: was, Chi-squared Score: 0.56
Feature: are, Chi-squared Score: 0.54
Feature: book, Chi-squared Score: 0.54
Feature: brilliant, Chi-squared Score: 0.54
Feature: characters, Chi-squared Score: 0.54
Feature: masterpiece, Chi-squared Score: 0.54
Feature: this, Chi-squared Score: 0.54
Feature: unforgettable, Chi-squared Score: 0.54
Feature: writing, Chi-squared Score: 0.54
Feature: acting, Chi-squared Score: 0.52
Feature: engaging, Chi-squared Score: 0.52
Feature: fantastic, Chi-squared Score: 0.52
Feature: movie, Chi-squared Score: 0.52
Feature: plot, Chi-squared Score: 0.52
Feature: superb, Chi-squared Score: 0.52
Feature: disappointed, Chi-squared Score: 0.52
Feature: film, Chi-squared Score: 0.52
Feature: mediocre, Chi-squared Score: 0.52
Feature: performances, Chi-squared Score: 0.52


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [2]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
reviews = [
    "This product is amazing! I love it.",
    "Not satisfied with the quality. Disappointing experience.",
    "Fast delivery and excellent customer service.",
    "The packaging was damaged, but the product is good.",
]

# Design a query
query = "Fast and efficient product delivery"

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the query
query_tokens = tokenizer(query, return_tensors='pt')
with torch.no_grad():
    query_outputs = model(**query_tokens)

# Extract BERT embeddings for the query
query_embedding = query_outputs.last_hidden_state.mean(dim=1).squeeze().numpy()

# Tokenize, encode, and calculate BERT embeddings for each document
document_embeddings = []
for review in reviews:
    document_tokens = tokenizer(review, return_tensors='pt')
    with torch.no_grad():
        document_outputs = model(**document_tokens)
    document_embedding = document_outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    document_embeddings.append(document_embedding)

# Calculate cosine similarity between the query and each document
similarities = cosine_similarity([query_embedding], document_embeddings)[0]

# Rank documents based on similarity in descending order
ranked_documents = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

# Display the ranked documents
print("\nRanked Documents based on BERT Cosine Similarity:")
for rank, (index, similarity) in enumerate(ranked_documents):
    print(f"Rank {rank + 1}: Document {index + 1} - Similarity: {similarity:.4f}")
    print(f"   \"{reviews[index]}\"\n")






The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]


Ranked Documents based on BERT Cosine Similarity:
Rank 1: Document 3 - Similarity: 0.8370
   "Fast delivery and excellent customer service."

Rank 2: Document 2 - Similarity: 0.6347
   "Not satisfied with the quality. Disappointing experience."

Rank 3: Document 1 - Similarity: 0.6089
   "This product is amazing! I love it."

Rank 4: Document 4 - Similarity: 0.5674
   "The packaging was damaged, but the product is good."



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



Learning Process: working on feature extraction from text data was a comprehensive and enlightening learning process. The fundamental concepts of feature extraction, which included everything from simple techniques like Bag of Words and TF-IDF to complex tactics like BERT embeddings, provided a thorough understanding of the procedure. The significance of a number of feature types, like as n-grams, sentiment lexicons, and part-of-speech tags, was emphasized. Examining both basic and advanced techniques improved illustrated the evolution of feature extraction in natural language processing.
challenges Encountered: Selecting the optimal characteristics for a particular task might be challenging, even if the exercises explored a wide range of feature extraction techniques. The experiment emphasized how crucial it is to test and evaluate different strategies in addition to accounting for computational efficiency. Due to resource limitations, it can be challenging to incorporate complicated models like BERT, practitioners strike a balance between the model's complexity and the resources available.
Relevance to my field of study: Natural Language Processing language models depend on the capacity to extract relevant information from textual input. In order to create models that can understand, evaluate, and produce text, natural language processing significantly depends on human language processing and interpretation. To do this, efficient feature extraction is required. The exercises have a direct bearing on the difficulties in natural language processing, where exact text representation is necessary for a number of uses, including information retrieval, sentiment analysis, and text classification.