<a href="https://colab.research.google.com/github/HarishChinnakadiri/Harish_INFO5731_Spring2024/blob/main/Chinnakadiri_Harish_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Sentiment analysis of movie reviews: This task involves classifying the sentiment of movie reviews as positive, negative, or neutral.

Features:
Word Frequency (Bag of Words): Counts of individual words can signal positive or negative sentiment. Frequent positive words indicate a positive review and vice versa.

TF-IDF Scores: Highlights important words that are frequent in a document but rare across all documents, helping to identify key sentiment drivers.

N-grams: Sequences of n-words provide context beyond single words, capturing phrases that have specific sentiment implications (e.g., "not good").

Sentiment Scores: Precomputed sentiment scores for words/phrases can aggregate to determine overall sentiment of the text.

Part-of-Speech Tags: Adjectives and adverbs are often tied to sentiment (e.g., "amazing" or "terribly"), making them useful for sentiment analysis.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from textblob import TextBlob
import numpy as np

# Sample text data (movie reviews)
reviews = [
    "An outstanding narrative that keeps you engaged from start to finish.",
    "The plot was predictable and lacked depth, making it a boring watch.",
    "Exceptional cinematography and brilliant performances by the cast.",
    "The movie's pacing was off, dragging in the middle but with an exciting climax.",
    "A masterpiece in storytelling with a captivating soundtrack.",
    "Failed to live up to the hype, with underdeveloped characters and a weak storyline.",
    "A visual treat, but the plot twists felt forced and unnatural.",
    "Engaging and thought-provoking, a must-watch for any film enthusiast.",
    "The director's unique vision comes to life in this compelling drama.",
    "More style than substance, it looks good but ultimately feels empty."
]

# Adjusted movie-related terms for focused feature extraction
movie_related_terms = ['outstanding', 'predictable', 'depth', 'cinematography', 'performances', 'pacing', 'masterpiece', 'storytelling', 'soundtrack', 'hype', 'characters', 'storyline', 'visual', 'twists', 'engaging', 'director', 'style', 'substance']

# Extracting Bag-of-Words (BoW) features
vectorizer_bow = CountVectorizer(vocabulary=movie_related_terms)
bow_features = vectorizer_bow.fit_transform(reviews).toarray()

# Extracting TF-IDF features
vectorizer_tfidf = TfidfVectorizer(vocabulary=movie_related_terms)
tfidf_features = vectorizer_tfidf.fit_transform(reviews).toarray()

# Extracting N-grams features
vectorizer_ngram = CountVectorizer(ngram_range=(1, 3))
ngram_features = vectorizer_ngram.fit_transform(reviews).toarray()

# Performing sentiment analysis
sentiments = [TextBlob(review).sentiment.polarity for review in reviews]

# Calculating document lengths
lengths = [len(review.split()) for review in reviews]

# Printing structured output
print("1. Bag-of-Words (BoW) specific to Movie Terminology:")
print(np.array(bow_features))

print("\n2. TF-IDF with Movie-Related Vocabulary:")
print(np.array(tfidf_features))

print("\n3. N-grams:")
# For brevity, only displaying the shape as the full matrix would be large
print(f"Shape: {ngram_features.shape}")

print("\n4. Sentiment Analysis:")
print(sentiments)

print("\n5. Document Structure and Length:")
print(lengths)


1. Bag-of-Words (BoW) specific to Movie Terminology:
[[1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 1 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1]]

2. TF-IDF with Movie-Related Vocabulary:
[[1.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.        ]
 [0.         0.70710678 0.70710678 0.         0.         0.
  0.         0.         0.         0.         0.         0.
  0.         0.         0.         0.         0.         0.        ]
 [0.         0.         0.         0.70710678 0.70710678 0.
  0.         0.         0.         0.         0.         0.
  0.         0. 

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# Sample text data (movie reviews)
reviews = [
    "An outstanding narrative that keeps you engaged from start to finish.",
    "The plot was predictable and lacked depth, making it a boring watch.",
    "Exceptional cinematography and brilliant performances by the cast.",
    "The movie's pacing was off, dragging in the middle but with an exciting climax.",
    "A masterpiece in storytelling with a captivating soundtrack.",
    "Failed to live up to the hype, with underdeveloped characters and a weak storyline.",
    "A visual treat, but the plot twists felt forced and unnatural.",
    "Engaging and thought-provoking, a must-watch for any film enthusiast.",
    "The director's unique vision comes to life in this compelling drama.",
    "More style than substance, it looks good but ultimately feels empty."
]

# Using TF-IDF to extract features
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(reviews)
feature_names = vectorizer_tfidf.get_feature_names_out()

# Summing up the TF-IDF scores for each term across all documents
sums = tfidf_matrix.sum(axis=0)
data = [(term, sums[0, col]) for col, term in enumerate(feature_names)]

# Sorting the terms by their summed TF-IDF score in descending order
ranking = sorted(data, key=lambda x: x[1], reverse=True)

# Displaying the ranked features based on their TF-IDF scores for all words
print("Ranked features based on TF-IDF scores for all words:")
for rank, (term, score) in enumerate(ranking, start=1):
    print(f"{rank}. {term}: {score:.4f}")


Ranked features based on TF-IDF scores for all words:
1. the: 1.2370
2. and: 1.0177
3. to: 0.9151
4. in: 0.7934
5. with: 0.7710
6. but: 0.7180
7. plot: 0.5932
8. watch: 0.5732
9. it: 0.5562
10. was: 0.5444
11. an: 0.5186
12. captivating: 0.4425
13. masterpiece: 0.4425
14. soundtrack: 0.4425
15. storytelling: 0.4425
16. brilliant: 0.3881
17. by: 0.3881
18. cast: 0.3881
19. cinematography: 0.3881
20. exceptional: 0.3881
21. performances: 0.3881
22. felt: 0.3554
23. forced: 0.3554
24. treat: 0.3554
25. twists: 0.3554
26. unnatural: 0.3554
27. visual: 0.3554
28. boring: 0.3424
29. depth: 0.3424
30. lacked: 0.3424
31. making: 0.3424
32. predictable: 0.3424
33. any: 0.3319
34. engaging: 0.3319
35. enthusiast: 0.3319
36. film: 0.3319
37. for: 0.3319
38. must: 0.3319
39. provoking: 0.3319
40. thought: 0.3319
41. comes: 0.3263
42. compelling: 0.3263
43. director: 0.3263
44. drama: 0.3263
45. life: 0.3263
46. this: 0.3263
47. unique: 0.3263
48. vision: 0.3263
49. empty: 0.3120
50. engaged: 0.312

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
from transformers import BertTokenizer, BertModel
import torch
from torch.nn.functional import cosine_similarity

# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode texts using BERT to get embeddings
def get_bert_embeddings(texts):
    # Tokenize and encode the texts with padding and truncation
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt', max_length=512)
    # Get the model output (last hidden states)
    with torch.no_grad():
        output = model(**encoded_input)
    # Get the embeddings from the last hidden state by averaging across token embeddings
    embeddings = output.last_hidden_state.mean(dim=1)
    return embeddings

# Sample text data (movie reviews) from question 2
texts = [
    "An outstanding narrative that keeps you engaged from start to finish.",
    "The plot was predictable and lacked depth, making it a boring watch.",
    "Exceptional cinematography and brilliant performances by the cast.",
    "The movie's pacing was off, dragging in the middle but with an exciting climax.",
    "A masterpiece in storytelling with a captivating soundtrack.",
    "Failed to live up to the hype, with underdeveloped characters and a weak storyline.",
    "A visual treat, but the plot twists felt forced and unnatural.",
    "Engaging and thought-provoking, a must-watch for any film enthusiast.",
    "The director's unique vision comes to life in this compelling drama.",
    "More style than substance, it looks good but ultimately feels empty."
]

# Your query
query = "A compelling story that captures your attention."

# Encode the query and texts to get their embeddings
query_embedding = get_bert_embeddings([query])
text_embeddings = get_bert_embeddings(texts)

# Calculate cosine similarities between the query and each text
similarities = [cosine_similarity(query_embedding, text_embedding.unsqueeze(0)) for text_embedding in text_embeddings]

# Rank the texts based on similarity scores in descending order
ranked_indices = sorted(range(len(texts)), key=lambda i: similarities[i], reverse=True)

# Display the ranked texts with their similarity scores
print("Ranked texts based on similarity to the query:")
for index in ranked_indices:
    print(f"Text: {texts[index]}, Similarity Score: {similarities[index].item()}")


Ranked texts based on similarity to the query:
Text: An outstanding narrative that keeps you engaged from start to finish., Similarity Score: 0.8701763153076172
Text: A masterpiece in storytelling with a captivating soundtrack., Similarity Score: 0.7749614715576172
Text: Engaging and thought-provoking, a must-watch for any film enthusiast., Similarity Score: 0.7745383381843567
Text: The director's unique vision comes to life in this compelling drama., Similarity Score: 0.7662754058837891
Text: The movie's pacing was off, dragging in the middle but with an exciting climax., Similarity Score: 0.6782228946685791
Text: A visual treat, but the plot twists felt forced and unnatural., Similarity Score: 0.6712093949317932
Text: More style than substance, it looks good but ultimately feels empty., Similarity Score: 0.6626472473144531
Text: The plot was predictable and lacked depth, making it a boring watch., Similarity Score: 0.6510769724845886
Text: Exceptional cinematography and brilliant per

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
User
In exploring feature extraction from text data, I've found a blend of techniques invaluable for navigating the complexities of natural language processing. Bag of Words (BoW) and TF-IDF provide a solid foundation, translating text into numerical vectors that capture word frequency and significance. The evolution towards word and contextual embeddings, notably through models like Word2Vec, GloVe, and BERT, marked a significant leap, offering nuanced, semantic representations of text. Cosine similarity emerged as a crucial tool for assessing text similarities, enhancing the relevance of document comparisons. Lastly, feature selection techniques proved essential in optimizing model performance, highlighting the importance of identifying and focusing on the most informative features.
Completing feature extraction exercises from text data often involves challenges like managing high-dimensional data, capturing contextual nuances, and executing effective data preprocessing. Advanced models like BERT improve accuracy but demand more computational power.
This exercise is central to the field of Natural Language Processing (NLP), focusing on transforming raw text into structured, analyzable data. It underscores the importance of feature extraction and representation, pivotal for tasks like sentiment analysis, text classification, and semantic similarity, thereby enhancing machine understanding and processing of human language.


'''