<a href="https://colab.research.google.com/github/TharunSaiVT/INFO-5731/blob/main/V_T_Tharun_Sai_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

One interesting text classification task could be sentiment analysis of movie reviews.
In this task, the goal is to classify movie reviews as either positive, negative, or neutral based on the sentiment expressed in the text.
Here are some features that could be useful for building a machine learning model:

Bag of Words (BoW):

This feature represents the presence or absence of words in the text.
It creates a sparse matrix where each column represents a unique word in the corpus, and each row represents a document.
BoW helps capture the vocabulary and basic word frequencies in the text, which can be indicative of sentiment.

Term Frequency-Inverse Document Frequency (TF-IDF):

TF-IDF is a numerical statistic that reflects the importance of a word in a document relative to a corpus.
Words with higher TF-IDF scores are considered more important for classification.

N-grams:

N-grams are sequences of contiguous words in a text. Unigrams (single words), bigrams (pairs of consecutive words), and trigrams (triplets of consecutive words) are commonly used.
For sentiment analysis, capturing phrases like "not good" or "very happy" can significantly impact classification accuracy.

Sentiment Lexicons:

Sentiment lexicons are dictionaries containing words annotated with sentiment polarities (e.g., positive, negative, neutral).
By matching words in the text with entries in sentiment lexicons, we can quantify the sentiment expressed in the text.

Part-of-Speech (POS) Tags:

POS tagging assigns grammatical categories (e.g., noun, verb, adjective) to each word in a text.
Incorporating POS tags as features can provide information about the syntactic structure and grammatical patterns of the text.
Certain POS tags, such as adjectives and adverbs, are often indicative of sentiment.
Analyzing the distribution of POS tags in the text can help capture nuances in sentiment expression.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [6]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
from sklearn.feature_extraction.text import TfidfTransformer
from nltk import pos_tag

# Sample movie reviews
reviews = [
    "The movie was fantastic, I absolutely loved it!",
    "I found the film to be very disappointing and boring.",
    "This movie exceeded my expectations, it was amazing!",
    "The acting in this film was subpar and the plot was weak.",
    "I couldn't stop laughing, this movie is hilarious!"
]

# Initialize NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Tokenization, stop words removal, and lemmatization
stop_words = set(stopwords.words('english')).union(set(ENGLISH_STOP_WORDS))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    tokens = [token for token in tokens if token.isalpha()]  # Remove non-alphabetic tokens
    tokens = [token for token in tokens if token not in stop_words]  # Remove stop words
    tokens = [lemmatizer.lemmatize(token) for token in tokens]  # Lemmatization
    return ' '.join(tokens)

preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Feature extraction: Bag of Words (BoW)
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(preprocessed_reviews)
print("Bag of Words features:")
print(bow_vectorizer.get_feature_names_out())
print("\n")
print(bow_features.toarray())

# Feature extraction: TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_reviews)
print("\nTF-IDF features:")
print(tfidf_vectorizer.get_feature_names_out())
print("\n")
print(tfidf_features.toarray())

# Feature extraction: POS Tags
def pos_tag_features(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    pos_tag_list = [tag[1] for tag in pos_tags]
    return pos_tag_list

pos_tagged_reviews = [pos_tag_features(review) for review in reviews]
print("\nPOS Tag features:")
print(pos_tagged_reviews)
print("\n")

Bag of Words features:
['absolutely' 'acting' 'amazing' 'boring' 'disappointing' 'exceeded'
 'expectation' 'fantastic' 'film' 'hilarious' 'laughing' 'loved' 'movie'
 'plot' 'stop' 'subpar' 'weak']


[[1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0]
 [0 0 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0]
 [0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 0]
 [0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1]
 [0 0 0 0 0 0 0 0 0 1 1 0 1 0 1 0 0]]

TF-IDF features:
['absolutely' 'acting' 'amazing' 'boring' 'disappointing' 'exceeded'
 'expectation' 'fantastic' 'film' 'hilarious' 'laughing' 'loved' 'movie'
 'plot' 'stop' 'subpar' 'weak']


[[0.53849791 0.         0.         0.         0.         0.
  0.         0.53849791 0.         0.         0.         0.53849791
  0.36063833 0.         0.         0.         0.        ]
 [0.         0.         0.         0.61418897 0.61418897 0.
  0.         0.         0.49552379 0.         0.         0.
  0.         0.         0.         0.         0.        ]
 [0.         0.         0.53849791 0.         0.         

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [8]:
from sklearn.feature_selection import mutual_info_classif

# Extracting features
X = tfidf_features  # Using TF-IDF features for demonstration
y = [1, 0, 1, 0, 1]  # Assuming binary labels for demonstration

# Perform Mutual Information feature selection
mi_scores = mutual_info_classif(X, y)

# Map feature names to their corresponding MI scores
feature_names = vectorizer_tfidf.get_feature_names_out()
feature_mi_scores = dict(zip(feature_names, mi_scores))

# Rank features based on their importance (MI scores) in descending order
ranked_features = sorted(feature_mi_scores.items(), key=lambda x: x[1], reverse=True)

# Print ranked features
print("Ranked features based on Mutual Information (MI) scores:")
for feature, score in ranked_features:
    print(f"Feature: {feature}, MI Score: {score}")


Ranked features based on Mutual Information (MI) scores:
Feature: film, MI Score: 0.6730116670092563
Feature: movie, MI Score: 0.6730116670092563
Feature: acting, MI Score: 0.22314355131420974
Feature: boring, MI Score: 0.22314355131420974
Feature: disappointing, MI Score: 0.22314355131420974
Feature: found, MI Score: 0.22314355131420974
Feature: plot, MI Score: 0.22314355131420974
Feature: subpar, MI Score: 0.22314355131420974
Feature: weak, MI Score: 0.22314355131420974
Feature: absolutely, MI Score: 0.11849392256130009
Feature: amazing, MI Score: 0.11849392256130009
Feature: could, MI Score: 0.11849392256130009
Feature: exceeded, MI Score: 0.11849392256130009
Feature: expectations, MI Score: 0.11849392256130009
Feature: fantastic, MI Score: 0.11849392256130009
Feature: hilarious, MI Score: 0.11849392256130009
Feature: laughing, MI Score: 0.11849392256130009
Feature: loved, MI Score: 0.11849392256130009
Feature: stop, MI Score: 0.11849392256130009




## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [9]:
# You code here (Please add comments in the code):


import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
sample_text_data = [
    "The movie was amazing! I loved every minute of it.",
    "I couldn't stand the terrible acting and poor plot.",
    "The film had its flaws, but overall it was enjoyable.",
    "Not a fan of the cinematography, but the story was compelling.",
    "This movie is neither good nor bad, just average."
]

# Query
query = "I enjoyed the movie very much."

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text using BERT
def encode_text(text):
    input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0)
    outputs = model(input_ids)
    last_hidden_state = outputs.last_hidden_state
    return last_hidden_state.mean(dim=1).squeeze().detach().numpy()

# Encode query
query_embedding = encode_text(query)

# Encode text data
text_embeddings = [encode_text(text) for text in sample_text_data]

# Calculate cosine similarity between query and text data
similarities = cosine_similarity([query_embedding], text_embeddings)[0]

# Rank the similarities in descending order
ranked_similarities = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

# Print ranked documents based on similarity scores
print("Ranked documents based on similarity with the query:")
for idx, sim in ranked_similarities:
    print(f"Similarity with document {idx + 1}: {sim:.4f}")
    print(f"Text: {sample_text_data[idx]}\n")



The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked documents based on similarity with the query:
Similarity with document 1: 0.7817
Text: The movie was amazing! I loved every minute of it.

Similarity with document 2: 0.7322
Text: I couldn't stand the terrible acting and poor plot.

Similarity with document 4: 0.7315
Text: Not a fan of the cinematography, but the story was compelling.

Similarity with document 3: 0.7163
Text: The film had its flaws, but overall it was enjoyable.

Similarity with document 5: 0.6789
Text: This movie is neither good nor bad, just average.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

I had good experience when i was working on extracting features from text data.

The key concepts you found most beneficial in understanding the process.
Parts of Speech Tagging
Bag-of-Words (BoW) representation
TF-IDF

The most difficulites in completeing this exercise is when it took much time to understand the features for completing the given tasks.




'''