<a href="https://colab.research.google.com/github/HarshaSolingaram/INFO_5731/blob/main/Solingaram_Harshavardhan_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Here are five types of features that could be useful for building a machine learning model for sentiment analysis:

Bag of Words (BoW) Features: BoW features represent the presence or absence of words in the text.
These features can capture the frequency of specific words and their importance in determining sentiment.
Words like "excellent," "terrible," "satisfied," and "disappointed" are likely to be indicative of sentiment.

TF-IDF Features: Term Frequency-Inverse Document Frequency (TF-IDF) features weigh the importance of a word in a document relative to a corpus of documents.
This feature type helps in identifying words that are frequent in a document but rare in the overall corpus, which can be strong indicators of sentiment.

N-grams Features: N-grams features capture sequences of adjacent words in the text.
Unigrams (single words), bigrams (pairs of adjacent words), and trigrams (triplets of adjacent words) can provide context and capture nuances in sentiment that may be missed by individual words alone.

Part-of-Speech (POS) Features: POS features categorize words in a text into their grammatical categories (e.g., nouns, verbs, adjectives).
Adjectives and adverbs often carry sentiment information, so including features based on POS tagging can help capture the sentiment expressed in the text more accurately.

Sentiment Lexicon Features: Sentiment lexicons contain lists of words annotated with their associated sentiment polarity (positive, negative, or neutral).
Incorporating features based on sentiment lexicons allows the model to leverage pre-defined sentiment knowledge, enhancing its ability to recognize sentiment in text.

'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [44]:
!pip install nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
import pandas as pd
nltk.download('wordnet')

# Sample text data
reviews = [
    "The product is excellent and highly recommended.",
    "I'm really disappointed with the service provided.",
    "Overall, I am satisfied with my purchase.",
    "The quality of the item is terrible, avoid it!",
    "This company offers outstanding customer support."
]
labels = ['positive', 'negative', 'positive', 'negative', 'positive']

# Tokenization, stop words removal, and lemmatization
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    tokens = word_tokenize(text.lower())
    filtered_tokens = [lemmatizer.lemmatize(token) for token in tokens if token.isalnum() and token not in stop_words]
    return ' '.join(filtered_tokens)

preprocessed_reviews = [preprocess_text(review) for review in reviews]

# Bag of Words (BoW) Features
vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(preprocessed_reviews)

# TF-IDF Features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(preprocessed_reviews)

# N-grams Features (bigrams)
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))
ngram_features = ngram_vectorizer.fit_transform(preprocessed_reviews)

# Part-of-Speech (POS) Features
pos_features = []
for review in preprocessed_reviews:
    tokens = nltk.pos_tag(word_tokenize(review))
    pos_tags = [tag for word, tag in tokens]
    pos_features.append(' '.join(pos_tags))

# Sentiment Lexicon Features
positive_lexicon = set(['excellent', 'recommended', 'satisfied', 'outstanding'])
negative_lexicon = set(['disappointed', 'terrible', 'avoid'])
lexicon_features = []
for review in preprocessed_reviews:
    pos_words = set(word_tokenize(review)).intersection(positive_lexicon)
    neg_words = set(word_tokenize(review)).intersection(negative_lexicon)
    if len(pos_words) > len(neg_words):
        lexicon_features.append('positive')
    elif len(pos_words) < len(neg_words):
        lexicon_features.append('negative')
    else:
        lexicon_features.append('neutral')

# Combine features into a DataFrame
features_df = pd.DataFrame({
    'Review': reviews,
    'BoW Features': [f.toarray()[0] for f in bow_features],
    'TF-IDF Features': [f.toarray()[0] for f in tfidf_features],
    'N-grams Features': [f.toarray()[0] for f in ngram_features],
    'POS Features': pos_features,
    'Lexicon Features': lexicon_features,
    'Sentiment': labels
})

print(features_df)




[nltk_data] Downloading package wordnet to /root/nltk_data...


                                              Review  \
0   The product is excellent and highly recommended.   
1  I'm really disappointed with the service provi...   
2          Overall, I am satisfied with my purchase.   
3     The quality of the item is terrible, avoid it!   
4  This company offers outstanding customer support.   

                                        BoW Features  \
0  [0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, ...   
1  [0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, ...   
2  [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, ...   
3  [1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, ...   
4  [0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, ...   

                                     TF-IDF Features  \
0  [0.0, 0.0, 0.0, 0.0, 0.5, 0.5, 0.0, 0.0, 0.0, ...   
1  [0.0, 0.0, 0.0, 0.5, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
2  [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...   
3  [0.5, 0.0, 0.0, 0.0, 0.0, 0.0, 0.5, 0.0, 0.0, ...   
4  [0.0, 0.4472135954999579, 0.447213595499957

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [46]:
from sklearn.feature_selection import mutual_info_classif

# Convert labels to numerical values
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(labels)

# Calculate Mutual Information scores for each feature
mi_scores = mutual_info_classif(bow_features, encoded_labels)

# Map feature indices to their corresponding Mutual Information scores
feature_mi_scores = dict(zip(range(len(vectorizer.vocabulary_)), mi_scores))

# Sort features based on their Mutual Information scores in descending order
sorted_features = sorted(feature_mi_scores.items(), key=lambda x: x[1], reverse=True)

# Display sorted features and their Mutual Information scores
print("Ranked features based on Mutual Information scores:")
for feature_idx, score in sorted_features:
    feature_name = list(vectorizer.vocabulary_.keys())[list(vectorizer.vocabulary_.values()).index(feature_idx)]
    print(f"Feature: {feature_name}, MI Score: {score}")


Ranked features based on Mutual Information scores:
Feature: avoid, MI Score: 0.22314355131420974
Feature: disappointed, MI Score: 0.22314355131420974
Feature: item, MI Score: 0.22314355131420974
Feature: provided, MI Score: 0.22314355131420974
Feature: quality, MI Score: 0.22314355131420974
Feature: really, MI Score: 0.22314355131420974
Feature: service, MI Score: 0.22314355131420974
Feature: terrible, MI Score: 0.22314355131420974
Feature: company, MI Score: 0.11849392256130009
Feature: customer, MI Score: 0.11849392256130009
Feature: excellent, MI Score: 0.11849392256130009
Feature: highly, MI Score: 0.11849392256130009
Feature: offer, MI Score: 0.11849392256130009
Feature: outstanding, MI Score: 0.11849392256130009
Feature: overall, MI Score: 0.11849392256130009
Feature: product, MI Score: 0.11849392256130009
Feature: purchase, MI Score: 0.11849392256130009
Feature: recommended, MI Score: 0.11849392256130009
Feature: satisfied, MI Score: 0.11849392256130009
Feature: support, MI Sco

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [50]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
reviews = [
    "The product is excellent and highly recommended.",
    "I'm really disappointed with the service provided.",
    "Overall, I am satisfied with my purchase.",
    "The quality of the item is terrible, avoid it!",
    "This company offers outstanding customer support."
]

# Query
query = "I want to buy a product and I'm looking for excellent recommendations."

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize query and texts, then convert them to tensors
query_tokens = tokenizer.encode_plus(query, add_special_tokens=True, max_length=512, truncation=True, padding='max_length', return_tensors='pt')
text_tokens = [tokenizer.encode_plus(text, add_special_tokens=True, max_length=512, truncation=True, padding='max_length', return_tensors='pt') for text in reviews]

# Calculate BERT embeddings for query and texts
with torch.no_grad():
    query_outputs = model(**query_tokens)
    text_outputs = [model(**text_token) for text_token in text_tokens]

query_embeddings = query_outputs.last_hidden_state[:, 0, :].numpy().reshape(1, -1)  # Reshape to 2D array
text_embeddings = [output.last_hidden_state[:, 0, :].numpy().reshape(1, -1) for output in text_outputs]

# Calculate cosine similarity between query and each text
similarities = [cosine_similarity(query_embeddings, text_embedding)[0][0] for text_embedding in text_embeddings]

# Rank texts based on similarity in descending order
ranked_results = sorted(zip(reviews, similarities), key=lambda x: x[1], reverse=True)

# Print ranked results
print("Ranked results based on text similarity:")
for i, (text, similarity) in enumerate(ranked_results):
    print(f"Rank {i+1}: Similarity: {similarity:.4f} - Text: {text}")


Ranked results based on text similarity:
Rank 1: Similarity: 0.9329 - Text: I'm really disappointed with the service provided.
Rank 2: Similarity: 0.9194 - Text: Overall, I am satisfied with my purchase.
Rank 3: Similarity: 0.8883 - Text: The product is excellent and highly recommended.
Rank 4: Similarity: 0.8809 - Text: This company offers outstanding customer support.
Rank 5: Similarity: 0.8237 - Text: The quality of the item is terrible, avoid it!


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [48]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:


i would like to get clarification about the exercies, quiz, and assignments.
beacause we are now learning from scrath, we need help and some hands on to do this assignments.
the time oyu guys given for this exercise i am a little bit dissapoited.
the exercise is obviously helpul please add some time for the next time.


'''

'\nPlease write you answer here:\n\n\ni would like to get clarification about the exercies, quiz, and assignments. \nbeacause we are now learning from scrath, we need help and some hands on to do this assignments. \nthe time oyu guys given for this exercise i am a little bit dissapoited.\nthe exercise is obviously helpul please add some time for the next time.\n\n\n'