<a href="https://colab.research.google.com/github/ManojKumarKolli/ManojKumar_INFO5731_Spring2024/blob/main/In_class_exercise/Kolli_ManojKumar_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

An interesting text classification task could involve identifying the sentiment of tweets related to customer service experiences.
This task is vital for businesses seeking to understand and improve customer satisfaction through social media engagement.
The goal would be to classify tweets into categories such as positive, neutral, or negative sentiments regarding the customer service they received.

Here are five different types of features that might be useful for building a machine learning model for this task:

1. TF-IDF (Term Frequency-Inverse Document Frequency):
TF-IDF measures the importance of a word within a document in a collection of documents (corpus).
It helps differentiate words that are frequent in a particular document but not across multiple documents, which can be critical for identifying key terms that signal sentiment in tweets.
This approach balances the word's occurrence in a single tweet against its commonness across all tweets, highlighting words that are uniquely significant in expressing sentiment.

2. N-grams:
N-grams (combinations of n words) capture phrases and expressions specific to customer service interactions (e.g., "quick response", "poor service").
These features help the model to recognize patterns and phrases that are predictive of sentiment, going beyond individual words to understand context and nuance in the text.


3. Sentiment Lexicons:
Sentiment lexicons provide scores for words based on their sentiment polarity (positive, neutral, negative) and can be used to compute aggregate sentiment scores for tweets.
These scores offer direct insight into sentiment, aiding the model in distinguishing between different sentiment classes based on the lexical content of the tweets.

4. Part-of-Speech Tags:
The grammatical role of words (e.g., adjectives, verbs) can be indicative of sentiment. For example, adjectives are often used to express opinions (e.g., "great", "terrible").
Including part-of-speech as features helps the model to focus on the types of words most relevant to expressing sentiment.

5. Tweet Metadata:
Metadata provides context that is not always apparent from the tweet's text alone.
Hashtags can indicate the topic or tone (e.g., #badservice), mentions may point to specific companies or products, and emojis can be powerful indicators of sentiment.
Incorporating these features allows the model to leverage additional signals that are specific to the social media context.



'''

'\nPlease write you answer here:\n\nAn interesting text classification task could involve identifying the sentiment of tweets related to customer service experiences. \nThis task is vital for businesses seeking to understand and improve customer satisfaction through social media engagement. \nThe goal would be to classify tweets into categories such as positive, neutral, or negative sentiments regarding the customer service they received. \n\nHere are five different types of features that might be useful for building a machine learning model for this task:\n\n1. TF-IDF (Term Frequency-Inverse Document Frequency):\nTF-IDF measures the importance of a word within a document in a collection of documents (corpus). \nIt helps differentiate words that are frequent in a particular document but not across multiple documents, which can be critical for identifying key terms that signal sentiment in tweets. \nThis approach balances the word\'s occurrence in a single tweet against its commonness a

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [5]:
pip install emoji

Collecting emoji
  Downloading emoji-2.10.1-py2.py3-none-any.whl (421 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m421.5/421.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: emoji
Successfully installed emoji-2.10.1


In [6]:
# You code here (Please add comments in the code):

import numpy as np
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer
import emoji

# Sample tweets
tweets = [
    "Really impressed with the quick response from the support team! 😊 #CustomerService",
    "Waited on hold for 30 minutes and still no help. Terrible service! 😠 #badservice",
    "The new update is fantastic! Great work.",
    "Why is my delivery delayed again? Unacceptable! #lateDelivery",
    "A big thank you to the team for resolving my issue so quickly! 👍"
]

# Ensure you have the necessary NLTK data
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')
nltk.download('punkt')

# 1. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(tweets)

# 2. N-grams
ngram_vectorizer = CountVectorizer(ngram_range=(1, 2))  # For unigrams and bigrams
ngram_features = ngram_vectorizer.fit_transform(tweets)

# 3. Sentiment Lexicons (VADER)
sia = SentimentIntensityAnalyzer()
sentiment_scores = [sia.polarity_scores(tweet) for tweet in tweets]

# 4. Part-of-Speech Tags
pos_tags = [pos_tag(word_tokenize(tweet)) for tweet in tweets]

# 5. Tweet Metadata (Extracting hashtags, mentions, and emojis)
def extract_metadata(tweet):
    hashtags = [word for word in tweet.split() if word.startswith("#")]
    mentions = [word for word in tweet.split() if word.startswith("@")]
    emojis_list = [char for char in tweet if emoji.is_emoji(char)]
    return {"hashtags": hashtags, "mentions": mentions, "emojis": emojis_list}

# Using the updated function
metadata_features = [extract_metadata(tweet) for tweet in tweets]

# Example output
print("Sample Metadata Features:", metadata_features[0])


# Example outputs
print("TF-IDF Features shape:", tfidf_features.shape)
print("N-gram Features shape:", ngram_features.shape)
print("Sample Sentiment Scores:", sentiment_scores)
print("Sample POS Tags:", pos_tags)
print("Sample Metadata Features:", metadata_features)



[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Sample Metadata Features: {'hashtags': ['#CustomerService'], 'mentions': [], 'emojis': ['😊']}
TF-IDF Features shape: (5, 44)
N-gram Features shape: (5, 90)
Sample Sentiment Scores: [{'neg': 0.0, 'neu': 0.585, 'pos': 0.415, 'compound': 0.7495}, {'neg': 0.306, 'neu': 0.547, 'pos': 0.148, 'compound': -0.4389}, {'neg': 0.0, 'neu': 0.385, 'pos': 0.615, 'compound': 0.8398}, {'neg': 0.464, 'neu': 0.536, 'pos': 0.0, 'compound': -0.636}, {'neg': 0.0, 'neu': 0.65, 'pos': 0.35, 'compound': 0.6588}]
Sample POS Tags: [[('Really', 'RB'), ('impressed', 'JJ'), ('with', 'IN'), ('the', 'DT'), ('quick', 'JJ'), ('response', 'NN'), ('from', 'IN'), ('the', 'DT'), ('support', 'NN'), ('team', 'NN'), ('!', '.'), ('😊', 'JJ'), ('#', '#'), ('CustomerService', 'NNP')], [('Waited', 'VBN'), ('on', 'IN'), ('hold', 'NN'), ('for', 'IN'), ('30', 'CD'), ('minutes', 'NNS'), ('and', 'CC'), ('still', 'RB'), ('no', 'DT'), ('help', 'NN'), ('.', '.'), ('Terrible', 'JJ'), ('service', 'NN'), ('!', '.'), ('😠', 'JJ'), ('#', '#'), 

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [7]:
# You code here (Please add comments in the code):

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
import numpy as np

# Sample tweets data
# tweets = [
#     "Really impressed with the quick response from the support team! 😊 #CustomerService",
#     "Waited on hold for 30 minutes and still no help. Terrible service! 😠 #badservice",
#     "The new update is fantastic! Great work.",
#     "Why is my delivery delayed again? Unacceptable! #lateDelivery",
#     "A big thank you to the team for resolving my issue so quickly! 👍"
# ]

# Assume labels for each tweet
tweet_labels = ['positive', 'negative', 'positive', 'negative', 'positive']

# Convert labels to numerical values
labels = ['positive', 'negative']
label_map = {label: idx for idx, label in enumerate(labels)}
numerical_labels = [label_map[label] for label in tweet_labels]

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(tweets)

# Chi-squared test for feature selection
chi2_scores, _ = chi2(tfidf_features, numerical_labels)

# Create a dictionary to associate feature names with chi-squared scores
feature_scores = {feature_name: score for feature_name, score in zip(tfidf_vectorizer.get_feature_names_out(), chi2_scores)}

# Sort features by their chi-squared scores in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top N features and their scores
top_n = 10  # Adjust as needed to display the desired number of top features
print("Top features based on Chi-squared scores:")
for feature, score in sorted_features[:top_n]:
    print(f"Feature: {feature}, Chi-squared Score: {score:.2f}")




Top features based on Chi-squared scores:
Feature: the, Chi-squared Score: 0.60
Feature: again, Chi-squared Score: 0.56
Feature: delayed, Chi-squared Score: 0.56
Feature: delivery, Chi-squared Score: 0.56
Feature: latedelivery, Chi-squared Score: 0.56
Feature: unacceptable, Chi-squared Score: 0.56
Feature: why, Chi-squared Score: 0.56
Feature: 30, Chi-squared Score: 0.42
Feature: and, Chi-squared Score: 0.42
Feature: badservice, Chi-squared Score: 0.42


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [9]:
pip install sentence_transformers

Collecting sentence_transformers
  Downloading sentence_transformers-2.5.0-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.3/156.3 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence_transformers
Successfully installed sentence_transformers-2.5.0


In [10]:
# You code here (Please add comments in the code):

from sentence_transformers import SentenceTransformer, util
import numpy as np

# Initialize the BERT model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Define a query
query = "Great customer service experience"

# Encode the query and the texts to get their embeddings
query_embedding = model.encode(query)
text_embeddings = model.encode(tweets)

# Calculate cosine similarities between the query and each of the text embeddings
similarities = util.pytorch_cos_sim(query_embedding, text_embeddings)[0]

# Convert similarities to numpy array for easier handling
similarities = similarities.numpy()

# Rank the texts based on similarities in descending order
sorted_indices = np.argsort(similarities)[::-1]

# Print the ranked texts with their similarity scores
print("Texts ranked by similarity to the query:")
for index in sorted_indices:
    print(f"Score: {similarities[index]:.4f}, Text: {tweets[index]}")




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Texts ranked by similarity to the query:
Score: 0.6301, Text: Really impressed with the quick response from the support team! 😊 #CustomerService
Score: 0.4361, Text: Waited on hold for 30 minutes and still no help. Terrible service! 😠 #badservice
Score: 0.2414, Text: A big thank you to the team for resolving my issue so quickly! 👍
Score: 0.2178, Text: Why is my delivery delayed again? Unacceptable! #lateDelivery
Score: 0.2024, Text: The new update is fantastic! Great work.


In [11]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained model tokenizer (vocabulary) and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a text
def get_bert_embedding(text):
    # Encode text
    encoded_input = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
    # Compute token embeddings
    with torch.no_grad():
        model_output = model(**encoded_input)
    # Pool the outputs into a single mean vector
    embeddings = model_output.last_hidden_state.mean(dim=1)
    return embeddings


# Your query
query = "Great customer service experience"

# Get embeddings for the query and the tweets
query_embedding = get_bert_embedding(query)
tweet_embeddings = torch.stack([get_bert_embedding(tweet) for tweet in tweets])

# Since we're using sklearn's cosine_similarity, we need to convert the embeddings from tensors to numpy arrays
query_embedding_np = query_embedding.detach().numpy()
tweet_embeddings_np = tweet_embeddings.detach().numpy().squeeze()

# Calculate cosine similarities between query and tweets
cos_similarities = cosine_similarity(query_embedding_np, tweet_embeddings_np)[0]

# Rank tweets based on similarity
sorted_tweet_indices = cos_similarities.argsort()[::-1]

print("Ranking of tweets based on similarity to the query:")
for rank, index in enumerate(sorted_tweet_indices):
    print(f"{rank+1}. Tweet {index+1} (Similarity: {cos_similarities[index]:.4f})")
    print(f"   Text: {tweets[index]}")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranking of tweets based on similarity to the query:
1. Tweet 1 (Similarity: 0.6551)
   Text: Really impressed with the quick response from the support team! 😊 #CustomerService
2. Tweet 3 (Similarity: 0.6438)
   Text: The new update is fantastic! Great work.
3. Tweet 2 (Similarity: 0.6282)
   Text: Waited on hold for 30 minutes and still no help. Terrible service! 😠 #badservice
4. Tweet 5 (Similarity: 0.5505)
   Text: A big thank you to the team for resolving my issue so quickly! 👍
5. Tweet 4 (Similarity: 0.5275)
   Text: Why is my delivery delayed again? Unacceptable! #lateDelivery


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Working on extracting features from text data was highly educational, especially learning about tokenization, vectorization (TF-IDF, word embeddings), and NLP preprocessing techniques, which are crucial for analyzing text effectively.

The main challenges included dealing with the natural language's ambiguity, selecting the optimal features for model performance, and mastering advanced vectorization techniques for NLP applications.

This exercise is directly aligned with the field of NLP, emphasizing the importance of feature extraction in building models for text analysis and showcasing the practical application of NLP techniques in processing and interpreting textual data.


'''