## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Bag of Words (BoW): This feature entails quantifying the frequency of individual words or tokens present in the text. It aids in capturing the most frequently used terms associated with either positive or negative sentiments. For instance, favorable comments may encompass words such as "amazing," "love," or "great," while unfavorable comments may incorporate terms like "disappointed," "hate," or "problem."

TF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF not only takes into account the occurrence of words within a single comment but also evaluates their significance within the entire dataset. It helps identify words that are specific to particular sentiments and may not commonly appear in general. For example, a high TF-IDF score for the word "defective" could suggest a negative sentiment.

N-grams: N-grams capture sequences of words as opposed to individual words. Bi-grams (pairs of words) and tri-grams (groups of three words) assist the model in grasping the context in which words are utilized. For instance, the bi-gram "customer service" may signal a negative sentiment if frequently associated with complaints.

Emotion Lexicons: Emotion lexicons comprise lists of words linked to specific emotions. By cross-referencing words in the text with these lexicons, the model can discern emotional nuances. For example, the presence of words like "happy," "joy," or "excitement" may indicate a positive sentiment.

Part-of-Speech (POS) Tags: Scrutinizing the grammatical structure of a sentence using POS tags offers valuable insights. Positive sentiments often feature nouns such as "product" and adjectives like "amazing," while negative sentiments might involve verbs like "break" and adjectives like "terrible."

Sentiment Lexicons: Sentiment lexicons consist of words assigned known sentiment values (e.g., SentiWordNet). Assigning sentiment scores to words and aggregating them can help gauge the overall sentiment of a comment. For example, calculating the sum of positive scores minus the sum of negative scores can reveal sentiment polarity.

Punctuation and Emoji Analysis: Examination of punctuation marks (e.g., exclamation points, question marks) and emojis can yield additional clues about sentiment. An excessive use of exclamation points may signify enthusiasm, while negative emojis like 😡 can indicate dissatisfaction.

Sentence Length: The length of a sentence can at times reflect sentiment. Short, succinct sentences often convey strong sentiments, while longer sentences may exhibit greater neutrality or analytical content.

Named Entity Recognition (NER): Identifying named entities (e.g., product names, company names) in the text allows for attributing sentiments to specific entities within a comment, offering more nuanced insights.

Contextual Word Embeddings: Utilizing pre-trained word embeddings like Word2Vec, GloVe, or BERT helps capture semantic relationships between words and phrases, enabling the model to grasp context more effectively.

By amalgamating these features, a machine learning model can proficiently categorize social media comments into positive, negative, or neutral sentiments, thereby furnishing valuable insights into the public's perception of a novel product launch.



'''

'\nPlease write you answer here:\n\nBag of Words (BoW): This feature entails quantifying the frequency of individual words or tokens present in the text. It aids in capturing the most frequently used terms associated with either positive or negative sentiments. For instance, favorable comments may encompass words such as "amazing," "love," or "great," while unfavorable comments may incorporate terms like "disappointed," "hate," or "problem."\n\nTF-IDF (Term Frequency-Inverse Document Frequency): TF-IDF not only takes into account the occurrence of words within a single comment but also evaluates their significance within the entire dataset. It helps identify words that are specific to particular sentiments and may not commonly appear in general. For example, a high TF-IDF score for the word "defective" could suggest a negative sentiment.\n\nN-grams: N-grams capture sequences of words as opposed to individual words. Bi-grams (pairs of words) and tri-grams (groups of three words) assist 

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
# You code here (Please add comments in the code):

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.util import ngrams
from nltk.sentiment.util import mark_negation
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords
from nltk import pos_tag, ne_chunk
import string
import emoji
from collections import Counter
import spacy

nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('stopwords')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Sample text data
sample_text = "I absolutely love this product! It's amazing and works great. However, the customer service was terrible. I'm really disappointed. 😡"

# Feature 1: Bag of Words (BoW)
def bow_features(text):
    words = word_tokenize(text.lower())
    return Counter(words)

# Feature 2: TF-IDF (Term Frequency-Inverse Document Frequency) - Requires a document corpus
# Not implemented here as it requires a larger dataset and a TF-IDF vectorizer

# Feature 3: N-grams
def ngram_features(text, n=2):
    words = word_tokenize(text.lower())
    ngrams_list = list(ngrams(words, n))
    return [' '.join(ngram) for ngram in ngrams_list]

# Feature 4: Emotion Lexicons
def emotion_lexicon_features(text):
    sid = SentimentIntensityAnalyzer()
    sentiment_scores = sid.polarity_scores(text)
    return sentiment_scores

# Feature 5: Part-of-Speech (POS) Tags
def pos_tag_features(text):
    words = word_tokenize(text)
    pos_tags = pos_tag(words)
    return [tag for _, tag in pos_tags]

# Feature 6: Sentiment Lexicons - Requires a lexicon (e.g., SentiWordNet)
# Not implemented here as it requires a sentiment lexicon

# Feature 7: Punctuation and Emoji Analysis
'''def punctuation_emoji_features(text):
    punctuation_count = Counter(char for char in text if char in string.punctuation)
    emoji_count = len(emoji.emoji_count(text))

    features = {
        'punctuation_count': punctuation_count,
        'emoji_count': emoji_count
    }

    return features'''

# Feature 8: Sentence Length
def sentence_length_feature(text):
    sentences = sent_tokenize(text)
    return [len(sentence.split()) for sentence in sentences]

# Feature 9: Named Entity Recognition (NER)
def ner_features(text):
    words = word_tokenize(text)
    tagged = pos_tag(words)
    named_entities = ne_chunk(tagged)
    return [chunk for chunk in named_entities if hasattr(chunk, 'label')]

# Feature 10: Contextual Word Embeddings - Requires spaCy or other word embedding model
# Not implemented here as it requires a pre-trained model

# Sample feature extraction
bow = bow_features(sample_text)
ngrams_2 = ngram_features(sample_text, n=2)
emotion_lexicon = emotion_lexicon_features(sample_text)
pos_tags = pos_tag_features(sample_text)
'''punctuation, emojis = punctuation_emoji_features(sample_text)'''
sentence_lengths = sentence_length_feature(sample_text)
named_entities = ner_features(sample_text)

print("Bag of Words (BoW):", bow)
print("Bi-grams:", ngrams_2)
print("Emotion Lexicons:", emotion_lexicon)
print("POS Tags:", pos_tags)
'''print("Punctuation Count:", punctuation)
print("Emoji Count:", emojis)'''
print("Sentence Lengths:", sentence_lengths)
print("Named Entities:", named_entities)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!


Bag of Words (BoW): Counter({'.': 3, 'i': 2, 'absolutely': 1, 'love': 1, 'this': 1, 'product': 1, '!': 1, 'it': 1, "'s": 1, 'amazing': 1, 'and': 1, 'works': 1, 'great': 1, 'however': 1, ',': 1, 'the': 1, 'customer': 1, 'service': 1, 'was': 1, 'terrible': 1, "'m": 1, 'really': 1, 'disappointed': 1, '😡': 1})
Bi-grams: ['i absolutely', 'absolutely love', 'love this', 'this product', 'product !', '! it', "it 's", "'s amazing", 'amazing and', 'and works', 'works great', 'great .', '. however', 'however ,', ', the', 'the customer', 'customer service', 'service was', 'was terrible', 'terrible .', '. i', "i 'm", "'m really", 'really disappointed', 'disappointed .', '. 😡']
Emotion Lexicons: {'neg': 0.202, 'neu': 0.404, 'pos': 0.394, 'compound': 0.8016}
POS Tags: ['PRP', 'RB', 'VBP', 'DT', 'NN', '.', 'PRP', 'VBZ', 'JJ', 'CC', 'VBZ', 'JJ', '.', 'RB', ',', 'DT', 'NN', 'NN', 'VBD', 'JJ', '.', 'PRP', 'VBP', 'RB', 'JJ', '.', 'NN']
Sentence Lengths: [5, 5, 6, 3, 1]
Named Entities: []


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [3]:
# You code here (Please add comments in the code):

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text data
sample_text = "I absolutely love this product! It's amazing and works great. However, the customer service was terrible. I'm really disappointed. 😡"

# Create a list of texts (only one text in this case)
texts = [sample_text]

# Initialize the TF-IDF vectorizer
tfidf_vectorizer = TfidfVectorizer()

# Fit and transform the text data
tfidf_matrix = tfidf_vectorizer.fit_transform(texts)

# Get the feature names (words or n-grams)
feature_names = tfidf_vectorizer.get_feature_names_out()

# Get the TF-IDF scores for each feature
tfidf_scores = tfidf_matrix.toarray()[0]

# Create a dictionary to store feature names and their TF-IDF scores
feature_tfidf_dict = dict(zip(feature_names, tfidf_scores))

# Rank features based on their TF-IDF scores in descending order
sorted_features = sorted(feature_tfidf_dict.items(), key=lambda x: x[1], reverse=True)

# Print the sorted features
for feature, tfidf_score in sorted_features:
    print(f"{feature}: {tfidf_score}")




absolutely: 0.24253562503633297
amazing: 0.24253562503633297
and: 0.24253562503633297
customer: 0.24253562503633297
disappointed: 0.24253562503633297
great: 0.24253562503633297
however: 0.24253562503633297
it: 0.24253562503633297
love: 0.24253562503633297
product: 0.24253562503633297
really: 0.24253562503633297
service: 0.24253562503633297
terrible: 0.24253562503633297
the: 0.24253562503633297
this: 0.24253562503633297
was: 0.24253562503633297
works: 0.24253562503633297


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [4]:
# You code here (Please add comments in the code):

import numpy as np
import torch
from sklearn.metrics.pairwise import cosine_similarity
from transformers import AutoTokenizer, AutoModel

# Sample text data
text_data = [
    "I absolutely love this product! It's amazing and works great.",
    "The customer service was terrible. I'm really disappointed.",
    "This product is just average, nothing special.",
    "I have never seen such a terrible product before.",
]

# Query to match against the text data
query = "I'm looking for a great product. Can you recommend one?"

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode the query and text data
query_tokens = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
text_tokens = tokenizer(text_data, padding=True, truncation=True, return_tensors="pt")

# Compute BERT embeddings for the query and text data
with torch.no_grad():
    query_embedding = model(**query_tokens).last_hidden_state.mean(dim=1)
    text_embeddings = model(**text_tokens).last_hidden_state.mean(dim=1)

# Calculate cosine similarity between the query and each text
cosine_similarities = cosine_similarity(query_embedding, text_embeddings).flatten()

# Rank documents based on similarity in descending order
sorted_indices = np.argsort(cosine_similarities)[::-1]
sorted_text_data = [text_data[i] for i in sorted_indices]
sorted_similarity_scores = [cosine_similarities[i] for i in sorted_indices]

# Print ranked results
print("Query:", query)
for i, (text, score) in enumerate(zip(sorted_text_data, sorted_similarity_scores), start=1):
    print(f"Rank {i}: Similarity Score: {score:.4f}\n{text}\n")





Query: I'm looking for a great product. Can you recommend one?
Rank 1: Similarity Score: 0.7520
I absolutely love this product! It's amazing and works great.

Rank 2: Similarity Score: 0.7400
The customer service was terrible. I'm really disappointed.

Rank 3: Similarity Score: 0.6768
This product is just average, nothing special.

Rank 4: Similarity Score: 0.6190
I have never seen such a terrible product before.

