# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [96]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Here is an example of a text classification task and potential features for model building:

1.Bag of Words (BoW) Model:
Explanation: Representing each review as a vector of word frequencies provides insight into the most important words used by customers.
Why it's helpful: High-frequency words in positive reviews might include terms like "love," "excellent," or "amazing," while negative reviews may contain words like "disappointing," "problems," or "poor."

2.TF-IDF (Term Frequency-Inverse Document Frequency) Features:
Explanation: Similar to BoW, but with TF-IDF weights, emphasizing the importance of words rare in the entire dataset but frequent in a specific review.
Why it's helpful: It helps identify words that are distinctive to certain reviews, providing context on specific aspects of products that customers find noteworthy.

3.Emoticons and Emoji Features:
Explanation: Extracting emoticons and emojis can capture the emotional tone of the review.
Why it's helpful: Emoticons like 😊 or emojis like 🎉 often indicate positive sentiments, while 😡 or 😞 may signify negative sentiments.

4.N-grams (2-grams) Features:
Explanation: Considering pairs of consecutive words (2-grams) helps capture phrases and expressions.
Why it's helpful: Some sentiments may be conveyed through specific phrases, and capturing 2-grams can provide context that single words may miss.

5.Product-specific Lexicon Features:
Explanation: Counting occurrences of product-related words in sentiment lexicons (e.g., "quality," "service," "price").
Why it's helpful: Understanding sentiments related to specific product aspects allows businesses to pinpoint areas for improvement or highlight strengths.

6.Part-of-Speech (POS) Tags Features:
Explanation: Tagging words with their parts of speech provides insights into the grammatical structure of reviews.
Why it's helpful: Identifying adjectives in positive reviews (e.g., "great," "awesome") or negative reviews (e.g., "bad," "poor") helps understand the linguistic patterns associated with sentiments.

'''

'\nPlease write you answer here:\nHere is an example of a text classification task and potential features for model building:\n\n1.Bag of Words (BoW) Model:\nExplanation: Representing each review as a vector of word frequencies provides insight into the most important words used by customers.\nWhy it\'s helpful: High-frequency words in positive reviews might include terms like "love," "excellent," or "amazing," while negative reviews may contain words like "disappointing," "problems," or "poor."\n\n2.TF-IDF (Term Frequency-Inverse Document Frequency) Features:\nExplanation: Similar to BoW, but with TF-IDF weights, emphasizing the importance of words rare in the entire dataset but frequent in a specific review.\nWhy it\'s helpful: It helps identify words that are distinctive to certain reviews, providing context on specific aspects of products that customers find noteworthy.\n\n3.Emoticons and Emoji Features:\nExplanation: Extracting emoticons and emojis can capture the emotional tone o

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [97]:
# Importing required libraries
import nltk
import pandas as pd
import emoji
import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data for sentiment analysis
sample_text = [
    "Absolutely enamored with this product! It's a game-changer.",
    "I despise this, it's a complete disappointment.",
    "Sharing a neutral opinion with no strong sentiment attached.",
    "😊 Exceptional experience with their customer service! 😃",
    "I'm appalled at how dreadful this is. 😡",
    "Just received my order and it's perfect! 🎉",
    "This new feature is amazing! Can't wait to explore it. 🔥",
    "Feeling frustrated with the app's performance today. 😤",
    "Spent the day outdoors, feeling blissful and content. 🌞😌",
]

# Downloading the required NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')

# setting up the stopwords
stop_words = set(stopwords.words('english'))

# Function to  the preprocess input text: lowercase conversion, tokenization, and stopwords removal
def custom_preprocess(input_text):
    processed_text = input_text.lower()
    tokenized_text = word_tokenize(processed_text)
    filtered_tokens = [word for word in tokenized_text if word not in stop_words]
    return ' '.join(filtered_tokens)

# Applying custom text for pre-processing to the sample data
custom_processed_text = [custom_preprocess(text) for text in sample_text]

# 1. Bag of Words (BoW) Model:-
word_count_vectorizer = CountVectorizer()
word_count_features = word_count_vectorizer.fit_transform(custom_processed_text)
word_count_df = pd.DataFrame(word_count_features.toarray(), columns=word_count_vectorizer.get_feature_names_out())
print("\nWord Count Features:")
print(word_count_df)


Word Count Features:
   absolutely  amazing  app  appalled  attached  blissful  ca  changer  \
0           1        0    0         0         0         0   0        1   
1           0        0    0         0         0         0   0        0   
2           0        0    0         0         1         0   0        0   
3           0        0    0         0         0         0   0        0   
4           0        0    0         1         0         0   0        0   
5           0        0    0         0         0         0   0        0   
6           0        1    0         0         0         0   1        0   
7           0        0    1         0         0         0   0        0   
8           0        0    0         0         0         1   0        0   

   complete  content  ...  performance  product  received  sentiment  service  \
0         0        0  ...            0        1         0          0        0   
1         1        0  ...            0        0         0          0       

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [98]:
# 2. TF-IDF Model
new_tfidf_model = TfidfVectorizer()
new_tfidf_data = new_tfidf_model.fit_transform(custom_processed_text)
new_tfidf_dataframe = pd.DataFrame(new_tfidf_data.toarray(), columns=new_tfidf_model.get_feature_names_out())
print("\nNew TF-IDF Features:")
print(new_tfidf_dataframe)


New TF-IDF Features:
   absolutely   amazing       app  appalled  attached  blissful        ca  \
0    0.447214  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
1    0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
2    0.000000  0.000000  0.000000  0.000000  0.408248  0.000000  0.000000   
3    0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
4    0.000000  0.000000  0.000000  0.707107  0.000000  0.000000  0.000000   
5    0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000   
6    0.000000  0.408248  0.000000  0.000000  0.000000  0.000000  0.408248   
7    0.000000  0.000000  0.460611  0.000000  0.000000  0.000000  0.000000   
8    0.000000  0.000000  0.000000  0.000000  0.000000  0.418363  0.000000   

    changer  complete   content  ...  performance   product  received  \
0  0.447214   0.00000  0.000000  ...     0.000000  0.447214   0.00000   
1  0.000000   0.57735  0.000000  ...     0.000000  0.000000  

In [None]:
# 3. Emoticons and Emoji Analysis
def emojis(text):
    emoji_pattern = re.compile("["u"\U0001F600-\U0001F64F" "]+", flags=re.UNICODE)
    return emoji_pattern.findall(text)
emoji_features = [emojis(text) for text in sample_text]
emoji_df = pd.DataFrame({'Emoji Features': emoji_features})
print("\nEmoticon and Emoji Features:")
print(emoji_df)

In [45]:
# 4. N-grams (2-grams) Model
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2))
ngram_features = ngram_vectorizer.fit_transform(custom_processed_text)
ngram_df = pd.DataFrame(ngram_features.toarray(), columns=ngram_vectorizer.get_feature_names_out())
print("\n2-gram Features:")
print(ngram_df)


2-gram Features:
   absolutely enamored  amazing ca  app performance  appalled dreadful  \
0                    1           0                0                  0   
1                    0           0                0                  0   
2                    0           0                0                  0   
3                    0           0                0                  0   
4                    0           0                0                  1   
5                    0           0                0                  0   
6                    0           1                0                  0   
7                    0           0                1                  0   
8                    0           0                0                  0   

   blissful content  ca wait  complete disappointment  customer service  \
0                 0        0                        0                 0   
1                 0        0                        1                 0   
2               

In [99]:
# 5. Product-specific Lexicon Features

pos_lexicon = ["enamored", "game-changer", "exceptional"]
neg_lexicon = ["despise", "dreadful", "disappointment"]
# Function to count occurrences of words
def count_words(text, lexicon):
    words = text.split()
    count = sum(1 for word in words if word in lexicon)
    return count
pos_lexicon_features = [count_words(text, pos_lexicon) for text in custom_processed_text]
neg_lexicon_features = [count_words(text, neg_lexicon) for text in custom_processed_text]
lexicon_df = pd.DataFrame({'Positive Lexicon Features': pos_lexicon_features,
                           'Negative Lexicon Features': neg_lexicon_features})
print("\nSentiment Lexicon Features:")
print(lexicon_df)


Sentiment Lexicon Features:
   Positive Lexicon Features  Negative Lexicon Features
0                          2                          0
1                          0                          2
2                          0                          0
3                          1                          0
4                          0                          1
5                          0                          0
6                          0                          0
7                          0                          0
8                          0                          0


In [100]:
#6. Part-of-Speech (POS) Tags Features:
pos_tags = [nltk.pos_tag(word_tokenize(text)) for text in custom_processed_text]
print("\nPart-of-Speech (POS) Tags:")
for tags in pos_tags:
    print(tags)


Part-of-Speech (POS) Tags:
[('absolutely', 'RB'), ('enamored', 'JJ'), ('product', 'NN'), ('!', '.'), ("'s", 'POS'), ('game-changer', 'NN'), ('.', '.')]
[('despise', 'NN'), (',', ','), ("'s", 'POS'), ('complete', 'JJ'), ('disappointment', 'NN'), ('.', '.')]
[('sharing', 'VBG'), ('neutral', 'JJ'), ('opinion', 'NN'), ('strong', 'JJ'), ('sentiment', 'NN'), ('attached', 'VBN'), ('.', '.')]
[('😊', 'JJ'), ('exceptional', 'JJ'), ('experience', 'NN'), ('customer', 'NN'), ('service', 'NN'), ('!', '.'), ('😃', 'NN')]
[("'m", 'VBP'), ('appalled', 'JJ'), ('dreadful', 'NN'), ('.', '.'), ('😡', 'NN')]
[('received', 'VBN'), ('order', 'NN'), ("'s", 'POS'), ('perfect', 'NN'), ('!', '.'), ('🎉', 'NN')]
[('new', 'JJ'), ('feature', 'NN'), ('amazing', 'NN'), ('!', '.'), ('ca', 'MD'), ("n't", 'RB'), ('wait', 'VB'), ('explore', 'RB'), ('.', '.'), ('🔥', 'VB')]
[('feeling', 'VBG'), ('frustrated', 'VBD'), ('app', 'NN'), ("'s", 'POS'), ('performance', 'NN'), ('today', 'NN'), ('.', '.'), ('😤', 'NN')]
[('spent', 'JJ'

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [109]:
# Imports
from sklearn.feature_selection import SelectKBest, mutual_info_classif
import numpy as np

# Sample texts
texts = [
    "I love this product! It's amazing.",
    "This is terrible, I hate it!",
    "Neutral comment with no strong feelings.",
    "😊 Great experience with their customer service! 😃",
    "I can't believe how bad this is. 😡",
    "Just received my order and it's perfect! 🎉",
    "This new feature is amazing! Can't wait to explore it. 🔥",
    "Feeling frustrated with the app's performance today. 😤",
    "Spent the day outdoors, feeling blissful and content. 🌞😌",
]

# Labels
labels = [1, 0, 2, 1, 0, 1, 1, 0, 2]

# Preprocess texts
preprocessed_texts = [preprocess_text(text) for text in texts]

# Define emojis
emojis = ['😊','😃','😡','🎉','🔥','😤','🌞','😌']

# Get emoji features
emoji_features = [[1 if emoji in text else 0 for emoji in emojis] for text in texts]

# Combine features
feature_matrix = np.concatenate((word_count_features.toarray(), emoji_features), axis=1)

# Select features
selector = SelectKBest(score_func=mutual_info_classif, k=5)
selector.fit(feature_matrix, labels)

# Get selected indices
selected_indices = np.argsort(selector.scores_)[::-1]

# Rank features
ranked_features = [f"Feature {i+1}" for i in selected_indices]

# Print ranked features
print("Ranked Features:")
for i, feature in enumerate(ranked_features):
  print(f"{i+1}: {feature}")

Ranked Features:
1: Feature 22
2: Feature 12
3: Feature 6
4: Feature 17
5: Feature 35
6: Feature 29
7: Feature 25
8: Feature 36
9: Feature 46
10: Feature 20
11: Feature 5
12: Feature 21
13: Feature 30
14: Feature 31
15: Feature 11
16: Feature 33
17: Feature 23
18: Feature 14
19: Feature 15
20: Feature 47
21: Feature 13
22: Feature 10
23: Feature 9
24: Feature 8
25: Feature 7
26: Feature 4
27: Feature 3
28: Feature 2
29: Feature 16
30: Feature 24
31: Feature 18
32: Feature 39
33: Feature 45
34: Feature 44
35: Feature 43
36: Feature 42
37: Feature 41
38: Feature 40
39: Feature 38
40: Feature 19
41: Feature 37
42: Feature 34
43: Feature 32
44: Feature 28
45: Feature 27
46: Feature 26
47: Feature 1


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [110]:
from sentence_transformers import SentenceTransformer, util
import pandas as pd

# Sample texts
sample_text = [
    "Absolutely enamored with this product! It's a game-changer.",
    "I despise this, it's a complete disappointment.",
    "Sharing a neutral opinion with no strong sentiment attached.",
    "😊 Exceptional experience with their customer service! 😃",
    "I'm appalled at how dreadful this is. 😡",
    "Just received my order and it's perfect! 🎉",
    "This new feature is amazing! Can't wait to explore it. 🔥",
    "Feeling frustrated with the app's performance today. 😤",
    "Spent the day outdoors, feeling blissful and content. 🌞😌",
]
# Query
query_text = "Searching for a high-quality product with great service."

# Load BERT model
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')

# Get embeddings
query_embed = model.encode(query_text, convert_to_tensor=True)
text_embed = model.encode(texts, convert_to_tensor=True)

# Calculate similarity
similarities = util.pytorch_cos_sim(query_embed, text_embed)[0]

# Create dataframe and rank
df = pd.DataFrame({'Text': texts, 'Similarity': similarities})
df = df.sort_values(by='Similarity', ascending=False)
print("\nRanked Texts by Similarity:")
print(df)


Ranked Texts by Similarity:
                                                Text  Similarity
3  😊 Great experience with their customer service! 😃    0.584360
0                 I love this product! It's amazing.    0.468316
5         Just received my order and it's perfect! 🎉    0.358053
6  This new feature is amazing! Can't wait to exp...    0.257360
7  Feeling frustrated with the app's performance ...    0.208661
8  Spent the day outdoors, feeling blissful and c...    0.151887
4                 I can't believe how bad this is. 😡    0.094290
2           Neutral comment with no strong feelings.    0.081141
1                       This is terrible, I hate it!   -0.030733


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [112]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Learning Experience: Extracting features from text provided good exposure to techniques like bag-of-words, TF-IDF, n-grams and BERT embeddings. Understanding how to generate numerical representations from text was very beneficial.
Challenges: The time provided was not sufficient to fully explore implementing the techniques on real datasets.
Relevance to NLP: Feature extraction is a crucial first step in many NLP tasks. These exercises provided a solid intro to commonly used techniques for generating useful text features.
Overall, the exercises served as a good introduction to feature engineering for NLP. Hands-on work with real data would further improve learning. Feature extraction is an integral part of the NLP workflow.
'''

'\nPlease write you answer here:\nLearning Experience: Extracting features from text provided good exposure to techniques like bag-of-words, TF-IDF, n-grams and BERT embeddings. Understanding how to generate numerical representations from text was very beneficial.\nChallenges: The time provided was not sufficient to fully explore implementing the techniques on real datasets.\nRelevance to NLP: Feature extraction is a crucial first step in many NLP tasks. These exercises provided a solid intro to commonly used techniques for generating useful text features.\nOverall, the exercises served as a good introduction to feature engineering for NLP. Hands-on work with real data would further improve learning. Feature extraction is an integral part of the NLP workflow.\n'