# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
The interesting text classification task coulb be spam email classification.
we can identify whether the email is spam or not.
The features we can identify include:
1.  Bag of Words (BoW) or Term Frequency used to important words that might occur frequently in spam but not in regular emails
2.  Email Metadata Features, Spam emails often come from unfamiliar or suspicious domains, and their subject lines may contain odd patterns.
    The sender’s email address and the subject line provide additional signals that differentiate legitimate emails from spam
3.  Lexical Features is used to analyze the patterns that have odd capitalizations or characters helps detects manipulation.
4.  Punctuation and Special Character Usage used to capture the usage of symbols that have promotional or deceptive content often found in spam.
5.  Sentiment Analysis used to Spam emails are often overly positive or aggressive, promoting products, offers, or schemes with exaggerated enthusiasm. A sentiment analysis model can help flag these overly positive or
    negative tones typical of spam content
These features combined can build a robust classification model for spam detection.
Textual patterns (BoW, lexical features), metadata (sender and subject), sentiment
and the structure of the email all contribute unique signals that, when aggregated,
can improve the model’s ability to detect spam.


'''

'\nThe interesting text classification task coulb be spam email classification.\nwe can identify whether the email is spam or not.\nThe features we can identify include:\n\n\n'

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
!pip install nltk scikit-learn spacy textstat vaderSentiment
!python -m spacy download en_core_web_sm


Collecting textstat
  Downloading textstat-0.7.4-py3-none-any.whl.metadata (14 kB)
Collecting vaderSentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl.metadata (572 bytes)
Collecting pyphen (from textstat)
  Downloading pyphen-0.16.0-py3-none-any.whl.metadata (3.2 kB)
Downloading textstat-0.7.4-py3-none-any.whl (105 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 kB[0m [31m7.1 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyphen-0.16.0-py3-none-any.whl (2.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m35.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, vaderSentiment, textstat
Successfully installed pyphen-0.16.0 textstat-0.7.4 vaderSentiment-3.3.2
Collecting en-core-web-sm==3.7.1
  Do

In [1]:
!pip install scikit-learn nltk pandas




In [2]:
import pandas as pd

# Sample emails
data = {
    'email': [
        "Hello, I hope you are doing well. Don't forget our meeting tomorrow.",
        "WIN BIG NOW!!! Click here to claim your $1,000 prize. Offer ends soon!!!",
        "Hi John, I wanted to follow up on our last conversation about the project proposal.",
        "Congratulations! You've been selected to receive a FREE gift. Click to redeem.",
        "This is a friendly reminder for your scheduled appointment at 2:30 PM.",
        "LIMITED OFFER: Buy 1 get 1 FREE! Visit our website now."
    ],
    'subject': [
        "Meeting Reminder",
        "WIN BIG NOW!!!",
        "Project Follow-up",
        "Congratulations! FREE Gift",
        "Appointment Reminder",
        "LIMITED OFFER"
    ],
    'sender': [
        "boss@company.com",
        "promo@spamdomain.com",
        "colleague@company.com",
        "offers@spamsite.com",
        "service@company.com",
        "marketing@spamworld.com"
    ]
}

# Load data into DataFrame
df = pd.DataFrame(data)
df


Unnamed: 0,email,subject,sender
0,"Hello, I hope you are doing well. Don't forget...",Meeting Reminder,boss@company.com
1,"WIN BIG NOW!!! Click here to claim your $1,000...",WIN BIG NOW!!!,promo@spamdomain.com
2,"Hi John, I wanted to follow up on our last con...",Project Follow-up,colleague@company.com
3,Congratulations! You've been selected to recei...,Congratulations! FREE Gift,offers@spamsite.com
4,This is a friendly reminder for your scheduled...,Appointment Reminder,service@company.com
5,LIMITED OFFER: Buy 1 get 1 FREE! Visit our web...,LIMITED OFFER,marketing@spamworld.com


In [3]:
import re
import string
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

# Download nltk resources
nltk.download('vader_lexicon')

# Initialize Sentiment Analyzer
sia = SentimentIntensityAnalyzer()

# Function to extract features
def extract_features(df):
    features = pd.DataFrame()

    # 1. Bag of Words (BoW)
    vectorizer = CountVectorizer(stop_words='english')
    bow_matrix = vectorizer.fit_transform(df['email'])
    bow_df = pd.DataFrame(bow_matrix.toarray(), columns=vectorizer.get_feature_names_out())
    features = pd.concat([features, bow_df], axis=1)

    # 2. TF-IDF
    tfidf_vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = tfidf_vectorizer.fit_transform(df['email'])
    tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
    tfidf_features = pd.concat([features, tfidf_df], axis=1)

    # 3. Metadata (Sender Domain & Subject Length)
    df['domain'] = df['sender'].apply(lambda x: x.split('@')[-1])
    df['subject_length'] = df['subject'].apply(len)
    features['domain'] = df['domain']
    features['subject_length'] = df['subject_length']

    # 4. Lexical Features (Avg Word Length & Capital Letters)
    def avg_word_length(text):
        words = text.split()
        return sum(len(word) for word in words) / len(words) if len(words) > 0 else 0

    features['avg_word_length'] = df['email'].apply(avg_word_length)
    features['capital_letters'] = df['email'].apply(lambda x: sum(1 for c in x if c.isupper()))

    # 5. Punctuation Features
    features['exclamation_marks'] = df['email'].apply(lambda x: x.count('!'))
    features['dollar_signs'] = df['email'].apply(lambda x: x.count('$'))

    # 6. Sentiment Analysis
    def get_sentiment(text):
        return sia.polarity_scores(text)['compound']

    features['sentiment'] = df['email'].apply(get_sentiment)

    # Email Structure (Presence of Links)
    def contains_link(text):
        return 1 if re.search(r'http[s]?://', text) else 0

    features['contains_link'] = df['email'].apply(contains_link)

    return features

# Extract features from sample data
extracted_features = extract_features(df)

# Display extracted features
extracted_features


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


Unnamed: 0,000,30,appointment,big,buy,claim,click,congratulations,conversation,doing,...,website,win,domain,subject_length,avg_word_length,capital_letters,exclamation_marks,dollar_signs,sentiment,contains_link
0,0,0,0,0,0,0,0,0,0,1,...,0,0,company.com,16,4.75,3,0,0,0.6874,0
1,1,0,0,1,0,1,1,0,0,0,...,0,1,spamdomain.com,14,4.615385,11,6,1,0.875,0
2,0,0,0,0,0,0,0,0,1,0,...,0,0,company.com,17,4.6,3,0,0,0.0,0
3,0,0,0,0,0,0,1,1,0,0,...,0,0,spamsite.com,26,5.583333,7,1,0,0.9027,0
4,0,1,1,0,0,0,0,0,0,0,...,0,0,company.com,20,4.916667,3,0,0,0.4939,0
5,0,0,0,0,1,0,0,0,0,0,...,1,0,spamworld.com,13,4.090909,18,1,0,0.4003,0


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.preprocessing import LabelEncoder

# Example target labels (spam = 1, not spam = 0) for demonstration
df['label'] = [0, 1, 0, 1, 0, 1]

# Extract features using the previously defined function
extracted_features = extract_features(df)

# Encode the target labels (0 = not spam, 1 = spam)
label_encoder = LabelEncoder()
y = label_encoder.fit_transform(df['label'])

# Chi-Square test expects non-negative values, remove negative or non-finite features
extracted_features = extracted_features.select_dtypes(include=['float64', 'int64']).abs()

# Apply Chi-Square feature selection
chi2_selector = SelectKBest(chi2, k='all')
chi2_selector.fit(extracted_features, y)

# Get Chi-Square scores and feature names
chi2_scores = chi2_selector.scores_
features = extracted_features.columns

# Create a DataFrame of features and their corresponding Chi-Square scores
feature_ranking = pd.DataFrame({'Feature': features, 'Chi-Square Score': chi2_scores})

# Sort features by Chi-Square scores in descending order
feature_ranking = feature_ranking.sort_values(by='Chi-Square Score', ascending=False).reset_index(drop=True)

# Display the ranked features
print(feature_ranking)


              Feature  Chi-Square Score
0     capital_letters         16.200000
1   exclamation_marks          8.000000
2               offer          2.000000
3               click          2.000000
4                free          2.000000
5            selected          1.000000
6               prize          1.000000
7             project          1.000000
8            proposal          1.000000
9             receive          1.000000
10             redeem          1.000000
11           reminder          1.000000
12          scheduled          1.000000
13           tomorrow          1.000000
14               soon          1.000000
15                 30          1.000000
16                 ve          1.000000
17              visit          1.000000
18             wanted          1.000000
19            website          1.000000
20                win          1.000000
21       dollar_signs          1.000000
22                 pm          1.000000
23                000          1.000000


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
!pip install transformers torch scikit-learn




In [6]:
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings for a given text
def get_bert_embedding(text):
    # Tokenize the input text and convert to input IDs
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)

    # Pass the inputs through BERT to get the hidden states (last hidden layer)
    with torch.no_grad():
        outputs = model(**inputs)
        # Get the embeddings from the [CLS] token (first token in BERT)
        cls_embedding = outputs.last_hidden_state[:, 0, :].numpy()

    return cls_embedding

# Text data (emails) from question 2
texts = [
    "Hello, I hope you are doing well. Don't forget our meeting tomorrow.",
    "WIN BIG NOW!!! Click here to claim your $1,000 prize. Offer ends soon!!!",
    "Hi John, I wanted to follow up on our last conversation about the project proposal.",
    "Congratulations! You've been selected to receive a FREE gift. Click to redeem.",
    "This is a friendly reminder for your scheduled appointment at 2:30 PM.",
    "LIMITED OFFER: Buy 1 get 1 FREE! Visit our website now."
]

# Query text (the input query we want to match)
query = "Are you interested in winning a big prize? Click here for your chance."

# Get BERT embedding for the query
query_embedding = get_bert_embedding(query)

# Get BERT embeddings for all text documents
text_embeddings = [get_bert_embedding(text) for text in texts]

# Flatten the query embedding and text embeddings for cosine similarity computation
query_embedding_flat = np.squeeze(query_embedding)
text_embeddings_flat = [np.squeeze(embedding) for embedding in text_embeddings]

# Compute cosine similarity between the query and each text
similarity_scores = [cosine_similarity([query_embedding_flat], [text_embedding])[0][0] for text_embedding in text_embeddings_flat]

# Rank texts based on similarity scores (in descending order)
ranked_texts = sorted(zip(similarity_scores, texts), key=lambda x: x[0], reverse=True)

# Display ranked results
print("Ranked Texts Based on Similarity:\n")
for rank, (score, text) in enumerate(ranked_texts, 1):
    print(f"Rank {rank}: (Similarity Score: {score:.4f})\n{text}\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Texts Based on Similarity:

Rank 1: (Similarity Score: 0.9264)
Congratulations! You've been selected to receive a FREE gift. Click to redeem.

Rank 2: (Similarity Score: 0.9241)
This is a friendly reminder for your scheduled appointment at 2:30 PM.

Rank 3: (Similarity Score: 0.9167)
LIMITED OFFER: Buy 1 get 1 FREE! Visit our website now.

Rank 4: (Similarity Score: 0.9127)
WIN BIG NOW!!! Click here to claim your $1,000 prize. Offer ends soon!!!

Rank 5: (Similarity Score: 0.9081)
Hello, I hope you are doing well. Don't forget our meeting tomorrow.

Rank 6: (Similarity Score: 0.8775)
Hi John, I wanted to follow up on our last conversation about the project proposal.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''

It is very interesting and it is highly engaging and informative. Working with the text data is interesting
and The key concepts I found interesting is feature extraction, feature selection methods.
challenges encountered are managing the complexities of representing texual data in machine readable form, preprocessing and extracting features.
This exercise is relavent to NLP , it is used in Text feature extraction, selection, and similarity ranking.




'''