<a href="https://colab.research.google.com/github/ManasaCherukupally1/Manasa_INFO5731_Spring2023/blob/main/Cherukupally_Exercise_03_ipynb.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
The interesting text classification task is spam detection in SMS or emails. In this task, the text messages from SMS are emails are analyzed and classified whether it is a spam or
not. It is one of the important problem statement as it is very common to get trapped by the spam emails or messages. To keep track of such emails and trying not to become bait to
such emails is extremely important. Hence this problem statement is extremely useful Text classification problem statement.

The features that can be extracted from this text that will be useful in building the model are
1. Word frequency - That count the occurance of each word in the message which will help us in finding the unrelated or rare words like "congratulations", "you have won" etc
2. Existence of hyperlinks(lexical feature extraction) - Usually spam messages contain hyperlinks to to click bait the receivers . Hence it is important to check for hyperlink
3. TF/IDF - Term frequency to inverse document frequency is important to know the relevance of word among the total messages in dataset.
4. POS tagging - POS tagging is important as the spam messages may contain high number of verbs
5. Sentiment score - It is important to know the sentiment score of the message to detect whether it is spam or not


'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [3]:
import re
import nltk
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer

# Download the NLTK stopwords dataset
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('vader_lexicon')
nltk.download('averaged_perceptron_tagger')

# Function to clean message
def cleaning_text(text):
    # Removing punctuation and other special characters
    cleaned_text = re.sub(r'[^\w\s]', ' ', text)
    # To lowercase
    cleaned_text = cleaned_text.lower()
    return cleaned_text

# Check for hyperlinks
def having_hyperlink(text):
    return 1 if re.search(r'http[s]?://', text) else 0

# Word Frequency features
def word_freq_cal(messages):
    # Creating a CountVectorizer
    vectorizer = CountVectorizer()

    # Fit and transform to obtain the bag of words
    bow_matrix = vectorizer.fit_transform(messages)

    # Get the feature names
    f_names = vectorizer.get_feature_names_out()

    # Convert bow matrix to an array
    bow_array = bow_matrix.toarray()

    # Calculate word frequency
    word_frequency = bow_array.sum(axis=1)

    return word_frequency

# TF-IDF Features
def tfidf_features_extractor(messages):
    tfidf_vectorizer = TfidfVectorizer()
    tfidf_matrix = tfidf_vectorizer.fit_transform(messages)
    tfidf_f_names = tfidf_vectorizer.get_feature_names_out()
    tfidf_array = tfidf_matrix.toarray()
    return tfidf_f_names, tfidf_array

# POS Tagging and Sentiment Scores
def pos_tagging_and_sentiment_score(messages):
    sia = SentimentIntensityAnalyzer()
    pos_tags_and_sent = []

    for message in messages:
        # POS tagging
        tokens = nltk.word_tokenize(message)
        pos_tags = nltk.pos_tag(tokens)

        # Sentiment analysis
        sentiments = sia.polarity_scores(message)

        pos_tags_and_sent.append({
            "POS Tags": pos_tags,
            "Sentiment": sentiments
        })

    return pos_tags_and_sent

# Sample messages and labels (not spam and spam)
messages = [
    "Hi Sara, I wanted to confirm our meeting scheduled for tomorrow at 10 AM. Please make sure to bring the project report with you. Looking forward to our discussion. Best regards, James",
    "Congratulations Sara! You've been selected as the lucky winner of our grand prize. Click the link below to claim your free prize worth $1000! Claim Now: https://www.xyz.com Don't miss this once-in-a-lifetime opportunity. Act now! Regards, Tony"
]

labels = ["not spam", "spam"]

# Create a DataFrame
features_df = pd.DataFrame()

# Adding target labels
features_df['Label'] = labels

# Clean the messages
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))

cleaned_msg = []
for message in messages:
    cleaned_text = cleaning_text(message)
    cleaned_words = [word for word in cleaned_text.split() if word not in stop_words]
    cleaned_msg.append(" ".join(cleaned_words))

# 1. Word Frequency
word_frequency = word_freq_cal(cleaned_msg)
features_df['Word Frequency'] = word_frequency

# 2. Existence of hyperlinks(lexical feature extraction)
features_df['Has Hyperlink'] = [having_hyperlink(message) for message in cleaned_msg]

# 3. TF-IDF Features
tfidf_f_names, tfidf_array = tfidf_features_extractor(cleaned_msg)
for i, feature_name in enumerate(tfidf_f_names):
    features_df[feature_name] = tfidf_array[:, i]

# 4. POS Tagging
pos_tags_and_sentiments = pos_tagging_and_sentiment_score(cleaned_msg)
features_df['POS Tags'] = [str(tags['POS Tags']) for tags in pos_tags_and_sentiments]

# 5. Sentiment Scores
sentiments = [sentiment['Sentiment'] for sentiment in pos_tags_and_sentiments]
features_df['Sentiment Score'] = sentiments

pd.set_option("display.max_colwidth", None)
# Print features
print("Features:")
print(features_df)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


Features:
      Label  Word Frequency  Has Hyperlink       10     1000      act  \
0  not spam              20              0  0.22934  0.00000  0.00000   
1      spam              25              0  0.00000  0.18894  0.18894   

      best    bring    claim    click  ...     sure  tomorrow     tony  \
0  0.22934  0.22934  0.00000  0.00000  ...  0.22934   0.22934  0.00000   
1  0.00000  0.00000  0.37788  0.18894  ...  0.00000   0.00000  0.18894   

    wanted   winner    worth      www      xyz  \
0  0.22934  0.00000  0.00000  0.00000  0.00000   
1  0.00000  0.18894  0.18894  0.18894  0.18894   

                                                                                                                                                                                                                                                                                                                                                                                                            

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [22]:
#The feature selection method I used in the below code is the filter based model for selecting features.
#It uses the chi-squared score for selecting the top features.

import pandas as pd
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.feature_extraction.text import TfidfVectorizer
import re

# Sample messages
data = {
    'text': [
        "Hi Sara,I wanted to confirm our meeting scheduled for tomorrow at 10 AM. Please make sure to bring the project report with you.Looking forward to our discussion.Best regards,James",
        "Congratulations Sara! You've been selected as the lucky winner of our grand prize. Click the link below to claim your free prize worth $1000! Claim Now: https://www.xyz.com Don't miss this once-in-a-lifetime opportunity. Act now! Regards,Tony"
    ],
    'label': [0, 1]  # 0 for not spam (ham), 1 for spam
}

# function to clean text data
def clean_text(text):
    # Remove special characters, punctuation
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Create a DataFrame
df = pd.DataFrame(data)

# Clean the text data
df['cleaned_text'] = df['text'].apply(clean_text)

# Split data
X = df['cleaned_text']
y = df['label']

# Vectorize text data

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(X)

# Apply feature selection (chi-squared test as an example)
num_top_features = 10  # Updated number of top features to select
selector = SelectKBest(score_func=chi2, k=num_top_features)
X_new = selector.fit_transform(X_tfidf, y)

# Get the chi-squared scores for all features
chi2_scores = selector.scores_

# Create a DataFrame to store feature names and their chi-squared scores
feature_scores_df = pd.DataFrame({'Feature': tfidf_vectorizer.get_feature_names_out(),
                                   'Chi2_Score': chi2_scores})

# Sort the features by chi-squared score in descending order to rank them
ranked_features_df = feature_scores_df.sort_values(by='Chi2_Score', ascending=False)

print("top 10 selected features")
print(ranked_features_df['Feature'][:10].to_list())

# Print the ranked features
print("Ranked top 10 Features")
print(ranked_features_df.head(10))

# Print the ranked features
print("Ranked Features (Descending Order):")
print(ranked_features_df)






top 10 selected features
['now', 'claim', 'prize', 'meeting', 'please', 'project', 'regardsjames', 'report', 'sarai', 'hi']
Ranked top 10 Features
         Feature  Chi2_Score
26           now    0.316080
9          claim    0.316080
32         prize    0.316080
24       meeting    0.185416
31        please    0.185416
33       project    0.185416
34  regardsjames    0.185416
36        report    0.185416
38         sarai    0.185416
19            hi    0.185416
Ranked Features (Descending Order):
            Feature  Chi2_Score
26              now    0.316080
9             claim    0.316080
32            prize    0.316080
24          meeting    0.185416
31           please    0.185416
33          project    0.185416
34     regardsjames    0.185416
36           report    0.185416
38            sarai    0.185416
19               hi    0.185416
39        scheduled    0.185416
41             sure    0.185416
45         tomorrow    0.185416
46           wanted    0.185416
48             wit

Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [26]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

#query
query = "Please confirm our scheduled meeting for tomorrow at 10 AM."

#documents
docs = [
    "Hi Sara, I wanted to confirm our meeting scheduled for tomorrow at 10 AM. Please make sure to bring the project report with you. Looking forward to our discussion. Best regards, James",
    "Congratulations Sara! You've been selected as the lucky winner of our grand prize. Click the link below to claim your free prize worth $1000! Claim Now: https://www.xyz.com Don't miss this once-in-a-lifetime opportunity. Act now! Regards, Tony"
]

# Load the BERT model
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertModel.from_pretrained("bert-base-uncased")

# Encoding the query and docs
encoded_query = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
encoded_docs = tokenizer(docs, padding=True, truncation=True, return_tensors="pt", many=True)

# Getting embeddings by BERT model
with torch.no_grad():
    query_embedding = model(**encoded_query).last_hidden_state.mean(dim=1)
    document_embeddings = model(**encoded_docs).last_hidden_state.mean(dim=1)

# Calculate cosine similarity
cosine_sim = cosine_similarity(query_embedding, document_embeddings)

# Rank docs
document_ranking = sorted(enumerate(cosine_sim[0]), key=lambda x: x[1], reverse=True)

# Print the docs
for idx, score in document_ranking:
    print(f"Document {idx + 1}: Similarity Score = {score:.4f}")
    print(docs[idx])
    print("=" * 50)

Keyword arguments {'many': True} not recognized.
Keyword arguments {'many': True} not recognized.


Document 1: Similarity Score = 0.7355
Hi Sara, I wanted to confirm our meeting scheduled for tomorrow at 10 AM. Please make sure to bring the project report with you. Looking forward to our discussion. Best regards, James
Document 2: Similarity Score = 0.6099
Congratulations Sara! You've been selected as the lucky winner of our grand prize. Click the link below to claim your free prize worth $1000! Claim Now: https://www.xyz.com Don't miss this once-in-a-lifetime opportunity. Act now! Regards, Tony


In [24]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m21.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m35.1 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m48.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m51.3 MB/s[0m eta [36m0:00:00[0m
Insta