<a href="https://colab.research.google.com/github/Lavanya-INFO5731-Fall2024/Lavanya_INFO5731_Fall2024/blob/main/Nidamanuri_Lavanya_Exercise_3_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
->Spam detection is an intriguing text classification job in which the aim is to determine if an email is spam or not (ham).
  This is a binary classification issue in which text patterns assist detect undesired or hazardous information.
->Here are five forms of characteristics that might be relevant in developing a spam detection model:
  *Bag of Words / Term Frequency-Inverse Document Frequency
  *N-grams (Bigrams, Trigrams)
  *Email Metadata Features
  *Presence of URLs and Hyperlinks
  *Special Characters, Punctuation, and Capitalization

->Bag of Words / Term Frequency-Inverse Document Frequency:
  -Spam emails frequently contain specific keywords such as "free," "win," "urgent," "money," "buy now," and so on.
  These terms are common in spam, but less so in legal (ham) emails.
  -TF-IDF reduces the effect of common, less-informative terms while emphasizing spam-related phrases.
->N-grams (Bigrams, Trigrams):
  -Spam sometimes includes terms like "limited time offer" or "act now".
   These multi-word formulations convey more meaning than single words and frequently expose spam trends.
  -Spam communications typically have more formulaic patterns that can be caught using n-grams.
->Email Metadata Features:
  -Spam subject lines sometimes include eye-catching or spectacular statements (for example, "Congratulations! You've Won!").
  -The sender's email address may contain patterns such as strange domains, odd strings, or unfamiliar contacts.
  -The timestamp might be important since spam emails are typically sent at unusual hours or in mass.
->Presence of URLs and Hyperlinks:
  -Many spam emails attempt to steer visitors to phishing sites or fraudulent offers.
   A large number of URLs or truncated URLs (e.g., bit.ly) may signal spam email.
  -The presence of strange or unusual domain names in the links may indicate spam.
->Special Characters, Punctuation, and Capitalization:
  -Spam communications frequently include exclamation marks, dollar signs, or other eye-catching symbols to convey urgency or a call to action (e.g., "BUY NOW!!!").
  -Excessive use of capitalized phrases such as "FREE," "WIN," or "ACT NOW" is typical in spam emails but uncommon in real correspondence.

'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):
import pandas as pd
import nltk
nltk.download('stopwords')
nltk.download('wordnet')
# Load the dataset (spam.csv or another dataset)
df = pd.read_csv('spam_assaain.csv', encoding='latin-1')  # Adjust path as needed

# Check the structure of the dataset
df.head()

# Assuming 'target' is the label (spam/ham) and 'text' is the email content
df = df[['text', 'target']]  # Keep only the relevant columns

# Convert labels to binary: spam = 1, ham = 0 (assuming 0 = ham, 1 = spam)
df['target'] = df['target'].map({1: 1, 0: 0})  # 1 for spam, 0 for ham
df.columns = ['email', 'label']  # Rename the columns for consistency

# Now the DataFrame is ready with 'email' and 'label' columns.

from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import re

lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Remove special characters and numbers
    text = re.sub(r'\W', ' ', text)
    text = re.sub(r'\d+', '', text)

    # Convert to lowercase
    text = text.lower()

    # Tokenize and lemmatize
    tokens = text.split()
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]

    return ' '.join(tokens)

# Apply preprocessing to the 'email' column
df['email'] = df['email'].apply(preprocess_text)

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(df['email'], df['label'], test_size=0.2, random_state=42)

# Initialize TF-IDF vectorizer (you can use CountVectorizer for Bag of Words if you prefer)
tfidf = TfidfVectorizer(max_features=3000, ngram_range=(1, 2))  # Extract unigrams and bigrams

# Fit the vectorizer on the training data and transform both training and testing sets
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

# Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

# Make predictions on the test data
y_pred = model.predict(X_test_tfidf)

# Evaluate the model's performance
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy * 100:.2f}%')  # Print accuracy in percentage

# Print detailed classification report
print(classification_report(y_test, y_pred, target_names=['ham', 'spam']))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


Accuracy: 98.36%
              precision    recall  f1-score   support

         ham       0.98      1.00      0.99       779
        spam       0.99      0.96      0.97       381

    accuracy                           0.98      1160
   macro avg       0.99      0.98      0.98      1160
weighted avg       0.98      0.98      0.98      1160



## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):
# Required Libraries
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import mutual_info_classif

# Sample Data (Text and Labels)
texts = [
    "This is a positive review",
    "I love this product, it's amazing",
    "Worst experience ever, totally disappointed",
    "This is a negative review",
    "Not satisfied with the product, bad quality",
    "The quality is outstanding, will buy again",
]

labels = [1, 1, 0, 0, 0, 1]  # 1 = Positive, 0 = Negative

# Step 1: Convert text data into TF-IDF features
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()

# Step 2: Calculate Mutual Information
mi = mutual_info_classif(X, labels, discrete_features=True)

# Step 3: Create a DataFrame to Display Feature Importance
mi_scores = pd.DataFrame({'Feature': feature_names, 'MI Score': mi})

# Sort the features by their MI score in descending order
mi_scores = mi_scores.sort_values(by='MI Score', ascending=False)

# Display the ranked features based on MI Score
print(mi_scores)

         Feature  MI Score
18           the  0.231049
15       quality  0.231049
14       product  0.231049
19          this  0.143841
7             is  0.143841
0          again  0.132304
1        amazing  0.132304
22          with  0.132304
21          will  0.132304
20       totally  0.132304
17     satisfied  0.132304
13      positive  0.132304
12   outstanding  0.132304
11           not  0.132304
10      negative  0.132304
9           love  0.132304
8             it  0.132304
6     experience  0.132304
5           ever  0.132304
4   disappointed  0.132304
3            buy  0.132304
2            bad  0.132304
23         worst  0.132304
16        review  0.000000




## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load a limited subset of the data (e.g., 100 samples)
df_limited = df.sample(n=20, random_state=42)

# Load the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text using BERT
def encode_text_bert(text, max_len=512):
    # Tokenize the input text and convert to input IDs, attention masks
    inputs = tokenizer(text, return_tensors='pt', max_length=max_len, truncation=True, padding='max_length')
    with torch.no_grad():  # Disable gradient calculations
        outputs = model(**inputs)  # Pass the inputs to the BERT model
        embeddings = outputs.last_hidden_state[:, 0, :]  # Extract the CLS token representation
    return embeddings

# Preprocess the limited dataset using BERT embeddings
df_limited['bert_embeddings'] = df_limited['email'].apply(lambda x: encode_text_bert(x).numpy().flatten())

# Function to compute cosine similarity between query and documents
def rank_documents_by_similarity(query, df_limited):
    # Encode the query using BERT
    query_embedding = encode_text_bert(query).numpy().flatten()

    # Calculate cosine similarity between the query and each document
    df_limited['similarity'] = df_limited['bert_embeddings'].apply(lambda x: cosine_similarity([query_embedding], [x])[0][0])

    # Sort the dataframe by similarity in descending order
    ranked_df = df_limited.sort_values(by='similarity', ascending=False)

    return ranked_df[['email', 'similarity']]

# Example query
query = "Limited time offer for free lottery entry"

# Rank documents by their similarity to the query
ranked_df_limited = rank_documents_by_similarity(query, df_limited)

# Display the top-ranked documents based on similarity
print(ranked_df_limited.head(10))

                                                  email  similarity
4937  gintare netzero net mon jun return path dmeizy...    0.830505
3321  ia rogers com mon aug return path ia rogers co...    0.811449
2954  fork admin xent com wed jul return path fork a...    0.740152
3337  claudia_robinson eudoramail com mon aug return...    0.739608
2418  fork admin xent com wed aug return path fork a...    0.736885
4675  rssfeeds jmason org thu oct return path rssfee...    0.733521
101   fork admin xent com mon dec return path fork a...    0.732922
3296  exmh user admin redhat com fri sep return path...    0.732734
5110  iiu owner taint org mon aug return path iiu ow...    0.730542
5079  yipxyvihobzh ibm com tue aug return path yipxy...    0.728989


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Learing Experience
-In order to work on extracting features from text data, I learnt how we can convert raw text into meaningful numerical representations
that can be used as features in machine learning models. Below are few of the important concepts which I found of great use:
*Text Vectorization : Bag of Words (BoW), Term Frequency-Inverse Document Frequency (TF-IDF), Word2Vec , BERT etc.
 Each has its own advantages. From simple frequency counts to capturing how words relate with each other.
*Feature Selection: I had to learn and understand Chi-Square, Mutual Information, and Information Gain methods very well in
 order to select the right important features from the text data which would reduce the dimensions as well as improve model performance
 by keeping only the informative features.
*Text Similarity: Learning to compute similarity between texts using BERT embeddings and cosine similarity helped me understand
 how deep learning models such as BERT internally represent text in a manner that captures context and semantics, which is critical
 for applications like document ranking and semantic search.

Challenges
-Text data generally have high dimensional feature spaces especially with traditional methods like Bag of Words or TF-IDF which is
computationally expensive and introduces what is normally referred to as the curse of dimensionality.
-Each of the feature selection method (e.g. Mutual Information, Chi-Square etc.) has certain pros and cons.
Thus, which one is the best for a specific task is difficult to conclude without having some trials in that regard.
BERT provides great contextual embeddings, but it is important to understand how these embeddings represent the meaning of text and
how to fine-tune or use them for your similarity ranking NLP task.
-BERT models are resource intensive, so if you try to use them on large datasets or for comparing similarities on multiple sentences,
it will be very slow on low-end machines.

Relevance to NLP
-Feature Extraction is essential in NLP problems like text classification, sentiment analysis, topic modeling etc.
Good feature representation and feature selection will help in building good models.
-Text Similarity is very important in many natural language processing (NLP) applications. For instance, information retrieval
(e.g., search engines), question answering, document clustering, etc. Among deep learning models, using BERT for semantic similarity
is state-of-the-art in NLP.
BERT and Contextual Embeddings are the new advancements which are happening in the NLP space. So learning how BERT can be used in
real-time problems directly correlates to what is the modern NLP as of now and what is cutting edge.

'''