<a href="https://colab.research.google.com/github/ManoharRavula/Manohar_INFO5731_-Spring2024/blob/main/Ravulapalli_Manohar_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
My topic for Text classification is identifying spam vs. non-spam emails. For this an effective feature extraction process is required for the machine learning model. By using
the below features it can build the model and make it accurate -
1)Bag of Words (BoW): This technique quantifies the presence and frequency of words within the email text as certain words and phrases are more prevalent in spam emails.
2)TF-IDF (Term Frequency-Inverse Document Frequency): This feature is useful for assessing the significance of a word within a document compared to a group of documents or corpus.
 This feature proves useful in identifying spam emails by devaluing used words across all emails and emphasizing the importance of words, to individual emails.
3)Checking the Length of the Email: The size of an email whether, in terms of word count or characters can be an indicator. Spam emails can show length traits like being
 unusually brief or lengthy compared to other emails. By including length, we can identify these trends.
4)Checking Presence of URLs: When there are URLs in an email it mostly indicates spam. Spam messages often contain links that
  direct recipients to external sites for phishing attacks, advertisements, or other unsolicited content.
  By Tracking the count of URLs within each email we can enhance the model's ability to recognize potential spam by identifying the links.
5)Use of Capital Letters: Spam emails often capitalize words to grab the readers attention whether in headers, subject lines or the message itself aiming to highlight urgency or significance.
 so by evaluating the percentage of capitalized text,we can detect the spam.



'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [31]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import re

# Sample email texts
emails = [
    "WIN BIG! Click here to claim your prize NOW! http://example.com",
    "Dear user, your account has been updated. Please log in to confirm.",
    "Congratulations! You've won a $500 Amazon gift card. Click to redeem: http://phishing.com",
    "Hi, could you please review the attached file? Thanks, John.",
    "LIMITED TIME OFFER! Get your free trial of our product today!",
     "Urgent: Your account will be locked unless you update your information immediately at http://scam-link.com",
    "Hello friend, I am the son of a deposed prince. I need your help to secure my inheritance. I offer you a reward.",
    "Are you looking for cheap flights? Exclusive deals just for you: http://deals-on-flights.com",
    "This is a reminder that your next appointment is scheduled for tomorrow at 3 PM. Please call to confirm.",
    "Team meeting at 10 AM in the main conference room. Please prepare your monthly reports.",
    "Your subscription has been cancelled successfully. If this was a mistake, please contact support.",
    "Final notice: Your payment is overdue. Log in through the link to avoid service interruption: http://fake-payment.com",
    "Discover the new features of our app with the latest update. Click here to learn more.",
]

# Labels for the sample data (1 for spam, 0 for non-spam)
labels = [1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0]

# Feature 1: Bag of Words (BoW)
count_vectorizer = CountVectorizer()
bow_features = count_vectorizer.fit_transform(emails)

# Feature 2: TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(emails)

# Feature 3: Length of the email (in terms of word count)
length_features = np.array([len(email.split()) for email in emails]).reshape(-1, 1)

# Feature 4: Presence of URLs
url_pattern = r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+'
url_counts = np.array([len(re.findall(url_pattern, email)) for email in emails]).reshape(-1, 1)

# Feature 5: Use of Capital Letters
capital_letter_counts = np.array([sum(1 for char in email if char.isupper()) for email in emails]).reshape(-1, 1)
capital_letter_percentage = np.array([count / len(email) for count, email in zip(capital_letter_counts, emails)]).reshape(-1, 1)

# Combining all features into a single feature matrix
from scipy.sparse import hstack
features = hstack([bow_features, tfidf_features, length_features, url_counts, capital_letter_percentage])

print("Features extracted successfully!")


Features extracted successfully!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [32]:
from sklearn.feature_selection import chi2
import numpy as np

# features is my combined feature matrix and labels are my target labels

# Applying the Chi-squared test
chi_scores, p_values = chi2(features, labels)

# Ranking features based on the chi-squared scores
feature_names = (count_vectorizer.get_feature_names_out().tolist() +
                 tfidf_vectorizer.get_feature_names_out().tolist() +
                 ["Length of email", "Presence of URLs", "Use of Capital Letters Percentage"])

# Combining feature names and their chi-squared scores
feature_chi_scores = zip(feature_names, chi_scores)

# Sorting the features by their chi-squared scores in descending order
sorted_features = sorted(feature_chi_scores, key=lambda x: x[1], reverse=True)

# Displaying the top features based on chi-squared scores
sorted_features[:10]

[('please', 5.833333333333332),
 ('com', 4.2857142857142865),
 ('http', 4.2857142857142865),
 ('Presence of URLs', 4.2857142857142865),
 ('been', 2.333333333333333),
 ('confirm', 2.333333333333333),
 ('has', 2.333333333333333),
 ('this', 2.333333333333333),
 ('you', 2.099206349206349),
 ('deals', 1.7142857142857144)]

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [34]:
from transformers import BertModel, BertTokenizer
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Initializing the BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function for encoding the text into BERT embeddings
def encode_text_to_bert_embeddings(text):
    # Tokenizing and encoding text with padding, truncation, and return PyTorch tensors
    encoded_input = tokenizer(text, padding=True, truncation=True, max_length=128, return_tensors='pt')
    # Getting the model output without gradient calculation
    with torch.no_grad():
        output = model(**encoded_input)
    # Use mean pooling for the token embeddings to get a single sentence embedding
    embeddings = output.last_hidden_state
    attention_mask = encoded_input['attention_mask']
    mask_expanded = attention_mask.unsqueeze(-1).expand(embeddings.size()).float()
    sum_embeddings = torch.sum(embeddings * mask_expanded, 1)
    sum_mask = torch.clamp(mask_expanded.sum(1), min=1e-9)
    mean_pooled = sum_embeddings / sum_mask
    return mean_pooled.cpu().numpy()

# Encoding all the texts to BERT embeddings
encoded_texts = [encode_text_to_bert_embeddings(text) for text in emails]

# Encoding the query
query = "Want to know the latest promotional offers? Click here!"
encoded_query = encode_text_to_bert_embeddings(query)

# Calculating the cosine similarities
similarities = [cosine_similarity(encoded_query, text_emb)[0][0] for text_emb in encoded_texts]

# Ranking texts by similarity
ranked_texts_indices = sorted(range(len(emails)), key=lambda i: similarities[i], reverse=True)

print("Ranked texts based on similarity to the query: 'Want to know the latest promotional offers? Click here!'")
for index in ranked_texts_indices:
    print(f"Text: \"{emails[index]}\" - Similarity Score: {similarities[index]}")


Ranked texts based on similarity to the query: 'Want to know the latest promotional offers? Click here!'
Text: "Discover the new features of our app with the latest update. Click here to learn more." - Similarity Score: 0.8666878938674927
Text: "WIN BIG! Click here to claim your prize NOW! http://example.com" - Similarity Score: 0.8127886652946472
Text: "Congratulations! You've won a $500 Amazon gift card. Click to redeem: http://phishing.com" - Similarity Score: 0.7976548671722412
Text: "Urgent: Your account will be locked unless you update your information immediately at http://scam-link.com" - Similarity Score: 0.7758417129516602
Text: "Dear user, your account has been updated. Please log in to confirm." - Similarity Score: 0.753309428691864
Text: "Are you looking for cheap flights? Exclusive deals just for you: http://deals-on-flights.com" - Similarity Score: 0.747409462928772
Text: "Hi, could you please review the attached file? Thanks, John." - Similarity Score: 0.745686411857605

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Coming up with a problem and how natural language processing can help in solving the problem was great. This assignment required alot of effort to understand feature
extraction process and from the extracted features using chi-squared to rank was really something great. Using these may be i can train a model like this expanding
my second question  X_train, X_test, y_train, y_test = train_test_split(features_scaled, labels, test_size=0.2, random_state=42)
training a Logistic Regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
predictions = model.predict(features_scaled) and release a spam detector
model.
Understanding questions and the output we need to get was the challenging part, modifying the code, analyzing required so much time.
The exercise demonstrated several core NLP concepts and technique like Text Representation using Bert model, Information Retrieval using cosine similarity and
how to combine features into single matrix for chi squared test. Greatful for this assigment.



'''