## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Identifying if an email or message is spam or not might be an interesting text classification job.
This entails categorizing incoming communications as "spam" or "not spam."

The attributes listed below are helpful in creating a machine learning model:

Features of Bag of Words (BoW):
Word Frequency: Determine how often each word appears in the review. Words that convey a positive attitude include "delicious," "amazing," and "friendly," whereas words that convey a negative emotion include "disappointing," "awful," and "rude."

N-gram Features: Bi-grams and Tri-grams: In addition to individual words, take into account word pairs (bi-grams) and word triplets (tri-grams). Sentiment-related terms like "good food" and "horrible service" can be captured by this.

Temporal Elements:
Review Date and Time: Take into account the review's timestamp because opinions might change over time. Reviews that are written soon after a meal may be more passionate than ones that are written later.

A feature of TF-IDF (Term Frequency-Inverse Document Frequency) is its score. Determine the TF-IDF ratings for every word in the review. This aids in downplaying common terms and emphasizing those that are crucial to a particular evaluation.

Capitalization and Punctuation Qualities:
Use of Punctuation: Determine how many ellipses, question marks, or exclamation points are used throughout the review. Overuse of exclamation marks can convey enthusiasm in reviews that are good or annoyance in reviews that are negative.

Features of Part-of-Speech (POS):
POS Tag Frequencies: Determine how frequently the various parts of speech—nouns, adjectives, verbs, and adverbs—occur in the review. those that are good may have more positive verbs and adjectives, while those that are negative may contain more negative verbs and adjectives.





'''

'\nPlease write you answer here:\nIdentifying if an email or message is spam or not might be an interesting text classification job.\nThis entails categorizing incoming communications as "spam" or "not spam."\n\nThe attributes listed below are helpful in creating a machine learning model:\n\nFeatures of Bag of Words (BoW):\nWord Frequency: Determine how often each word appears in the review. Words that convey a positive attitude include "delicious," "amazing," and "friendly," whereas words that convey a negative emotion include "disappointing," "awful," and "rude."\n\nN-gram Features: Bi-grams and Tri-grams: In addition to individual words, take into account word pairs (bi-grams) and word triplets (tri-grams). Sentiment-related terms like "good food" and "horrible service" can be captured by this.\n\nTemporal Elements:\nReview Date and Time: Take into account the review\'s timestamp because opinions might change over time. Reviews that are written soon after a meal may be more passiona

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
# You code here (Please add comments in the code):
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')

import string
import pandas as pd
import datetime

# Sample text data
reviews = [
    "The movie was captivating, and the acting was outstanding!",
    "I had a disappointing time. The plot was confusing, and the acting was subpar.",
    "This book is just average. Nothing remarkable.",
]

# Tokenize the text and remove stopwords
nltk.download('punkt')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    words = word_tokenize(text.lower())
    filtered_words = [word for word in words if word not in stop_words and word not in string.punctuation]
    return filtered_words

tokenized_reviews = [preprocess_text(review) for review in reviews]

# Bag of Words (BoW) Features
def bag_of_words(review, vocabulary):
    bow_vector = [0] * len(vocabulary)
    for word in review:
        if word in vocabulary:
            bow_vector[vocabulary.index(word)] += 1
    return bow_vector

# Create a vocabulary
all_words = [word for review in tokenized_reviews for word in review]
vocabulary = list(set(all_words))

# Create BoW features
bow_features = [bag_of_words(review, vocabulary) for review in tokenized_reviews]

# TF-IDF Features
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(reviews)

# N-grams Features
n = 2  # Change n to the desired n-gram size
ngram_features = [list(ngrams(review, n)) for review in tokenized_reviews]

# Part-of-Speech (POS) Features
pos_features = [pos_tag(review) for review in tokenized_reviews]

# Punctuation and Capitalization Features
def punctuation_and_capitalization_features(review):
    punctuation_count = sum(1 for char in review if char in string.punctuation)
    capitalization_count = sum(1 for char in review if char.isupper())
    return [punctuation_count, capitalization_count]

punctuation_capitalization_features = [punctuation_and_capitalization_features(review) for review in reviews]

# Temporal Features
timestamp = [datetime.datetime.now() - datetime.timedelta(days=i) for i in range(len(reviews))]

# Create a DataFrame to display the features
feature_df = pd.DataFrame({
    'Review': reviews,
    'BoW Features': bow_features,
    'TF-IDF Features': [tfidf.toarray().tolist() for tfidf in tfidf_features],
    'N-grams Features': ngram_features,
    'POS Features': pos_features,
    'Punctuation & Capitalization Features': punctuation_capitalization_features,
    'Timestamp': timestamp
})

print(feature_df)




[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


                                              Review  \
0  The movie was captivating, and the acting was ...   
1  I had a disappointing time. The plot was confu...   
2     This book is just average. Nothing remarkable.   

                              BoW Features  \
0  [0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1]   
1  [0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0]   
2  [1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0]   

                                     TF-IDF Features  \
0  [[0.25660665102527236, 0.25660665102527236, 0....   
1  [[0.22154791696626897, 0.22154791696626897, 0....   
2  [[0.0, 0.0, 0.37796447300922725, 0.37796447300...   

                                    N-grams Features  \
0  [(movie, captivating), (captivating, acting), ...   
1  [(disappointing, time), (time, plot), (plot, c...   
2  [(book, average), (average, nothing), (nothing...   

                                        POS Features  \
0  [(movie, NN), (captivating, VBG), (acting, VBG...   
1  [(disappointing, JJ), (ti

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [3]:
# You code here (Please add comments in the code):
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
import numpy as np

# Sample text data
reviews = [
    "The movie was captivating, and the acting was outstanding!",
    "I had a disappointing time. The plot was confusing, and the acting was subpar.",
    "This book is just average. Nothing remarkable.",
]

# Sample labels (for sentiment analysis, for example)
labels = ['positive', 'negative', 'neutral']

# Assume labels for each review (modify accordingly)
review_labels = ['positive', 'negative', 'neutral']

# Convert labels to numerical values
label_map = {label: idx for idx, label in enumerate(labels)}
numerical_labels = [label_map[label] for label in review_labels]

# TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(reviews)

# Chi-squared test for feature selection
chi2_scores, _ = chi2(tfidf_features, numerical_labels)

# Create a dictionary to associate feature names with chi-squared scores
feature_scores = {feature_name: score for feature_name, score in zip(tfidf_vectorizer.get_feature_names_out(), chi2_scores)}

# Sort features by their chi-squared scores in descending order
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Print the top N features and their scores
top_n = 20  # Change to the desired number of top features
for feature, score in sorted_features[:top_n]:
    print(f"Feature: {feature}, Chi-squared Score: {score:.2f}")





Feature: average, Chi-squared Score: 0.76
Feature: book, Chi-squared Score: 0.76
Feature: is, Chi-squared Score: 0.76
Feature: just, Chi-squared Score: 0.76
Feature: nothing, Chi-squared Score: 0.76
Feature: remarkable, Chi-squared Score: 0.76
Feature: this, Chi-squared Score: 0.76
Feature: captivating, Chi-squared Score: 0.67
Feature: movie, Chi-squared Score: 0.67
Feature: outstanding, Chi-squared Score: 0.67
Feature: confusing, Chi-squared Score: 0.58
Feature: disappointing, Chi-squared Score: 0.58
Feature: had, Chi-squared Score: 0.58
Feature: plot, Chi-squared Score: 0.58
Feature: subpar, Chi-squared Score: 0.58
Feature: time, Chi-squared Score: 0.58
Feature: the, Chi-squared Score: 0.49
Feature: was, Chi-squared Score: 0.49
Feature: acting, Chi-squared Score: 0.24
Feature: and, Chi-squared Score: 0.24


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
pip install transformers

Collecting transformers
  Downloading transformers-4.35.0-py3-none-any.whl (7.9 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.9/7.9 MB[0m [31m42.5 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.19.0-py3-none-any.whl (311 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m311.2/311.2 kB[0m [31m31.6 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m78.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m62.2 MB/s[0m eta [36m0:00:00[0m
Col

In [6]:
# You code here (Please add comments in the code):
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Sample text data
reviews = [
    "The movie was captivating, and the acting was outstanding!",
    "I had a disappointing time. The plot was confusing, and the acting was subpar.",
    "This book is just average. Nothing remarkable.",
]

# Query text
query = "I'm on the lookout for an exceptional dining spot with mouthwatering dishes and top-notch service."

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Tokenize and encode the query
query_tokens = tokenizer.encode(query, add_special_tokens=True, max_length=128, truncation=True, padding='max_length', return_tensors='pt')

# Encode the query using the BERT model
with torch.no_grad():
    query_embeddings = model(query_tokens)[0][0]  # [CLS] token embedding for the query

# Calculate cosine similarity between query and text data
similarities = []
for review in reviews:
    review_tokens = tokenizer.encode(review, add_special_tokens=True, max_length=128, truncation=True, padding='max_length', return_tensors='pt')
    with torch.no_grad():
        review_embedding = model(review_tokens)[0][0]  # [CLS] token embedding for each review
    similarity = cosine_similarity(query_embeddings.reshape(1, -1), review_embedding.reshape(1, -1))
    similarities.append(similarity[0][0])

# Rank the text data by similarity in descending order
ranking = np.argsort(similarities)[::-1]

# Print the ranked text data
print("Ranked Text Data:")
for i, idx in enumerate(ranking):
    print(f"Rank {i + 1}: Similarity {similarities[idx]:.4f} - Text: {reviews[idx]}")



Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

We strongly recommend passing in an `attention_mask` since your input_ids may be padded. See https://huggingface.co/docs/transformers/troubleshooting#incorrect-output-when-padding-tokens-arent-masked.


Ranked Text Data:
Rank 1: Similarity 0.8913 - Text: I had a disappointing time. The plot was confusing, and the acting was subpar.
Rank 2: Similarity 0.8741 - Text: The movie was captivating, and the acting was outstanding!
Rank 3: Similarity 0.8717 - Text: This book is just average. Nothing remarkable.
