<a href="https://colab.research.google.com/github/EshwithaNaini/EshwithaNaini_INFO5731_Spring2023/blob/main/In_class_exercise_03_02282023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
A fun text classification task would be to categorize positive and negative customer reviews of a product. This task can help businesses understand how their customers feel about their products and services, allowing them to improve their offerings.

We can build a machine learning model for this task using a variety of features, including:

Bag of Words (BoW): This feature represents each word's occurrence in the text, regardless of order of appearance. The BoW feature is useful because it tracks the frequency of words indicating positive or negative sentiment.

N-grams are contiguous textual sequences of n words. Using n-grams can help you capture contextual information about words.

POS tagging: This feature assigns each word a part of speech, such as noun, verb, adjective, and so on. POS tagging can aid in the identification of sentiment-bearing words such as adjectives and adverbs.

Sentiment lexicons are pre-defined lists of positive or negative words or phrases. Including sentiment lexicons as features can help the model better identify a text's sentiment.

Punctuation and capitalization: These elements can help convey the intensity of the text's emotion. A message conveyed in all caps, for example, may be more powerful.

Finally, we can create a more accurate and robust machine learning model by incorporating features such as BoW, N-grams, POS tagging, sentiment lexicons, and punctuation and capitalization.
'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [None]:
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.sentiment import SentimentIntensityAnalyzer

# Sample text data
text_data = ["I really enjoyed the movie, it was great!",              "The food was terrible, I would not recommend it to anyone.",             "The customer service was excellent, very friendly staff.",             "The product was not what I expected, I was disappointed."]

# Define functions for feature extraction
def bag_of_words(text):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word not in stopwords.words("english")]
    return dict(nltk.FreqDist(words))

def ngrams(text, n=2):
    words = word_tokenize(text)
    words = [word.lower() for word in words if word.isalpha() and word not in stopwords.words("english")]
    return list(nltk.ngrams(words, n))

def pos_tagging(text):
    words = word_tokenize(text)
    return nltk.pos_tag(words)

def sentiment(text):
    sia = SentimentIntensityAnalyzer()
    return sia.polarity_scores(text)

def punctuation(text):
    num_exclamations = text.count('!')
    num_question_marks = text.count('?')
    num_punctuations = num_exclamations + num_question_marks
    return {'exclamations': num_exclamations, 'question_marks': num_question_marks, 'total_punctuations': num_punctuations}

# Example usage
for text in text_data:
    print("Text:", text)
    print("Bag of Words:", bag_of_words(text))
    print("Bigrams:", ngrams(text, n=2))
    print("POS Tagging:", pos_tagging(text))
    print("Sentiment:", sentiment(text))
    print("Punctuation:", punctuation(text))
    print("")


Text: I really enjoyed the movie, it was great!
Bag of Words: {'i': 1, 'really': 1, 'enjoyed': 1, 'movie': 1, 'great': 1}
Bigrams: [('i', 'really'), ('really', 'enjoyed'), ('enjoyed', 'movie'), ('movie', 'great')]
POS Tagging: [('I', 'PRP'), ('really', 'RB'), ('enjoyed', 'VBD'), ('the', 'DT'), ('movie', 'NN'), (',', ','), ('it', 'PRP'), ('was', 'VBD'), ('great', 'JJ'), ('!', '.')]
Sentiment: {'neg': 0.0, 'neu': 0.385, 'pos': 0.615, 'compound': 0.8395}
Punctuation: {'exclamations': 1, 'question_marks': 0, 'total_punctuations': 1}

Text: The food was terrible, I would not recommend it to anyone.
Bag of Words: {'the': 1, 'food': 1, 'terrible': 1, 'i': 1, 'would': 1, 'recommend': 1, 'anyone': 1}
Bigrams: [('the', 'food'), ('food', 'terrible'), ('terrible', 'i'), ('i', 'would'), ('would', 'recommend'), ('recommend', 'anyone')]
POS Tagging: [('The', 'DT'), ('food', 'NN'), ('was', 'VBD'), ('terrible', 'JJ'), (',', ','), ('I', 'PRP'), ('would', 'MD'), ('not', 'RB'), ('recommend', 'VB'), ('it',

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [None]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import CountVectorizer

# Convert the text data into feature vectors using bag of words
corpus = []
for text in text_data:
    corpus.append(' '.join(word_tokenize(text)))
    
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus).toarray()
y = np.array([1, 0, 1, 0])  # Positive reviews are labeled as 1, negative reviews are labeled as 0

# Compute the MI scores for each feature
mi_scores = mutual_info_classif(X, y)

# Create a dictionary of feature name and MI score pairs
feature_scores = dict(zip(vectorizer.get_feature_names(), mi_scores))

# Rank the features based on their importance in the descending order
ranked_features = sorted(feature_scores, key=feature_scores.get, reverse=True)

# Print the top 5 features with the highest MI scores
print("Top 5 features with the highest MI scores:")
for feature in ranked_features[:5]:
    print(feature, feature_scores[feature])
print

Top 5 features with the highest MI scores:
not 0.8333333333333331
the 0.8333333333333331
service 0.45833333333333315
disappointed 0.20833333333333315
enjoyed 0.20833333333333315




Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [2]:
pip install transformers


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting transformers
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m42.2 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1
  Downloading tokenizers-0.13.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m65.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.11.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.2/199.2 KB[0m [31m23.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: tokenizers, huggingface-hub, transformers
Successfully installed huggingface-hub-0.13.1 tokenizers-0.13.2 transformers-4.26.1


In [3]:
pip list


Package                       Version
----------------------------- --------------------
absl-py                       1.4.0
aeppl                         0.0.33
aesara                        2.7.9
alabaster                     0.7.13
albumentations                1.2.1
altair                        4.2.2
argon2-cffi                   21.3.0
argon2-cffi-bindings          21.2.0
arviz                         0.12.1
astropy                       4.3.1
astunparse                    1.6.3
atomicwrites                  1.4.1
attrs                         22.2.0
audioread                     3.0.0
autograd                      1.5
Babel                         2.12.1
backcall                      0.2.0
backports.zoneinfo            0.2.1
beautifulsoup4                4.6.3
bleach                        6.0.0
blis                          0.7.9
bokeh                         2.4.3
branca                        0.6.0
bs4                           0.0.1
CacheControl                  0.12.11
cach

In [7]:
from transformers import BertTokenizer, BertModel
import torch
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example text data
text_data = ["This product is amazing and works really well.",
             "I was disappointed with this product, it didn't work as expected.",
             "I'm very happy with my purchase, the product exceeded my expectations.",
             "This product is terrible, it doesn't work at all."]

# BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Query
query = "I'm very happy with my purchase, the product works perfectly."

# Tokenize and encode the query
input_ids = torch.tensor(tokenizer.encode(query, add_special_tokens=True)).unsqueeze(0)

# Generate the query vector
with torch.no_grad():
    last_hidden_states = model(input_ids)[0]
    query_vector = torch.mean(last_hidden_states, dim=1).squeeze().numpy()

# Compute the similarity between the query vector and each text vector
text_vectors = []
for text in text_data:
    # Tokenize and encode the text
    input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0)
    
    # Generate the text vector
    with torch.no_grad():
        last_hidden_states = model(input_ids)[0]
        text_vector = torch.mean(last_hidden_states, dim=1).squeeze().numpy()
    
    text_vectors.append(text_vector)

# Compute the cosine similarity between the query vector and each text vector
similarity_scores = cosine_similarity([query_vector], text_vectors)

# Rank the text data based on their similarity scores
ranked_text_data = [text_data[i] for i in similarity_scores.argsort()[0][::-1]]

# Print the ranked text data
print("Ranked text data based on similarity to the query:")
for text in ranked_text_data:
    print(text)


Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertModel: ['cls.seq_relationship.bias', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'cls.seq_relationship.weight', 'cls.predictions.decoder.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


Ranked text data based on similarity to the query:
I'm very happy with my purchase, the product exceeded my expectations.
I was disappointed with this product, it didn't work as expected.
This product is amazing and works really well.
This product is terrible, it doesn't work at all.
