<a href="https://colab.research.google.com/github/MoulikaGudipally/Moulika_INFO5731_Fall2023/blob/main/Gudipally_Exercise_03_10082023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
An interesting text classification or text mining task could be sentiment analysis of customer reviews for a restaurant. The goal is to determine whether a customer review is positive, negative, or neutral based on the text content. Here are five different types of features that could be useful for building a machine learning model for this task:

1. Bag of Words (BoW) Features:
   - Features: This approach involves creating a vocabulary of all unique words in the corpus and representing each review as a vector where each dimension corresponds to a word in the vocabulary. The value in each dimension represents the frequency or presence of the word in the review.
   - Why it's helpful: BoW features capture the overall word usage in a review, allowing the model to identify specific words or phrases associated with positive or negative sentiment.

2. TF-IDF (Term Frequency-Inverse Document Frequency) Features:
   - Features: TF-IDF represents the importance of each word in a review relative to the entire corpus. It takes into account the frequency of the word in the review (term frequency) and inversely scales it by the frequency of the word across all reviews (inverse document frequency).
   - Why it's helpful: TF-IDF features highlight words that are unique to a review and have a strong impact on sentiment. Words that are common across all reviews are downweighted.

3. Word Embeddings:
   - Features: Word embeddings, such as Word2Vec or GloVe, represent words as dense vectors in a continuous vector space. These pre-trained embeddings can be used to convert individual words in a review into fixed-length vectors.
   - Why it's helpful: Word embeddings capture semantic relationships between words, allowing the model to understand context and identify synonyms or related terms for sentiment analysis.

4. N-grams:
   - Features: N-grams represent sequences of 'n' consecutive words in a review. For example, bigrams (n=2) would capture pairs of adjacent words.
   - Why it's helpful: N-grams help the model capture local context and dependencies between words, which can be crucial for sentiment analysis. For instance, "not good" is different from "very good."

5. Sentiment Lexicons:
   - Features: Sentiment lexicons are lists of words with associated sentiment scores (e.g., positive, negative, or neutral). You can calculate the sentiment score of a review by summing the scores of the words it contains from the lexicon.
   - Why it's helpful: Sentiment lexicons provide explicit sentiment labels for individual words, which can be used to assign an overall sentiment score to a review. This can be especially useful when dealing with domain-specific sentiment terms.

These features collectively provide a rich representation of the text data, allowing the machine learning model to learn patterns and relationships that are indicative of sentiment in customer reviews. Depending on the specific problem and dataset, a combination of these features or more advanced techniques like deep learning with recurrent neural networks (RNNs) or transformers may be employed to improve performance.


'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [9]:
# You code here (Please add comments in the code):

import nltk
import re
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
import gensim.downloader as api

# Download NLTK data (tokenizers)
nltk.download('punkt')

# Sample text data
sample_reviews = [
    "The food at this restaurant is amazing. I love their pasta!",
    "Service was slow, and the food was terrible. I won't come back.",
    "The ambiance was nice, and the staff was friendly.",
]

# Tokenize the text into words
def tokenize_text(text):
    # Use NLTK's word_tokenize function and filter out non-alphanumeric tokens
    tokens = nltk.word_tokenize(text)
    tokens = [token for token in tokens if re.match(r'^\w+$', token)]
    return tokens

# Bag of Words (BoW) features
vectorizer = CountVectorizer(tokenizer=tokenize_text)
bow_features = vectorizer.fit_transform(sample_reviews)

# TF-IDF features
tfidf_vectorizer = TfidfVectorizer(tokenizer=tokenize_text)
tfidf_features = tfidf_vectorizer.fit_transform(sample_reviews)

# Word Embeddings (Word2Vec)
word2vec_model = api.load("word2vec-google-news-300")
word2vec_features = [word2vec_model[token] for token in tokenize_text(sample_reviews[0])]


# N-grams (Bigrams)
ngram_vectorizer = CountVectorizer(tokenizer=tokenize_text, ngram_range=(2, 2))
ngram_features = ngram_vectorizer.fit_transform(sample_reviews)

# Sentiment Lexicons (Sample Lexicon)
sentiment_lexicon = {
    "amazing": 1.0,
    "terrible": -1.0,
    "friendly": 0.5,
}
sentiment_features = [sum(sentiment_lexicon.get(word, 0) for word in tokenize_text(review)) for review in sample_reviews]

# Print extracted features
print("Bag of Words (BoW) Features:")
print(bow_features.toarray())

print("\nTF-IDF Features:")
print(tfidf_features.toarray())

print("\nWord Embeddings (Word2Vec) Features:")
print(word2vec_features)

print("\nN-grams (Bigrams) Features:")
print(ngram_features.toarray())

print("\nSentiment Lexicon Features:")
print(sentiment_features)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Bag of Words (BoW) Features:
[[1 0 0 1 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 1 0 0]
 [0 0 1 0 1 1 1 0 1 0 0 0 0 0 1 1 0 1 1 0 0 2 1]
 [0 1 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 2 0 0 2 0]]

TF-IDF Features:
[[0.32434681 0.         0.         0.32434681 0.         0.
  0.24667411 0.         0.24667411 0.32434681 0.32434681 0.
  0.32434681 0.32434681 0.         0.         0.         0.
  0.19156445 0.32434681 0.32434681 0.         0.        ]
 [0.         0.         0.23585598 0.         0.31012227 0.31012227
  0.23585598 0.         0.23585598 0.         0.         0.
  0.         0.         0.31012227 0.31012227 0.         0.31012227
  0.18316321 0.         0.         0.47171196 0.31012227]
 [0.         0.34737079 0.26418444 0.         0.         0.
  0.         0.34737079 0.         0.         0.         0.34737079
  0.         0.         0.         0.         0.34737079 0.
  0.41032556 0.         0.         0.52836887 0.        ]]

Word Embeddings (Word2Vec) Features:
[array([-0.17285156,  0.2792

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [11]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif

# Sample labels (for demonstration)
labels = ["positive", "negative", "neutral"]

# Generate label indices based on the labels in the sample data
label_indices = [labels.index(label) for label in labels]

# Convert the label indices to a NumPy array
label_indices = np.array(label_indices)

# Calculate Mutual Information between features and labels
mi_scores = mutual_info_classif(tfidf_features, label_indices)

# Get feature names from the TfidfVectorizer
feature_names = np.array(tfidf_vectorizer.get_feature_names_out())

# Create a dictionary mapping feature names to their importance scores
feature_mi_scores = {feature: score for feature, score in zip(feature_names, mi_scores)}

# Rank features by importance in descending order
sorted_features = sorted(feature_mi_scores.items(), key=lambda x: x[1], reverse=True)

# Print the ranked features and their importance scores
for feature, score in sorted_features:
    print(f"Feature: {feature}, MI Score: {score:.4f}")


Feature: and, MI Score: 1.0986
Feature: food, MI Score: 1.0986
Feature: i, MI Score: 1.0986
Feature: the, MI Score: 1.0986
Feature: was, MI Score: 1.0986
Feature: amazing, MI Score: 0.6365
Feature: ambiance, MI Score: 0.6365
Feature: at, MI Score: 0.6365
Feature: back, MI Score: 0.6365
Feature: come, MI Score: 0.6365
Feature: friendly, MI Score: 0.6365
Feature: is, MI Score: 0.6365
Feature: love, MI Score: 0.6365
Feature: nice, MI Score: 0.6365
Feature: pasta, MI Score: 0.6365
Feature: restaurant, MI Score: 0.6365
Feature: service, MI Score: 0.6365
Feature: slow, MI Score: 0.6365
Feature: staff, MI Score: 0.6365
Feature: terrible, MI Score: 0.6365
Feature: their, MI Score: 0.6365
Feature: this, MI Score: 0.6365
Feature: wo, MI Score: 0.6365




Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [26]:
pip install transformers scipy



In [28]:
import numpy as np
from scipy.spatial.distance import cosine
from transformers import BertTokenizer, BertModel
import torch

# Sample text data
sample_reviews = [
    "The food at this restaurant is amazing. I love their pasta!",
    "Service was slow, and the food was terrible. I won't come back.",
    "The ambiance was nice, and the staff was friendly.",
]

# Query
query = "I'm looking for a restaurant with great food and friendly staff."

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the query
query_tokens = tokenizer.encode(query, add_special_tokens=True)
query_tokens = torch.tensor(query_tokens).unsqueeze(0)

# Generate BERT embeddings for the query
with torch.no_grad():
    query_embeddings = model(query_tokens)[0].mean(dim=1).numpy().flatten()  # Flatten the embeddings

# Calculate cosine similarity between query and each document
similarities = []
for review in sample_reviews:
    # Tokenize and encode the document
    doc_tokens = tokenizer.encode(review, add_special_tokens=True)
    doc_tokens = torch.tensor(doc_tokens).unsqueeze(0)

    # Generate BERT embeddings for the document
    with torch.no_grad():
        doc_embeddings = model(doc_tokens)[0].mean(dim=1).numpy().flatten()  # Flatten the embeddings

    # Calculate cosine similarity
    similarity = 1 - cosine(query_embeddings, doc_embeddings)
    similarities.append(similarity)

# Rank documents based on similarity in descending order
ranked_indices = np.argsort(similarities)[::-1]
ranked_documents = [sample_reviews[i] for i in ranked_indices]

# Print the ranked documents and their similarity scores
for i, document in enumerate(ranked_documents):
    print(f"Rank {i+1}: Similarity Score: {similarities[ranked_indices[i]]:.4f}")
    print(document)
    print()


Rank 1: Similarity Score: 0.7755
The food at this restaurant is amazing. I love their pasta!

Rank 2: Similarity Score: 0.7386
Service was slow, and the food was terrible. I won't come back.

Rank 3: Similarity Score: 0.7386
The ambiance was nice, and the staff was friendly.

