<a href="https://colab.research.google.com/github/Tharunchandubatla/Tharun_INFO5731_Fall2023/blob/main/In_class_exercise_03_10082023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
classifying movie reviews as either "positive" or "negative" based on the sentiment expressed in the text.

Bag of Words (BoW) Features:

BoW represents the frequency of words in the text.
Positive reviews may contain words like "excellent," "amazing," and "enjoyable," while negative reviews may contain words like "disappointing," "awful," and "bad."
Why: BoW features help identify the presence of sentiment-related keywords in the reviews.

Sentence Length Feature:

The number of words in each sentence in the review.
Short sentences may indicate a more straightforward sentiment expression.
Why: Sentence length can be indicative of the review's complexity and potential sentiment.

Punctuation Usage Feature:

The frequency of exclamation marks (!) or question marks (?) in the review.
Positive reviews may contain more exclamation marks, while negative reviews may contain more question marks or ellipses (...).
Why: Punctuation usage can convey emotional intensity or uncertainty.

Capitalization Feature:

The percentage of words in uppercase in the review.
Words in all caps may indicate strong sentiment, whether positive or negative.
Why: Capitalization can emphasize emotional intensity.

Positive and Negative Word Counts:

Count the number of positive and negative sentiment words (e.g., "good," "bad") in the review.
Why: Counting specific sentiment words can directly capture sentiment orientation in the text.

'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [3]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
import string

# Sample movie reviews
reviews = [
    "The movie was excellent! I loved it.",
    "It was a terrible film... so disappointing.",
    "An amazing movie with great actors!",
    "This movie is bad and boring.",
]

# Create a DataFrame to store the reviews
df = pd.DataFrame({'text': reviews})

# Bag of Words (BoW) feature extraction
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['text'])

# Function to calculate sentence length
def sentence_length(text):
    sentences = text.split('.')
    return [len(sentence.split()) for sentence in sentences]

# Function to calculate punctuation usage
def punctuation_usage(text):
    exclamation_count = text.count('!')
    question_count = text.count('?')
    return exclamation_count, question_count

# Function to calculate capitalization
def capitalization(text):
    uppercase_words = sum(1 for word in text.split() if word.isupper())
    total_words = len(text.split())
    return uppercase_words / total_words

# Function to count positive and negative words
def count_sentiment_words(text):
    positive_words = ['excellent', 'amazing', 'loved', 'great']
    negative_words = ['terrible', 'disappointing', 'bad', 'boring']

    words = text.lower().split()
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    return positive_count, negative_count

# Add BoW features to the DataFrame
bow_feature_names = vectorizer.get_feature_names_out()
for i, feature_name in enumerate(bow_feature_names):
    df[f'bow_{feature_name}'] = X_bow[:, i].toarray()

# Add other features to the DataFrame
df['sentence_length'] = df['text'].apply(sentence_length)
df['exclamation_marks'], df['question_marks'] = zip(*df['text'].apply(punctuation_usage))
df['capitalization'] = df['text'].apply(capitalization)
df['positive_words'], df['negative_words'] = zip(*df['text'].apply(count_sentiment_words))

# Display the organized DataFrame with features
print(df)

                                          text  bow_actors  bow_amazing  \
0         The movie was excellent! I loved it.           0            0   
1  It was a terrible film... so disappointing.           0            0   
2          An amazing movie with great actors!           1            1   
3                This movie is bad and boring.           0            0   

   bow_an  bow_and  bow_bad  bow_boring  bow_disappointing  bow_excellent  \
0       0        0        0           0                  0              1   
1       0        0        0           0                  1              0   
2       1        0        0           0                  0              0   
3       0        1        1           1                  0              0   

   bow_film  ...  bow_the  bow_this  bow_was  bow_with  sentence_length  \
0         0  ...        1         0        1         0           [7, 0]   
1         1  ...        0         0        1         0  [5, 0, 0, 2, 0]   
2         0  

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [4]:
# You code here (Please add comments in the code):
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_selection import chi2

# Sample movie reviews
reviews = [
    "The movie was excellent! I loved it.",
    "It was a terrible film... so disappointing.",
    "An amazing movie with great actors!",
    "This movie is bad and boring.",
    "The film was great. I thoroughly enjoyed it.",
]

# Create a DataFrame to store the reviews
df = pd.DataFrame({'text': reviews, 'label': ['positive', 'negative', 'positive', 'negative', 'positive']})

# Bag of Words (BoW) feature extraction
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(df['text'])

# Chi-squared feature selection
chi2_scores, _ = chi2(X_bow, df['label'])

# Create a DataFrame to store feature names and their importance scores
feature_scores = pd.DataFrame({'Feature': vectorizer.get_feature_names_out(), 'Chi-squared Score': chi2_scores})

# Sort features by importance in descending order
sorted_features = feature_scores.sort_values(by='Chi-squared Score', ascending=False)

# Display the sorted features
print(sorted_features)

          Feature  Chi-squared Score
11             is           1.500000
3             and           1.500000
4             bad           1.500000
5          boring           1.500000
6   disappointing           1.500000
18           this           1.500000
16       terrible           1.500000
15             so           1.500000
17            the           1.333333
10          great           1.333333
13          loved           0.666667
19     thoroughly           0.666667
0          actors           0.666667
1         amazing           0.666667
8       excellent           0.666667
7         enjoyed           0.666667
2              an           0.666667
21           with           0.666667
9            film           0.083333
12             it           0.055556
14          movie           0.055556
20            was           0.055556


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [8]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m19.8 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m27.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m53.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m51.5 MB/s[0m eta [36m0:00:00[0m
Insta

In [10]:
import pandas as pd
import numpy as np
from transformers import BertTokenizer, BertModel
import torch

# Sample movie reviews
reviews = [
    "The movie was excellent! I loved it.",
    "It was a terrible film... so disappointing.",
    "An amazing movie with great actors!",
    "This movie is bad and boring.",
]

# Query text
query = "I want to find reviews for a great movie."

# Load pre-trained BERT model and tokenizer
model_name = 'bert-base-uncased'
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Tokenize and embed the query using BERT
qry_tkn = tokenizer(query, padding=True, truncation=True, return_tensors="pt")
qry_embedding = model(**qry_tkn).last_hidden_state.mean(dim=1).detach().numpy()

# Tokenize and embed movie reviews using BERT
review_tokens = tokenizer(reviews, padding=True, truncation=True, return_tensors="pt")
review_embeddings = model(**review_tokens).last_hidden_state.mean(dim=1).detach().numpy()

# Calculate cosine similarity between the query and each movie review
similarities = np.dot(qry_embedding, review_embeddings.T) / (np.linalg.norm(qry_embedding) * np.linalg.norm(review_embeddings, axis=1))

# Create a DataFrame to store movie reviews and similarities
result_data = pd.DataFrame({'Review': reviews, 'Similarity': similarities[0]})

# Rank movie reviews by similarity in descending order
result_data = result_data.sort_values(by='Similarity', ascending=False)

# Display the ranked movie reviews
print("Ranked Movie Reviews based on Text Similarity:")
print(result_data)

Ranked Movie Reviews based on Text Similarity:
                                        Review  Similarity
0         The movie was excellent! I loved it.    0.741413
3                This movie is bad and boring.    0.725099
2          An amazing movie with great actors!    0.705845
1  It was a terrible film... so disappointing.    0.613262
