<a href="https://colab.research.google.com/github/MMR1318/Maheshreddy_INFO5731_Fall2024/blob/main/Mottakatla_Mahesh_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Text Classification Task: Feelings Analysis of the PRODUCTS
Task Description: Another exciting text categorization technique is user sentiment analysis in e-commerce websites’ product reviews. As the aim they have to classify the text of the review as positive, negative or neutral. The information explained above can be useful for business entities in getting to know the consumers’ opinions and enhancing their offered goods or services.
Some of the Information That Can Be Required to Build the Machine Learning Model
1.	Word Frequencies (Bag of Words):
o	Description: This feature entails tallying the freq of every word in the reviews.
o	Why It's Helpful: It records the number of terms likely to express attitude (positive or negative) such as love, hate, excellent, poor. The fact is that the most frequently used words can be also the best predictors of the general opinion of the reviewer.
2.	TF-IDF (Term Frequency-Inverse Document Frequency):
o	Description: This feature indicates the overall significance of a word to a document in relation to a collection. It compares the density of a particular word within a given review to its density within all the review available.
o	Why It's Helpful: TF-IDF is used to determine the unique relevance that certain words hold in one review against the entire set of reviews while assuming that many of the frequently recurring words may not be so important. : This can improve sentiment analysis as it will be carried out on individual manifestations of the sentiments.
3.	Sentiment Lexicon Scores:
o	Description: In this feature, values to the effect that a word is positive, negative, or neutral are attached to the words according to a pre-assigned list (such as AFINN or SentiWordNet).
o	Why It's Helpful: We must also make sure that the sentiment scores inferred on the words in a review are additive; this way, we can get an immediate sentiment score that will be put into the classification process. This offers a near perfect quantization of sentiment since polarity levels can easily be represented numerically.
4.	N-grams:
o	Description: N-grams are an ordered sequence of n items either character or words in a straight line from the text collection. Popular options are unigrams – which is one word; bigrams – two words put together; and trigrams – three words joined together.
o	Why It's Helpful: The important idea is that the N-grams can pick up contextual interactions and phrases that may be signifying sentiment such as “not good,” or “very satisfied.” This is useful to determine sentiment in than basic phrases, in fact in any insert that has more than one word.
5.	Text Length:
o	Description: This feature works on the word count, which counts the entire number of words or character in the review.
o	Why It's Helpful: Length of a review: the size of a review could directly influence the sentiment; for example, a long review may contain more details that might make it easier for the classification of the sentiment. Moreover, review length can be too short, however it may represent deep negative tones (e.g. “Bad!”) or very brief positive comment (e.g. ‘Great!’).
Summary
Thus, as can be appreciated from the reviews, sentiment analysis of product reviews is an interesting text classification task. These are features such as word frequencies, TF-IDF Scores, sentiment lexicon scores, n-grams and text length in which can give information of the sentiment the review gives. All these features can therefore be incorporated in order that a machine learning model of sentiment can be established so as to enable business entities determine the sentiment of their customers for strategies to tap into.
'''

"\nText Classification Task: Feelings Analysis of the PRODUCTS\nTask Description: Another exciting text categorization technique is user sentiment analysis in e-commerce websites’ product reviews. As the aim they have to classify the text of the review as positive, negative or neutral. The information explained above can be useful for business entities in getting to know the consumers’ opinions and enhancing their offered goods or services.\nSome of the Information That Can Be Required to Build the Machine Learning Model\n1.\tWord Frequencies (Bag of Words):\no\tDescription: This feature entails tallying the freq of every word in the reviews.\no\tWhy It's Helpful: It records the number of terms likely to express attitude (positive or negative) such as love, hate, excellent, poor. The fact is that the most frequently used words can be also the best predictors of the general opinion of the reviewer.\n2.\tTF-IDF (Term Frequency-Inverse Document Frequency):\no\tDescription: This feature in

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
# You code here (Please add comments in the code):

import pandas as pd
import numpy as np
import nltk
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.util import ngrams

# Download the VADER lexicon for sentiment analysis
nltk.download('vader_lexicon')
nltk.download('punkt')

# Sample product reviews
reviews = [
    "I love this product! It works great and I use it every day.",
    "This is the worst purchase I have ever made. Totally dissatisfied.",
    "It's okay, nothing special but it gets the job done.",
    "Fantastic quality! Highly recommend it to everyone.",
    "Not worth the price. I expected much better quality.",
]

# Create a DataFrame
df = pd.DataFrame(reviews, columns=['review'])


[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [2]:
# 1. Word Frequencies (Bag of Words)
count_vectorizer = CountVectorizer()
word_freq = count_vectorizer.fit_transform(df['review'])
word_freq_df = pd.DataFrame(word_freq.toarray(), columns=count_vectorizer.get_feature_names_out())


In [3]:
# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['review'])
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=tfidf_vectorizer.get_feature_names_out())


In [4]:
# 3. Sentiment Lexicon Scores using VADER
sia = SentimentIntensityAnalyzer()
df['sentiment_scores'] = df['review'].apply(lambda x: sia.polarity_scores(x)['compound'])


In [5]:
# 4. N-grams (Bigrams and Trigrams)
def extract_ngrams(text, n):
    tokens = nltk.word_tokenize(text)
    return list(ngrams(tokens, n))

df['bigrams'] = df['review'].apply(lambda x: extract_ngrams(x, 2))
df['trigrams'] = df['review'].apply(lambda x: extract_ngrams(x, 3))


In [6]:
# 5. Text Length
df['text_length'] = df['review'].apply(lambda x: len(x.split()))

# Display the extracted features
print("Word Frequencies (Bag of Words):")
print(word_freq_df)
print("\nTF-IDF Features:")
print(tfidf_df)
print("\nSentiment Scores:")
print(df[['review', 'sentiment_scores']])
print("\nBigrams:")
print(df[['review', 'bigrams']])
print("\nTrigrams:")
print(df[['review', 'trigrams']])
print("\nText Length:")
print(df[['review', 'text_length']])


Word Frequencies (Bag of Words):
   and  better  but  day  dissatisfied  done  ever  every  everyone  expected  \
0    1       0    0    1             0     0     0      1         0         0   
1    0       0    0    0             1     0     1      0         0         0   
2    0       0    1    0             0     1     0      0         0         0   
3    0       0    0    0             0     0     0      0         1         0   
4    0       1    0    0             0     0     0      0         0         1   

   ...  recommend  special  the  this  to  totally  use  works  worst  worth  
0  ...          0        0    0     1   0        0    1      1      0      0  
1  ...          0        0    1     1   0        1    0      0      1      0  
2  ...          0        1    1     0   0        0    0      0      0      0  
3  ...          1        0    0     0   1        0    0      0      0      0  
4  ...          0        0    1     0   0        0    0      0      0      1  

[5 ro

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [7]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.model_selection import train_test_split
from nltk.sentiment import SentimentIntensityAnalyzer
import nltk

# Sample product reviews and their sentiment labels
reviews = [
    "I love this product! It works great and I use it every day.",
    "This is the worst purchase I have ever made. Totally dissatisfied.",
    "It's okay, nothing special but it gets the job done.",
    "Fantastic quality! Highly recommend it to everyone.",
    "Not worth the price. I expected much better quality.",
]

# Sentiment labels (1 for positive, 0 for negative, 2 for neutral)
sentiment_labels = [1, 0, 2, 1, 0]

# Create a DataFrame
df = pd.DataFrame({'review': reviews, 'sentiment': sentiment_labels})

# 1. Word Frequencies (Bag of Words)
count_vectorizer = CountVectorizer()
X_bow = count_vectorizer.fit_transform(df['review'])

# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['review'])

# Combine features for Chi-Squared Test
X_combined = np.hstack((X_bow.toarray(), X_tfidf.toarray()))
y = df['sentiment'].values

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X_combined, y, test_size=0.2, random_state=42)

# Feature Selection using Chi-Squared Test
chi2_values, p_values = chi2(X_train, y_train)

# Rank features based on Chi-Squared scores
feature_names = count_vectorizer.get_feature_names_out().tolist() + tfidf_vectorizer.get_feature_names_out().tolist()
feature_importance = pd.DataFrame({'Feature': feature_names, 'Chi2 Score': chi2_values})
feature_importance = feature_importance.sort_values(by='Chi2 Score', ascending=False)

# Display ranked features in the desired format
print("Ranked Features Based on Chi-Squared Scores:")
for index, row in feature_importance.iterrows():
    print(f"Feature: {row['Feature']}, Importance: {row['Chi2 Score']:.4f}")


Ranked Features Based on Chi-Squared Scores:
Feature: worth, Importance: 3.0000
Feature: but, Importance: 3.0000
Feature: special, Importance: 3.0000
Feature: done, Importance: 3.0000
Feature: price, Importance: 3.0000
Feature: okay, Importance: 3.0000
Feature: expected, Importance: 3.0000
Feature: nothing, Importance: 3.0000
Feature: gets, Importance: 3.0000
Feature: not, Importance: 3.0000
Feature: better, Importance: 3.0000
Feature: much, Importance: 3.0000
Feature: job, Importance: 3.0000
Feature: the, Importance: 2.0000
Feature: it, Importance: 1.8000
Feature: better, Importance: 1.1259
Feature: expected, Importance: 1.1259
Feature: much, Importance: 1.1259
Feature: not, Importance: 1.1259
Feature: price, Importance: 1.1259
Feature: worth, Importance: 1.1259
Feature: this, Importance: 1.0000
Feature: works, Importance: 1.0000
Feature: use, Importance: 1.0000
Feature: to, Importance: 1.0000
Feature: and, Importance: 1.0000
Feature: recommend, Importance: 1.0000
Feature: quality, Im

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [8]:
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import torch

# Sample product reviews
reviews = [
    "I love this product! It works great and I use it every day.",
    "This is the worst purchase I have ever made. Totally dissatisfied.",
    "It's okay, nothing special but it gets the job done.",
    "Fantastic quality! Highly recommend it to everyone.",
    "Not worth the price. I expected much better quality.",
]

# Convert reviews to DataFrame
df = pd.DataFrame({'review': reviews})

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text using BERT
def encode_text(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Return the embeddings for the [CLS] token as a 2D array
    return outputs.last_hidden_state[:, 0, :].numpy().reshape(1, -1)

# Encode all reviews
embeddings = np.vstack([encode_text(review) for review in df['review']])

# Define a query
query = "What is the best product to buy?"
query_embedding = encode_text(query)

# Calculate cosine similarity between the query and all reviews
similarity_scores = cosine_similarity(query_embedding, embeddings).flatten()

# Add similarity scores to the DataFrame
df['similarity_score'] = similarity_scores

# Rank the DataFrame based on similarity scores
ranked_df = df.sort_values(by='similarity_score', ascending=False)

# Display the ranked documents based on similarity in the desired format
print("Ranked Documents Based on Similarity to Query:")
for rank, (index, row) in enumerate(ranked_df.iterrows(), start=1):
    print(f"Rank {rank}: Similarity = {row['similarity_score']:.4f}")
    print(f"Text: {row['review']}\n")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Documents Based on Similarity to Query:
Rank 1: Similarity = 0.9123
Text: Not worth the price. I expected much better quality.

Rank 2: Similarity = 0.8982
Text: Fantastic quality! Highly recommend it to everyone.

Rank 3: Similarity = 0.8799
Text: This is the worst purchase I have ever made. Totally dissatisfied.

Rank 4: Similarity = 0.8543
Text: I love this product! It works great and I use it every day.

Rank 5: Similarity = 0.8434
Text: It's okay, nothing special but it gets the job done.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

> Add blockquote



Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [9]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Reflective Feedback on Text Feature Extraction and Similarity Ranking

Learning Experience: It was a great learning working on extracting features from text data for the show cause statement analysis and learning of NLP. The choice of the BERT (Bidirectional Encoder Representations from Transformers) was particularly useful, as it gave me the ability to cover context relations in the text meaning that are so vital for many NLP tasks. The process of mapping the features such as cosine similarity emphasized the need of converting text in raw form to useful features in number space. One of interesting points learned is understanding how the text data should be preprocessed before passing through encoding and then the final analysis to obtain relevant similarity measurements for a future project.

Challenges Encountered: One of the major issues that I faced was dealing with the dimensions of the output from the BERT model, which are presented further below. At the first stage, problems arose while calculating cosine similarity, due to the embeddings dimensionality that was changed, so I had to make sure that the embeddings were fulfilling the 2D array format properly. This highlighted the fact that one needs to have attention to detail when manipulating large complex models and the various data structure manipulations called for. Moreover, it was not very difficult to fine-tune the general parameters of the model, for instance, the maximum length of the sequence.

Relevance to Your Field of Study: This exercise is directly related to NLP which is one of the developing areas in artificial intelligence. Machine learning technique like feature extraction and similarity measurement proves to be very fundamental for recommendation system and sentiment analysis or document clustering and information retrieval. these parameters and exercises are important for the definition of effective text data preprocessing and analysis, thus, this exercise has helped me to gain the practical skills that will be beneficial for my further academic path and, in general, my work in the data analytics field.

'''

'\nReflective Feedback on Text Feature Extraction and Similarity Ranking\n\nLearning Experience: It was a great learning working on extracting features from text data for the show cause statement analysis and learning of NLP. The choice of the BERT (Bidirectional Encoder Representations from Transformers) was particularly useful, as it gave me the ability to cover context relations in the text meaning that are so vital for many NLP tasks. The process of mapping the features such as cosine similarity emphasized the need of converting text in raw form to useful features in number space. One of interesting points learned is understanding how the text data should be preprocessed before passing through encoding and then the final analysis to obtain relevant similarity measurements for a future project.\n\nChallenges Encountered: One of the major issues that I faced was dealing with the dimensions of the output from the BERT model, which are presented further below. At the first stage, probl