<a href="https://colab.research.google.com/github/Saikrishna2472/INFO-5731.020-7886-Assignment-1/blob/main/Paleru_Jai_Sai_Krishna_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
A text classification about a task like analyzing customer reviews for products. The goal is to figure out whether people are feeling positive, negative or neutral about what they purchased. This kind of analysis can be valuable for businesses as it can help them understand how a customer feels and where they need to make improvements.

Features That Would Help Us Build the Model

Term Frequency-Inverse Document Frequency
This is a way to identify how important a word is in a particular review compared to all reviews. It highlights words that stand out and aren’t just commonly used everywhere.

Use: By focusing on unique words that capture customer feelings, we can better understand what really matters in a review.

Sentiment Lexicons
These are lists of words that are known to convey specific feelings like “amazing” for positive sentiments or “horrible” for negative ones.

Use: Using these lists gives us a quick way to gauge the sentiment of a review. If a review is filled with positive words we can confidently classify it as positive.

N-grams
These are groups of words that appear together in a review. A unigram is a single word while a bigram is a pair of words and a trigram is a group of three.

Use: N-grams help capture the context better. For example “not great” as a bigram conveys a different meaning than just looking at the words individually.

Part-of-Speech (POS) Tags
This involves identifying whether a word is a noun, verb, adjective, etc.

Use: Adjectives and adverbs often carry emotional weight. Knowing how many descriptive words are used can give us clues about the overall sentiment.

Length of Review
This is simply the number of words or characters in a review.

Use: Sometimes, longer reviews can indicate more nuanced feelings, while shorter ones may express strong opinions. Understanding the length can help us gauge sentiment intensity.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
# You code here (Please add comments in the code):
import pandas as pd
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk import ngrams
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from collections import Counter

# Sample product reviews
data = {
    'reviews': [
        "This product is amazing! I love it.",
        "Terrible experience. It broke after one use.",
        "It's okay, nothing special but does the job.",
        "Fantastic quality! Highly recommend.",
        "Worst purchase ever. I'm very disappointed."
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# 1. TF-IDF Feature Extraction
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['reviews'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# 2. Sentiment Lexicon (for demonstration, we'll create a simple one)
positive_words = set(["amazing", "love", "fantastic", "highly", "recommend"])
negative_words = set(["terrible", "broke", "worst", "disappointed", "bad"])

# Function to calculate sentiment lexicon features
def sentiment_lexicon_features(review):
    words = word_tokenize(review.lower())
    pos_count = len([word for word in words if word in positive_words])
    neg_count = len([word for word in words if word in negative_words])
    return pos_count, neg_count

# 3. N-grams Extraction
def extract_ngrams(review, n=2):
    words = word_tokenize(review.lower())
    return list(ngrams(words, n))

# 4. Part-of-Speech Tagging
def pos_features(review):
    words = word_tokenize(review)
    pos_tags = pos_tag(words)
    return Counter(tag for word, tag in pos_tags)

# 5. Length of Review
df['review_length'] = df['reviews'].apply(lambda x: len(x.split()))

# Extract features and store them in the DataFrame
df['pos_count'], df['neg_count'] = zip(*df['reviews'].apply(sentiment_lexicon_features))
df['bigrams'] = df['reviews'].apply(lambda x: extract_ngrams(x, 2))
df['pos_tags'] = df['reviews'].apply(pos_features)

# Display the DataFrame with features
print(df)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


                                        reviews  review_length  pos_count  \
0           This product is amazing! I love it.              7          2   
1  Terrible experience. It broke after one use.              7          0   
2  It's okay, nothing special but does the job.              8          0   
3          Fantastic quality! Highly recommend.              4          3   
4   Worst purchase ever. I'm very disappointed.              6          0   

   neg_count                                            bigrams  \
0          0  [(this, product), (product, is), (is, amazing)...   
1          2  [(terrible, experience), (experience, .), (., ...   
2          0  [(it, 's), ('s, okay), (okay, ,), (,, nothing)...   
3          0  [(fantastic, quality), (quality, !), (!, highl...   
4          2  [(worst, purchase), (purchase, ever), (ever, ....   

                                            pos_tags  
0  {'DT': 1, 'NN': 1, 'VBZ': 1, 'JJ': 1, '.': 2, ...  
1  {'JJ': 1, 'NN': 2, '.

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [2]:
# You code here (Please add comments in the code):
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.preprocessing import LabelEncoder

# Sample product reviews and labels
data = {
    'reviews': [
        "This product is amazing! I love it.",
        "Terrible experience. It broke after one use.",
        "It's okay, nothing special but does the job.",
        "Fantastic quality! Highly recommend.",
        "Worst purchase ever. I'm very disappointed."
    ],
    'sentiment': ['positive', 'negative', 'neutral', 'positive', 'negative']
}

# Create a DataFrame
df = pd.DataFrame(data)

# Convert sentiments to numerical values
le = LabelEncoder()
df['sentiment_encoded'] = le.fit_transform(df['sentiment'])

# 1. TF-IDF Feature Extraction
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform(df['reviews'])
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()

# 2. Feature Selection using Chi-Squared
chi2_values, p_values = chi2(tfidf_matrix, df['sentiment_encoded'])

# Create a DataFrame to hold feature names and their Chi-Squared scores
chi2_df = pd.DataFrame({
    'feature': tfidf_feature_names,
    'chi2_score': chi2_values
})

# Rank the features by their Chi-Squared scores
chi2_df = chi2_df.sort_values(by='chi2_score', ascending=False)

# Display the ranked features
print(chi2_df)





         feature  chi2_score
14       nothing    1.465633
15          okay    1.465633
3            but    1.465633
23           the    1.465633
5           does    1.465633
21       special    1.465633
12           job    1.465633
20     recommend    0.750000
8      fantastic    0.750000
9         highly    0.750000
19       quality    0.750000
26          very    0.670820
18      purchase    0.670820
27         worst    0.670820
6           ever    0.670820
4   disappointed    0.670820
1        amazing    0.642617
13          love    0.642617
17       product    0.642617
10            is    0.642617
24          this    0.642617
16           one    0.590692
7     experience    0.590692
22      terrible    0.590692
25           use    0.590692
2          broke    0.590692
0          after    0.590692
11            it    0.059160


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [4]:
# You code here (Please add comments in the code):
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import torch

# Sample product reviews
data = {
    'reviews': [
        "This product is amazing! I love it.",
        "Terrible experience. It broke after one use.",
        "It's okay, nothing special but does the job.",
        "Fantastic quality! Highly recommend.",
        "Worst purchase ever. I'm very disappointed."
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Define a query
query = "I really love this product, it's fantastic!"

# Load BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to get BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # We take the mean of the last hidden state as the embedding
    return outputs.last_hidden_state.mean(dim=1).numpy().flatten()  # Flatten to 1D

# Get embeddings for the reviews and the query
review_embeddings = np.array([get_bert_embedding(review) for review in df['reviews']])
query_embedding = get_bert_embedding(query)

# Calculate cosine similarity
similarity_scores = cosine_similarity(query_embedding.reshape(1, -1), review_embeddings).flatten()

# Add similarity scores to the DataFrame
df['similarity_score'] = similarity_scores

# Rank the reviews by similarity score in descending order
df_ranked = df.sort_values(by='similarity_score', ascending=False)

# Display the ranked reviews along with their similarity scores
print(df_ranked[['reviews', 'similarity_score']])






                                        reviews  similarity_score
0           This product is amazing! I love it.          0.932473
3          Fantastic quality! Highly recommend.          0.662408
4   Worst purchase ever. I'm very disappointed.          0.634917
2  It's okay, nothing special but does the job.          0.585165
1  Terrible experience. It broke after one use.          0.509220


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Working on extracting features from text data was a valuable experience. Here are some key takeaways:

Feature Representation: I learned how to represent text using techniques like TF-IDF and BERT embeddings. TF-IDF shows the importance of words in documents while BERT provides a deeper understanding of meaning.

Feature Selection: Understanding feature selection methods, especially using Chi-Squared tests, helped me see how important it is to choose relevant features for improving model performance.

Similarity Measures: Calculating cosine similarity for ranking texts based on a query was particularly helpful. It showed how embeddings can be used for tasks beyond just classification.

Challenges Encountered
Technical Issues: I faced challenges with the shape of embeddings when calculating cosine similarity. This highlighted the importance of ensuring data shapes match for calculations.

Model Complexity: Using BERT required a better understanding of how transformer models work, which was initially tricky.

Library Management: Managing different library versions and ensuring compatibility added some complexity, especially for those new to Python.

Relevance to Your Field of Study
This exercise is very relevant to natural language processing (NLP):

Feature Extraction: Learning feature extraction techniques is essential, as they greatly affect how well models perform in tasks like sentiment analysis.

Deep Learning: Using BERT shows the shift toward deep learning in NLP, capturing language context better than traditional methods.

Real-World Applications: These skills are applicable in many real-world scenarios, such as chatbots and recommendation systems, which rely heavily on text processing.




'''
# Colab link: https://colab.research.google.com/drive/1IoW79cs9yBviW9Q_OlywwCwlbpgA_Tz-?hl=en#scrollTo=CAq0DZWAhU9m