<a href="https://colab.research.google.com/github/AiswaryaGoriparthi/Aiswarya_INFO5731_Fall2024/blob/main/Goriparthi_Aiswarya_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [1]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

**Task: Detecting FAKE Product Reviews**(text classification)

Sorting product reviews into two categories—genuine and fraudulent—is the challenge.
In numerous internet markets, companies could come across reviews that are purposefully giving negative impact;
these reviews are frequently composed to falsely advertise a product or harm the reputation of a rival.
For reviews to remain credible and trustworthy, it is essential to identify these fraudulent reviews.

*Features to Build the Model:-*
1.Term Frequency-Inverse Document Frequency or (TF-IDF):

The process involves down-weighting common words and converting the text into numerical vectors depending on word frequency.
Why It is Useful:- Reviews that are fraudulent may repeat phrases like "excellent product" or "strongly suggested," "Not worthy", "bad product",
which are extremely general. Since authentic evaluations typically include more thorough and distinctive descriptions,
TF-IDF aids in the identification of these overused terms.

2. Bigrams and Trigrams, or N-grams:

Word sequences consisting of two or three words in succession that are intended to capture word patterns and local context.
Why It is Useful:- False reviews frequently use the same words or phrase patterns, like "best product ever" or "will buy again."
N-grams assist in identifying these recurring and typical patterns that point to fraudulent reviews.

3.The review's word count:

Description: The review's word count in its entirety.
Why It is Useful:-  Since deceptive reviews are usually superficial and short, they are frequently shorter.
Sincere evaluations usually contain more information in-depth. It is straightforward yet useful feature to distinguish between the two words.

4.Readability Points:

Metrics used to gauge the text's complexity include the Gunning Fog Index and Flesch Reading Ease.
Why It is useful:- Since deceptive evaluations are written fast and with little effort, they are frequently shorter and simpler.
More in-depth and intricate explanations may be found in genuine reviews.
Reading levels can be used to differentiate between reviews that are written naturally and those that are simplified.

5.Polarity of Sentiment:

The review's positive or negative tone is indicated by a numerical score that ranges from -1 (negative) to +1 (positive).
Why It is Useful:- ones that are fraudulent often feature strong opinions, either too favorable or too critical,
whereas ones that are sincere are more impartial. Sentiment analysis can be used to find the text's abnormally high or low emotional points.

6.Part-of-Speech (POS) Identification:

The task involves recognizing and quantifying several grammatical elements (such as nouns, verbs, adjectives, and adverbs) within the given text.
Why It is Useful:- Reviewers who are not honest may overemphasize their sentiment by using more adjectives or adverbs and less nouns or action verbs
(e.g., "incredibly amazing," "absolutely perfect"). Unusual language patterns that indicate exaggeration or omission of details can be detected with the aid of POS tagging.

7.Language Variety (Single Word Percentage):

Calculates the proportion of unique words to all words in a review.
Why It is Useful:- Sincere reviews are more likely to employ a variety of words,
whereas fraudulent reviews frequently use the same vocabulary repeatedly. A reduced ratio of unique words may be a sign of dishonesty.


By combining these features, we can build a robust machine learning model capable of distinguishing between genuine and deceptive reviews in a variety of contexts.


'''

'\nPlease write you answer here:\n\n**Task: Detecting FAKE Product Reviews**(text classification)\n\nSorting product reviews into two categories—genuine and fraudulent—is the challenge.\nIn numerous internet markets, companies could come across reviews that are purposefully giving negative impact;\nthese reviews are frequently composed to falsely advertise a product or harm the reputation of a rival.\nFor reviews to remain credible and trustworthy, it is essential to identify these fraudulent reviews.\n\n*Features to Build the Model:-*\n1.Term Frequency-Inverse Document Frequency or (TF-IDF):\n\nThe process involves down-weighting common words and converting the text into numerical vectors depending on word frequency.\nWhy It is Useful:- Reviews that are fraudulent may repeat phrases like "excellent product" or "strongly suggested," "Not worthy", "bad product",\nwhich are extremely general. Since authentic evaluations typically include more thorough and distinctive descriptions, \nTF-I

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [2]:
# You code here (Please add comments in the code):

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from textblob import TextBlob
import nltk
import pandas as pd

# Download NLTK's required data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample Data (Product Reviews)
reviews = [
    "This product is amazing! Highly recommended.",
    "Not worth the money. Very bad experience.",
    "Best purchase ever. Will buy again!",
    "Terrible quality. I hated it.",
    "Good product, but not what I expected."
]

# 1. Word Count
def word_count(text):
    return len(text.split())

# 2. Sentiment Polarity
def sentiment_polarity(text):
    return TextBlob(text).sentiment.polarity

# 3. Readability Score (Approximate via sentence and word count)
def readability_score(text):
    sentences = nltk.sent_tokenize(text)
    words = nltk.word_tokenize(text)
    return len(words) / len(sentences)  # Approximate readability

# 4. Part-of-Speech (POS) Tagging Distribution
def pos_tagging(text):
    tokens = nltk.word_tokenize(text)
    tags = nltk.pos_tag(tokens)
    pos_counts = nltk.FreqDist(tag for (word, tag) in tags)
    return dict(pos_counts)

# 5. Unique Word Ratio (Linguistic Diversity)
def unique_word_ratio(text):
    words = text.lower().split()
    unique_words = set(words)
    return len(unique_words) / len(words)

# 6. TF-IDF Extraction
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(reviews)

# 7. N-grams (Bigrams)
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = bigram_vectorizer.fit_transform(reviews)

# Create a DataFrame to display the extracted features
feature_data = []
for review in reviews:
    features = {
        'Review': review,
        'Word Count': word_count(review),
        'Sentiment Polarity': sentiment_polarity(review),
        'Readability Score': readability_score(review),
        'Unique Word Ratio': unique_word_ratio(review),
        'POS Tagging': pos_tagging(review)
    }
    feature_data.append(features)

# Convert the data to DataFrame
df = pd.DataFrame(feature_data)

# Display Extracted Features
print(df)

# Display TF-IDF Matrix (Optional)
tfidf_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out())
print("\nTF-IDF Matrix:")
print(tfidf_df)

# Display Bigrams (Optional)
bigram_df = pd.DataFrame(bigram_matrix.toarray(), columns=bigram_vectorizer.get_feature_names_out())
print("\nBigram Matrix:")
print(bigram_df)



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


                                         Review  Word Count  \
0  This product is amazing! Highly recommended.           6   
1     Not worth the money. Very bad experience.           7   
2           Best purchase ever. Will buy again!           6   
3                 Terrible quality. I hated it.           5   
4        Good product, but not what I expected.           7   

   Sentiment Polarity  Readability Score  Unique Word Ratio  \
0               0.455                4.0                1.0   
1              -0.530                4.5                1.0   
2               1.000                4.0                1.0   
3              -0.950                3.5                1.0   
4               0.300                9.0                1.0   

                                         POS Tagging  
0  {'DT': 1, 'NN': 1, 'VBZ': 1, 'JJ': 1, '.': 2, ...  
1  {'RB': 2, 'IN': 1, 'DT': 1, 'NN': 2, '.': 2, '...  
2  {'JJS': 1, 'NN': 1, 'RB': 2, '.': 2, 'NNP': 1,...  
3     {'JJ': 1, 'NN': 

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [3]:
# You code here (Please add comments in the code):

from sklearn.feature_selection import SelectKBest, chi2
import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample labels (0 = fake review, 1 = genuine review)
labels = [1, 0, 1, 0, 1]  # Simplified example

# Sample features extracted from reviews
data = {
    'Word Count': [6, 7, 6, 6, 7],
    'Sentiment Polarity': [0.6, -0.875, 0.7, -1.0, 0.3],
    'Readability Score': [6.0, 7.0, 6.0, 6.0, 7.0],
    'Unique Word Ratio': [0.833, 0.857, 1.0, 1.0, 0.857],
    'TFIDF_amazing': [0.5, 0.0, 0.0, 0.0, 0.0],
    'TFIDF_bad': [0.0, 0.44, 0.0, 0.0, 0.0]
}

# Convert the data into a pandas DataFrame
df = pd.DataFrame(data)

# Step 1: Normalize the data (Chi-Square works better with non-negative values)
scaler = MinMaxScaler()
scaled_data = scaler.fit_transform(df)

# Step 2: Use Chi-Square for feature selection
chi2_selector = SelectKBest(chi2, k='all')
chi2_selector.fit(scaled_data, labels)

# Step 3: Get the scores for each feature and rank them
feature_scores = chi2_selector.scores_
features = df.columns

# Step 4: Sort the features based on their scores (descending order)
ranked_features = sorted(zip(features, feature_scores), key=lambda x: x[1], reverse=True)

# Print the ranked features
print("Ranked Features Based on Chi-Square Test:")
for feature, score in ranked_features:
    print(f"{feature}: {score}")




Ranked Features Based on Chi-Square Test:
Sentiment Polarity: 1.6159482311443103
TFIDF_bad: 1.5
TFIDF_amazing: 0.6666666666666667
Unique Word Ratio: 0.09530938123752497
Word Count: 0.08333333333333329
Readability Score: 0.08333333333333329


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [4]:
# You code here (Please add comments in the code):
import numpy as np  # For numerical operations
from sklearn.metrics.pairwise import cosine_similarity  # For calculating similarity
from transformers import BertTokenizer, BertModel  # For BERT model and tokenizer
import torch  # For tensor operations

# Sample documents to compare against
documents = [
    "This product is amazing!",
    "I absolutely hate this item.",
    "It's okay, not the best but not terrible.",
    "Worst purchase I've ever made.",
    "I love it, highly recommend!"
]

# The query we want to match against the documents
query = "I love this product!"

# Load the BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')  # Load the tokenizer for BERT
model = BertModel.from_pretrained('bert-base-uncased')  # Load the BERT model

# Function to get BERT embeddings for a list of texts
def get_embeddings(texts):
    # Tokenize the input texts and convert them to tensors
    inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
    with torch.no_grad():  # Disable gradient calculation for efficiency
        outputs = model(**inputs)  # Get model outputs
    # Return the embeddings for the [CLS] token which represents the whole input
    return outputs.last_hidden_state[:, 0, :].numpy()

# Get embeddings for all documents and the query
document_embeddings = get_embeddings(documents)  # Get embeddings for documents
query_embedding = get_embeddings([query])  # Get embedding for the query (wrapped in a list)

# Calculate cosine similarity between the query and each document
similarity_scores = cosine_similarity(query_embedding, document_embeddings)

# Pair each document with its corresponding similarity score
results = list(zip(documents, similarity_scores.flatten()))  # Flatten the scores for easier pairing

# Sort the documents based on similarity scores in descending order
sorted_results = sorted(results, key=lambda x: x[1], reverse=True)

# Print the ranked results
print("Ranked Documents based on Similarity:")
for doc, score in sorted_results:
    print(f"Document: {doc}, Similarity Score: {score:.4f}")  # Print document with its similarity score


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Documents based on Similarity:
Document: This product is amazing!, Similarity Score: 0.9861
Document: I love it, highly recommend!, Similarity Score: 0.9496
Document: I absolutely hate this item., Similarity Score: 0.9150
Document: It's okay, not the best but not terrible., Similarity Score: 0.9107
Document: Worst purchase I've ever made., Similarity Score: 0.8865


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [6]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Learning Experience:
It was instructive to work on feature extraction from text data.
Important lessons learned included:
Feature Extraction Techniques: By learning to apply techniques like sentiment analysis, TF-IDF, and n-grams, I was able to improve my text analysis skills.
Using BERT to Represent Text: The use of BERT made clear how effective contextual embeddings are in grasping linguistic subtleties.
Cosine Similarity: Using cosine similarity to rank documents improved my understanding of NLP's vector space models.

Challenges Encountered:
Comprehending BERT: At first, the intricacy of BERT was overwhelming, and it took me some time to understand how to use it efficiently.
Feature Selection: Careful thought was needed to determine which features would have the greatest influence.
Programming Syntax: Although difficult, debugging code and fixing syntax issues helped me become a better programmer.


Relevance to Your Field of Study:
This exercise greatly relates to natural language processing (NLP) as it highlights the significance of feature extraction in a variety of tasks,
including sentiment analysis and classification. Gaining knowledge of these methods improves my understanding of data science and
artificial intelligence applications and has a direct impact on model performance. In summary, it emphasized the need for
ongoing education in the ever-evolving field of natural language processing.
'''

"\nPlease write you answer here:\n\nLearning Experience:\nIt was instructive to work on feature extraction from text data. \nImportant lessons learned included:\nFeature Extraction Techniques: By learning to apply techniques like sentiment analysis, TF-IDF, and n-grams, I was able to improve my text analysis skills.\nUsing BERT to Represent Text: The use of BERT made clear how effective contextual embeddings are in grasping linguistic subtleties.\nCosine Similarity: Using cosine similarity to rank documents improved my understanding of NLP's vector space models.\n\nChallenges Encountered:\nComprehending BERT: At first, the intricacy of BERT was overwhelming, and it took me some time to understand how to use it efficiently.\nFeature Selection: Careful thought was needed to determine which features would have the greatest influence.\nProgramming Syntax: Although difficult, debugging code and fixing syntax issues helped me become a better programmer.\n\n\nRelevance to Your Field of Study: