<a href="https://colab.research.google.com/github/Manaswini1912/INFO-5731/blob/main/Kodela_Manaswini_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
sentiment analysis of customer reviews for e-commerce products. In this task, the goal is to classify customer reviews as positive, neutral, or negative sentiments based on their content. Here are some features that might be useful for building the machine learning model:

Word Frequency: Counting the frequency of each word in the review. Words that frequently appear in positive or negative reviews may indicate sentiment polarity.

Sentiment Lexicons: Using sentiment lexicons or dictionaries to identify words with positive or negative sentiment. This feature can capture sentiment-bearing words and their polarity, such as "good," "excellent," "bad," or "poor."

N-grams: Extracting sequences of adjacent words of length n from the review. N-grams can capture the context and co-occurrence of words, providing insights into the sentiment expressed in phrases or sentences.

Part of Speech (POS) Tags: Identifying the parts of speech (e.g., nouns, verbs, adjectives) in the review. POS tags can help identify sentiment-bearing words and their roles in the sentence structure.

Review Length: Calculating the length of the review in terms of words or characters. Longer reviews may contain more detailed sentiments, while shorter reviews may be more concise in expressing sentiments.





'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [5]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk import pos_tag
from nltk.util import ngrams
from collections import Counter

# Function to scrape reviews from Amazon product page
def scrape_reviews_amazon(url):
    # Send a GET request to the product URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the review container elements
        review_containers = soup.find_all('div', {'data-hook': 'review'})

        # Extract review texts from each container
        reviews = [container.find('span', {'data-hook': 'review-body'}).text.strip() for container in review_containers]

        return reviews
    else:
        print("Failed to retrieve the product page. Status code:", response.status_code)
        return None

# URL of the Amazon product page for Apple iPhone 12
product_url = 'https://www.amazon.com/Apple-iPhone-12-64GB-Blue/dp/B08PNM1LNZ/ref=sr_1_2?crid=1NNLIQ4SDTMW7&dib=eyJ2IjoiMSJ9.xt3_O1X35oYubFd_5Wn7DA.srRrzes-KEisDbnzAguJ3g42jif0GegTAIgSqRgN0do&dib_tag=se&keywords=B08PNM1LNZ&qid=1709238900&sprefix=b08pnm1lnz%2Caps%2C435&sr=8-2&th=1'

# Scrape reviews from the Amazon product page
reviews = scrape_reviews_amazon(product_url)

# Download necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Function to extract word frequency feature
def extract_word_frequency(reviews):
    words = [word for review in reviews for word in word_tokenize(review.lower()) if word.isalpha()]
    word_freq = Counter(words)
    return word_freq

# Function to extract sentiment lexicons feature
def extract_sentiment_lexicons(reviews):
    positive_words = set(["excellent", "love", "satisfied"])
    negative_words = set(["late", "damaged", "terrible", "disappointed"])
    pos_score = sum(1 for review in reviews for word in review.lower().split() if word in positive_words)
    neg_score = sum(1 for review in reviews for word in review.lower().split() if word in negative_words)
    return pos_score, neg_score

# Function to extract n-grams feature
def extract_ngrams(reviews, n=2):
    words = [word for review in reviews for word in word_tokenize(review.lower()) if word.isalpha()]
    ngrams_list = list(ngrams(words, n))
    return ngrams_list

# Function to extract part of speech (POS) tags feature
def extract_pos_tags(reviews):
    words = [word_tokenize(review) for review in reviews]
    pos_tags = [pos_tag(word) for word in words]
    return pos_tags

# Function to extract review length feature
def extract_review_length(reviews):
    words = [word_tokenize(review) for review in reviews]
    return [len(word) for word in words]

# Extract features from scraped reviews
word_freq = extract_word_frequency(reviews)
pos_score, neg_score = extract_sentiment_lexicons(reviews)
bigrams = extract_ngrams(reviews, n=2)
pos_tags = extract_pos_tags(reviews)
review_length = extract_review_length(reviews)

# Print the extracted features
print("Word Frequency:", word_freq)
print("Positive Score:", pos_score)
print("Negative Score:", neg_score)
print("Bigrams:", bigrams)
print("POS Tags:", pos_tags)
print("Review Length:", review_length)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


Word Frequency: Counter({'i': 84, 'the': 74, 'a': 65, 'and': 47, 'it': 45, 'was': 42, 'to': 40, 'my': 33, 'in': 23, 'for': 22, 'of': 22, 'phone': 21, 'this': 20, 'is': 19, 'with': 19, 'but': 17, 'have': 17, 'as': 17, 'you': 17, 'more': 16, 'on': 14, 'that': 14, 'me': 14, 'had': 13, 'they': 13, 'new': 12, 'so': 12, 'from': 12, 'read': 12, 'great': 11, 'iphone': 11, 'if': 10, 'used': 10, 'amazon': 9, 'one': 9, 'or': 9, 'case': 9, 'no': 9, 'an': 8, 'en': 8, 'has': 7, 'could': 7, 'be': 7, 'la': 7, 'having': 6, 'buying': 6, 'good': 6, 'want': 6, 'would': 6, 'green': 6, 'what': 6, 'work': 6, 'mint': 6, 'day': 6, 'battery': 6, 'face': 6, 'el': 6, 'muy': 6, 'device': 5, 'after': 5, 'only': 5, 'by': 5, 'get': 5, 'life': 5, 'take': 5, 'thing': 5, 'when': 5, 'than': 5, 'all': 5, 'colour': 5, 'screen': 5, 'up': 5, 'id': 5, 'are': 5, 'at': 5, 'works': 5, 'will': 5, 'con': 5, 'phones': 4, 'needed': 4, 'first': 4, 'not': 4, 'going': 4, 'old': 4, 'fun': 4, 'purple': 4, 'less': 4, 'also': 4, 'refurbish

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [12]:
import numpy as np
from sklearn.feature_selection import mutual_info_classif

# Function to pad feature vectors with zeros to ensure equal length
def pad_features(features):
    max_length = max(len(feature) for feature in features)
    padded_features = [feature + [0] * (max_length - len(feature)) for feature in features]
    return padded_features

# Assuming 'labels' contains sentiment labels for each review
# Ensure 'labels' has the same number of samples as 'reviews'
# Adjust 'labels' accordingly if necessary

# Print the number of reviews and labels to verify consistency
print("Number of reviews:", len(reviews))
print("Number of labels:", len(labels))

# Combine all features into a single matrix
all_features = [
    list(word_freq.values()),
    [pos_score],
    [neg_score],
    list(Counter(bigrams).values()),
    # Convert POS tags to binary representation for feature selection
    [1 if 'NN' in pos_tag else 0 for pos_tag in pos_tags],
    review_length
]

# Pad features to ensure equal length
padded_features = pad_features(all_features)

# Convert padded features to a NumPy array
all_features_array = np.array(padded_features).T

# Ensure the number of samples in features matches the number of samples in labels
# Adjust features and labels accordingly if necessary
if len(all_features_array) != len(labels):
    min_samples = min(len(all_features_array), len(labels))
    all_features_array = all_features_array[:min_samples]
    labels = labels[:min_samples]

# Calculate Information Gain for each feature
ig_scores = mutual_info_classif(all_features_array, labels)

# Create a dictionary to store feature names and their corresponding Information Gain scores
feature_ig_scores = {}
feature_names = ["Word Frequency", "Positive Score", "Negative Score", "Bigrams", "POS Tags", "Review Length"]
for i, feature_name in enumerate(feature_names):
    feature_ig_scores[feature_name] = ig_scores[i]

# Rank features based on their Information Gain scores in descending order
ranked_features = sorted(feature_ig_scores.items(), key=lambda x: x[1], reverse=True)

# Print the ranked features
print("Ranked Features based on Information Gain:")
for feature, score in ranked_features:
    print(f"{feature}: {score}")


Number of reviews: 11
Number of labels: 6
Ranked Features based on Information Gain:
Bigrams: 0.7833333333333334
Negative Score: 0.436111111111111
Word Frequency: 0.03333333333333344
Review Length: 0.03333333333333344
Positive Score: 0.0
POS Tags: 0.0


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [14]:
!pip install requests beautifulsoup4 nltk sentence-transformers scikit-learn

Collecting sentence-transformers
  Downloading sentence_transformers-2.5.0-py3-none-any.whl (156 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m156.3/156.3 kB[0m [31m3.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.5.0


In [16]:
import requests
from bs4 import BeautifulSoup
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.util import ngrams
from collections import Counter
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Function to scrape reviews from Amazon product page
def scrape_reviews_amazon(url):
    # Send a GET request to the product URL
    response = requests.get(url)

    # Check if the request was successful (status code 200)
    if response.status_code == 200:
        # Parse the HTML content of the page
        soup = BeautifulSoup(response.text, 'html.parser')

        # Find the review container elements
        review_containers = soup.find_all('div', {'data-hook': 'review'})

        # Extract review texts from each container
        reviews = [container.find('span', {'data-hook': 'review-body'}).text.strip() for container in review_containers]

        return reviews
    else:
        print("Failed to retrieve the product page. Status code:", response.status_code)
        return None

# URL of the Amazon product page for Apple iPhone 12
product_url = 'https://www.amazon.com/Apple-iPhone-12-64GB-Blue/dp/B08PNM1LNZ/ref=sr_1_2?crid=1NNLIQ4SDTMW7&dib=eyJ2IjoiMSJ9.xt3_O1X35oYubFd_5Wn7DA.srRrzes-KEisDbnzAguJ3g42jif0GegTAIgSqRgN0do&dib_tag=se&keywords=B08PNM1LNZ&qid=1709238900&sprefix=b08pnm1lnz%2Caps%2C435&sr=8-2&th=1'

# Scrape reviews from the Amazon product page
reviews = scrape_reviews_amazon(product_url)

# Download necessary NLTK resources
nltk.download('punkt')

# Function to extract word frequency feature
def extract_word_frequency(reviews):
    words = [word for review in reviews for word in word_tokenize(review.lower()) if word.isalpha()]
    word_freq = Counter(words)
    return word_freq

# Function to extract n-grams feature
def extract_ngrams(reviews, n=2):
    words = [word for review in reviews for word in word_tokenize(review.lower()) if word.isalpha()]
    ngrams_list = list(ngrams(words, n))
    return ngrams_list

# Function to extract part of speech (POS) tags feature
def extract_pos_tags(reviews):
    words = [word_tokenize(review) for review in reviews]
    pos_tags = [pos_tag(word) for word in words]
    return pos_tags

# Load the pre-trained BERT model for sentence embeddings
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Sample query (replace this with your actual query)
query = "This is a query to find relevant documents."

# Preprocess the query and generate its embedding
query_embedding = model.encode([query])[0]

# Preprocess the text data and generate embeddings
text_embeddings = model.encode(reviews)

# Calculate cosine similarity between the query and each text
similarities = cosine_similarity([query_embedding], text_embeddings)[0]

# Rank the texts based on their cosine similarity with the query
ranked_indices = np.argsort(similarities)[::-1]
ranked_texts = [(reviews[i], similarities[i]) for i in ranked_indices]

# Print the ranked texts
print("Ranked Texts based on Similarity with Query:")
for i, (text, similarity) in enumerate(ranked_texts, start=1):
    print(f"Rank {i}: Similarity = {similarity:.4f}, Text = {text}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Ranked Texts based on Similarity with Query:
Rank 1: Similarity = 0.5585, Text = Excelente compra
Read more
Rank 2: Similarity = 0.5337, Text = Funciona bien. Est√©tica del producto muy buena
Read more
Rank 3: Similarity = 0.2799, Text = Me gusto cumpli√≥ con mis expectativas, hasta con mica me lleg√≥, entrega al otro d√≠a, todo funcionando correctamente, solo peque√±os detalles, pero con la funda lo cubre, recomendando 100%
Read more
Rank 4: Similarity = 0.2779, Text = Compr√© con el vendedor buyspry en una condici√≥n aceptable a muy buen precioEstoy muy sorprendido en c√≥mo lleg√≥ el equipo, lleg√≥ con 100% de bater√≠a y sin golpes en los marcos, pr√°cticamente parece nuevo, todo le funciona a la perfecci√≥nEsto depende mucho de la suerte si te llega un equipo como nuevoMe lleg√≥ en 4 d√≠as
Read more
Rank 5: Similarity = 0.2298, Text = El tel√©fono lleg√≥ en tiempo y forma. Llevo aproximadamente 4 - 5 meses us√°ndolo y todo ha estado muy bien. Como se menciona en la descripci√≥n, la 

# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# I have learnt how to break down text into smaller parts and count .I learned about using fancy tools like BERT, which helps turn text into numbers that computers can understand, and how to compare pieces of text to see how similar they are. Even though it was a bit tricky to use these advanced tools at first, practicing helped me get better. Overall, learning how to understand and work with text data is super important for tasks like figuring out if reviews are positive or negative.







'''