# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Sentiment analysis in customer reviews about a product or service is always interesting and belongs to the domain of text classification or text mining. In this work, we are going to classify the text-content-based review (customer) into good, negative, or neutral attitudes. The following list provides five different types of features that can help when building a machine learning model:
1.   Bag of Words
2.   Term Frequency-Inverse Document Frequency (TF-IDF)
3.   N-grams
4.   Sentiment Lexicons
5.   Part-of-Speech (POS) Tags
6.  Sentence Length and Punctuation Statistics
These various kinds of variables can be combined to allow a machine learning model to capture many facets of text data; in this way, it allows the more accurate prediction of the sentiment represented in customer reviews. It turns out that in text classification problems, the features that are chosen and how they are combined have quite an impact on the Results.





'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [3]:
import nltk
import spacy
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import stopwords

# Download necessary NLTK resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

# Sample text data
my_sample_data = [
    "The new album is incredibly captivating; I can't stop listening to it!",
    "I found the latest book to be quite disappointing; it didn't meet my expectations.",
    "The customer service was excellent, but the product quality was subpar.",
]

# Initialize spaCy and NLTK
nlp = spacy.load("en_core_web_sm")
sia = SentimentIntensityAnalyzer()
# 1. Bag of Words (BoW)
my_count_vectorizer = CountVectorizer(stop_words=stopwords.words("english"))
my_bow_features = my_count_vectorizer.fit_transform(my_sample_data)
my_bow_df = pd.DataFrame(my_bow_features.toarray(), columns=my_count_vectorizer.get_feature_names_out())
print("Bag of Words (BoW):\n", my_bow_df)

# 2. Term Frequency-Inverse Document Frequency (TF-IDF)
my_tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords.words("english"))
my_tfidf_features = my_tfidf_vectorizer.fit_transform(my_sample_data)
my_tfidf_df = pd.DataFrame(my_tfidf_features.toarray(), columns=my_tfidf_vectorizer.get_feature_names_out())
print("\nTF-IDF:\n", my_tfidf_df)

# 3. N-grams
my_ngram_vectorizer = CountVectorizer(ngram_range=(1, 2), stop_words=stopwords.words("english"))
my_ngram_features = my_ngram_vectorizer.fit_transform(my_sample_data)
my_ngram_df = pd.DataFrame(my_ngram_features.toarray(), columns=my_ngram_vectorizer.get_feature_names_out())
print("\nN-grams:\n", my_ngram_df)

# 4. Sentiment Lexicons (VADER sentiment scores)
my_sentiment_scores = [sia.polarity_scores(text) for text in my_sample_data]
my_sentiment_df = pd.DataFrame(my_sentiment_scores)
print("\nSentiment Lexicons (VADER):\n", my_sentiment_df)

# 5. Part-of-Speech (POS) Tags
my_pos_tags = [nltk.pos_tag(nltk.word_tokenize(text)) for text in my_sample_data]
print("\nPart-of-Speech (POS) Tags:\n", my_pos_tags)

# 6. Sentence Length and Punctuation Statistics
my_sentence_lengths = [len(text.split()) for text in my_sample_data]
my_punctuation_counts = [text.count("!") for text in my_sample_data]
my_expressiveness_df = pd.DataFrame({"Sentence Length": my_sentence_lengths, "Exclamation Count": my_punctuation_counts})
print("\nSentence Length and Punctuation Statistics:\n", my_expressiveness_df)


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Bag of Words (BoW):
    album  book  captivating  customer  disappointing  excellent  expectations  \
0      1     0            1         0              0          0             0   
1      0     1            0         0              1          0             1   
2      0     0            0         1              0          1             0   

   found  incredibly  latest  listening  meet  new  product  quality  quite  \
0      0           1       0          1     0    1        0        0      0   
1      1           0       1          0     1    0        0        0      1   
2      0           0       0          0     0    0        1        1      0   

   service  stop  subpar  
0        0     1       0  
1        0     0       0  
2        1     0       1  

TF-IDF:
       album      book  captivating  customer  disappointing  excellent  \
0  0.408248  0.000000     0.408248  0.000000       0.000000   0.000000   
1  0.000000  0.377964     0.000000  0.000000       0.377964   0.000000 

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [8]:
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer

# Download necessary NLTK resources
nltk.download('stopwords')

# Sample text data
my_sample_data = [
    "The new album is incredibly captivating; I can't stop listening to it!",
    "I found the latest book to be quite disappointing; it didn't meet my expectations.",
    "The customer service was excellent, but the product quality was subpar.",
]

# Custom labels
my_custom_labels = ['positive', 'negative', 'neutral']

# Convert labels to indices
label_dict_custom = {label: idx for idx, label in enumerate(my_custom_labels)}
label_indices_custom = [label_dict_custom[label] for label in my_custom_labels]

# Initialize CountVectorizer for Bag of Words
my_count_vectorizer = CountVectorizer(stop_words=nltk.corpus.stopwords.words("english"))
my_bow_features = my_count_vectorizer.fit_transform(my_sample_data)

# Training a Random Forest classifier
my_rf_classifier = RandomForestClassifier(n_estimators=100, random_state=42)
my_rf_classifier.fit(my_bow_features, label_indices_custom)

# Extracting feature importances
my_feature_importances = my_rf_classifier.feature_importances_

# Create a DataFrame to store feature names and importances
my_feature_importances_df = pd.DataFrame({'Feature': my_count_vectorizer.get_feature_names_out(), 'Importance': my_feature_importances})

# Sort features by importance in descending order
my_sorted_features = my_feature_importances_df.sort_values(by='Importance', ascending=False)

# Print top N important features
top_n_custom_features = 5
for idx, row in my_sorted_features.head(top_n_custom_features).iterrows():
    print(f"Feature: {row['Feature']}, Importance: {row['Importance']:.4f}")


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Feature: album, Importance: 0.0706
Feature: customer, Importance: 0.0706
Feature: expectations, Importance: 0.0706
Feature: book, Importance: 0.0647
Feature: service, Importance: 0.0647


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [10]:
!pip install transformers
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
my_sample_data = [
    "The new album is incredibly captivating; I can't stop listening to it!",
    "I found the latest book to be quite disappointing; it didn't meet my expectations.",
    "The customer service was excellent, but the product quality was subpar.",
]

# Query
my_query = "I want to read a captivating book tonight."

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode the query and sample data into BERT embeddings
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings

query_embedding = get_bert_embeddings(my_query)
sample_embeddings = [get_bert_embeddings(text) for text in my_sample_data]

# Calculate cosine similarity between the query and each sample
similarities = cosine_similarity([query_embedding], sample_embeddings)[0]

# Rank the similarities in descending order
ranking = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

# Print ranked results
print("Ranked Documents based on Similarity to Query:")
for rank, (index, similarity) in enumerate(ranking):
    print(f"Rank {rank + 1}: Similarity = {similarity:.4f}")
    print(f"Text: {my_sample_data[index]}\n")







The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Documents based on Similarity to Query:
Rank 1: Similarity = 0.7252
Text: The new album is incredibly captivating; I can't stop listening to it!

Rank 2: Similarity = 0.7088
Text: I found the latest book to be quite disappointing; it didn't meet my expectations.

Rank 3: Similarity = 0.5241
Text: The customer service was excellent, but the product quality was subpar.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [11]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:I felt I need more time for this tasks, eventhough we have materials and videos .
It might be helpful to try in different ways and to clear errors.
It was challending to get output as we expected, there were lot of trials in that process.





'''

'\nPlease write you answer here:I felt I need more time for this tasks, eventhough we have materials and videos . It might be \n\n\n\n\n\n'