<a href="https://colab.research.google.com/github/KrinalM/Krinalben_INFO5731_Spring2020/blob/main/Monpara_Krinalben_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Sentiment analysis of customer evaluations for a product or service might be one intriguing text classification assignment.
Sentiment analysis is the process of identifying the sentiment—which can be neutral, positive, or negative—expressed in a text.
Businesses may learn about client satisfaction levels and pinpoint areas for development by completing this activity.

Many kinds of characteristics may be used to create a machine learning model for sentiment analysis:

Bag-of-Words (BoW) Features: Text is converted into a vector of word frequencies in order to depict BoW.
                             Every word in the lexicon becomes a feature, and the value of a feature is determined by how frequently it appears in the document.
                             BoW records the words that appear in the text and can offer important insights into emotion.
                             Positive terms like "good," "excellent," or negative terms like "bad," "poor," for instance, may be very telling of attitude.

Part-of-Speech (POS) Features: Labeling each word in a text with the appropriate part of speech—a noun, verb, adjective, etc.—is known as POS tagging.
                               Sentiment-influencing grammatical structures and syntactic patterns can be captured via POS characteristics.
                               For example, adverbs and adjectives are frequently effective markers of sentiment in a statement.
                               Understanding the sentiment conveyed in the text may be gained by examining the distribution of POS tags.

TF-IDF Features: A statistical metric called Term Frequency-Inverse Document Frequency (TF-IDF) assesses a word's significance in a document in relation to a group of documents.
                 TF-IDF takes into account a term's rarity throughout texts (IDF) as well as its frequency in a document (TF).
                 This feature representation may assist capture sentiment-bearing phrases more successfully by assigning greater weight to terms that occur frequently in a text but infrequently over the whole corpus.

N-grams Features: Contiguous groups of n words from a given text are called N-grams.
                  N-grams are able to record more contextual information since they take word sequences into account rather than simply individual words.
                  N-gram-derived features are capable of capturing the subtleties of sentiment conveyed by word choices or phrases.
                  Bigrams that express feeling beyond single words, such "not good" or "very helpful," are one example.

Sentiment Lexicon Features: Sentiment lexicons are dictionaries that have words marked with the positive, negative, or neutral sentiment polarity.
                            Matching words in the text with lexicon entries and utilizing their sentiment labels as features is known as "leveraging sentiment lexicons as features."
                            Through the explicit use of domain-specific knowledge about words that carry sentiment, this strategy improves the model's performance in sentiment classification tasks.

'''

'\nSentiment analysis of customer evaluations for a product or service might be one intriguing text classification assignment. \nSentiment analysis is the process of identifying the sentiment—which can be neutral, positive, or negative—expressed in a text. \nBusinesses may learn about client satisfaction levels and pinpoint areas for development by completing this activity.\n\nMany kinds of characteristics may be used to create a machine learning model for sentiment analysis:\n\nBag-of-Words (BoW) Features: Text is converted into a vector of word frequencies in order to depict BoW. \n                             Every word in the lexicon becomes a feature, and the value of a feature is determined by how frequently it appears in the document.\n                             BoW records the words that appear in the text and can offer important insights into emotion. \n                             Positive terms like "good," "excellent," or negative terms like "bad," "poor," for instance, m

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# Importing necessary libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.util import ngrams
from nltk.corpus import stopwords
from tensorflow import keras
from typing import List
from keras.preprocessing.text import Tokenizer

In [None]:
# Sample text data
texts = [
    "The product is really good. I love it!",
    "This service is terrible. I'm never using it again.",
    "The quality of the item is poor. I'm disappointed.",
    "The customer support was excellent. They were very helpful.",
    "Not recommended. Waste of money.",
]

In [None]:
# Bag-of-Words (BoW) Features
def print_bow(texts: List[str]) -> None:
    tokenizer = Tokenizer()
    tokenizer.fit_on_texts(texts)
    sequences = tokenizer.texts_to_sequences(texts)
    word_index = tokenizer.word_index
    bow = {}
    for key in word_index:
        bow[key] = sequences[0].count(word_index[key])

    print(f"Bag of word sentence 1:\n{bow}")
    print(f"We found {len(word_index)} unique tokens.")

print_bow(texts)

Bag of word sentence 1:
{'the': 1, 'is': 1, 'it': 1, "i'm": 0, 'of': 0, 'product': 1, 'really': 1, 'good': 1, 'i': 1, 'love': 1, 'this': 0, 'service': 0, 'terrible': 0, 'never': 0, 'using': 0, 'again': 0, 'quality': 0, 'item': 0, 'poor': 0, 'disappointed': 0, 'customer': 0, 'support': 0, 'was': 0, 'excellent': 0, 'they': 0, 'were': 0, 'very': 0, 'helpful': 0, 'not': 0, 'recommended': 0, 'waste': 0, 'money': 0}
We found 32 unique tokens.


In [None]:
# Tokenization
tokenized_texts = [word_tokenize(text.lower()) for text in texts]
# Part-of-Speech (POS) Features
pos_features = []
for text in tokenized_texts:
    pos_tags = [tag for word, tag in pos_tag(text)]
    pos_features.append(pos_tags)
print("POS Features:")
print(pos_features)  # Print the POS tag sequences for each text
print()

POS Features:
[['DT', 'NN', 'VBZ', 'RB', 'JJ', '.', 'VB', 'VBP', 'PRP', '.'], ['DT', 'NN', 'VBZ', 'JJ', '.', 'NN', 'VBP', 'RB', 'VBG', 'PRP', 'RB', '.'], ['DT', 'NN', 'IN', 'DT', 'NN', 'VBZ', 'JJ', '.', 'JJ', 'VBP', 'JJ', '.'], ['DT', 'NN', 'NN', 'VBD', 'JJ', '.', 'PRP', 'VBD', 'RB', 'JJ', '.'], ['RB', 'VBN', '.', 'NN', 'IN', 'NN', '.']]



In [None]:
# TF-IDF Features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(texts)
print("TF-IDF Features:")
print(tfidf_vectorizer.get_feature_names_out())  # Print the feature names
print(tfidf_features.toarray())  # Print the TF-IDF feature vectors
print()

TF-IDF Features:
['again' 'customer' 'disappointed' 'excellent' 'good' 'helpful' 'is' 'it'
 'item' 'love' 'money' 'never' 'not' 'of' 'poor' 'product' 'quality'
 'really' 'recommended' 'service' 'support' 'terrible' 'the' 'they' 'this'
 'using' 'very' 'was' 'waste' 'were']
[[0.         0.         0.         0.         0.42455503 0.
  0.28432945 0.34252832 0.         0.42455503 0.         0.
  0.         0.         0.         0.42455503 0.         0.42455503
  0.         0.         0.         0.         0.28432945 0.
  0.         0.         0.         0.         0.         0.        ]
 [0.37530838 0.         0.         0.         0.         0.
  0.2513484  0.30279644 0.         0.         0.         0.37530838
  0.         0.         0.         0.         0.         0.
  0.         0.37530838 0.         0.37530838 0.         0.
  0.37530838 0.37530838 0.         0.         0.         0.        ]
 [0.         0.         0.38087336 0.         0.         0.
  0.25507533 0.         0.3808733

In [None]:
# N-grams Features
n = 2  # for bigrams
ngram_features = []
for text in tokenized_texts:
    bigrams = list(ngrams(text, n))
    ngram_features.append([" ".join(bigram) for bigram in bigrams])
print("N-grams Features:")
print(ngram_features)  # Print the gene rated n-grams for each text
print()

N-grams Features:
[['the product', 'product is', 'is really', 'really good', 'good .', '. i', 'i love', 'love it', 'it !'], ['this service', 'service is', 'is terrible', 'terrible .', '. i', "i 'm", "'m never", 'never using', 'using it', 'it again', 'again .'], ['the quality', 'quality of', 'of the', 'the item', 'item is', 'is poor', 'poor .', '. i', "i 'm", "'m disappointed", 'disappointed .'], ['the customer', 'customer support', 'support was', 'was excellent', 'excellent .', '. they', 'they were', 'were very', 'very helpful', 'helpful .'], ['not recommended', 'recommended .', '. waste', 'waste of', 'of money', 'money .']]



In [None]:
# Sentiment Lexicon Features
sentiment_lexicon = {
    "good": "positive",
    "love": "positive",
    "terrible": "negative",
    "poor": "negative",
    "excellent": "positive",
    "helpful": "positive",
    "recommended": "positive",
    "waste": "negative"
}

sentiment_lexicon_features = []
for text in tokenized_texts:
    sentiment_words = [word for word in text if word in sentiment_lexicon]
    sentiment_labels = [sentiment_lexicon[word] for word in sentiment_words]
    sentiment_lexicon_features.append(sentiment_labels)
print("Sentiment Lexicon Features:")
print(sentiment_lexicon_features)  # Print the sentiment labels for sentiment words in each text

Sentiment Lexicon Features:
[['positive', 'positive'], ['negative'], ['negative'], ['positive', 'positive'], ['positive', 'negative']]


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import numpy as np

# Compute the TF-IDF scores for each feature
tfidf_scores = np.sum(tfidf_features, axis=0).tolist()[0]

# Create a dictionary mapping feature names to their TF-IDF scores
feature_tfidf_scores = dict(zip(tfidf_vectorizer.get_feature_names_out(), tfidf_scores))

# Sort the features based on their TF-IDF scores in descending order
sorted_features = sorted(feature_tfidf_scores.items(), key=lambda x: x[1], reverse=True)

# Print the sorted features
print("Features ranked by TF-IDF scores:")
for feature, score in sorted_features:
    print(f"{feature}: {score}")


Features ranked by TF-IDF scores:
the: 1.0248881805932226
is: 0.7907531744687686
of: 0.681391000080969
it: 0.6453247622448499
money: 0.4636932227319092
not: 0.4636932227319092
recommended: 0.4636932227319092
waste: 0.4636932227319092
good: 0.4245550254370497
love: 0.4245550254370497
product: 0.4245550254370497
really: 0.4245550254370497
disappointed: 0.38087335870660655
item: 0.38087335870660655
poor: 0.38087335870660655
quality: 0.38087335870660655
again: 0.3753083838272226
never: 0.3753083838272226
service: 0.3753083838272226
terrible: 0.3753083838272226
this: 0.3753083838272226
using: 0.3753083838272226
customer: 0.34404071666393615
excellent: 0.34404071666393615
helpful: 0.34404071666393615
support: 0.34404071666393615
they: 0.34404071666393615
very: 0.34404071666393615
was: 0.34404071666393615
were: 0.34404071666393615


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
!pip install sentence-transformers



In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
texts = [
    "The product is really good. I love it!",
    "This service is terrible. I'm never using it again.",
    "The quality of the item is poor. I'm disappointed.",
    "The customer support was excellent. They were very helpful.",
    "Not recommended. Waste of money.",
]

# Query
query = "I'm looking for a product with excellent customer support."

# Load pre-trained BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Generate embeddings for text data and query
text_embeddings = model.encode(texts)
query_embedding = model.encode([query])[0]

# Calculate cosine similarity between query and text data
similarities = cosine_similarity([query_embedding], text_embeddings)[0]

# Rank texts based on similarity scores
ranked_texts = sorted(
    list(zip(texts, similarities)),
    key=lambda x: x[1],
    reverse=True
)

# Print ranked texts
print("Ranked Texts Based on Similarity:")
for text, similarity in ranked_texts:
    print(f"Similarity: {similarity:.4f} - Text: {text}")

Ranked Texts Based on Similarity:
Similarity: 0.7898 - Text: The customer support was excellent. They were very helpful.
Similarity: 0.7636 - Text: The product is really good. I love it!
Similarity: 0.3404 - Text: The quality of the item is poor. I'm disappointed.
Similarity: 0.1495 - Text: Not recommended. Waste of money.
Similarity: 0.1000 - Text: This service is terrible. I'm never using it again.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
 Working on extracting features from text data was a valuable learning experience.
 It reinforced the importance of feature engineering in natural language processing (NLP) tasks and
 highlighted various techniques for representing text data in a machine-readable format.

The key concepts that I found most beneficial in understanding the feature extraction process:
Bag-of-Words (BoW) representation
TF-IDF
Part-of-Speech (POS) tagging
Sentiment Lexicon features

One challenge I encountered was understanding and implementing feature selection techniques mentioned in the paper by Deng et al. (2019).
While the exercise focused on using TF-IDF scores for feature ranking, exploring other statistical measures like
Information Gain or Chi-square could provide a more comprehensive understanding of feature selection in text classification tasks.

This exercise is highly relevant to the field of Natural Language Processing (NLP).
Feature extraction plays a crucial role in NLP tasks such as sentiment analysis, text classification, and information retrieval.
By learning how to extract and select relevant features from text data,
NLP practitioners can improve the performance and interpretability of their machine learning models.
'''

'\n Working on extracting features from text data was a valuable learning experience. \n It reinforced the importance of feature engineering in natural language processing (NLP) tasks and \n highlighted various techniques for representing text data in a machine-readable format. \n\nThe key concepts that I found most beneficial in understanding the feature extraction process:\nBag-of-Words (BoW) representation\nTF-IDF\nPart-of-Speech (POS) tagging\nSentiment Lexicon features\n\nOne challenge I encountered was understanding and implementing feature selection techniques mentioned in the paper by Deng et al. (2019).\nWhile the exercise focused on using TF-IDF scores for feature ranking, exploring other statistical measures like \nInformation Gain or Chi-square could provide a more comprehensive understanding of feature selection in text classification tasks.\n\nThis exercise is highly relevant to the field of Natural Language Processing (NLP). \nFeature extraction plays a crucial role in N