<a href="https://colab.research.google.com/github/Devendarreddybathini/Devendarreddy_INFO5731_spring2023/blob/main/In_class_exercise_03_02282023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):
'''
Sorting customer reviews of a product into those with positive, negative, or neutral sentiments 
would be a fascinating text classification job. 
For businesses to comprehend customer feedback and improve their goods and services accordingly, this job is crucial.
To build a machine learning model for this task,
features might be useful are Bag of words, N_gram, parts of speech, sentiment lexicon and readability.
Each word in the review is handled separately using the Bag of Words (BoW) approach.
Each word's frequency within the evaluation is tracked and used as a feature.
This kind of feature is helpful because it records customer vocabularies and the frequency with which 
particular words appear in favorable, unfavorable, and neutral evaluations.

N-grams is similar to BoW, but it measures the frequency of contiguous sequences of N words
rather than the frequency of individual words.
(i.e., N-grams). Bigrams (N=2), for instance, would recognize terms like "great product" 
or "poor quality". This kind of feature is beneficial because it records how words are used in context.

Part-of-speech (POS) features specify each word's grammatical function,
such as whether it is a noun, verb, adjective, etc. 
The syntactic structure of the review and how it links to sentiment can be captured by POS features. 

Sentiment lexicons are collections of words that express either good or negative emotions.
These characteristics could indicate the existence of sentimental words and their polarity in the review. 
This kind of feature is advantageous because it takes into account previous understanding of the meaning of words.
characteristics, such as sentence length, word count per sentence, and average word length,
reflect how complicated the language employed in the review is.
Understanding client reading levels and how they connect to sentiment may be possible with the use of these elements.
'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [None]:
# You code here (Please add comments in the code):
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text data
texts = [ "This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.",  
         "I was really disappointed with this product. It didn't work as advertised.", 
         "I love this product so much! It has made my life so much easier.",  
         "This product is just okay. It wasn't great, but it wasn't terrible either.", 
         "I would never buy this product again. It was a complete waste of money."]

# Tokenize and preprocess text data
processed_texts = []
for text in texts:
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    processed_texts.append(' '.join(tokens))

# Bag of Words features
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(processed_texts).toarray()

# N-grams features
ngram_vectorizer = CountVectorizer(ngram_range=(2,2))
ngram_features = ngram_vectorizer.fit_transform(processed_texts).toarray()

# Part-of-speech (POS) features
pos_features = []
for text in texts:
    tokens = word_tokenize(text.lower())
    pos_features.append([pos for token, pos in nltk.pos_tag(tokens)])

# Sentiment lexicon features
sid = SentimentIntensityAnalyzer()
lexicon_features = []
for text in texts:
    polarity_scores = sid.polarity_scores(text)
    lexicon_features.append([polarity_scores['pos'], polarity_scores['neg'], polarity_scores['neu']])

# Readability features
readability_features = []
for text in texts:
    words = text.split()
    num_words = len(words)
    num_sentences = len(nltk.sent_tokenize(text))
    avg_sentence_length = float(num_words/num_sentences)
    avg_word_length = float(sum(len(word) for word in words)/num_words)
    readability_features.append([avg_sentence_length, avg_word_length])

# Print features
print("Bag of Words features: ", bow_features)
print("N-grams features: ", ngram_features)
print("Part-of-speech (POS) features: ", pos_features)
print("Sentiment lexicon features: ", lexicon_features)
print("Readability features: ", readability_features)


Bag of Words features:  [[0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 2 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1]]
N-grams features:  [[1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]
 [0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1]]
Part-of-speech (POS) features:  [['DT', 'NN', 'VBZ', 'JJ', '.', 'PRP', 'VBD', 'PRP$', 'NNS', '.', 'NN', 'MD', 'RB', 'VB', 'PRP', 'TO', 'NN', '.'], ['NN', 'VBD', 'RB', 'JJ', 'IN', 'DT', 'NN', '.', 'PRP', 'VBD', 'RB', 'VB', 'RB', 'JJ', '.'], ['RB', 'VBP', 'DT', 'NN', 'RB', 'JJ', '.', 'PRP', 'VBZ', 'VBN', 'PRP$', 'NN', 'RB', 'RB', 'JJR', '.'], ['DT', 'NN', 'VBZ', 'RB', 'RB', '.', 'PRP', 'VBD', '

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [None]:
#using chi square
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse import hstack
import textstat
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Sample data
documents= ["This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.", 
            "I was really disappointed with this product. It didn't work as advertised.", 
            "I love this product so much! It has made my life so much easier.",    
            "This product is just okay. It wasn't great, but it wasn't terrible either.",  
            "I would never buy this product again. It was a complete waste of money."]

# Target class labels
labels = np.array([1, 0, 1,2,0])  # 1 for positive, 0 for negative, 2 for neutral

# Bag of Words feature extraction
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform(documents)

# N-Grams feature extraction
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 2))
ngram_features = vectorizer_ngrams.fit_transform(documents)

# Part-of-speech feature extraction
pos_vectorizer = CountVectorizer(token_pattern=r'\b\w\w+\b|!|\?|\"|\'', ngram_range=(1,1), analyzer='word', 
                                 stop_words='english')
pos_features = pos_vectorizer.fit_transform(documents)

# Sentiment Lexicon feature extraction
analyzer = SentimentIntensityAnalyzer()
lexicon_features = []
for doc in documents:
    vs = analyzer.polarity_scores(doc)
    # apply non-negative transformation
    lexicon_features.append([abs(vs['neg']), abs(vs['neu']), abs(vs['pos']), abs(vs['compound'])])
lexicon_features = np.array(lexicon_features)

# Readability feature extraction
readability_features = []
for doc in documents:
    flesch_score = textstat.flesch_reading_ease(doc)
    smog_score = textstat.smog_index(doc)
    # apply non-negative transformation
    readability_features.append([abs(flesch_score), abs(smog_score)])
readability_features = np.array(readability_features)

# Concatenate all features horizontally
features = hstack((bow_features, ngram_features, pos_features, lexicon_features, readability_features))

# Get feature names
feature_names = vectorizer_bow.get_feature_names() + vectorizer_ngrams.get_feature_names() + pos_vectorizer.get_feature_names() + ['neg', 'neu', 'pos', 'compound'] + ['flesch_score', 'smog_score']

# Chi-Square feature selection
chi2_scores, _ = chi2(features, labels)
feature_scores = list(zip(feature_names, chi2_scores))
feature_scores.sort(key=lambda x: x[1], reverse=True)
print("Chi-Square scores for all features:")
for feature, score in feature_scores:
    print(feature, score)


Chi-Square scores for all features:
smog_score 14.549999999999999
wasn 8.000000000000002
it wasn 8.000000000000002
wasn 8.000000000000002
' 4.499999999999999
but 4.000000000000001
either 4.000000000000001
great 4.000000000000001
just 4.000000000000001
okay 4.000000000000001
terrible 4.000000000000001
but it 4.000000000000001
great but 4.000000000000001
is just 4.000000000000001
just okay 4.000000000000001
okay it 4.000000000000001
terrible either 4.000000000000001
wasn great 4.000000000000001
wasn terrible 4.000000000000001
great 4.000000000000001
just 4.000000000000001
okay 4.000000000000001
terrible 4.000000000000001
much 3.0
my 3.0
so 3.0
was 3.0
so much 3.0
! 3.0
is 1.75
product is 1.75
advertised 1.5
again 1.5
amazing 1.5
anyone 1.5
as 1.5
buy 1.5
complete 1.5
definitely 1.5
didn 1.5
disappointed 1.5
easier 1.5
exceeded 1.5
expectations 1.5
has 1.5
life 1.5
love 1.5
made 1.5
money 1.5
never 1.5
of 1.5
really 1.5
recommend 1.5
to 1.5
waste 1.5
with 1.5
work 1.5
again it 1.5
amazing



Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [None]:
# You code here (Please add comments in the code):
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Define the query and texts
#provding almost a neutral statement as query
query = "I recently tried out the new restaurant in town and found the food to be decent.The service was good and the ambiance was pleasant. However, the prices were a bit on the higher side. Overall, it was a decent experience but I am not sure if I would go back again given the price point."
data = ["This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.", 
            "I was really disappointed with this product. It didn't work as advertised.", 
            "I love this product so much! It has made my life so much easier.",    
            "This product is just okay. It wasn't great, but it wasn't terrible either.",  
            "I would never buy this product again. It was a complete waste of money."]


# Compute the embeddings for the query and texts
query_embedding = model.encode([query])[0]
text_embeddings = model.encode(texts)

# Compute the cosine similarities between the query and texts
similarities = cosine_similarity([query_embedding], text_embeddings)[0]

# Rank the texts based on their similarity to the query
ranked_texts = sorted(zip(texts, similarities), key=lambda x: x[1], reverse=True)

print(ranked_texts)





[("This product is just okay. It wasn't great, but it wasn't terrible either.", 0.60150015), ('I love this product so much! It has made my life so much easier.', 0.56847817), ('This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.', 0.45820385), ("I was really disappointed with this product. It didn't work as advertised.", 0.44249266), ('I would never buy this product again. It was a complete waste of money.', 0.29375678)]
