<a href="https://colab.research.google.com/github/Sammii0207/sami/blob/main/In_class_exercise_03_02282023_Final.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
I find sentiment analysis to be an interesting text classification task. It involves determining the emotional tone of a given piece of text, such as a tweet, product review, or news article. 
Sentiment analysis has many practical applications, such as monitoring public opinion, predicting customer behavior, or analyzing political discourse.

To build a machine learning model for sentiment analysis, we need to extract relevant features from the text that can help us distinguish between positive, 
negative, and neutral sentiments. Here are five types of features that might be useful:


Firstly, bag-of-words features capture the frequency of each word in the text, ignoring the order and context of the words. 
This type of feature can help the model identify the most frequent words associated with positive or negative sentiments, such as "great," "terrible," "love," or "hate."

Secondly, part-of-speech (POS) features indicate the grammatical category of each word in the text, such as noun, verb, adjective, or adverb. 
This type of feature can help the model identify the syntactic patterns associated with different sentiments, such as the use of more adjectives in positive reviews.

Thirdly, sentiment lexicons are lists of words annotated with their polarity, such as positive, negative, or neutral. 
This type of feature can help the model leverage external knowledge about the emotional connotations of words, even if they are rare or context-dependent.

Fourthly, emotion features capture the presence and intensity of specific emotions in the text, such as joy, sadness, anger, or fear. 
This type of feature can help the model capture the nuances of different emotional states and distinguish between similar sentiments, such as frustration and anger.

Lastly, stylistic features capture various aspects of the writing style, such as sentence length, punctuation, capitalization, or emoticons. 
This type of feature can help the model capture the subjective or ironic tone of the text and detect sarcasm or humor that might affect the sentiment.




'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [None]:
# You code here (Please add comments in the code):
import nltk
import re
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('opinion_lexicon')
nltk.download('vader_lexicon')
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from nltk.corpus import opinion_lexicon

# Sample texts for each type of emotion
positive_text = "I absolutely loved this product! It exceeded my expectations in every way."
negative_text = "I was very disappointed with this product. It did not live up to my expectations at all."
neutral_text = "This product is okay. It's not great, but it's not terrible either."

# Bag-of-words feature extraction
tokens = word_tokenize(positive_text)
bag_of_words = nltk.FreqDist(tokens)

# POS feature extraction
pos_tags = nltk.pos_tag(tokens)
pos_counts = nltk.Counter(tag for word, tag in pos_tags)

# Sentiment lexicon feature extraction
positive_words = set(opinion_lexicon.positive())
negative_words = set(opinion_lexicon.negative())
sentiment_words = [word for word in tokens if word in positive_words or word in negative_words]
sentiment_counts = nltk.FreqDist(sentiment_words)

# Emotion feature extraction
analyzer = SentimentIntensityAnalyzer()

positive_scores = analyzer.polarity_scores(positive_text)
positive_emotion_counts = {emotion: score for emotion, score in positive_scores.items() if emotion != "compound"}

negative_scores = analyzer.polarity_scores(negative_text)
negative_emotion_counts = {emotion: score for emotion, score in negative_scores.items() if emotion != "compound"}

neutral_scores = analyzer.polarity_scores(neutral_text)
neutral_emotion_counts = {emotion: score for emotion, score in neutral_scores.items() if emotion != "compound"}

# Stylistic feature extraction
num_chars = len(positive_text)
num_words = len(tokens)
num_sentences = len(nltk.sent_tokenize(positive_text))
num_exclamation_marks = positive_text.count("!")
num_question_marks = positive_text.count("?")
num_capitalized_words = len([word for word in tokens if word.isupper()])
num_emoticons = len(re.findall(r'[:;][-^]?[\)DpP\(\[{\|@}]', positive_text))

#Print the extracted features
print("\n===== Feature Extraction Results =====")
print("\nBag-of-words:")
for word, freq in bag_of_words.items():
  print(f"{word}: {freq}")
print("\nPOS tags:")
for tag, count in pos_counts.items():
  if tag.startswith('N'):
    print(f"noun: {count} - {', '.join(word for word, t in pos_tags if t == tag)}")
  elif tag.startswith('V'):
    print(f"verb: {count} - {', '.join(word for word, t in pos_tags if t == tag)}")
  elif tag.startswith('J'):
    print(f"adjective: {count} - {', '.join(word for word, t in pos_tags if t == tag)}")
  elif tag.startswith('R'):
    print(f"adverb: {count} - {', '.join(word for word, t in pos_tags if t == tag)}")
else:
  print(f"{tag}: {count}")
  print("\nSentiment words:")
for word, freq in sentiment_counts.items():
  print(f"{word}: {freq}")
  print("\nPositive emotions:")
for emotion, score in positive_emotion_counts.items():
  print(f"{emotion}: {score}")
  print("\nNegative emotions:")
for emotion, score in negative_emotion_counts.items():
  print(f"{emotion}: {score}")
print("\nNeutral emotions:")
for emotion, score in neutral_emotion_counts.items():
  print(f"{emotion}: {score}")
  print("\nStylistic features:")
  print(f"Number of characters: {num_chars}")
  print(f"Number of words: {num_words}")
  print(f"Number of sentences: {num_sentences}")
  print(f"Number of exclamation marks: {num_exclamation_marks}")
  print(f"Number of question marks: {num_question_marks}")
  print(f"Number of capitalized words: {num_capitalized_words}")
  print(f"Number of emoticons: {num_emoticons}")





===== Feature Extraction Results =====

Bag-of-words:
I: 1
absolutely: 1
loved: 1
this: 1
product: 1
!: 1
It: 1
exceeded: 1
my: 1
expectations: 1
in: 1
every: 1
way: 1
.: 1

POS tags:
adverb: 1 - absolutely
verb: 2 - loved, exceeded
noun: 2 - product, way
noun: 1 - expectations
IN: 1

Sentiment words:
loved: 1

Positive emotions:
exceeded: 1

Positive emotions:
neg: 0.0

Negative emotions:
neu: 0.69

Negative emotions:
pos: 0.31

Negative emotions:
neg: 0.184
neu: 0.816
pos: 0.0

Neutral emotions:
neg: 0.135

Stylistic features:
Number of characters: 74
Number of words: 14
Number of sentences: 2
Number of exclamation marks: 1
Number of question marks: 0
Number of capitalized words: 1
Number of emoticons: 0
neu: 0.565

Stylistic features:
Number of characters: 74
Number of words: 14
Number of sentences: 2
Number of exclamation marks: 1
Number of question marks: 0
Number of capitalized words: 1
Number of emoticons: 0
pos: 0.3

Stylistic features:
Number of characters: 74
Number of words

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package opinion_lexicon to /root/nltk_data...
[nltk_data]   Package opinion_lexicon is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [None]:
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.feature_selection import SelectKBest, f_classif

# Define features and labels
features_list = [
    {'bag_of_words': 'I absolutely loved'},
    {'bag_of_words': 'this was great excellent'},
    {'bag_of_words': 'nice liked exceeded'},
    {'bag_of_words': 'not disappointed worst worst wasted'},
    {'pos_tags': 'NN NN NN VB JJ JJ'},
    {'pos_tags': 'NN NN NN NN VB JJ'},
    {'pos_tags': 'NN VB JJ'},
    {'pos_tags': 'RB VBN JJ NN NN'},
    {'sentiment_lexicon': 2.1, 'emotion': 'happy', 'readability_score': 7.8},
    {'sentiment_lexicon': -1.5, 'emotion': 'sad', 'readability_score': 6.2},
    {'sentiment_lexicon': 0.8, 'emotion': 'neutral', 'readability_score': 8.3},
    {'sentiment_lexicon': -2.3, 'emotion': 'angry', 'readability_score': 5.9},
]

labels = ['positive', 'positive', 'positive', 'negative', 'positive', 'positive', 'positive', 'negative',
         'positive', 'negative', 'positive', 'negative']

# Transform the list of dictionaries into a numeric array
vectorizer = DictVectorizer(sparse=False)
features_matrix = vectorizer.fit_transform(features_list)

# Use SelectKBest with f_classif to rank the features
selector = SelectKBest(f_classif, k='all')
selector.fit_transform(features_matrix, labels)

# Get the feature scores and rank them in descending order
scores = selector.scores_
sorted_scores = sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
feature_names = vectorizer.get_feature_names_out()
ranked_features = [feature_names[i] for i in sorted_scores]

# Print the ranked feature names
print("Ranked feature names:")
for feature in ranked_features:
    print(feature)


Ranked feature names:
sentiment_lexicon
bag_of_words=not disappointed worst worst wasted
emotion=angry
emotion=sad
pos_tags=RB VBN JJ NN NN
bag_of_words=I absolutely loved
bag_of_words=nice liked exceeded
bag_of_words=this was great excellent
emotion=happy
emotion=neutral
pos_tags=NN NN NN NN VB JJ
pos_tags=NN NN NN VB JJ JJ
pos_tags=NN VB JJ
readability_score


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [None]:
#Run this before executing the below code
!pip install sentence-transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m22.8 MB/s[0m eta [36m0:00:00[0m
Building wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25l[?25hdone
  Created wheel for sentence-transformers: filename=sentence_transformers-2.2.2-py3-none-any.whl size=125938 sha256=8b7889308539d9f50c5605dfa7b2ebf624afda67044b48282dde23e7480e55fd
  Stored in directory: /root/.cache/pip/wheels/71/67/06/162a3760c40d7

In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Define the text data
positive_text = "I absolutely loved this product! It exceeded my expectations in every way."
negative_text = "I was very disappointed with this product. It did not live up to my expectations at all."
neutral_text = "This product is okay. It's not great, but it's not terrible either."

# Define the query
query = "product did not exceeded expectations"

# Load the BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Tokenize the text data and query
positive_tokens = model.tokenize(positive_text)
negative_tokens = model.tokenize(negative_text)
neutral_tokens = model.tokenize(neutral_text)
query_tokens = model.tokenize(query)

# Get the BERT embeddings for the text data and query
positive_embeddings = model.encode([positive_text]).reshape(1,-1)
negative_embeddings = model.encode([negative_text]).reshape(1,-1)
neutral_embeddings = model.encode([neutral_text]).reshape(1,-1)
query_embedding = model.encode([query]).reshape(1,-1)

# Calculate the cosine similarity between the query and text embeddings
positive_similarity = cosine_similarity(query_embedding, positive_embeddings)[0][0]
negative_similarity = cosine_similarity(query_embedding, negative_embeddings)[0][0]
neutral_similarity = cosine_similarity(query_embedding, neutral_embeddings)[0][0]

# Print the similarity scores in descending order
similarity_scores = {'Positive': positive_similarity, 'Negative': negative_similarity, 'Neutral': neutral_similarity}
sorted_scores = sorted(similarity_scores.items(), key=lambda x: x[1], reverse=True)
for label, score in sorted_scores:
    print(f'{label} text: {score}')


Negative text: 0.7740368843078613
Neutral text: 0.6637380123138428
Positive text: 0.21653777360916138
