## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Sentiment analysis of movie reviews could be an intriguing text classification problem. Given a set of movie reviews, the job is to categorise each review as good or negative.

The following categories of features may be relevant in developing a machine learning model for this task:

1. Bag-of-words features entail building a vocabulary of words from the reviews and displaying each review as a vector of word frequencies. This tool can detect the frequency of favourable and negative terms in reviews.

2. Parts of Speech: This entails tagging each word in the evaluations with its POS tag. This feature can capture the grammatical structure of the reviews and may aid in recognising phrases and clauses that indicate favourable or negative sentiment.

3. N-gram features entail constructing an n-gram vocabulary (sequences of n words) from the reviews and encoding each review as a vector of n-gram frequencies. This tool may count the number of times favourable and negative terms appear in reviews.

4. Sentiment lexicon features: This entails assigning a sentiment score to each word in the reviews using a sentiment lexicon (a dictionary of terms with associated positive or negative sentiment values). This tool can detect the sentiment of each word in the reviews and produce a sentiment score for each one.

5. Emoji: The emoji function is designed to extract the emojis included in each review in order to collect additional nonverbal sentiment given through the use of emojis.



'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [3]:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
!pip install emoji

# Sample reviews
Sample_reviews = [ "TI ❤️👍 NLP and text analysis! 😃 This is a great course. 👌",    
           "The service was terrible. I would not recommend it to anyone.",    
           "The food was okay. Nothing special.",    
           "I love this brand. Their products are always top quality."
           "I just tried the new 🍕 from that new place in town and it was 😋👌! The crust was perfectly crispy and the toppings were fresh and delicious. 🙌🏼 Highly recommend!"]

# Feature 1: Bag of words
Stop_words = set(stopwords.words('english'))
for review in Sample_reviews:
    words = word_tokenize(review.lower())
    words = [w for w in words if w not in Stop_words and w.isalpha()]
    Bag_of_Words = dict(nltk.FreqDist(words))
    print(Bag_of_Words)

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
{'ti': 1, 'nlp': 1, 'text': 1, 'analysis': 1, 'great': 1, 'course': 1}
{'service': 1, 'terrible': 1, 'would': 1, 'recommend': 1, 'anyone': 1}
{'food': 1, 'okay': 1, 'nothing': 1, 'special': 1}
{'love': 1, 'brand': 1, 'products': 1, 'always': 1, 'top': 1, 'tried': 1, 'new': 2, 'place': 1, 'town': 1, 'crust': 1, 'perfectly': 1, 'crispy': 1, 'toppings': 1, 'fresh': 1, 'delicious': 1, 'highly': 1, 'recommend': 1}


In [18]:
# Feature 2: N-grams
for review in Sample_reviews:
    words = word_tokenize(review.lower())
    words = [w for w in words if w not in Stop_words and w.isalpha()]
    bigrams = list(nltk.bigrams(words))
    print(bigrams)

[('ti', 'nlp'), ('nlp', 'text'), ('text', 'analysis'), ('analysis', 'great'), ('great', 'course')]
[('service', 'terrible'), ('terrible', 'would'), ('would', 'recommend'), ('recommend', 'anyone')]
[('food', 'okay'), ('okay', 'nothing'), ('nothing', 'special')]
[('love', 'brand'), ('brand', 'products'), ('products', 'always'), ('always', 'top'), ('top', 'tried'), ('tried', 'new'), ('new', 'new'), ('new', 'place'), ('place', 'town'), ('town', 'crust'), ('crust', 'perfectly'), ('perfectly', 'crispy'), ('crispy', 'toppings'), ('toppings', 'fresh'), ('fresh', 'delicious'), ('delicious', 'highly'), ('highly', 'recommend')]


In [19]:
# Feature 3: Parts of speech
for review in Sample_reviews:
    words = word_tokenize(review.lower())
    words = [w for w in words if w not in Stop_words and w.isalpha()]
    pos_tags = nltk.pos_tag(words)
    print(pos_tags)


[('ti', 'NN'), ('nlp', 'CC'), ('text', 'JJ'), ('analysis', 'NN'), ('great', 'JJ'), ('course', 'NN')]
[('service', 'NN'), ('terrible', 'JJ'), ('would', 'MD'), ('recommend', 'VB'), ('anyone', 'NN')]
[('food', 'NN'), ('okay', 'NN'), ('nothing', 'NN'), ('special', 'JJ')]
[('love', 'VB'), ('brand', 'NN'), ('products', 'NNS'), ('always', 'RB'), ('top', 'VBP'), ('tried', 'JJ'), ('new', 'JJ'), ('new', 'JJ'), ('place', 'NN'), ('town', 'NN'), ('crust', 'NN'), ('perfectly', 'RB'), ('crispy', 'JJ'), ('toppings', 'NNS'), ('fresh', 'JJ'), ('delicious', 'JJ'), ('highly', 'RB'), ('recommend', 'VBP')]


In [20]:
# Feature 4: Sentiment lexicons
sentiment = SentimentIntensityAnalyzer()
for review in Sample_reviews:
    Sentiment_Lexicons = sentiment.polarity_scores(review)
    print(Sentiment_Lexicons)

{'neg': 0.0, 'neu': 0.672, 'pos': 0.328, 'compound': 0.6588}
{'neg': 0.394, 'neu': 0.606, 'pos': 0.0, 'compound': -0.6381}
{'neg': 0.277, 'neu': 0.49, 'pos': 0.233, 'compound': -0.092}
{'neg': 0.0, 'neu': 0.62, 'pos': 0.38, 'compound': 0.9616}


In [21]:
# Feature 5: Emojis
import re
import emoji
for review in Sample_reviews:
  # Convert emojis to their text representation
    text = emoji.demojize(review)
    # Use regex to extract emoji tokens
    emojis = re.findall(r':[a-z_]+:', text)
    print(emojis)


[':red_heart:', ':thumbs_up:', ':grinning_face_with_big_eyes:']
[]
[]
[':pizza:', ':face_savoring_food:']


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [8]:
import pandas as pd
from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_extraction.text import TfidfVectorizer

In [12]:
# Define the sample data
Sample_reviews = [ "TI ❤️👍 NLP and text analysis! 😃 This is a great course. 👌",    
           "The service was terrible. I would not recommend it to anyone.",    
           "The food was okay. Nothing special.",    
           "I love this brand. Their products are always top quality."
           "I just tried the new 🍕 from that new place in town and it was 😋👌! The crust was perfectly crispy and the toppings were fresh and delicious. 🙌🏼 Highly recommend!"]

Y_train = pd.Series([1, 0, 1, 0])  # 1 for positive sentiment, 0 for negative sentiment


vectorizer = TfidfVectorizer()
X_train = vectorizer.fit_transform(Sample_reviews)
Y_train = Y_train.values.ravel()

# Compute mutual information scores for each feature
mi_scores = mutual_info_classif(X_train, Y_train)

# Create a dataframe to store the feature names and their importance scores
feature_scores = pd.DataFrame(list(zip(vectorizer.get_feature_names_out(), mi_scores)), columns=['Feature', 'Score'])

# Sort the features based on their importance scores in descending order
feature_scores = feature_scores.sort_values(by='Score', ascending=False)

# Print the top 10 features and their importance scores
print(feature_scores.head(10))


      Feature     Score
44        was  0.693147
29  recommend  0.693147
17         it  0.693147
35        the  0.693147
2         and  0.346574
37       this  0.346574
0      always  0.215762
34       that  0.215762
27   products  0.215762
28    quality  0.215762




Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [13]:
# You code here (Please add comments in the code):
!pip install sentence_transformers
import numpy as np
import pandas as pd
import torch
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Load the pre-trained BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m8.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m106.9 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m78.7 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading huggingface_hub-0.13.1-py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.

Downloading (…)821d1/.gitattributes:   0%|          | 0.00/391 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/README.md:   0%|          | 0.00/3.95k [00:00<?, ?B/s]

Downloading (…)d1/added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

Downloading (…)01e821d1/config.json:   0%|          | 0.00/625 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/122 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Downloading (…)821d1/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/399 [00:00<?, ?B/s]

Downloading (…)8d01e821d1/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)1e821d1/modules.json:   0%|          | 0.00/229 [00:00<?, ?B/s]

In [15]:
# Defining the documents 
Documents= ["TI ❤️👍 NLP and text analysis! 😃 This is a great course. 👌",    
           "The service was terrible. I would not recommend it to anyone.",    
           "The food was okay. Nothing special.",    
           "I love this brand. Their products are always top quality."
           "I just tried the new 🍕 from that new place in town and it was 😋👌! The crust was perfectly crispy and the toppings were fresh and delicious. 🙌🏼 Highly recommend!"]

# Define the query to match the most relevant documents
Query = "NLP course"

# Compute the embeddings for the documents and the query
Document_embeddings = model.encode(Documents)
Query_embedding = model.encode(Query)

# Compute the cosine similarity between the query and each document
Similarities = cosine_similarity(Query_embedding.reshape(1,-1), Document_embeddings)

# Combine the similarities with the documents into a dataframe
Results = pd.DataFrame({
    'Document': Documents,
    'Similarity': Similarities.flatten()
})

# Sort the results by similarity in descending order
Results = Results.sort_values(by=['Similarity'], ascending=False)

# Printing the sorted descending results
Results

Unnamed: 0,Document,Similarity
0,TI ❤️👍 NLP and text analysis! 😃 This is a grea...,0.620869
2,The food was okay. Nothing special.,0.182036
1,The service was terrible. I would not recommen...,0.150309
3,I love this brand. Their products are always t...,0.110012
