<a href="https://colab.research.google.com/github/SangeethaKaveti/sangeetha_INFO5731_Spring2023/blob/main/In_class_exercise_03_02282023.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (2/28/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
An interesting text classification task is sentiment analysis, which involves categorizing text documents into different sentiment categories, such as positive, negative, or neutral.

To build a sentiment analysis model, some useful features are:

Bag-of-words: This feature represents a document as a set of its constituent words, ignoring the order in which they occur. The bag-of-words approach is useful because it captures the essence of the document without considering the order of words.
N-grams: N-grams are a sequence of N words in a text document. In the case of sentiment analysis, unigrams (single words) and bigrams (two-word combinations) are the most commonly used. N-grams help capture the contextual meaning of words, which can be important for sentiment analysis.
Part-of-speech (POS) tags: This feature involves labeling each word in a text document with its corresponding part of speech, such as noun, verb, adjective, or adverb. POS tags are useful for capturing the grammatical structure of a sentence, which can provide insights into the sentiment of the document.
Named entities: Named entities are words or phrases that refer to specific people, places, or organizations. Identifying named entities can provide useful contextual information for sentiment analysis, as the sentiment associated with these entities can influence the overall sentiment of the document.
Sentiment lexicons: Sentiment lexicons are lists of words and phrases that are associated with positive or negative sentiment. These lexicons can be used to assign sentiment scores to words in a text document, which can be aggregated to determine the overall sentiment of the document.
By combining these features, a machine learning model can be trained to classify text documents into different sentiment categories. Bag-of-words and n-grams can be used to represent the document, while POS tags and named entities can provide additional contextual information. Finally, sentiment lexicons can be used to assign sentiment scores to words and phrases, which can be used to predict the overall sentiment of the document.




'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction. 

In [22]:
# You code here (Please add comments in the code):
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')
from nltk.tokenize import word_tokenize
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample text
texts = [ "This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.",  
         "I was really disappointed with this product. It didn't work as advertised.", 
         "I love this product so much! It has made my life so much easier.",  
         "This product is just okay. It wasn't great, but it wasn't terrible either.", 
         "I would never buy this product again. It was a complete waste of money."]

# Tokenize the text
processed_texts = []
for text in texts:
    tokens = word_tokenize(text.lower())
    stop_words = set(stopwords.words('english'))
    tokens = [token for token in tokens if token not in stop_words]
    processed_texts.append(' '.join(tokens))

# Bag of Words
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(processed_texts).toarray()

# N-grams
ngram_vectorizer = CountVectorizer(ngram_range=(2,2))
ngram_features = ngram_vectorizer.fit_transform(processed_texts).toarray()

# Part-of-speech (POS)
pos_features = []
for text in texts:
    tokens = word_tokenize(text.lower())
    pos_features.append([pos for token, pos in nltk.pos_tag(tokens)])

# Sentiment lexicon
sid = SentimentIntensityAnalyzer()
lexicon_features = []
for text in texts:
    polarity_scores = sid.polarity_scores(text)
    lexicon_features.append([polarity_scores['pos'], polarity_scores['neg'], polarity_scores['neu']])

# Readability
readability_features = []
for text in texts:
    words = text.split()
    num_words = len(words)
    num_sentences = len(nltk.sent_tokenize(text))
    avg_sentence_length = float(num_words/num_sentences)
    avg_word_length = float(sum(len(word) for word in words)/num_words)
    readability_features.append([avg_sentence_length, avg_word_length])

# Print features
print("Bag of Words features: ", bow_features)
print("N-grams features: ", ngram_features)
print("Part-of-speech (POS) features: ", pos_features)
print("Sentiment lexicon features: ", lexicon_features)
print("Readability features: ", readability_features)

Bag of Words features:  [[0 1 1 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1]
 [1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 2 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0]
 [0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 0 1 0 1]]
N-grams features:  [[1 0 0 1 0 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0]
 [0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0]
 [0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 1]]
Part-of-speech (POS) features:  [['DT', 'NN', 'VBZ', 'JJ', '.', 'PRP', 'VBD', 'PRP$', 'NNS', '.', 'NN', 'MD', 'RB', 'VB', 'PRP', 'TO', 'NN', '.'], ['NN', 'VBD', 'RB', 'JJ', 'IN', 'DT', 'NN', '.', 'PRP', 'VBD', 'RB', 'VB', 'RB', 'JJ', '.'], ['RB', 'VBP', 'DT', 'NN', 'RB', 'JJ', '.', 'PRP', 'VBZ', 'VBN', 'PRP$', 'NN', 'RB', 'RB', 'JJR', '.'], ['DT', 'NN', 'VBZ', 'RB', 'RB', '.', 'PRP', 'VBD', '

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)." Select the most important features you extracted above, rank the features based on their importance in the descending order. 

In [14]:
pip install textstat

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting textstat
  Downloading textstat-0.7.3-py3-none-any.whl (105 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.1/105.1 KB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting pyphen
  Downloading pyphen-0.13.2-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pyphen, textstat
Successfully installed pyphen-0.13.2 textstat-0.7.3


In [16]:
pip install vadersentiment

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting vadersentiment
  Downloading vaderSentiment-3.3.2-py2.py3-none-any.whl (125 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m126.0/126.0 KB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: vadersentiment
Successfully installed vadersentiment-3.3.2


In [23]:
# You code here (Please add comments in the code):
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from scipy.sparse import hstack
import textstat
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

# Sample data
documents= ["This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.", 
            "I was really disappointed with this product. It didn't work as advertised.", 
            "I love this product so much! It has made my life so much easier.",    
            "This product is just okay. It wasn't great, but it wasn't terrible either.",  
            "I would never buy this product again. It was a complete waste of money."]

# Target class labels
labels = np.array([1, 0, 1,2,0])  # 1 for positive, 0 for negative, 2 for neutral

# Bag of Words feature extraction
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform(documents)

# N-Grams feature extraction
vectorizer_ngrams = CountVectorizer(ngram_range=(2, 2))
ngram_features = vectorizer_ngrams.fit_transform(documents)

# Part-of-speech feature extraction
pos_vectorizer = CountVectorizer(token_pattern=r'\b\w\w+\b|!|\?|\"|\'', ngram_range=(1,1), analyzer='word', 
                                 stop_words='english')
pos_features = pos_vectorizer.fit_transform(documents)

# Sentiment Lexicon feature extraction
analyzer = SentimentIntensityAnalyzer()
lexicon_features = []
for doc in documents:
    vs = analyzer.polarity_scores(doc)
    # apply non-negative transformation
    lexicon_features.append([abs(vs['neg']), abs(vs['neu']), abs(vs['pos']), abs(vs['compound'])])
lexicon_features = np.array(lexicon_features)

# Readability feature extraction
readability_features = []
for doc in documents:
    flesch_score = textstat.flesch_reading_ease(doc)
    smog_score = textstat.smog_index(doc)
    # apply non-negative transformation
    readability_features.append([abs(flesch_score), abs(smog_score)])
readability_features = np.array(readability_features)

features = hstack((bow_features, ngram_features, pos_features, lexicon_features, readability_features))

feature_names = vectorizer_bow.get_feature_names() + vectorizer_ngrams.get_feature_names() + pos_vectorizer.get_feature_names() + ['neg', 'neu', 'pos', 'compound'] + ['flesch_score', 'smog_score']

# Chi-Square feature selection
chi2_scores, _ = chi2(features, labels)
feature_scores = list(zip(feature_names, chi2_scores))
feature_scores.sort(key=lambda x: x[1], reverse=True)
print("Chi-Square scores for all features:")
for feature, score in feature_scores:
    print(feature, score)








AttributeError: ignored

Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order. 

In [20]:
pip install sentence_transformers

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting sentence_transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/86.0 KB[0m [31m?[0m eta [36m-:--:--[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 KB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.26.1-py3-none-any.whl (6.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m6.3/6.3 MB[0m [31m48.8 MB/s[0m eta [36m0:00:00[0m
Collecting sentencepiece
  Downloading sentencepiece-0.1.97-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m50.4 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting huggingface-hub>=0.4.0
  Downloading hu

In [24]:
# You code here (Please add comments in the code):
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

#BERT model
model = SentenceTransformer('bert-base-nli-mean-tokens')

# Define the query and text
query = "I recently tried out the new restaurant in town and found the food to be decent.The service was good and the ambiance was pleasant. However, the prices were a bit on the higher side. Overall, it was a decent experience but I am not sure if I would go back again given the price point."
data = ["This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.", 
            "I was really disappointed with this product. It didn't work as advertised.", 
            "I love this product so much! It has made my life so much easier.",    
            "This product is just okay. It wasn't great, but it wasn't terrible either.",  
            "I would never buy this product again. It was a complete waste of money."]


query_embedding = model.encode([query])[0]
text_embeddings = model.encode(texts)

similarities = cosine_similarity([query_embedding], text_embeddings)[0]

ranked_texts = sorted(zip(texts, similarities), key=lambda x: x[1], reverse=True)

print(ranked_texts)




[("This product is just okay. It wasn't great, but it wasn't terrible either.", 0.6015003), ('I love this product so much! It has made my life so much easier.', 0.5684782), ('This product is amazing! It exceeded my expectations. I would definitely recommend it to anyone.', 0.4582039), ("I was really disappointed with this product. It didn't work as advertised.", 0.44249266), ('I would never buy this product again. It was a complete waste of money.', 0.29375678)]
