## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
"Sentiment analysis on Apartments.com reviews"
To build a machine learning model for sentiment analysis on Apartments.com reviews
These feature might be useful for building machine learning model

Bag of Words (BoW):
Explanation: BoW is a simple and commonly used text representation technique. It converts a piece of text into a numerical vector by counting the frequency of each word in the text and ignoring the order and structure of words.
How it's Helpful: BoW helps capture the presence and frequency of specific words in reviews. It's useful for identifying important keywords or phrases in the text that may be indicative of sentiment. For example, words like "excellent" or "disappointing" can strongly influence sentiment classification.

TF-IDF (Term Frequency-Inverse Document Frequency):
Explanation: TF-IDF is a statistical measure that evaluates the importance of a word within a document relative to a collection of documents (corpus). It considers both the frequency of a word in a review and its rarity across all reviews.
How it's Helpful: TF-IDF is useful for identifying words that are discriminative for sentiment. It gives higher weight to words that are unique to certain reviews, helping to identify sentiment-bearing terms that are not common across all reviews.

N-grams:
Explanation: N-grams are contiguous sequences of N words from the text. For example, bigrams (N=2) capture pairs of adjacent words.
How it's Helpful: N-grams capture more complex relationships between words and phrases. This can be valuable for understanding sentiment nuances. For example, the bigram "not good" can convey negative sentiment even if the individual words "not" and "good" are neutral.

Sentiment Lexicons:
Explanation: Sentiment lexicons are lists of words or phrases pre-labeled with sentiment (e.g., positive, negative, or neutral). Examples include the Harvard General Inquirer or the SentiWordNet lexicon.
How it's Helpful: Sentiment lexicons provide a direct mapping of words to sentiment, allowing the model to quickly identify and weigh sentiment-bearing terms in the text. They can be valuable for fine-grained sentiment analysis by assigning sentiment scores to words.

Part-of-Speech (POS) Tags:
Explanation: POS tagging involves labeling words in a text with their corresponding grammatical categories (e.g., nouns, verbs, adjectives). For sentiment analysis, you might focus on adjectives and adverbs.
How it's Helpful: Analyzing POS tags can help identify words that play a crucial role in expressing sentiment. Adjectives and adverbs often carry sentiment-related information. For instance, "great" and "horrible" are adjectives that strongly indicate sentiment.

'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
# You code here (Please add comments in the code):
import nltk
from nltk import word_tokenize, pos_tag
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample reviews on Apartments.com
reviews = [
    "the amenities' such as walk in closets, space up to date appliances. upgrade property, Courtesy officers, and beautiful bathrooms, and pantry in kitchen, with accessible covered parking, clubhouse, gas/electric fireplace, landscaping, pool, package items secure at leasing office.",
    "I was looking for apartments in Spencerport NY and Rochester NY keeps coming up",
    "App was good, but could be more user friendly by only recommendation places that fit search filters, not giving more expensive and other location options. Also, couldn’t figure out how to go back and select a place to add to my list if I accidentally “noped” it.",
    "I don't know if this is a part of you or the people who list the apartments but I'd rather see on line first what I'd pay for the place before I go tour the apartment. I know to some degree what I want and what I can pay for so don't want to get myself into a place I'll have to leave the next year when the rent is raised....",
]

# Tokenize the text
tokenized_reviews = [word_tokenize(review.lower()) for review in reviews]

# Bag of Words (BoW) feature extraction
vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_reviews])
print("BoW Features:")
print(bow_features.toarray())

# TF-IDF feature extraction
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_reviews])
print("\nTF-IDF Features:")
print(tfidf_features.toarray())

# N-grams feature extraction (bigrams)
ngram_vectorizer = CountVectorizer(ngram_range=(2, 2))
ngram_features = ngram_vectorizer.fit_transform([' '.join(tokens) for tokens in tokenized_reviews])
print("\nBigram Features:")
print(ngram_features.toarray())

# Sentiment Lexicons
positive_words = ["amazing", "fantastic", "great"]
negative_words = ["terrible", "boring", "awful"]
positive_counts = [sum(review.count(word) for word in positive_words) for review in reviews]
negative_counts = [sum(review.count(word) for word in negative_words) for review in reviews]
print("\nSentiment Lexicon Features:")
print("Positive Counts:", positive_counts)
print("Negative Counts:", negative_counts)

# Part-of-Speech (POS) Tagging
pos_tags = [pos_tag(tokens) for tokens in tokenized_reviews]
adjective_counts = [len([tag for word, tag in tags if tag.startswith('JJ')]) for tags in pos_tags]
adverb_counts = [len([tag for word, tag in tags if tag.startswith('RB')]) for tags in pos_tags]
print("\nPOS Tag Features:")
print("Adjective Counts:", adjective_counts)
print("Adverb Counts:", adverb_counts)




[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


BoW Features:
[[1 0 0 0 1 2 0 0 0 1 1 1 0 1 0 1 0 0 0 0 1 1 0 0 0 1 1 1 0 0 1 0 0 0 1 0
  0 0 0 1 0 0 0 0 0 0 0 2 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
  1 0 0 0 0 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0
  1 1 0 1 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0
  1 0 0 0 0 1 0 0 0 0 0 0]
 [0 1 1 1 0 2 0 0 1 0 0 0 1 0 1 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 0 0
  1 0 1 0 0 1 1 1 0 1 1 0 0 0 1 0 0 0 0 0 0 0 0 1 0 1 0 2 1 0 0 1 1 0 0 0
  0 0 1 1 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 0 0 1 0 0 0 0 0 1 0 0 3 0
  0 0 1 0 0 1 0 0 0 0 0 0]
 [0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 2 0 0 0 0 0 1
  0 2 0 0 1 0 1 0 1 0 1 0 1 2 0 0 0 0 2 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 1 0
  0 1 0 0 1 0 0 0 0 0 1 2 1 2 0 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 6 1 3 1
  0 0 0 0 2 0 3 1

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [2]:
# You code here (Please add comments in the code):
from sklearn.feature_selection import chi2
import numpy as np

# Sample class labels for sentiment classes (positive, negative, neutral)
labels = ["positive", "negative", "positive", "neutral"]

# Combine all the features into one feature matrix
all_features = np.hstack([bow_features.toarray(), tfidf_features.toarray(),
                          ngram_features.toarray(),
                          np.array(positive_counts)[:, np.newaxis],
                          np.array(negative_counts)[:, np.newaxis],
                          np.array(adjective_counts)[:, np.newaxis],
                          np.array(adverb_counts)[:, np.newaxis]])

# Calculate Chi-squared statistics and p-values for each feature
chi2_stat, p_values = chi2(all_features, labels)

# Create a list of labels corresponding to each feature
feature_labels = ["BoW"] * len(vectorizer.get_feature_names_out()) + \
                 ["TF-IDF"] * len(tfidf_vectorizer.get_feature_names_out()) + \
                 ["Positive Count", "Negative Count", "Adjective Count", "Adverb Count"]

# Combine feature labels and Chi-squared scores
feature_info = list(zip(feature_labels, chi2_stat))

# Create a dictionary to accumulate the Chi-squared scores for each label
label_scores = {}
for label, score in feature_info:
    if label in label_scores:
        label_scores[label] += score
    else:
        label_scores[label] = score

# Print the aggregated Chi-squared scores by label
print("Chi-squared Scores by Label:")
for label, score in label_scores.items():
    print(f"{label}: {score}")




Chi-squared Scores by Label:
BoW: 248.90476190476193
TF-IDF: 33.084105729558644
Positive Count: 1.0
Negative Count: 1.0
Adjective Count: 1.0
Adverb Count: 1.0


Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [5]:
# You code here (Please add comments in the code):
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
text_data = [
    "This website is the best to search for apartments",
    "Very bad site, no proper information available",
    "Site is good, but loading times are slow",
    "It's neither good or bad",
]


query = "Want to find a good apartment with all amenities"
all_text = text_data + [query]

# Convert the text data into TF-IDF vectors
count_vectorizer = CountVectorizer()
tfidf_transformer = TfidfTransformer()
tfidf_matrix = tfidf_transformer.fit_transform(count_vectorizer.fit_transform(all_text))

# Calculate cosine similarity between the query and each text
cosine_similarities = cosine_similarity(tfidf_matrix[-1], tfidf_matrix[:-1])

# Rank texts based on cosine similarities in descending order
sorted_indices = cosine_similarities.argsort()[0][::-1]

# Print the ranked text data based on cosine similarity
print("Ranked Text Data (Descending Order of Cosine_Similarity):")
for idx in sorted_indices:
    print(f"Similarity: {cosine_similarities[0][idx]:.4f}")
    print(text_data[idx])
    print()





Ranked Text Data (Descending Order of Cosine_Similarity):
Similarity: 0.0848
This website is the best to search for apartments

Similarity: 0.0831
It's neither good or bad

Similarity: 0.0648
Site is good, but loading times are slow

Similarity: 0.0000
Very bad site, no proper information available

