<a href="https://colab.research.google.com/github/HarshaSolingaram/INFO_5731/blob/main/Solingaram_Harshavardhan_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Here are five types of features that could be useful for building a machine learning model for sentiment analysis:

Bag-of-Words (BoW) features:

BoW features represent the frequency of occurrence of each word in the document.
These features capture the presence or absence of specific words in the text, which can indicate sentiment.
For example, the presence of words like "excellent," "great," or "satisfied" might indicate positive sentiment, while words like "disappointed," "poor," or "terrible" might indicate negative sentiment.


N-gram Features:

N-grams represent sequences of N consecutive words in the text.
By capturing sequences of words, N-gram features can provide context and capture nuances in sentiment that individual words may not convey.
For instance, phrases like "not good" or "very happy" may carry different sentiments compared to individual words.


Word Embedding Features:

Word embedding techniques like Word2Vec, GloVe, or FastText represent words in a continuous vector space where words with similar meanings are closer to each other.
These embeddings capture semantic relationships between words, which can help in understanding the sentiment of the text.
By leveraging pre-trained word embeddings or learning embeddings specific to the task, the model can understand the contextual meaning of words better.


Part-of-Speech (POS) Tag Features:

POS tags represent the grammatical category of each word in the text (e.g., noun, verb, adjective).
Different parts of speech may convey different sentimental orientations. For example, adjectives and adverbs often carry sentimental information.
POS tag features can help the model focus on specific word categories that are more indicative of sentiment.


Sentiment Lexicon Features:

Sentiment lexicons contain lists of words along with their associated sentiment polarity (positive, negative, or neutral).
Using sentiment lexicon features, the model can directly incorporate knowledge about the sentiment of words.
By matching words in the text to entries in the sentiment lexicon, the model can assign sentiment scores to the text based on the presence of positive or negative words.



'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [20]:
!pip install nltk
import numpy as np
import pandas as pd
from nltk import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from gensim.models import Word2Vec
from sklearn.feature_extraction.text import CountVectorizer
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

# Sample text data
sample_data = ["The warm sun peeked through the leaves, dappling the forest floor in light.",
               " A gentle breeze rustled the ferns, carrying the sweet scent of wildflowers.",
               " A curious deer emerged from the trees, its large eyes gazing around before it nibbled on tender leaves.",
               " As the sun began to set, painting the sky in hues of orange and pink, the forest settled into a peaceful stillness.",
               " broken only by the hooting of an owl in the distance."]

# Sample sentiment labels
sentiment_labels = ["positive", "negative", "negative", "positive", "positive"]

# Bag-of-Words (BoW) features
vectorizer_bow = CountVectorizer()
X_bow = vectorizer_bow.fit_transform(sample_data)
print("Bag-of-Words (BoW) features:")
print(vectorizer_bow.get_feature_names_out())
print(X_bow.toarray())
print()

# Part-of-Speech (POS) Tag features
stop_words = set(stopwords.words('english'))
X_pos = []
for text in sample_data:
    words = word_tokenize(text)
    words = [word for word in words if word not in stop_words]
    pos_tags = pos_tag(words)
    X_pos.append(" ".join([tag for word, tag in pos_tags]))
vectorizer_pos = CountVectorizer()
X_pos = vectorizer_pos.fit_transform(X_pos)
print("Part-of-Speech (POS) Tag features:")
print(vectorizer_pos.get_feature_names_out())
print(X_pos.toarray())
print()

# Word Embedding features
word2vec_model = Word2Vec(sentences=[word_tokenize(text) for text in sample_data], vector_size=100, window=5, min_count=1, workers=4)
X_word2vec = np.array([np.mean([word2vec_model.wv[word] for word in word_tokenize(text) if word in word2vec_model.wv] or [np.zeros(100)], axis=0) for text in sample_data])
print("Word Embedding features:")
print(X_word2vec)
print()

# Sentiment Lexicon features (assuming you have a sentiment lexicon loaded)
# Here, I'll calculate the sentiment score for each sentence based on the words in the sentiment lexicon
print("Sentiment Lexicon features:")
sentiment_lexicon = {
    "excellent": 1,
    "satisfied": 1,
    "disappointed": -1,
    "terrible": -1,
    "quick": 1,
    "helpful": 1
}
X_lexicon = []
for text in sample_data:
    words = text.lower().split()
    score = sum(sentiment_lexicon.get(word, 0) for word in words)
    X_lexicon.append([score])

X_lexicon = np.array(X_lexicon)
print(X_lexicon)


Bag-of-Words (BoW) features:
['an' 'and' 'around' 'as' 'before' 'began' 'breeze' 'broken' 'by'
 'carrying' 'curious' 'dappling' 'deer' 'distance' 'emerged' 'eyes'
 'ferns' 'floor' 'forest' 'from' 'gazing' 'gentle' 'hooting' 'hues' 'in'
 'into' 'it' 'its' 'large' 'leaves' 'light' 'nibbled' 'of' 'on' 'only'
 'orange' 'owl' 'painting' 'peaceful' 'peeked' 'pink' 'rustled' 'scent'
 'set' 'settled' 'sky' 'stillness' 'sun' 'sweet' 'tender' 'the' 'through'
 'to' 'trees' 'warm' 'wildflowers']
[[0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 0
  0 0 0 1 0 0 0 0 0 0 0 1 0 0 3 1 0 0 1 0]
 [0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0
  0 0 0 0 0 1 1 0 0 0 0 0 1 0 2 0 0 0 0 1]
 [0 0 1 0 1 0 0 0 0 0 1 0 1 0 1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 1 0 1 0 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0]
 [0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 0 0 0 0 0 0 1 0 0 1
  0 1 1 0 1 0 0 1 1 1 1 1 0 0 3 0 1 0 0 0]
 [1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 0 0 0

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [27]:
pip install nltk gensim




In [28]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.util import ngrams
from nltk.tag import pos_tag
from gensim.models import Word2Vec

# Sample text data
text_data = [
    "The movie was excellent! I loved every moment of it.",
    "The food was terrible. I wouldn't recommend it to anyone.",
    "I feel very happy today.",
    "This book is not good. I am disappointed.",
    "She sings beautifully."
]

# Bag-of-Words (BoW) features
def bow_features(text):
    tokens = word_tokenize(text.lower())
    bow = {}
    for token in tokens:
        if token not in bow:
            bow[token] = 1
        else:
            bow[token] += 1
    return bow

# N-gram Features
def ngram_features(text, n=2):
    tokens = word_tokenize(text.lower())
    ngrams_list = list(ngrams(tokens, n))
    return [' '.join(gram) for gram in ngrams_list]

# Word Embedding Features (Word2Vec)
def word_embedding_features(text, model):
    tokens = word_tokenize(text.lower())
    embedding_vector = []
    for token in tokens:
        if token in model.wv:
            embedding_vector.append(model.wv[token])
    return embedding_vector

# Part-of-Speech (POS) Tag Features
def pos_tag_features(text):
    tokens = word_tokenize(text)
    pos_tags = pos_tag(tokens)
    return [tag for word, tag in pos_tags]

# Sentiment Lexicon Features
def sentiment_lexicon_features(text, positive_lexicon, negative_lexicon):
    tokens = word_tokenize(text.lower())
    positive_count = sum(1 for word in tokens if word in positive_lexicon)
    negative_count = sum(1 for word in tokens if word in negative_lexicon)
    return {'positive_count': positive_count, 'negative_count': negative_count}

# Sample sentiment lexicons
positive_lexicon = set(["excellent", "great", "happy", "beautiful"])
negative_lexicon = set(["terrible", "disappointed", "not good"])

# Preprocess sample text data
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

preprocessed_data = [' '.join([word for word in word_tokenize(sentence.lower()) if word.isalpha() and word not in stop_words]) for sentence in text_data]

# Train Word2Vec model
tokenized_sentences = [word_tokenize(sentence.lower()) for sentence in preprocessed_data]
word2vec_model = Word2Vec(tokenized_sentences, vector_size=100, window=5, min_count=1, workers=4)

# Extract features
for text in preprocessed_data:
    print("Text:", text)
    print("Bag-of-Words (BoW) features:", bow_features(text))
    print("N-gram Features (2-gram):", ngram_features(text, n=2))
    print("Word Embedding Features (Word2Vec):", word_embedding_features(text, word2vec_model))
    print("Part-of-Speech (POS) Tag Features:", pos_tag_features(text))
    print("Sentiment Lexicon Features:", sentiment_lexicon_features(text, positive_lexicon, negative_lexicon))
    print()


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Text: movie excellent loved every moment
Bag-of-Words (BoW) features: {'movie': 1, 'excellent': 1, 'loved': 1, 'every': 1, 'moment': 1}
N-gram Features (2-gram): ['movie excellent', 'excellent loved', 'loved every', 'every moment']
Word Embedding Features (Word2Vec): [array([ 0.00180023,  0.00704609,  0.0029447 , -0.00698085,  0.00771268,
       -0.00598893,  0.00899771,  0.0029592 , -0.00401529, -0.00468899,
       -0.00441672, -0.00614646,  0.00937874, -0.0026496 ,  0.00777244,
       -0.00968034,  0.00210879, -0.00123361,  0.00754423, -0.0090546 ,
        0.00743756, -0.0051058 , -0.00601377, -0.00564916, -0.00337917,
       -0.0034111 , -0.00319566, -0.0074922 ,  0.00070878, -0.00057607,
       -0.001684  ,  0.00375713, -0.00762019, -0.00322142,  0.00515534,
        0.00854386, -0.00980994,  0.00719534,  0.00530949, -0.0038797 ,
        0.00857616, -0.00922199,  0.00724868,  0.00536383,  0.00129359,
       -0.00519975, -0.00417865, -0.00335678,  0.00160829,  0.0015867 ,
        0.0

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [36]:
from sklearn.feature_selection import mutual_info_classif

# Extracting features
X = tfidf_features  # Using TF-IDF features for demonstration
y = [1, 0, 1, 0, 1]  # Assuming binary labels for demonstration

# Perform Mutual Information feature selection
mi_scores = mutual_info_classif(X, y)

# Map feature names to their corresponding MI scores
feature_names = vectorizer_tfidf.get_feature_names_out()
feature_mi_scores = dict(zip(feature_names, mi_scores))

# Rank features based on their importance (MI scores) in descending order
ranked_features = sorted(feature_mi_scores.items(), key=lambda x: x[1], reverse=True)

# Print ranked features
print("Ranked features based on Mutual Information (MI) scores:")
for feature, score in ranked_features:
    print(f"Feature: {feature}, MI Score: {score}")

NameError: name 'tfidf_features' is not defined

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):





# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:





'''