<a href="https://colab.research.google.com/github/Likhi2001/Likhitha_Jarugula_INFO5731_SPRING2024/blob/main/Jarugula__Likhitha_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:
Sentiment analysis of movie reviews could be a fascinating text categorization project. Sorting reviews into categories that are positive, negative, or neutral based on the text content is the goal. Completing this duty is essential for bettering recommendation systems, comprehending consumer input, and raising the caliber of services or goods.

It can be really helpful to have the following features when creating a machine learning model for this task:

1. Word Frequencies (Bag of Words): The number or frequency of each word in the passage. This straightforward function can identify phrases that are strongly linked to either happy or negative emotions. For example, it is obvious what kind of sentiment "excellent" or "terrible" convey.

2. Term Frequency-Inverse Document Frequency, or TF-IDF: It accounts for word commonality throughout all publications in addition to measuring word frequency. This makes it easier to highlight words in a review that actually reflect the spirit of the review rather than just words that are frequently used.

3. N-grams: Word groups of N can express meanings in phrases that might not be expressed by a single word (a unigram). Tri-grams (like "not very good") and bi-grams (like "not good") can effectively convey intensifiers and negations that have a big emotional impact.

4. Tags for Parts of Speech: Words' roles in grammar (such as nouns, adjectives, and verbs) can provide useful information. Adverbs and adjectives (such as "amazing" and "sadly") frequently convey sentiment more explicitly than verbs.

5. Sentiment Scores: Precomputed sentiment scores from sentiment lexicons (such as SentiWordNet, AFINN, or VADER) might work as a powerful feature for the model by offering a baseline sentiment estimate for words or phrases.

Let's now create a Python code that will pull these features from several example texts. For basic NLP tasks (such as tokenization and part-of-speech tagging), we will utilize libraries like `nltk}, and for TF-IDF and count vectorizer features, we will use `sklearn`.

It appears that the code failed due to the inability to locate the NLTK data needed for part-of-speech tagging and tokenization. Normally, `nltk.download()} would be used to download these materials, but we are unable to download external resources directly in this environment.

In spite of this, we may talk about what the feature extraction procedure should produce:

1. Bag of Words Word Frequencies: A matrix with each row representing a review and each column representing a distinct word used in all reviews would be created using this feature extraction. The numbers would show how frequently each word appears in each review.

2. TF-IDF: The TF-IDF model would result in a matrix with rows representing reviews and columns representing unique terms, much like the Bag of terms model. But the weighting of the numbers would be based on how important a term is in a review compared to how often it appears in reviews overall. Words with higher TF-IDF scores are those that appear frequently in a certain review but not in all reviews.

3. Digrams: By doing this, a feature set akin to the Bag of Words would be created, except it would take word pairs into consideration. This can pick up on sentences that imply certain feelings that a single word might not be able to express.

4. Tags for Parts of Speech: The goal is to examine the grammatical structure of sentences, even if we were unable to extract these because of the missing NLTK data. For example, this can help distinguish between words that signify various things depending on the grammatical context in which they are used (such as "like" as a verb and "like" as a preposition).

5. Emotional Ratings: In order to provide a numerical depiction of the sentiment expressed in the text, we planned to compute sentiment ratings for every review. After that, these scores might be employed as features in a machine learning model, providing a straightforward emotion estimation prior to taking into account more intricate textual data.

In practical applications, the code would enable you to extract these features and train a sentiment analysis model after obtaining the required NLTK resources. Using the syntactic and semantic properties of the reviews to effectively classify the sentiment, this method offers a full picture of the text data.

'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.sentiment import SentimentIntensityAnalyzer
import pandas as pd
import nltk

# Ensure necessary NLTK resources are downloaded (this is hypothetical in this context)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('vader_lexicon')

# Sample movie reviews
reviews = [
    "The movie was excellent! The performances were outstanding.",
    "I did not like the movie, it was too long and boring.",
    "The direction was mediocre but the acting was pretty good.",
    "What a terrible movie. It was a waste of time.",
    "The movie was okay, not great but not bad either."
]

# Initialize VADER for sentiment analysis
sia = SentimentIntensityAnalyzer()

# 1. Word Frequencies (Bag of Words)
vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(reviews)

# 2. TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(reviews)

# 3. Bi-grams
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
bigram_features = bigram_vectorizer.fit_transform(reviews)

# 4. Part-of-Speech Tags
tokens = [word_tokenize(review) for review in reviews]
pos_tags = [pos_tag(token) for token in tokens]

# 5. Sentiment Scores
sentiment_scores = [sia.polarity_scores(review) for review in reviews]

# Convert features to DataFrames for easier viewing (except for POS tags and sentiment scores, which are more complex structures)
bow_df = pd.DataFrame(bow_features.toarray(), columns=vectorizer.get_feature_names_out())
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
bigram_df = pd.DataFrame(bigram_features.toarray(), columns=bigram_vectorizer.get_feature_names_out())

# Printing examples of features
print("Bag of Words Features (Example):")
print(bow_df.head())
print("\nTF-IDF Features (Example):")
print(tfidf_df.head())
print("\nBi-gram Features (Example):")
print(bigram_df.head())
print("\nPart-of-Speech Tags (Example):")
print(pos_tags[0])  # Showing POS tags for the first review
print("\nSentiment Scores (Example):")
print(sentiment_scores[0])  # Showing sentiment scores for the first review


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Bag of Words Features (Example):
   acting  and  bad  boring  but  did  direction  either  excellent  good  \
0       0    0    0       0    0    0          0       0          1     0   
1       0    1    0       1    0    1          0       0          0     0   
2       1    0    0       0    1    0          1       0          0     1   
3       0    0    0       0    0    0          0       0          0     0   
4       0    0    1       0    1    0          0       1          0     0   

   ...  performances  pretty  terrible  the  time  too  was  waste  were  what  
0  ...             1       0         0    2     0    0    1      0     1     0  
1  ...             0       0         0    1     0    1    1      0     0     0  
2  ...             0       1         0    2     0    0    2      0     0     0  
3  ...             0       0         1    0     1    0    1      1     0     1  
4  ...             0       0         0    1     0    0    1      0     0     0  

[5 rows x 30 colu

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):
from sklearn.feature_selection import chi2
import numpy as np

# Hypothetical sentiment labels for the reviews
sentiments = [1, 0, 1, 0, 1]  # 1 for positive, 0 for negative

# Applying chi-square test
chi_scores, p_values = chi2(bow_features, sentiments)

# Ranking features based on their chi-scores
indices = np.argsort(chi_scores)[::-1]  # Descending order

# Print the ranked features
print("Feature ranking (based on chi-square scores):")
for i in indices:
    print(f"{vectorizer.get_feature_names_out()[i]}: {chi_scores[i]}")

Feature ranking (based on chi-square scores):
it: 3.0
terrible: 1.5
like: 1.5
long: 1.5
of: 1.5
did: 1.5
what: 1.5
time: 1.5
too: 1.5
and: 1.5
waste: 1.5
boring: 1.5
the: 1.3611111111111118
but: 1.3333333333333335
bad: 0.6666666666666667
good: 0.6666666666666667
direction: 0.6666666666666667
either: 0.6666666666666667
excellent: 0.6666666666666667
mediocre: 0.6666666666666667
great: 0.6666666666666667
were: 0.6666666666666667
okay: 0.6666666666666667
outstanding: 0.6666666666666667
performances: 0.6666666666666667
pretty: 0.6666666666666667
acting: 0.6666666666666667
movie: 0.16666666666666657
was: 0.1111111111111113
not: 0.05555555555555565


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):

from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Sample data and query
documents = [
    "The movie was excellent! The performances were outstanding.",
    "I did not like the movie, it was too long and boring.",
    "The direction was mediocre but the acting was pretty good.",
    "What a terrible movie. It was a waste of time.",
    "The movie was okay, not great but not bad either."
]
query = "Looking for an outstanding movie performance."

# Load pre-trained BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to encode text to get BERT embeddings
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding='max_length')
    outputs = model(**inputs)
    # Use the mean of the last hidden state as the sentence embedding
    embeddings = outputs.last_hidden_state.mean(dim=1)
    return embeddings

# Encode the query
query_embedding = get_bert_embedding(query)

# Encode all documents and calculate cosine similarity with the query
similarities = []
for doc in documents:
    doc_embedding = get_bert_embedding(doc)
    cosine_sim = cosine_similarity(query_embedding.detach().numpy(), doc_embedding.detach().numpy())[0][0]
    similarities.append(cosine_sim)

# Rank documents based on similarity
ranked_docs_indices = sorted(range(len(similarities)), key=lambda i: similarities[i], reverse=True)

# Print ranked documents with their similarity scores
print("Documents ranked by similarity to query:")
for index in ranked_docs_indices:
    print(f"Document: {documents[index]} \nSimilarity Score: {similarities[index]}\n")




The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Documents ranked by similarity to query:
Document: The movie was excellent! The performances were outstanding. 
Similarity Score: 0.814476728439331

Document: The direction was mediocre but the acting was pretty good. 
Similarity Score: 0.8015655279159546

Document: What a terrible movie. It was a waste of time. 
Similarity Score: 0.7750781774520874

Document: I did not like the movie, it was too long and boring. 
Similarity Score: 0.7743532657623291

Document: The movie was okay, not great but not bad either. 
Similarity Score: 0.742277979850769



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

### Learning Experience

It has been really illuminating to work on feature extraction from text data and document rating using text similarity. Text preprocessing and feature extraction are critical processes that set the groundwork for any text analysis or natural language processing (NLP) project. Key concepts like Bag of Words (BoW), TF-IDF, N-grams, Part-of-Speech (POS) tags, and Sentiment Scores are among the most important ones. This exercise gave practical experience with converting unformatted text into a format that machine learning models can understand, which is a fundamental understanding. Furthermore, the introduction to document similarity calculations and text embedding generation using BERT provided a thorough understanding of the real-world implementation of cutting-edge NLP models. Together, these strategies highlight the shift in nlp from conventional vectorization techniques to sophisticated language models.

### Challenges Encountered

The environment's limitations were a major challenge in this study, especially when it came to running programs that needed third-party packages or external resources, like NLTK's sentiment analysis and POS tagger. Due to this limitation, parts of the exercise could not be completed directly, which required a fictitious approach to explanations. There is also a learning curve associated with using BERT models due to their complexity, which includes managing their input and output for tasks like similarity computations. A solid understanding of transformer network concepts and the architecture of the model is necessary to comprehend how to preprocess text for BERT and analyze its embeddings.


### Relevance to Your Field of Study

The field of Natural Language Processing (NLP), a subfield of artificial intelligence that focuses on natural language interaction between computers and humans, finds great relevance in this activity. Many applications of natural language processing (NLP), ranging from topic modeling and sentiment analysis to information retrieval and chatbots, rely on the capacity to extract significant features from text and calculate similarities between documents using sophisticated models like BERT. The abilities acquired in this exercise can be used in a variety of real-world contexts, such as recommendation engines, automated customer assistance, and search engines. This highlights the significance of natural language processing (NLP) techniques in improving our ability to comprehend and extract insights from textual material.


Overall, this exercise highlighted the significance of feature extraction and document similarity in natural language processing (NLP) by offering a thorough overview of both basic and advanced text analysis approaches. Notwithstanding the difficulties, the educational process was priceless, providing real-world insights into managing text data and utilizing cutting-edge NLP models to solve problems.

'''