<a href="https://colab.research.google.com/github/17251A0404/Abhigna_INFO5731_Spring2024/blob/main/DARA_ABHIGNA_Inclass_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:

"""
One interesting text classification task could be sentiment analysis on social media comments regarding a particular product or service. This task involves determining the sentiment (positive, negative, or neutral) expressed in each comment. Here are five types of features that could be useful for building a machine learning model for sentiment analysis:

1. **Bag-of-Words (BoW)**:
   - This feature represents the frequency of each word in the text.
   - BoW features capture the presence or absence of specific words, which can indicate sentiment polarity.
   - Words like "great," "awesome," and "love" are likely indicators of positive sentiment, while words like "awful," "disappointed," and "hate" may indicate negative sentiment.

2. **N-grams**:
   - N-grams are sequences of adjacent words in the text.
   - Bi-grams (sequences of two words) and tri-grams (sequences of three words) can capture contextual information better than individual words.
   - For instance, the phrase "not good" would be represented as a bi-gram, which is crucial for capturing negation in sentiment analysis.

3. **Word Embeddings**:
   - Word embeddings represent words as dense vectors in a continuous vector space.
   - Pre-trained word embeddings like Word2Vec, GloVe, or fastText can capture semantic relationships between words.
   - These embeddings encode contextual and semantic information, which can enhance the model's understanding of sentiment nuances.

4. **Part-of-Speech (POS) Tags**:
   - POS tags represent the grammatical category of each word in the text (e.g., noun, verb, adjective).
   - Adjectives and adverbs often carry sentiment information. For example, positive sentiments are often expressed through adjectives like "amazing" or "beautiful."
   - Incorporating POS tags as features can help the model focus on the most informative words for sentiment analysis.

5. **Sentiment Lexicons**:
   - Sentiment lexicons contain lists of words labeled with their associated sentiment polarity (positive, negative, or neutral).
   - Features based on sentiment lexicons can capture sentiment signals directly without needing extensive training data.
   - Lexicons like SentiWordNet or AFINN provide a pre-defined sentiment score for each word, aiding in sentiment classification.

These features collectively provide a rich representation of the text, capturing both syntactic and semantic information relevant to sentiment analysis. By incorporating these diverse features, the machine learning model can effectively learn the underlying patterns in the text data and make accurate predictions about sentiment polarity."""


"Identifying Fake News in Online Articles is an intriguing text classification topic that we may discuss. In the current digital era, when the propagation of fake news may have serious repercussions, identifying disinformation is essential. The goal of this work is to develop a machine learning model that can discriminate between authentic and fraudulent news sources.\nThe features:\n\nSource Credibility Score:  An indicator of the news source's reliability.\nexplanation: Reputable sources are typically the source of high-quality journalism. Evaluating a source's historical credibility may help a model by taking into account things like public trust, journalistic honors, and historical fact-checking. These factors can be used to generate a weighted score.\nEmotive Tone Analysis: The text's emotional tone is identified.\nIn order to get attention, fake news frequently preys on emotions. Features that analyze the article's overall emotional tone, identify emotional triggers, and perform 

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):

import nltk
# Download the necessary resource
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import numpy as np


# Sample text data
text_data = [
    "This restaurant has the best food in town, I can't get enough!",
    "The delivery was fast and the packaging was excellent.",
    "I had a terrible experience with their customer service, very unprofessional.",
    "The hotel room was clean and spacious, I had a comfortable stay.",
    "The price of the product is too high for its quality, not worth it."
]


# Tokenization
tokenized_texts = [word_tokenize(text.lower()) for text in text_data]

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_texts = [[word for word in tokens if word not in stop_words] for tokens in tokenized_texts]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_texts = [[lemmatizer.lemmatize(word) for word in tokens] for tokens in filtered_texts]

# Bag-of-Words (BoW) feature extraction
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform([' '.join(tokens) for tokens in lemmatized_texts])

# N-grams feature extraction
vectorizer_ngrams = CountVectorizer(ngram_range=(1, 2))
ngram_features = vectorizer_ngrams.fit_transform([' '.join(tokens) for tokens in lemmatized_texts])

# Word embeddings - This part requires pre-trained word embeddings like Word2Vec or GloVe

# Part-of-Speech (POS) tagging
pos_tags = [nltk.pos_tag(tokens) for tokens in tokenized_texts]
pos_features = []
for pos_tag in pos_tags:
    pos_features.append([tag for word, tag in pos_tag])

# Sentiment Lexicons - We'll use a simple lexicon for demonstration
sentiment_lexicon = {
    'love': 'positive',
    'amazing': 'positive',
    'terrible': 'negative',
    'disappointed': 'negative',
    'fantastic': 'positive',
    'awful': 'negative'
}
lexicon_features = []
for tokens in lemmatized_texts:
    score = sum([1 if token in sentiment_lexicon else 0 for token in tokens])
    lexicon_features.append(score)

# Print extracted features
print("Bag-of-Words Features:")
print(bow_features.toarray())
print("\nN-grams Features:")
print(ngram_features.toarray())
print("\nPOS Tags:")
print(pos_features)
print("\nSentiment Lexicon Features:")
print(lexicon_features)

Bag-of-Words Features:
[[1 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0]
 [0 0 0 0 0 1 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 1 0]
 [0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 1 1 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 1 0 0 0 0 0 0 0 0 1]]

N-grams Features:
[[1 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1
  0 0 0 0 0 0 0 0 0 1 1 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 1 1 0 0 0 1 1 0 0 1 0]
 [0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0
  1 1 0 0 1 1 1 0 0 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 1 1 1 1 1 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 1]]

POS Tags:
[['DT', 'NN', 'VBZ', 'DT', 'JJS', 'NN', 'IN', 'NN', ',', 'NN', 'MD', 'RB', 'VB', 'RB', '.'], ['DT', 'NN', 'VBD', 'RB', 'CC', 'DT', 'NN', 'VBD', '

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [26]:
# You code here (Please add comments in the code):
import nltk
# Download the necessary resource
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

from sklearn.feature_selection import chi2
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder
import numpy as np

text_data = [
    "This restaurant has the best food in town, I can't get enough!",
    "The delivery was fast and the packaging was excellent.",
    "I had a terrible experience with their customer service, very unprofessional.",
    "The hotel room was clean and spacious, I had a comfortable stay.",
    "The price of the product is too high for its quality, not worth it."

]


# Tokenization
tokenized_texts = [word_tokenize(text.lower()) for text in text_data]

# Stopword removal
stop_words = set(stopwords.words('english'))
filtered_texts = [[word for word in tokens if word not in stop_words] for tokens in tokenized_texts]

# Lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_texts = [[lemmatizer.lemmatize(word) for word in tokens] for tokens in filtered_texts]

# Bag-of-Words (BoW) feature extraction
vectorizer_bow = CountVectorizer()
bow_features = vectorizer_bow.fit_transform([' '.join(tokens) for tokens in lemmatized_texts])


# Target labels
labels = ['Binary', 'Multiclass', 'Multi-label', 'Hierarchical', 'ordinal']

# Perform Chi-square test
chi2_scores, _ = chi2(bow_features, labels)

## Creating dictionary to map features to their Chi-square scores
feature_scores = {feature: score for feature, score in zip(vectorizer_bow.vocabulary_.keys(), chi2_scores)}

# Sorting features based on their scores in DO
sorted_features = sorted(feature_scores.items(), key=lambda x: x[1], reverse=True)

# Printing the sorted features & their Chi-square scores
for feature, score in sorted_features:
    print(f"Feature: {feature}, Chi-square Score: {score}")


Feature: restaurant, Chi-square Score: 4.000000000000001
Feature: best, Chi-square Score: 4.000000000000001
Feature: food, Chi-square Score: 4.000000000000001
Feature: town, Chi-square Score: 4.000000000000001
Feature: ca, Chi-square Score: 4.000000000000001
Feature: get, Chi-square Score: 4.000000000000001
Feature: enough, Chi-square Score: 4.000000000000001
Feature: delivery, Chi-square Score: 4.000000000000001
Feature: fast, Chi-square Score: 4.000000000000001
Feature: packaging, Chi-square Score: 4.000000000000001
Feature: excellent, Chi-square Score: 4.000000000000001
Feature: terrible, Chi-square Score: 4.000000000000001
Feature: experience, Chi-square Score: 4.000000000000001
Feature: customer, Chi-square Score: 4.000000000000001
Feature: service, Chi-square Score: 4.000000000000001
Feature: unprofessional, Chi-square Score: 4.000000000000001
Feature: hotel, Chi-square Score: 4.000000000000001
Feature: room, Chi-square Score: 4.000000000000001
Feature: clean, Chi-square Score: 4

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):
from transformers import BertTokenizer, BertModel
import torch
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
text_data = [
    "This restaurant has the best food in town, I can't get enough!",
    "The delivery was fast and the packaging was excellent.",
    "I had a terrible experience with their customer service, very unprofessional.",
    "The hotel room was clean and spacious, I had a comfortable stay.",
    "The price of the product is too high for its quality, not worth it."
]

# Sample query
query = "I'm looking for a great product recommendation."

# Load BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Tokenize and encode the query
query_tokens = tokenizer.encode(query, add_special_tokens=True, max_length=512, truncation=True, padding='max_length', return_tensors='pt')

# Tokenize and encode the text data
text_tokens = tokenizer(text_data, add_special_tokens=True, max_length=512, truncation=True, padding='max_length', return_tensors='pt')

# Get BERT embeddings for the query
with torch.no_grad():
    query_outputs = model(input_ids=query_tokens)[0][:, 0, :].squeeze().numpy()  # Use the [CLS] token embedding

# Get BERT embeddings for the text data
with torch.no_grad():
    text_outputs = model(input_ids=text_tokens.input_ids)[0][:, 0, :].squeeze().numpy()  # Use the [CLS] token embedding

# Calculate cosine similarity between the query and each text document
similarity_scores = cosine_similarity([query_outputs], text_outputs)

# Zip text data with similarity scores
text_similarity_scores = list(zip(text_data, similarity_scores[0]))

# Sort text documents based on similarity scores in descending order
sorted_text_similarity_scores = sorted(text_similarity_scores, key=lambda x: x[1], reverse=True)

# Print the sorted text documents and their similarity scores
print("Ranked Text Documents Based on Similarity to the Query:")
for text, score in sorted_text_similarity_scores:
    print(f"Text: {text}")
    print(f"Similarity Score: {score}")
    print()


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):


Learning experience:
I learnt a lot about text analysis, taking into account not only the words but also how people use language and share information.

Challenges Encountered:
It was difficult to strike the correct balance between employing complicated features and making things easy to grasp. Also, gathering trustworthy data for false news identification proved difficult.

Importance to Your Field of Study:
This practice is appropriate for learning language and computers (NLP). It demonstrates how we may utilize linguistic subtleties and machine learning to identify and address issues such as detecting false news. It's like combining linguistic knowledge with computer abilities.






'''