<a href="https://colab.research.google.com/github/Grishma5278/Info-5731/blob/main/Tallapareddy_grishma_exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

The following five feature types are helpful for developing a sentiment analysis machine learning model:

1.Features of word frequency
2. Features for parts of speech
3. Features of sentiment lexicon
4. Features of an emotional tone
5. Features of Syntactic Dependency

Features and the potential benefits of these features:
 1. Features of word frequency:

-Features based on word frequency, such as bigrams and unigrams. A method such as Term Frequency-Inverse Document Frequency (TF-IDF) can be used to examine the frequency at which particular terms and phrases occur in each review.
-There can be a strong correlation between certain words or phrases and how we feel. Positive rating phrases that exemplify this type of language include "excellent," "amazing," and "great."

2. Features of parts of speech:
-Derived traits (like nouns, verbs, and adjectives) from the grammatical roles of specific words in the text.
-Sentiment can be deduced from word choice (e.g., using positive adjectives and negative verbs).
3. Features of the sentiment lexicon:
-Features taken from lexicons or dictionaries that group words based on their emotional meanings.
-The average of the ratings assigned to individual terms can be used to determine the overall tone of a review.
-Sentiment lexicons are helpful for methodically recording the underlying feelings that different words express.
4. Features of an emotional tone:
-The ability to identify the emotional tone of a text and determine whether it is good or negative.
 Two techniques that might be applied in this situation are sentiment intensity analysis and emotion analysis.
-Emotional tone analysis can reveal the author's actual emotions in the review.
5. Syntactic Dependency features:
Syntactic features, such as the presence or absence of syntactic patterns (such as subject-verb-object) in the text, are the fifth feature of syntactic dependency.
-Sentence structure and grammar can affect how a sentence feels.
 Negative emotions, for instance, could be communicated using passive speech.




## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag

# Download necessary NLTK data
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')

# Sample review
review = "The battery life of this laptop is terrible. The screen quality is also poor."

# Tokenize the review
tokens = word_tokenize(review)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

# Lemmatize the tokens
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in filtered_tokens]

# POS tagging
pos_tags = pos_tag(lemmatized_tokens)

print("Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)
print("Lemmatized Tokens:", lemmatized_tokens)
print("POS Tags:", pos_tags)


Tokens: ['The', 'battery', 'life', 'of', 'this', 'laptop', 'is', 'terrible', '.', 'The', 'screen', 'quality', 'is', 'also', 'poor', '.']
Filtered Tokens: ['battery', 'life', 'laptop', 'terrible', '.', 'screen', 'quality', 'also', 'poor', '.']
Lemmatized Tokens: ['battery', 'life', 'laptop', 'terrible', '.', 'screen', 'quality', 'also', 'poor', '.']
POS Tags: [('battery', 'NN'), ('life', 'NN'), ('laptop', 'JJ'), ('terrible', 'NN'), ('.', '.'), ('screen', 'JJ'), ('quality', 'NN'), ('also', 'RB'), ('poor', 'JJ'), ('.', '.')]


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import numpy as np
from sklearn.feature_selection import chi2

# Sample TF-IDF features and part-of-speech features
tfidf_features = np.array([[0.2, 0.3, 0.0, 0.4], [0.1, 0.0, 0.5, 0.6], [0.7, 0.8, 0.9, 0.0]])
pos_features = np.array([[2, 3, 1], [1, 2, 3], [3, 1, 2]])

# Combine features for ranking
combined_features = np.concatenate((tfidf_features, pos_features), axis=1)

# Sample labels for the samples (positive, negative, neutral)
labels = np.array([1, 0, 1])

# Calculate chi-squared statistics and p-values
chi2_stat, p_val = chi2(combined_features, labels)

# Create a dictionary to store feature names and their corresponding chi-squared scores
feature_scores = {}
feature_names = ['TF-IDF 1', 'TF-IDF 2', 'TF-IDF 3', 'TF-IDF 4', 'POS 1', 'POS 2', 'POS 3']  # Sample feature names
for i in range(len(feature_names)):
    feature_scores[feature_names[i]] = chi2_stat[i]

# Sort features by importance (chi-squared scores) in descending order
sorted_features = {k: v for k, v in sorted(feature_scores.items(), key=lambda item: item[1], reverse=True)}

# Display sorted features and their chi-squared scores
print("Ranked Features based on Chi-squared scores (descending order):")
for feature, score in sorted_features.items():
    print(f"{feature}: {score}")





Ranked Features based on Chi-squared scores (descending order):
POS 1: 0.75
POS 3: 0.75
TF-IDF 2: 0.5500000000000002
TF-IDF 4: 0.31999999999999995
TF-IDF 1: 0.24499999999999994
TF-IDF 3: 0.00357142857142857
POS 2: 0.0


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
!pip install transformers scikit-learn




In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import torch

# Sample text data
texts = [
    "This product is excellent! I am really satisfied with it.",
    "The customer service was terrible. I had a bad experience.",
    "The delivery was fast and efficient. Highly recommended!",
]

# Sample query
query = "I had a great experience with the customer service."

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Function to calculate BERT embeddings for a given text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True)
    with torch.no_grad():
        outputs = model(**inputs)
    return outputs.pooler_output

# Get BERT embeddings for the query and each text
query_embedding = get_bert_embedding(query)
text_embeddings = [get_bert_embedding(text) for text in texts]

# Calculate cosine similarity between the query and each text
similarity_scores = [cosine_similarity(query_embedding, text_embedding)[0][0] for text_embedding in text_embeddings]

# Create a list of tuples (text, similarity score)
text_similarity_tuples = list(zip(texts, similarity_scores))

# Sort the text-similarity tuples by similarity score in descending order
sorted_text_similarity = sorted(text_similarity_tuples, key=lambda x: x[1], reverse=True)

# Display the sorted text similarity
print("Ranked Texts based on Similarity (descending order):")
for text, similarity_score in sorted_text_similarity:
    print(f"Similarity Score: {similarity_score:.4f}")
    print(f"Text: {text}\n")

Ranked Texts based on Similarity (descending order):
Similarity Score: 0.9497
Text: The delivery was fast and efficient. Highly recommended!

Similarity Score: 0.9110
Text: The customer service was terrible. I had a bad experience.

Similarity Score: 0.8886
Text: This product is excellent! I am really satisfied with it.



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



Extracting features from text data involves several key concepts and techniques, such as tokenization, stop-word removal, lemmatization, and part-of-speech tagging. The most beneficial aspect of this process is understanding how these techniques can be used to preprocess and extract meaningful features from raw text data. The challenge lies in deciding which features to use and how to represent them in a format suitable for machine learning models. This exercise is relevant to the field of NLP as it forms the foundation for many text-based tasks, such as sentiment analysis, document classification, and information retrieval.

