<a href="https://colab.research.google.com/github/SatyaA-dev/SatyaAditya_INFO5731_Fall2024/blob/main/Masimukku_SatyaAditya_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
One interesting text classification task is "classifying news articles by topic". We aim to build a model that can automatically assign a topic label to a given news article based on its content. Here are five different types of features that would be useful for building a machine learning model to tackle this problem:
1. Bag of words (BoW) feature:
   BoW represents text as a set of unigrams, along with their frequency counts where words are tokenized and the occurence of each word in a document is noted.

   Specific words are more frequent in certain topics. Words like 'election', 'candidate' and 'vote' may appear more often in political artciles. In the same way 'game', 'team' and 'score' might show up often in the sports articles.

2. TF-IDF (Term Frequency-Inverse Document Frequency) Feature:
   TF-IDF is a refined version of BoW that accounts for the frequency of words in individual documents (term frequency) and penalizes common words that appear across many documents (inverse document frequency).

   TF-IDF helps emphasize words that are important for a specific document while reducing the influence of frequently occurring but non-informative words like "the," "and," etc. It’s especially helpful when dealing with large datasets where common words might skew results.

3. N-gram Feature:
   N-grams capture sequences of words, such as pairs (bigrams) or triples (trigrams) of consecutive words in the text.

   While BoW captures individual words, N-grams preserve word order and context. For instance, the bigram "climate change" is more informative for environmental news than the individual words "climate" or "change." N-grams help capture phrases and multi-word expressions that are indicative of specific topics.

4. Named Entity Recognition Feature:
   Named entities such as people, organizations, locations, dates, and product names are extracted from the text.

   Different topics frequently mention different types of entities. Political articles may mention politicians and countries (e.g., "Biden," "United States"), while technology articles may mention companies and products (e.g., "Google," "iPhone"). NER allows the model to use these named entities as features to help disambiguate the topics.

5. Sentiment or Emotion Feature:
   Sentiment analysis detects the emotional tone of a document, labeling it as positive, negative, or neutral, while emotion detection may classify text based on emotions like happiness, anger, or sadness.

   Certain topics may have characteristic sentiment profiles. For example, sports articles might show fluctuating emotions (excitement, disappointment), while scientific or technical articles may have a more neutral or formal tone. Including sentiment features can enhance classification accuracy, especially when sentiment is a distinguishing factor.




'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
# You code here (Please add comments in the code):

import nltk
import spacy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from textblob import TextBlob

# Download necessary NLTK data
nltk.download('punkt')

# Load spaCy's English NLP model
nlp = spacy.load("en_core_web_sm")

# Sample dataset
texts = [
    "The election results were announced yesterday. Biden won the election by a significant margin.",
    "Google has announced a new version of its popular smartphone, the Google Pixel 6, set to release next month.",
    "The soccer game ended with a score of 3-2, with Liverpool defeating Manchester United in a thrilling match."
]

### 1. Bag of Words (BoW) Feature Extraction ###
vectorizer_bow = CountVectorizer()
bow_matrix = vectorizer_bow.fit_transform(texts)
print("Bag of Words (BoW) Feature Names:", vectorizer_bow.get_feature_names_out())
print("BoW Matrix (Document-Term Matrix):\n", bow_matrix.toarray())

### 2. TF-IDF Feature Extraction ###
vectorizer_tfidf = TfidfVectorizer()
tfidf_matrix = vectorizer_tfidf.fit_transform(texts)
print("\nTF-IDF Feature Names:", vectorizer_tfidf.get_feature_names_out())
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

### 3. N-gram Features (Bigrams) ###
vectorizer_ngram = CountVectorizer(ngram_range=(2, 2))  # bigrams
ngram_matrix = vectorizer_ngram.fit_transform(texts)
print("\nN-gram (Bigram) Feature Names:", vectorizer_ngram.get_feature_names_out())
print("N-gram Matrix:\n", ngram_matrix.toarray())

### 4. Named Entity Recognition (NER) ###
print("\nNamed Entities:")
for text in texts:
    doc = nlp(text)
    for ent in doc.ents:
        print(f"Text: {ent.text}, Label: {ent.label_}")

### 5. Sentiment and Polarity Features ###
print("\nSentiment Analysis:")
for text in texts:
    blob = TextBlob(text)
    sentiment = blob.sentiment.polarity  # Ranges from -1 (negative) to +1 (positive)
    print(f"Text: {text}\nSentiment Polarity: {sentiment}\n")



[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Bag of Words (BoW) Feature Names: ['announced' 'biden' 'by' 'defeating' 'election' 'ended' 'game' 'google'
 'has' 'in' 'its' 'liverpool' 'manchester' 'margin' 'match' 'month' 'new'
 'next' 'of' 'pixel' 'popular' 'release' 'results' 'score' 'set'
 'significant' 'smartphone' 'soccer' 'the' 'thrilling' 'to' 'united'
 'version' 'were' 'with' 'won' 'yesterday']
BoW Matrix (Document-Term Matrix):
 [[1 1 1 0 2 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 0 2 0 0 0 0 1 0 1
  1]
 [1 0 0 0 0 0 0 2 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 0
  0]
 [0 0 0 1 0 1 1 0 0 1 0 1 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 1 1 0 1 0 0 2 0
  0]]

TF-IDF Feature Names: ['announced' 'biden' 'by' 'defeating' 'election' 'ended' 'game' 'google'
 'has' 'in' 'its' 'liverpool' 'manchester' 'margin' 'match' 'month' 'new'
 'next' 'of' 'pixel' 'popular' 'release' 'results' 'score' 'set'
 'significant' 'smartphone' 'soccer' 'the' 'thrilling' 'to' 'united'
 'version' 'were' 'with' 'won' 'yesterday']
TF-IDF Matrix:
 [[0

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import SelectKBest, f_classif, VarianceThreshold
from scipy.sparse import hstack

# Sample dataset (same as before)
texts = [
    "The election results were announced yesterday. Biden won the election by a significant margin.",
    "Google has announced a new version of its popular smartphone, the Google Pixel 6, set to release next month.",
    "The soccer game ended with a score of 3-2, with Liverpool defeating Manchester United in a thrilling match."
]

# Labels (e.g., 0 = Politics, 1 = Technology, 2 = Sports)
labels = [0, 1, 2]

# Step 1: Extract features (BoW and TF-IDF)

# Bag of Words (BoW) Features
bow_vectorizer = CountVectorizer()
bow_features = bow_vectorizer.fit_transform(texts)

# TF-IDF Features
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(texts)

# Combine the features (BoW + TF-IDF) using hstack to handle sparse matrices
combined_features = hstack([bow_features, tfidf_features])

# Step 2: Filter out constant features using VarianceThreshold
# VarianceThreshold removes features with zero variance
variance_filter = VarianceThreshold(threshold=0.0)
combined_features = variance_filter.fit_transform(combined_features)

# Step 3: Use SelectKBest to rank features
feature_selector = SelectKBest(score_func=f_classif, k='all')  # 'all' to rank all features

# Step 4: Fit the feature selector on the entire dataset
feature_selector.fit(combined_features, labels)

# Step 5: Extract feature scores (importance)
feature_scores = feature_selector.scores_

# Step 6: Rank features by importance (descending order)
ranked_indices = np.argsort(feature_scores)[::-1]
ranked_scores = feature_scores[ranked_indices]

# Get feature names (BoW + TF-IDF)
bow_feature_names = bow_vectorizer.get_feature_names_out()
tfidf_feature_names = tfidf_vectorizer.get_feature_names_out()
combined_feature_names = np.concatenate([bow_feature_names, tfidf_feature_names])

# Since we used VarianceThreshold, we need to adjust feature names accordingly
# VarianceThreshold removes some features, so we need to select the names of the features that remain
# Get the indices of features that remain after VarianceThreshold
remaining_feature_indices = variance_filter.get_support(indices=True)
remaining_feature_names = combined_feature_names[remaining_feature_indices]

# Now, get the names of the features in the order of ranked_indices
ranked_feature_names = remaining_feature_names[ranked_indices]

# Print ranked features and their importance scores
print("Ranked Features (Feature Importance):\n")
for name, score in zip(ranked_feature_names, ranked_scores):
    print(f"Feature: {name}, Score: {score:.4f}")


Ranked Features (Feature Importance):

Feature: yesterday, Score: nan
Feature: of, Score: nan
Feature: popular, Score: nan
Feature: release, Score: nan
Feature: results, Score: nan
Feature: score, Score: nan
Feature: set, Score: nan
Feature: significant, Score: nan
Feature: smartphone, Score: nan
Feature: soccer, Score: nan
Feature: the, Score: nan
Feature: thrilling, Score: nan
Feature: to, Score: nan
Feature: united, Score: nan
Feature: version, Score: nan
Feature: were, Score: nan
Feature: with, Score: nan
Feature: pixel, Score: nan
Feature: next, Score: nan
Feature: won, Score: nan
Feature: new, Score: nan
Feature: biden, Score: nan
Feature: by, Score: nan
Feature: defeating, Score: nan
Feature: election, Score: nan
Feature: ended, Score: nan
Feature: game, Score: nan
Feature: google, Score: nan
Feature: has, Score: nan
Feature: in, Score: nan
Feature: its, Score: nan
Feature: liverpool, Score: nan
Feature: manchester, Score: nan
Feature: margin, Score: nan
Feature: match, Score: n

  msw = sswn / float(dfwn)


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
# You code here (Please add comments in the code):
# You code here (Please add comments in the code):

import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample dataset (same as the one used previously)
texts = [
    "The election results were announced yesterday. Biden won the election by a significant margin.",
    "Google has announced a new version of its popular smartphone, the Google Pixel 6, set to release next month.",
    "The soccer game ended with a score of 3-2, with Liverpool defeating Manchester United in a thrilling match."
]

# Function to get BERT embeddings for a given text
def get_bert_embedding(text):
    inputs = tokenizer(text, return_tensors='pt', max_length=512, truncation=True, padding=True)
    outputs = model(**inputs)
    # The embedding is taken from the last hidden state of the [CLS] token (index 0)
    cls_embedding = outputs.last_hidden_state[:, 0, :].detach().numpy()
    return cls_embedding

# Function to compute cosine similarity between two vectors
def cosine_sim(a, b):
    return cosine_similarity(a, b)

# Embed the documents using BERT
document_embeddings = np.vstack([get_bert_embedding(text) for text in texts])

# Define the query and get its BERT embedding
query = "New smartphone release by a tech company"
query_embedding = get_bert_embedding(query)

# Calculate cosine similarity between the query and each document
similarities = cosine_sim(query_embedding, document_embeddings).flatten()

# Rank the documents by similarity (descending order)
ranked_indices = np.argsort(similarities)[::-1]
ranked_similarities = similarities[ranked_indices]

# Print the ranked documents with their similarity scores
print("Ranked Documents based on similarity to query:\n")
for idx, similarity in zip(ranked_indices, ranked_similarities):
    print(f"Document {idx+1}: {texts[idx]}")
    print(f"Similarity Score: {similarity:.4f}\n")






Ranked Documents based on similarity to query:

Document 2: Google has announced a new version of its popular smartphone, the Google Pixel 6, set to release next month.
Similarity Score: 0.8351

Document 3: The soccer game ended with a score of 3-2, with Liverpool defeating Manchester United in a thrilling match.
Similarity Score: 0.7812

Document 1: The election results were announced yesterday. Biden won the election by a significant margin.
Similarity Score: 0.7129



# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
### Learning Experience

Working on the exercises for extracting features from text data provided a valuable learning experience. The key concepts I found most beneficial were:

1. Feature Extraction Techniques: Understanding how to extract different features from text using methods like Bag of Words (BoW) and TF-IDF was crucial. These techniques allowed me to represent textual data in numerical form, which is essential for machine learning models to interpret text.

2. Feature Selection: Applying feature selection using methods like `SelectKBest` and `chi2` scoring helped me grasp how to prioritize the most relevant features for classification tasks. Learning how feature selection helps reduce the dimensionality of data, while maintaining important predictive features, was particularly insightful.

3. Combining Features: Combining multiple feature extraction techniques (BoW and TF-IDF) demonstrated the importance of a comprehensive approach in building models for NLP tasks. It was interesting to see how each method captures different aspects of the text (e.g., word frequencies vs. term importance).

One of the main challenges I encountered was the issue with low variance in features, leading to errors during feature selection with `f_classif`. This required me to adjust the approach by switching to the `chi2` scoring function, which is better suited for sparse, non-negative data like term frequencies in text classification tasks.
Another challenge was dealing with a small dataset. Having more data would have improved the variability among features and given me more flexibility to apply complex feature selection techniques.

This exercise is highly relevant to the field of Natural Language Processing (NLP). Feature extraction is fundamental in NLP because text data needs to be transformed into a format that machine learning algorithms can process. Techniques like BoW and TF-IDF are commonly used in many NLP applications, including text classification, sentiment analysis, and information retrieval.
The ability to rank and select important features based on their relevance is also critical for improving model performance and interpretability in NLP tasks. Understanding these concepts helps in building more efficient models that generalize well, which is a key requirement in real-world NLP applications.





'''