<a href="https://colab.research.google.com/github/NagillaUdayasree/Udayasree_INFO5731_Spring2024/blob/main/Nagilla_Udayasree_Exercise_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Students are expected to complete the exercise during lecture period to meet the active participation criteria of the course.
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of the day tomorrow, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**


## Question 1 (10 Points)
Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Sorting news items into groups based on topics like politics, sports, technology, entertainment, and health could be a fun text categorization assignment. This task is identifying the topic or category to which news stories belong by examining their content. The following attributes could be helpful in developing a machine learning model for this task:

Bag-of-Words (BoW) or TF-IDF Features:
BoW features show how frequently each word appears in the article, whereas TF-IDF features use word frequency across all articles in the dataset to determine how important a word is.
Why it's helpful: The language used in the articles is captured by the BoW and TF-IDF features, which also reveal whether or not certain terms are present in relation to other categories. Terms that fall within the category of politics include "election," "government," and "policy," for instance.

Word Embeddings:
Word embeddings capture the semantic associations between words by representing them as dense, low-dimensional vectors in a continuous space.
Why it's helpful:Word embeddings enable the model to capture subtle semantic information and similarities between words with comparable meanings by helping it grasp the context and meaning of the words in the articles. This may enhance the model's capacity to correctly categorize articles according to their content.

Features of Named Entity Recognition (NER):
Named entities in the articles, including individuals, groups, places, dates, and numerical expressions, are recognized and categorized by NER characteristics.
Why it's helpful:Key entities discussed in the articles and their applicability to particular categories are detailed in NER features. Mentions of government agencies or political personalities, for instance, may point to a politics category.

Features of Part-of-Speech (POS):
POS characteristics classify words according to their grammatical categories (e.g., nouns, verbs, adjectives) in order to capture the syntactic structure of the articles.
Why it's helpful: POS traits help the model's ability to recognize language cues connected to various categories and to capture syntactic patterns. An indication of a technology category could be, for example, a high frequency of adjectives characterizing technical developments.

Features of Sentiment Analysis:

Sentiment analysis elements are designed to capture the sentiment—whether positive, negative, or neutral—expressed in the articles.
Why it's helpful:Although classifying news is the main goal, sentiment analysis features can offer further context and insights into the tone or position of the articles. An upbeat attitude toward technical progress, for instance, could be indicated by favorable sentiment in a technology article.

The machine learning model can efficiently categorize news articles into distinct groups according to their content, entities referenced, syntactic structure, and sentiment expressed by integrating these aspects. With the help of this multifaceted approach, the model is better able to comprehend and categorize news stories, which makes tasks like trend analysis, subject modeling, and content suggestion easier.






'''

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [136]:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
import nltk
from nltk.tokenize import word_tokenize
import spacy
from textblob import TextBlob
from sklearn.linear_model import LogisticRegression


# Sample news articles
news_articles = [
   "The new healthcare policy aims to improve access to medical services for all citizens.",
   "The football team celebrated their victory with a parade through the city streets.",
   "Apple Inc. unveiled its latest iPhone model at the annual product launch event.",
   "The latest movie blockbuster broke box office records on its opening weekend.",
   "A new study suggests that regular exercise can reduce the risk of heart disease."
]

# Initializing spaCy model for Named Entity Recognition (NER) and Part-of-Speech (POS) tagging
nlp = spacy.load("en_core_web_sm")

# Function to perform Bag-of-Words (BoW) feature extraction
def bow_features(articles):
    # Initialize CountVectorizer
    vectorizer = CountVectorizer()
    # Fit and transform the articles to BoW matrix
    bow_matrix = vectorizer.fit_transform(articles)
    # Get feature names
    feature_names = vectorizer.get_feature_names_out()
    return bow_matrix, feature_names

# Function to perform word embeddings (Word2Vec)
def word_embeddings(articles):
    embeddings = []
    # Generating word embeddings for each article
    for article in articles:
        # Tokenizing the article and generate word embeddings using spaCy
        doc = nlp(article)
        # Computing the mean of word embeddings for the article
        article_embedding = np.mean([token.vector for token in doc], axis=0)
        embeddings.append(article_embedding)
    return np.array(embeddings)

# Function to perform Named Entity Recognition (NER) feature extraction
def ner_features(articles):
    ner_tags = []
    # Extracting NER tags for each article
    for article in articles:
        # Processing the article using spaCy
        doc = nlp(article)
        # Extracting NER tags and append to the list
        ner_tags.append([ent.label_ for ent in doc.ents])
    return ner_tags

# Function to perform Part-of-Speech (POS) feature extraction
def pos_features(articles):
    pos_tags = []
    # Extracting POS tags for each article
    for article in articles:
        # Processing the article using spaCy
        doc = nlp(article)
        # Extracting POS tags and append to the list
        pos_tags.append([token.pos_ for token in doc])
    return pos_tags

# Function to perform sentiment analysis feature extraction
def sentiment_features(articles):
    sentiment_scores = []
    # Computing sentiment polarity score for each article
    for article in articles:
        # Using TextBlob to compute sentiment polarity
        blob = TextBlob(article)
        sentiment_scores.append(blob.sentiment.polarity)
    return sentiment_scores

# Performing feature extraction
bow_matrix, bow_feature_names = bow_features(news_articles)
word_embeddings_matrix = word_embeddings(news_articles)
ner_tags = ner_features(news_articles)
pos_tags = pos_features(news_articles)
sentiment_scores = sentiment_features(news_articles)

# Printing the extracted features
print("Bag-of-Words Features:")
print(bow_matrix.toarray())
print("Feature Names:",bow_feature_names)
print("\nWord Embeddings:")
print(word_embeddings_matrix)
print("\nNamed Entity Recognition (NER) Features:")
print(ner_tags)
print("\nPart-of-Speech (POS) Features:")
print(pos_tags)
print("\nSentiment Analysis Features:")
print(sentiment_scores)






Bag-of-Words Features:
[[1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1
  0 0 0 0 0 1 0 0 0 0 0 1 0 0 2 0 0 0 0]
 [0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
  0 0 0 0 0 0 1 0 0 1 0 2 1 1 0 0 1 0 1]
 [0 0 0 1 1 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 0
  1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0]
 [0 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1 1 1 0 0
  0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0]
 [0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0
  0 0 1 1 1 0 0 1 1 0 1 1 0 0 0 0 0 0 0]]
Feature Names: ['access' 'aims' 'all' 'annual' 'apple' 'at' 'blockbuster' 'box' 'broke'
 'can' 'celebrated' 'citizens' 'city' 'disease' 'event' 'exercise'
 'football' 'for' 'healthcare' 'heart' 'improve' 'inc' 'iphone' 'its'
 'latest' 'launch' 'medical' 'model' 'movie' 'new' 'of' 'office' 'on'
 'opening' 'parade' 'policy' 'product' 'records' 'reduce' 'regular' 'risk'
 'services' 'streets' 'study' '

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [139]:
from sklearn.feature_selection import SelectKBest, chi2

#Class-dependent feature selection

#Labels corresponding to each sentence are defined. These labels represent the categories to which each sentence belongs
labels = ['medical', 'sports', 'technology', 'entertainment', 'medical']

# A feature matrix is created where each row represents a sentence and each column represents a word in the vocabulary.
feature_matrix = np.zeros((len(sentences), len(vocabulary)), dtype=int)

for i, sentence_words in enumerate(words):
    for j, word in enumerate(vocabulary):
        if word in sentence_words:
            feature_matrix[i, j] = 1

# The SelectKBest method is used to perform class-dependent feature selection based on the chi-squared test. The chi2 function is passed as the scoring function.
k_best = SelectKBest(score_func=chi2, k=10)  # The fit_transform method of SelectKBest is used to fit the feature matrix to the labels and select the top k features (in this case, top 10 features
selected_features = k_best.fit_transform(feature_matrix, labels)

# The get_support(indices=True) method is used to get the indices of the selected features.
sf_indices = k_best.get_support(indices=True)

# Get the names of selected features
sf_names = [list(vocabulary)[i] for i in sf_indices]
print(sf_names)

['team', 'records', 'through', 'celebrated', 'product', 'annual', 'blockbuster', 'event.', 'iphone', 'city']


In [138]:
#Ranking
# The scores_ attribute of the SelectKBest object (k_best) contains the chi-squared scores for each feature.
chi2Scores = k_best.scores_

# Creating a dictionary to store feature importance scores for each selected feature
feature_importance_dict = {feature_name: chi2_score for feature_name, chi2_score in zip(sf_names, chi2Scores)}

# Sorting features based on importance in descending order
desc_features = sorted(feature_importance_dict.items(), key=lambda x: x[1], reverse=True)

# Printing the ranked features
print("Ranked selected features based on importance:")
for i, (feature_name, importance_score) in enumerate(desc_features):
    print(f"{i+1}. {feature_name}: {importance_score:.4f}")

Ranked selected features based on importance:
1. through: 4.0000
2. celebrated: 4.0000
3. product: 4.0000
4. event.: 4.0000
5. iphone: 4.0000
6. blockbuster: 1.7500
7. team: 1.5000
8. records: 1.5000
9. annual: 1.5000
10. city: 1.5000


## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [141]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel

# Sample news articles
news_articles = [
   "The new healthcare policy aims to improve access to medical services for all citizens.",
   "The football team celebrated their victory with a parade through the city streets.",
   "Apple Inc. unveiled its latest iPhone model at the annual product launch event.",
   "The latest movie blockbuster broke box office records on its opening weekend.",
   "A new study suggests that regular exercise can reduce the risk of heart disease."
]

# The BERT tokenizer and model are initialized using the bert-base-uncased pre-trained model.
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# The encode_text function takes an input text, tokenizes it using the BERT tokenizer, and then passes it through the BERT model to obtain contextual embeddings.
def encode_text(text):
    tokens = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    outputs = model(**tokens)
    embeddings = outputs.last_hidden_state.mean(dim=1)  #mean pooling of token embeddings is used to obtain a fixed-size representation for the text. The embeddings are then converted to NumPy arrays and returned
    return embeddings.detach().numpy()

# Encode query and text data using BERT
query = "healthcare policy"        #Each news article in the news_articles list is also encoded using the same function. The resulting embeddings are stored in NumPy arrays.
query_encoding = encode_text(query)
txt_encodings = np.array([encode_text(text) for text in news_articles])

# The query encoding and text encodings are reshaped to ensure they have the correct dimensions for cosine similarity calculation.
query_encoding = query_encoding.reshape(1, -1) if len(query_encoding.shape) > 1 else query_encoding

# Reshape the text encodings to ensure they are 2D
txt_encodings = np.squeeze(txt_encodings)
txt_encodings = txt_encodings.reshape(txt_encodings.shape[0], -1) if len(txt_encodings.shape) > 2 else txt_encodings


# Calculating cosine similarity between query and each text
similarities = cosine_similarity(query_encoding, txt_encodings)

# Ranking the similarities in descending order
sorted_indices = similarities.argsort()[0][::-1]
sorted_similarities = similarities[0, sorted_indices]

# Printing the ranked similarities and corresponding articles
for idx, sim in zip(sorted_indices, sorted_similarities):
    print(f"Similarity: {sim:.4f} | Article: {news_articles[idx]}")


Similarity: 0.5618 | Article: The new healthcare policy aims to improve access to medical services for all citizens.
Similarity: 0.4503 | Article: A new study suggests that regular exercise can reduce the risk of heart disease.
Similarity: 0.4468 | Article: The football team celebrated their victory with a parade through the city streets.
Similarity: 0.4413 | Article: Apple Inc. unveiled its latest iPhone model at the annual product launch event.
Similarity: 0.3614 | Article: The latest movie blockbuster broke box office records on its opening weekend.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
 It was a valuable learning experience overall.Understanding various extraction techniques and trying to understand them and relating to real-time problems was exciting.One of the main challenges was ensuring data consistency and compatibility across different preprocessing and feature extraction steps.
  Handling unseen labels or unexpected data values during preprocessing and encoding also required careful attention and troubleshooting.Given time was not sufficient to deep dive into the concepts to write the code on own . Understanding the input format for question:3 was challenging and did not have clarity on the expected output.
This exercise in feature extraction from text data is highly relevant to Information Science. It enables the extraction of meaningful features from textual information, facilitating tasks such as document classification, information retrieval, and semantic analysis.

'''