<a href="https://colab.research.google.com/github/Chandru-018/Chandrasekhar_INFO5731_FALL2024/blob/main/Karumanchi_Chandrasekhar_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

Task: News Article Topic Classification

News article topic classification into predefined topics is an interesting text classification task: politics, sports, technology, health, and entertainment. This enables users to access relevant news faster while enabling media organizations to manage their content more orderly.

Informative Features of the Model

1. Bag of Words (BoW):
	Description: A representation that counts the occurrence of each word in the articles.
- Why Useful: It will capture the presence of key terms associated with different topics. For instance, words like "election" and "vote" are closely related to politics, while "goal" and "team" relate to sports.

2. TF-IDF (Term Frequency-Inverse Document Frequency:
	- Description: This is similar to BoW but weighs the words against their frequency relative to their overall occurrence in the dataset.
- Why Useful: TF-IDF underlines words unique for certain topics; therefore, the model will be interested in important terms that can help in differentiating the categories, instead of common and repetitive words.

3. N-grams:
   - Description: Feature that captures sequences of n words - bigrams, trigrams, etc., to keep some context.
- Why Useful: N-grams will be useful in capturing highly specific phrases with respect to certain topics. For example, "climate change" tells directly that the news is on environmental issues. Inclusion of such features might help improve the performance of the model in terms of classification.
- Why useful: Some of the POS tags are indicative of topic relevance. For example, a higher level of nouns could relate to technical articles, and adjectives to entertainment articles.

5. Named Entity Recognition (NER):
    - Describe it: It detects proper nouns present in the text that name people, organizations, and places.
- Why Useful: This might provide critical context for classification; for example, articles mentioning "NASA" are likely related to technology or science, while "Biden" is relevant to politics.

6. Text Length:
  - Description: Total number of words or characters in the article.
- Why Useful: The average lengths would vary in different topics. For example, technology articles can be more deep whereas entertainment news might be shorter and crisp.

7. Sentiment Scores:
   - Description: A score indicating the sentiment of the article's content: positive, negative, or neutral.
Why Useful: Some topics will have characteristic sentiment profiles. For example, political articles may tend towards polarized sentiment, while sports articles generally tend to be more positive.

Conclusion

Taken together, these features will give the machine learning model the ability to classify news articles with an appropriate topic. Also, these features are representative of both the content and context of these articles, therefore allowing for more accurate and meaningful classification that enhances the user experience of the readers in finding relevant news.

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import nltk
import spacy
data = {
    "article": [
        "NASA's Perseverance rover begins its search for ancient life on Mars.",
        "The Premier League is set to kick off this weekend with exciting matches.",
        "The COVID-19 vaccine rollout is speeding up as cases decline across the country.",
        "Tech companies are facing scrutiny over data privacy issues.",
        "The Oscars celebrated the best films of the year in a grand ceremony."
    ],
    "topic": [
        "Technology",
        "Sports",
        "Health",
        "Technology",
        "Entertainment"
    ]
}

df = pd.DataFrame(data)

count_vectorizer = CountVectorizer()
X_bow = count_vectorizer.fit_transform(df['article'])
bow_features = pd.DataFrame(X_bow.toarray(), columns=count_vectorizer.get_feature_names_out())

tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['article'])
tfidf_features = pd.DataFrame(X_tfidf.toarray(), columns=tfidf_vectorizer.get_feature_names_out())

bigram_vectorizer = CountVectorizer(ngram_range=(2, 2))
X_bigrams = bigram_vectorizer.fit_transform(df['article'])
bigram_features = pd.DataFrame(X_bigrams.toarray(), columns=bigram_vectorizer.get_feature_names_out())

nlp = spacy.load("en_core_web_sm")

def extract_pos_and_ner(article):
    doc = nlp(article)
    pos_tags = [(token.text, token.pos_) for token in doc]
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    return pos_tags, entities

df['pos_tags'] = df['article'].apply(lambda x: extract_pos_and_ner(x)[0])
df['named_entities'] = df['article'].apply(lambda x: extract_pos_and_ner(x)[1])

df['text_length'] = df['article'].apply(len)

from textblob import TextBlob

def get_sentiment(article):
    analysis = TextBlob(article)
    return analysis.sentiment.polarity, analysis.sentiment.subjectivity

df['sentiment'] = df['article'].apply(get_sentiment)
df['sentiment_polarity'] = df['sentiment'].apply(lambda x: x[0])
df['sentiment_subjectivity'] = df['sentiment'].apply(lambda x: x[1])

print("Bag of Words Features:\n", bow_features)
print("\nTF-IDF Features:\n", tfidf_features)
print("\nBigrams Features:\n", bigram_features)
print("\nPOS Tags and Named Entities:\n", df[['article', 'pos_tags', 'named_entities']])
print("\nText Length:\n", df[['article', 'text_length']])
print("\nSentiment Scores:\n", df[['article', 'sentiment_polarity', 'sentiment_subjectivity']])



Bag of Words Features:
    19  across  ancient  are  as  begins  best  cases  celebrated  ceremony  \
0   0       0        1    0   0       1     0      0           0         0   
1   0       0        0    0   0       0     0      0           0         0   
2   1       1        0    0   1       0     0      1           0         0   
3   0       0        0    1   0       0     0      0           0         0   
4   0       0        0    0   0       0     1      0           1         1   

   ...  speeding  tech  the  this  to  up  vaccine  weekend  with  year  
0  ...         0     0    0     0   0   0        0        0     0     0  
1  ...         0     0    1     1   1   0        0        1     1     0  
2  ...         1     0    2     0   0   1        1        0     0     0  
3  ...         0     1    0     0   0   0        0        0     0     0  
4  ...         0     0    3     0   0   0        0        0     0     1  

[5 rows x 53 columns]

TF-IDF Features:
          19    across

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [2]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.feature_selection import chi2
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import LabelEncoder

# Sample news articles
data = {
    "article": [
        "NASA's Perseverance rover begins its search for ancient life on Mars.",
        "The Premier League is set to kick off this weekend with exciting matches.",
        "The COVID-19 vaccine rollout is speeding up as cases decline across the country.",
        "Tech companies are facing scrutiny over data privacy issues.",
        "The Oscars celebrated the best films of the year in a grand ceremony."
    ],
    "topic": [
        "Technology",
        "Sports",
        "Health",
        "Technology",
        "Entertainment"
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Vectorize the articles using TF-IDF for feature selection
tfidf_vectorizer = TfidfVectorizer()
X_tfidf = tfidf_vectorizer.fit_transform(df['article'])
y = df['topic']

# Perform Chi-Squared Test
chi2_values, p_values = chi2(X_tfidf, y)

# Create a DataFrame to hold feature names and their Chi-Squared values
feature_names = tfidf_vectorizer.get_feature_names_out()
chi2_df = pd.DataFrame({'feature': feature_names, 'chi2': chi2_values})

# Rank the features based on Chi-Squared values
chi2_df = chi2_df.sort_values(by='chi2', ascending=False)

# Display the ranked features
print("Ranked Features based on Chi-Squared Test:\n", chi2_df)

Ranked Features based on Chi-Squared Test:
          feature      chi2
42           set  1.149946
46          this  1.149946
28       matches  1.149946
31           off  1.149946
15      exciting  1.149946
36       premier  1.149946
25        league  1.149946
24          kick  1.149946
47            to  1.149946
51          with  1.149946
50       weekend  1.149946
8     celebrated  1.107841
17         films  1.107841
30            of  1.107841
20            in  1.107841
19         grand  1.107841
33        oscars  1.107841
9       ceremony  1.107841
6           best  1.107841
52          year  1.107841
49       vaccine  1.090888
48            up  1.090888
38       rollout  1.090888
43      speeding  1.090888
1         across  1.090888
0             19  1.090888
12         covid  1.090888
11       country  1.090888
4             as  1.090888
14       decline  1.090888
7          cases  1.090888
45           the  1.040237
21            is  0.678744
44          tech  0.500000
22        i

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [3]:
import pandas as pd
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Sample news articles
data = {
    "article": [
        "NASA's Perseverance rover begins its search for ancient life on Mars.",
        "The Premier League is set to kick off this weekend with exciting matches.",
        "The COVID-19 vaccine rollout is speeding up as cases decline across the country.",
        "Tech companies are facing scrutiny over data privacy issues.",
        "The Oscars celebrated the best films of the year in a grand ceremony."
    ]
}

# Create a DataFrame
df = pd.DataFrame(data)

# Initialize BERT tokenizer and model
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

def get_embedding(text):
    # Encode the text
    inputs = tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    # Use the [CLS] token representation as the sentence embedding
    return outputs.last_hidden_state[0][0].numpy()

# Get embeddings for the articles
embeddings = [get_embedding(article) for article in df['article']]

# Define a query
query = "What are the latest developments in space exploration?"
query_embedding = get_embedding(query)

# Calculate cosine similarity between the query and the article embeddings
similarities = cosine_similarity([query_embedding], embeddings)[0]

# Add similarities to the DataFrame
df['similarity'] = similarities

# Rank articles based on similarity in descending order
ranked_articles = df.sort_values(by='similarity', ascending=False)

# Display the ranked articles
print("Ranked Articles Based on Similarity to Query:\n", ranked_articles[['article', 'similarity']])


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Articles Based on Similarity to Query:
                                              article  similarity
1  The Premier League is set to kick off this wee...    0.859125
2  The COVID-19 vaccine rollout is speeding up as...    0.836800
4  The Oscars celebrated the best films of the ye...    0.802346
0  NASA's Perseverance rover begins its search fo...    0.785491
3  Tech companies are facing scrutiny over data p...    0.783125


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

Concepts explained bt Dr. Fengjiao Tu were well understanding. One significant challenge faced was ensuring that the text data was preprocessed appropriately before feature extraction.Other than that it is a good concept lo dive through.