<a href="https://colab.research.google.com/github/NahidFathima/NahidF_INFO5731_Fall2023/blob/main/Syed_NF_In_class_Exercise_03.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

<b>Answer</b>:
classify news articles into different categories, such as business, sports, politics, and entertainment. This task is challenging because news articles can be very diverse in terms of their content and writing style.

Here are five different types of features that might be useful for building a machine learning model for this task:

<u><b>Bag-of-words (BOW)</b></u>: This feature represents a text document as a vector of word counts. BOW features are helpful for capturing the overall topic and sentiment of a document.

<u><b>Part-of-speech (POS) tags</b></u>: These tags indicate the grammatical role of each word in a sentence. POS features can be helpful for understanding the meaning of a sentence and identifying important keywords.

<u><b>Named entity recognition (NER)</b></u>: This feature identifies and classifies named entities in a text document, such as people, places, organizations, and events. NER features can be helpful for understanding the context of a document and identifying specific topics that are being discussed.

<u><b>Topic modeling</b></u>: This technique can be used to identify latent topics in a collection of text documents. Topic modeling features can be helpful for capturing the overall themes of a document and distinguishing between different categories of news articles.

<u><b>Word embeddings</b></u>: These dense vectors represent the meaning of words. Word embeddings are learned from a large corpus of text data and can capture the semantic and syntactic relationships between words. Word embeddings are often used as input features for deep learning text classification models.

In addition to these features, it may also be helpful to include features that are specific to the news domain, such as the publication date, the source of the article, and the headline.

By combining these different types of features, it is possible to build a machine learning model that can accurately classify news articles into different categories.

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [None]:
!pip install spacy
!python -m spacy download en_core_web_sm


Collecting en-core-web-sm==3.7.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.0/en_core_web_sm-3.7.0-py3-none-any.whl (12.8 MB)
Installing collected packages: en-core-web-sm
Successfully installed en-core-web-sm-3.7.0
[38;5;2m[+] Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [None]:
import nltk
import spacy
import gensim
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Sample containing news articles
news_articles = [
    "Apple Inc. announced record quarterly earnings, driven by strong sales of the iPhone 13.",
    "The United Nations held a special session to discuss climate change and its impact on global politics.",
    "Manchester United defeated Liverpool 2-1 in a thrilling football match.",
    "New Hollywood blockbuster 'Inception 2' is set to hit theaters next week.",
]

# Download all the necessary NLTK resources
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')

# Initializing spaCy
nlp = spacy.load("en_core_web_sm")

# Bag-of-Words (BoW) features
vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(news_articles)

# Part-of-Speech (POS) tags
pos_tags = []
for article in news_articles:
    doc = nltk.word_tokenize(article)
    pos_tags.append(nltk.pos_tag(doc))

# Named Entity Recognition (NER) features
ner_features = []
for article in news_articles:
    doc = nlp(article)
    ner_tags = [ent.label_ for ent in doc.ents]
    ner_features.append(ner_tags)

# Topic Modeling features (using Latent Dirichlet Allocation)
def get_topics(text):
    tokens = gensim.utils.simple_preprocess(text, deacc=True, min_len=2)
    return tokens

# Create a list of tokenized documents in the required format
corpus = [dictionary.doc2bow(get_topics(article)) for article in news_articles]

# Initialize and train the LDA model
lda_model = gensim.models.LdaModel(corpus, num_topics=2, id2word=dictionary, passes=15)

topic_features = []
for article in news_articles:
    topics = lda_model[dictionary.doc2bow(get_topics(article))]
    topic_features.append([topic[1] for topic in topics])

# Word Embeddings (using spaCy word vectors)
word_embedding_features = []
for article in news_articles:
    doc = nlp(article)
    word_vectors = [token.vector for token in doc]
    word_embedding_features.append(word_vectors)

# Now you have extracted the mentioned features for your sample news articles.
print("Bag-of-Words (BoW) Features:\n", bow_features.toarray())
print("Part-of-Speech (POS) Tags:\n", pos_tags)
print("Named Entity Recognition (NER) Features:\n", ner_features)
print("Topic Modeling Features:\n", topic_features)
print("Word Embedding Features:\n", word_embedding_features)





[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to
[nltk_data]     C:\Users\Nahid\AppData\Roaming\nltk_data...
[nltk_data]   Package words is already up-to-date!


Bag-of-Words (BoW) Features:
 [[1 0 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 1
  0 0 0 1 1 0 0 0 0 0]
 [0 1 0 0 0 0 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 0
  1 0 1 0 1 0 0 1 1 0]
 [0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 1 0 1 0]
 [0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0
  0 1 0 0 0 1 0 1 0 1]]
Part-of-Speech (POS) Tags:
 [[('Apple', 'NNP'), ('Inc.', 'NNP'), ('announced', 'VBD'), ('record', 'JJ'), ('quarterly', 'JJ'), ('earnings', 'NNS'), (',', ','), ('driven', 'VBN'), ('by', 'IN'), ('strong', 'JJ'), ('sales', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('iPhone', 'NN'), ('13', 'CD'), ('.', '.')], [('The', 'DT'), ('United', 'NNP'), ('Nations', 'NNPS'), ('held', 'VBD'), ('a', 'DT'), ('special', 'JJ'), ('session', 'NN'), ('to', 'TO'), ('discuss', 'VB'), ('climate', 'NN'), ('change', 'NN'), ('and', 'CC'), ('its', 'PRP$'), ('impact', 'NN'), ('on', 'IN'), ('global', 'JJ'), ('p

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
import nltk
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Sample news articles
news_articles = [
    "Apple Inc. announced record quarterly earnings, driven by strong sales of the iPhone 13.",
    "The United Nations held a special session to discuss climate change and its impact on global politics.",
    "Manchester United defeated Liverpool 2-1 in a thrilling football match.",
    "New Hollywood blockbuster 'Inception 2' is set to hit theaters next week.",
]

# Query
query = "Effects of climate change on the environment"

# Tokenization and preprocessing using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(news_articles)
query_vector = vectorizer.transform([query])

# Calculate cosine similarity between the query and news articles
similarities = cosine_similarity(query_vector, X)

# Rank articles by similarity in descending order
ranking = similarities.argsort()[0][::-1]

# Display the ranked articles
for rank, article_index in enumerate(ranking):
    similarity_score = similarities[0][article_index]
    article = news_articles[article_index]
    print(f"Rank {rank + 1}: Similarity = {similarity_score:.4f}\n{article}\n{'=' * 50}\n")


Rank 1: Similarity = 0.4369
The United Nations held a special session to discuss climate change and its impact on global politics.

Rank 2: Similarity = 0.2044
Apple Inc. announced record quarterly earnings, driven by strong sales of the iPhone 13.

Rank 3: Similarity = 0.0000
New Hollywood blockbuster 'Inception 2' is set to hit theaters next week.

Rank 4: Similarity = 0.0000
Manchester United defeated Liverpool 2-1 in a thrilling football match.



Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [None]:
pip install transformers

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
Collecting huggingface-hub<1.0,>=0.16.4
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
Collecting tokenizers<0.15,>=0.14
  Downloading tokenizers-0.14.1-cp39-none-win_amd64.whl (2.2 MB)
Collecting safetensors>=0.3.1
  Downloading safetensors-0.4.0-cp39-none-win_amd64.whl (277 kB)
Installing collected packages: huggingface-hub, tokenizers, safetensors, transformers
Successfully installed huggingface-hub-0.17.3 safetensors-0.4.0 tokenizers-0.14.1 transformers-4.34.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
pip install torch

Collecting torch
  Downloading torch-2.1.0-cp39-cp39-win_amd64.whl (192.2 MB)
Installing collected packages: torch
Successfully installed torch-2.1.0
Note: you may need to restart the kernel to use updated packages.


In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from transformers import BertTokenizer, BertModel
import torch

# Sample news articles
news_articles = [
    "Apple Inc. announced record quarterly earnings, driven by strong sales of the iPhone 13.",
    "The United Nations held a special session to discuss climate change and its impact on global politics.",
    "Manchester United defeated Liverpool 2-1 in a thrilling football match.",
    "New Hollywood blockbuster 'Inception 2' is set to hit theaters next week.",
]

# Sample query
#query = "Apple's quarterly earnings and iPhone sales"
query = "Effects of climate change on the environment"


# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertModel.from_pretrained(model_name)

# Tokenize and encode the query
query_tokens = tokenizer.encode(query, add_special_tokens=True)
query_tokens = torch.tensor(query_tokens).unsqueeze(0)  # Add batch dimension

# Compute BERT embeddings for the query
with torch.no_grad():
    query_outputs = model(query_tokens)
    query_embeddings = query_outputs.last_hidden_state.mean(dim=1).numpy()

# Compute BERT embeddings for each news article
article_embeddings = []
for article in news_articles:
    article_tokens = tokenizer.encode(article, add_special_tokens=True)
    article_tokens = torch.tensor(article_tokens).unsqueeze(0)  # Add batch dimension
    with torch.no_grad():
        article_outputs = model(article_tokens)
        article_embedding = article_outputs.last_hidden_state.mean(dim=1).numpy()
    article_embeddings.append(article_embedding)

# Convert embeddings to NumPy arrays
query_embeddings = query_embeddings.reshape(1, -1)  # Reshape to 2D
article_embeddings = np.array(article_embeddings).squeeze()  # Remove batch dimension

# Calculate cosine similarities between the query and each article
similarities = cosine_similarity(query_embeddings, article_embeddings)

# Rank articles by similarity in descending order
ranking = np.argsort(similarities[0])[::-1]

# Print ranked articles and their similarities
for i, idx in enumerate(ranking):
    print(f"Rank {i+1}: Similarity = {similarities[0][idx]:.4f}")
    print(news_articles[idx])
    print("=" * 50)


Rank 1: Similarity = 0.7538
The United Nations held a special session to discuss climate change and its impact on global politics.
Rank 2: Similarity = 0.5430
New Hollywood blockbuster 'Inception 2' is set to hit theaters next week.
Rank 3: Similarity = 0.5168
Manchester United defeated Liverpool 2-1 in a thrilling football match.
Rank 4: Similarity = 0.4685
Apple Inc. announced record quarterly earnings, driven by strong sales of the iPhone 13.
