## The third In-class-exercise (due on 11:59 PM 10/08/2023, 40 points in total)

The purpose of this exercise is to understand text representation.

Question 1 (10 points): Describe an interesting text classification or text mining task and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features.

In [None]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Please write you answer here:

Identifying fake news from actual news is an intriguing text classification task. In an age of information overload and
misinformation campaigns, spotting fake news is essential. Five distinct kinds of features are listed below that can be
used to create a machine learning model for this task:

Text Content Features:
  •	TF-IDF (Term Frequency-Inverse Document Frequency)
  •	Word Embeddings:
Lexical Features:
  •	Punctuation Usage
  •	Capitalization
Syntactic Features:
  •	Part-of-Speech (POS) Tags
  •	Sentence Length
Semantic Features:
  •	Named Entity Recognition (NER)
  •	Topic Modeling
Sentiment Analysis Features:
  •	Sentiment Polarity
  •	Emotional Tone
Source and Context Features:
  •	Source Reliability
  •	Publication Date

We can build a strong classifier that can distinguish between fake and real news articles by using these features in a
machine learning model. It's important to remember that a good training and testing of such a model requires a large
dataset with labeled samples of both false and real news articles.






'''

Question 2 (10 points): Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [1]:
# You code here (Please add comments in the code):

import pandas as pd
import numpy as np
import spacy
import nltk
from nltk.sentiment import SentimentIntensityAnalyzer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import StandardScaler

# Sample fake and real news headlines
sample_data = [
    "Breaking: Alien Invasion Imminent, Government Warns",
    "Study Finds Vaccines Effective in Preventing Disease",
    "Billionaire Elon Musk to Fund Mission to Mars",
    "Exclusive: Top-Secret Government Files Leaked",
]

# Initialize spaCy and NLTK
nlp = spacy.load("en_core_web_sm")
nltk.download("vader_lexicon")
sia = SentimentIntensityAnalyzer()

# Text Content Features - TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(sample_data)
tfidf_df = pd.DataFrame(tfidf_features.toarray(), columns=tfidf_vectorizer.get_feature_names_out())
print("TF-IDF Features:\n", tfidf_df)

# Lexical Features
lexical_features = []

for text in sample_data:
    punctuation_count = text.count('!')
    capitalization_ratio = sum(1 for char in text if char.isupper()) / len(text)
    lexical_features.append([punctuation_count, capitalization_ratio])

lexical_df = pd.DataFrame(lexical_features, columns=["Punctuation Count", "Capitalization Ratio"])
print("\nLexical Features:\n", lexical_df)

# Syntactic Features - POS Tags
pos_tags = []

for text in sample_data:
    doc = nlp(text)
    pos_counts = nltk.FreqDist(token.pos_ for token in doc)
    pos_tags.append(pos_counts)

pos_df = pd.DataFrame(pos_tags)
print("\nPOS Tag Features:\n", pos_df)

# Semantic Features - Sentiment Polarity
sentiment_polarity = []

for text in sample_data:
    sentiment_scores = sia.polarity_scores(text)
    sentiment_polarity.append(sentiment_scores)

sentiment_df = pd.DataFrame(sentiment_polarity)
print("\nSentiment Polarity Features:\n", sentiment_df)

[nltk_data] Downloading package vader_lexicon to /root/nltk_data...


TF-IDF Features:
       alien  billionaire  breaking   disease  effective      elon  exclusive  \
0  0.421765     0.000000  0.421765  0.000000   0.000000  0.000000   0.000000   
1  0.000000     0.000000  0.000000  0.377964   0.377964  0.000000   0.000000   
2  0.000000     0.316228  0.000000  0.000000   0.000000  0.316228   0.000000   
3  0.000000     0.000000  0.000000  0.000000   0.000000  0.000000   0.421765   

      files     finds      fund  ...      mars   mission      musk  \
0  0.000000  0.000000  0.000000  ...  0.000000  0.000000  0.000000   
1  0.000000  0.377964  0.000000  ...  0.000000  0.000000  0.000000   
2  0.000000  0.000000  0.316228  ...  0.316228  0.316228  0.316228   
3  0.421765  0.000000  0.000000  ...  0.000000  0.000000  0.000000   

   preventing    secret     study        to       top  vaccines     warns  
0    0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.421765  
1    0.377964  0.000000  0.377964  0.000000  0.000000  0.377964  0.000000  
2 

Question 3 (10 points): Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [None]:
# You code here (Please add comments in the code):





Question 4 (10 points): Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [3]:
# You code here (Please add comments in the code):





In [4]:
pip install transformers torch numpy

Collecting transformers
  Downloading transformers-4.34.0-py3-none-any.whl (7.7 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.7/7.7 MB[0m [31m20.3 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.16.4 (from transformers)
  Downloading huggingface_hub-0.17.3-py3-none-any.whl (295 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m295.0/295.0 kB[0m [31m28.9 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers<0.15,>=0.14 (from transformers)
  Downloading tokenizers-0.14.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m3.8/3.8 MB[0m [31m42.3 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.4.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m47.8 MB/s[0m eta [36m0:00:00[0m
Insta

In [5]:
import torch
from transformers import AutoTokenizer, AutoModel
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Sample text data
sample_data = [
    "Breaking: Alien Invasion Imminent, Government Warns",
    "Study Finds Vaccines Effective in Preventing Disease",
    "Billionaire Elon Musk to Fund Mission to Mars",
    "Exclusive: Top-Secret Government Files Leaked",
]

# Query
query = "Alien Invasion Warning"

# Load pre-trained BERT model and tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

# Encode the query and sample data into BERT embeddings
def get_bert_embeddings(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state.mean(dim=1).squeeze().numpy()
    return embeddings

query_embedding = get_bert_embeddings(query)
sample_embeddings = [get_bert_embeddings(text) for text in sample_data]

# Calculate cosine similarity between the query and each sample
similarities = cosine_similarity([query_embedding], sample_embeddings)[0]

# Rank the similarities in descending order
ranking = sorted(enumerate(similarities), key=lambda x: x[1], reverse=True)

# Print ranked results
print("Ranked Documents based on Similarity to Query:")
for rank, (index, similarity) in enumerate(ranking):
    print(f"Rank {rank + 1}: Similarity = {similarity:.4f}")
    print(f"Text: {sample_data[index]}\n")


Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Ranked Documents based on Similarity to Query:
Rank 1: Similarity = 0.7753
Text: Breaking: Alien Invasion Imminent, Government Warns

Rank 2: Similarity = 0.6014
Text: Exclusive: Top-Secret Government Files Leaked

Rank 3: Similarity = 0.5849
Text: Study Finds Vaccines Effective in Preventing Disease

Rank 4: Similarity = 0.5583
Text: Billionaire Elon Musk to Fund Mission to Mars

