# **INFO5731 In-class Exercise 3**

The purpose of this exercise is to explore various aspects of text analysis, including feature extraction, feature selection, and text similarity ranking.

**Expectations**:
*   Use the provided .*ipynb* document to write your code & respond to the questions. Avoid generating a new file.
*   Write complete answers and run all the cells before submission.
*   Make sure the submission is "clean"; *i.e.*, no unnecessary code cells.
*   Once finished, allow shared rights from top right corner (*see Canvas for details*).

**Total points**: 40

**Deadline**: This in-class exercise is due at the end of Friday, at 11:59 PM.

**Late submissions will have a penalty of 10% of the marks for each day of late submission. , and no requests will be answered. Manage your time accordingly.**

**Please check that the link you submitted can be opened and points to the correct assignment.**

## Question 1 (10 Points)
Describe an interesting **text classification or text mining task** and explain what kind of features might be useful for you to build the machine learning model. List your features and explain why these features might be helpful. You need to list at least five different types of features. **Your dataset must be text.**

In [6]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Identifying Satirical News Articles:
Satirical news pieces frequently imitate official news while adding ludicrous or inflated details to make fun of or comment on current affairs. Developing a machine that can automatically discern between news stories that are satirical and those that are not is the task at hand.
a)Sentiment analysis: This feature encapsulates the article's emotional tone, be it neutral, positive, or negative. Satirical articles frequently have an extreme positive or negative attitude that may not be consistent with the serious tone of regular news, as well as a sardonic or exaggerated tone.
b)Lexical Expressions: Measures the number of unique words used in the article in relation to its total word count. While non-satirical news typically sticks to a more official and standardized terminology, satirical pieces may utilize more imaginative or exaggerated language.
c)Frequency of Word/Term: This tool monitors the frequency with which specific keywords or phrases—both common and uncommon words—appear. Certain terms or phrases—such as "aliens" or "unicorns"—that are humorous, ironic, or ridiculous may appear in satirical pieces. Such words would be less common in non-satirical news pieces.
d)N-grams(tri- and bi-grams): N-grams are collections of words or phrases that appear in a text in a particular order. Satirical news frequently uses clever, ironic, or hilarious word combinations. These patterns, which would not be apparent with only one word (e.g., "world's greatest," "according to sources"), can be identified by analyzing bi-grams and tri-grams.
e)Named Entities Recognition(NER): Proper nouns, such names of individuals, locations, businesses, etc., are identified by this attribute. Non-satirical stories usually report on actual events, whereas satirical articles frequently exaggerate real-life characters or events (e.g., turning a political figure into a superhero). Satire can be identified by variations in the usage of named entities.
f)Subjectivity Score: This quantifies the extent to which a text is objective or subjective. Non-satirical news should be more objective and accurate, whereas satirical news frequently includes ideas, feelings, and fictitious components, making it very subjective.

Each of these characteristics draws attention to the language or topical distinctions between authentic news and satire. Sentiment analysis and subjectivity capture the tone and perspective, lexical diversity and term frequency highlight the language style, and NER and n-grams reveal the structure of the content. These characteristics work together to highlight the minute distinctions in the ways satire manipulates reality for humorous effect.

'''

'\nIdentifying Satirical News Articles:\nSatirical news pieces frequently imitate official news while adding ludicrous or inflated details to make fun of or comment on current affairs. Developing a machine that can automatically discern between news stories that are satirical and those that are not is the task at hand.\na)Sentiment analysis: This feature encapsulates the article\'s emotional tone, be it neutral, positive, or negative. Satirical articles frequently have an extreme positive or negative attitude that may not be consistent with the serious tone of regular news, as well as a sardonic or exaggerated tone.\nb)Lexical Expressions: Measures the number of unique words used in the article in relation to its total word count. While non-satirical news typically sticks to a more official and standardized terminology, satirical pieces may utilize more imaginative or exaggerated language.\nc)Frequency of Word/Term: This tool monitors the frequency with which specific keywords or phras

## Question 2 (10 Points)
Write python code to extract these features you discussed above. You can collect a few sample text data for the feature extraction.

In [5]:
# You code here (Please add comments in the code):
import nltk
import spacy
from textblob import TextBlob
from sklearn.feature_extraction.text import CountVectorizer
from collections import Counter
from nltk.util import ngrams
import en_core_web_sm

# Downloading necessary resources
nltk.download('punkt')
nltk.download('vader_lexicon')

# Load spaCy model for NER
nlp = en_core_web_sm.load()

# Sample text data: Satirical and Non-satirical
texts = [
    "Scientists Announce Breakthrough Discovery: The Internet Was Actually a Social Experiment Gone Wrong!",
    "The government announces a new economic plan for improving healthcare infrastructure.",
    "World Leaders Agree to Settle All Future Disputes with Rock-Paper-Scissors Championship",
    "Local authorities warn residents about rising flood risks in coastal cities."
]

# Function to extract features
def extract_features(text):
    # Initialize a dictionary to hold the features
    features = {}

    # 1. Sentiment Analysis using TextBlob
    sentiment = TextBlob(text).sentiment
    features['polarity'] = sentiment.polarity  # Positive/Negative
    features['subjectivity'] = sentiment.subjectivity  # Subjective vs Objective

    # 2. Lexical Diversity
    tokens = nltk.word_tokenize(text.lower())
    unique_tokens = set(tokens)
    lexical_diversity = len(unique_tokens) / len(tokens) if len(tokens) > 0 else 0
    features['lexical_diversity'] = lexical_diversity

    # 3. Word/Term Frequency using CountVectorizer
    vectorizer = CountVectorizer()
    word_counts = vectorizer.fit_transform([text])
    word_freq = dict(zip(vectorizer.get_feature_names_out(), word_counts.toarray()[0]))
    features['word_frequency'] = word_freq

    # 4. N-grams (bi-grams and tri-grams)
    bi_grams = list(ngrams(tokens, 2))
    tri_grams = list(ngrams(tokens, 3))
    features['bi_grams'] = bi_grams
    features['tri_grams'] = tri_grams

    # 5. Named Entity Recognition using spaCy
    doc = nlp(text)
    named_entities = [(ent.text, ent.label_) for ent in doc.ents]
    features['named_entities'] = named_entities

    return features

# Extract features for each text
for i, text in enumerate(texts):
    print(f"Features for text {i+1}:")
    features = extract_features(text)
    for key, value in features.items():
        print(f"{key}: {value}")
    print("\n")

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package vader_lexicon to /root/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


Features for text 1:
polarity: -0.29583333333333334
subjectivity: 0.48333333333333334
lexical_diversity: 1.0
word_frequency: {'actually': 1, 'announce': 1, 'breakthrough': 1, 'discovery': 1, 'experiment': 1, 'gone': 1, 'internet': 1, 'scientists': 1, 'social': 1, 'the': 1, 'was': 1, 'wrong': 1}
bi_grams: [('scientists', 'announce'), ('announce', 'breakthrough'), ('breakthrough', 'discovery'), ('discovery', ':'), (':', 'the'), ('the', 'internet'), ('internet', 'was'), ('was', 'actually'), ('actually', 'a'), ('a', 'social'), ('social', 'experiment'), ('experiment', 'gone'), ('gone', 'wrong'), ('wrong', '!')]
tri_grams: [('scientists', 'announce', 'breakthrough'), ('announce', 'breakthrough', 'discovery'), ('breakthrough', 'discovery', ':'), ('discovery', ':', 'the'), (':', 'the', 'internet'), ('the', 'internet', 'was'), ('internet', 'was', 'actually'), ('was', 'actually', 'a'), ('actually', 'a', 'social'), ('a', 'social', 'experiment'), ('social', 'experiment', 'gone'), ('experiment', 'g

## Question 3 (10 points):
Use any of the feature selection methods mentioned in this paper "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019). Feature selection for text classification: A review. Multimedia Tools & Applications, 78(3)."

Select the most important features you extracted above, rank the features based on their importance in the descending order.

In [7]:
# You code here (Please add comments in the code):
'''
According to the publication "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019)," feature selection techniques for text classification encompass methods such as:
The Chi-Square Test (X2) determines which properties (terms) are more likely to occur in a given class by measuring the deviation between the predicted and observed frequencies of terms.
Information Gain (IG) measures the amount of information a feature adds to the target class prediction.
Mutual Information (MI) calculates the extent to which being aware of a feature's value lessens ambiguity regarding the class.
Term Frequency-Inverse Document Frequency (TF-IDF) is a method assigns greater weight to terms that are more unique by weighing a term's frequency throughout the corpus in relation to its rarity.

I plan to employ Information Gain (IG) as my feature selection strategy in order to assign a lower priority to the extracted features. IG gauges how much a feature contributes to reducing ambiguity when identifying satirist or non-satirical articles. The features can be ranked conceptually according to the degree to which they are likely to aid in differentiating satirical texts from non-satirical ones.

Subjectivity Score: While non-satirical news is more objective, satirical articles are typically very subjective.
Rank 1: Offers the most distinct line between factual reporting and a fictional or satirical tone.

Sentiment Analysis: While regular news is typically neutral, satirical news frequently conveys strong sentiment.
Rank 2: Able to recognize sarcastic or exaggerated tones that are frequently found in satire.

Named Entities Recognition(NER): Non-satirical news concentrates on real-world entities, whereas satirical pieces may feature made-up characters or exaggerated names.
Rank 3: Assists in determining if the individuals, places, or events in the article are real or have been exaggerated or faked.

Lexical Diversity: Compared to official news pieces, satirical articles may employ a more inventive and varied vocabulary.
Rank 4: Assists in setting satirical creative writing apart from news language.

N-grams (Bi- and Tri-grams): Due to irony or comedy, some word combinations or phrases may only be found in satire.
Rank 5: Describes the stylistic distinctions between the information conveyed by satirist and non-satire.

Word/Term Frequency: those used in satirical writing may be different from those used in non-satirical writing.
Rank 6: Significant but sometimes distracting, particularly for widely used terms.

By focusing on high-ranking features like subjectivity and Sentiment analysis, we can likely improve the performance of a model in classifying satirical vs. non-satirical texts.

'''


'\nAccording to the publication "Deng, X., Li, Y., Weng, J., & Zhang, J. (2019)," feature selection techniques for text classification encompass methods such as:\nThe Chi-Square Test (X2) determines which properties (terms) are more likely to occur in a given class by measuring the deviation between the predicted and observed frequencies of terms.\nInformation Gain (IG) measures the amount of information a feature adds to the target class prediction.\nMutual Information (MI) calculates the extent to which being aware of a feature\'s value lessens ambiguity regarding the class.\nTerm Frequency-Inverse Document Frequency (TF-IDF) is a method assigns greater weight to terms that are more unique by weighing a term\'s frequency throughout the corpus in relation to its rarity.\n\nI plan to employ Information Gain (IG) as my feature selection strategy in order to assign a lower priority to the extracted features. IG gauges how much a feature contributes to reducing ambiguity when identifying 

## Question 4 (10 points):
Write python code to rank the text based on text similarity. Based on the text data you used for question 2, design a query to match the most relevant docments. Please use the BERT model to represent both your query and the text data, then calculate the cosine similarity between the query and each text in your data. Rank the similary with descending order.

In [9]:
import torch
from transformers import BertTokenizer, BertModel
from sklearn.metrics.pairwise import cosine_similarity

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample texts (from Question 2)
texts = [
    "Scientists Announce Breakthrough Discovery: The Internet Was Actually a Social Experiment Gone Wrong!",
    "The government announces a new economic plan for improving healthcare infrastructure.",
    "World Leaders Agree to Settle All Future Disputes with Rock-Paper-Scissors Championship",
    "Local authorities warn residents about rising flood risks in coastal cities."
]

# Query (the document to match with)
query = "Internet and social experiments gone wrong"

# Function to encode the text using BERT and get the mean of token embeddings
def encode_text(text):
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    with torch.no_grad():
        outputs = model(**inputs)
    embeddings = outputs.last_hidden_state
    # Get the mean of the token embeddings to represent the sentence
    sentence_embedding = torch.mean(embeddings, dim=1)
    return sentence_embedding

# Encode the query and the text data
query_embedding = encode_text(query)
text_embeddings = [encode_text(text) for text in texts]

# Calculate cosine similarity between the query and each text
similarity_scores = [cosine_similarity(query_embedding, text_embedding)[0][0] for text_embedding in text_embeddings]

# Rank the texts by similarity score in descending order
ranked_texts = sorted(zip(similarity_scores, texts), key=lambda x: x[0], reverse=True)

# Print the ranked texts and their similarity scores
for score, text in ranked_texts:
    print(f"Similarity Score: {score:.4f} | Text: {text}")



Similarity Score: 0.7883 | Text: Scientists Announce Breakthrough Discovery: The Internet Was Actually a Social Experiment Gone Wrong!
Similarity Score: 0.6096 | Text: The government announces a new economic plan for improving healthcare infrastructure.
Similarity Score: 0.5990 | Text: World Leaders Agree to Settle All Future Disputes with Rock-Paper-Scissors Championship
Similarity Score: 0.5655 | Text: Local authorities warn residents about rising flood risks in coastal cities.


# Mandatory Question

**Important: Reflective Feedback on this exercise**

Please provide your thoughts and feedback on the exercises you completed in this assignment. Consider the following points in your response:

Learning Experience: Describe your overall learning experience in working on extracting features from text data. What were the key concepts or techniques you found most beneficial in understanding the process?

Challenges Encountered: Were there specific difficulties in completing this exercise?

Relevance to Your Field of Study: How does this exercise relate to the field of NLP?

**(Your submission will not be graded if this question is left unanswered)**



In [10]:
# Your answer here (no code for this question, write down your answer as detail as possible for the above questions):

'''
Learning Experience: I found working on feature extraction from text data to be incredibly enlightening. I discovered how crucial it is to recognize the differences between various texts. Through the use of tools such as TextBlob for sentiment analysis and spaCy for named entity identification, I was able to get practical expertise in spotting patterns that I otherwise would not have observed. The ability of characteristics like subjectivity and sentiment analysis to disclose concealed tones in a text, particularly for differentiating between humor and serious content, was what most impressed me. I also discovered that lexical diversity reveals the formality or creativity of a piece of writing.

Challenges Encountered: Determining which elements would be most crucial for text classification was one of the most difficult tasks. Selecting the ideal characteristics isn't always easy because there are so many options. For example, it was difficult to determine which word combinations were most important when extracting bi- and tri-grams. Additionally, at first, utilizing a pre-trained model for text similarity—such as BERT—was a little complicated because I had to learn how to apply cosine similarity and comprehend how embeddings function.

Relevance to my Field of Study: Natural Language Processing (NLP), which is a significant component of cybersecurity and data science, is closely related to this exercise. In data science, text feature extraction is essential to improving predictions, while in cybersecurity, pattern recognition in text can aid in identifying phishing emails or other fraudulent communications. Because it increases the accuracy of text-based machine learning models, which is extremely relevant in both domains, learning how to extract and rank features is crucial.

'''

"\nLearning Experience: I found working on feature extraction from text data to be incredibly enlightening. I discovered how crucial it is to recognize the differences between various texts. Through the use of tools such as TextBlob for sentiment analysis and spaCy for named entity identification, I was able to get practical expertise in spotting patterns that I otherwise would not have observed. The ability of characteristics like subjectivity and sentiment analysis\xa0to disclose concealed tones in a text, particularly for differentiating between humor and serious content, was what most impressed me. I also discovered that lexical diversity reveals the formality or creativity of a piece of writing.\n\nChallenges Encountered: Determining which elements would be most crucial for text classification was one of the most difficult tasks. Selecting the ideal characteristics isn't always easy because there are so many options. For example, it was difficult to determine which word combinatio