Student Details:
  - **Name:** A V Divakara


# News Article Text analysis

Let's get started by installing the necessary modules for this analysis.

In [21]:
!pip3 install nltk spacy pandas tqdm plotly huggingface_hub transformers bertopic flair python-dotenv --upgrade
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
!pip install -q -U google-generativeai
!pip install cuml-cu11 --extra-index-url=https://pypi.nvidia.com

Collecting spacy
  Downloading spacy-3.8.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (27 kB)
Collecting pandas
  Downloading pandas-2.2.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (89 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m89.9/89.9 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
Collecting bertopic
  Downloading bertopic-0.16.4-py3-none-any.whl.metadata (23 kB)
Collecting flair
  Downloading flair-0.15.0-py3-none-any.whl.metadata (12 kB)
Collecting python-dotenv
  Downloading python_dotenv-1.0.1-py3-none-any.whl.metadata (23 kB)
Collecting thinc<8.4.0,>=8.3.0 (from spacy)
  Downloading thinc-8.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting hdbscan>=0.8.29 (from bertopic)
  Downloading hdbscan-0.8.40-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (15 kB)
Collecting umap-learn>=0.5.0 (from bertopic)
  Downloading umap_learn-0.5.7-py3-none-any.wh

In [1]:
# Import the libraries
import pandas as pd
import numpy as np

In [None]:
# Import the csv file with the raw data
df = pd.read_csv('./articles_raw.csv')

### Inspection of data

In [None]:
# Columns of the raw dataset
df.columns

Index(['title', 'authors', 'published_date', 'source', 'url', 'content',
       'tags', 'images'],
      dtype='object')

In [None]:
# Data types of the raw dataset
df.dtypes

Unnamed: 0,0
title,object
authors,object
published_date,object
source,object
url,object
content,object
tags,object
images,object


In [None]:
# First 5 rows of the dataset
df.head()

Unnamed: 0,title,authors,published_date,source,url,content,tags,images
0,Datadog challenger Dash0 aims to dash observab...,"Anna Heim ,Devin Coldewey ,Marina Temkin ,Maxw...",2024-11-04T00:00:00.000000,TechCrunch,https://techcrunch.com/2024/11/04/datadog-chal...,The end of zero-interest rates has driven comp...,"Fundraising ,Startups ,observability ,cloud co...",https://techcrunch.com/wp-content/uploads/2024...
1,Karen tries to claim ownership of the place sh...,,,FAIL Blog,https://cheezburger.com/37613061/karen-tries-t...,Scroll down for the next article,"entitled parents ,housing ,Random Memes ,Geek ...",https://i.chzbgr.com/full/10424000000/hAC1368F...
2,Elon Musk's ex Grimes takes swipe at Tesla bil...,"Eve Buckland ,Eve Buckland For Dailymail.Com ,...",2024-11-04T21:53:57.000000+0000,Daily Mail,https://www.dailymail.co.uk/tvshowbiz/article-...,Elon Musk's ex Grimes took a savage swipe at t...,,https://i.dailymail.co.uk/1s/2024/11/03/05/916...
3,Kansas City Chiefs find success in bringing ba...,"Adam Teicher ,Jenna Laine ,Todd Archer ,Kather...",,ESPN,https://www.espn.com/nfl/story/_/id/42171653/k...,Get ready for an electric Week 9 Monday Night ...,,https://a2.espncdn.com/combiner/i?img=%2Fphoto...
4,CMS’s Medical Debt Relief Will Worsen Medical ...,"Ge Bai ,Ge Baicontributoropinions Expressed Fo...",2024-11-04T00:00:00.000000,Forbes,https://www.forbes.com/sites/gebai/2024/11/04/...,Man collects money with magnet from human crow...,,https://specials-images.forbesimg.com/imageser...


In [None]:
# Total number of articles within the dataset
print('Total number of articles:', len(df))

Total number of articles: 2986


In [None]:
# Number of articles sources
print(f'Number of news publishers: {df["source"].unique().size}')
print("Examples of news publishers:")
print(df["source"].unique()[:5].tolist() + ['...'])

Number of news publishers: 57
Examples of news publishers:
['TechCrunch', 'FAIL Blog', 'Daily Mail', 'ESPN', 'Forbes', '...']


First, Let's convert the `published_date` to date format

In [None]:
# Date dtype conversion
df.published_date = pd.to_datetime(df.published_date, errors="coerce")

In [None]:
# The date range of the articles
non_empty_dates_df = df[df["published_date"].notnull()]
# non_empty_dates_df["published_date"] = non_empty_dates_df["published_date"].dt.tz_localize('UTC')
print(f'Date range of articles: {non_empty_dates_df["published_date"].min()} to {non_empty_dates_df["published_date"].max()}')

Date range of articles: 2020-09-10 00:00:00 to 2024-11-20 00:00:00


We can see that we have the following columns in this dataset:
  - title (Article Title)
  - author (Authors' name of the article)
  - published_date (Date which the article was published)
  - source (The Article website's name)
  - url (Article URL)
  - content (Content of the article)
  - tags (SEO Keywords of the Article)
  - images (Image links of the article)

Let's add a new column named ID to the dataset to uniquely identify each article.

In [None]:
# Add UUID to articles in the dataset
import uuid
df['id'] = [uuid.uuid4() for _ in range(len(df))]
df.set_index('id', inplace=True)

## Pre-processing

In this step, pre-processing methods like tokenization & lemmatization will be applied to prepare the text data for analysis tasks.

In [5]:
# Importing necessary libraries for creating new sets of processed data.
import os
import string
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk import pos_tag

import pickle

from tqdm import tqdm

In [None]:
# Download the necessary nltk resources
nltk.download('words')
nltk.download('punkt')
nltk.download('punkt_tab')
nltk.download('stopwords')
nltk.download('wordnet')

nltk.download('averaged_perceptron_tagger')
nltk.download('averaged_perceptron_tagger_eng')

nltk.download('maxent_ne_chunker')
nltk.download('maxent_ne_chunker_tab')

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping chunkers/maxent_ne_chunker.zip.
[nltk_data] Downloading packag

True

Let's check and remove the articles with no content if they exists.

In [None]:
# Count the number of articles without a body content
print(f"Articles with missing body content: {df['content'].isna().sum()}")

Articles with missing body content: 4


In [None]:
# Remove the articles without a body
df.dropna(subset=['content'], inplace=True)

Let's tokenize & lemmatize the articles first.

In [None]:
# Initialize the lemmatizer and the stop words
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
punctuation_table = str.maketrans('', '', string.punctuation)

In [None]:
# This is the folder where the pre-processed data will be placed in.
processed_data_folder = "processed_text_data"

if not os.path.exists(processed_data_folder):
    os.mkdir(processed_data_folder)

In [None]:
# Remove unwanted escape characters of the text
df['content'] = df['content'].str.replace('\n', '. ').str.replace('\t', '. ')

Let's create sets of pre-processed data and seraialize it so it could be used later without processing it over and over again. The following sets will be created:

|Pre-processed sets|  Lower-cased  |  Sentence Tokenized  |  Word Tokenized  |  Punctuations removed  |  Stopwords removed  |  Lemmatized |
|:-------------------|----:|---:|--:|--:|---:|---:|
|  Lowered text set  |  ✓  |    |   |   |    |    |
|  Sentence Token set      |  ✓  |  ✓  |   |   |    |    |
|  Word Token set    |  ✓  |    |  ✓  |   |    |    |
|  Filtered Token set|  ✓  |    |  ✓  |  ✓  |  ✓  |    |
|  Lemma set         |  ✓  |   |  ✓  |  ✓  |  ✓  |  ✓  |

These sets will be used for different purposes throught out this analysis.

In [None]:
# Initialize dictionaries for each type of processed data (content and title)
data_lower_content = {}
data_lower_title = {}
data_content_tokens = {}
data_title_tokens = {}
data_content_sent_tokens = {}
data_title_sent_tokens = {}
data_content_filtered_tokens = {}
data_title_filtered_tokens = {}
data_content_lemmas = {}
data_title_lemmas = {}

# Iterate over rows in the dataframe
for _, item in tqdm(df.iterrows(), desc='Processing rows', total=df.shape[0]):
    article_id = str(_)  # Store IDs as strings for keys

    # Process the article content
    lowered_content = item['content'].lower()
    data_lower_content[article_id] = lowered_content

    # Sentence tokenization (content)
    content_sent_tokens = sent_tokenize(lowered_content)
    content_sent_tokens = [sent.translate(punctuation_table) for sent in content_sent_tokens]  # Remove punctuation from sentences
    data_content_sent_tokens[article_id] = content_sent_tokens

    # Word tokenization (content)
    content_tokens = word_tokenize(lowered_content)
    data_content_tokens[article_id] = content_tokens

    # Stopword removal (content)
    content_filtered_tokens = [token for token in content_tokens if token not in stop_words and token not in string.punctuation]
    data_content_filtered_tokens[article_id] = content_filtered_tokens

    # Lemmatization (content)
    content_lemmas = [lemmatizer.lemmatize(token) for token in content_filtered_tokens]
    data_content_lemmas[article_id] = content_lemmas

    # Process the article title
    lowered_title = item['title'].lower()
    data_lower_title[article_id] = lowered_title

    # Sentence tokenization (title)
    title_sent_tokens = sent_tokenize(lowered_title)
    title_sent_tokens = [sent.translate(punctuation_table) for sent in title_sent_tokens]  # Remove punctuation from sentences
    data_title_sent_tokens[article_id] = title_sent_tokens

    # Word tokenization (title)
    title_tokens = word_tokenize(lowered_title)
    data_title_tokens[article_id] = title_tokens

    # Stopword removal (title)
    title_filtered_tokens = [token for token in title_tokens if token not in stop_words and token not in string.punctuation]
    data_title_filtered_tokens[article_id] = title_filtered_tokens

    # Lemmatization (title)
    title_lemmas = [lemmatizer.lemmatize(token) for token in title_filtered_tokens]
    data_title_lemmas[article_id] = title_lemmas

# Save each type of processed data into its own Pickle file
pickle_files = {
    'lower_content.pkl': data_lower_content,
    'lower_title.pkl': data_lower_title,
    'content_tokens.pkl': data_content_tokens,
    'title_tokens.pkl': data_title_tokens,
    'content_sent_tokens.pkl': data_content_sent_tokens,
    'title_sent_tokens.pkl': data_title_sent_tokens,
    'content_filtered_tokens.pkl': data_content_filtered_tokens,
    'title_filtered_tokens.pkl': data_title_filtered_tokens,
    'content_lemmas.pkl': data_content_lemmas,
    'title_lemmas.pkl': data_title_lemmas,
}

for file_name, data in pickle_files.items():
    file_path = os.path.join(processed_data_folder, file_name)
    with open(file_path, 'wb') as f:
        pickle.dump(data, f)

print(f"Processed data for content and title saved in directory: {processed_data_folder}")


Processing rows: 100%|██████████| 2982/2982 [00:37<00:00, 79.84it/s]


Processed data for content and title saved in directory: processed_text_data


Let's export a copy of this cleaned dataset as a csv.


In [None]:
# Let's also make a copy of our updated articles df
df.to_csv('articles.csv')

## Analysis

Let's do some POS tagging & dependency parsing to identify sentence types in articles to get a better idea of them.

In [2]:
import plotly.express as px

import plotly.graph_objects as go
import plotly.subplots as sp

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

from collections import Counter

In [5]:
processed_data_folder = "processed_text_data"

In [8]:
# Load the dataset & pre-processed sets to memory
df = pd.read_csv('articles.csv')

data_files = {
    'lower_content': 'lower_content.pkl',
    'lower_title': 'lower_title.pkl',
    'content_tokens': 'content_tokens.pkl',
    'title_tokens': 'title_tokens.pkl',
    'content_sent_tokens': 'content_sent_tokens.pkl',
    'title_sent_tokens': 'title_sent_tokens.pkl',
    'content_filtered_tokens': 'content_filtered_tokens.pkl',
    'title_filtered_tokens': 'title_filtered_tokens.pkl',
    'content_lemmas': 'content_lemmas.pkl',
    'title_lemmas': 'title_lemmas.pkl'
}

for key, value in data_files.items():
    with open(f'{processed_data_folder}/{value}', 'rb') as f:
        data_files[key] = pickle.load(f)
    print(f'{value} Loaded.')

lower_content.pkl Loaded.
lower_title.pkl Loaded.
content_tokens.pkl Loaded.
title_tokens.pkl Loaded.
content_sent_tokens.pkl Loaded.
title_sent_tokens.pkl Loaded.
content_filtered_tokens.pkl Loaded.
title_filtered_tokens.pkl Loaded.
content_lemmas.pkl Loaded.
title_lemmas.pkl Loaded.
content_pos_tags.pkl Loaded.
title_pos_tags.pkl Loaded.


### Filtering out articles with irrelevant content

#### POS Tagging & NER

In this section, let's start by tagging POS & NER for each article. Also, let's get a distribution of POS & NER as well.

In [None]:
# Processing the POS Tags for all article titles & contents
pos_tags = {'content': {}, 'title': {}}

for i, row in tqdm(df.iterrows(), desc="Processing POS tags for content and title...", total=df.shape[0]):
    row_id = str(i)

    # Processing content
    sent_tokens_content = sent_tokenize(row['content'])
    sent_tokens_content = [sent_tokens_content] if isinstance(sent_tokens_content, str) else sent_tokens_content
    sent_pos_content = []
    for sent in sent_tokens_content:
        tokens = word_tokenize(sent)
        sent_pos_content.append(nltk.pos_tag(tokens))
    pos_tags['content'][row_id] = sent_pos_content

    # Processing title
    title = row['title']
    title_tokens = [title]
    sent_pos_title = []
    for sent in title_tokens:
        tokens = word_tokenize(sent)
        sent_pos_title.append(nltk.pos_tag(tokens))
    pos_tags['title'][row_id] = sent_pos_title

# Save content POS tags
file_path_content = os.path.join(processed_data_folder, 'content_pos_tags.pkl')
with open(file_path_content, 'wb') as f:
    pickle.dump(pos_tags['content'], f)

# Save title POS tags
file_path_title = os.path.join(processed_data_folder, 'title_pos_tags.pkl')
with open(file_path_title, 'wb') as f:
    pickle.dump(pos_tags['title'], f)

Processing POS tags for content and title...: 100%|██████████| 2982/2982 [02:04<00:00, 23.92it/s]


In [None]:
# Load the serialized POS tag data to Memory
with open(f'{processed_data_folder}/title_pos_tags.pkl', 'rb') as f:
    data_files['title_pos_tags'] = pickle.load(f)

with open(f'{processed_data_folder}/content_pos_tags.pkl', 'rb') as f:
    data_files['content_pos_tags'] = pickle.load(f)

In [None]:
# Counts the overall POS tag distribution within the whole corpus
tag_counter = Counter()

for i, article in tqdm(df.iterrows(), desc="Counting POS Tags", total=len(df)):
    article_id = article['id']
    for sent_pos in data_files['content_pos_tags'][article_id]:
        tags = [tag for _, tag in sent_pos]
        tag_counter = Counter(tags) + tag_counter

Counting POS Tags: 100%|██████████| 2982/2982 [00:03<00:00, 906.03it/s]


In [None]:
# Plot the POS Tag Frequency Distribution

labels = list(tag_counter.keys())
values = np.array(list(tag_counter.values()))

# Sort items by frequency (descending) and take the top 10
sorted_indices = np.argsort(values)[::-1]  # Indices of sorted values in descending order
top_indices = sorted_indices[:15]  # Top 10 indices
other_indices = sorted_indices[15:]  # All other indices

# Extract top 10 labels and values
top_labels = [labels[i] for i in top_indices]
top_values = values[top_indices]

# Calculate "Other" category
other_value = np.sum(values[other_indices])
if other_value > 0:
    top_labels.append("Other")
    top_values = np.append(top_values, other_value)

# Create pie chart
fig = go.Figure(data=[go.Pie(labels=top_labels, values=top_values)])
fig.update_layout(title_text="POS Tag Frequency (Top 15 + Other)")
fig.show()

In [None]:
# Process NER for the article titles & content
# Initialize variables
content_ner_data = {}
title_ner_data = {}

# Process NER for each article
for i, article in tqdm(df.iterrows(), desc="Processing NER", total=len(df)):
    article_id = article['id']
    title_text_pos = nltk.pos_tag(word_tokenize(article['title']))
    content_text_pos = nltk.pos_tag(word_tokenize(article['content']))
    title_ner_data[article_id] = nltk.chunk.ne_chunk(title_text_pos)
    content_ner_data[article_id] = nltk.chunk.ne_chunk(content_text_pos)

title_file_path = os.path.join(processed_data_folder, 'title_ner.pkl')
content_file_path = os.path.join(processed_data_folder, 'content_ner.pkl')

# Serialize the NER for both title & content
with open(title_file_path, 'wb') as f:
    pickle.dump(title_ner_data, f)
with open(content_file_path, 'wb') as f:
    pickle.dump(content_ner_data, f)

Processing NER:   0%|          | 0/2982 [00:00<?, ?it/s]


LookupError: 
**********************************************************************
  Resource [93mpunkt_tab[0m not found.
  Please use the NLTK Downloader to obtain the resource:

  [31m>>> import nltk
  >>> nltk.download('punkt_tab')
  [0m
  For more information see: https://www.nltk.org/data.html

  Attempted to load [93mtokenizers/punkt_tab/english/[0m

  Searched in:
    - '/root/nltk_data'
    - '/usr/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/share/nltk_data'
    - '/usr/local/share/nltk_data'
    - '/usr/lib/nltk_data'
    - '/usr/local/lib/nltk_data'
    - '/kaggle/working/nltk_data'
**********************************************************************


In [9]:
# Load the serialized NER data to Memory
with open(f'{processed_data_folder}/title_ner.pkl', 'rb') as f:
    data_files['title_ner'] = pickle.load(f)

with open(f'{processed_data_folder}/content_ner.pkl', 'rb') as f:
    data_files['content_ner'] = pickle.load(f)

In [None]:
# Count Entities in the dataset & Plot the frequency distribution within the dataset.
ner_counter = Counter()

for i, article in tqdm(df.iterrows(), desc="Counting Entities", total=len(df)):
    article_id = article['id']
    entities = data_files['content_ner'][article_id]
    entity_list = [entity.label() if hasattr(entity, 'label') else None for entity in entities]
    ner_counter = ner_counter + Counter(entity_list)

ner_counter.pop(None)
labels = list(ner_counter.keys())
values = np.array(list(ner_counter.values()))

# Create pie chart
fig = go.Figure(data=[go.Pie(labels=labels, values=values)])
fig.update_layout(title_text="Entity Frequency")
fig.show()

Counting Entities: 100%|██████████| 2982/2982 [00:01<00:00, 2210.72it/s]


In [None]:
ner_counter

Counter({'ORGANIZATION': 44959,
         'PERSON': 66504,
         'GPE': 25750,
         'FACILITY': 1047,
         'LOCATION': 687,
         'GSP': 833})

In [None]:
# POS tag distribution for each entity
pos_counter_template = {}

tag_list_full = list(tag_counter.keys())
for tag in tag_list_full:
    pos_counter_template[tag] = 0

pos_distributions = {}
for entity, count in ner_counter.items():
    pos_distributions[entity] = pos_counter_template.copy()

for idx, article in tqdm(df.iterrows(), desc="Counting POS Tags for each entity...", total=len(df)):
    article_id = article['id']
    entities = data_files['content_ner'][article_id]
    for entity in entities:
        if hasattr(entity, 'label'):
            entity = entity.pos()
            pos_tag = entity[0][0][1]
            pos_distributions[entity[0][1]][pos_tag] += 1

# Keep only the top 10 POS tags for each entity
filtered_pos_distributions = {}
for entity_name, entity_dist in pos_distributions.items():
    sorted_tags = sorted(entity_dist.items(), key=lambda x: x[1], reverse=True)[:10]
    filtered_pos_distributions[entity_name] = dict(sorted_tags)

fig = sp.make_subplots(rows=1, cols=len(filtered_pos_distributions), specs=[[{"type": "domain"}] * len(filtered_pos_distributions)],
                       subplot_titles=list(filtered_pos_distributions.keys()))

for i, (entity_name, entity_dist) in enumerate(filtered_pos_distributions.items(), start=1):
    fig.add_trace(
        go.Pie(labels=list(entity_dist.keys()), values=list(entity_dist.values()), name=entity_name),
        row=1, col=i,
    )

# Update layout
fig.update_layout(title_text="POS Tag Distribution for Each Entity", width=2000)

# Show the plot
fig.show()

Counting POS Tags for each entity...: 100%|██████████| 2982/2982 [00:00<00:00, 3955.30it/s]


#### POS-NER Vectorization

For vectorizing, We will be using the Top 10 POS Tags based on the frequency in the corpus and the Entities per POS Tag which have occured atleast once in the corpus. In the vector, first the count will be taken for each feature, and then it will be normalized by the word count of the text, to avoid the vector from being dependent on the word count.

In [None]:
# Filter Top 10 POS Tags (based on frequency) & Entities per POS
individual_tag_list = []
mixed_tag_list = []

entity_tag_list = {entity: [] for entity, count in ner_counter.items()}

for entity, dists in pos_distributions.items():
    tag_labels = list(dists.keys())
    tag_counts = list(dists.values())
    indices = np.argsort(tag_counts)[::-1][:10]
    for i, idx in enumerate(indices):
        if tag_counts[idx] <= 0:
            break
        elif i == len(tag_counts)-1:
            i += 1
    indices = indices[:i]
    entity_tag_list[entity] = [tag_labels[idx] for idx in indices]

for entity, tag_list in entity_tag_list.items():
    mixed_tag_list.extend([entity+'-'+tag for tag in tag_list])

tag_labels = list(tag_counter.keys())
tag_counts = list(tag_counter.values())
indices = np.argsort(tag_counts)[::-1][:10]

individual_tag_list = [tag_labels[idx] for idx in indices]

In [None]:
"""The function for creating POS-NER vector for a text

This function, returns a dictionary containing the POS-NER vector for a given text as follows:
  - Number of sentences.
  - Senetence-wise vectors for the above features.
  - Article-wise vector for the above features.

"""
def generate_pos_ner_vector(pos_tags, ner_tags):
    """
    Generate a POS-NER vector for a given text.

    Args:
        pos_tags (list): List of POS tags for the text.
        ner_tags (list): List of NER tags for the text.

    Returns:
        dict: A dictionary containing the POS-NER vector.
    """
    vector_size = len(individual_tag_list) + len(mixed_tag_list)
    data = {
        'no_of_sentences': len(pos_tags),
        'sentence_vectors': np.zeros(shape=(len(pos_tags), vector_size)),
        'article_vector': np.zeros(shape=vector_size)
    }
    article_token_count = 0
    ner_pointer = 0
    entity_was_added = False
    for i, sent_pos_tag in enumerate(pos_tags):
        sent_token_count = len(sent_pos_tag)
        article_token_count += sent_token_count
        for tagged_token in sent_pos_tag:
            tag = tagged_token[1]
            if tag in individual_tag_list:
                data['sentence_vectors'][i][individual_tag_list.index(tag)] += 1
            entity = ner_tags[ner_pointer]
            if hasattr(entity, 'label') and not entity_was_added:
                entity_label = entity.label()
                if entity_label+'-'+tag in mixed_tag_list:
                    data['sentence_vectors'][i][len(individual_tag_list) + mixed_tag_list.index(entity_label+'-'+tag)] += 1
                    entity_was_added = True
            else:
                entity_was_added = False

            if tagged_token == entity:
                ner_pointer += 1
        data['article_vector'] += data['sentence_vectors'][i]
        data['sentence_vectors'][i] /= sent_token_count
    data['article_vector'] /= article_token_count
    return data

In [None]:
# Generate the POS-NER vectors for the articles

# Initialize the dictionary for storing the POS-NER vectors
pos_ner_vectors = {
  'content': {},
  'title': {}
}

# Iterate over the articles and generate the POS-NER vectors
for i, article in tqdm(df.iterrows(), desc="Generating POS-NER Vectors", total=len(df)):
    article_id = article['id']
    pos_ner_vectors['content'][article_id] = generate_pos_ner_vector(data_files['content_pos_tags'][article_id], data_files['content_ner'][article_id])
    pos_ner_vectors['title'][article_id] = generate_pos_ner_vector(data_files['title_pos_tags'][article_id], data_files['title_ner'][article_id])

Generating POS-NER Vectors: 100%|██████████| 2982/2982 [00:20<00:00, 148.56it/s]


In [None]:
# Flatten all article vectors into 1 list
article_vectors_full = np.concatenate([pos_ner_vectors['content'][article['id']]['article_vector'].reshape(1, -1) for idx, article in df.iterrows()])

Now, let's visualize these POS-NER vectors using t-SNE to see if there are any patterns in the data. Prior to t-SNE, we will use PCA to reduce the dimensionality of the data inorder to save computation time, as t-SNE takes a long time to reduce the dimensionality of the vectors.

In [None]:
%%time
# t-SNE for article vectors
# Apply PCA prior to t-SNE to speed up the process
pca = PCA(n_components=5)
pca_result = pca.fit_transform(article_vectors_full)

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_result = tsne.fit_transform(pca_result)

fig = px.scatter(
    x=tsne_result[:, 0],
    y=tsne_result[:, 1],
    title="t-SNE of Article Vectors",
    labels={'x': 't-SNE 1', 'y': 't-SNE 2'},
)

fig.update_traces(marker=dict(size=5))

fig.show()

In the scatter plot, we can see that the vectors are well separated into 6 clusters. So, let's use clustering algorithms to capture these clusters.

In [None]:
# Article cluster details
no_of_article_clusters = 6

##### KMeans

Let's start with K-Means clustering algorithm, as it's the most common algorithm used in clustering.

In [None]:
# Perform K-means clustering for article vectors
kmeans = KMeans(n_clusters=no_of_article_clusters, random_state=42)
article_clusters = kmeans.fit_predict(article_vectors_full)

In [None]:
%%time
# t-SNE for article vectors for Kmeans clusters
# Apply PCA prior to t-SNE to speed up the process
pca = PCA(n_components=5)
pca_result = pca.fit_transform(article_vectors_full)

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_result = tsne.fit_transform(pca_result)

tsne_df = pd.DataFrame({
    't-SNE 1': tsne_result[:, 0],
    't-SNE 2': tsne_result[:, 1],
    'Cluster': pd.Categorical([f'Cluster {clust_id}' for clust_id in article_clusters])
})

# Create the scatter plot
fig = px.scatter(
    tsne_df,
    x='t-SNE 1',
    y='t-SNE 2',
    color='Cluster',
    title="t-SNE of Article Vectors (K-Means Clusters)",
    labels={'x': 't-SNE 1', 'y': 't-SNE 2'},
)

# Update marker style
fig.update_traces(marker=dict(size=5))

# Show plot
fig.show()

CPU times: user 19.8 s, sys: 124 ms, total: 19.9 s
Wall time: 19.9 s


As it depicts, the KMeans clustering algorithm is not able to capture the clusters well at this situation because of it's nature of capturing circular clusters around it's centroid. So, let's try another clustering algorithm.

##### GMM (Gaussian Mixture Models)

Let's try GMM clustering algorithm to see if it can capture the clusters better than KMeans.

In [None]:
# Perform Gaussian Mixture Modeling for article vectors
gmm = GaussianMixture(n_components=no_of_article_clusters, random_state=42).fit(article_vectors_full)
article_clusters_gmm = gmm.predict(article_vectors_full)

In [None]:
%%time
# t-SNE for article vectors for Kmeans clusters
# Apply PCA prior to t-SNE to speed up the process
pca = PCA(n_components=5)
pca_result = pca.fit_transform(article_vectors_full)

tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_result = tsne.fit_transform(pca_result)

tsne_df = pd.DataFrame({
    't-SNE 1': tsne_result[:, 0],
    't-SNE 2': tsne_result[:, 1],
    'Cluster': pd.Categorical([f'Cluster {clust_id}' for clust_id in article_clusters_gmm])
})

# Create the scatter plot
fig = px.scatter(
    tsne_df,
    x='t-SNE 1',
    y='t-SNE 2',
    color='Cluster',
    title="t-SNE of Article Vectors (GMM Clusters)",
    labels={'x': 't-SNE 1', 'y': 't-SNE 2'},
)

# Update marker style
fig.update_traces(marker=dict(size=5))

# Show plot
fig.show()

CPU times: user 20.7 s, sys: 126 ms, total: 20.8 s
Wall time: 20.8 s


In [None]:
%%time
# t-SNE for article vectors for Kmeans clusters
# Apply PCA prior to t-SNE to speed up the process
pca = PCA(n_components=5)
pca_result = pca.fit_transform(article_vectors_full)

# Change n_components to 3 for 3D t-SNE
tsne = TSNE(n_components=3, random_state=42, perplexity=30)
tsne_result = tsne.fit_transform(pca_result)

# Prepare data for the scatter plot
tsne_df = pd.DataFrame({
    't-SNE 1': tsne_result[:, 0],
    't-SNE 2': tsne_result[:, 1],
    't-SNE 3': tsne_result[:, 2],
    'Cluster': pd.Categorical([f'Cluster {clust_id}' for clust_id in article_clusters_gmm])
})

# Create the 3D scatter plot
fig = px.scatter_3d(
    tsne_df,
    x='t-SNE 1',
    y='t-SNE 2',
    z='t-SNE 3',
    color='Cluster',
    title="t-SNE of Article Vectors (GMM Clusters)",
    labels={'x': 't-SNE 1', 'y': 't-SNE 2', 'z': 't-SNE 3'}
)

# Update marker style
fig.update_traces(marker=dict(size=2))

# Show plot
fig.show()

CPU times: user 1min 23s, sys: 0 ns, total: 1min 23s
Wall time: 1min 38s


The GMM clustering algorithm is able to captures the clusters perfectly. We can see that the clusters are well separated.

Now, let's get some samples from each cluster to see if they are having irrelavant information in the articles.

In [None]:
# Cluster samples
article_texts = np.array([text for text in df['content']])

for i in range(no_of_article_clusters):
    print(f'Cluster {i}:')
    mask = (article_clusters_gmm == i)
    print_samples = article_texts[mask][:8]
    print_samples = [text[:30] + "..." for text in print_samples]
    print(print_samples)

Cluster 0:
["Elon Musk's ex Grimes took a s...", 'Decrypt’s Art, Fashion, and En...', 'Donald Trump told supporters a...', "Who's Playing. . Philadelphia ...", 'Daniel Shirey/MLB Photos via G...', 'Members of the dolphin family ...', 'Maye Musk, the mother of billi...', 'We may earn a commission from ...']
Cluster 1:
['What is the best internet prov...', 'The next leader of the free wo...', 'A Columbia, Tennessee man alle...', 'Renowned Danish architect Bjar...', 'New York conservatives are war...', 'In brief Leading AI models wer...', "A Russia's Sukhoi Su-57 stealt...", 'Attention, points and miles en...']
Cluster 2:
['The end of zero-interest rates...', 'Get ready for an electric Week...', 'The Chicago Bears were blown o...', 'The Kansas City Chiefs (7-0) w...', 'While each product featured is...', 'covers breaking and general as...', 'FIRST ON FOX – A Global Peace ...', 'The Heritage Foundation—the sa...']
Cluster 3:
['Scroll down for the next artic...', 'Scroll down for the next a

We can see that cluster 3 is having some irrelevant information in the articles. Let's remove the articles in cluster 3 and proceed with the analysis as the dataset is much cleaner than before.

In [None]:
# Remove all articles from the cluster 3
mask = (article_clusters_gmm == 3)

cluster_article_ids = [article['id'] for idx, article in df.iloc[mask].iterrows()]

df = df[~df['id'].isin(cluster_article_ids)]

### Topic Modelling

In this stage, we will be doing topic modelling to identify the topics in the articles & to identify articles within the same context making it easier to analyze the articles.

For topic modelling, we will be using BerTopic(a transformer based topic modelling library) to identify the topics in the articles. Meanwhile, we will be categorizing these topics for more ease of analysis using a text classification model called BART.

In [22]:
import warnings
import logging

import torch
from bertopic import BERTopic
from transformers import pipeline

from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument

import plotly.subplots as sp
import plotly.graph_objects as go

For applying BerTopic topic modelling algorithm.

The set of lemma tokens will be used for this, as only the words which given contextual meaning are preserved in the text making it muc easier for the model to classify topics.

#### BerTopic modelling for whole corpus

In [None]:
# Apply BERTopic on the articles
text = [" ".join(data_files['content_lemmas'][article['id']]) for idx, article in df.iterrows()]

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(text)

modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

In [None]:
# Plot the intertopic distance map
topic_model.visualize_topics()

In [None]:
# Plots the top 12 topics with the list of words associated with each topics
topic_model.visualize_barchart(top_n_topics=12)

Through the topic model, 57 distinct topics were identified.

Having so many topics makes it difficult to analyze the articles. So, let's categorize these topics into 10 categories using BART text classification model and then let's get topics for each of these categories.

#### Approach 1

In this approach, we will be using the BART text classification model to classify the articles into 8 categories and then we will be applying BerTopic for the corpus within each category.

In [None]:
# The list of categories
candidate_labels = ['Business', 'Sports', 'Technology', 'Health', 'Entertainment', 'Finance', 'Economy', 'Uncategorized']

In [None]:
# Apply BARt for categorizing the articles
# Suppress warnings
warnings.filterwarnings("ignore", message=".*overflowing tokens.*")

# Suppress logging warnings from the tokenizer
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

mnli_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)

categorization_results = []


for text in tqdm(df['content'].tolist(), desc="Categorizing articles use BART...", total=len(df['content'])):
    results = mnli_pipeline(text, candidate_labels=candidate_labels)

    best_label = results['labels'][0]

    # Save results
    categorization_results.append({
        "text": results['sequence'],
        "predicted_label": best_label,
        "entailment_probs": {label: prob for label, prob in zip(results['labels'], results['scores'])}
    })

# Save the categorization results
with open(f'{processed_data_folder}/article_categories.pkl', 'wb') as f:
    pickle.dump(categorization_results, f)

Device set to use cuda:0
Categorizing articles use BART...:   0%|          | 10/2918 [00:11<53:23,  1.10s/it]You seem to be using the pipelines sequentially on GPU. In order to maximize efficiency please use a dataset
Categorizing articles use BART...: 100%|██████████| 2918/2918 [41:32<00:00,  1.17it/s]


In [None]:
# Load the serialized categorization results
with open(f'{processed_data_folder}/article_categories.pkl', 'rb') as f:
    categorization_results = pickle.load(f)

In [None]:
# Count the number of articles in each category
label_counts = Counter(item['predicted_label'] for item in categorization_results)

# Convert to lists for plotting
labels = list(label_counts.keys())
counts = list(label_counts.values())

# Create the bar plot with Plotly
fig = px.bar(x=labels, y=counts, labels={'x': 'Category', 'y': 'Count'},
             title='Distribution of Categories')

# Show the plot
fig.show()

In [None]:
# Apply BERTopic on the articles within each category
# Prepare data for each category
data = {category: [" ".join(data_files['content_lemmas'][df.iloc[i]['id']]) for i, item in enumerate(categorization_results) if item['predicted_label'] == category] for category in candidate_labels}

# Initialize BERTopic model
model = BERTopic()

# Subplots configuration
fig = sp.make_subplots(rows=4, cols=2, subplot_titles=candidate_labels)

# Apply BERTopic and visualize topics for each category
row, col = 1, 1
for category in candidate_labels:
    if data[category]:  # Only process if there are articles for the category
        topics, _ = model.fit_transform(data[category])
        freq = model.get_topic_info()

        # Prepare data for bar chart
        freq = freq[freq["Topic"] != -1]  # Exclude outliers
        bar = go.Bar(x=freq["Count"], y=freq["Name"], orientation='h', name=category)
        fig.add_trace(bar, row=row, col=col)

    # Update subplot position
    if col == 2:
        row += 1
        col = 1
    else:
        col += 1

# Layout updates
fig.update_layout(height=800, width=2000, title_text="Topics per Category", showlegend=False)

fig.show()

As it's depicted, the topics are not well separated in this approach. So, let's try another approach.

#### Approach 2

In this approach, we will be following these steps:
  
  1. Create topic labels (With the help of Gemini API) using the word list for each topic found by BerTopic.
  2. Categorize the topics into the categories using BART text classification model.

In [None]:
import os
import dotenv

import google.generativeai as genai
from tqdm import tqdm
import time

dotenv.load_dotenv()
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY")

In [None]:
# Configure Gemini API
api_key = GEMINI_API_KEY
genai.configure(api_key=api_key)
model = genai.GenerativeModel('gemini-1.5-flash')

In [None]:
# Get topic words from BERTopic model
topic_words = topic_model.get_topics()
labels = {}

In [None]:
# Process each topic
for topic_id, words in tqdm(topic_words.items(), desc="Labeling topics"):
    if topic_id == -1:  # Skip outlier topic
        continue

    # Create prompt for this topic
    words_str = ", ".join([f"{word} ({score:.3f})" for word, score in words])
    prompt = f"""Given the following list of words and their importance scores from a topic model:

{words_str}

What would be an appropriate 2-10 word topic for this list of words? Provide ONLY the label, nothing else.
The label should be specific and descriptive, avoiding generic terms.
"""

    # Get label from Gemini with retry logic
    max_retries = 3
    for attempt in range(max_retries):
        try:
            response = model.generate_content(prompt)
            labels[topic_id] = response.text.strip()
            break
        except Exception as e:
            if attempt == max_retries - 1:
                print(f"Failed to get label for topic {topic_id} after {max_retries} attempts: {e}")
                labels[topic_id] = "Failed to generate label"
            time.sleep(1.0)

    # Add delay to respect 15 RPM rate limit
    # 60 seconds / 15 requests = 4 seconds per request
    time.sleep(4.0)

# Update topic labels in BERTopic model
topic_model.set_topic_labels(labels)

with open(f'{processed_data_folder}/topic_labels.pkl', 'wb') as f:
    pickle.dump(labels, f)

with open(f'{processed_data_folder}/topic_model.pkl', 'wb') as f:
    pickle.dump(topic_model, f)


Labeling topics:  21%|██        | 12/58 [01:12<04:22,  5.70s/it]ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 1420.73ms
ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 864.34ms


Failed to get label for topic 11 after 3 attempts: 500 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint: Unable to submit request because the service is temporarily unavailable.


Labeling topics:  55%|█████▌    | 32/58 [03:28<02:17,  5.28s/it]ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 1200.58ms
ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 891.23ms


Failed to get label for topic 31 after 3 attempts: 500 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint: Unable to submit request because the service is temporarily unavailable.


Labeling topics:  69%|██████▉   | 40/58 [04:15<01:36,  5.34s/it]ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 1193.21ms
ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 887.22ms


Failed to get label for topic 39 after 3 attempts: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response'))


Labeling topics:  81%|████████  | 47/58 [05:01<01:01,  5.63s/it]ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 1343.82ms
ERROR:tornado.access:500 POST /v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint (127.0.0.1) 713.51ms


Failed to get label for topic 46 after 3 attempts: 500 POST https://generativelanguage.googleapis.com/v1beta/models/gemini-1.5-flash:generateContent?%24alt=json%3Benum-encoding%3Dint: Unable to submit request because the service is temporarily unavailable.


Labeling topics: 100%|██████████| 58/58 [06:13<00:00,  6.45s/it]


In [24]:
import pickle

# Load the serialized topic labels
with open(f'{processed_data_folder}/topic_labels.pkl', 'rb') as f:
    labels = pickle.load(f)

with open(f'{processed_data_folder}/topic_model.pkl', 'rb') as f:
    topic_model = pickle.load(f)

In [None]:
# The list of categories for the articles
candidate_labels = ['Business', 'Sports', 'Technology', 'Health', 'Entertainment', 'Space', 'Economy', 'Uncategorized']

In [None]:
# Categorize the topics using BART
# Initialize dictionary for storing the categorization results
label_categorization = {}

# Get topic info into variable
topic_df = topic_model.get_topic_info()

# Suppress warnings
warnings.filterwarnings("ignore", message=".*overflowing tokens.*")

# Suppress logging warnings from the tokenizer
logging.getLogger("transformers.tokenization_utils_base").setLevel(logging.ERROR)

mnli_pipeline = pipeline("zero-shot-classification", model="facebook/bart-large-mnli", device=0)


for idx, text in tqdm(labels.items(), desc="Categorizing topics using BART...", total=len(labels)):
    results = mnli_pipeline(text, candidate_labels=candidate_labels)

    best_label = results['labels'][0]

    # Save results
    label_categorization[idx] = ({
        "topic": results['sequence'],
        "predicted_label": best_label,
        "count": topic_df[topic_df['Topic'] == idx]['Count'],
        "entailment_probs": {label: prob for label, prob in zip(results['labels'], results['scores'])}
    })

with open(f'{processed_data_folder}/article_categories_2.pkl', 'wb') as f:
    pickle.dump(label_categorization, f)

model.safetensors:  17%|#6        | 273M/1.63G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Categorizing topics using BART...: 100%|██████████| 57/57 [05:42<00:00,  6.01s/it]


In [25]:
# Load the serialized categorization results
with open(f'{processed_data_folder}/article_categories_2.pkl', 'rb') as f:
    label_categorization = pickle.load(f)

In [None]:
# Count the number of topics in each category
label_counts = Counter([item['predicted_label'] for idx, item in label_categorization.items()])

# Convert to lists for plotting
labels = list(label_counts.keys())
counts = list(label_counts.values())

# Create the bar plot with Plotly
fig = px.bar(x=labels, y=counts, labels={'x': 'Category', 'y': 'Count'},
             title='Distribution of Categories (For Topics)')

# Show the plot
fig.show()

In [None]:
# Subplots configuration
fig = sp.make_subplots(rows=4, cols=2, subplot_titles=candidate_labels)

# Apply BERTopic and visualize topics for each category
row, col = 1, 1
for category in candidate_labels:
  topic_ids = []
  for idx, item in label_categorization.items():
    if item['predicted_label'] == category:
      topic_ids.append(idx)
  selected_topic_df = topic_df[topic_df['Topic'].isin(topic_ids)]

  if not selected_topic_df.empty:
    bar = go.Bar(x=selected_topic_df["Count"], y=selected_topic_df["CustomName"], orientation='h', name=category)
    fig.add_trace(bar, row=row, col=col)

    # Update subplot position
    if col == 2:
        row += 1
        col = 1
    else:
        col += 1

# Layout updates
fig.update_layout(height=800, width=2000, title_text="Topics per Category", showlegend=False)

fig.show()

In [None]:
# Assign topics for each article in df
df['topic'] = topic_model.topics_

#### Document Clustering on Topics

Let's do document clustering on the topics to see if there are any patterns in the topics.

In [None]:
# Step 1: Preprocess and tag the documents
texts = [" ".join(data_files['content_lemmas'][article['id']]) for idx, article in df.iterrows()]
tagged_data = [TaggedDocument(words=text.split(), tags=[str(i)]) for i, text in enumerate(texts)]

In [None]:
%%time
# Step 2: Train the Doc2Vec model
model = Doc2Vec(vector_size=20, window=2, min_count=1, workers=4, epochs=100)
model.build_vocab(tagged_data)
model.train(tagged_data, total_examples=model.corpus_count, epochs=model.epochs)

with open(f'{processed_data_folder}/doc2vec_model.pkl', 'wb') as f:
    pickle.dump(model, f)

CPU times: user 12min 36s, sys: 7.88 s, total: 12min 44s
Wall time: 8min 23s


In [75]:
# Load the serialized Doc2Vec model
with open(f'{processed_data_folder}/doc2vec_model.pkl', 'rb') as f:
    model = pickle.load(f)

In [None]:
%%time
# Step 3: Obtain document vectors
doc_vectors = [model.infer_vector(text.split()) for text in texts]

# Step 4: Apply PCA for dimensionality reduction to 50 components (before t-SNE)
pca = PCA(n_components=8)
pca_result = pca.fit_transform(doc_vectors)

# Step 5: Apply t-SNE for further dimensionality reduction to 2 components for visualization
tsne = TSNE(n_components=3, random_state=42, perplexity=30)
tsne_result = tsne.fit_transform(pca_result)

CPU times: user 10min 2s, sys: 993 ms, total: 10min 3s
Wall time: 10min 8s


In [None]:
# Mark the anomaly topic as 'Uncategorized'
label_categorization[-1] = {
    'predicted_label': 'Uncategorized',
}

In [None]:
# Step 6: Create a DataFrame for visualization
temp_df = pd.DataFrame(tsne_result, columns=["TSNE-1", "TSNE-2", "TSNE-3"])
temp_df["Text"] = texts
temp_df["Topic"] = df["topic"]
temp_df["Category"] = df.apply(lambda row: label_categorization[row["topic"]]['predicted_label'], axis=1)

In [None]:
# Step 7: Plot the result using Plotly
fig = px.scatter_3d(temp_df, x="TSNE-1", y="TSNE-2", z="TSNE-3", color="Category", title="t-SNE visualization of Doc2Vec Embeddings (Based on Category)", labels={'TSNE-1': 't-SNE 1', 'TSNE-2': 't-SNE 2', 'TSNE-3': 't-SNE 3'})
fig.update_traces(marker=dict(size=2))
fig.show(width=500)

### Sentiment Analysis

Let's do sentiment analysis on the articles to see how the sentiment of the articles are distributed.

In [None]:
import plotly.figure_factory as ff
import plotly.graph_objects as go

from sklearn.metrics import confusion_matrix, balanced_accuracy_score, mean_squared_error

import math
import spacy
from flair.data import Sentence
from flair.nn import Classifier

In [None]:
# Load the Flair sentiment model
sentiment_model = Classifier.load('sentiment')

2024-12-30 19:06:29,347 https://nlp.informatik.hu-berlin.de/resources/models/sentiment-curated-distilbert/sentiment-en-mix-distillbert_4.pt not found in cache, downloading to /tmp/tmpdpwseipm


100%|██████████| 253M/253M [00:12<00:00, 20.9MB/s]

2024-12-30 19:06:42,459 copying /tmp/tmpdpwseipm to cache at /root/.flair/models/sentiment-en-mix-distillbert_4.pt





2024-12-30 19:06:44,554 removing temp file /tmp/tmpdpwseipm


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [None]:
# Analyze sentiments for each article
# Initialize lists for storing sentiment labels and scores
sentiment_labels = []
sentiment_scores = []

headline_weight = 0.6
content_weight = 1 - headline_weight

for idx, article in tqdm(df.iterrows(), desc='Analyzing Sentiments for articles...', total=len(df)):
    headline = article['title']
    content = article['content']

    headline_sentence = Sentence(headline)
    content_sentence = Sentence(content)

    sentiment_model.predict(headline_sentence)
    sentiment_model.predict(content_sentence)

    headline_score = headline_sentence.score * (-1 if headline_sentence.tag == "NEGATIVE" else 1)
    content_score = content_sentence.score * (-1 if content_sentence.tag == "NEGATIVE" else 1)

    final_polarity_score = (headline_score*headline_weight)+(content_score*content_weight)
    if final_polarity_score > 0:
        sentiment_label = "POSITIVE"
    elif final_polarity_score == 0:
        sentiment_label = "NEUTRAL"
    else:
        sentiment_label = "NEGATIVE"

    sentiment_labels.append(sentiment_label)
    sentiment_scores.append(final_polarity_score)

df['sentiment_label'] = sentiment_labels
df['sentiment_polarity'] = sentiment_scores


Analyzing Sentiments for articles...: 100%|██████████| 2918/2918 [48:29<00:00,  1.00it/s]


In [None]:
# Plot the sentiment polarity distribution
fig = px.histogram(df, x='sentiment_polarity', title="Polarity Score Distribution",
                   nbins=20)
fig.update_layout(xaxis_title="Polarity Score", yaxis_title="Frequency")
fig.show()

In [None]:
# Add neutral class
df.loc[(df['sentiment_polarity'] < 0.2) & (df['sentiment_polarity'] > -0.2), 'sentiment_label'] = "NEUTRAL"

In [None]:
# Plot the proportion of each sentiment label
fig = px.pie(df, names='sentiment_label', title="Sentiment Label Proportions")
fig.show()

In [None]:
# Save the sentiment analysis results
with open(f'{processed_data_folder}/sentiment_analysis.pkl', 'wb') as f:
    pickle.dump(df, f)

In [23]:
# Load the serialized sentiment analysis results
import pickle

with open(f'{processed_data_folder}/sentiment_analysis.pkl', 'rb') as f:
    df = pickle.load(f)

##### Sentiment Distribution for the top 10 topics

Let's see how the sentiment is distributed for the top 10 topics.

In [None]:
# Calculate the sentiment proportions per topic
topic_sentiment = df.groupby('topic')['sentiment_label'].value_counts(normalize=True).unstack()
topic_sentiment = topic_sentiment.rename(columns={
    "NEGATIVE": "Negative",
    "POSITIVE": "Positive",
    "NEUTRAL": "Neutral"
})

# Get the top 10 topics by number of articles
top_10_topics = df['topic'].value_counts().nlargest(10).index
topic_sentiment = topic_sentiment.loc[top_10_topics]

# Filter out the outlier topic (-1) if present
if -1 in topic_sentiment.index:
    topic_sentiment = topic_sentiment.drop(-1)

# Replace topic IDs with custom names from topic_model
topic_names = {topic_id: name for topic_id, name in topic_model.get_topic_info().set_index('Topic')['CustomName'].items()}
topic_sentiment = topic_sentiment.rename(index=topic_names)

# Adjust the order of columns to ensure the correct stacking order
topic_sentiment = topic_sentiment[["Negative", "Neutral", "Positive"]]  # Bottom to top

# Plot the bar chart without text labels first
fig = px.bar(
    topic_sentiment,
    x=topic_sentiment.index,
    y=["Negative", "Neutral", "Positive"],  # Correct stacking order
    title="Top 10 Topic Sentiment Proportions",
    labels={"value": "Proportion", "variable": "Sentiment", "index": "Topic"},
    color_discrete_map={
        "Positive": "#00CC96",
        "Neutral": "#636EFA",
        "Negative": "#EF553B"
    }
)

# Update the layout to stack bars
fig.update_layout(barmode='stack')

# Add annotations with the percentages (manual calculation of positions)
for topic in topic_sentiment.index:
    negative = topic_sentiment.loc[topic, 'Negative']
    neutral = topic_sentiment.loc[topic, 'Neutral']
    positive = topic_sentiment.loc[topic, 'Positive']

    # Calculate the cumulative sum for each sentiment segment
    negative_cumsum = negative / 2
    neutral_cumsum = negative + neutral / 2
    positive_cumsum = negative + neutral + positive / 2

    # Add text annotations for each sentiment type (in percentage)
    fig.add_annotation(
        x=topic,
        y=negative_cumsum,
        text=f"{negative * 100:.1f}%",
        showarrow=False,
        font=dict(size=12, color="white"),
        align="center"
    )
    fig.add_annotation(
        x=topic,
        y=neutral_cumsum,
        text=f"{neutral * 100:.1f}%",
        showarrow=False,
        font=dict(size=12, color="white"),
        align="center"
    )
    fig.add_annotation(
        x=topic,
        y=positive_cumsum,
        text=f"{positive * 100:.1f}%",
        showarrow=False,
        font=dict(size=12, color="white"),
        align="center"
    )

# Show the plot
fig.show()


### Dependency Parsing

In [14]:
import spacy
from spacy import displacy

import networkx as nx

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

In [None]:
!python -m spacy download en_core_web_md

--------------------------------------------------------------------------------

  CuPy may not function correctly because multiple CuPy packages are installed
  in your environment:

    cupy-cuda11x, cupy-cuda12x

  Follow these steps to resolve this issue:

    1. For all packages listed above, run the following command to remove all
       existing CuPy installations:

         $ pip uninstall <package_name>

      If you previously installed CuPy via conda, also run the following:

         $ conda uninstall cupy

    2. Install the appropriate CuPy package.
       Refer to the Installation Guide for detailed instructions.

         https://docs.cupy.dev/en/stable/install.html

--------------------------------------------------------------------------------

Collecting en-core-web-md==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_md-3.8.0/en_core_web_md-3.8.0-py3-none-any.whl (33.5 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [None]:
# Load spaCy model
spacy_model = spacy.load('en_core_web_md')

In [None]:
# Creating a list of spaCy docs for each article, for further processing
spacy_docs = []

for idx, article in tqdm(df.iterrows(), desc="Creating spacy docs...", total=len(df)):
    spacy_docs.append(spacy_model(article['content']))

import pickle

with open(f'{processed_data_folder}/spacy_docs.pkl', 'wb') as f:
    pickle.dump(spacy_docs, f)

Creating spacy docs...: 100%|██████████| 2918/2918 [05:39<00:00,  8.60it/s]


In [None]:
# Load the serialized spacy docs
import pickle

with open(f'{processed_data_folder}/spacy_docs.pkl', 'rb') as f:
    spacy_docs = pickle.load(f)

In [None]:
# A sample dependency parsing visualization
doc = list(spacy_docs[0].sents)[1].as_doc()
displacy.serve(doc, style="dep")


Using the 'dep' visualizer
Serving on http://0.0.0.0:5000 ...

Shutting down server on port 5000.


In [None]:
# A sample dependency parsing visualization (In a network graph)
import plotly.graph_objects as go
import networkx as nx

def visualize_dependency_graph(doc):
    graph = nx.DiGraph()
    for token in doc:
        if not token.is_punct and not token.is_space:
            graph.add_edge(token.head.text, token.text)

    pos = nx.spring_layout(graph, k=0.5)  # Adjust k for spacing

    # Create edges
    edge_x = []
    edge_y = []
    for edge in graph.edges():
        x0, y0 = pos[edge[0]]
        x1, y1 = pos[edge[1]]
        edge_x.extend([x0, x1, None])
        edge_y.extend([y0, y1, None])

    edge_trace = go.Scatter(
        x=edge_x, y=edge_y,
        line=dict(width=0.5, color='#888'),
        hoverinfo='none',
        mode='lines')

    # Create nodes
    node_x = []
    node_y = []
    node_degrees = []
    for node in graph.nodes():
        x, y = pos[node]
        node_x.append(x)
        node_y.append(y)
        node_degrees.append(graph.degree[node])

    node_trace = go.Scatter(
        x=node_x, y=node_y,
        mode='markers+text',
        hoverinfo='text',
        text=list(graph.nodes()),
        textposition="top center",
        marker=dict(
            showscale=True,
            colorscale='blues',
            size=10,
            color=node_degrees,
            colorbar=dict(
                thickness=15,
                title='Node Connections',
                xanchor='left',
                titleside='right'
            ),
            line_width=2))

    # Create network graph
    fig = go.Figure(data=[edge_trace, node_trace],
                    layout=go.Layout(
                        title='Dependency Parsing as Graph',
                        titlefont_size=16,
                        showlegend=False,
                        hovermode='closest',
                        margin=dict(b=20, l=5, r=5, t=40),
                        annotations=[dict(
                            text="",
                            showarrow=False,
                            xref="paper", yref="paper")],
                        xaxis=dict(showgrid=False, zeroline=False, showticklabels=False),
                        yaxis=dict(showgrid=False, zeroline=False, showticklabels=False))
                    )

    fig.show()

# Assuming spacy_docs is defined and contains the parsed documents
visualize_dependency_graph(spacy_docs[20])


### Stylometric Analysis

Let's do stylometric analysis on the articles to see if there are any patterns in the articles based on the writing style.

In [15]:
import re
from statistics import mean, stdev
from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.preprocessing import StandardScaler

import plotly.figure_factory as ff

In [None]:
def rate_grammar(doc):
    """
    Rates the grammatical correctness of sentences in the input text based on dependency parsing.

    Args:
        doc (spacy.tokens.doc.Doc): The input text(as a spacy document) to be rated.

    Returns:
        tuple: A tuple containing a list of scores for each sentence and an overall score for the text.
    """

    # Define rules for grammatical correctness
    rules = {
        "ROOT": lambda token: token.dep_ == "ROOT",
        "SUBJECT": lambda token: any(child.dep_ in {"nsubj", "nsubjpass"} for child in token.children),
        "VERB": lambda token: token.pos_ in {"VERB", "AUX"},
    }

    def evaluate_sentence(sent):
        """Evaluate grammatical correctness of a sentence."""
        sent_score = 0
        root_found = False
        total_tokens = len(sent)

        for token in sent:
            # Check for a ROOT token
            if rules["ROOT"](token):
                root_found = True

            # Check for subject dependency
            if rules["SUBJECT"](token):
                sent_score += 1

            # Check for verb presence
            if rules["VERB"](token):
                sent_score += 1

        # Penalize if no ROOT found
        if not root_found:
            sent_score -= 1

        # Normalize score by the number of tokens to make it independent of sentence length
        normalized_score = sent_score / total_tokens if total_tokens > 0 else 0
        return max(0, normalized_score)  # Ensure score is non-negative

    # Rate each sentence
    sentence_scores = [evaluate_sentence(sent) for sent in doc.sents]
    overall_score = sum(sentence_scores) / len(sentence_scores) if sentence_scores else 0

    return sentence_scores, overall_score

In [None]:
def rate_informativeness(text):
    """
    Rates the informativeness of sentences in the input text based on dependency parsing.

    Args:
        doc (spacy.tokens.doc.Doc): The input text(as a spacy document) to be rated.

    Returns:
        tuple: A tuple containing a list of scores for each sentence and an overall score for the text.
    """

    def evaluate_informativeness(sent):
        """Evaluate informativeness of a sentence."""
        informative_tokens = {"nsubj", "dobj", "pobj", "ROOT", "acl", "advcl"}
        total_tokens = len(sent)
        informative_count = sum(1 for token in sent if token.dep_ in informative_tokens)

        # Normalize score by the number of tokens to make it independent of sentence length
        normalized_score = informative_count / total_tokens if total_tokens > 0 else 0
        return normalized_score

    # Rate each sentence
    sentence_scores = [evaluate_informativeness(sent) for sent in doc.sents]
    overall_score = sum(sentence_scores) / len(sentence_scores) if sentence_scores else 0

    return sentence_scores, overall_score

In [None]:
def lexical_features(text):
    """
    Compute lexical features for the given text.

    Parameters:
        text (str): Input text.

    Returns:
        list: A vector of lexical features.
    """
    # Tokenize the text into words
    words = re.findall(r'\b\w+\b', text.lower())  # Lowercase and tokenize
    word_count = len(words)
    unique_words = set(words)
    unique_word_count = len(unique_words)

    # Character count (excluding spaces)
    char_count = sum(len(word) for word in words)

    # Type-Token Ratio
    ttr = unique_word_count / word_count if word_count > 0 else 0

    # Hapax Legomena and Dislegomena
    word_freq = Counter(words)
    hapax_legomena = sum(1 for freq in word_freq.values() if freq == 1)
    hapax_dislegomena = sum(1 for freq in word_freq.values() if freq == 2)

    hapax_legomena_ratio = hapax_legomena / word_count if word_count > 0 else 0
    hapax_dislegomena_ratio = hapax_dislegomena / word_count if word_count > 0 else 0

    # Average Word Length
    avg_word_length = char_count / word_count if word_count > 0 else 0

    # Construct feature vector
    feature_vector = [
        word_count,                # Total words
        unique_word_count,         # Unique words
        ttr,                       # Type-Token Ratio
        avg_word_length,           # Average word length
        hapax_legomena_ratio,      # Hapax Legomena Ratio
        hapax_dislegomena_ratio,   # Hapax Dislegomena Ratio
        char_count                 # Total character count
    ]

    return feature_vector

In [None]:
def stylistic_features(text):
    """
    Compute stylistic features for the given text.

    Parameters:
        text (str): Input text.

    Returns:
        list: A vector of stylistic features.
    """
    # Sentence segmentation
    sentences = re.split(r'[.!?]', text)  # Split text into sentences
    sentences = [s.strip() for s in sentences if s.strip()]  # Remove empty strings

    # Tokenization (words)
    words = re.findall(r'\b\w+\b', text)
    total_words = len(words)

    # Punctuation counts
    periods = text.count('.')
    commas = text.count(',')
    question_marks = text.count('?')
    exclamation_marks = text.count('!')
    ellipses = text.count('...')

    # Capitalization
    capitalized_words = sum(1 for word in words if word[0].isupper())
    all_caps_words = sum(1 for word in words if word.isupper() and len(word) > 1)
    proportion_capitalized = capitalized_words / total_words if total_words > 0 else 0
    proportion_all_caps = all_caps_words / total_words if total_words > 0 else 0

    # Sentence lengths
    sentence_lengths = [len(re.findall(r'\b\w+\b', s)) for s in sentences]  # Words per sentence
    avg_sentence_length = mean(sentence_lengths) if sentence_lengths else 0
    std_sentence_length = stdev(sentence_lengths) if len(sentence_lengths) > 1 else 0

    # Function words
    function_words_set = {
        "the", "is", "in", "at", "of", "on", "and", "a", "to", "it", "this", "that", "for",
        "with", "as", "was", "by", "an", "be", "from", "or", "are", "but", "if", "then",
        "so", "than", "about", "into", "can", "not", "do", "which", "how", "when", "where",
        "what", "who", "whom", "why", "whose", "while", "has", "have", "had", "will", "shall",
        "may", "might", "would", "should", "could"
    }
    function_words_count = sum(1 for word in words if word.lower() in function_words_set)
    proportion_function_words = function_words_count / total_words if total_words > 0 else 0

    # Feature vector
    feature_vector = [
        periods,                          # Number of periods
        commas,                           # Number of commas
        question_marks,                   # Number of question marks
        exclamation_marks,                # Number of exclamation marks
        proportion_capitalized,           # Proportion of words starting with a capital letter
        proportion_all_caps,              # Proportion of all-uppercase words
        proportion_function_words,        # Proportion of function words
        avg_sentence_length,              # Average sentence length
        std_sentence_length,              # Sentence length variability
        ellipses                          # Number of ellipses
    ]

    return feature_vector

In [None]:
# Calculate stylometric features for a each article
# Create a DataFrame to store the stylometric features
stylometry_df = pd.DataFrame(
    index = df['id'],
    columns = [
        'grammar_rating',
        'informativeness_rating',
        'word_count',
        'unique_word_count',
        'type_token_ratio',
        'average_word_length',
        'hapax_legomena_ratio',
        'hapax_dislegomena_ratio',
        'total_character_count',
        'periods',
        'commas',
        'question_marks',
        'exclamation_marks',
        'proportion_capitalized',
        'proportion_all_caps',
        'proportion_function_words',
        'avg_sentence_length',
        'std_sentence_length',
        'ellipses'
    ]
)

for idx, article in tqdm(df.reset_index().iterrows(), desc="Computing stylometric features...", total=len(df)):
    doc = spacy_docs[idx]
    # Features involving dependency parsing
    grammar_scores, grammar_rating = rate_grammar(doc)
    informativeness_scores, informativeness_rating = rate_informativeness(doc)
    # Lexical Features
    lexical_features_vector = lexical_features(article['content'])
    # Stylistic Features
    stylistic_features_vector = stylistic_features(article['content'])
    article_stylometric_vector = [grammar_rating, informativeness_rating] + lexical_features_vector + stylistic_features_vector
    stylometry_df.iloc[idx] = article_stylometric_vector

Computing stylometric features...: 100%|██████████| 2918/2918 [00:17<00:00, 165.40it/s]


In [None]:
stylometry_df.head()

Unnamed: 0_level_0,grammar_rating,informativeness_rating,word_count,unique_word_count,type_token_ratio,average_word_length,hapax_legomena_ratio,hapax_dislegomena_ratio,total_character_count,periods,commas,question_marks,exclamation_marks,proportion_capitalized,proportion_all_caps,proportion_function_words,avg_sentence_length,std_sentence_length,ellipses
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1fbdb59e-eb89-4ee7-8c93-4d604068110a,0.252475,0.249462,596,324,0.543624,4.864094,0.397651,0.073826,2899,46,39,1,0,0.110738,0.003356,0.34396,21.285714,11.897267,0
235e7b38-5484-4363-9ebd-81c58b96186f,0.25078,0.28439,837,382,0.456392,4.381123,0.285544,0.091995,3667,94,33,0,0,0.193548,0.02509,0.320191,19.465116,11.253251,0
d427da6e-e1e5-45b7-8ea1-d06fc7236d9b,0.255543,0.321244,1867,611,0.327263,4.239422,0.18211,0.059989,7915,197,74,1,0,0.191216,0.012855,0.304767,15.056452,10.314803,0
7da6b484-8894-4093-aa1e-7adef6f15eb9,0.185127,0.29518,661,344,0.520424,5.588502,0.370651,0.068079,3694,58,37,2,0,0.122542,0.01059,0.25416,17.394737,11.764673,1
dbc5c6ab-ed16-4d3e-8fd9-e1ffa0cc4d94,0.195086,0.293726,1138,509,0.447276,5.114236,0.318102,0.057996,5820,153,40,0,0,0.220562,0.016696,0.288225,13.547619,10.730437,0


In [None]:
# Save the stylometric features
stylometry_df.to_csv(f'{processed_data_folder}/stylometry_df.csv')

In [16]:
# Load the stylometric features
stylometry_df = pd.read_csv(f'{processed_data_folder}/stylometry_df.csv', index_col=0)

Let's plot a correlation matrix to see if there are high correlations between the stylometric features. This will help us to identify the features which are highly correlated and remove them to reduce the dimensionality of the data.

In [12]:
# Plot correlation heatmap for the stylometric features
# Calculate the correlation matrix
correlation_matrix = stylometry_df.corr()

# Create the heatmap using plotly.figure_factory
fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    annotation_text=np.around(correlation_matrix.values, decimals=2),  # Round values to 2 decimals
    colorscale='balance',  # You can change the colorscale
    showscale=True
)

# Customize the layout (optional)
fig.update_layout(
    title='Correlation Matrix Heatmap',
    xaxis_title='Stylometric Features',
    yaxis_title='Stylometric Features'
)

fig.show()

It's depicted that there are no high correlations between the stylometric features. So, let's remove a feature out of feature pairs which are highly correlated.

Selected limit: (-0.8) - 0.8

In [13]:
# Trim the features to remove highly correlated ones
# Calculate the correlation matrix
correlation_matrix = stylometry_df.corr()

# Create a copy of the DataFrame
stylometry_df_copy = stylometry_df.copy()

# Iterate through the correlation matrix to identify and remove highly correlated features
features_to_remove = set()
for i in range(len(correlation_matrix.columns)):
    for j in range(i + 1, len(correlation_matrix.columns)):
        col1 = correlation_matrix.columns[i]
        col2 = correlation_matrix.columns[j]
        correlation = correlation_matrix.iloc[i, j]

        if abs(correlation) >= 0.8:  # Consider correlation above 0.8 (or below -0.8)
            if col1 not in features_to_remove:
                features_to_remove.add(col2)

# Remove the identified features from the copy of the dataframe
stylometry_df_copy = stylometry_df_copy.drop(columns=list(features_to_remove))

print(f"Removed features: {features_to_remove}")
print("Shape of the original df:", stylometry_df.shape)
print("Shape of the copied and modified df:", stylometry_df_copy.shape)

Removed features: {'periods', 'hapax_legomena_ratio', 'total_character_count', 'commas', 'unique_word_count'}
Shape of the original df: (2918, 19)
Shape of the copied and modified df: (2918, 14)


In [14]:
# Plot correlation heatmap for the stylometric features
# Calculate the correlation matrix
correlation_matrix = stylometry_df_copy.corr()

# Create the heatmap using plotly.figure_factory
fig = ff.create_annotated_heatmap(
    z=correlation_matrix.values,
    x=list(correlation_matrix.columns),
    y=list(correlation_matrix.index),
    annotation_text=np.around(correlation_matrix.values, decimals=2),  # Round values to 2 decimals
    colorscale='balance',  # You can change the colorscale
    showscale=True
)

# Customize the layout (optional)
fig.update_layout(
    title='Correlation Matrix Heatmap',
    xaxis_title='Stylometric Features',
    yaxis_title='Stylometric Features'
)

fig.show()

In [15]:
# Let's Normalize the stylometric features using StandardScaler to ensure all features are on the same scale
# Create a copy of the DataFrame
normalized_stylometry_df = stylometry_df_copy.copy()

# Standard normalization
scaler = StandardScaler()
normalized_stylometry_df = pd.DataFrame(scaler.fit_transform(normalized_stylometry_df), columns = normalized_stylometry_df.columns)

In [20]:
%%time
# Perform PCA
pca = PCA(n_components=5)  # Reduce to 2 principal components
pca_result = pca.fit_transform(normalized_stylometry_df)

# Perform t-SNE
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
tsne_result_2d = tsne.fit_transform(pca_result)

# Create a DataFrame for plotting
df_tsne = pd.DataFrame(tsne_result_2d, columns=['tsne_x', 'tsne_y'])

# Create the scatter plot using Plotly Express
fig = px.scatter(
    df_tsne,
    x='tsne_x',
    y='tsne_y',
    title='t-SNE Visualization of Stylometric Features',
    labels={'tsne_x': 't-SNE 1', 'tsne_y': 't-SNE 2'},
    color=df['source'],
)
fig.update_traces(marker=dict(size=5))
fig.show()

CPU times: user 32.9 s, sys: 215 ms, total: 33.1 s
Wall time: 36.7 s


In [21]:
%%time
# Perform PCA
pca = PCA(n_components=5)  # Reduce to 2 principal components
pca_result = pca.fit_transform(normalized_stylometry_df)

# Perform t-SNE
tsne = TSNE(n_components=3, random_state=42, perplexity=30)
tsne_result_3d = tsne.fit_transform(pca_result)

# Create a DataFrame for plotting
df_tsne = pd.DataFrame(tsne_result_3d, columns=['tsne_x', 'tsne_y', 'tsne_z'])

# Create the scatter plot using Plotly Express
fig = px.scatter_3d(
    df_tsne,
    x='tsne_x',
    y='tsne_y',
    z='tsne_z',
    title='t-SNE Visualization of Stylometric Features',
    labels={'tsne_x': 't-SNE 1', 'tsne_y': 't-SNE 2', 'tsne_z': 't-SNE 3'},
    color=df['source'],
)
fig.update_traces(marker=dict(size=2))
fig.show()

CPU times: user 1min 44s, sys: 9.63 ms, total: 1min 44s
Wall time: 1min 45s


In [45]:
X = normalized_stylometry_df.copy()
y = (df.reset_index()['source'] == 'FOX News').astype(int)

# Perform LDA
lda = LinearDiscriminantAnalysis()
lda_result = lda.fit(X, y)

# Get feature coefficients
feature_importance = pd.DataFrame({
    'Feature': X.columns,
    'Coefficient': lda.coef_[0],
    'Abs_Coefficient': abs(lda.coef_[0])
}).sort_values('Abs_Coefficient', ascending=False)

feature_importance.reset_index(inplace=True)

In [67]:
# Create feature importance plot
fig_importance = go.Figure()
fig_importance.add_trace(go.Bar(
    y=feature_importance['Feature'],
    x=feature_importance['Coefficient'],
    orientation='h'
))

fig_importance.update_layout(
    title='LDA Feature Coefficients - Impact on Distinguishing FOX News',
    xaxis_title='Coefficient Value',
    yaxis_title='Stylometric Feature',
    height=800,
    showlegend=False,
    yaxis={'categoryorder': 'total ascending'}
)

# Display results
print("Top Discriminating Features:")
print(feature_importance[['Feature', 'Coefficient']].sort_values(['Coefficient'], ascending=False).to_string(index=False))

# Show explained variance ratio
print("\nExplained Variance Ratio:", lda.explained_variance_ratio_)

# Calculate classification accuracy
y_pred = lda.predict(X)
accuracy = (y == y_pred).mean()
print(f"\nClassification Accuracy: {accuracy:.2%}")

# Display figures
fig_importance.show()

Top Discriminating Features:
                  Feature  Coefficient
      proportion_all_caps     6.903941
           grammar_rating     1.088070
      std_sentence_length     0.506702
proportion_function_words     0.496566
   informativeness_rating     0.464248
      average_word_length     0.184110
        exclamation_marks     0.053497
           question_marks     0.013465
   proportion_capitalized    -0.083366
                 ellipses    -0.319506
      avg_sentence_length    -0.374590
         type_token_ratio    -0.505408
  hapax_dislegomena_ratio    -0.640311
               word_count    -0.848803

Explained Variance Ratio: [1.]

Classification Accuracy: 97.46%


In [68]:
# Create distribution plots for top 6 features
fig_dist = sp.make_subplots(
    rows=3, cols=2,
    subplot_titles=feature_importance['Feature'][:6],
    vertical_spacing=0.15
)

# Plot distributions for top features
row = 1
col = 1
for i, feature in enumerate(feature_importance['Feature'][:6]):
    feature_idx = list(X.columns).index(feature)

    # Get data for Fox News and others
    fox_news_data = X[y == 1].iloc[:, feature_idx]
    others_data = X[y == 0].iloc[:, feature_idx]

    # Add violin plots
    fig_dist.add_trace(
        go.Violin(x=['FOX News']*len(fox_news_data),
                 y=fox_news_data,
                 name='FOX News',
                 side='positive',
                 line_color='green'),
        row=row, col=col
    )

    fig_dist.add_trace(
        go.Violin(x=['Others']*len(others_data),
                 y=others_data,
                 name='Others',
                 side='negative',
                 line_color='gray'),
        row=row, col=col
    )

    col += 1
    if col > 2:
        col = 1
        row += 1

fig_dist.update_layout(
    height=800,
    title='Distribution of Top 6 Discriminating Features',
    showlegend=False
)

# Additional statistics for top features
top_features = feature_importance['Feature'][:6]
stats_df = pd.DataFrame()

for feature in top_features:
    feature_idx = list(X.columns).index(feature)
    fox_news_values = X[y == 1].iloc[:, feature_idx]
    others_values = X[y == 0].iloc[:, feature_idx]

    stats_df = pd.concat([stats_df, pd.DataFrame({
        'Feature': [feature],
        'Fox_News_Mean': [fox_news_values.mean()],
        'Others_Mean': [others_values.mean()],
        'Mean_Difference': [fox_news_values.mean() - others_values.mean()],
        'Effect_Size': [(fox_news_values.mean() - others_values.mean()) /
                       np.sqrt((fox_news_values.var() + others_values.var()) / 2)]
    })])

print("\nDetailed Statistics for Top Features:")
print(stats_df.round(3).to_string(index=False))

# Display the plot
fig_dist.show()


Detailed Statistics for Top Features:
                Feature  Fox_News_Mean  Others_Mean  Mean_Difference  Effect_Size
    proportion_all_caps          2.657       -0.202            2.859        2.833
         grammar_rating          0.129       -0.010            0.139        0.156
             word_count         -0.177        0.013           -0.190       -0.251
hapax_dislegomena_ratio         -0.067        0.005           -0.072       -0.094
    std_sentence_length          0.124       -0.009            0.133        0.148
       type_token_ratio         -0.096        0.007           -0.104       -0.128


## Text Summarization on news topics

In [17]:
!pip install transformers torch



In [26]:
from transformers import pipeline
from tqdm import tqdm

In [27]:
# Initialize the summarization pipeline
summarizer = pipeline("summarization", model="google/pegasus-large")

config.json:   0%|          | 0.00/3.09k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/2.28G [00:00<?, ?B/s]

Some weights of PegasusForConditionalGeneration were not initialized from the model checkpoint at google/pegasus-large and are newly initialized: ['model.decoder.embed_positions.weight', 'model.encoder.embed_positions.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


generation_config.json:   0%|          | 0.00/260 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/88.0 [00:00<?, ?B/s]

spiece.model:   0%|          | 0.00/1.91M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/65.0 [00:00<?, ?B/s]

Device set to use cpu


In [28]:
def chunk_text(text, max_tokens=500):
    """
    Splits text into chunks of a specified maximum token size.

    Args:
        text (str): The input text to chunk.
        max_tokens (int): Maximum token size per chunk.

    Returns:
        list of str: List of text chunks.
    """
    sentences = text.split(". ")
    chunks = []
    current_chunk = []
    current_length = 0

    for sentence in sentences:
        sentence_length = len(sentence.split())
        if current_length + sentence_length > max_tokens:
            chunks.append(". ".join(current_chunk) + ".")
            current_chunk = []
            current_length = 0
        current_chunk.append(sentence)
        current_length += sentence_length

    # Add the last chunk
    if current_chunk:
        chunks.append(". ".join(current_chunk) + ".")

    return chunks

def create_custom_timeline(texts):
    """
    Summarizes and organizes text into a timeline-like structure.

    Args:
        texts (list of str): List of text documents.

    Returns:
        timeline (list of dict): List of ordered events with summaries and inferred time sequence.
    """
    # Combine all text into one
    combined_text = " ".join(texts)

    # Split text into manageable chunks
    text_chunks = chunk_text(combined_text)

    # Summarize each chunk and show progress with tqdm
    chunk_summaries = []
    for chunk in tqdm(text_chunks, desc="Summarizing chunks", unit="chunk"):
        try:
            summary = summarizer(chunk, max_length=150, min_length=30, do_sample=False)[0]["summary_text"]
            chunk_summaries.append(summary)
        except Exception as e:
            print(f"Error summarizing chunk: {e}")
            chunk_summaries.append("Error summarizing this chunk.")

    # Combine the summaries into a single summary
    final_summary = " ".join(chunk_summaries)

    # Split the summary into individual sentences
    sentences = final_summary.split(". ")

    # Create a timeline structure (ordered list of events)
    timeline = []
    for idx, sentence in enumerate(sentences, start=1):
        timeline.append({
            "event_number": idx,  # Use event numbering for lack of explicit dates
            "event_summary": sentence.strip()
        })

    return timeline


In [29]:
topic_key = 18

label_categorization[topic_key]

{'topic': 'Israel-Syria Conflict',
 'predicted_label': 'Uncategorized',
 'entailment_probs': {'Uncategorized': 0.47758203744888306,
  'Health': 0.09224038571119308,
  'Economy': 0.08775100857019424,
  'Space': 0.07608936727046967,
  'Business': 0.07261011749505997,
  'Sports': 0.07100424915552139,
  'Technology': 0.061911072582006454,
  'Entertainment': 0.06081176921725273}}

In [30]:
topic_filtered_df = df[df['topic'] == topic_key]
topic_filtered_df.head(3)

Unnamed: 0.2,Unnamed: 0.1,Unnamed: 0,id,title,authors,published_date,source,url,content,tags,images,sentiment_label,sentiment_polarity,topic
28,28,28,dc837007-517d-436d-8d8d-14dba595ec77,UN peace exhibit features slogan calling for I...,"Bradford Betz ,Fox News ,Bradford Betz Is A Fo...",,FOX News,https://www.foxnews.com/world/un-peace-exhibit...,FIRST ON FOX – A Global Peace Flag exhibit at ...,,https://static.foxnews.com/foxnews.com/content...,NEGATIVE,-0.999842,18
29,29,29,71bc5a58-88e7-47ee-97bd-3902818ae845,The Lie Trump Is Offering Jewish Voters,"Emily Tamkin ,Jeremy Stahl",,Slate,https://slate.com/news-and-politics/2024/11/tr...,The Heritage Foundation—the same group that dr...,"Donald Trump ,Judaism ,Antisemitism ,2024 Camp...",https://pixel.quantserve.com/pixel/p-fw53_-Tq3...,NEGATIVE,-0.979737,18
172,172,172,ba8dffe3-2994-43c6-a3ce-3be49f53d43d,Head of US Central Command being investigated ...,"Morgan Phillips ,Fox News",,FOX News,https://www.foxnews.com/politics/head-us-centr...,One of the Pentagon’s top generals is under in...,,https://static.foxnews.com/foxnews.com/content...,NEGATIVE,-0.970864,18


In [31]:
texts = []

for idx, article in topic_filtered_df.iterrows():
    text = article['content']
    if not isinstance(article['published_date'], float):
      text = 'On ' + str(article['published_date']) + ', ' + text
    texts.append(text)

texts[0]

'FIRST ON FOX – A Global Peace Flag exhibit at the United Nations New York City headquarters features a slogan that many Israelis regard as an explicit call to wipe Israel off the map.. . The picture shows a map of Israel, resembling a watermelon, without any West Bank or Gaza partition. In the top right-hand corner is the Palestinian flag.. . The left side of the map contains the phrase "From the River to the Sea" and the right side contains the phrase, "Will be Free." It is an obvious nod to the phrase, "From the River to the Sea, Palestine Will be Free.". . Supporters of Palestinians maintain that the phrase is merely a slogan to represent the Palestinian struggle against the State of Israel, which they see as an occupying force.. . IRANIAN WOMAN STRIPS DOWN IN ANTI-HIJAB PROTEST FOLLOWING VICIOUS ASSAULT BY REGIME MILITIA. . Israelis, meanwhile, regard the phrase as an explicit call to genocide, a call for Israel to be wiped off the map completely.. . The phrase has gained a resurg

In [32]:
%%time
summary_timeline = create_custom_timeline(texts)

Summarizing chunks:   0%|          | 0/78 [00:20<?, ?chunk/s]


KeyboardInterrupt: 

In [39]:
summary_timeline

[{'event_number': 1,
  'event_summary': 'FIRST ON FOX – A Global Peace Flag exhibit at the United Nations New York City headquarters features a slogan that many Israelis regard as an explicit call to wipe Israel off the map.'},
 {'event_number': 2,
  'event_summary': 'The left side of the map contains the phrase "From the River to the Sea" and the right side contains the phrase, "Will be Free." It is an obvious nod to the phrase, "From the River to the Sea, Palestine Will be Free.".'},
 {'event_number': 3,
  'event_summary': 'Israelis, meanwhile, regard the phrase as an explicit call to genocide, a call for Israel to be wiped off the map completely.'},
 {'event_number': 4,
  'event_summary': '"This appalling display is front and center at the UN, and included art that unambiguously calls for the destruction of the Jewish people and the State of Israel," "We have alerted UN Security to the continued unauthorized interference in the exhibit and to review security footage to find out who 