# **Comprehensive NLP Analysis of Global News Headlines (2019-Present)**

This document presents a **thorough Natural Language Processing (NLP) analysis** of global news headlines spanning from **2019 to the present**. The analysis demonstrates proficiency in fundamental NLP techniques, preprocessing steps, and insightful visualizations that extract meaningful patterns from textual data.

---

## **Table of Contents**

1. [**Introduction**](#introduction)
2. [**Data Exploration**](#data-exploration)
3. [**Data Preprocessing**](#data-preprocessing)
   - [Text Normalization](#text-normalization)
   - [Tokenization](#tokenization)
   - [Stop Words Removal](#stop-words-removal)
   - [Token Unification and Renormalizing Entities](#token-unification)
   - [Part-of-Speech Tagging](#part-of-speech-tagging)
   - [Lemmatization](#lemmatization)
   - [Named Entity Recognition](#named-entity-recognition)
4. [**Feature Extraction**](#feature-extraction)
   - [Bag of Words](#bag-of-words)
   - [TF-IDF Vectorization](#tf-idf-vectorization)
   - [Word Embeddings](#word-embeddings)
5. [**Sentiment Analysis**](#sentiment-analysis)
   - [Monthly Sentiment Trends](#monthly-sentiment-trends)
   - [Year-over-Year Sentiment Comparison](#year-over-year-sentiment-comparison)
6. [**Topic Modeling**](#topic-modeling)
   - [Latent Dirichlet Allocation](#latent-dirichlet-allocation)
   - [Topic Evolution Over Time](#topic-evolution-over-time)
7. [**Entity Analysis**](#entity-analysis)
   - [Most Mentioned Entities](#most-mentioned-entities)
   - [Entity Co-occurrence Networks](#entity-co-occurrence-networks)
8. [**Time Series Analysis**](#time-series-analysis)
   - [Headline Complexity Over Time](#headline-complexity-over-time)
   - [Topic Seasonality](#topic-seasonality)
9. [**Conclusion**](#conclusion)

---

## **Introduction**

This analysis explores a rich dataset of **news headlines** from **2019 to 2023**, covering **25 of the world's most influential news headlines**. The dataset is structured with **dates** in the first column followed by 25 headlines from each source. By applying various **NLP techniques**, we aim to uncover patterns, trends, and insights that reveal how global news discourse has evolved over this significant period, from sentimental analysis of each day and year to reoccuring patterns of entities and how their sentiment on the media has changed over the years, or as i like to call it, the sentimental derivative. 

---

## **Data Exploration**

Let's begin by loading the dataset and exploring its basic structure:

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import re
import string
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# Set visualization styles
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("viridis")
plt.rcParams['figure.figsize'] = (12, 8)

# Load the dataset
news_df = pd.read_csv('datasets/WorldNewsData.csv')

# Display basic information
print(f"Dataset shape: {news_df.shape}")



# Check the first few rows


# Convert Date column to datetime (note capital "D")
# Since the date format is "May 01, 2018", we need to parse it correctly
news_df['Date'] = pd.to_datetime(news_df['Date'], format="%b %d, %Y")

# Display the date range
print(f"\nTime period: {news_df['Date'].min()} to {news_df['Date'].max()}")

# Create a year-month column for temporal analysis
news_df['year_month'] = news_df['Date'].dt.to_period('M')

# Create a column for the year
news_df['year'] = news_df['Date'].dt.year

# Create a column for the month
news_df['month'] = news_df['Date'].dt.month

# Sample a few rows
print("\nSample data:")
news_df.head()

ModuleNotFoundError: No module named 'pandas'

## **Data Preprocessing**

 There are some duplicate months in the original data, (like October 2020) Lets start by removing the duplicates and clearing the dataset. Next, let's create a function to combine all headlines for each day into a single text corpus, and a list that contains the headlines separately, so we can analyze both on per day basis and per headline. This distinction will be useful later on.

In [None]:
# ensuring 'Date' is in datetime format
news_df['Date'] = pd.to_datetime(news_df['Date'])

# drop duplicates
news_df = news_df.drop_duplicates(subset='Date', keep='first')

def combine_headlines(row):
    # combine all headlines in a row into a single string for easy day based tokenization later
    headlines = []
    # start from column 'Top1' through 'Top25'
    for col in news_df.columns[1:26]:  # Skip the Date column, include only Top1-Top25
        if pd.notna(row[col]):
            headlines.append(str(row[col]))
    return ' '.join(headlines)

# Apply the function to create a new column with combined headlines
news_df['combined_headlines'] = news_df.apply(combine_headlines, axis=1)

headlines_per_month = news_df.groupby(news_df['Date'].dt.to_period('M')).size() * 25

plt.figure(figsize=(14, 6))
headlines_per_month.plot(kind='bar')
plt.title('Total Number of Headlines per Month')
plt.xlabel('Month')
plt.ylabel('Headline Count')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()

# Create a new column that stores the list of non-null individual headlines (Top1–Top25)
def get_separate_headlines(row):
    """Return a list of non-empty individual headlines"""
    return [str(row[col]) for col in news_df.columns[1:26] if pd.notna(row[col])]

news_df['separate_headlines'] = news_df.apply(get_separate_headlines, axis=1)

# Count number of headlines (not assuming all 25 filled)
news_df['headline_count'] = news_df['separate_headlines'].apply(len)

# Group by month and sum all actual headlines
separate_headlines_per_month = news_df.groupby(news_df['Date'].dt.to_period('M'))['headline_count'].sum()


# **Text Normalization**

Now our data looks clear and no unusual spikes showing duplicate data. Before diving into the deeper depths of data processing, we need to normalize our dataset. Normalization is like organizing a dataset, we need to remove symbols, lowercase everything, remove numbers, non-UNICODE letters, and some other things based on what we need, we might need to remove words like "in, and, out, of" (which are called stopwords) because they might skew the data by introducing unnecessary bias. We actually took our first step already by removing duplicates! We will do a general normalization here and might use more in depth techniques like lemmatization or stemming based on our needs. 

Next, lets start by defining a new function normalize_text() to do every normalization step we desire to the dataset and store it in a variable.

In [None]:
# !pip install spacy contractions

# python -m spacy download en_core_web_sm
import re
import string
import html
import contractions
import spacy
import pandas as pd

# Load spaCy model (blank English model to avoid over-processing)
nlp = spacy.blank("en")

def safe_expand_contractions(text):
    """Safely expand contractions, handling non-English text"""
    if not isinstance(text, str) or not text.strip():
        return text
    
    # Skip processing if text contains non-ASCII characters that might cause issues
    if any(ord(c) > 127 for c in text):
        return text  # Return original text without expansion
    
    # Process contractions for English text
    try:
        return contractions.fix(text)
    except Exception as e:
        print(f"Contraction expansion error: {e} in text: {text[:50]}...")
        return text  # Return original text on error

def normalize_text(text):
    if not isinstance(text, str) or not text.strip():
        return ""
        
    # Step 1: Pre-replace acronyms with more comprehensive dict
    acronyms = {"U.N.": "UN", "U.S.": "US", "E.U.": "EU", "u.s.": "US", 
                "U.K.": "UK", "N.Y.": "NY", "L.A.": "LA"}
    for k, v in acronyms.items():
        text = text.replace(k, v)
    
    # Step 2: Expand contractions with safe handling
    text = safe_expand_contractions(text)
    
    # Step 3: Lowercase
    text = text.lower()
    
    # Step 4: Remove possessive 's or 's (improved patterns)
    text = re.sub(r"\b(\w+)['']\s*s\b", r"\1", text)  # Handles standard possessives
    text = re.sub(r"\b(\w+s)['']\b", r"\1", text)     # Handles plural possessives
    
    # Step 5: Remove remaining apostrophes (shouldn't be needed, for edge cases)
    text = re.sub(r"(\w+)'(\w+)", r"\1\2", text)
    text = re.sub(r"(\w+)'", r"\1", text)
    
    # Step 6: Remove punctuation and smart quotes/dashes
    punct_chars = string.punctuation + "–—''""•"
    text = re.sub(f"[{re.escape(punct_chars)}]", "", text)
    
    # Step 7: HTML unescape and entity removal (combined)
    text = html.unescape(text)
    text = re.sub(r'&[a-zA-Z#0-9]+;', '', text)  # Remove any remaining entities
    
    # Step 8: Remove any non-English characters and emojis
    # Note: This will remove numbers as per original code
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Step 9: Remove any remaining single s (from cases like "60's")
    text = re.sub(r'\b[sS]\b', '', text)
    
    # Step 10: Collapse whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text.strip()

# Make sure the dataframe is properly filtered before processing
news_df = news_df[
    news_df['Date'].notna() & 
    news_df['combined_headlines'].notna() & 
    (news_df['combined_headlines'].str.strip() != '')
]

# Apply normalization
news_df['normalized_text'] = news_df['combined_headlines'].apply(normalize_text)

# This one is the headlines divided individually instead of a per-day basis of concatenation.
# As stated, this might be useful later on to track the sentimental derivative
# of specific entities. This line applies our normalize function to every h in lst,
# or specifically every headline in every row and column.
news_df['normalized_separate'] = news_df['separate_headlines'].apply(
    lambda lst: [normalize_text(h) for h in lst if h]  # Added check for empty headlines
)

# Display a sample of normalized text
print("Original headline:")
print(news_df['combined_headlines'].iloc[0][:400])
print("\nNormalized headline:")
print(news_df['normalized_text'].iloc[0][:400])
print("\nOriginal Separated headline:")
print(news_df['separate_headlines'].iloc[0][:10])
print("\nNormalized Separated headline:")
print(news_df['normalized_separate'].iloc[0][:10])

We are getting there, Now its time for tokenization! 
# **Tokenization**

 Tokenization is the process of breaking text into individual words or tokens, I will use NLTK for this as spacy is not supported with python 3.13...

 We stored the normalized dataset in the variable "normalized_text", lets play around with it, and tokenize!

In [None]:
import spacy
from spacy.tokenizer import Tokenizer
from spacy.lang.en.stop_words import STOP_WORDS
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Initialize spaCy with custom tokenizer settings
nlp = spacy.blank("en")
tokenizer = nlp.tokenizer

# Add special cases to the tokenizer
special_cases = [{"ORTH": "covid19"}, {"ORTH": "covid-19"}]
for case in special_cases:
    nlp.tokenizer.add_special_case(case["ORTH"], [case])

def tokenize_text(text):
    """Splits text into individual words using spaCy's tokenizer with error handling"""
    if not text or not isinstance(text, str):
        return []
    try:
        tokens = tokenizer(text)
        # Return only non-empty tokens (optional: filter stop words)
        # Uncomment the stop words filter if needed
        # return [token.text for token in tokens if token.text.strip() and token.text.lower() not in STOP_WORDS]
        return [token.text for token in tokens if token.text.strip()]
    except Exception as e:
        print(f"Tokenization error: {e} in text: {text[:50]}...")
        return []

# Function for batch processing
def batch_tokenize(texts, batch_size=1000):
    """Process texts in batches for better performance"""
    # For simple tokenization, we can use the tokenizer directly
    return [tokenize_text(text) for text in texts]
    
    # For more complex NLP tasks, use spaCy's pipe:
    # results = []
    # for doc in nlp.pipe(texts, batch_size=batch_size):
    #     results.append([token.text for token in doc])
    # return results

# Apply tokenization to all texts
news_df['tokens'] = batch_tokenize(news_df['normalized_text'])

# For separate headlines - with error handling
def tokenize_headline_list(headline_list):
    if not headline_list:
        return []
    return [tokenize_text(h) for h in headline_list]

news_df['separate_tokens'] = news_df['normalized_separate'].apply(tokenize_headline_list)

# Display samples of all three token types
print("1. COMBINED TEXT TOKENS (first 40):")
print(news_df['tokens'].iloc[0][:40])

print("\n2. SEPARATE HEADLINE TOKENS (first 5 headlines):")
sample_headlines = news_df['separate_tokens'].iloc[0][:5] # Get first 5 headlines from first row
for i, headline_tokens in enumerate(sample_headlines):
    print(f" Headline {i+1}: {headline_tokens}")

# Token statistics with error handling
news_df['token_count'] = news_df['tokens'].apply(len)
news_df['avg_token_length'] = news_df['tokens'].apply(
    lambda x: np.mean([len(token) for token in x]) if x else 0
)

# separate_headlines total token count per day
news_df['separate_token_count'] = news_df['separate_tokens'].apply(
    lambda lst: sum(len(tokens) for tokens in lst) if lst else 0
)

# compute average word length with error handling
news_df['separate_avg_token_length'] = news_df['separate_tokens'].apply(
    lambda lst: np.mean([len(token) for tokens in lst for token in tokens]) if lst and any(tokens for tokens in lst) else 0
)

# each headline is treated like a sentence here
news_df['separate_sentence_count'] = news_df['normalized_separate'].apply(len)

# graph month column
news_df['Month'] = pd.to_datetime(news_df['Date']).dt.to_period('M').astype(str)

# Display statistics about all three token types
print("\nSTATISTICS SUMMARY:")
print(f"Combined text tokens: {news_df['token_count'].sum():,} total tokens")
print(f"Separate headlines: {news_df['separate_sentence_count'].sum():,} total headlines")
print(f"Separate headline tokens: {news_df['separate_token_count'].sum():,} total tokens")

# Visualization code
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 6))

# Boxplot of token counts by month
sns.boxplot(x='Month', y='token_count', data=news_df, ax=axes[0])
axes[0].set_title('Distribution of Token Counts by Month')
axes[0].set_xlabel('Month')
axes[0].set_ylabel('Token Count')
axes[0].tick_params(axis='x', rotation=45)

# Prepare data for grouped bar chart
news_df['separate_avg_tokens'] = news_df['separate_token_count'] / news_df['separate_sentence_count'].clip(lower=1)

monthly_stats = news_df.groupby('Month').agg({
    'token_count': 'mean',
    'separate_token_count': 'mean',
    'separate_avg_tokens': 'mean',
}).reset_index()

# Barplot of average token count by month
sns.barplot(x='Month', y='token_count', data=monthly_stats, ax=axes[1], color='blue', alpha=0.7)
axes[1].set_title('Average Tokens per Day')
axes[1].set_xlabel('Month')
axes[1].set_ylabel('Token Count')
axes[1].tick_params(axis='x', rotation=45)

plt.tight_layout()
plt.show()

# **Stop Words Removal**
Next, we should tidy up a bit more until we are left with only the meaningful segment of our data in hand. This meaningful segment includes people, places, and everything else that we call entities, while it doesn't include things like "of, on, and, in, the," AND words that dont have any importance in tracking, we care about who or what is being talked about, not the verb they are doing. For example, in the sentence "Donald Trump stated that China should revert all tariffs back or..." We are interested in Donald Trump, China, tariffs, but we are not interested in "that, stated, all, or" . Revert and should can be used for sentiment analysis, so they should stay. We are also allowing some words, contrary to the stop word list so that stop words don't skew the sentiment analysis data. We also clean up apostrophes so we don't get multiple tokens for words like doesn't, or Trump's.

In [None]:
import spacy
from spacy.lang.en.stop_words import STOP_WORDS

# Load spaCy English model - only if needed for other purposes
# Use blank model if only stopwords are needed to improve performance
nlp = spacy.blank("en")  # Changed from spacy.load to save memory/time

# Base stop words from spaCy
stop_words = STOP_WORDS.copy()

# Add your custom stop words
custom_stops = {
    'says', 'said', 'reuters', 'ap', 'afp', 'report', 'reports', 
    'or', 'stated', 'for', 'new', 'u', 'amp', 'is', 'in', 'news'
}
stop_words.update(custom_stops)

# Keep important stopwords for sentiment analysis
stop_words.difference_update({'not', 'very', 'so', 'should', 'if'})

# Function to remove stopwords with error handling
def remove_stopwords(tokens):
    """Remove stopwords from a token list with error handling"""
    if not tokens:
        return []
    try:
        return [token for token in tokens if token.lower() not in stop_words]
    except Exception as e:
        print(f"Error removing stopwords: {e}")
        return tokens

# Function to process nested lists (one list per headline)
# and DROP any headline that becomes empty after stop-word removal
def remove_stopwords_from_separate_headlines(headline_tokens_list):
    """Process nested token lists and filter empty results"""
    if not headline_tokens_list:
        return []
        
    cleaned_headlines = []
    for tokens in headline_tokens_list:
        filtered = remove_stopwords(tokens)
        if filtered:  # keep only non-empty results
            cleaned_headlines.append(filtered)
    return cleaned_headlines

# Apply the functions - with duplication check
if 'tokens_nostop' not in news_df.columns:
    news_df['tokens_nostop'] = news_df['tokens'].apply(remove_stopwords)

if 'separate_tokens_nostop' not in news_df.columns:
    news_df['separate_tokens_nostop'] = news_df['separate_tokens'].apply(
        remove_stopwords_from_separate_headlines
    )

# Print some statistics
print(f"Original token count: {sum(news_df['tokens'].apply(len)):,}")
print(f"Tokens after stopword removal: {sum(news_df['tokens_nostop'].apply(len)):,}")
percent_reduction = (1 - sum(news_df['tokens_nostop'].apply(len))/sum(news_df['tokens'].apply(len))) * 100
print(f"Reduction: {percent_reduction:.2f}%")

# Check for any empty token lists after stopword removal
empty_token_rows = news_df[news_df['tokens_nostop'].apply(len) == 0]
if not empty_token_rows.empty:
    print(f"\nWarning: {len(empty_token_rows)} rows have empty token lists after stopword removal")

# **Token Unification** 
If we listed the tokens and how many times they are used, we would see duplicate tokens like covid (2000 uses) and coronavirus (1500 uses). To counter this we create an entity renormalization table of synonyms that will unify some tokens together based on the rules we provide. 

In [None]:
# Synonym map for token normalization
synonym_map = {
    # COVID-related terms
    'covid': 'covid19',
    'covid-19': 'covid19',
    'coronavirus': 'covid19',
    'covid19': 'covid19',
    'corona': 'covid19',  
    'corona-virus': 'covid19',
    'sars-cov-2': 'covid19',
    'sarscov2': 'covid19',

    # US related terms
    'us': 'usa',
    'u.s.': 'usa',
    'u.s.a.': 'usa',
    'united states': 'usa',
    'america': 'usa',
    'american': 'usa',
    'americans': 'usa',

    # UK related terms
    'uk': 'united_kingdom',
    'u.k.': 'united_kingdom',
    'britain': 'united_kingdom',
    'british': 'united_kingdom',

    # Political terms
    'democrat': 'democrats',
    'democratic': 'democrats',
    'republican': 'republicans',
    'gop': 'republicans',

    # Politicians
    'biden': 'joe_biden',
    'joe': 'joe_biden',
    'president biden': 'joe_biden',
    
    'trump': 'donald_trump',
    'donald': 'donald_trump',
    'president trump': 'donald_trump',
    
    'putin': 'vladimir_putin',
    'vladimir': 'vladimir_putin',
    'president putin': 'vladimir_putin',
    
    # Common variations
    'govt': 'government',
    'gov': 'government',
    'admin': 'administration',
    'intl': 'international',
    'corp': 'corporation',
    'co': 'company',
    'cos': 'companies',
    'ceo': 'chief_executive_officer',
}

# Create case-insensitive mapping for better matching
case_insensitive_map = {k.lower(): v for k, v in synonym_map.items()}

# Enhanced token unification with multi-word phrase handling and error checking
def unify_tokens_with_phrases(tokens, mapping):
    """Normalize tokens using a synonym mapping with multi-word phrase support"""
    if not tokens:
        return []
        
    try:
        normalized_tokens = []
        i = 0
        while i < len(tokens):
            # Check for multi-word phrases (up to 3 words)
            found_match = False
            for n in range(min(3, len(tokens) - i), 0, -1):  # Safe range checking
                if i + n <= len(tokens):
                    phrase = ' '.join(tokens[i:i+n]).lower()
                    if phrase in mapping:
                        normalized_tokens.append(mapping[phrase])
                        i += n
                        found_match = True
                        break
            
            # If no phrase match, just normalize the single token
            if not found_match:
                token_lower = tokens[i].lower()
                normalized_tokens.append(mapping.get(token_lower, tokens[i]))
                i += 1
        
        return normalized_tokens
    except Exception as e:
        print(f"Error in token unification: {e} for tokens: {tokens[:5]}...")
        return tokens

# Function to safely process nested headline tokens
def unify_headline_tokens(headline_list, mapping):
    """Process a list of headline token lists"""
    if not headline_list:
        return []
    
    try:
        return [unify_tokens_with_phrases(tokens, mapping) for tokens in headline_list]
    except Exception as e:
        print(f"Error processing headline tokens: {e}")
        return headline_list

# Apply the enhanced normalization with phrase detection
if 'tokens_nostop' in news_df.columns:
    news_df['tokens_normalized'] = news_df['tokens_nostop'].apply(
        lambda tokens: unify_tokens_with_phrases(tokens, case_insensitive_map)
    )
else:
    print("Warning: 'tokens_nostop' column not found. Please run stopword removal first.")

# For separate headline tokens
if 'separate_tokens_nostop' in news_df.columns:
    news_df['separate_tokens_normalized'] = news_df['separate_tokens_nostop'].apply(
        lambda headline_list: unify_headline_tokens(headline_list, case_insensitive_map)
    )
else:
    print("Warning: 'separate_tokens_nostop' column not found. Please run stopword removal first.")

# Calculate token frequencies after normalization to check if duplicates were consolidated
from collections import Counter

# Only continue if required columns exist
if 'tokens_normalized' in news_df.columns:
    # Gather statistics
    all_normalized_tokens = [token for tokens in news_df['tokens_normalized'].tolist() for token in tokens]
    normalized_token_freq = Counter(all_normalized_tokens)
    print("\n--- TOP 20 NORMALIZED TOKENS ---")
    print(normalized_token_freq.most_common(20))

    # Compare frequencies before and after normalization
    if 'tokens_nostop' in news_df.columns:
        all_tokens_nostop = [token.lower() for tokens in news_df['tokens_nostop'].tolist() for token in tokens]
        token_freq_before = Counter(all_tokens_nostop)
        
        # Calculate the number of unique tokens before and after normalization
        unique_before = len(token_freq_before)
        unique_after = len(normalized_token_freq)
        
        print(f"\nUnique tokens before normalization: {unique_before}")
        print(f"Unique tokens after normalization: {unique_after}")
        print(f"Reduction in unique tokens: {unique_before - unique_after} ({(1 - unique_after/unique_before)*100:.2f}%)")
        
        # Show examples of normalized terms
        print("\nSample normalizations:")
        for original, normalized in [('covid', 'covid19'), ('us', 'usa'), ('trump', 'donald_trump')]:
            orig_count = token_freq_before.get(original, 0)
            norm_count = normalized_token_freq.get(case_insensitive_map.get(original, original), 0)
            if orig_count > 0 or norm_count > 0:
                print(f"  '{original}' → '{case_insensitive_map.get(original)}': {orig_count} → {norm_count}")

Inspect the results!

In [None]:
from collections import Counter
import matplotlib.pyplot as plt # Import for plotting
import pandas as pd # Assuming news_df is a pandas DataFrame

# --- TOKEN COUNTS & FREQUENCIES ---
print("\n--- TOKEN ANALYSIS ---")

# Calculate counts from the combined list of tokens
# Assuming 'tokens' represents the concatenated tokens for a row
token_count_before = sum(news_df['tokens'].apply(len))
token_count_after = sum(news_df['tokens_nostop'].apply(len))

print("Original total token count:", token_count_before)
print("Total token count after stopword removal:", token_count_after)
print("Reduction: {:.2f}%".format((1 - token_count_after / token_count_before) * 100))

# Display sample for tokens (from the combined list)
print("\nSample before stopword removal (first 20 tokens in the first row):")
# Handle potential empty lists gracefully
print(news_df['tokens'].iloc[0][:20] if news_df['tokens'].iloc[0] else "[]")
print("\nSample after stopword removal (first 20 tokens in the first row):")
# Handle potential empty lists gracefully
print(news_df['tokens_nostop'].iloc[0][:20] if news_df['tokens_nostop'].iloc[0] else "[]")


# Flatten the list of lists of tokens after stopword removal across the entire DataFrame
# This collects *all* non-stopword tokens into a single list for frequency counting
all_tokens_nostop = [token for tokens_list in news_df['tokens_nostop'].tolist() for token in tokens_list]

# Calculate frequency distribution
token_freq = Counter(all_tokens_nostop)

# Print Top N tokens
N = 20
print(f"\n--- TOP {N} TOKENS WITHOUT STOPWORDS ---")
top_tokens = token_freq.most_common(N)
print(top_tokens)

# --- VISUALIZATION ---
print(f"\n--- PLOTTING TOP {N} TOKENS ---")

if top_tokens:
    # Prepare data for plotting
    tokens, counts = zip(*top_tokens)

    plt.figure(figsize=(12, 7)) # Adjust figure size for better readability
    plt.bar(tokens, counts, color='teal') # Using a different color
    plt.xlabel('Tokens', fontsize=12) # Increase font size for labels
    plt.ylabel('Frequency', fontsize=12)
    plt.title(f'Top {N} Token Frequencies (Without Stopwords)', fontsize=14) # Increase font size for title
    plt.xticks(rotation=45, ha='right', fontsize=10) # Rotate and adjust alignment of labels
    plt.yticks(fontsize=10) # Increase font size for y-axis ticks
    plt.grid(axis='y', linestyle='--', alpha=0.7) # Add a grid for better readability
    plt.tight_layout() # Adjust layout to prevent labels overlapping
    plt.show()
else:
    print("No tokens found to plot.")

# **Part of Speech Tagging** 
 Now that we have clean, unified tokens, we can perform Part-of-Speech (POS) tagging. 

 POS tagging assigns a grammatical category (like noun, verb, adjective) to each token.
 
 This is crucial for understanding sentence structure and meaning.

 We will be POS tagging and lemmatizing in the same cell, reason explained below:


# **Lemmatization**

Lemmatization reduces words to their base form (lemma), and we will do this to keep track of trends and sentiment in a more robust way. You might be questioning why we are taking the input of normalized tokens instead of just using the POS tagged column of our dataframe that we will be creating, this is because POS tagged column is made out of a list of list of tuples, while lemmatization requires the input of lists that contain single token strings. The workaround is using POS tagging and lemmatization in a single passtrough of our normalized tokens list, and creating 4 new columns in total, of new POS tagged combined and separate, and lemmatized combined and separate. 

Lemmatization uses POS tagging as context to strip down them, so POS tagging data is crucial for accurate lemmatization, and for the other reasons explained above, they are in a single cell. 

In [None]:
# ================================================================
# LEMMATIZATION WITH POS TAGS FOR HEADLINES
# ================================================================
import spacy
from itertools import compress
from collections import defaultdict
from tqdm import tqdm
import warnings


# Load spaCy model with lemmatizer enabled, but still disable parser and NER for speed
nlp = spacy.load("en_core_web_md", disable=["parser", "ner", "senter"])

# Define a function to safely join tokens
def safe_join(tokens):
    """Safely join tokens into a string with error handling"""
    if not tokens:
        return ""
    try:
        return " ".join(str(tok) for tok in tokens)
    except Exception as e:
        print(f"Error joining tokens: {e}")
        return ""

# Function to safely process a batch with retry for errors
def process_batch_with_retry(texts, batch_size=512, n_process=None):
    """Process a batch of texts with spaCy and retry on failures"""
    results = []
    errors = []
    
    # Use the recommended way to determine processes
    if n_process is None:
        # Let spaCy decide based on available cores
        n_process = -1
    
    # Process texts with tqdm for progress tracking
    with tqdm(total=len(texts), desc="Processing texts") as pbar:
        # Safely process with retries on errors
        try:
            # Process texts in batches
            for i, doc in enumerate(nlp.pipe(texts, batch_size=batch_size, n_process=n_process)):
                try:
                    # Extract text, POS tag, and lemma for each token
                    results.append([(tok.text, tok.pos_, tok.lemma_) for tok in doc])
                    pbar.update(1)
                except Exception as e:
                    print(f"Error processing text at index {i}: {e}")
                    errors.append(i)
                    results.append([])  # Add empty result to maintain index alignment
                    pbar.update(1)
        except Exception as e:
            print(f"Batch processing error: {e}")
            # Fall back to sequential processing for remaining texts
            if errors:
                print(f"Retrying {len(errors)} failed texts sequentially...")
                for i in errors:
                    try:
                        if i < len(texts):
                            doc = nlp(texts[i])
                            results[i] = [(tok.text, tok.pos_, tok.lemma_) for tok in doc]
                    except Exception as e2:
                        print(f"Sequential retry failed for text at index {i}: {e2}")
    
    return results

# ------------------------------------------------
# 1)  COMBINED-HEADLINE LEMMATIZATION
# ------------------------------------------------
print("Lemmatizing and POS tagging – combined headlines...")

# Check if required column exists
if 'tokens_normalized' not in news_df.columns:
    print("Warning: 'tokens_normalized' column not found. Please run token normalization first.")
else:
    # Build list of texts and remember their DataFrame indices
    texts_combined = [safe_join(toks) for toks in news_df['tokens_normalized']]
    non_empty_mask = [bool(txt.strip()) for txt in texts_combined]
    texts_non_empty = list(compress(texts_combined, non_empty_mask))
    idx_mapping = [i for i, keep in enumerate(non_empty_mask) if keep]
    
    # Show stats
    print(f"Processing {len(texts_non_empty):,} non-empty texts out of {len(texts_combined):,} total")
    
    # Handle edge case of no texts
    if not texts_non_empty:
        warnings.warn("No non-empty texts found for lemmatization!")
        lemma_pos_combined = [[] for _ in range(len(news_df))]
    else:
        # Prepare result container (same length as DataFrame)
        lemma_pos_combined = [[] for _ in range(len(news_df))]
        
        # Determine optimal batch size based on text length
        avg_len = sum(len(t) for t in texts_non_empty) / max(1, len(texts_non_empty))
        batch_size = max(32, min(512, int(10000 / max(1, avg_len))))
        print(f"Using batch size of {batch_size} based on average text length of {avg_len:.1f} chars")
        
        # Determine optimal number of processes
        import os
        suggested_processes = min(4, os.cpu_count() or 1)
        print(f"Using {suggested_processes} processes for parallel processing")
        
        # Run spaCy in batches / multi-process
        lemma_results = process_batch_with_retry(
            texts_non_empty, 
            batch_size=batch_size, 
            n_process=suggested_processes
        )
        
        # Map results back to DataFrame positions
        for i, df_idx in enumerate(idx_mapping):
            if i < len(lemma_results):
                lemma_pos_combined[df_idx] = lemma_results[i]
    
    # Add to DataFrame
    news_df['lemma_pos_combined'] = lemma_pos_combined
    
    # Create a column with just the lemmatized tokens for convenience
    news_df['lemmas_combined'] = [
        [item[2] for item in token_list] if token_list else [] 
        for token_list in news_df['lemma_pos_combined']
    ]
    
    # Quick stats
    total_lemmas = sum(len(lemmas) for lemmas in news_df['lemmas_combined'])
    print(f"Total lemmas generated: {total_lemmas:,}")

# ------------------------------------------------
# 2)  SEPARATE-HEADLINE LEMMATIZATION
# ------------------------------------------------
print("\nLemmatizing and POS tagging – separate headlines...")

# Check if required column exists
if 'separate_tokens_normalized' not in news_df.columns:
    print("Warning: 'separate_tokens_normalized' column not found. Please run token normalization first.")
else:
    texts_sep = []   # flattened texts to send to spaCy
    row_map = []     # DataFrame row index for each text
    sub_map = []     # sub-list position inside that row
    
    # Handle potential iteritems vs items method difference (pandas version compatibility)
    iter_method = getattr(news_df['separate_tokens_normalized'], 'items', None)
    if iter_method is None:
        iter_method = news_df['separate_tokens_normalized'].iteritems
    
    # Collect texts with their mapping information
    for row_idx, headline_lists in iter_method():
        if not headline_lists:
            continue
            
        for sub_idx, toks in enumerate(headline_lists):
            if toks:  # already filtered empties
                text = safe_join(toks)
                if text.strip():  # Make sure we have actual content
                    texts_sep.append(text)
                    row_map.append(row_idx)
                    sub_map.append(sub_idx)
    
    # Show stats
    print(f"Processing {len(texts_sep):,} separate headlines")
    
    # Handle edge case of no texts
    if not texts_sep:
        warnings.warn("No non-empty separate headlines found for lemmatization!")
        news_df['lemma_pos_separate'] = [[] for _ in range(len(news_df))]
        news_df['lemmas_separate'] = [[] for _ in range(len(news_df))]
    else:
        # Determine optimal batch size based on text length
        avg_len = sum(len(t) for t in texts_sep) / max(1, len(texts_sep))
        batch_size = max(32, min(512, int(10000 / max(1, avg_len))))
        print(f"Using batch size of {batch_size} based on average text length of {avg_len:.1f} chars")
        
        # Determine optimal number of processes
        import os
        suggested_processes = min(4, os.cpu_count() or 1)
        print(f"Using {suggested_processes} processes for parallel processing")
        
        # Lemmatize and POS-tag all in one go with progress display
        docs_lemma = process_batch_with_retry(
            texts_sep, 
            batch_size=batch_size, 
            n_process=suggested_processes
        )
        
        # Re-assemble back into the original nested structure
        lemma_dict = defaultdict(list)   # row_idx -> list of lists
        
        for i, (row_idx, sub_idx) in enumerate(zip(row_map, sub_map)):
            if i >= len(docs_lemma):
                continue
                
            lemma_pos = docs_lemma[i]
            
            # ensure outer list long enough
            while len(lemma_dict[row_idx]) <= sub_idx:
                lemma_dict[row_idx].append([])
            
            lemma_dict[row_idx][sub_idx] = lemma_pos
        
        # Add to DataFrame
        news_df['lemma_pos_separate'] = news_df.index.map(lambda i: lemma_dict.get(i, []))
        
        # Create a nested list with just the lemmas for convenience
        news_df['lemmas_separate'] = news_df['lemma_pos_separate'].apply(
            lambda headline_lists: [
                [item[2] for item in tokens] if tokens else []
                for tokens in headline_lists
            ] if headline_lists else []
        )
        
        # Quick stats
        total_headlines = sum(len(lemmas) for lemmas in news_df['lemmas_separate'])
        total_lemmas = sum(len(headline) for headlines in news_df['lemmas_separate'] for headline in headlines)
        print(f"Total headlines processed: {total_headlines:,}")
        print(f"Total lemmas generated: {total_lemmas:,}")

# Sample display
print("\nSample lemmatization results:")
if 'lemma_pos_combined' in news_df.columns and len(news_df) > 0:
    sample_lemmas = news_df['lemma_pos_combined'].iloc[0][:5]
    print(f"First 5 combined lemmas with POS: {sample_lemmas}")
    
    sample_just_lemmas = news_df['lemmas_combined'].iloc[0][:5]
    print(f"First 5 combined lemmas only: {sample_just_lemmas}")

if 'lemma_pos_separate' in news_df.columns and len(news_df) > 0:
    # Get first headline's lemmas from first row
    if news_df['lemma_pos_separate'].iloc[0] and news_df['lemma_pos_separate'].iloc[0][0]:
        sample_sep_lemmas = news_df['lemma_pos_separate'].iloc[0][0][:5]
        print(f"First 5 lemmas with POS from first separate headline: {sample_sep_lemmas}")
        
        sample_sep_just_lemmas = news_df['lemmas_separate'].iloc[0][0][:5] if news_df['lemmas_separate'].iloc[0] else []
        print(f"First 5 lemmas only from first separate headline: {sample_sep_just_lemmas}")

# Display information about the columns we've added
print("\nNew columns added to DataFrame:")
print("  - lemma_pos_combined: List of (token, POS, lemma) tuples for combined headlines")
print("  - lemmas_combined: List of just lemmas for combined headlines")
print("  - lemma_pos_separate: Nested lists of (token, POS, lemma) tuples for separate headlines")
print("  - lemmas_separate: Nested lists of just lemmas for separate headlines")

# Display a simple visualization of lemmatization effects
if 'tokens_normalized' in news_df.columns and 'lemmas_combined' in news_df.columns and len(news_df) > 0:
    print("\nLemmatization example (first row):")
    tokens = news_df['tokens_normalized'].iloc[0][:10]  # First 10 tokens
    lemmas = news_df['lemmas_combined'].iloc[0][:10]    # First 10 lemmas
    
    if tokens and lemmas:
        print("Original tokens vs. Lemmas:")
        for i, (token, lemma) in enumerate(zip(tokens[:10], lemmas[:10])):
            if token != lemma:
                print(f"  {token} → {lemma}")

# Part Of Speech Data Visualization

Lets see what we can learn from all the data we processed so far! We have a lot of data to process. We can start by graphing the POS tagging data, and inspect their various aspects.

In [None]:
# ================================================================
# POS TAGGING VISUALIZATION (FIXED VERSION)
# ================================================================
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from collections import Counter, defaultdict
import re
from wordcloud import WordCloud
from matplotlib.ticker import PercentFormatter

print("Generating fixed visualizations for POS tagging data...")

# Set plot style
plt.style.use('ggplot')
sns.set(font_scale=1.2)

# ------------------------------------------------
# 1) Distribution of POS Tags (FIXED)
# ------------------------------------------------
print("Analyzing POS tag distributions...")

# Fixed function to count POS tags with better handling of nested structures
def count_tags(tags_column):
    tag_counts = Counter()
    
    # Skip empty DataFrames
    if len(tags_column) == 0:
        return tag_counts
        
    # Try to get a non-empty sample to determine the data structure
    sample = None
    for item in tags_column:
        if item:  # Find first non-empty item
            sample = item
            break
    
    if sample is None:
        return tag_counts
    
    # Handle different formats based on nesting level and tuple structure
    if isinstance(sample, list):
        # Check if it's the separate headlines format (list of lists of tuples)
        if sample and isinstance(sample[0], list):
            for headlines in tags_column:
                if not headlines:
                    continue
                for headline in headlines:
                    if not headline:
                        continue
                    # Check tuple format - (text, pos, lemma) or (text, pos)
                    if headline and isinstance(headline[0], tuple):
                        if len(headline[0]) == 3:  # New format (text, pos, lemma)
                            for _, pos, _ in headline:
                                tag_counts[pos] += 1
                        else:  # Old format (text, pos)
                            for _, pos in headline:
                                tag_counts[pos] += 1
        # It's the combined headlines format (list of tuples)
        elif sample and isinstance(sample[0], tuple):
            for tags in tags_column:
                if not tags:
                    continue
                # Check tuple format
                if len(tags[0]) == 3:  # New format (text, pos, lemma)
                    for _, pos, _ in tags:
                        tag_counts[pos] += 1
                else:  # Old format (text, pos)
                    for _, pos in tags:
                        tag_counts[pos] += 1
    
    return tag_counts

# Determine which columns to use for visualization
pos_columns = []
if 'lemma_pos_combined' in news_df.columns:
    pos_columns.append(('lemma_pos_combined', 'Combined Headlines'))
elif 'pos_tags_combined' in news_df.columns:
    pos_columns.append(('pos_tags_combined', 'Combined Headlines'))

if 'lemma_pos_separate' in news_df.columns:
    pos_columns.append(('lemma_pos_separate', 'Separate Headlines'))
elif 'pos_tags_separate' in news_df.columns:
    pos_columns.append(('pos_tags_separate', 'Separate Headlines'))

# Debug - print sample row data to verify structure
print("Debugging data structure:")
for col_name, title in pos_columns:
    print(f"\nColumn: {col_name}")
    # Find first non-empty row
    for i, row in enumerate(news_df[col_name]):
        if row:
            print(f"Sample row {i}: {type(row)}")
            if isinstance(row, list):
                if row and isinstance(row[0], list):
                    print(f"  First nested list: {type(row[0])}")
                    if row[0]:
                        print(f"    First element in nested list: {type(row[0][0])}")
                        print(f"    First few items: {row[0][:3]}")
                elif row and isinstance(row[0], tuple):
                    print(f"  First tuple: {type(row[0])}")
                    print(f"  Sample tuples: {row[:3]}")
            break

# Create separate figures based on available columns
num_plots = len(pos_columns)
if num_plots > 0:
    fig, axes = plt.subplots(1, num_plots, figsize=(10*num_plots, 8))
    # Handle single plot case
    if num_plots == 1:
        axes = [axes]
    
    for i, (col_name, title_suffix) in enumerate(pos_columns):
        # Count POS tags with the fixed function
        pos_counts = count_tags(news_df[col_name])
        
        # Debug to verify tag counting is working
        print(f"\nPOS counts for {col_name}:")
        print(f"Total unique POS tags found: {len(pos_counts)}")
        if pos_counts:
            top_5 = sorted(pos_counts.items(), key=lambda x: x[1], reverse=True)[:5]
            print(f"Top 5 POS tags: {top_5}")
        else:
            print("No POS tags found!")
        
        if not pos_counts:
            axes[i].text(0.5, 0.5, f"No POS tags found in {title_suffix}",
                         ha='center', va='center', fontsize=12)
            axes[i].set_title(f'POS Tags Distribution - {title_suffix}')
            axes[i].axis('off')
            continue
        
        pos_df = pd.DataFrame({
            'POS': list(pos_counts.keys()),
            'Count': list(pos_counts.values())
        }).sort_values('Count', ascending=False)
        
        # Calculate percentage
        total = pos_df['Count'].sum()
        pos_df['Percentage'] = pos_df['Count'] / total * 100
        
        # Plot top N tags
        top_n = min(15, len(pos_df))
        sns.barplot(x='POS', y='Percentage', data=pos_df.head(top_n), ax=axes[i])
        axes[i].set_title(f'Top {top_n} POS Tags Distribution - {title_suffix}')
        axes[i].set_ylabel('Percentage (%)')
        axes[i].set_xlabel('Part of Speech')
        axes[i].tick_params(axis='x', rotation=45)
        
        # Add value labels
        for j, p in enumerate(axes[i].patches):
            height = p.get_height()
            axes[i].text(p.get_x() + p.get_width()/2., height + 0.3,
                    f'{height:.1f}%', ha="center")
    
    plt.tight_layout()
    plt.savefig('pos_tag_distribution_fixed.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No POS tagging columns found in the DataFrame.")

# ------------------------------------------------
# 2) Word Clouds by POS Tag
# ------------------------------------------------
print("Generating word clouds for common POS tags...")

# Function to extract words by POS tag with support for new format
def get_words_by_pos(tags_column, target_pos):
    words = []
    
    # Skip empty DataFrames
    if len(tags_column) == 0:
        return words
        
    # Check the format of the data
    sample = next((x for x in tags_column if x), None)
    
    if sample is None:
        return words
        
    # Handle different formats
    if isinstance(sample, list):
        if not sample:
            return words
            
        if isinstance(sample[0], tuple):
            # Check if it's the new format with (text, pos, lemma) or old format with (text, pos)
            if len(sample[0]) == 3:  # New format with lemma
                for tags in tags_column:
                    words.extend([word.lower() for word, pos, _ in tags if pos == target_pos])
            else:  # Old format
                for tags in tags_column:
                    words.extend([word.lower() for word, pos in tags if pos == target_pos])
    else:
        # Handle nested lists (separate headlines)
        for headlines in tags_column:
            if not headlines:
                continue
                
            for headline in headlines:
                if not headline:
                    continue
                    
                # Check if it's the new format
                if len(headline[0]) == 3:  # New format with lemma
                    words.extend([word.lower() for word, pos, _ in headline if pos == target_pos])
                else:  # Old format
                    words.extend([word.lower() for word, pos in headline if pos == target_pos])
    
    return words

# Select important POS categories to visualize
important_pos = ['NOUN', 'VERB', 'ADJ', 'PROPN']

# Choose which column to use for word clouds (prioritize lemma_pos_combined)
wordcloud_col = None
if 'lemma_pos_combined' in news_df.columns:
    wordcloud_col = 'lemma_pos_combined'
elif 'pos_tags_combined' in news_df.columns:
    wordcloud_col = 'pos_tags_combined'
elif 'lemma_pos_separate' in news_df.columns:
    wordcloud_col = 'lemma_pos_separate'
elif 'pos_tags_separate' in news_df.columns:
    wordcloud_col = 'pos_tags_separate'

if wordcloud_col:
    # Create word clouds for important POS tags
    fig, axes = plt.subplots(2, 2, figsize=(20, 12))
    axes = axes.flatten()
    
    # Use the selected column for word clouds
    for i, pos in enumerate(important_pos):
        words = get_words_by_pos(news_df[wordcloud_col], pos)
        word_counts = Counter(words)
        
        # Skip if no words found
        if not word_counts:
            axes[i].text(0.5, 0.5, f"No {pos} tags found", 
                         horizontalalignment='center', fontsize=18)
            axes[i].axis('off')
            continue
        
        # Generate word cloud
        wc = WordCloud(width=800, height=400, 
                       background_color='white', 
                       max_words=100,
                       colormap='viridis',
                       contour_width=1, contour_color='steelblue')
        
        wc.generate_from_frequencies(word_counts)
        
        axes[i].imshow(wc, interpolation='bilinear')
        axes[i].set_title(f'Most Common {pos} Words', fontsize=18)
        axes[i].axis('off')
    
    plt.tight_layout()
    plt.subplots_adjust(hspace=0)  # Try smaller values like 0.1 or 0.05 if needed
    plt.savefig('pos_word_clouds.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No suitable columns found for word cloud visualization.")

# ------------------------------------------------
# 3) Headline POS Tag Sequence Patterns
# ------------------------------------------------
print("Analyzing common POS tag sequences...")

# Function to extract common POS sequences with support for new format
def get_pos_sequences(tags_column, seq_length=3):
    sequences = []
    
    # Skip empty DataFrames
    if len(tags_column) == 0:
        return sequences
        
    # Check the format of the data
    sample = next((x for x in tags_column if x), None)
    
    if sample is None:
        return sequences
        
    # Handle different formats
    if isinstance(sample, list):
        if not sample:
            return sequences
            
        if isinstance(sample[0], tuple):
            # Check if it's the new format with (text, pos, lemma) or old format with (text, pos)
            if len(sample[0]) == 3:  # New format with lemma
                for tags in tags_column:
                    if len(tags) >= seq_length:
                        pos_tags = [pos for _, pos, _ in tags]
                        for i in range(len(pos_tags) - seq_length + 1):
                            sequences.append(tuple(pos_tags[i:i+seq_length]))
            else:  # Old format
                for tags in tags_column:
                    if len(tags) >= seq_length:
                        pos_tags = [pos for _, pos in tags]
                        for i in range(len(pos_tags) - seq_length + 1):
                            sequences.append(tuple(pos_tags[i:i+seq_length]))
    else:
        # Handle nested lists (separate headlines)
        for headlines in tags_column:
            if not headlines:
                continue
                
            for headline in headlines:
                if len(headline) >= seq_length:
                    # Check if it's the new format
                    if len(headline[0]) == 3:  # New format with lemma
                        pos_tags = [pos for _, pos, _ in headline]
                        for i in range(len(pos_tags) - seq_length + 1):
                            sequences.append(tuple(pos_tags[i:i+seq_length]))
                    else:  # Old format
                        pos_tags = [pos for _, pos in headline]
                        for i in range(len(pos_tags) - seq_length + 1):
                            sequences.append(tuple(pos_tags[i:i+seq_length]))
    
    return sequences

# Choose which column to use for sequence analysis (prioritize lemma_pos_combined)
sequence_col = None
if 'lemma_pos_combined' in news_df.columns:
    sequence_col = 'lemma_pos_combined'
elif 'pos_tags_combined' in news_df.columns:
    sequence_col = 'pos_tags_combined'
elif 'lemma_pos_separate' in news_df.columns:
    sequence_col = 'lemma_pos_separate'
elif 'pos_tags_separate' in news_df.columns:
    sequence_col = 'pos_tags_separate'

if sequence_col:
    seq_length = 3  # trigrams
    sequences = get_pos_sequences(news_df[sequence_col], seq_length)
    seq_counts = Counter(sequences)
    
    # Create DataFrame for top sequences
    top_n = 15
    seq_df = pd.DataFrame({
        'Sequence': [' → '.join(seq) for seq in seq_counts.keys()],
        'Count': list(seq_counts.values())
    }).sort_values('Count', ascending=False).head(top_n)
    
    # Calculate percentage
    total = sum(seq_counts.values())
    seq_df['Percentage'] = seq_df['Count'] / total * 100
    
    # Plot top sequences
    plt.figure(figsize=(14, 8))
    bars = sns.barplot(x='Sequence', y='Percentage', data=seq_df)
    plt.title(f'Top {top_n} POS Tag Sequences (Length {seq_length})', fontsize=16)
    plt.xlabel('POS Tag Sequence', fontsize=14)
    plt.ylabel('Percentage (%)', fontsize=14)
    plt.xticks(rotation=45, ha='right')
    
    # Add value labels
    for i, p in enumerate(bars.patches):
        height = p.get_height()
        plt.text(p.get_x() + p.get_width()/2., height + 0.1,
                f'{height:.1f}%', ha="center")
    
    plt.tight_layout()
    plt.savefig('pos_sequence_patterns.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No suitable columns found for sequence pattern analysis.")

# ------------------------------------------------
# 4) Headline Length Distribution by POS Count
# ------------------------------------------------
print("Analyzing headline length distributions...")

# Function to get POS tag counts per headline with support for new format
def get_pos_counts_per_headline(tags_column):
    counts = []
    if tags_column.empty:
        return counts

    sample = tags_column.iloc[0]

    if all(isinstance(elem, tuple) for elem in sample):
        # Combined format: one headline per row
        for tags in tags_column:
            counts.append(len(tags))
    else:
        # Separate format: multiple headlines per row
        for headlines in tags_column:
            if not headlines:
                continue
            for headline in headlines:
                counts.append(len(headline))
    return counts

# Choose columns for length distribution // i crossed out combined headlines because they are not relevant here. 
length_columns = []
#if 'lemma_pos_combined' in news_df.columns:
#    length_columns.append(('lemma_pos_combined', 'Combined Headlines', 'blue'))
#elif 'pos_tags_combined' in news_df.columns:
#    length_columns.append(('pos_tags_combined', 'Combined Headlines', 'blue'))

if 'lemma_pos_separate' in news_df.columns:
    length_columns.append(('lemma_pos_separate', 'Separate Headlines', 'green'))
elif 'pos_tags_separate' in news_df.columns:
    length_columns.append(('pos_tags_separate', 'Separate Headlines', 'green'))

# Plot headline length distribution if at least one column is available
if length_columns:
    plt.figure(figsize=(12, 6))
    
    for col_name, label, color in length_columns:
        counts = get_pos_counts_per_headline(news_df[col_name])
        sns.histplot(counts, color=color, alpha=0.6, label=label, bins=20)
    
    plt.title('Distribution of Headline Lengths (Word Count)', fontsize=16)
    plt.xlabel('Number of Words', fontsize=14)
    plt.ylabel('Frequency', fontsize=14)
    plt.xlim(0, 32)
    plt.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('headline_length_distribution.png', dpi=300, bbox_inches='tight')
    plt.show()
else:
    print("No suitable columns found for headline length analysis.")

# ------------------------------------------------
# 5) Comparative Analysis Between Different Headlines
# ------------------------------------------------
print("Analyzing POS tag variations across headlines...")

# Choose which column to use for comparative analysis (prioritize lemma_pos_separate)
comparison_col = None
if 'lemma_pos_separate' in news_df.columns:
    comparison_col = 'lemma_pos_separate'
elif 'pos_tags_separate' in news_df.columns:
    comparison_col = 'pos_tags_separate'

if comparison_col:
    # Function to calculate POS tag diversity with support for new format
    def calculate_pos_diversity(headlines_tags):
        headline_diversities = []
        
        for headlines in headlines_tags:
            if len(headlines) >= 2:  # Need at least 2 headlines to compare
                # Calculate Jaccard similarity between headlines
                similarities = []
                for i in range(len(headlines)):
                    for j in range(i+1, len(headlines)):
                        # Check if it's the new format
                        if len(headlines[i][0]) == 3 and len(headlines[j][0]) == 3:  # New format with lemma
                            # Extract POS tags
                            pos_i = [pos for _, pos, _ in headlines[i]]
                            pos_j = [pos for _, pos, _ in headlines[j]]
                        else:  # Old format
                            # Extract POS tags
                            pos_i = [pos for _, pos in headlines[i]]
                            pos_j = [pos for _, pos in headlines[j]]
                        
                        # Calculate Jaccard similarity (intersection / union)
                        set_i = set(pos_i)
                        set_j = set(pos_j)
                        intersection = len(set_i.intersection(set_j))
                        union = len(set_i.union(set_j))
                        
                        if union > 0:
                            similarities.append(intersection / union)
                
                # Average similarity for this row
                if similarities:
                    headline_diversities.append(np.mean(similarities))
        
        return headline_diversities
    
    # Calculate diversity scores
    diversity_scores = calculate_pos_diversity(news_df[comparison_col])
    
    # Plot diversity distribution if we have enough data
    if diversity_scores:
        plt.figure(figsize=(12, 6))
        sns.histplot(diversity_scores, bins=20, kde=True)
        plt.title('POS Tag Similarity Between Headlines (Jaccard Index)', fontsize=16)
        plt.xlabel('Average Jaccard Similarity (higher = more similar)', fontsize=14)
        plt.ylabel('Frequency', fontsize=14)
        plt.xlim(0, 1)
        plt.axvline(np.mean(diversity_scores), color='red', linestyle='--', 
                    label=f'Mean: {np.mean(diversity_scores):.2f}')
        plt.legend()
        plt.grid(True, alpha=0.3)
        plt.tight_layout()
        plt.savefig('headline_pos_similarity.png', dpi=300, bbox_inches='tight')
        plt.show()
    else:
        print("Not enough comparable headlines found for diversity analysis.")
else:
    print("No suitable columns found for comparative analysis.")

print("Visualization complete! All plots have been saved as PNG files.")

print("Sample Separate Headlines data:", news_df['lemma_pos_separate'].dropna().head(3).tolist())

# Results

We have some interesting remarks already. We find that:
 63% of news follow a similar grammatical trend, 
 Covid is NOT the most talked about topic even though it felt like that living through that era,
 Russia, China and Ukraine was the 3 countries that news outlets talked the most about, 
 15.3% of 3 word sequences are made out of nouns,
 Most of news headlines are 5 to 10 words,


 TO DO:
 Most common verbs
 Most common Nouns
 Most common phrases of words



In [None]:
import sys
print(sys.prefix)