Context: This script initializes the Natural Language Processing (NLP) environment within the 05_nlp.ipynb notebook. Before analyzing the rhetorical strategies of presidential candidates, the necessary software infrastructure must be established.

In [4]:
# ==========================================
# STEP 1: ENVIRONMENT SETUP & DEPENDENCY INSTALLATION
# ==========================================
# Note: The exclamation mark (!) allows the execution of shell commands directly within the Jupyter Notebook.

# 1. Install TextBlob
# We utilize 'TextBlob', a Python library built upon NLTK (Natural Language Toolkit), 
# chosen for its efficiency in performing standard NLP tasks such as sentiment analysis 
# and noun phrase extraction.
!pip install textblob

# 2. Download Linguistic Corpora
# TextBlob requires specific lexical resources (corpora) to function correctly.
# This command downloads the necessary datasets, including 'punkt' (for tokenization) 
# and 'averaged_perceptron_tagger' (for part-of-speech tagging).
!python -m textblob.download_corpora

[nltk_data] Downloading package brown to
[nltk_data]     /Users/jessicabourdouxhe/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt_tab to
[nltk_data]     /Users/jessicabourdouxhe/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jessicabourdouxhe/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /Users/jessicabourdouxhe/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package conll2000 to
[nltk_data]     /Users/jessicabourdouxhe/nltk_data...
[nltk_data]   Package conll2000 is already up-to-date!
[nltk_data] Downloading package movie_reviews to
[nltk_data]     /Users/jessicabourdouxhe/nltk_data...
[nltk_data]   Unzipping corpora/movie_reviews.zip.
Finished.


Context: This script constitutes the core analytical engine of the 05_nlp.ipynb notebook. It transitions from raw text processing to quantitative Feature Extraction. The goal is to convert unstructured text data (speeches and manifestos) into numerical vectors (Sentiment and Subjectivity scores) that can be analyzed statistically.

In [5]:
import pandas as pd
from textblob import TextBlob
import numpy as np
from pathlib import Path

# ==========================================
# STEP 1: DATA LOADING
# ==========================================
# Load the sanitized NLP dataset (V8 Cleaned Version).
# This file contains the unified corpus of speeches and manifestos.
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
FIG_DIR = PROJECT_ROOT / 'figures'

file_path = PROCESSED_DIR / "nlp_database_CLEAN_V8.csv"
try:
    df = pd.read_csv(file_path)
    print("‚úÖ V8 Cleaned Database successfully loaded.")
except FileNotFoundError:
    # Fallback mechanism if the specific path fails
    df = pd.read_csv('nlp_database_CLEAN_TERMINATOR.csv')
    print("‚úÖ Alternate Database loaded.")

# ==========================================
# STEP 2: DOCUMENT TYPOLOGY IDENTIFICATION
# ==========================================
# Distinguish between "Speeches" (Candidate Rhetoric) and "Platforms" (Party Ideology).
# We generate a categorical variable 'source_type' to facilitate comparative analysis.
def get_type(candidate_name):
    if "Party Platform" in str(candidate_name):
        return "Platform"
    else:
        return "Speech"

df['source_type'] = df['candidate'].apply(get_type)

print("Document Type Distribution:")
print(df['source_type'].value_counts())

# ==========================================
# STEP 3: SENTIMENT & SUBJECTIVITY SCORING (TEXTBLOB)
# ==========================================
print("\nüß† Initiating Sentiment Analysis...")

def get_sentiment(text):
    # Returns the Polarity score: Float within range [-1.0, 1.0]
    # -1.0 = Highly Negative | 0 = Neutral | +1.0 = Highly Positive
    return TextBlob(str(text)).sentiment.polarity

def get_subjectivity(text):
    # Returns the Subjectivity score: Float within range [0.0, 1.0]
    # 0.0 = Objective/Factual | 1.0 = Subjective/Opinionated
    return TextBlob(str(text)).sentiment.subjectivity

# Apply the functions to the text corpus
df['sentiment'] = df['text'].apply(get_sentiment)
df['subjectivity'] = df['text'].apply(get_subjectivity)

# ==========================================
# STEP 4: DATA RESTRUCTURING (PIVOTING)
# ==========================================
# Transform the dataset from Long Format to Wide Format.
# Objective: Create distinct columns for 'Speech' and 'Platform' metrics for each Year-Party pair.
df_pivot = df.pivot_table(
    index=['year', 'party'], 
    columns='source_type', 
    values=['sentiment', 'subjectivity'],
    aggfunc='mean'
).reset_index()

# Flatten the hierarchical MultiIndex columns created by the pivot table.
# Renaming schema: e.g., ('sentiment', 'Speech') becomes 'sentiment_speech'
df_pivot.columns = ['year', 'party', 
                    'sentiment_platform', 'sentiment_speech', 
                    'subjectivity_platform', 'subjectivity_speech']

# ==========================================
# STEP 5: AGGREGATE METRIC CALCULATION
# ==========================================
# Compute a composite score (Mean) combining both oral rhetoric (Speech) and written policy (Platform).
# This provides a holistic view of the party's tone for that election cycle.
df_pivot['sentiment_mean'] = df_pivot[['sentiment_platform', 'sentiment_speech']].mean(axis=1)
df_pivot['subjectivity_mean'] = df_pivot[['subjectivity_platform', 'subjectivity_speech']].mean(axis=1)

# Chronological Sorting
df_pivot = df_pivot.sort_values(by=['year', 'party'], ascending=[False, True])

# ==========================================
# STEP 6: VALIDATION & EXPORT
# ==========================================
print("\n--- DETAILED FINAL RESULTS ---")
# Display a comparative snapshot of the sentiment metrics
print(df_pivot[['year', 'party', 'sentiment_speech', 'sentiment_platform', 'sentiment_mean']].head(10))

from pathlib import Path

# ==========================================
# STEP 4: EXPORT PROCESSED DATA (Sentiment Analysis)
# ==========================================

# 1. Configuration des chemins
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
PROCESSED_DIR = DATA_DIR / 'processed'

# 2. S√©curit√© : Cr√©ation du dossier s'il n'existe pas
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3. D√©finition du chemin complet
output_name = 'political_sentiment_DETAILED.csv'
save_path_sentiment = PROCESSED_DIR / output_name

# 4. Export
df_pivot.to_csv(save_path_sentiment, index=False)

print(f"\n‚úÖ File '{output_name}' generated successfully!")
print(f"üìç Location: {save_path_sentiment}")

‚úÖ V8 Cleaned Database successfully loaded.
Document Type Distribution:
source_type
Speech      14
Platform    13
Name: count, dtype: int64

üß† Initiating Sentiment Analysis...

--- DETAILED FINAL RESULTS ---
    year       party  sentiment_speech  sentiment_platform  sentiment_mean
12  2024    Democrat          0.188678            0.107454        0.148066
13  2024  Republican          0.176667            0.089143        0.132905
10  2020    Democrat          0.154585            0.103591        0.129088
11  2020  Republican          0.157135                 NaN        0.157135
8   2016    Democrat          0.172119            0.131665        0.151892
9   2016  Republican          0.096610            0.097079        0.096845
6   2012    Democrat          0.157531            0.143446        0.150488
7   2012  Republican          0.193518            0.106685        0.150101
4   2008    Democrat          0.152744            0.123579        0.138161
5   2008  Republican          0.151983