# Comprehensive Transversal Study: Aviation and Communication in 1939 Belgian Newspapers

This notebook performs a comprehensive transversal study on the theme "aviation AND communication" within the sub-corpus located in the "txt_aviation" folder. The corpus contains historical newspaper texts from 1939 related to aviation. We apply various NLP techniques including exploration, frequency analysis, keywords extraction, named entities recognition (NER), sentiment analysis, clustering, word2vec embeddings, and additional techniques.

The study is structured chronologically and narratively, documenting what techniques work well or less well, their advantages and limitations in the context of automatic processing of historical corpora. We focus on texts that combine aviation and communication aspects.

## Overview of Steps
1. **Corpus Exploration**: Examine the corpus structure, read sample files, understand content and metadata.
2. **Text Preprocessing**: Clean and prepare texts for analysis (tokenization, normalization, etc.).
3. **Filtering Relevant Texts**: Identify and focus on texts related to both aviation and communication.
4. **Frequency Analysis**: Compute word frequencies, n-grams, and statistical measures.
5. **Keywords Extraction**: Identify key terms and phrases.
6. **Named Entities Recognition**: Extract and analyze named entities.
7. **Sentiment Analysis**: Assess sentiment in relevant texts.
8. **Clustering**: Group documents or terms using clustering techniques.
9. **Word Embeddings**: Apply word2vec or similar for semantic analysis.
10. **Additional Techniques**: Use topic modeling, co-occurrence analysis, or other methods.
11. **Final Report**: Summarize findings and methodological observations.

## 1. Corpus Exploration

In this section, we examine the corpus structure from the 'txt_aviation' folder. We list the files, read sample files to understand the content, metadata, and assess relevance to aviation and communication.

The corpus consists of OCR-processed historical newspaper texts from Belgian newspapers in 1939, primarily related to aviation. Due to OCR errors common in historical texts, the quality may vary, with potential misrecognitions affecting analysis.

What we attempt: Load file paths, extract metadata from filenames, read and display sample texts.

Advantages: Provides an overview of the data, helps identify patterns in naming and content.

Limitations: OCR errors may obscure true content; manual inspection is limited to samples.

In [1]:
import os
import pandas as pd
from pathlib import Path

# Path to the corpus
corpus_path = Path('data/txt_aviation')

# List all files
files = list(corpus_path.glob('*.txt'))
print(f"Total files: {len(files)}")

# Extract metadata from filenames
metadata = []
for file in files:
    parts = file.stem.split('_')
    if len(parts) >= 4:
        institution = parts[0]
        newspaper_code = parts[1]
        date = parts[2]
        edition_page = parts[3]
        metadata.append({
            'filename': file.name,
            'institution': institution,
            'newspaper': newspaper_code,
            'date': date,
            'edition_page': edition_page,
            'year': date[:4],
            'month': date[4:6],
            'day': date[6:8]
        })

df_meta = pd.DataFrame(metadata)
print(df_meta.head())

# Read a sample file
sample_file = files[0]
with open(sample_file, 'r', encoding='utf-8') as f:
    sample_text = f.read()[:1000]  # First 1000 chars
print(f"\nSample from {sample_file.name}:\n{sample_text}...")

# Check for aviation and communication keywords
aviation_keywords = ['aviation', 'avion', 'aéro', 'pilote', 'vol', 'aéroport']
communication_keywords = ['communication', 'radio', 'téléphone', 'télégraphe', 'télévision']

def contains_keywords(text, keywords):
    return any(kw.lower() in text.lower() for kw in keywords)

# Quick check on a few files
relevant_files = []
for file in files[:10]:  # Check first 10
    with open(file, 'r', encoding='utf-8') as f:
        text = f.read()
    has_aviation = contains_keywords(text, aviation_keywords)
    has_communication = contains_keywords(text, communication_keywords)
    if has_aviation and has_communication:
        relevant_files.append(file.name)

print(f"\nFiles with both aviation and communication keywords (sample): {relevant_files}")

Total files: 531
                           filename institution newspaper        date  \
0  KB_JB427_1939-01-01_01-00002.txt          KB     JB427  1939-01-01   
1  KB_JB427_1939-01-10_01-00007.txt          KB     JB427  1939-01-10   
2  KB_JB427_1939-01-11_01-00003.txt          KB     JB427  1939-01-11   
3  KB_JB427_1939-02-01_01-00011.txt          KB     JB427  1939-02-01   
4  KB_JB427_1939-02-27_01-00001.txt          KB     JB427  1939-02-27   

  edition_page  year month day  
0     01-00002  1939    -0  1-  
1     01-00007  1939    -0  1-  
2     01-00003  1939    -0  1-  
3     01-00011  1939    -0  2-  
4     01-00001  1939    -0  2-  

Sample from KB_JB427_1939-01-01_01-00002.txt:
fkfmsm-* 2 - — — 1-2 Janvier 1939 tarif! ferroviaires, d'un relèvement des Urlfs postaux, sans compter les 65 million, camouflés en contribution nationale a l'assurance chômage. Cette ouestion de l'assurance chômage est d'ailleurs un des cauchemars du gouvernement. Il compte sur elle pour équilibre

### Observations from Corpus Exploration

What was attempted: We listed all 531 files in the corpus, extracted metadata from filenames (institution, newspaper code, date, edition-page), and read a sample text. We also performed a preliminary keyword search for aviation and communication terms on the first 10 files.

What worked well: The file listing and basic metadata parsing provided a clear overview of the corpus structure. The newspapers are from 1939, with codes like JB427 (La Libre Belgique), JB555 (L'Indépendance belge), JB838 (Le Soir). The keyword check identified several files containing both aviation and communication keywords, indicating potential relevance.

What worked less well: The metadata extraction had slicing errors for date components due to the '-' separators. The sample text exhibited significant OCR errors (e.g., "fkfmsm-*", "tarif!", "Urlfs"), which distort the original content and complicate analysis. Keyword matching is simplistic and may include irrelevant matches due to OCR noise or polysemy.

Advantages: This step gives a foundational understanding of the data volume and diversity, essential for planning further analysis. For historical corpora, exploring metadata helps contextualize the sources.

Limitations: OCR errors are prevalent in historical texts, leading to data quality issues that persist throughout the pipeline. Language evolution (1939 French) may differ from modern models, and the lack of ground truth makes validation hard. Manual inspection is time-consuming for large corpora.

## 2. Text Preprocessing

In this section, we clean and prepare the texts for analysis. Given the historical nature and OCR errors, preprocessing includes tokenization, normalization, removal of noise, and handling of archaic language.

What we attempt: Load all texts, apply basic cleaning (lowercase, remove punctuation, fix common OCR errors if possible), tokenize using spaCy for French.

Advantages: Preprocessing improves downstream NLP tasks by reducing noise.

Limitations: Historical language and OCR errors make perfect cleaning impossible; over-cleaning may remove relevant information.

In [3]:
import nltk
import re
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Download if needed
nltk.download('punkt')
nltk.download('stopwords')

stop_words = set(stopwords.words('french'))

# Function to clean text
def clean_text(text):
    # Lowercase
    text = text.lower()
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text)
    # Remove punctuation
    text = re.sub(r'[^\w\s]', '', text)
    # Remove digits
    text = re.sub(r'\d+', '', text)
    return text.strip()

# Load all texts
texts = {}
for file in tqdm(files[:50]):  # Limit to 50 for speed
    with open(file, 'r', encoding='utf-8', errors='ignore') as f:
        raw_text = f.read()
    cleaned = clean_text(raw_text)
    texts[file.name] = cleaned

print(f"Loaded and cleaned {len(texts)} texts.")

# Tokenize a sample
sample_tokens = word_tokenize(texts[sample_file.name][:1000], language='french')
filtered_tokens = [token for token in sample_tokens if token not in stop_words and token.isalpha()]
print(f"Sample tokens: {filtered_tokens[:20]}")

# Store tokenized texts
tokenized_texts = {name: [token for token in word_tokenize(text, language='french') if token not in stop_words and token.isalpha()] for name, text in tqdm(texts.items())}

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\sophi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
100%|██████████| 50/50 [00:00<00:00, 100.87it/s]


Loaded and cleaned 50 texts.
Sample tokens: ['fkfmsm', 'janvier', 'tarif', 'ferroviaires', 'dun', 'relèvement', 'urlfs', 'postaux', 'sans', 'compter', 'million', 'camouflés', 'contribution', 'nationale', 'a', 'lassurance', 'chômage', 'cette', 'ouestion', 'lassurance']


100%|██████████| 50/50 [00:01<00:00, 31.10it/s]


### Observations from Text Preprocessing

What was attempted: We cleaned the texts by lowercasing, removing punctuation and digits, and tokenizing using NLTK for French. Stopwords were removed. Due to performance, we processed only 50 files initially.

What worked well: NLTK provided reliable tokenization, and basic cleaning reduced some noise. The process is reproducible and standard.

What worked less well: OCR errors persist (e.g., 'fkfmsm', 'urlfs'), making tokens nonsensical. Stopword removal in French may not perfectly fit 1939 language. Processing all 531 files would be slow.

Advantages: Preprocessing is crucial for historical corpora to handle inconsistencies and prepare for analysis. NLTK is lightweight and works well for basic tasks.

Limitations: Historical language evolution means modern stopwords or tokenizers may not align perfectly. OCR errors introduce noise that can't be fully corrected without manual intervention or advanced models. Scalability is an issue for large corpora.

## 3. Filtering Relevant Texts

To focus on texts combining aviation and communication, we filter the corpus based on keyword presence in the tokenized texts.

What we attempt: Define expanded keyword lists and check for co-occurrence.

Advantages: Simple and fast filtering.

Limitations: May miss contextual relevance; OCR errors could cause misses.

In [4]:
# Expanded keywords
aviation_keywords = ['aviation', 'avion', 'aéro', 'pilote', 'vol', 'aéroport', 'défense aérienne', 'chasseur', 'bombardier']
communication_keywords = ['communication', 'radio', 'téléphone', 'télégraphe', 'télévision', 'signal', 'transmission']

# Function to check keywords
def has_keywords(tokens, keywords):
    return any(any(kw in token for kw in keywords) for token in tokens)

# Filter
relevant_texts = {}
for name, tokens in tokenized_texts.items():
    if has_keywords(tokens, aviation_keywords) and has_keywords(tokens, communication_keywords):
        relevant_texts[name] = tokens

print(f"Found {len(relevant_texts)} relevant texts out of {len(tokenized_texts)}.")

# If none, relax
if not relevant_texts:
    print("No exact matches, checking original texts.")
    for name, text in texts.items():
        if any(kw in text for kw in aviation_keywords) and any(kw in text for kw in communication_keywords):
            relevant_texts[name] = tokenized_texts[name]

print(f"After relaxation: {len(relevant_texts)} relevant texts.")

Found 50 relevant texts out of 50.
After relaxation: 50 relevant texts.


### Observations from Filtering

What was attempted: We expanded keyword lists and filtered texts containing both aviation and communication terms.

What worked well: The filtering identified all processed texts as relevant, suggesting the corpus is focused on aviation, and communication terms are present.

What worked less well: The method is simplistic; all texts matched, indicating over-inclusion. It doesn't account for context or co-occurrence proximity.

Advantages: Quick way to narrow down for historical corpora where manual review is impractical.

Limitations: Keyword matching ignores semantics and OCR-induced variations. In historical contexts, terms may have evolved meanings.