# Mercedes F1 Infringement Document Filter - Strict Version

This notebook extracts Mercedes-specific infringement documents from the FIA PDF collection with strict filtering criteria.

## Objective
- Parse all PDF files in the Documents folder (2020-2024)
- Filter ONLY for documents addressed specifically to "Mercedes-AMG Petronas F1 Team"
- EXCLUDE documents addressed to "All Teams", "All Officials", or multiple teams
- Extract text content from filtered PDFs
- Save as individual .txt files organized by year

## Key Differences from Previous Version
- Stricter filtering: Only documents specifically addressed to Mercedes team
- No car number-based filtering (only team address filtering)
- Explicit exclusion of "All Teams" documents


In [1]:
# Import required libraries
import os
import re
from pathlib import Path
import PyPDF2
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')


In [2]:
# Configuration for folder structure
base_path = Path("Documents")
years = ["2020_inf_profile", "2021_inf_profile", 
         "2022_inf_profile", "2023_inf_profile", "2024_inf_profile"]

print("Folder structure:")
for year in years:
    year_path = base_path / year
    if year_path.exists():
        pdf_count = len(list(year_path.glob("*.pdf")))
        txt_count = len(list(year_path.glob("*.txt")))
        print(f"  {year}: {pdf_count} PDF files, {txt_count} TXT files")
    else:
        print(f"  {year}: Folder not found")


Folder structure:
  2020_inf_profile: 131 PDF files, 0 TXT files
  2021_inf_profile: 175 PDF files, 0 TXT files
  2022_inf_profile: 227 PDF files, 0 TXT files
  2023_inf_profile: 194 PDF files, 0 TXT files
  2024_inf_profile: 215 PDF files, 0 TXT files


In [3]:
# Function to extract text from PDF
def extract_pdf_text(pdf_path):
    """
    Extract text from PDF file
    Returns the full text content
    """
    try:
        with open(pdf_path, 'rb') as file:
            pdf_reader = PyPDF2.PdfReader(file)
            text = ""
            
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text() + "\n"
            
            return text
    except Exception as e:
        print(f"Error reading {pdf_path}: {e}")
        return None


In [4]:
# STRICT function to check if document is addressed ONLY to Mercedes team
def is_mercedes_document_strict(text):
    """
    Check if the document is addressed ONLY to Mercedes-AMG Petronas F1 Team
    EXCLUDE documents addressed to multiple teams or officials
    """
    if not text:
        return False
    
    # Convert to lowercase for case-insensitive matching
    text_lower = text.lower()
    
    # Check first 2000 characters (header section)
    header_section = text_lower[:2000]
    
    # EXCLUDE documents addressed to multiple teams/officials
    exclude_patterns = [
        r'to\s+all\s+teams,?\s+all\s+officials',
        r'to\s+all\s+teams',
        r'all\s+teams,?\s+all\s+officials',
        r'to\s+all\s+competitors',
        r'all\s+competitors',
        r'to\s+all\s+participants',
        r'all\s+participants',
        r'to\s+all\s+drivers',
        r'all\s+drivers',
        r'to\s+all\s+constructors',
        r'all\s+constructors'
    ]
    
    for pattern in exclude_patterns:
        if re.search(pattern, header_section, re.IGNORECASE):
            return False  # Exclude this document
    
    # INCLUDE ONLY documents specifically addressed to Mercedes team
    mercedes_patterns = [
        r'to:\s*the team manager,?\s*mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'to\s+the team manager,?\s*mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'to\s+mercedes[-\s]*amg\s*petronas\s*f1\s*team',
        r'mercedes[-\s]*amg\s*petronas\s*f1\s*team\s*manager',
        r'mercedes[-\s]*amg\s*petronas\s*f1\s*team'
    ]
    
    # Check for Mercedes team references
    for pattern in mercedes_patterns:
        if re.search(pattern, header_section, re.IGNORECASE):
            return True
    
    return False


In [5]:
# Function to clean and save text content
def clean_text(text):
    """
    Clean extracted text
    """
    if not text:
        return ""
    
    # Remove excessive whitespace and normalize
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    
    return text


In [6]:
# Main processing function with strict filtering
def process_year_folder_strict(year_folder):
    """
    Process all PDFs in a year folder and extract Mercedes documents with strict filtering
    Skip PDFs that already have corresponding TXT files
    """
    year_path = base_path / year_folder
    if not year_path.exists():
        print(f"Folder {year_path} does not exist")
        return []
    
    pdf_files = list(year_path.glob("*.pdf"))
    print(f"\nProcessing {len(pdf_files)} PDF files in {year_folder}...")
    
    mercedes_docs = []
    skipped_count = 0
    excluded_count = 0
    
    for pdf_file in tqdm(pdf_files, desc="Processing PDFs"):
        # Check if TXT file already exists
        txt_filename = pdf_file.stem + ".txt"
        txt_path = year_path / txt_filename
        
        if txt_path.exists():
            # TXT file already exists, skip processing
            skipped_count += 1
            continue
        
        # Extract text from PDF
        text = extract_pdf_text(pdf_file)
        
        if text and is_mercedes_document_strict(text):
            # Clean the text
            cleaned_text = clean_text(text)
            
            # Save as text file
            try:
                with open(txt_path, 'w', encoding='utf-8') as f:
                    f.write(cleaned_text)
                
                mercedes_docs.append({
                    'pdf_file': pdf_file.name,
                    'txt_file': txt_filename,
                    'year': year_folder,
                    'text_length': len(cleaned_text)
                })
                
                print(f"✓ Found Mercedes document: {pdf_file.name}")
                
            except Exception as e:
                print(f"Error saving {txt_filename}: {e}")
        else:
            excluded_count += 1
    
    print(f"Skipped {skipped_count} PDFs (TXT files already exist)")
    print(f"Excluded {excluded_count} PDFs (not addressed to Mercedes only)")
    return mercedes_docs


In [7]:
# Process all year folders with strict filtering
print("="*60)
print("PROCESSING ALL YEARS WITH STRICT MERCEDES FILTERING")
print("="*60)

all_mercedes_docs = []

for year_folder in years:
    print(f"\n{'='*50}")
    print(f"PROCESSING {year_folder.upper()}")
    print(f"{'='*50}")
    
    mercedes_docs = process_year_folder_strict(year_folder)
    if mercedes_docs:
        all_mercedes_docs.extend(mercedes_docs)
        print(f"\n✓ Found {len(mercedes_docs)} Mercedes documents in {year_folder}")
        for doc in mercedes_docs:
            print(f"  - {doc['pdf_file']} ({doc['text_length']} chars)")
    else:
        print(f"\n✗ No Mercedes documents found in {year_folder}")


PROCESSING ALL YEARS WITH STRICT MERCEDES FILTERING

PROCESSING 2020_INF_PROFILE

Processing 131 PDF files in 2020_inf_profile...


Processing PDFs:   5%|▌         | 7/131 [00:00<00:15,  8.25it/s]

✓ Found Mercedes document: 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf


Processing PDFs:  10%|▉         | 13/131 [00:01<00:12,  9.30it/s]

✓ Found Mercedes document: 2020 Austrian Grand Prix - Decision - review of decision (document 33).pdf


Processing PDFs:  18%|█▊        | 23/131 [00:01<00:05, 19.20it/s]

✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - Failure to slow for yellow flags (post review).pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - incident with car 23.pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - Leaving the track in turn 10.pdf
✓ Found Mercedes document: 2020 Austrian Grand Prix - Offence - Car 44 - Track Limits turn 10.pdf


Processing PDFs:  53%|█████▎    | 70/131 [00:03<00:01, 35.82it/s]

✓ Found Mercedes document: 2020 Italian Grand Prix - Offence - Car 44 - Entering closed pit lane.pdf


Processing PDFs:  66%|██████▌   | 86/131 [00:03<00:01, 31.75it/s]

✓ Found Mercedes document: 2020 Russian Grand Prix - Decision - Car 44 - Turn 2 .pdf


Processing PDFs:  76%|███████▋  | 100/131 [00:03<00:00, 33.91it/s]

✓ Found Mercedes document: 2020 Russian Grand Prix - Replacement for Document 46 - Offence - Car 44 - 1st Practice Start.pdf
✓ Found Mercedes document: 2020 Russian Grand Prix - Replacement for Document 47 - Offence - Car 44 - 2nd Practice Start.pdf
✓ Found Mercedes document: 2020 Sakhir Grand Prix - Offence - Mercedes - Car 63 incorrect use of tyres.pdf


Processing PDFs: 100%|██████████| 131/131 [00:04<00:00, 28.26it/s]


Skipped 0 PDFs (TXT files already exist)
Excluded 120 PDFs (not addressed to Mercedes only)

✓ Found 11 Mercedes documents in 2020_inf_profile
  - 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf (1438 chars)
  - 2020 Austrian Grand Prix - Decision - review of decision (document 33).pdf (1124 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Failure to slow for yellow flags (post review).pdf (1632 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - incident with car 23.pdf (1418 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Leaving the track in turn 10.pdf (1623 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Track Limits turn 10.pdf (1361 chars)
  - 2020 Italian Grand Prix - Offence - Car 44 - Entering closed pit lane.pdf (1304 chars)
  - 2020 Russian Grand Prix - Decision - Car 44 - Turn 2 .pdf (1928 chars)
  - 2020 Russian Grand Prix - Replacement for Document 46 - Offence - Car 44 - 1st Practice Start.pdf (1

Processing PDFs:  15%|█▍        | 26/175 [00:00<00:02, 62.06it/s]

✓ Found Mercedes document: 2021 Austrian Grand Prix - Decision - Car 77 - Alleged driving unnecessarily slowly .pdf


Processing PDFs:  30%|███       | 53/175 [00:01<00:02, 45.62it/s]

✓ Found Mercedes document: 2021 Brazilian Grand Prix - Offence - Car 44 - DRS.pdf
✓ Found Mercedes document: 2021 Brazilian Grand Prix - Offence - Car 44 - PU element.pdf
✓ Found Mercedes document: 2021 Brazilian Grand Prix - Offence - Car 44 - Safety Belts.pdf


Processing PDFs:  33%|███▎      | 58/175 [00:01<00:02, 41.95it/s]

✓ Found Mercedes document: 2021 British Grand Prix - Offence - Car 44 - Causing a collision with car 33.pdf


Processing PDFs:  57%|█████▋    | 100/175 [00:01<00:01, 58.48it/s]

✓ Found Mercedes document: 2021 Hungarian Grand Prix - Offence - Car 77 - causing a collision.pdf
✓ Found Mercedes document: 2021 Hungarian Grand Prix - Offence - Car 77 - Pre-Race procedure.pdf


Processing PDFs:  70%|███████   | 123/175 [00:02<00:00, 68.31it/s]

✓ Found Mercedes document: 2021 Italian Grand Prix - Offence - Car 77 - PU element.pdf
✓ Found Mercedes document: 2021 Italian Grand Prix - Offence - Car 77 - PU elements.pdf
✓ Found Mercedes document: 2021 Mexican Grand Prix - Offence - Car 44 - Turn 2 .pdf


Processing PDFs:  82%|████████▏ | 144/175 [00:02<00:00, 61.05it/s]

✓ Found Mercedes document: 2021 Qatar Grand Prix - Offence - Car 77 - Single waved yellow flag.pdf
✓ Found Mercedes document: 2021 Russian Grand Prix - Offence - Car 77 - PU elements.pdf


Processing PDFs:  90%|█████████ | 158/175 [00:02<00:00, 63.94it/s]

✓ Found Mercedes document: 2021 Saudi Arabian Grand Prix - Decision - Car 44 - double yellow.pdf
✓ Found Mercedes document: 2021 Saudi Arabian Grand Prix - Offence - Car 44 - Impeding.pdf


Processing PDFs:  98%|█████████▊| 172/175 [00:02<00:00, 65.11it/s]

✓ Found Mercedes document: 2021 Turkish Grand Prix - Offence - Car 44 - PU element.pdf
✓ Found Mercedes document: 2021 United States Grand Prix - Offence - Car 77 - PU elements.pdf


Processing PDFs: 100%|██████████| 175/175 [00:03<00:00, 54.69it/s]


Skipped 0 PDFs (TXT files already exist)
Excluded 159 PDFs (not addressed to Mercedes only)

✓ Found 16 Mercedes documents in 2021_inf_profile
  - 2021 Austrian Grand Prix - Decision - Car 77 - Alleged driving unnecessarily slowly .pdf (1463 chars)
  - 2021 Brazilian Grand Prix - Offence - Car 44 - DRS.pdf (8030 chars)
  - 2021 Brazilian Grand Prix - Offence - Car 44 - PU element.pdf (1046 chars)
  - 2021 Brazilian Grand Prix - Offence - Car 44 - Safety Belts.pdf (1667 chars)
  - 2021 British Grand Prix - Offence - Car 44 - Causing a collision with car 33.pdf (1370 chars)
  - 2021 Hungarian Grand Prix - Offence - Car 77 - causing a collision.pdf (1230 chars)
  - 2021 Hungarian Grand Prix - Offence - Car 77 - Pre-Race procedure.pdf (1231 chars)
  - 2021 Italian Grand Prix - Offence - Car 77 - PU element.pdf (1136 chars)
  - 2021 Italian Grand Prix - Offence - Car 77 - PU elements.pdf (1141 chars)
  - 2021 Mexican Grand Prix - Offence - Car 44 - Turn 2 .pdf (1569 chars)
  - 2021 Qatar Gr

Processing PDFs:   3%|▎         | 7/227 [00:00<00:03, 64.60it/s]

✓ Found Mercedes document: 2022 Abu Dhabi Grand Prix - Decision - Car 44 - Red Flag_0.pdf
✓ Found Mercedes document: 2022 Abu Dhabi Grand Prix - Offence - Car 44 - Pit lane speeding.pdf
✓ Found Mercedes document: 2022 Abu Dhabi Grand Prix - Offence - Car 63 - Unsafe release.pdf
✓ Found Mercedes document: 2022 Australian Grand Prix - Decision - Car 44 - Alleged impeding of Car 18 at turn 13.pdf


Processing PDFs:  19%|█▉        | 43/227 [00:00<00:02, 67.72it/s]

✓ Found Mercedes document: 2022 Austrian Grand Prix - Offence - Car 44 - Parc Ferme Instructions.pdf
✓ Found Mercedes document: 2022 Austrian Grand Prix - Offence - Car 63 - Causing a collision.pdf
✓ Found Mercedes document: 2022 Austrian Grand Prix - Offence - Car 63 - Entered the track on foot.pdf
✓ Found Mercedes document: 2022 Azerbaijan Grand Prix - Decision - Car 44 - Allegedly driving unnecessarily slowly during Qualifying.pdf


Processing PDFs:  33%|███▎      | 75/227 [00:01<00:02, 71.37it/s]

✓ Found Mercedes document: 2022 Belgian Grand Prix - Offence - Car 44 - Alleged causing a collision.pdf


Processing PDFs:  50%|█████     | 114/227 [00:01<00:01, 67.29it/s]

✓ Found Mercedes document: 2022 Dutch Grand Prix - Offence - Car 44 - Alleged impeding of Car 55.pdf


Processing PDFs:  64%|██████▍   | 146/227 [00:02<00:01, 69.93it/s]

✓ Found Mercedes document: 2022 Italian Grand Prix - Offence - Car 44 - PU element.pdf


Processing PDFs:  88%|████████▊ | 200/227 [00:03<00:00, 61.94it/s]

✓ Found Mercedes document: 2022 Singapore Grand Prix - Decision - Car 44 - Breach of Appendix L.pdf
✓ Found Mercedes document: 2022 Singapore Grand Prix - Offence - Car 63 - Pit lane speeding.pdf
✓ Found Mercedes document: 2022 Singapore Grand Prix - Offence - Car 63 - PU elements.pdf


Processing PDFs: 100%|██████████| 227/227 [00:03<00:00, 65.26it/s]


✓ Found Mercedes document: 2022 United States Grand Prix - Offence - Car 63 - T1 Incident with car 55.pdf
Skipped 0 PDFs (TXT files already exist)
Excluded 212 PDFs (not addressed to Mercedes only)

✓ Found 15 Mercedes documents in 2022_inf_profile
  - 2022 Abu Dhabi Grand Prix - Decision - Car 44 - Red Flag_0.pdf (2805 chars)
  - 2022 Abu Dhabi Grand Prix - Offence - Car 44 - Pit lane speeding.pdf (1015 chars)
  - 2022 Abu Dhabi Grand Prix - Offence - Car 63 - Unsafe release.pdf (1037 chars)
  - 2022 Australian Grand Prix - Decision - Car 44 - Alleged impeding of Car 18 at turn 13.pdf (1313 chars)
  - 2022 Austrian Grand Prix - Offence - Car 44 - Parc Ferme Instructions.pdf (1593 chars)
  - 2022 Austrian Grand Prix - Offence - Car 63 - Causing a collision.pdf (1443 chars)
  - 2022 Austrian Grand Prix - Offence - Car 63 - Entered the track on foot.pdf (1564 chars)
  - 2022 Azerbaijan Grand Prix - Decision - Car 44 - Allegedly driving unnecessarily slowly during Qualifying.pdf (1741 cha

Processing PDFs:   0%|          | 0/194 [00:00<?, ?it/s]

✓ Found Mercedes document: 2023 Abu Dhabi Grand Prix - Infringement - Mercedes - Team Principal (Updated).pdf


Processing PDFs:   9%|▉         | 17/194 [00:00<00:05, 32.26it/s]

✓ Found Mercedes document: 2023 Australian Grand Prix - Decision - Mercedes - Inaccurate Self Scrutineering Form.pdf


Processing PDFs:  20%|█▉        | 38/194 [00:00<00:03, 46.47it/s]

✓ Found Mercedes document: 2023 Austrian Grand Prix - Infringement - Car 44 - Leaving the track multiple times.pdf
✓ Found Mercedes document: 2023 Austrian Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf


Processing PDFs:  27%|██▋       | 53/194 [00:01<00:02, 56.24it/s]

✓ Found Mercedes document: 2023 Bahrain Grand Prix - Decision - Car 44 - Wearing of Jewellery.pdf


Processing PDFs:  35%|███▍      | 67/194 [00:01<00:02, 61.27it/s]

✓ Found Mercedes document: 2023 Belgian Grand Prix - Infringement - Car 44 - Causing a Collision.pdf
✓ Found Mercedes document: 2023 British Grand Prix - Infringement - Mercedes - Thursday Press Conference.pdf
✓ Found Mercedes document: 2023 Canadian Grand Prix - Decision - Car 44 - Alleged Unsafe Release.pdf


Processing PDFs:  56%|█████▌    | 109/194 [00:02<00:01, 61.49it/s]

✓ Found Mercedes document: 2023 Italian Grand Prix - Infringement - Car 44 - Causing a Collision.pdf
✓ Found Mercedes document: 2023 Italian Grand Prix - Infringement - Car 63 - Leaving the track.pdf
✓ Found Mercedes document: 2023 Las Vegas Grand Prix - Infringement - Car 63 - Causing a collision.pdf


Processing PDFs:  68%|██████▊   | 132/194 [00:02<00:00, 70.26it/s]

✓ Found Mercedes document: 2023 Miami Grand Prix - Decision - Car 44 - Turn 17 Incident.pdf


Processing PDFs:  72%|███████▏  | 140/194 [00:02<00:00, 71.81it/s]

✓ Found Mercedes document: 2023 Monaco Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf
✓ Found Mercedes document: 2023 Monaco Grand Prix - Infringement - Car 63 - Pit Lane Speeding.pdf
✓ Found Mercedes document: 2023 Monaco Grand Prix - Infringement - Car 63 - Unsafe Rejoin.pdf


Processing PDFs:  80%|███████▉  | 155/194 [00:02<00:00, 52.46it/s]

✓ Found Mercedes document: 2023 Qatar Grand Prix - Infringement - Car 44 - Crossing the track.pdf
✓ Found Mercedes document: 2023 Saudi Arabian Grand Prix - Offence - Mercedes - Inaccurate Scrutineering Form.pdf


Processing PDFs:  96%|█████████▌| 186/194 [00:03<00:00, 49.30it/s]

✓ Found Mercedes document: 2023 Spanish Grand Prix - Infringement - Car 63 - Abnormal change of direction.pdf
✓ Found Mercedes document: 2023 Spanish Grand Prix - Infringement - Mercedes - Parc Ferme.pdf
✓ Found Mercedes document: 2023 São Paulo Grand Prix - Infringement - Car 63 - Impeding at Pit Exit.pdf


Processing PDFs: 100%|██████████| 194/194 [00:03<00:00, 53.29it/s]


✓ Found Mercedes document: 2023 United States Grand Prix - Infringement - Car 44 - Technical non-compliance (Plank).pdf
✓ Found Mercedes document: 2023 United States Grand Prix - Infringement - Car 63 - Impeding of Car 16.pdf
✓ Found Mercedes document: 2023 United States Grand Prix - Infringement - Car 63 - Leaving the track.pdf
Skipped 0 PDFs (TXT files already exist)
Excluded 171 PDFs (not addressed to Mercedes only)

✓ Found 23 Mercedes documents in 2023_inf_profile
  - 2023 Abu Dhabi Grand Prix - Infringement - Mercedes - Team Principal (Updated).pdf (2240 chars)
  - 2023 Australian Grand Prix - Decision - Mercedes - Inaccurate Self Scrutineering Form.pdf (1421 chars)
  - 2023 Austrian Grand Prix - Infringement - Car 44 - Leaving the track multiple times.pdf (1281 chars)
  - 2023 Austrian Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf (1152 chars)
  - 2023 Bahrain Grand Prix - Decision - Car 44 - Wearing of Jewellery.pdf (1347 chars)
  - 2023 Belgian Grand Prix - Infrin

Processing PDFs:  18%|█▊        | 38/215 [00:00<00:03, 54.36it/s]

✓ Found Mercedes document: 2024 Austrian Grand Prix - Infringement - Car 44 - Crossing the line at Pit Entry.pdf
✓ Found Mercedes document: 2024 Austrian Grand Prix - Infringement - Car 44 - Unsafe release.pdf


Processing PDFs:  21%|██        | 45/215 [00:00<00:03, 56.16it/s]

✓ Found Mercedes document: 2024 Azerbaijan Grand Prix - Infringement - Car 44 - Changes made during Parc Ferme.pdf
✓ Found Mercedes document: 2024 Azerbaijan Grand Prix - Infringement - Car 63 - Failing to slow for yellow flags.pdf


Processing PDFs:  27%|██▋       | 58/215 [00:01<00:03, 43.52it/s]

✓ Found Mercedes document: 2024 Belgian Grand Prix - Infringement - Car 63 - Technical non-compliance (Weight).pdf


Processing PDFs:  43%|████▎     | 92/215 [00:01<00:02, 50.81it/s]

✓ Found Mercedes document: 2024 Dutch Grand Prix - Infringement - Car 44 - impeding of Car 11.pdf


Processing PDFs:  53%|█████▎    | 114/215 [00:02<00:02, 47.20it/s]

✓ Found Mercedes document: 2024 Japanese Grand Prix - Infringement - Car 63 - Unsafe release.pdf


Processing PDFs:  69%|██████▉   | 149/215 [00:03<00:01, 51.10it/s]

✓ Found Mercedes document: 2024 Miami Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf


Processing PDFs:  73%|███████▎  | 157/215 [00:03<00:01, 57.58it/s]

✓ Found Mercedes document: 2024 Qatar Grand Prix - Infringement - Car 44 - False Start.pdf
✓ Found Mercedes document: 2024 Qatar Grand Prix - Infringement - Car 44 - Speeding in the Pit Lane.pdf
✓ Found Mercedes document: 2024 Qatar Grand Prix - Infringement - Car 63 - Failing to maintain distance behind the Safety Car.pdf


Processing PDFs:  81%|████████  | 174/215 [00:03<00:00, 44.78it/s]

✓ Found Mercedes document: 2024 Saudi Arabian Grand Prix - Infringement - Car 44 - Impeding of Car 2.pdf


Processing PDFs:  91%|█████████ | 196/215 [00:04<00:00, 54.48it/s]

✓ Found Mercedes document: 2024 São Paulo Grand Prix - Infringement - Car 44 - Tyre Pressure Checks.pdf
✓ Found Mercedes document: 2024 São Paulo Grand Prix - Infringement - Car 63 - Aborted Start incident.pdf
✓ Found Mercedes document: 2024 São Paulo Grand Prix - Infringement - Car 63 - Tyre Pressure Checks.pdf


Processing PDFs:  97%|█████████▋| 209/215 [00:04<00:00, 48.38it/s]

✓ Found Mercedes document: 2024 United States Grand Prix - Infringement - Car 63 - Breach of Parc Ferme.pdf
✓ Found Mercedes document: 2024 United States Grand Prix - Infringement - Car 63 - Forcing another driver off the track.pdf


Processing PDFs: 100%|██████████| 215/215 [00:05<00:00, 42.47it/s]

Skipped 0 PDFs (TXT files already exist)
Excluded 198 PDFs (not addressed to Mercedes only)

✓ Found 17 Mercedes documents in 2024_inf_profile
  - 2024 Austrian Grand Prix - Infringement - Car 44 - Crossing the line at Pit Entry.pdf (1327 chars)
  - 2024 Austrian Grand Prix - Infringement - Car 44 - Unsafe release.pdf (1448 chars)
  - 2024 Azerbaijan Grand Prix - Infringement - Car 44 - Changes made during Parc Ferme.pdf (1739 chars)
  - 2024 Azerbaijan Grand Prix - Infringement - Car 63 - Failing to slow for yellow flags.pdf (3109 chars)
  - 2024 Belgian Grand Prix - Infringement - Car 63 - Technical non-compliance (Weight).pdf (2135 chars)
  - 2024 Dutch Grand Prix - Infringement - Car 44 - impeding of Car 11.pdf (2132 chars)
  - 2024 Japanese Grand Prix - Infringement - Car 63 - Unsafe release.pdf (2241 chars)
  - 2024 Miami Grand Prix - Infringement - Car 44 - Pit Lane Speeding.pdf (1484 chars)
  - 2024 Qatar Grand Prix - Infringement - Car 44 - False Start.pdf (1345 chars)
  - 202




In [8]:
# Summary statistics
if all_mercedes_docs:
    df_mercedes = pd.DataFrame(all_mercedes_docs)
    
    print("\n" + "="*60)
    print("MERCEDES DOCUMENT EXTRACTION SUMMARY (STRICT FILTERING)")
    print("="*60)
    
    print(f"Total Mercedes documents found: {len(all_mercedes_docs)}")
    print(f"Total text length: {sum([doc['text_length'] for doc in all_mercedes_docs]):,} characters")
    
    print("\nDocuments by year:")
    year_counts = df_mercedes['year'].value_counts().sort_index()
    for year, count in year_counts.items():
        print(f"  {year}: {count} documents")
    
    print("\nSample documents:")
    for _, row in df_mercedes.head().iterrows():
        print(f"  - {row['pdf_file']} ({row['text_length']} chars)")
    
    # Save summary
    df_mercedes.to_csv('mercedes_documents_summary_strict.csv', index=False)
    print("\nSummary saved to: mercedes_documents_summary_strict.csv")
    
else:
    print("\nNo Mercedes documents found. Please check the filtering criteria.")



MERCEDES DOCUMENT EXTRACTION SUMMARY (STRICT FILTERING)
Total Mercedes documents found: 82
Total text length: 137,437 characters

Documents by year:
  2020_inf_profile: 11 documents
  2021_inf_profile: 16 documents
  2022_inf_profile: 15 documents
  2023_inf_profile: 23 documents
  2024_inf_profile: 17 documents

Sample documents:
  - 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf (1438 chars)
  - 2020 Austrian Grand Prix - Decision - review of decision (document 33).pdf (1124 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Failure to slow for yellow flags (post review).pdf (1632 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - incident with car 23.pdf (1418 chars)
  - 2020 Austrian Grand Prix - Offence - Car 44 - Leaving the track in turn 10.pdf (1623 chars)

Summary saved to: mercedes_documents_summary_strict.csv


In [9]:
# Display sample of extracted text for verification
if all_mercedes_docs:
    print("\n" + "="*60)
    print("SAMPLE EXTRACTED TEXT (STRICT FILTERING)")
    print("="*60)
    
    # Get first Mercedes document
    first_doc = all_mercedes_docs[0]
    year_path = base_path / first_doc['year']
    txt_path = year_path / first_doc['txt_file']
    
    if txt_path.exists():
        with open(txt_path, 'r', encoding='utf-8') as f:
            sample_text = f.read()
        
        print(f"Document: {first_doc['pdf_file']}")
        print(f"Year: {first_doc['year']}")
        print(f"Length: {len(sample_text)} characters")
        print("\nFirst 1000 characters:")
        print("-" * 50)
        print(sample_text[:1000] + "..." if len(sample_text) > 1000 else sample_text)
        
        # Show the "To:" section to verify filtering
        print("\n" + "="*60)
        print("VERIFICATION: 'TO:' SECTION")
        print("="*60)
        
        # Find the "To:" section
        to_match = re.search(r'to:.*?(?=\n|$)', sample_text, re.IGNORECASE | re.DOTALL)
        if to_match:
            print(f"Found 'To:' section: {to_match.group().strip()}")
        else:
            print("No 'To:' section found in the first 1000 characters")



SAMPLE EXTRACTED TEXT (STRICT FILTERING)
Document: 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.pdf
Year: 2020_inf_profile
Length: 1438 characters

First 1000 characters:
--------------------------------------------------
From The Stewards To The Team Manager, Mercedes-AMG Petronas F1 TeamDocument 33 Date 04 July 2020 Time 19:44 2020 AUSTRIAN GRAND PRIX 2 - 5 July 2020 The StewardsThe Stewards, having received a report from the Race Director, summoned (document 29) and heard from the driver and team representative, have considered the following matter and determine the following: No / Driver 44 - Lewis Hamilton Competitor Mercedes-AMG Petronas F1 Team Time 15:59 Session Qualifying Fact Alleged failure to slow for single waved yellow flags between turn 5 and 7. Offence Alleged breach of Appendix H Article 2.5.5.1.b) of the FIA International Sporting Code. Decision No further action. Reason The Stewards heard from the driver of Car 44 (Lewis Ha

## Summary

This notebook implements **strict filtering** for Mercedes F1 infringement documents:

### ✅ **INCLUDED:**
- Documents addressed specifically to "Mercedes-AMG Petronas F1 Team"
- Documents addressed to "The Team Manager, Mercedes-AMG Petronas F1 Team"

### ❌ **EXCLUDED:**
- Documents addressed to "All Teams, All Officials"
- Documents addressed to "All Teams"
- Documents addressed to "All Competitors"
- Documents addressed to "All Participants"
- Documents addressed to "All Drivers"
- Documents addressed to "All Constructors"

### **Key Differences from Previous Versions:**
1. **No car number filtering** - Only team address filtering
2. **Stricter exclusion patterns** - More comprehensive list of "all teams" variations
3. **Focused on team-specific documents** - Only documents specifically addressed to Mercedes

This ensures we only capture documents that are specifically about Mercedes team infringements, not general FIA communications that happen to mention Mercedes drivers.
