# Mercedes F1 Infringement Documents - Preprocessing

This notebook performs preprocessing on Mercedes infringement documents to prepare them for text mining and summarization.

## Preprocessing Steps:
1. **Article Removal**: Remove common articles like "the", "a", "an"
2. **Additional preprocessing steps** (to be added)

## Objective:
- Clean and normalize Mercedes infringement text data
- Prepare consolidated datasets for each year
- Remove unnecessary words while preserving important information


In [7]:
# Import required libraries
import os
import re
from pathlib import Path
import pandas as pd
from tqdm import tqdm
import warnings
warnings.filterwarnings('ignore')


In [8]:
# Configuration
base_path = Path("Documents")
years = ["2020_inf_profile", "2021_inf_profile", 
         "2022_inf_profile", "2023_inf_profile", "2024_inf_profile"]

# Create output directory structure
output_dir = Path("pre_proc_op")
output_dir.mkdir(exist_ok=True)

# Create year subfolders
year_folders = {}
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = output_dir / year
    year_folder.mkdir(exist_ok=True)
    year_folders[year] = year_folder

print("Configuration:")
print(f"Base path: {base_path}")
print(f"Output directory: {output_dir}")
print(f"Year folders created: {list(year_folders.keys())}")

# Check existing TXT files
print(f"\nExisting Mercedes TXT files:")
for year_folder in years:
    year_path = base_path / year_folder
    if year_path.exists():
        txt_count = len(list(year_path.glob("*.txt")))
        print(f"  {year_folder}: {txt_count} TXT files")
    else:
        print(f"  {year_folder}: Folder not found")


Configuration:
Base path: Documents
Output directory: pre_proc_op
Year folders created: ['2020', '2021', '2022', '2023', '2024']

Existing Mercedes TXT files:
  2020_inf_profile: 11 TXT files
  2021_inf_profile: 16 TXT files
  2022_inf_profile: 15 TXT files
  2023_inf_profile: 23 TXT files
  2024_inf_profile: 17 TXT files


In [9]:
# Define preprocessing functions

def remove_articles(text):
    """
    Remove common articles (the, a, an) from text
    """
    if not text:
        return ""
    
    # Define articles to remove
    articles = ['the', 'a', 'an']
    
    # Split text into words
    words = text.split()
    
    # Filter out articles (case-insensitive)
    filtered_words = []
    for word in words:
        # Remove punctuation and convert to lowercase for comparison
        clean_word = re.sub(r'[^\w]', '', word.lower())
        if clean_word not in articles:
            filtered_words.append(word)
    
    return ' '.join(filtered_words)

def basic_clean(text):
    """
    Basic text cleaning - normalize whitespace
    """
    if not text:
        return ""
    
    # Remove excessive whitespace and normalize
    text = re.sub(r'\s+', ' ', text)
    text = text.strip()
    
    return text

def preprocess_text(text):
    """
    Complete preprocessing pipeline
    """
    if not text:
        return ""
    
    # Apply preprocessing steps
    text = basic_clean(text)
    text = remove_articles(text)
    
    return text

# Test the preprocessing functions
test_text = "The Mercedes team was fined for the violation of the track limits during the race."
print("Preprocessing test:")
print(f"Original: {test_text}")
print(f"After article removal: {remove_articles(test_text)}")
print(f"After full preprocessing: {preprocess_text(test_text)}")


Preprocessing test:
Original: The Mercedes team was fined for the violation of the track limits during the race.
After article removal: Mercedes team was fined for violation of track limits during race.
After full preprocessing: Mercedes team was fined for violation of track limits during race.


In [10]:
# Process individual documents and create preprocessed versions

def process_year_documents(year_folder):
    """
    Process all Mercedes documents in a year folder
    """
    year_path = base_path / year_folder
    year_display = year_folder.replace('_inf_profile', '')
    
    if not year_path.exists():
        print(f"Folder {year_path} does not exist")
        return []
    
    # Get all TXT files
    txt_files = list(year_path.glob("*.txt"))
    print(f"\nProcessing {len(txt_files)} files in {year_display}...")
    
    processed_docs = []
    
    for txt_file in tqdm(txt_files, desc=f"Processing {year_display}"):
        try:
            # Read original text
            with open(txt_file, 'r', encoding='utf-8') as f:
                original_text = f.read()
            
            # Preprocess the text
            preprocessed_text = preprocess_text(original_text)
            
            # Create output filename with "no_articles_" prefix
            output_filename = f"no_articles_{txt_file.name}"
            output_path = year_folders[year_display] / output_filename
            
            # Save preprocessed text
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(preprocessed_text)
            
            processed_docs.append({
                'year': year_display,
                'original_file': txt_file.name,
                'preprocessed_file': output_filename,
                'original_length': len(original_text),
                'preprocessed_length': len(preprocessed_text),
                'reduction_percent': ((len(original_text) - len(preprocessed_text)) / len(original_text)) * 100 if len(original_text) > 0 else 0
            })
            
        except Exception as e:
            print(f"Error processing {txt_file.name}: {e}")
    
    return processed_docs

# Process all years
print("="*60)
print("PREPROCESSING MERCEDES DOCUMENTS")
print("="*60)

all_processed_docs = []

for year_folder in years:
    print(f"\n{'='*50}")
    print(f"PROCESSING {year_folder.upper()}")
    print(f"{'='*50}")
    
    processed_docs = process_year_documents(year_folder)
    if processed_docs:
        all_processed_docs.extend(processed_docs)
        print(f"\n✓ Processed {len(processed_docs)} documents in {year_folder}")
    else:
        print(f"\n✗ No documents processed in {year_folder}")

print(f"\nTotal documents processed: {len(all_processed_docs)}")


PREPROCESSING MERCEDES DOCUMENTS

PROCESSING 2020_INF_PROFILE

Processing 11 files in 2020...


Processing 2020: 100%|██████████| 11/11 [00:00<00:00, 124.99it/s]



✓ Processed 11 documents in 2020_inf_profile

PROCESSING 2021_INF_PROFILE

Processing 16 files in 2021...


Processing 2021: 100%|██████████| 16/16 [00:00<00:00, 184.32it/s]



✓ Processed 16 documents in 2021_inf_profile

PROCESSING 2022_INF_PROFILE

Processing 15 files in 2022...


Processing 2022: 100%|██████████| 15/15 [00:00<00:00, 347.97it/s]



✓ Processed 15 documents in 2022_inf_profile

PROCESSING 2023_INF_PROFILE

Processing 23 files in 2023...


Processing 2023: 100%|██████████| 23/23 [00:00<00:00, 347.75it/s]



✓ Processed 23 documents in 2023_inf_profile

PROCESSING 2024_INF_PROFILE

Processing 17 files in 2024...


Processing 2024: 100%|██████████| 17/17 [00:00<00:00, 288.74it/s]



✓ Processed 17 documents in 2024_inf_profile

Total documents processed: 82


In [11]:
# Summary and analysis of preprocessing results

print("\n" + "="*60)
print("PREPROCESSING SUMMARY")
print("="*60)

if all_processed_docs:
    # Create DataFrame for analysis
    df_processed = pd.DataFrame(all_processed_docs)
    
    print(f"Total documents processed: {len(all_processed_docs)}")
    print(f"Total original characters: {df_processed['original_length'].sum():,}")
    print(f"Total preprocessed characters: {df_processed['preprocessed_length'].sum():,}")
    print(f"Overall reduction: {df_processed['reduction_percent'].mean():.1f}%")
    
    print(f"\nPreprocessing results by year:")
    print("-" * 50)
    year_summary = df_processed.groupby('year').agg({
        'original_file': 'count',
        'original_length': 'sum',
        'preprocessed_length': 'sum',
        'reduction_percent': 'mean'
    }).round(1)
    
    year_summary.columns = ['Documents', 'Original_Chars', 'Preprocessed_Chars', 'Avg_Reduction_%']
    print(year_summary)
    
    # Show sample of preprocessing
    print(f"\nSample preprocessing results:")
    print("-" * 50)
    sample_doc = all_processed_docs[0]
    print(f"Sample file: {sample_doc['original_file']}")
    print(f"Preprocessed file: {sample_doc['preprocessed_file']}")
    print(f"Original length: {sample_doc['original_length']} chars")
    print(f"Preprocessed length: {sample_doc['preprocessed_length']} chars")
    print(f"Reduction: {sample_doc['reduction_percent']:.1f}%")
    
    # Save summary
    df_processed.to_csv(output_dir / 'preprocessing_summary.csv', index=False)
    print(f"\nSummary saved to: {output_dir / 'preprocessing_summary.csv'}")
    
else:
    print("No documents were processed.")

print(f"\nOutput directory structure:")
print("-" * 30)
print(f"pre_proc_op/")
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    file_count = len(list(year_folder.glob("*.txt")))
    print(f"├── {year}/ ({file_count} files)")
    if file_count > 0:
        # Show first few files as examples
        files = list(year_folder.glob("*.txt"))[:3]
        for file in files:
            print(f"│   ├── {file.name}")
        if file_count > 3:
            print(f"│   └── ... and {file_count - 3} more files")
    else:
        print(f"│   └── (no files)")



PREPROCESSING SUMMARY
Total documents processed: 82
Total original characters: 137,437
Total preprocessed characters: 127,399
Overall reduction: 7.1%

Preprocessing results by year:
--------------------------------------------------
      Documents  Original_Chars  Preprocessed_Chars  Avg_Reduction_%
year                                                                
2020         11           17830               16581              7.1
2021         16           28277               26159              6.9
2022         15           22467               20923              6.8
2023         23           37206               34475              7.1
2024         17           31657               29261              7.5

Sample preprocessing results:
--------------------------------------------------
Sample file: 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.txt
Preprocessed file: no_articles_2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to

In [13]:
# Summary and analysis of preprocessing results

print("\n" + "="*60)
print("PREPROCESSING SUMMARY")
print("="*60)

if all_processed_docs:
    # Create DataFrame for analysis
    df_processed = pd.DataFrame(all_processed_docs)
    
    print(f"Total documents processed: {len(all_processed_docs)}")
    print(f"Total original characters: {df_processed['original_length'].sum():,}")
    print(f"Total preprocessed characters: {df_processed['preprocessed_length'].sum():,}")
    print(f"Overall reduction: {df_processed['reduction_percent'].mean():.1f}%")
    
    print(f"\nPreprocessing results by year:")
    print("-" * 50)
    year_summary = df_processed.groupby('year').agg({
        'original_file': 'count',
        'original_length': 'sum',
        'preprocessed_length': 'sum',
        'reduction_percent': 'mean'
    }).round(1)
    
    year_summary.columns = ['Documents', 'Original_Chars', 'Preprocessed_Chars', 'Avg_Reduction_%']
    print(year_summary)
    
    # Show sample of preprocessing
    print(f"\nSample preprocessing results:")
    print("-" * 50)
    sample_doc = all_processed_docs[0]
    print(f"Sample file: {sample_doc['original_file']}")
    print(f"Preprocessed file: {sample_doc['preprocessed_file']}")
    print(f"Original length: {sample_doc['original_length']} chars")
    print(f"Preprocessed length: {sample_doc['preprocessed_length']} chars")
    print(f"Reduction: {sample_doc['reduction_percent']:.1f}%")
    
    # Save summary
    df_processed.to_csv(output_dir / 'preprocessing_summary.csv', index=False)
    print(f"\nSummary saved to: {output_dir / 'preprocessing_summary.csv'}")
    
else:
    print("No documents were processed.")

print(f"\nOutput directory structure:")
print("-" * 30)
print(f"pre_proc_op/")
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    file_count = len(list(year_folder.glob("*.txt")))
    print(f"├── {year}/ ({file_count} files)")
    if file_count > 0:
        # Show first few files as examples
        files = list(year_folder.glob("*.txt"))[:3]
        for file in files:
            print(f"│   ├── {file.name}")
        if file_count > 3:
            print(f"│   └── ... and {file_count - 3} more files")
    else:
        print(f"│   └── (no files)")



PREPROCESSING SUMMARY
Total documents processed: 82
Total original characters: 137,437
Total preprocessed characters: 127,399
Overall reduction: 7.1%

Preprocessing results by year:
--------------------------------------------------
      Documents  Original_Chars  Preprocessed_Chars  Avg_Reduction_%
year                                                                
2020         11           17830               16581              7.1
2021         16           28277               26159              6.9
2022         15           22467               20923              6.8
2023         23           37206               34475              7.1
2024         17           31657               29261              7.5

Sample preprocessing results:
--------------------------------------------------
Sample file: 2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.txt
Preprocessed file: no_articles_2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to

## Header Removal - Remove Everything Before "No / Driver"

This step removes all header information from FIA documents by finding the "No / Driver" keyword and removing everything before it. This eliminates:
- From/To information
- Document numbers
- Dates and times
- Event headers
- Steward information

The processed files are saved with the "no_header_" prefix.


In [20]:
# Header removal function

def remove_headers_simple(text):
    """
    Remove everything before "No / Driver" keyword
    If "No / Driver" not found, remove header pattern up to Grand Prix date
    """
    if not text:
        return ""
    
    # First try: Look for "No / Driver" pattern (case insensitive)
    pattern1 = r'no\s*/\s*driver'
    match1 = re.search(pattern1, text, re.IGNORECASE)
    
    if match1:
        # Return everything from "No / Driver" onwards
        return text[match1.start():]
    
    # Second try: Remove header pattern up to Grand Prix date
    # Pattern: "From Stewards To Team Manager, Mercedes-AMG Petronas F1 TeamDocument XX Date XX Month XXXX Time XX:XX XXXX GRAND PRIX X - X Month XXXX"
    pattern2 = r'from\s+stewards.*?to\s+team\s+manager.*?mercedes.*?document\s+\d+.*?date\s+\d+.*?time\s+\d+:\d+.*?grand\s+prix.*?\d+\s*-\s*\d+.*?\d{4}'
    match2 = re.search(pattern2, text, re.IGNORECASE | re.DOTALL)
    
    if match2:
        # Return everything after the header pattern
        return text[match2.end():]
    
    # Third try: More flexible pattern for different header formats
    # Look for "From" followed by "To" and "Document" and "Date" patterns
    pattern3 = r'from\s+.*?to\s+.*?document\s+\d+.*?date\s+\d+.*?grand\s+prix.*?\d{4}'
    match3 = re.search(pattern3, text, re.IGNORECASE | re.DOTALL)
    
    if match3:
        # Return everything after the header pattern
        return text[match3.end():]
    
    # If no patterns found, return original text
    return text

def preprocess_text_with_header_removal(text):
    """
    Complete preprocessing pipeline with header removal
    """
    if not text:
        return ""
    
    # First remove headers
    text = remove_headers_simple(text)
    
    # Then apply other preprocessing
    text = basic_clean(text)
    text = remove_articles(text)
    
    return text


In [21]:
# Process documents to remove headers and create "no_header_" files

def process_documents_no_header():
    """
    Process all no_articles_ documents to remove headers and create no_header_ files
    """
    print("\n" + "="*60)
    print("REMOVING HEADERS AND CREATING NO_HEADER_ FILES")
    print("="*60)
    
    processed_count = 0
    
    for year in ["2020", "2021", "2022", "2023", "2024"]:
        year_folder = year_folders[year]
        
        # Get all no_articles_ files in this year folder
        no_articles_files = list(year_folder.glob("no_articles_*.txt"))
        
        if not no_articles_files:
            print(f"\nNo no_articles_ files found in {year}/")
            continue
        
        print(f"\nProcessing {len(no_articles_files)} files in {year}/...")
        
        for file_path in tqdm(no_articles_files, desc=f"Processing {year}"):
            try:
                # Read the no_articles_ file
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # Process the content to remove headers
                processed_content = preprocess_text_with_header_removal(content)
                
                # Create output filename with "no_header_" prefix
                original_filename = file_path.name.replace("no_articles_", "")
                output_filename = f"no_header_{original_filename}"
                output_path = year_folder / output_filename
                
                # Save the processed content
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(processed_content)
                
                processed_count += 1
                
            except Exception as e:
                print(f"Error processing {file_path.name}: {e}")
    
    print(f"\nTotal no_header_ files created: {processed_count}")
    return processed_count

# Process all documents
header_processed_count = process_documents_no_header()

# Show summary of no_header_ files
print(f"\nNo_header_ files summary:")
print("-" * 40)
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    no_header_files = list(year_folder.glob("no_header_*.txt"))
    print(f"{year}: {len(no_header_files)} no_header_ files")

print(f"\nFinal output directory structure:")
print("-" * 30)
print(f"pre_proc_op/")
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    no_articles_count = len(list(year_folder.glob("no_articles_*.txt")))
    no_header_count = len(list(year_folder.glob("no_header_*.txt")))
    print(f"├── {year}/ ({no_articles_count} no_articles_, {no_header_count} no_header_)")
    
    if no_header_count > 0:
        # Show first few no_header_ files as examples
        files = list(year_folder.glob("no_header_*.txt"))[:2]
        for file in files:
            print(f"│   ├── {file.name}")
        if no_header_count > 2:
            print(f"│   └── ... and {no_header_count - 2} more no_header_ files")



REMOVING HEADERS AND CREATING NO_HEADER_ FILES

Processing 11 files in 2020/...


Processing 2020:   0%|          | 0/11 [00:00<?, ?it/s]

Processing 2020: 100%|██████████| 11/11 [00:00<00:00, 266.46it/s]



Processing 16 files in 2021/...


Processing 2021: 100%|██████████| 16/16 [00:00<00:00, 383.89it/s]



Processing 15 files in 2022/...


Processing 2022: 100%|██████████| 15/15 [00:00<00:00, 382.21it/s]



Processing 23 files in 2023/...


Processing 2023: 100%|██████████| 23/23 [00:00<00:00, 432.95it/s]



Processing 17 files in 2024/...


Processing 2024: 100%|██████████| 17/17 [00:00<00:00, 364.80it/s]


Total no_header_ files created: 82

No_header_ files summary:
----------------------------------------
2020: 11 no_header_ files
2021: 16 no_header_ files
2022: 15 no_header_ files
2023: 23 no_header_ files
2024: 17 no_header_ files

Final output directory structure:
------------------------------
pre_proc_op/
├── 2020/ (11 no_articles_, 11 no_header_)
│   ├── no_header_2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.txt
│   ├── no_header_2020 Austrian Grand Prix - Decision - review of decision (document 33).txt
│   └── ... and 9 more no_header_ files
├── 2021/ (16 no_articles_, 16 no_header_)
│   ├── no_header_2021 Austrian Grand Prix - Decision - Car 77 - Alleged driving unnecessarily slowly .txt
│   ├── no_header_2021 Brazilian Grand Prix - Offence - Car 44 - DRS.txt
│   └── ... and 14 more no_header_ files
├── 2022/ (15 no_articles_, 15 no_header_)
│   ├── no_header_2022 Abu Dhabi Grand Prix - Decision - Car 44 - Red Flag_0.txt
│   ├── no_he




## Competitor and Time Removal

This step removes the competitor and time information from FIA documents. It removes patterns like:
- "Competitor Mercedes-AMG Petronas F1 Team Time 15:59"
- "Competitor [Team Name] Time [HH:MM]"

The pattern starts with "Competitor" and ends with a time in 24-hour format (HH:MM).

The processed files are saved with the "no_comp_time_" prefix.


In [22]:
# Competitor and time removal function

def remove_competitor_time(text):
    """
    Remove competitor and time information from FIA documents
    Pattern: "Competitor [Team Name] Time [HH:MM]"
    """
    if not text:
        return ""
    
    # Pattern to match "Competitor" followed by team name and ending with time in HH:MM format
    # This will match patterns like:
    # "Competitor Mercedes-AMG Petronas F1 Team Time 15:59"
    # "Competitor [Any Team Name] Time [HH:MM]"
    pattern = r'competitor\s+.*?time\s+\d{1,2}:\d{2}'
    
    # Remove the pattern (case insensitive)
    cleaned_text = re.sub(pattern, '', text, flags=re.IGNORECASE)
    
    # Clean up any extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    cleaned_text = cleaned_text.strip()
    
    return cleaned_text

def preprocess_text_with_competitor_time_removal(text):
    """
    Complete preprocessing pipeline with competitor and time removal
    """
    if not text:
        return ""
    
    # Remove competitor and time information
    text = remove_competitor_time(text)
    
    # Apply other preprocessing
    text = basic_clean(text)
    text = remove_articles(text)
    
    return text


In [23]:
# Process documents to remove competitor and time info and create "no_comp_time_" files

def process_documents_no_competitor_time():
    """
    Process all no_header_ documents to remove competitor and time info and create no_comp_time_ files
    """
    print("\n" + "="*60)
    print("REMOVING COMPETITOR AND TIME INFO AND CREATING NO_COMP_TIME_ FILES")
    print("="*60)
    
    processed_count = 0
    
    for year in ["2020", "2021", "2022", "2023", "2024"]:
        year_folder = year_folders[year]
        
        # Get all no_header_ files in this year folder
        no_header_files = list(year_folder.glob("no_header_*.txt"))
        
        if not no_header_files:
            print(f"\nNo no_header_ files found in {year}/")
            continue
        
        print(f"\nProcessing {len(no_header_files)} files in {year}/...")
        
        for file_path in tqdm(no_header_files, desc=f"Processing {year}"):
            try:
                # Read the no_header_ file
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # Process the content to remove competitor and time info
                processed_content = preprocess_text_with_competitor_time_removal(content)
                
                # Create output filename with "no_comp_time_" prefix
                original_filename = file_path.name.replace("no_header_", "")
                output_filename = f"no_comp_time_{original_filename}"
                output_path = year_folder / output_filename
                
                # Save the processed content
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(processed_content)
                
                processed_count += 1
                
            except Exception as e:
                print(f"Error processing {file_path.name}: {e}")
    
    print(f"\nTotal no_comp_time_ files created: {processed_count}")
    return processed_count

# Process all documents
comp_time_processed_count = process_documents_no_competitor_time()

# Show summary of no_comp_time_ files
print(f"\nNo_comp_time_ files summary:")
print("-" * 40)
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    no_comp_time_files = list(year_folder.glob("no_comp_time_*.txt"))
    print(f"{year}: {len(no_comp_time_files)} no_comp_time_ files")

print(f"\nFinal output directory structure:")
print("-" * 30)
print(f"pre_proc_op/")
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    no_articles_count = len(list(year_folder.glob("no_articles_*.txt")))
    no_header_count = len(list(year_folder.glob("no_header_*.txt")))
    no_comp_time_count = len(list(year_folder.glob("no_comp_time_*.txt")))
    print(f"├── {year}/ ({no_articles_count} no_articles_, {no_header_count} no_header_, {no_comp_time_count} no_comp_time_)")
    
    if no_comp_time_count > 0:
        # Show first few no_comp_time_ files as examples
        files = list(year_folder.glob("no_comp_time_*.txt"))[:2]
        for file in files:
            print(f"│   ├── {file.name}")
        if no_comp_time_count > 2:
            print(f"│   └── ... and {no_comp_time_count - 2} more no_comp_time_ files")



REMOVING COMPETITOR AND TIME INFO AND CREATING NO_COMP_TIME_ FILES

Processing 11 files in 2020/...


Processing 2020: 100%|██████████| 11/11 [00:00<00:00, 266.43it/s]



Processing 16 files in 2021/...


Processing 2021: 100%|██████████| 16/16 [00:00<00:00, 306.22it/s]



Processing 15 files in 2022/...


Processing 2022: 100%|██████████| 15/15 [00:00<00:00, 359.20it/s]



Processing 23 files in 2023/...


Processing 2023: 100%|██████████| 23/23 [00:00<00:00, 386.46it/s]



Processing 17 files in 2024/...


Processing 2024: 100%|██████████| 17/17 [00:00<00:00, 306.44it/s]



Total no_comp_time_ files created: 82

No_comp_time_ files summary:
----------------------------------------
2020: 11 no_comp_time_ files
2021: 16 no_comp_time_ files
2022: 15 no_comp_time_ files
2023: 23 no_comp_time_ files
2024: 17 no_comp_time_ files

Final output directory structure:
------------------------------
pre_proc_op/
├── 2020/ (11 no_articles_, 11 no_header_, 11 no_comp_time_)
│   ├── no_comp_time_2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.txt
│   ├── no_comp_time_2020 Austrian Grand Prix - Decision - review of decision (document 33).txt
│   └── ... and 9 more no_comp_time_ files
├── 2021/ (16 no_articles_, 16 no_header_, 16 no_comp_time_)
│   ├── no_comp_time_2021 Austrian Grand Prix - Decision - Car 77 - Alleged driving unnecessarily slowly .txt
│   ├── no_comp_time_2021 Brazilian Grand Prix - Offence - Car 44 - DRS.txt
│   └── ... and 14 more no_comp_time_ files
├── 2022/ (15 no_articles_, 15 no_header_, 15 no_comp_time_)
│

## Appeal Information Removal

This step removes the appeal information and signatures from FIA documents. It removes everything starting from:
- "Competitors are reminded that they have right to appeal..."

This eliminates:
- Appeal rights information
- Steward signatures
- Administrative footers
- Legal disclaimers

The processed files are saved with the "no_footer_" prefix.


In [26]:
# Appeal information removal function

def remove_appeal_info(text):
    """
    Remove appeal information and signatures from FIA documents
    Pattern: "Competitors are reminded that they have right to appeal..."
    """
    if not text:
        return ""
    
    # Pattern to match "Competitors are reminded..." and everything after it
    # This will match patterns like:
    # "Competitors are reminded that they have right to appeal..."
    # "Competitors are reminded that they have the right to appeal..."
    pattern = r'competitors are reminded.*'
    
    # Remove the pattern and everything after it (case insensitive)
    cleaned_text = re.sub(pattern, '', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Clean up any extra whitespace
    cleaned_text = re.sub(r'\s+', ' ', cleaned_text)
    cleaned_text = cleaned_text.strip()
    
    return cleaned_text

def preprocess_text_with_appeal_removal(text):
    """
    Complete preprocessing pipeline with appeal information removal
    """
    if not text:
        return ""
    
    # Remove appeal information
    text = remove_appeal_info(text)
    
    # Apply other preprocessing
    text = basic_clean(text)
    text = remove_articles(text)
    
    return text


In [27]:
# Process documents to remove appeal info and create "no_footer_" files

def process_documents_no_footer():
    """
    Process all no_comp_time_ documents to remove appeal info and create no_footer_ files
    """
    print("\n" + "="*60)
    print("REMOVING APPEAL INFO AND CREATING NO_FOOTER_ FILES")
    print("="*60)
    
    processed_count = 0
    
    for year in ["2020", "2021", "2022", "2023", "2024"]:
        year_folder = year_folders[year]
        
        # Get all no_comp_time_ files in this year folder
        no_comp_time_files = list(year_folder.glob("no_comp_time_*.txt"))
        
        if not no_comp_time_files:
            print(f"\nNo no_comp_time_ files found in {year}/")
            continue
        
        print(f"\nProcessing {len(no_comp_time_files)} files in {year}/...")
        
        for file_path in tqdm(no_comp_time_files, desc=f"Processing {year}"):
            try:
                # Read the no_comp_time_ file
                with open(file_path, 'r', encoding='utf-8') as f:
                    content = f.read()
                
                # Process the content to remove appeal info
                processed_content = preprocess_text_with_appeal_removal(content)
                
                # Create output filename with "no_footer_" prefix
                original_filename = file_path.name.replace("no_comp_time_", "")
                output_filename = f"no_footer_{original_filename}"
                output_path = year_folder / output_filename
                
                # Save the processed content
                with open(output_path, 'w', encoding='utf-8') as f:
                    f.write(processed_content)
                
                processed_count += 1
                
            except Exception as e:
                print(f"Error processing {file_path.name}: {e}")
    
    print(f"\nTotal no_footer_ files created: {processed_count}")
    return processed_count

# Process all documents
footer_processed_count = process_documents_no_footer()

# Show summary of no_footer_ files
print(f"\nNo_footer_ files summary:")
print("-" * 40)
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    no_footer_files = list(year_folder.glob("no_footer_*.txt"))
    print(f"{year}: {len(no_footer_files)} no_footer_ files")

print(f"\nFinal output directory structure:")
print("-" * 30)
print(f"pre_proc_op/")
for year in ["2020", "2021", "2022", "2023", "2024"]:
    year_folder = year_folders[year]
    no_articles_count = len(list(year_folder.glob("no_articles_*.txt")))
    no_header_count = len(list(year_folder.glob("no_header_*.txt")))
    no_comp_time_count = len(list(year_folder.glob("no_comp_time_*.txt")))
    no_footer_count = len(list(year_folder.glob("no_footer_*.txt")))
    print(f"├── {year}/ ({no_articles_count} no_articles_, {no_header_count} no_header_, {no_comp_time_count} no_comp_time_, {no_footer_count} no_footer_)")
    
    if no_footer_count > 0:
        # Show first few no_footer_ files as examples
        files = list(year_folder.glob("no_footer_*.txt"))[:2]
        for file in files:
            print(f"│   ├── {file.name}")
        if no_footer_count > 2:
            print(f"│   └── ... and {no_footer_count - 2} more no_footer_ files")



REMOVING APPEAL INFO AND CREATING NO_FOOTER_ FILES

Processing 11 files in 2020/...


Processing 2020: 100%|██████████| 11/11 [00:00<00:00, 671.20it/s]



Processing 16 files in 2021/...


Processing 2021: 100%|██████████| 16/16 [00:00<00:00, 172.47it/s]



Processing 15 files in 2022/...


Processing 2022: 100%|██████████| 15/15 [00:00<00:00, 228.18it/s]



Processing 23 files in 2023/...


Processing 2023: 100%|██████████| 23/23 [00:00<00:00, 211.29it/s]



Processing 17 files in 2024/...


Processing 2024: 100%|██████████| 17/17 [00:00<00:00, 458.54it/s]


Total no_footer_ files created: 82

No_footer_ files summary:
----------------------------------------
2020: 11 no_footer_ files
2021: 16 no_footer_ files
2022: 15 no_footer_ files
2023: 23 no_footer_ files
2024: 17 no_footer_ files

Final output directory structure:
------------------------------
pre_proc_op/
├── 2020/ (11 no_articles_, 11 no_header_, 11 no_comp_time_, 11 no_footer_)
│   ├── no_footer_2020 Austrian Grand Prix - Decision - Car 44 - alleged failure to slow for yellow flags.txt
│   ├── no_footer_2020 Austrian Grand Prix - Decision - review of decision (document 33).txt
│   └── ... and 9 more no_footer_ files
├── 2021/ (16 no_articles_, 16 no_header_, 16 no_comp_time_, 16 no_footer_)
│   ├── no_footer_2021 Austrian Grand Prix - Decision - Car 77 - Alleged driving unnecessarily slowly .txt
│   ├── no_footer_2021 Brazilian Grand Prix - Offence - Car 44 - DRS.txt
│   └── ... and 14 more no_footer_ files





├── 2022/ (15 no_articles_, 15 no_header_, 15 no_comp_time_, 15 no_footer_)
│   ├── no_footer_2022 Abu Dhabi Grand Prix - Decision - Car 44 - Red Flag_0.txt
│   ├── no_footer_2022 Abu Dhabi Grand Prix - Offence - Car 44 - Pit lane speeding.txt
│   └── ... and 13 more no_footer_ files
├── 2023/ (23 no_articles_, 23 no_header_, 23 no_comp_time_, 23 no_footer_)
│   ├── no_footer_2023 Abu Dhabi Grand Prix - Infringement - Mercedes - Team Principal (Updated).txt
│   ├── no_footer_2023 Australian Grand Prix - Decision - Mercedes - Inaccurate Self Scrutineering Form.txt
│   └── ... and 21 more no_footer_ files
├── 2024/ (17 no_articles_, 17 no_header_, 17 no_comp_time_, 17 no_footer_)
│   ├── no_footer_2024 Austrian Grand Prix - Infringement - Car 44 - Crossing the line at Pit Entry.txt
│   ├── no_footer_2024 Austrian Grand Prix - Infringement - Car 44 - Unsafe release.txt
│   └── ... and 15 more no_footer_ files
