# PDF Generation Workflow - With PII Integration

This notebook processes resume-job matches and generates PDFs for different treatment types with enhanced PII (Personally Identifiable Information) data integration.

## Features:
- **Interactive File Selection**: Choose specific files to process or process all files
- **Enhanced PII Generation**: Automatically generates culturally appropriate names, emails, and phone numbers
- **Geographic Cluster Mapping**: Maps countries to geographic regions for PII selection
- **Treatment Type Support**: Loops through all treatment types (control, Type_I, Type_II, Type_III)
- **Enhanced Webhook Integration**: Sends comprehensive PII data to the webhook endpoint
- **Comprehensive Logging**: Tracks all PII data used for each request
- **Results Management**: Saves output PDF links to CSV with enhanced PII data
- **Append Mode**: Adds new records when running again

## PII Data Generation:
- **Country Mapping**: Automatically maps resume countries to geographic clusters
- **Cultural Names**: Generates appropriate first names based on geographic region
- **Treatment-Specific Data**: Each treatment type gets unique last names, emails, and phone numbers
- **Random Selection**: Randomly selects gender and names for diversity
- **Fallback Handling**: Graceful fallback to default values when country data is missing

## Configuration:
- Set `selection_choice` to control file selection method (1-4)
- Configure `file_numbers` or `filenames` for specific file selection
- Set `treatment_types` to specify which treatments to process
- Set `test_url` and `authorization` for the API endpoint
- Results are saved to `pdf_generation_results.csv`

## Data Sources:
- **Country Mapping**: `Country and Geographic_Cluster Mapping.csv`
- **PII Clusters**: `Resume Audit - Country Clusters - Clusters.csv`
- **Job Matches**: `Resume_study.resume_job_matches_filtered.csv`

## Webhook Integration:
The enhanced webhook now receives:
- Basic fields: `file_id`, `treatment_type`, `location`
- PII fields: `name`, `email`, `phone`, `first_name`, `last_name`, `gender`
- Geographic fields: `country`, `geographic_cluster`

## Output:
Enhanced CSV with all original data plus:
- PII information (names, emails, phones)
- Geographic clustering data
- Treatment-specific identifiers
- Complete audit trail of all operations

In [42]:
import requests
import json
import pandas as pd
import os
from datetime import datetime
import time  # Add this line
import random

In [43]:
# Configuration
num_files_to_process = 5  # Set this to control how many files to process
test_url = "https://prayag-is-dummy.app.n8n.cloud/webhook/9eb0c4bc-f2a4-4f23-bb71-26422deedf55"
authorization = ("prayag_purohit", "Resumeaudit")
output_csv = "google_doc_generation_results.csv"

# Treatment types to process
treatment_types = ['control', 'Type_I', 'Type_II', 'Type_III'] # ['control', 'Type_I', 'Type_II', 'Type_III']

print(f"Configuration:")
print(f"- Files to process: {num_files_to_process}")
print(f"- Treatment types: {treatment_types}")
print(f"- Output CSV: {output_csv}")

Configuration:
- Files to process: 5
- Treatment types: ['control', 'Type_I', 'Type_II', 'Type_III']
- Output CSV: google_doc_generation_results.csv


In [44]:
# Load the job matches data
file_path = "Resume_study.resume_job_matches_filtered.csv"
job_matches_df = pd.read_csv(file_path)

# Rename columns to match our endpoint requirements
job_matches_df.rename(
    columns={
        'description': 'job_description',
        'tile': 'job_title'
    }, inplace=True)

print(f"Loaded {len(job_matches_df)} job matches")
print(f"Columns: {list(job_matches_df.columns)}")
job_matches_df.head()

Loaded 1226 job matches
Columns: ['_id', 'job_posting_id', 'title', 'job_description', 'file_id', 'key_metrics.basics.likely_home_country', 'match_score', 'location', 'date_posted']


Unnamed: 0,_id,job_posting_id,title,job_description,file_id,key_metrics.basics.likely_home_country,match_score,location,date_posted
0,68a29ca54105e44264b851f7,689d5acce78d625301071376,Database Developer (Software Developer),Job Description\nDatabase Developer\nThis is a...,ITC resume 20.pdf,India,90,"Toronto, ON, CA",2025-08-13
1,68a29cf6ff6560afd37a01a3,689d5acce78d62530107137b,"Application Developer, D365 Finance & Operations",Sporting Life Group is a proudly Canadian fami...,ITC resume 14.pdf,Saudi Arabia,75,"Vaughan, ON, CA",2025-08-13
2,68a29cf7ff6560afd37a01a4,689d5acce78d62530107137c,Backend Developer (Python),**Please note before applying:** \n\n* We’re ...,ITC resume 18.pdf,Eritrea,92,"Toronto, ON, CA",2025-08-13
3,68a29d01ff6560afd37a01a6,689d5acce78d62530107137d,Full Stack Developer,**Please note before applying:** \n\n* We’re ...,ITC resume 18.pdf,Eritrea,90,"Toronto, ON, CA",2025-08-13
4,68a29d1aff6560afd37a01ad,689d5acce78d625301071389,"QA Automation Developer– Java, Selenium, Sales...","**Role Description:**\n\n* Java, Selenium, Cuc...",ITC resume 09.pdf,Pakistan,92,"Toronto, ON, CA",2025-08-13


### Display Available Files

In [45]:
# Get unique files to process
unique_files_df = job_matches_df.drop_duplicates(subset='file_id')
print(f"Total unique files: {len(unique_files_df)}")

# Display all available files with their details
print("\nAvailable files to process:")
print("-" * 60)
for idx, row in unique_files_df.iterrows():
    print(f"{idx+1:2d}. {row['file_id']}")
    if 'key_metrics.basics.likely_home_country' in row and pd.notna(row['key_metrics.basics.likely_home_country']):
        print(f"    Country: {row['key_metrics.basics.likely_home_country']}")
    if 'location' in row and pd.notna(row['location']):
        print(f"    Location: {row['location']}")
    print()

# File selection options
print("File selection options:")
print("1. Process all files")
print("2. Process specific files by number")
print("3. Process specific files by filename")
print("4. Process first N files (current behavior)")

Total unique files: 18

Available files to process:
------------------------------------------------------------
 1. ITC resume 20.pdf
    Country: India
    Location: Toronto, ON, CA

 2. ITC resume 14.pdf
    Country: Saudi Arabia
    Location: Vaughan, ON, CA

 3. ITC resume 18.pdf
    Country: Eritrea
    Location: Toronto, ON, CA

 5. ITC resume 09.pdf
    Country: Pakistan
    Location: Toronto, ON, CA

 6. ITC resume 07.pdf
    Country: Turkey
    Location: Toronto, ON, CA

 7. ITC resume 08.pdf
    Country: Mauritius
    Location: Toronto, ON, CA

11. ITC resume 16.pdf
    Country: USA
    Location: Toronto, ON, CA

18. ITC resume 01.pdf
    Country: Lebanon
    Location: Vancouver, BC, CA

47. ITC resume 17.pdf
    Country: India
    Location: Calgary, AB, CA

67. ITC resume 03.pdf
    Country: Bangladesh
    Location: Edmonton, AB, CA

75. ITC resume 15.pdf
    Country: India
    Location: Ottawa, ON, CA

393. ITC resume 04.pdf
    Country: India
    Location: Shelburne, ON, 

### File Selection and Processing Setup

In [46]:
# File selection configuration
# CHANGE THESE VARIABLES TO SELECT YOUR OPTION

selection_choice = "1"  # Change to "1", "2", "3", or "4"

# For option 2 (specific file numbers) - change these to the file numbers you want (1-based indexing)
file_numbers = [18]  # Example: process files 1, 3, and 5

# For option 3 (specific filenames) - change these to the exact filenames you want
filenames = ["ITC resume 20.pdf", "another_file.pdf"]  # Example filenames

# File selection logic
if selection_choice == "1":
    # Process all files
    files_to_process = unique_files_df
    print(f"\nProcessing ALL {len(files_to_process)} files")
    
elif selection_choice == "2":
    # Process specific files by number
    print(f"\nProcessing specific files by number: {file_numbers}")
    try:
        # Convert to 0-based indexing
        valid_indices = [i - 1 for i in file_numbers if 1 <= i <= len(unique_files_df)]
        if valid_indices:
            files_to_process = unique_files_df.iloc[valid_indices]
            print(f"Processing {len(files_to_process)} selected files:")
            for idx, row in files_to_process.iterrows():
                print(f"  {row['file_id']}")
        else:
            print("No valid file numbers provided. Processing first file.")
            files_to_process = unique_files_df.head(1)
    except Exception as e:
        print(f"Error: {e}. Processing first file.")
        files_to_process = unique_files_df.head(1)
        
elif selection_choice == "3":
    # Process specific files by filename
    print(f"\nProcessing specific files by filename: {filenames}")
    try:
        files_to_process = unique_files_df[unique_files_df['file_id'].isin(filenames)]
        if len(files_to_process) > 0:
            print(f"Processing {len(files_to_process)} selected files:")
            for idx, row in files_to_process.iterrows():
                print(f"  {row['file_id']}")
        else:
            print("No matching filenames found. Processing first file.")
            files_to_process = unique_files_df.head(1)
    except Exception as e:
        print(f"Error: {e}. Processing first file.")
        files_to_process = unique_files_df.head(1)
        
elif selection_choice == "4":
    # Process first N files (current behavior)
    files_to_process = unique_files_df.head(num_files_to_process)
    print(f"\nProcessing first {num_files_to_process} files:")
    for idx, row in files_to_process.iterrows():
        print(f"  {row['file_id']}")
        
else:
    # Default to first N files if invalid input
    print("Invalid choice. Processing first file.")
    files_to_process = unique_files_df.head(1)

print(f"\nFinal selection: {len(files_to_process)} files to process")


Processing ALL 18 files

Final selection: 18 files to process


### Load PII Mapping Data and Setup Functions

In [47]:
# Load the country to geographic cluster mapping
country_cluster_df = pd.read_csv("Country and Geographic_Cluster Mapping.csv")
print(f"Loaded country mapping: {len(country_cluster_df)} countries mapped to {country_cluster_df['Geographic_Cluster'].nunique()} clusters")

# Load the PII clusters data
pii_clusters_df = pd.read_csv("Resume Audit - Country Clusters - Clusters.csv")
print(f"Loaded PII clusters: {len(pii_clusters_df)} cluster-treatment combinations")

# Create lookup dictionaries for efficient access
country_to_cluster = dict(zip(country_cluster_df['Country'], country_cluster_df['Geographic_Cluster']))

# Create a nested dictionary for PII lookup: {cluster: {treatment: pii_data}}
pii_lookup = {}
for _, row in pii_clusters_df.iterrows():
    cluster = row['Geographic_Cluster']
    treatment = row['Treatment Type']
    
    if cluster not in pii_lookup:
        pii_lookup[cluster] = {}
    
    # Parse the name pools into lists
    male_names = [name.strip() for name in row['Male_First_Name_Pool'].split(',')]
    female_names = [name.strip() for name in row['Female_First_Name_Pool'].split(',')]
    
    pii_lookup[cluster][treatment] = {
        'last_name': row['Last_Name'],
        'email': row['Assigned_Email'],
        'phone': row['Assigned_Phone_Number'],
        'male_names': male_names,
        'female_names': female_names
    }

print(f"\nPII lookup structure created:")
for cluster in pii_lookup:
    print(f"  {cluster}: {list(pii_lookup[cluster].keys())}")

# Function to get PII data for a country and treatment type
def get_pii_data(country, treatment_type):
    """
    Get PII data for a given country and treatment type.
    
    Args:
        country (str): Country name
        treatment_type (str): Treatment type (control, Type_I, Type_II, Type_III)
    
    Returns:
        dict: PII data with keys: last_name, email, phone, male_names, female_names
        None: If country or treatment not found
    """
    # Map country to geographic cluster
    if country not in country_to_cluster:
        print(f"Warning: Country '{country}' not found in mapping")
        return None
    
    cluster = country_to_cluster[country]
    
    # Get PII data for cluster and treatment
    if cluster not in pii_lookup or treatment_type not in pii_lookup[cluster]:
        print(f"Warning: No PII data found for cluster '{cluster}' and treatment '{treatment_type}'")
        return None
    
    return pii_lookup[cluster][treatment_type]

# Function to generate a random name and PII
def generate_pii_for_file(country, treatment_type):
    """
    Generate PII data for a file based on country and treatment type.
    
    Args:
        country (str): Country name
        treatment_type (str): Treatment type
    
    Returns:
        dict: Complete PII data with full_name, email, phone
        None: If PII data cannot be generated
    """
    pii_data = get_pii_data(country, treatment_type)
    if not pii_data:
        return None
    
    # Randomly select gender (50/50 chance)
    is_male = random.choice([True, False])
    
    # Select random name from appropriate pool
    if is_male:
        first_name = random.choice(pii_data['male_names'])
    else:
        first_name = random.choice(pii_data['female_names'])
    
    # Construct full name
    full_name = f"{first_name} {pii_data['last_name']}"
    
    return {
        'full_name': full_name,
        'email': pii_data['email'],
        'phone': pii_data['phone'],
        'first_name': first_name,
        'last_name': pii_data['last_name'],
        'gender': 'Male' if is_male else 'Female'
    }

# Test the functions with a sample
print(f"\nTesting PII generation:")
test_country = "India"
test_treatment = "Type_II"
test_pii = generate_pii_for_file(test_country, test_treatment)
if test_pii:
    print(f"  Country: {test_country}, Treatment: {test_treatment}")
    print(f"  Generated: {test_pii['full_name']} ({test_pii['gender']})")
    print(f"  Email: {test_pii['email']}")
    print(f"  Phone: {test_pii['phone']}")

Loaded country mapping: 53 countries mapped to 6 clusters
Loaded PII clusters: 24 cluster-treatment combinations

PII lookup structure created:
  South Asia: ['control', 'Type_I', 'Type_II', 'Type_III']
  Sub-Saharan Africa: ['control', 'Type_I', 'Type_II', 'Type_III']
  Middle East & North Africa: ['control', 'Type_I', 'Type_II', 'Type_III']
  East & Southeast Asia: ['control', 'Type_I', 'Type_II', 'Type_III']
  Eastern Europe: ['control', 'Type_I', 'Type_II', 'Type_III']
  Latin America: ['control', 'Type_I', 'Type_II', 'Type_III']

Testing PII generation:
  Country: India, Treatment: Type_II
  Generated: Aisha Kumar (Female)
  Email: kumar.xx@gmail.com
  Phone: +1 (647) 333-0003


### Apply PII Selection to Selected Files

In [48]:
# Apply PII selection to the files we want to process
print("Generating PII data for selected files...")
print("-" * 60)

# Add PII data to our files_to_process dataframe
files_with_pii = []

for idx, row in files_to_process.iterrows():
    file_id = row['file_id']
    country = row.get('key_metrics.basics.likely_home_country', 'Unknown')
    
    print(f"\nFile: {file_id}")
    print(f"Country: {country}")
    
    file_pii_data = {}
    
    # Generate PII for each treatment type
    for treatment_type in treatment_types:
        if pd.isna(country) or country == 'Unknown':
            print(f"  Warning: No country data for {treatment_type}, using default PII")
            # Use a default (you can customize this)
            file_pii_data[treatment_type] = {
                'full_name': 'Test User',
                'email': 'test@example.com',
                'phone': '123-456-7890',
                'first_name': 'Test',
                'last_name': 'User',
                'gender': 'Unknown'
            }
        else:
            pii = generate_pii_for_file(country, treatment_type)
            if pii:
                file_pii_data[treatment_type] = pii
                print(f"  {treatment_type}: {pii['full_name']} ({pii['gender']})")
                print(f"    Email: {pii['email']}")
                print(f"    Phone: {pii['phone']}")
            else:
                print(f"  {treatment_type}: Failed to generate PII")
                # Fallback to default
                file_pii_data[treatment_type] = {
                    'full_name': 'Test User',
                    'email': 'test@example.com',
                    'phone': '123-456-7890',
                    'first_name': 'Test',
                    'last_name': 'User',
                    'gender': 'Unknown'
                }
    
    # Store the file data with PII
    file_data = {
        'file_row': row,
        'pii_data': file_pii_data
    }
    files_with_pii.append(file_data)

print(f"\nPII generation completed for {len(files_with_pii)} files")

Generating PII data for selected files...
------------------------------------------------------------

File: ITC resume 20.pdf
Country: India
  control: Amit Patel (Male)
    Email: patel.xx@gmail.com
    Phone: +1 (416) 111-0001
  Type_I: Kavya Singh (Female)
    Email: singh.xx@gmail.com
    Phone: +1 (416) 222-0002
  Type_II: Sunil Kumar (Male)
    Email: kumar.xx@gmail.com
    Phone: +1 (647) 333-0003
  Type_III: Mohan Khan (Male)
    Email: khan.xx@gmail.com
    Phone: +1 (647) 444-0004

File: ITC resume 14.pdf
Country: Saudi Arabia
  control: Zara Hassan (Female)
    Email: hassan.xx@gmail.com
    Phone: +1 (416) 121-0009
  Type_I: Fatima Rahman (Female)
    Email: rahman.xx@gmail.com
    Phone: +1 (416) 121-0013
  Type_II: Aleena Karimi (Female)
    Email: karimi.xx@gmail.com
    Phone: +1 (647) 121-0014
  Type_III: Hamza Mohamed (Male)
    Email: mohamed.xx@gmail.com
    Phone: +1 (647) 121-0015

File: ITC resume 18.pdf
Country: Eritrea
  control: Louise Adebayo (Female)
    E

### Webhook Request with PII Data

In [49]:
# Enhanced webhook request function that includes PII data
def create_enhanced_request_body(file_row, treatment_type, pii_data):
    """
    Create an enhanced request body with PII data.
    
    Args:
        file_row: Row from the job matches dataframe
        treatment_type: Treatment type being processed
        pii_data: PII data for this treatment type
    
    Returns:
        dict: Enhanced request body
    """
    request_body = {
        'file_id': file_row['file_id'],
        'treatment_type': treatment_type,
        'name': pii_data['full_name'],
        'email': pii_data['email'],
        'phone': pii_data['phone'],
        'location': file_row.get('location', 'Toronto, ON') if pd.notna(file_row.get('location')) else 'Toronto, ON',
        # Additional PII fields for the webhook
        'first_name': pii_data['first_name'],
        'last_name': pii_data['last_name'],
        'gender': pii_data['gender'],
        'country': file_row.get('key_metrics.basics.likely_home_country', 'Unknown'),
        'geographic_cluster': country_to_cluster.get(file_row.get('key_metrics.basics.likely_home_country', ''), 'Unknown')
    }
    
    return request_body

# Test the enhanced request body creation
print("Testing enhanced request body creation:")
print("-" * 60)

if files_with_pii:
    test_file = files_with_pii[0]
    test_treatment = treatment_types[0]
    test_pii = test_file['pii_data'][test_treatment]
    
    test_request = create_enhanced_request_body(
        test_file['file_row'], 
        test_treatment, 
        test_pii
    )
    
    print(f"Sample request body for {test_file['file_row']['file_id']} - {test_treatment}:")
    for key, value in test_request.items():
        print(f"  {key}: {value}")

Testing enhanced request body creation:
------------------------------------------------------------
Sample request body for ITC resume 20.pdf - control:
  file_id: ITC resume 20.pdf
  treatment_type: control
  name: Amit Patel
  email: patel.xx@gmail.com
  phone: +1 (416) 111-0001
  location: Toronto, ON, CA
  first_name: Amit
  last_name: Patel
  gender: Male
  country: India
  geographic_cluster: South Asia


### Load Existing Results and Initialize

In [50]:
# Load existing results if available
existing_results = []
if os.path.exists(output_csv):
    existing_results = pd.read_csv(output_csv).to_dict('records')
    print(f"Loaded {len(existing_results)} existing results from {output_csv}")
else:
    print(f"No existing results found. Will create new {output_csv}")

# Initialize results list
new_results = []

No existing results found. Will create new google_doc_generation_results.csv


### Process Files with PII Data

In [51]:
# Process each file with each treatment type (ENHANCED VERSION WITH PII)
total_operations = len(files_with_pii) * len(treatment_types)
current_operation = 0
stop_processing = False

print(f"Starting processing of {total_operations} operations with enhanced PII data...")
print("=" * 60)

for file_idx, file_data in enumerate(files_with_pii):
    if stop_processing:
        break
    
    file_row = file_data['file_row']
    file_pii = file_data['pii_data']
    
    print(f"\nProcessing file {file_idx + 1}/{len(files_with_pii)}: {file_row['file_id']}")
    print("-" * 40)
    
    for treatment_idx, treatment_type in enumerate(treatment_types):
        if stop_processing:
            break
            
        current_operation += 1
        print(f"  Treatment {treatment_idx + 1}/{len(treatment_types)}: {treatment_type}")
        
        # Get PII data for this treatment
        treatment_pii = file_pii[treatment_type]
        print(f"    Using PII: {treatment_pii['full_name']} ({treatment_pii['email']})")
        
        try:
            # Create enhanced request body with PII data
            request_body = create_enhanced_request_body(file_row, treatment_type, treatment_pii)
            
            # Send request
            print(f"    📤 Sending enhanced request for {treatment_type}...")
            response = requests.post(test_url, json=request_body, auth=authorization)
            
            if response.status_code == 200:
                response_data = response.json()
                print(f"    ✅ Response received: {response_data}")
                
                # Extract Google Doc information from the response format
                doc_id = response_data.get('documentID', '')
                doc_url = response_data.get('file_url', '')
                doc_status = response_data.get('status', '')
                doc_filename = response_data.get('fileName', '')
                
                # Create enhanced result record
                result_record = {
                    'timestamp': datetime.now().isoformat(),
                    'file_id': file_row['file_id'],
                    'treatment_type': treatment_type,
                    'google_doc_id': doc_id,
                    'google_doc_url': doc_url,
                    'google_doc_status': doc_status,
                    'google_doc_filename': doc_filename,
                    'status': 'success',
                    'response_status': response.status_code,
                    'response_data': json.dumps(response_data),
                    # Enhanced PII fields
                    'full_name': treatment_pii['full_name'],
                    'email': treatment_pii['email'],
                    'phone': treatment_pii['phone'],
                    'first_name': treatment_pii['first_name'],
                    'last_name': treatment_pii['last_name'],
                    'gender': treatment_pii['gender'],
                    'country': file_row.get('key_metrics.basics.likely_home_country', 'Unknown'),
                    'geographic_cluster': country_to_cluster.get(file_row.get('key_metrics.basics.likely_home_country', ''), 'Unknown')
                }
                
                # Add essential job data
                result_record['job_posting_id'] = file_row.get('job_posting_id', '')
                result_record['job_title'] = file_row.get('title', '')
                result_record['job_description'] = file_row.get('job_description', '')[:200] + '...' if len(str(file_row.get('job_description', ''))) > 200 else file_row.get('job_description', '')
                result_record['likely_home_country'] = file_row.get('key_metrics.basics.likely_home_country', '')
                result_record['match_score'] = file_row.get('match_score', '')
                result_record['location'] = file_row.get('location', '')
                result_record['date_posted'] = file_row.get('date_posted', '')
                
                new_results.append(result_record)
                print(f"    ✓ Success: {doc_url}")
                print(f"        Document ID: {doc_id}")
                print(f"        Filename: {doc_filename}")
                print(f"        PII: {treatment_pii['full_name']} ({treatment_pii['email']})")
                
            else:
                print(f"    ✗ Failed: HTTP {response.status_code}")
                print(f"    Response: {response.text}")
                
                # Create enhanced error record
                result_record = {
                    'timestamp': datetime.now().isoformat(),
                    'file_id': file_row['file_id'],
                    'treatment_type': treatment_type,
                    'google_doc_id': '',
                    'google_doc_url': '',
                    'google_doc_status': '',
                    'google_doc_filename': '',
                    'status': 'failed',
                    'response_status': response.status_code,
                    'error_message': response.text,
                    'response_data': '',
                    # Enhanced PII fields (even for failed requests)
                    'full_name': treatment_pii['full_name'],
                    'email': treatment_pii['email'],
                    'phone': treatment_pii['phone'],
                    'first_name': treatment_pii['first_name'],
                    'last_name': treatment_pii['last_name'],
                    'gender': treatment_pii['gender'],
                    'country': file_row.get('key_metrics.basics.likely_home_country', 'Unknown'),
                    'geographic_cluster': country_to_cluster.get(file_row.get('key_metrics.basics.likely_home_country', ''), 'Unknown')
                }
                
                # Add essential job data
                result_record['job_posting_id'] = file_row.get('job_posting_id', '')
                result_record['job_title'] = file_row.get('title', '')
                result_record['job_description'] = file_row.get('job_description', '')[:200] + '...' if len(str(file_row.get('job_description', ''))) > 200 else file_row.get('job_description', '')
                result_record['likely_home_country'] = file_row.get('key_metrics.basics.likely_home_country', '')
                result_record['match_score'] = file_row.get('match_score', '')
                result_record['location'] = file_row.get('location', '')
                result_record['date_posted'] = file_row.get('date_posted', '')
                
                new_results.append(result_record)
                
                # Check if it's a 404 error and stop processing
                if response.status_code == 404:
                    print(f"    404 Error detected. Stopping all processing.")
                    print(f"    Last processed: File {file_row['file_id']}, Treatment {treatment_type}")
                    stop_processing = True
                    break
                
        except Exception as e:
            print(f"    ✗ Error: {str(e)}")
            
            # Create enhanced error record
            result_record = {
                'timestamp': datetime.now().isoformat(),
                'file_id': file_row['file_id'],
                'treatment_type': treatment_type,
                'google_doc_id': '',
                'google_doc_url': '',
                'google_doc_status': '',
                'google_doc_filename': '',
                'status': 'error',
                'response_status': '',
                'error_message': str(e),
                'response_data': '',
                # Enhanced PII fields (even for errors)
                'full_name': treatment_pii['full_name'],
                'email': treatment_pii['email'],
                'phone': treatment_pii['phone'],
                'first_name': treatment_pii['first_name'],
                'last_name': treatment_pii['last_name'],
                'gender': treatment_pii['gender'],
                'country': file_row.get('key_metrics.basics.likely_home_country', 'Unknown'),
                'geographic_cluster': country_to_cluster.get(file_row.get('key_metrics.basics.likely_home_country', ''), 'Unknown')
            }
            
            # Add essential job data
            result_record['job_posting_id'] = file_row.get('job_posting_id', '')
            result_record['job_title'] = file_row.get('title', '')
            result_record['job_description'] = file_row.get('job_description', '')[:200] + '...' if len(str(file_row.get('job_description', ''))) > 200 else file_row.get('job_description', '')
            result_record['likely_home_country'] = file_row.get('key_metrics.basics.likely_home_country', '')
            result_record['match_score'] = file_row.get('match_score', '')
            result_record['location'] = file_row.get('location', '')
            result_record['date_posted'] = file_row.get('date_posted', '')
            
            new_results.append(result_record)
        
        # Progress update
        print(f"    Progress: {current_operation}/{total_operations} ({current_operation/total_operations*100:.1f}%)")
        
        # Add delay between requests to avoid overwhelming n8n
        if not stop_processing:
            print(f"    ⏳ Waiting 3 seconds before next request...")
            time.sleep(3)  # Wait 3 seconds between requests
            print(f"    ▶️ Continuing to next request...")

print("\n" + "=" * 60)
print(f"Processing completed! Generated {len(new_results)} new results with enhanced PII data.")

Starting processing of 72 operations with enhanced PII data...

Processing file 1/18: ITC resume 20.pdf
----------------------------------------
  Treatment 1/4: control
    Using PII: Amit Patel (patel.xx@gmail.com)
    📤 Sending enhanced request for control...
    ✅ Response received: {'documentID': '1mM5sq-DnOlQRnnv9ZhOTTek5yKp4A8zNrPTemPJSJBg', 'file_url': 'https://docs.google.com/document/d/1mM5sq-DnOlQRnnv9ZhOTTek5yKp4A8zNrPTemPJSJBg', 'status': 200, 'fileName': 'ITC resume 20.pdf_control'}
    ✓ Success: https://docs.google.com/document/d/1mM5sq-DnOlQRnnv9ZhOTTek5yKp4A8zNrPTemPJSJBg
        Document ID: 1mM5sq-DnOlQRnnv9ZhOTTek5yKp4A8zNrPTemPJSJBg
        Filename: ITC resume 20.pdf_control
        PII: Amit Patel (patel.xx@gmail.com)
    Progress: 1/72 (1.4%)
    ⏳ Waiting 3 seconds before next request...
    ▶️ Continuing to next request...
  Treatment 2/4: Type_I
    Using PII: Kavya Singh (singh.xx@gmail.com)
    📤 Sending enhanced request for Type_I...
    ✅ Response receiv

### Save Results and Display Summary

In [52]:
# Combine existing and new results
all_results = existing_results + new_results
results_df = pd.DataFrame(all_results)

# Save to CSV
results_df.to_csv(output_csv, index=False, encoding='utf-8')
print(f"Results saved to {output_csv}")
print(f"Total records: {len(results_df)}")
print(f"New records added: {len(new_results)}")

# Display summary
print("\nSummary:")
print(f"- Success: {len(results_df[results_df['status'] == 'success'])}")
print(f"- Failed: {len(results_df[results_df['status'] == 'failed'])}")
print(f"- Errors: {len(results_df[results_df['status'] == 'error'])}")

# Show first few results with enhanced PII data
print("\nFirst few results (with PII data):")
display_columns = ['timestamp', 'file_id', 'treatment_type', 'status', 'full_name', 'email', 'google_doc_url']
available_columns = [col for col in display_columns if col in results_df.columns]
results_df[available_columns].head(10)

Results saved to google_doc_generation_results.csv
Total records: 72
New records added: 72

Summary:
- Success: 64
- Failed: 0
- Errors: 8

First few results (with PII data):


Unnamed: 0,timestamp,file_id,treatment_type,status,full_name,email,google_doc_url
0,2025-08-23T23:40:17.900003,ITC resume 20.pdf,control,success,Amit Patel,patel.xx@gmail.com,https://docs.google.com/document/d/1mM5sq-DnOl...
1,2025-08-23T23:40:54.374821,ITC resume 20.pdf,Type_I,success,Kavya Singh,singh.xx@gmail.com,https://docs.google.com/document/d/1MQcZpAFTT0...
2,2025-08-23T23:41:22.160314,ITC resume 20.pdf,Type_II,success,Sunil Kumar,kumar.xx@gmail.com,https://docs.google.com/document/d/1Lnj1748MSi...
3,2025-08-23T23:41:48.183573,ITC resume 20.pdf,Type_III,success,Mohan Khan,khan.xx@gmail.com,https://docs.google.com/document/d/1vzYwRnVXmS...
4,2025-08-23T23:42:23.244079,ITC resume 14.pdf,control,success,Zara Hassan,hassan.xx@gmail.com,https://docs.google.com/document/d/1xMZxyYBqBZ...
5,2025-08-23T23:43:02.763763,ITC resume 14.pdf,Type_I,success,Fatima Rahman,rahman.xx@gmail.com,https://docs.google.com/document/d/1Oc_L91n6QM...
6,2025-08-23T23:43:48.041742,ITC resume 14.pdf,Type_II,success,Aleena Karimi,karimi.xx@gmail.com,https://docs.google.com/document/d/1prEDV-JhKx...
7,2025-08-23T23:44:17.949255,ITC resume 14.pdf,Type_III,success,Hamza Mohamed,mohamed.xx@gmail.com,https://docs.google.com/document/d/1gBIHZLGZod...
8,2025-08-23T23:44:46.121958,ITC resume 18.pdf,control,success,Louise Adebayo,adebayo.xx@gmail.com,https://docs.google.com/document/d/1TOyHwAsA4B...
9,2025-08-23T23:45:14.913050,ITC resume 18.pdf,Type_I,success,Amara Okoro,okoro.xx@gmail.com,https://docs.google.com/document/d/1sKeJxTdNBg...


### Detailed Results Display

In [53]:
# Optional: Display detailed results for a specific file
if len(new_results) > 0:
    print("Detailed results for the first processed file:")
    first_file_id = new_results[0]['file_id']
    file_results = results_df[results_df['file_id'] == first_file_id]
    
    for _, row in file_results.iterrows():
        print(f"\nTreatment: {row['treatment_type']}")
        print(f"Status: {row['status']}")
        print(f"PII: {row.get('full_name', 'N/A')} ({row.get('gender', 'N/A')})")
        print(f"Email: {row.get('email', 'N/A')}")
        print(f"Phone: {row.get('phone', 'N/A')}")
        print(f"Country: {row.get('country', 'N/A')}")
        print(f"Geographic Cluster: {row.get('geographic_cluster', 'N/A')}")
        
        if row['status'] == 'success':
            print(f"Google Doc URL: {row['google_doc_url']}")
            print(f"Document ID: {row['google_doc_id']}")
            print(f"Filename: {row['google_doc_filename']}")
        else:
            print(f"Error: {row.get('error_message', 'Unknown error')}")

Detailed results for the first processed file:

Treatment: control
Status: success
PII: Amit Patel (Male)
Email: patel.xx@gmail.com
Phone: +1 (416) 111-0001
Country: India
Geographic Cluster: South Asia
Google Doc URL: https://docs.google.com/document/d/1mM5sq-DnOlQRnnv9ZhOTTek5yKp4A8zNrPTemPJSJBg
Document ID: 1mM5sq-DnOlQRnnv9ZhOTTek5yKp4A8zNrPTemPJSJBg
Filename: ITC resume 20.pdf_control

Treatment: Type_I
Status: success
PII: Kavya Singh (Female)
Email: singh.xx@gmail.com
Phone: +1 (416) 222-0002
Country: India
Geographic Cluster: South Asia
Google Doc URL: https://docs.google.com/document/d/1MQcZpAFTT0qA4m2ZSoncv1ytlhdPEQjPi52fNyZIZeQ
Document ID: 1MQcZpAFTT0qA4m2ZSoncv1ytlhdPEQjPi52fNyZIZeQ
Filename: ITC resume 20_Type_I

Treatment: Type_II
Status: success
PII: Sunil Kumar (Male)
Email: kumar.xx@gmail.com
Phone: +1 (647) 333-0003
Country: India
Geographic Cluster: South Asia
Google Doc URL: https://docs.google.com/document/d/1Lnj1748MSiaJxECIdvkBuxlfgK4TAD0-NphsgL4CFJc
Document ID