# Data Processing Pipeline

We process the resume screening dataset to create Dataset A (no demographics used as features).

## Steps:
1. Load & Clean the Dataset
2. Remove personal identifiers and demographic columns
3. Standardize features (Skills, Education, Experience)
4. Handle missing values
5. Label encoding for categorical variables
6. Train-test split (80/20)


In [16]:
# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import warnings
warnings.filterwarnings('ignore')

print("Libraries imported successfully!")


Libraries imported successfully!


## Step 1: Load & Clean the Dataset


In [17]:
df = pd.read_csv('AI_Resume_Screening_with_demographics_BIASED.csv')
df.head()

Unnamed: 0,Resume_ID,Name,Skills,Experience (Years),Education,Certifications,Job Role,Recruiter Decision,Salary Expectation ($),Projects Count,AI Score (0-100),Gender,Race,Age,Disability_Status,AI_Score_Biased,Recruiter_Decision_Biased
0,1,Ashley Ali,"TensorFlow, NLP, Pytorch",10,B.Sc,,AI Researcher,Hire,104895,8,100,Male,Asian,30,No,96.8,Hire
1,2,Wesley Roman,"Deep Learning, Machine Learning, Python, SQL",10,MBA,Google ML,Data Scientist,Hire,113002,1,100,Female,White,33,No,100.0,Hire
2,3,Corey Sanchez,"Ethical Hacking, Cybersecurity, Linux",1,MBA,Deep Learning Specialization,Cybersecurity Analyst,Hire,71766,7,70,Female,Hispanic,26,No,63.6,Hire
3,4,Elizabeth Carney,"Python, Pytorch, TensorFlow",7,B.Tech,AWS Certified,AI Researcher,Hire,46848,0,95,Male,Asian,30,No,91.0,Hire
4,5,Julie Hill,"SQL, React, Java",4,PhD,,Software Engineer,Hire,87441,9,100,Male,Hispanic,34,No,97.8,Hire


In [18]:
df=df.drop(columns=['AI Score (0-100)','Recruiter Decision'])

In [19]:
# Load the main dataset

# Rename columns to use single words/underscores only (no spaces, parentheses, special chars)
column_rename_map = {
    'Experience (Years)': 'Experience',
    'Salary Expectation ($)': 'Salary_Expectation',
    'Projects Count': 'Projects_Count',
    'AI_Score_Biased': 'AI_Score',
    'Recruiter_Decision_Biased': 'Recruiter_Decision',
    'Job Role': 'Job_Role'
}
df = df.rename(columns=column_rename_map)

print(f"Dataset shape: {df.shape}")
print(f"\nColumns: {list(df.columns)}")
print(f"\nFirst few rows:")
df.head()


Dataset shape: (1000, 15)

Columns: ['Resume_ID', 'Name', 'Skills', 'Experience', 'Education', 'Certifications', 'Job_Role', 'Salary_Expectation', 'Projects_Count', 'Gender', 'Race', 'Age', 'Disability_Status', 'AI_Score', 'Recruiter_Decision']

First few rows:


Unnamed: 0,Resume_ID,Name,Skills,Experience,Education,Certifications,Job_Role,Salary_Expectation,Projects_Count,Gender,Race,Age,Disability_Status,AI_Score,Recruiter_Decision
0,1,Ashley Ali,"TensorFlow, NLP, Pytorch",10,B.Sc,,AI Researcher,104895,8,Male,Asian,30,No,96.8,Hire
1,2,Wesley Roman,"Deep Learning, Machine Learning, Python, SQL",10,MBA,Google ML,Data Scientist,113002,1,Female,White,33,No,100.0,Hire
2,3,Corey Sanchez,"Ethical Hacking, Cybersecurity, Linux",1,MBA,Deep Learning Specialization,Cybersecurity Analyst,71766,7,Female,Hispanic,26,No,63.6,Hire
3,4,Elizabeth Carney,"Python, Pytorch, TensorFlow",7,B.Tech,AWS Certified,AI Researcher,46848,0,Male,Asian,30,No,91.0,Hire
4,5,Julie Hill,"SQL, React, Java",4,PhD,,Software Engineer,87441,9,Male,Hispanic,34,No,97.8,Hire


In [20]:
# Check for missing values
print("Missing values per column:")
print(df.isnull().sum())
print(f"\nTotal missing values: {df.isnull().sum().sum()}")
print(f"\nNote: Column names have been standardized (no spaces, parentheses, or special characters)")


Missing values per column:
Resume_ID               0
Name                    0
Skills                  0
Experience              0
Education               0
Certifications        274
Job_Role                0
Salary_Expectation      0
Projects_Count          0
Gender                  0
Race                    0
Age                     0
Disability_Status       0
AI_Score                0
Recruiter_Decision      0
dtype: int64

Total missing values: 274

Note: Column names have been standardized (no spaces, parentheses, or special characters)


### Step 1.1: Save Demographic Features & Remove from Training Data

For Dataset A, we:
- **Save demographic features separately** (Gender, Race, Age, Disability_Status) for fairness metrics analysis
- **Remove from training features**: Name, Resume_ID, and demographic columns (not used as model inputs)


In [21]:
# Save demographic features separately for fairness metrics analysis
demographic_columns = ['Gender', 'Race', 'Age', 'Disability_Status']
demographics_df = df[demographic_columns].copy() if all(col in df.columns for col in demographic_columns) else pd.DataFrame()

print("Demographic features saved separately:")
print(f"Demographics shape: {demographics_df.shape}")
print(f"Demographic columns: {list(demographics_df.columns)}")
print(f"\nDemographic features preview:")
print(demographics_df.head())
print(f"\nDemographic value counts:")
for col in demographic_columns:
    if col in demographics_df.columns:
        print(f"\n{col}:")
        print(demographics_df[col].value_counts())

# Create Dataset A by removing personal identifiers and demographic columns from training features
columns_to_remove = ['Name', 'Resume_ID'] + demographic_columns
df_cleaned = df.drop(columns=columns_to_remove, errors='ignore')

print(f"\n" + "="*60)
print(f"Original shape: {df.shape}")
print(f"Cleaned shape (for training): {df_cleaned.shape}")
print(f"Demographics shape (for fairness): {demographics_df.shape}")
print(f"\nRemaining columns for training: {list(df_cleaned.columns)}")
df_cleaned.head()


Demographic features saved separately:
Demographics shape: (1000, 4)
Demographic columns: ['Gender', 'Race', 'Age', 'Disability_Status']

Demographic features preview:
   Gender      Race  Age Disability_Status
0    Male     Asian   30                No
1  Female     White   33                No
2  Female  Hispanic   26                No
3    Male     Asian   30                No
4    Male  Hispanic   34                No

Demographic value counts:

Gender:
Gender
Male      503
Female    497
Name: count, dtype: int64

Race:
Race
White              566
Hispanic           177
Black              158
Asian               65
Other               25
Native American      9
Name: count, dtype: int64

Age:
Age
27    92
29    92
30    89
28    89
26    79
32    75
31    74
24    60
33    58
25    47
34    46
35    37
23    36
36    33
22    25
21    18
37    11
39     9
20     8
38     7
41     5
19     3
40     3
18     2
43     2
Name: count, dtype: int64

Disability_Status:
Disability_Status
No

Unnamed: 0,Skills,Experience,Education,Certifications,Job_Role,Salary_Expectation,Projects_Count,AI_Score,Recruiter_Decision
0,"TensorFlow, NLP, Pytorch",10,B.Sc,,AI Researcher,104895,8,96.8,Hire
1,"Deep Learning, Machine Learning, Python, SQL",10,MBA,Google ML,Data Scientist,113002,1,100.0,Hire
2,"Ethical Hacking, Cybersecurity, Linux",1,MBA,Deep Learning Specialization,Cybersecurity Analyst,71766,7,63.6,Hire
3,"Python, Pytorch, TensorFlow",7,B.Tech,AWS Certified,AI Researcher,46848,0,91.0,Hire
4,"SQL, React, Java",4,PhD,,Software Engineer,87441,9,97.8,Hire


### Step 1.2: Standardize Features

- **Skills**: Keep as text feature (will be included in features for modeling)
- **Education**: Convert to ordinal (High School < Bachelors < Masters < PhD)
- **Experience**: Ensure numeric format


In [22]:
# Skills: Keep as text feature (will be processed later for modeling)
# Check for missing values in Skills
print("Skills column check:")
print(f"Missing Skills values: {df_cleaned['Skills'].isnull().sum()}")
print(f"\nSample Skills column:")
print(df_cleaned[['Skills']].head(10))
print(f"\nSkills data type: {df_cleaned['Skills'].dtype}")

Skills column check:
Missing Skills values: 0

Sample Skills column:
                                              Skills
0                           TensorFlow, NLP, Pytorch
1       Deep Learning, Machine Learning, Python, SQL
2              Ethical Hacking, Cybersecurity, Linux
3                        Python, Pytorch, TensorFlow
4                                   SQL, React, Java
5  Cybersecurity, Networking, Linux, Ethical Hacking
6         Networking, Cybersecurity, Ethical Hacking
7                           TensorFlow, Pytorch, NLP
8                        Networking, Ethical Hacking
9                   Python, TensorFlow, Pytorch, NLP

Skills data type: object


In [23]:
# Standardize Education: Convert to ordinal encoding
# Define education hierarchy
education_order = {
    'High School': 1,
    'B.Sc': 2,
    'B.Tech': 2,
    'MBA': 3,
    'M.Tech': 3,
    'Masters': 3,
    'PhD': 4
}

def map_education(edu):
    """Map education level to ordinal value"""
    if pd.isna(edu):
        return np.nan
    edu_str = str(edu).strip()
    # Check for exact matches first
    if edu_str in education_order:
        return education_order[edu_str]
    # Check for partial matches
    for key, value in education_order.items():
        if key.lower() in edu_str.lower():
            return value
    return np.nan

df_cleaned['Education_Ordinal'] = df_cleaned['Education'].apply(map_education)

print("Education mapping:")
print(df_cleaned[['Education', 'Education_Ordinal']].value_counts().sort_index())
print(f"\nMissing Education_Ordinal: {df_cleaned['Education_Ordinal'].isnull().sum()}")


Education mapping:
Education  Education_Ordinal
B.Sc       2                    205
B.Tech     2                    200
M.Tech     3                    198
MBA        3                    202
PhD        4                    195
Name: count, dtype: int64

Missing Education_Ordinal: 0


In [24]:
# Standardize Experience: Ensure numeric format
df_cleaned['Experience'] = pd.to_numeric(df_cleaned['Experience'], errors='coerce')

print("Experience statistics:")
print(df_cleaned['Experience'].describe())
print(f"\nMissing Experience values: {df_cleaned['Experience'].isnull().sum()}")


Experience statistics:
count    1000.000000
mean        4.896000
std         3.112695
min         0.000000
25%         2.000000
50%         5.000000
75%         8.000000
max        10.000000
Name: Experience, dtype: float64

Missing Experience values: 0


### Step 1.3: Handle Missing Values

- **Mode** for categorical variables
- **Median** for numeric variables

In [25]:
# Identify categorical and numeric columns
categorical_cols = ['Certifications', 'Job_Role', 'Education']
numeric_cols = ['Experience', 'Salary_Expectation', 'Projects_Count', 
                'AI_Score', 'Education_Ordinal']

print("Missing values before imputation:")
print(df_cleaned[categorical_cols + numeric_cols].isnull().sum())


Missing values before imputation:
Certifications        274
Job_Role                0
Education               0
Experience              0
Salary_Expectation      0
Projects_Count          0
AI_Score                0
Education_Ordinal       0
dtype: int64


In [26]:
# Fill missing values: Mode for categorical, Median for numeric
# Categorical: Mode except for Certifications (none)
if 'Certifications' in df_cleaned.columns:
    df_cleaned['Certifications'] = df_cleaned['Certifications'].fillna("None")
    print("Certifications: Filled missing values with 'None'")

for col in categorical_cols:
    if col == 'Certifications':
        continue
    if col in df_cleaned.columns:
        mode_value = df_cleaned[col].mode()[0] if not df_cleaned[col].mode().empty else ''
        df_cleaned[col].fillna(mode_value, inplace=True)
        print(f"{col}: Filled with mode = '{mode_value}'")

# Numeric: Median
for col in numeric_cols:
    if col in df_cleaned.columns:
        median_value = df_cleaned[col].median()
        df_cleaned[col].fillna(median_value, inplace=True)
        print(f"{col}: Filled with median = {median_value:.2f}")

# Handle Skills missing values (text feature)
if 'Skills' in df_cleaned.columns:
    df_cleaned['Skills'] = df_cleaned['Skills'].fillna('')
    print("Skills: Filled missing values with empty string")

print("\nMissing values after imputation:")
print(df_cleaned[categorical_cols + numeric_cols + ['Skills']].isnull().sum().sum())


Certifications: Filled missing values with 'None'
Job_Role: Filled with mode = 'AI Researcher'
Education: Filled with mode = 'B.Sc'
Experience: Filled with median = 5.00
Salary_Expectation: Filled with median = 79834.50
Projects_Count: Filled with median = 5.00
AI_Score: Filled with median = 92.20
Education_Ordinal: Filled with median = 3.00
Skills: Filled missing values with empty string

Missing values after imputation:
0


### Step 1.4: Label Encoding for Categorical Variables


In [27]:
# Apply label encoding to categorical columns
label_encoders = {}
categorical_features = ['Certifications', 'Job_Role', 'Education']

for col in categorical_features:
    if col in df_cleaned.columns:
        le = LabelEncoder()
        df_cleaned[f'{col}_Encoded'] = le.fit_transform(df_cleaned[col].astype(str))
        label_encoders[col] = le
        print(f"{col} encoding:")
        print(f"  Unique values: {df_cleaned[col].unique()}")
        print(f"  Encoded mapping: {dict(zip(le.classes_, le.transform(le.classes_)))}")
        print()


Certifications encoding:
  Unique values: ['None' 'Google ML' 'Deep Learning Specialization' 'AWS Certified']
  Encoded mapping: {'AWS Certified': 0, 'Deep Learning Specialization': 1, 'Google ML': 2, 'None': 3}

Job_Role encoding:
  Unique values: ['AI Researcher' 'Data Scientist' 'Cybersecurity Analyst'
 'Software Engineer']
  Encoded mapping: {'AI Researcher': 0, 'Cybersecurity Analyst': 1, 'Data Scientist': 2, 'Software Engineer': 3}

Education encoding:
  Unique values: ['B.Sc' 'MBA' 'B.Tech' 'PhD' 'M.Tech']
  Encoded mapping: {'B.Sc': 0, 'B.Tech': 1, 'M.Tech': 2, 'MBA': 3, 'PhD': 4}



### Step 1.5: Prepare Final Dataset A

Select features for modeling and prepare the target variable.

**Note**: 
- Target variable: **Recruiter_Decision**

In [31]:
# Select features for Dataset A (include Skills text, exclude AI Score and Recruiter Decision)
feature_columns = [
    'Skills',  # text feature
    'Experience',
    'Education_Ordinal',
    'Certifications_Encoded',
    'Job_Role_Encoded',
    'Salary_Expectation',
    'Projects_Count'
]

# Create feature matrix X
X = df_cleaned[feature_columns].copy()

y_target = df_cleaned['Recruiter_Decision'].copy()  # Classification target (Hire/Reject)
ai_score = df_cleaned['AI_Score'].copy()            # For fairness metrics

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y_target.shape}")
print(f"\nFeature columns: {list(X.columns)}")
print(f"\nTarget variable (recruiter_Decision) statistics:")
print(y_target.describe())
X.head()


Feature matrix shape: (1000, 7)
Target variable shape: (1000,)

Feature columns: ['Skills', 'Experience', 'Education_Ordinal', 'Certifications_Encoded', 'Job_Role_Encoded', 'Salary_Expectation', 'Projects_Count']

Target variable (recruiter_Decision) statistics:
count     1000
unique       2
top       Hire
freq       807
Name: Recruiter_Decision, dtype: object


Unnamed: 0,Skills,Experience,Education_Ordinal,Certifications_Encoded,Job_Role_Encoded,Salary_Expectation,Projects_Count
0,"TensorFlow, NLP, Pytorch",10,2,3,0,104895,8
1,"Deep Learning, Machine Learning, Python, SQL",10,3,2,2,113002,1
2,"Ethical Hacking, Cybersecurity, Linux",1,3,1,1,71766,7
3,"Python, Pytorch, TensorFlow",7,2,0,0,46848,0
4,"SQL, React, Java",4,4,3,3,87441,9


### Step 1.6: Train-Test Split (80/20)


In [32]:
# Perform train-test split (80/20) 
X_train, X_test, y_train, y_test = train_test_split(
    X, y_target,  # Use y_target (Recruiter_Decision)
    test_size=0.2,
    random_state=42,
    stratify=y_target  
)

# Get the indices used in the split
train_indices = X_train.index
test_indices = X_test.index

# Align demographic features using the same indices for fairness metrics analysis
demographics_train = demographics_df.loc[train_indices].copy()
demographics_test = demographics_df.loc[test_indices].copy()

# Align AI_Score for fairness metrics
ai_score_train = ai_score.loc[train_indices].copy()
ai_score_test = ai_score.loc[test_indices].copy()

print(f"Training set shape: {X_train.shape}")
print(f"Test set shape: {X_test.shape}")
print(f"\nDemographics train shape: {demographics_train.shape}")
print(f"Demographics test shape: {demographics_test.shape}")
print(f"\nTraining target (Recruiter Decision) distribution:")
print(y_train.value_counts())
print(f"\nTest target (Recruiter Decision) distribution:")
print(y_test.value_counts())
print(f"\nAI Score - Training statistics:")
print(ai_score_train.describe())
print(f"\nAI Score - Test statistics:")
print(ai_score_test.describe())
print(f"\nDemographics test set (for fairness metrics):")
print(demographics_test.head())

Training set shape: (800, 7)
Test set shape: (200, 7)

Demographics train shape: (800, 4)
Demographics test shape: (200, 4)

Training target (Recruiter Decision) distribution:
Recruiter_Decision
Hire      646
Reject    154
Name: count, dtype: int64

Test target (Recruiter Decision) distribution:
Recruiter_Decision
Hire      161
Reject     39
Name: count, dtype: int64

AI Score - Training statistics:
count    800.000000
mean      81.813250
std       20.816799
min        5.100000
25%       67.975000
50%       91.250000
75%       99.900000
max      100.000000
Name: AI_Score, dtype: float64

AI Score - Test statistics:
count    200.000000
mean      84.640500
std       19.983546
min        8.100000
25%       74.500000
50%       93.550000
75%      100.000000
max      100.000000
Name: AI_Score, dtype: float64

Demographics test set (for fairness metrics):
    Gender      Race  Age Disability_Status
495   Male     Black   25                No
851   Male  Hispanic   22                No
517   M

### Step 1.7: Summary of Dataset A

Dataset A is now ready for modeling. It contains:
- **No demographic features in training** (Gender, Race, Age, Disability_Status excluded from model inputs)
- **Demographic features saved separately** (for fairness metrics analysis on test set)
- **No personal identifiers** (Name, Resume_ID removed)
- **Standardized features** (Skills as text, Education ordinal, Experience numeric)
- **No missing values** (imputed with mode/median)
- **Label encoded categorical variables**
- **Target variable: AI Score (0-100)** - regression task
- **Recruiter Decision kept separate** - for fairness metrics analysis (not used as feature)
- **Train-test split** (80/20) with aligned demographic and Recruiter Decision splits


In [34]:
# Display summary statistics
print("=" * 60)
print("DATASET A SUMMARY")
print("=" * 60)
print(f"\nOriginal dataset shape: {df.shape}")
print(f"Cleaned dataset shape (for training): {df_cleaned.shape}")
print(f"\nFeatures used for modeling: {len(feature_columns)}")
print(f"  - {feature_columns}")
print(f"\nDemographic features saved separately: {list(demographics_df.columns)}")
print(f"  - For fairness metrics analysis (NOT used in model training)")
print(f"\nTraining samples: {len(X_train)} ({len(X_train)/len(X)*100:.1f}%)")
print(f"Test samples: {len(X_test)} ({len(X_test)/len(X)*100:.1f}%)")
print(f"Demographics train: {demographics_train.shape}")
print(f"Demographics test: {demographics_test.shape}")
# Correct summary for classification target
print(f"\nTarget variable: Recruiter_Decision - CLASSIFICATION TASK")
print(f"  - Classes: {y_target.unique()}")
print(f"  - Distribution: {dict(y_target.value_counts())}")

print("Dataset A is ready for modeling!")
print("Demographic features and AI Score are saved separately for fairness analysis!")
print("=" * 60)


DATASET A SUMMARY

Original dataset shape: (1000, 15)
Cleaned dataset shape (for training): (1000, 13)

Features used for modeling: 7
  - ['Skills', 'Experience', 'Education_Ordinal', 'Certifications_Encoded', 'Job_Role_Encoded', 'Salary_Expectation', 'Projects_Count']

Demographic features saved separately: ['Gender', 'Race', 'Age', 'Disability_Status']
  - For fairness metrics analysis (NOT used in model training)

Training samples: 800 (80.0%)
Test samples: 200 (20.0%)
Demographics train: (800, 4)
Demographics test: (200, 4)

Target variable: Recruiter_Decision - CLASSIFICATION TASK
  - Classes: ['Hire' 'Reject']
  - Distribution: {'Hire': 807, 'Reject': 193}
Dataset A is ready for modeling!
Demographic features and AI Score are saved separately for fairness analysis!


### Step 1.8: Save Processed Datasets

Save all processed datasets for use in model training and fairness analysis.


In [36]:
# Save the processed datasets
try:
    # Training and test features/targets (CLASSIFICATION: Recruiter_Decision)
    X_train_clean = pd.DataFrame(X_train.values, columns=X_train.columns).reset_index(drop=True)
    X_test_clean = pd.DataFrame(X_test.values, columns=X_test.columns).reset_index(drop=True)
    
    X_train_clean.to_csv('X_train_modelB.csv', index=False)
    X_test_clean.to_csv('X_test_modelB.csv', index=False)
    print("✓ Saved X_train_modelB.csv and X_test_modelB.csv")
    
    # Classification targets (Recruiter_Decision)
    y_train_df = pd.DataFrame({'Recruiter_Decision': y_train.values})
    y_test_df = pd.DataFrame({'Recruiter_Decision': y_test.values})
    
    y_train_df.to_csv('y_train_modelB.csv', index=False)
    y_test_df.to_csv('y_test_modelB.csv', index=False)
    print("✓ Saved y_train_modelB.csv and y_test_modelB.csv")
    
    # AI_Score (for fairness metrics analysis)
    ai_score_train_df = pd.DataFrame({'AI_Score': ai_score_train.values})
    ai_score_test_df = pd.DataFrame({'AI_Score': ai_score_test.values})
    
    ai_score_train_df.to_csv('ai_score_train_modelB.csv', index=False)
    ai_score_test_df.to_csv('ai_score_test_modelB.csv', index=False)
    print("✓ Saved ai_score_train_modelB.csv and ai_score_test_modelB.csv")
    
    # Demographic features (for fairness metrics analysis)
    demographics_train_clean = pd.DataFrame(demographics_train.values, columns=demographics_train.columns).reset_index(drop=True)
    demographics_test_clean = pd.DataFrame(demographics_test.values, columns=demographics_test.columns).reset_index(drop=True)
    
    demographics_train_clean.to_csv('demographics_train_modelB.csv', index=False)
    demographics_test_clean.to_csv('demographics_test_modelB.csv', index=False)
    print("✓ Saved demographics_train_modelB.csv and demographics_test_modelB.csv")
    
    # Full processed dataset
    df_cleaned_clean = pd.DataFrame(df_cleaned.values, columns=df_cleaned.columns).reset_index(drop=True)
    df_cleaned_clean.to_csv('Dataset_B_processed.csv', index=False)
    print("✓ Saved Dataset_B_processed.csv")
    
    print("\n" + "="*60)
    print("✓ All datasets saved successfully!")
    print("="*60)
    print("\nSaved files:")
    print("  - X_train.csv, X_test.csv (features)")
    print("  - y_train.csv, y_test.csv (Recruiter_Decision targets - CLASSIFICATION)")
    print("  - ai_score_train.csv, ai_score_test.csv (AI_Score for fairness)")
    print("  - demographics_train.csv, demographics_test.csv (demographics)")
    print("  - Dataset_A_processed.csv (full processed dataset)")
    print("\nNote:")
    print("- Target variable for modeling: Recruiter_Decision - CLASSIFICATION (Hire/Reject)")
    print("- AI_Score saved separately for fairness metrics analysis")
    print("- Demographic features saved separately for fairness metrics analysis")
    print("- They are NOT included in X_train/X_test (model training features)")
    
except Exception as e:
    print(f"Error saving files: {e}")
    import traceback
    traceback.print_exc()

✓ Saved X_train_modelB.csv and X_test_modelB.csv
✓ Saved y_train_modelB.csv and y_test_modelB.csv
✓ Saved ai_score_train_modelB.csv and ai_score_test_modelB.csv
✓ Saved demographics_train_modelB.csv and demographics_test_modelB.csv
✓ Saved Dataset_B_processed.csv

✓ All datasets saved successfully!

Saved files:
  - X_train.csv, X_test.csv (features)
  - y_train.csv, y_test.csv (Recruiter_Decision targets - CLASSIFICATION)
  - ai_score_train.csv, ai_score_test.csv (AI_Score for fairness)
  - demographics_train.csv, demographics_test.csv (demographics)
  - Dataset_A_processed.csv (full processed dataset)

Note:
- Target variable for modeling: Recruiter_Decision - CLASSIFICATION (Hire/Reject)
- AI_Score saved separately for fairness metrics analysis
- Demographic features saved separately for fairness metrics analysis
- They are NOT included in X_train/X_test (model training features)
