<a href="https://colab.research.google.com/github/Shameen-ghyas/Artificial-Intelligence/blob/main/Target_Leakage_Detection_and_Model_Validation_in_AI_Job_Risk_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Description**

This project investigates a multi-class classification problem aimed at categorizing job roles into Low, Medium, and High automation risk. Initial experiments using Logistic Regression and Random Forest yielded near-perfect accuracy, which raised concerns about model validity.

Rather than optimizing for accuracy, this study focuses on validating the dataset and identifying potential target leakage. Multiple diagnostic techniques including label shuffling, single-feature dominance testing, and feature removal experiments were applied to evaluate whether the models were learning genuine patterns or simply exploiting leaked target information.

The findings demonstrate that the target variable is deterministically derived from one or more input features, rendering the dataset unsuitable for predictive modeling. This notebook documents the investigative process, results, and lessons learned regarding responsible model evaluation.

Mount the Drive

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### **Methodology**

The dataset was preprocessed using encoding and scaling where appropriate. Multiple classification algorithms including Logistic Regression and Random Forest were trained and evaluated using accuracy and class-wise metrics.

To validate model legitimacy, diagnostic tests were performed:

* Label shuffling to assess dependence on true targets

* Single-feature dominance testing to detect proxy leakage

These steps ensured that model performance was critically examined beyond surface-level accuracy.

import necessary libraries

In [4]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier

Load the file

In [5]:
df = pd.read_csv('/content/drive/MyDrive/AI_Impact_on_Jobs_2030.csv')
df.head()

Unnamed: 0,Job_Title,Average_Salary,Years_Experience,Education_Level,AI_Exposure_Index,Tech_Growth_Factor,Automation_Probability_2030,Risk_Category,Skill_1,Skill_2,Skill_3,Skill_4,Skill_5,Skill_6,Skill_7,Skill_8,Skill_9,Skill_10
0,Security Guard,45795,28,Master's,0.18,1.28,0.85,High,0.45,0.1,0.46,0.33,0.14,0.65,0.06,0.72,0.94,0.0
1,Research Scientist,133355,20,PhD,0.62,1.11,0.05,Low,0.02,0.52,0.4,0.05,0.97,0.23,0.09,0.62,0.38,0.98
2,Construction Worker,146216,2,High School,0.86,1.18,0.81,High,0.01,0.94,0.56,0.39,0.02,0.23,0.24,0.68,0.61,0.83
3,Software Engineer,136530,13,PhD,0.39,0.68,0.6,Medium,0.43,0.21,0.57,0.03,0.84,0.45,0.4,0.93,0.73,0.33
4,Financial Analyst,70397,22,High School,0.52,1.46,0.64,Medium,0.75,0.54,0.59,0.97,0.61,0.28,0.3,0.17,0.02,0.42


Exploratory Data Analysis EDA

In [6]:
df.describe()

Unnamed: 0,Average_Salary,Years_Experience,AI_Exposure_Index,Tech_Growth_Factor,Automation_Probability_2030,Skill_1,Skill_2,Skill_3,Skill_4,Skill_5,Skill_6,Skill_7,Skill_8,Skill_9,Skill_10
count,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0,3000.0
mean,89372.279,14.677667,0.501283,0.995343,0.501503,0.496973,0.497233,0.499313,0.503667,0.49027,0.499807,0.49916,0.502843,0.501433,0.493627
std,34608.088767,8.739788,0.284004,0.287669,0.247881,0.287888,0.288085,0.288354,0.287063,0.285818,0.28605,0.288044,0.289832,0.285818,0.286464
min,30030.0,0.0,0.0,0.5,0.05,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,58640.0,7.0,0.26,0.74,0.31,0.24,0.25,0.25,0.26,0.24,0.26,0.25,0.25,0.26,0.25
50%,89318.0,15.0,0.5,1.0,0.5,0.505,0.5,0.5,0.51,0.49,0.5,0.49,0.5,0.5,0.49
75%,119086.5,22.0,0.74,1.24,0.7,0.74,0.74,0.75,0.75,0.73,0.74,0.75,0.75,0.74,0.74
max,149798.0,29.0,1.0,1.5,0.95,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
df.shape

(3000, 18)

In [8]:
df.isnull().sum()

Unnamed: 0,0
Job_Title,0
Average_Salary,0
Years_Experience,0
Education_Level,0
AI_Exposure_Index,0
Tech_Growth_Factor,0
Automation_Probability_2030,0
Risk_Category,0
Skill_1,0
Skill_2,0


In [9]:
df.dtypes

Unnamed: 0,0
Job_Title,object
Average_Salary,int64
Years_Experience,int64
Education_Level,object
AI_Exposure_Index,float64
Tech_Growth_Factor,float64
Automation_Probability_2030,float64
Risk_Category,object
Skill_1,float64
Skill_2,float64


Standardizing our data

In [10]:
scalar = StandardScaler()
cols_to_scale = ['Average_Salary', 'Years_Experience']
df[cols_to_scale] = scalar.fit_transform(df[cols_to_scale])
print(df[cols_to_scale].describe())


       Average_Salary  Years_Experience
count    3.000000e+03      3.000000e+03
mean     1.503982e-16     -3.789561e-17
std      1.000167e+00      1.000167e+00
min     -1.714980e+00     -1.679688e+00
25%     -8.881566e-01     -8.786193e-01
50%     -1.568652e-03      3.688729e-02
75%      8.587349e-01      8.379556e-01
max      1.746291e+00      1.639024e+00


Converting categorical data into numerical

In [11]:
# label encoding
df['Risk_Category'] = (
    df['Risk_Category'].astype(str).str.strip().str.lower()
)
print("Unique values after cleaning: ", df['Risk_Category'].unique())
df['Risk_Category'] = df['Risk_Category'].map({'low':0, 'medium':1, 'high':2})

print(df['Risk_Category'].unique())


Unique values after cleaning:  ['high' 'low' 'medium']
[2 0 1]


In [12]:
#separating target and features
X = df.drop('Risk_Category', axis=1)

y=df['Risk_Category']


In [13]:
X['Education_Level'] = (
    X['Education_Level'].astype(str).str.strip().str.lower()
)
print(X['Education_Level'].unique())

X['Education_Level'] = X['Education_Level'].map({'high school': 0,'bachelor\'s':1, 'master\'s':2, 'phd':3})

["master's" 'phd' 'high school' "bachelor's"]


In [14]:
X['Job_Title'] = X['Job_Title'].str.strip().str.lower()
X = pd.get_dummies(X, columns = ['Job_Title'], drop_first=True)


In [15]:
# Select boolean columns
bool_cols = X.select_dtypes(include='bool').columns

# Convert True/False to 1/0
X[bool_cols] = X[bool_cols].astype(int)


Train-Test split

In [16]:
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.2, random_state = 42, stratify=y)

In [17]:
scaler = StandardScaler()
X_train_scaled = scalar.fit_transform(X_train)
X_test_scaled = scalar.transform(X_test)

### **Adding diagnostic tests to validate the model**

Test 1: Shuffle-Label Diagnostic Test

In [18]:
# Shuffle Only training labels
y_shuffled = np.random.permutation(y_train)

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_shuffled)
pred = rf.predict(X_test)

print("Accuracy with shuffled labels:", accuracy_score(y_test, pred))


Accuracy with shuffled labels: 0.5066666666666667


Test 2: Single Feature Influence Test

In [19]:
for col in X_train.columns:
    rf = RandomForestClassifier(random_state=42)
    rf.fit(X_train[[col]], y_train)
    acc = accuracy_score(y_test, rf.predict(X_test[[col]]))

    if acc > 0.85:
        print(col, acc)


Automation_Probability_2030 1.0


Defining Models

In [20]:
models = {
    'Logistic Regression': LogisticRegression(multi_class='multinomial', solver='lbfgs', max_iter=500),
    'Random Forest': RandomForestClassifier(n_estimators = 100, random_state=42),

}

Storing results

In [21]:
results = {}

for name, model in models.items():
  model.fit(X_train_scaled, y_train)
  y_pred = model.predict(X_test_scaled)
  acc= accuracy_score(y_test, y_pred)
  results[name] = acc
  print(f"\nModel: {name}")
  print(f"Accuracy: {acc:.4f}")
  print(classification_report(y_test, y_pred))




Model: Logistic Regression
Accuracy: 0.9933
              precision    recall  f1-score   support

           0       1.00      0.99      0.99       148
           1       0.99      0.99      0.99       304
           2       0.99      1.00      0.99       148

    accuracy                           0.99       600
   macro avg       0.99      0.99      0.99       600
weighted avg       0.99      0.99      0.99       600


Model: Random Forest
Accuracy: 0.9983
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       148
           1       1.00      1.00      1.00       304
           2       0.99      1.00      1.00       148

    accuracy                           1.00       600
   macro avg       1.00      1.00      1.00       600
weighted avg       1.00      1.00      1.00       600



Summary of Accuracy

In [22]:
print("\nModel Accuracy Summary:")
for name, acc in results.items():
    print(f"{name}: {acc:.4f}")


Model Accuracy Summary:
Logistic Regression: 0.9933
Random Forest: 0.9983


## **Results and Analysis**

Initial model evaluations produced unusually high accuracy scores (≈99–100%) across multiple algorithms. However, diagnostic testing revealed that a single feature (Automation_Probability_2030) was sufficient to achieve perfect classification.

Further, when training labels were randomly shuffled, model accuracy remained significantly above random chance, confirming the presence of severe target leakage. Removing individual high-level features did not materially reduce performance, indicating proxy leakage across multiple correlated variables.

Based on these findings, it was concluded that the dataset exhibits deterministic label generation and does not support meaningful predictive modeling.

### **Limitations and Ethical Considerations**

Because the target variable is directly encoded within the feature space, any high-performing model trained on this dataset would be misleading and non-generalizable. Deploying such a model could result in false confidence and incorrect decision-making.

Consequently, no final predictive model was selected or deployed. This decision highlights the importance of dataset validation, leakage detection, and ethical responsibility in applied machine learning.

### **Conclusion**

This project demonstrates that high model accuracy does not necessarily indicate meaningful learning. Through systematic validation and leakage detection, the study underscores the critical role of dataset integrity in machine learning workflows. The key outcome of this work is not model performance, but the identification of conditions under which predictive modeling should not proceed.