# Train-Test-Validation leakage check

Train-test-validation data leakage is an important concept in machine learning that refers to situations where information from the validation or test sets is inadvertently used in the training process, leading to overly optimistic performance estimates and poor generalization performance. Data leakage can occur when the training, validation, and testing datasets are not properly separated or when features that should be excluded from the training process are inadvertently included. Identifying and preventing train-test-validation data leakage is crucial for building accurate and reliable machine learning models.






In [11]:
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score
import xgboost as xgb
from sklearn.metrics import accuracy_score

In [21]:
import pandas as pd

# Read the data
data_train = pd.read_csv('/content/drive/MyDrive/Enterprise Data Science Group Project 2/airplane_train_processed_date.csv', 
                   true_values = ['yes'], false_values = ['no'])

data_val = pd.read_csv('/content/drive/MyDrive/Enterprise Data Science Group Project 2/airplane_test_processed_date.csv', 
                   true_values = ['yes'], false_values = ['no'])

In [22]:
# Select target
y_1 = data_train.satisfaction
y_1_num = [0 if y_1[i] == "neutral or dissatisfied" else 1 for i in range(len(y_1))]

# Select predictors
X_1 = data_train.drop(['satisfaction'], axis=1)

In [23]:
# Convert 'Date' column to datetime data type
X_1['Date'] = pd.to_datetime(X_1['Date'])

# Extracting relevant information from 'Date'
X_1['Year'] = X_1['Date'].dt.year
X_1['Month'] = X_1['Date'].dt.month
X_1['Day'] = X_1['Date'].dt.day

# Drop the original 'Date' column
X_1.drop('Date', axis=1, inplace=True)

In [24]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_1, y_1_num, test_size = 0.33, random_state = 5)

In [25]:
X_val = data_val[['Flight Distance', 'Departure Delay in Minutes',
       'Arrival Delay in Minutes', 'Gender_Female', 'Gender_Male',
       'Customer Type_Loyal Customer', 'Customer Type_disloyal Customer',
       'Type of Travel_Business travel', 'Type of Travel_Personal Travel',
       'Class_Business', 'Class_Eco', 'Class_Eco Plus', 'Age',
       'Inflight wifi service', 'Departure/Arrival time convenient',
       'Ease of Online booking', 'Gate location', 'Food and drink',
       'Online boarding', 'Seat comfort', 'Inflight entertainment',
       'On-board service', 'Leg room service', 'Baggage handling',
       'Checkin service', 'Inflight service', 'Cleanliness',"Date"]]

y_val = data_val['satisfaction'].values

y_val = [0 if y_val[i] == "neutral or dissatisfied" else 1 for i in range(len(y_val))]


In [26]:
# Convert 'Date' column to datetime data type
X_val['Date'] = pd.to_datetime(X_val['Date'])

# Extracting relevant information from 'Date'
X_val['Year'] = X_val['Date'].dt.year
X_val['Month'] = X_val['Date'].dt.month
X_val['Day'] = X_val['Date'].dt.day

# Drop the original 'Date' column
X_val.drop('Date', axis=1, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_val['Date'] = pd.to_datetime(X_val['Date'])


In [27]:
params = {'colsample_bytree': 0.6911920435612005, 'gamma': 8.593324118055857, 'max_depth': 37, 'min_child_weight': 9.0, 'reg_alpha': 72.0, 'reg_lambda': 0.7966579413290078}


The following code checks if there is a data leakage between the train-test-validation datasets.

This code trains an XGBoost classification model which was the model with best accuracy in compariosn with the other models using a pipeline that applies StandardScaler to the input features and sets several hyperparameters for the XGBClassifier. It then uses cross-validation to make predictions on the training, validation, and testing sets. Finally, it checks for train-test-validation data leakage by comparing the accuracy scores of the predictions on the training, validation, and testing sets. If the difference in accuracy scores between the validation set and either the training or testing sets is greater than 0.1, it prints a message indicating that train-test-validation data leakage has been detected. Otherwise, it prints a message indicating that no train-test-validation data leakage has been detected.

In [28]:
# Fit a pipeline with StandardScaler and Logistic Regression on the training data
pipeline = make_pipeline(StandardScaler(), xgb.XGBClassifier(n_estimators=1000, max_depth=int(params['max_depth']), gamma=params['gamma'],
                        reg_alpha=params['reg_alpha'], min_child_weight=params['min_child_weight'],
                        colsample_bytree=params['colsample_bytree']))

pipeline.fit(X_train, y_train)

# Make predictions on the training, validation, and testing sets using cross-validation
y_pred_train = cross_val_score(pipeline, X_train, y_train, cv=5)
y_pred_val = cross_val_score(pipeline, X_val, y_val, cv=5)
y_pred_test = cross_val_score(pipeline, X_test, y_test, cv=5)

# Check for train-test-validation data leakage by comparing the accuracy scores
# of the predictions on the training, validation, and testing sets
train_val_diff = abs(y_pred_train.mean() - y_pred_val.mean())
test_val_diff = abs(y_pred_test.mean() - y_pred_val.mean())

if train_val_diff > 0.1 or test_val_diff > 0.1:
    print("Train-test-validation data leakage detected")
else:
    print("No train-test-validation data leakage detected")


No train-test-validation data leakage detected


**Conclusion**

`No train-test-validation data leakage detected`, means that the pipeline has been successfully trained and validated without any evidence of data leakage. This is a good sign, as it indicates that the model is likely to generalize well to new, unseen data. Overall, this output is a positive result that suggests the model is a good candidate for further testing and refinement.