# Target Leakage checking

Data leakage, occurs when the training data used for creating a model contains information about the target variable that will not be available during the prediction stage. As a result, the model can achieve high performance on the training set, and even the validation set, but its performance will be poor when used for making predictions in real-world scenarios.

Leakage makes a model appear accurate during training, but its accuracy drastically decreases when the model is used for making decisions in practical applications.

Target leakage happens when your predictors include data that will not be available at the time you make predictions, thereby compromising the accuracy of your model. It is essential to understand target leakage in terms of the timing or chronological order in which the data becomes available, rather than solely based on whether a feature aids in making accurate predictions.

In [2]:
import pandas as pd

# Read the data
data = pd.read_csv('/content/drive/MyDrive/Enterprise Data Science Group Project 2/airplane_train_processed_date.csv', 
                   true_values = ['yes'], false_values = ['no'])

# Select target
y = data.satisfaction

# Select predictors
X = data.drop(['satisfaction'], axis=1)

print("Number of rows in the dataset:", X.shape[0])
X.head()

Number of rows in the dataset: 102825


Unnamed: 0,Departure Delay in Minutes,Arrival Delay in Minutes,Gender_Female,Gender_Male,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Age,Flight Distance,Date
0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,...,2.0,4.0,4.0,4.0,4.0,4.0,2.0,43,508,2017-01-01
1,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,0.0,1.0,...,4.0,2.0,1.0,3.0,3.0,2.0,4.0,34,199,2017-01-01
2,0.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,5.0,5.0,5.0,5.0,4.0,5.0,5.0,54,2917,2017-01-01
3,36.0,27.0,1.0,0.0,1.0,0.0,1.0,0.0,1.0,0.0,...,5.0,5.0,5.0,5.0,5.0,5.0,5.0,57,270,2017-01-01
4,0.0,5.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,...,5.0,1.0,2.0,4.0,4.0,3.0,5.0,58,308,2017-01-01


In [3]:
# Convert 'Date' column to datetime data type
X['Date'] = pd.to_datetime(X['Date'])

# Extracting relevant information from 'Date'
X['Year'] = X['Date'].dt.year
X['Month'] = X['Date'].dt.month
X['Day'] = X['Date'].dt.day

# Drop the original 'Date' column
X.drop('Date', axis=1, inplace=True)

In [4]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
import xgboost as xgb

In [5]:
y_num = []
y_num = [0 if y[i] == "neutral or dissatisfied" else 1 for i in range(len(y))]

In [19]:
params = {'colsample_bytree': 0.6911920435612005, 'gamma': 8.593324118055857, 'max_depth': 37, 'min_child_weight': 9.0, 'reg_alpha': 72.0, 'reg_lambda': 0.7966579413290078}

# Since there is no preprocessing, we don't need a pipeline (used anyway as best practice!)
my_pipeline = make_pipeline(xgb.XGBClassifier(n_estimators=1000, max_depth=int(params['max_depth']), gamma=params['gamma'],
                        reg_alpha=params['reg_alpha'], min_child_weight=params['min_child_weight'],
                        colsample_bytree=params['colsample_bytree']))

cv_scores = cross_val_score(my_pipeline, X, y_num, 
                            cv=5,
                            scoring='accuracy')

print("Cross-validation accuracy: %f" % cv_scores.mean())

Cross-validation accuracy: 0.952978


In [6]:
##Relationship between Loyal customers and satisfaction 

num_loyal = data[data["Customer Type_Loyal Customer"]== 1].shape[0]
num_loyal_satisfied_customers = data[(data["satisfaction"] == 'satisfied') & (data['Customer Type_Loyal Customer'] == 1)].shape[0]
num_loyal_dissatisfied_customers = data[(data["satisfaction"] == 'neutral or dissatisfied') & (data['Customer Type_Loyal Customer'] == 1)].shape[0]


print('Fraction of Loyal customers those were satisfied: ',round(num_loyal_satisfied_customers/num_loyal,2))

print('Fraction of Loyal customers those were unsatisfied: ' ,round(num_loyal_dissatisfied_customers/num_loyal,2))


Fraction of Loyal customers those were satisfied:  0.48
Fraction of Loyal customers those were unsatisfied:  0.52


Based on the numbers, it is apparent that there is a weak correlation between customer satisfaction and customer loyalty. 

**Identifying the top correlated variables**

In [7]:
df = data.copy()

In [8]:
df['satisfaction'] = df['satisfaction'].apply(lambda x: 1 if x == 'satisfied' else 0)

In [21]:
# find the correlation among columns
corr_matrix = df.corr()

corr_matrix

  corr_matrix = df.corr()


Unnamed: 0,Departure Delay in Minutes,Arrival Delay in Minutes,Gender_Female,Gender_Male,Customer Type_Loyal Customer,Customer Type_disloyal Customer,Type of Travel_Business travel,Type of Travel_Personal Travel,Class_Business,Class_Eco,...,Inflight entertainment,On-board service,Leg room service,Baggage handling,Checkin service,Inflight service,Cleanliness,Age,Flight Distance,satisfaction
Departure Delay in Minutes,1.0,0.934728,-0.002991,0.002991,-0.011209,0.011209,0.004293,-0.004293,-0.017294,0.012359,...,-0.029103,-0.031752,-0.003041,-0.0154,-0.025322,-0.04387,-0.020383,-0.011023,-0.020921,-0.063013
Arrival Delay in Minutes,0.934728,1.0,-0.000452,0.000452,-0.01125,0.01125,0.003783,-0.003783,-0.023308,0.016732,...,-0.03323,-0.036657,-0.006118,-0.019236,-0.029303,-0.050749,-0.023861,-0.013718,-0.025892,-0.072996
Gender_Female,-0.002991,-0.000452,1.0,-1.0,-0.03217,0.03217,0.007395,-0.007395,-0.010434,0.004947,...,-0.005915,-0.008197,-0.025812,-0.03779,-0.010743,-0.039664,-0.006643,-0.009264,-0.006116,-0.012271
Gender_Male,0.002991,0.000452,-1.0,1.0,0.03217,-0.03217,-0.007395,0.007395,0.010434,-0.004947,...,0.005915,0.008197,0.025812,0.03779,0.010743,0.039664,0.006643,0.009264,0.006116,0.012271
Customer Type_Loyal Customer,-0.011209,-0.01125,-0.03217,0.03217,1.0,-1.0,-0.308851,0.308851,0.085051,-0.118231,...,0.110467,0.057041,0.052855,-0.025226,0.032292,-0.022899,0.084348,0.28248,0.224975,0.188103
Customer Type_disloyal Customer,0.011209,0.01125,0.03217,-0.03217,-1.0,1.0,0.308851,-0.308851,-0.085051,0.118231,...,-0.110467,-0.057041,-0.052855,0.025226,-0.032292,0.022899,-0.084348,-0.28248,-0.224975,-0.188103
Type of Travel_Business travel,0.004293,0.003783,0.007395,-0.007395,-0.308851,0.308851,1.0,-1.0,0.552058,-0.500779,...,0.150408,0.057012,0.130711,0.031773,-0.018919,0.023483,0.078308,0.048303,0.268114,0.448892
Type of Travel_Personal Travel,-0.004293,-0.003783,-0.007395,0.007395,0.308851,-0.308851,-1.0,1.0,-0.552058,0.500779,...,-0.150408,-0.057012,-0.130711,-0.031773,0.018919,-0.023483,-0.078308,-0.048303,-0.268114,-0.448892
Class_Business,-0.017294,-0.023308,-0.010434,0.010434,0.085051,-0.085051,0.552058,-0.552058,1.0,-0.86526,...,0.199539,0.222165,0.212315,0.171115,0.16209,0.166704,0.138775,0.139147,0.467183,0.503714
Class_Eco,0.012359,0.016732,0.004947,-0.004947,-0.118231,0.118231,-0.500779,0.500779,-0.86526,1.0,...,-0.176846,-0.184866,-0.181027,-0.138871,-0.129842,-0.136127,-0.122447,-0.132712,-0.404351,-0.450987


In [33]:
corr_matrix["satisfaction"].sort_values(ascending=False).head(10)

satisfaction                      1.000000
Online boarding                   0.559446
Class_Business                    0.503714
Type of Travel_Business travel    0.448892
Inflight entertainment            0.398958
Inflight wifi service             0.375680
Seat comfort                      0.348482
On-board service                  0.324691
Leg room service                  0.317805
Cleanliness                       0.303716
Name: satisfaction, dtype: float64

Based on the numbers, it is apparent that there is a strong correlation between customer `satisfaction` and `Online boarding`. Higher the online boarding rating, higher is the customer satisfaction. Looks like the flight providers have to give more importance in improving the online boarding services. This can be a leaky predictor since the Online boarding rating isn't available previously. 

Based on the numbers, it is apparent that there is a strong correlation between customer satisfaction and class of travel. 

In [32]:
##Relationship between business class customers and satisfaction 

num_busi = data[data["Class_Business"]== 1].shape[0]
num_busi_satisfied_customers = data[(data["satisfaction"] == 'satisfied') & (data["Class_Business"] == 1)].shape[0]
num_busi_dissatisfied_customers = data[(data["satisfaction"] == 'neutral or dissatisfied') & (data["Class_Business"] == 1)].shape[0]


print('Fraction of business class customers who were satisfied: ',round(num_busi_satisfied_customers/num_busi,2))

print('Fraction of business class customers who were unsatisfied: ' ,round(num_busi_dissatisfied_customers/num_busi,2))


Fraction of business class customers who were satisfied:  0.69
Fraction of business class customers who were unsatisfied:  0.31


People who travelled in the business class is more likely to be satisfied. But since we already have this data, `Class_Business` can't be a leaky predictor 

Predictors like `Online boarding`, `Inflight wifi service` and `Baggage handling` can be potentially leaky predictors since these ratings are available after the flight is completed. 

In [20]:
# Drop leaky predictors from dataset
potential_leaks = ['Online boarding', 'Inflight wifi service', 'Baggage handling']
X2 = X.drop(potential_leaks, axis=1)

# Evaluate the model with leaky predictors removed
cv_scores = cross_val_score(my_pipeline, X2, y_num, 
                            cv=5,
                            scoring='accuracy')

print("Cross-val accuracy: %f" % cv_scores.mean())

Cross-val accuracy: 0.927527


**Conclusion**

Data leakage can be multi-million dollar mistake in many data science applications. A combination of caution, common sense, and data exploration can help identify target leakage.