# Drift Monitoring
### Importance of Data Drift Monitoring:
1. Model Performance: Changes in the input data distribution can lead to a decline in model performance, as the model might not generalize well to the new data.
2. Decision Making: In many applications, especially those involving critical decisions, it's essential to ensure that the model's predictions remain reliable and aligned with the current state of the system.
3. Regulatory Compliance: In some industries, regulations may require continuous monitoring and validation of machine learning models to ensure they meet certain standards over time.


### Implementation of Data Drift Monitoring:
1. Define Baseline: Establish a baseline by recording the statistical properties of the training dataset. This includes mean, standard deviation, and other relevant measures.
2. Continuous Monitoring: Regularly collect samples of the incoming data and compare their statistical properties to the baseline. This can be done using various metrics, such as the Kolmogorov-Smirnov statistic or the Jensen-Shannon divergence.
3. Thresholds and Alerts: Set thresholds for acceptable drift levels. If the monitored data drift exceeds these thresholds, trigger alerts to notify data scientists or system operators.
4. Update Models: When significant drift is detected, consider retraining the machine learning model with the most recent data to adapt to the new distribution.


### Common Problems and Solutions:
1. Feature Drift: Changes in feature distributions. Solution: Regularly update the feature set or engineering techniques to handle evolving features.
2. Concept Drift: Changes in the relationship between features and labels. Solution: Periodically retrain the model with fresh data.
3. Data Quality Issues: Noisy or incomplete data can lead to drift detection false positives. Solution: Implement data preprocessing techniques and ensure data quality.

In [7]:
import numpy as np 
import pandas as pd 

df_marketing_campaign = pd.read_csv('Dataset/marketing_campaign.csv', sep= '\t')
df_marketing_campaign.head()

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
0,5524,1957,Graduation,Single,58138.0,0,0,04-09-2012,58,635,...,7,0,0,0,0,0,0,3,11,1
1,2174,1954,Graduation,Single,46344.0,1,1,08-03-2014,38,11,...,5,0,0,0,0,0,0,3,11,0
2,4141,1965,Graduation,Together,71613.0,0,0,21-08-2013,26,426,...,4,0,0,0,0,0,0,3,11,0
3,6182,1984,Graduation,Together,26646.0,1,0,10-02-2014,26,11,...,6,0,0,0,0,0,0,3,11,0
4,5324,1981,PhD,Married,58293.0,1,0,19-01-2014,94,173,...,5,0,0,0,0,0,0,3,11,0


In [8]:
# Sampling data to catch data drift
data_1 = df_marketing_campaign.sample(frac=0.5)

data_2 = df_marketing_campaign.sample(frac=0.5)

display(data_1.head())

print('\n\n')

display(data_2.head())

Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
1054,5731,1983,Master,Married,27100.0,1,0,05-04-2013,64,12,...,7,0,0,0,0,0,0,3,11,0
1682,2156,1955,PhD,Married,22554.0,1,1,03-11-2012,38,27,...,5,0,0,0,0,0,0,3,11,0
1859,521,1985,Graduation,Together,54006.0,1,0,18-09-2012,42,174,...,7,0,0,0,0,0,0,3,11,0
299,3924,1965,PhD,Divorced,57912.0,0,1,17-03-2014,34,801,...,5,0,1,0,0,0,0,3,11,0
72,6312,1959,Graduation,Married,65031.0,0,1,17-03-2013,29,258,...,7,0,0,0,0,0,0,3,11,0







Unnamed: 0,ID,Year_Birth,Education,Marital_Status,Income,Kidhome,Teenhome,Dt_Customer,Recency,MntWines,...,NumWebVisitsMonth,AcceptedCmp3,AcceptedCmp4,AcceptedCmp5,AcceptedCmp1,AcceptedCmp2,Complain,Z_CostContact,Z_Revenue,Response
1176,5907,1952,Master,Married,33444.0,1,1,03-11-2012,24,8,...,8,0,0,0,0,0,0,3,11,0
2186,2666,1972,Master,Married,76234.0,0,1,06-02-2014,21,519,...,3,0,1,0,0,0,0,3,11,0
1036,10613,1958,PhD,Together,37334.0,1,1,26-02-2014,44,26,...,4,0,0,0,0,0,0,3,11,0
789,347,1976,Graduation,Divorced,40780.0,0,1,08-09-2012,30,229,...,9,0,0,0,0,0,0,3,11,0
516,11025,1961,Graduation,Married,36443.0,1,1,03-02-2013,9,65,...,8,0,0,0,0,0,0,3,11,0


##### The Kolmogorov-Smirnov (KS) test is a non-parametric test that compares two distributions to assess if they are significantly different.

In [11]:
from scipy.stats import ks_2samp
import sys
import os

dir = os.path.dirname('../Toolkit/DataExplorationToolkit.py')
sys.path.append(dir)

import DataExplorationToolkit as dtl
model_evalutation = dtl.ModelEvaluation()

# Check for drift in each feature
for feature in data_1.columns:
    drift = model_evalutation.evaluate_data_drift(data_1[feature], data_2[feature])
    print(f'{feature} Drift: {drift[0]}, Statistic: {drift[1]}, p-value: {drift[2]}')

ID Drift: False, Statistic: 0.025892857142857145, p-value: 0.8472105468520287
Year_Birth Drift: False, Statistic: 0.027678571428571427, p-value: 0.7844779479591066
Education Drift: False, Statistic: 0.01607142857142857, p-value: 0.998711583767401
Marital_Status Drift: False, Statistic: 0.01607142857142857, p-value: 0.998711583767401
Income Drift: False, Statistic: 0.01607142857142857, p-value: 0.998711583767401
Kidhome Drift: False, Statistic: 0.004464285714285714, p-value: 1.0
Teenhome Drift: False, Statistic: 0.0017857142857142857, p-value: 0.9999999999999994
Dt_Customer Drift: False, Statistic: 0.03214285714285714, p-value: 0.6095096717443
Recency Drift: False, Statistic: 0.022321428571428572, p-value: 0.9431423916136951
MntWines Drift: False, Statistic: 0.01607142857142857, p-value: 0.998711583767401
MntFruits Drift: False, Statistic: 0.033928571428571426, p-value: 0.5396366484453271
MntMeatProducts Drift: False, Statistic: 0.023214285714285715, p-value: 0.9236495948091168
MntFishP

In [2]:
#ANOVA
import pandas as pd
from scipy.stats import f_oneway

# Sample Data
data = {'Education Level': ['High School', 'Bachelor\'s', 'Master\'s'] * 30,
        'Income': [30000, 40000, 45000, 25000, 35000, 40000, 50000, 60000, 70000] * 10}

df = pd.DataFrame(data)

# Perform ANOVA
result = f_oneway(df['Income'][df['Education Level'] == 'High School'],
                  df['Income'][df['Education Level'] == 'Bachelor\'s'],
                  df['Income'][df['Education Level'] == 'Master\'s'])

# Display the result
print("ANOVA Result:")
print("F-statistic:", result.statistic)
print("P-value:", result.pvalue)

# Interpret the result
if result.pvalue < 0.05:
    print("Reject the null hypothesis. There are significant differences in income across education levels.")
else:
    print("Fail to reject the null hypothesis. No significant differences in income across education levels.")


ANOVA Result:
F-statistic: 15.0958904109589
P-value: 2.3553817216359917e-06
Reject the null hypothesis. There are significant differences in income across education levels.
