# Data Centric Approaches

#### To detect feature drifts on categorical features, the popular choice of metric is the population stability index (PSI) (https://www.lexjansen.com/wuss/2017/47_Final_Paper_PDF.pdf). This is a statistical method used to measure the shift in a variable over a period of time. If the overall drift score is more than 0.2 or 20%, then the drift is considered to be significant, establishing the presence of feature drift.

#### To detect feature drifts in numeric features, the Wasserstein metric (https://kowshikchilamkurthy.medium.com/wasserstein-distance-contraction-mapping-and-modern-rl-theory-93ef740ae867) is the popular choice. This is a distance function for measuring the distance between two probability distributions. Similar to PSI, if the drift score using the Wasserstein metric is higher than 20%, this is considered to be significant and the numerical feature is considered to have feature drift.

#### Similar concept drifts can also be detected using these metrics. For regression problems, the Wasserstein metric is effective, while for classification problems PSI is more effective. You can see the application of these methods on a practical dataset at https://github.com/PacktPublishing/Applied-Machine-Learning-Explainability-Techniques/tree/main/Chapter03. Additionally, there are other statistical methods that are extremely useful for detecting data drifts such as Kullback-Leibler Divergence (KL Divergence), the Bhattacharyya distance, Jensen-Shannon Divergence (JS Divergence), and more.

#### Now that we are aware of certain effective ways in which to detect drifts, what do we do when we have identified the presence of drifts? The first step is to alert our stakeholders if the ML system is already in production. Incorrect predictions due to data drift can impact many end users, which might, ultimately, lead to the loss of trust of the end users. The next step is to check whether the drift is temporary, seasonal, or permanent in nature. Analysis of the nature of the drift can be challenging, but if the changes that are causing the drift can be identified and reverted, then that is the best solution.

#### If the drift is temporary, the first step is to identify the temporary change that caused the drift and then revert the changes. For seasonal drifts, seasonal changes to the data should be accounted for during the training process or as an additional preprocessing step to normalize any seasonal effects on the data. This is so that the model is aware of the seasonal pattern in the data. However, if the drift is permanent, then the only option is to retrain the model on the new data and deploy the newly trained model for the production system.

