#  Retrain Periodically
Machine learning models can degrade over time due to changes in data patterns, known as data drift or concept drift. Regular retraining helps ensure that models remain accurate and relevant as the underlying data evolves.

Triggers for Retraining:
- New Data Availability: 
    - Retrain when sufficient new data accumulates (typically 10-20% of the original training volume).
    - Example: A fraud detection model should be retrained when new transaction data becomes available.
- Performance Degradation: 
    - Monitor key performance metrics (e.g., accuracy, precision, recall, F1-score).
    - Retrain when metrics fall below predefined thresholds.
    - Example: If the accuracy of a recommendation system drops below 90%, trigger retraining.
- Detected Drift: 
    - Use statistical tests (e.g., Kolmogorov-Smirnov, Chi-Square) to detect changes in feature distributions or prediction patterns.
- Upstream Data Changes: 
    - Modifications in data sources, schemas, or preprocessing steps can affect model inputs, necessitating retraining.
- Changing Business Requirements: 
    - Shifts in organizational goals or key performance indicators (KPIs) might require model updates to align with new priorities.

Implementation Strategies:
- Scheduled Retraining: 
    - Set up a regular retraining cycle (e.g., daily, weekly, monthly) depending on how rapidly your data changes.
- Triggered Retraining: 
    - Implement automated pipelines that continuously monitor model performance and data distributions.
    - Retrain when deviations exceed preset thresholds.
    - Example: Use a monitoring system to detect performance degradation and trigger retraining automatically..
- Pipeline Integration: 
    - Integrate retraining into your CI/CD pipelines using tools like Kubeflow, Apache Airflow, or custom automation scripts.
    - Automate the entire process—from data extraction and preprocessing to model evaluation and deployment.
    - Example: Use Kubeflow to automate the retraining and deployment of a sales forecasting model.


#### Best Practices for Retraining

1) Version Control:
Use tools like DVC or MLflow to track model versions, datasets, and hyperparameters.
Example: Track changes in training data and model performance over time.
2) Automated Testing:
Include automated tests for data quality, model performance, and deployment readiness.
Example: Use unit tests to ensure the retrained model meets performance thresholds.
3) Rollback Strategies:
Implement rollback mechanisms to revert to a previous model version if the retrained model underperforms.
Example: Use Kubernetes to roll back to a previous Docker image if the new model fails.
4) Monitoring:
Continuously monitor model performance and data drift in production.
Example: Use Prometheus and Grafana to track accuracy and latency. 

In [None]:
#Scheduled Retraining with Apache Airflow

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Define the retraining function
def retrain_model():
    # Load new data
    new_data = pd.read_csv('new_data.csv')
    X_new, y_new = new_data.drop('target', axis=1), new_data['target']

    # Load the existing model
    model = joblib.load('model.pkl')

    # Retrain the model
    model.fit(X_new, y_new)

    # Evaluate the model
    y_pred = model.predict(X_new)
    accuracy = accuracy_score(y_new, y_pred)
    print(f"Retrained Model Accuracy: {accuracy}")

    # Save the retrained model
    joblib.dump(model, 'model.pkl')

# Define the DAG
default_args = {
    'owner': 'airflow',
    'start_date': datetime(2023, 1, 1),
    'retries': 1,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'retrain_model_dag',
    default_args=default_args,
    description='A DAG to retrain the model periodically',
    schedule_interval=timedelta(days=7),  # Retrain weekly
)

# Define the task
retrain_task = PythonOperator(
    task_id='retrain_model',
    python_callable=retrain_model,
    dag=dag,
)

retrain_task

In [None]:
# Triggered Retraining with Monitoring
import joblib
import pandas as pd
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

# Load the existing model
model = joblib.load('model.pkl')

# Load new data
new_data = pd.read_csv('new_data.csv')
X_new, y_new = new_data.drop('target', axis=1), new_data['target']

# Evaluate the model on new data
y_pred = model.predict(X_new)
accuracy = accuracy_score(y_new, y_pred)

# Check if accuracy falls below a threshold
if accuracy < 0.9:  # Threshold = 90%
    print("Model performance degraded. Retraining...")
    model.fit(X_new, y_new)
    joblib.dump(model, 'model.pkl')
    print("Model retrained and saved.")
else:
    print("Model performance is acceptable.")

In [None]:
# Pipeline Integration with Kubeflow
import kfp
from kfp import dsl
from kfp.components import func_to_container_op
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
import joblib

# Define the retraining function
def retrain_model():
    # Load new data
    new_data = pd.read_csv('new_data.csv')
    X_new, y_new = new_data.drop('target', axis=1), new_data['target']

    # Load the existing model
    model = joblib.load('model.pkl')

    # Retrain the model
    model.fit(X_new, y_new)

    # Save the retrained model
    joblib.dump(model, 'retrained_model.pkl')

# Convert the function to a Kubeflow component
retrain_op = func_to_container_op(retrain_model)

# Define the pipeline
@dsl.pipeline(
    name='Retrain Pipeline',
    description='A pipeline to retrain the model periodically.'
)
def retrain_pipeline():
    retrain_task = retrain_op()

# Compile and run the pipeline
kfp.compiler.Compiler().compile(retrain_pipeline, 'retrain_pipeline.yaml')