## **Working of with Keyowrd**

[GPT_Explanation](https://chatgpt.com/share/681f4657-6520-8006-b695-a215a1783899)

[Real_Python_Implementation](https://www.youtube.com/watch?v=iba-I4CrmyA)


## **Before Starting Project**

`source mlflow_env/bin/activate`

`mlflow ui`


## **ML Flow**

Open-source. We can track our Machine Learning project such as performance metrices etc.

## **Lifecycle of a Data Science Project**

**Data Pre**

**EDA**

**Feature Eng**

**Model Training**

**Model Validation**

**Deployment**

**Monitoring**

## **How ML Flow is used by Data Scientist**

- Experiment Tracking

- Hypothesis Testing in EDA

- Code Structuring (Pipeline)

- Model Packaging and Dependency Management

- Evaluating Hyperparameter : Track every combination of Hyperparameter

- Compare the results of model and deploy the best performing model

## **How ML Flow is used by ML Engineeer**

- Manage the lifecycle of trained models both pre and post deployment

- Deploy models security to the production env

- Manage Deployment Dependencies


## **ML Flow Starter**

### **ML Flow Tracking Server**

For tracking our experiments we need to create a server

To start the server we use `mlflow ui`.

Then we will need to provide the tracking UI so that everything is tracked by MLFlow. `mlflow.set_tracking_uri("http://127.0.0.1:5000")`

Then to log our performance metrices we will use as below:

```py

mlflow.set_experiment("Day2")

# Start the MLFlow Run

with mlflow.start_run():
  # Log the hyperparameters
  mlflow.log_params(params)

	# Log te accuracy metrics
  mlflow.log_metric("Accuracy",accuracy)

	# Set tag that we can use to remind ourselves what this run was for
  mlflow.set_tag("Training Info", "Basic LR Model for Iris Data")

	# Infer the model signature
  signature = infer_signature(x_train, model.predict(x_train))

	# Log the model
  model_info = mlflow.sklearn.log_model(
    sk_model=model,
    artifact_path="Iris Mode",
    signature = signature,
    input_example=x_train,
    registered_model_name="Tracking-quickstart"
	)

```

A new folder named `mlruns` is created which stores all the info about our experiments. We should not delete the `mlruns` folder.

## **Tracking a ML Project with MLFlow**

`project.ipynb`

Let's create a sparate folder for our ML Project.

Once we have setup our ML Project, now we will have to keep track of different performance metrics on the basis of our used hyperparameters. For which we will use `ML Flow`.


## **Inference of Model Artifacts**

### **UI**

`path` : Path of artifacts

**Validate Before Deployment**

As soon as we complete training our model, the model is saved as `model.pkl` in the `artifacts` but before using the model in the production we will need to validate it.

For that the base code already provided in the UI only.

```Py

# Validate The Model

import mlflow
from mlflow.models import Model

model_uri = 'runs:/cd866b98bcfb4235bbe3b225ece9fce9/Iris Mode'
# The model is logged with an input example
pyfunc_model = mlflow.pyfunc.load_model(model_uri)

predictions = pyfunc_model.predict(x_test)

predictions

```

`mlflow.pyfunc.load_model` loads the model as `Python's` generic function.

## **Model Registry Tracking**

Model Registry is a centralized model store, set of APIs, and UI to collaboratively manage the full lifecycke of an MLFlow Model. It provides model lineage (which MLFlow exps and runs produced the model), model versioning, model aliasing, model tagging and annotations.

In the previous code, we directly saved (Registerd) the model without even validating if it the best model. As we provided `registered_model_name="Tracking-quickstart"` argument in the `log_model` function which registers and maintains the model versioning.

To avoid it we should not pass this parameter. If we not pass this parameter in the `UI` there will be a `Button` as `Register Model`. If the model has been registered then it would be `Model Registered` with it's version.

How do we choose the best model? We need to compare the experiments and then find the experiment with the highest accuracy and then register that experiment.

Okay, we've saved our best model but how are we going to predict from the saved best model?

```Py

# Inferencing the Model from the Model Registry (Prediction from the Best Model)

# Inferencing the Model from the Model Registry (Prediction from the Best Model)

import mlflow.sklearn

model_name = 'Tracking-quickstart'
model_version = '6' # Version of the best model {latest, number_version, ..}

# Path for the model from the Model Registry
model_uri = f"models:/{model_name}/{model_version}"

model = mlflow.sklearn.load_model(model_uri)

model.predict(x_test)

```


## **Hosue Price Pred (MLFlow)**

Refer to `ML_Project/Phase2(House).ipynb` file


## **ANN with MLFlow**

Refer to `ML_Flow(ANN_Project)`

### **Pipeline**

- Build an ANN Project

- Run a hyperparameter sweep on a training script.

- Compare the results of the runs in the MLFlow UI

- Choose the best run and register it as a model

- Deploy the Model to a REST API

- Build a container image suitable for deployment to a cloud platform

**Libraries**

`keras`

`tensorflow`

`hyperopt` : Hyperparameter Tuining for the `ANN`

**Documentation**

[Hyperopt](https://hyperopt.github.io/hyperopt/)

**Dataset**

```Py

# Data Wine Data

data = pd.read_csv(
  'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv',
  sep=';'
)

data

```

In the above data set, the `target`is the `quality {1-6}`. It is a classification task.

Now, we will need to build an ANN to classify it.

```Py

	model.compile(optimizer=keras.optimizers.SGD(
		learning_rate=params['lr'],momentum=params["momentum"]
	))

```

In the above, we change the `Dense` layers as an HyperParameter Tuining but it will take a lot of time. Instead we tune the `learning_rate` hyperparameter and `momemtum`.

Now, we will try to train our model for the different combination of values of `learning_rate` and `momentum` and track each and every experiment for each combination.

For the combination of different values of our `HyperParameters` we will use the `Hyperopt` library.

Now once the model `compilation` code is written.

```Py

# ANN Model

import mlflow.tensorflow


def train_model(params, epochs, train_x, train_y, valid_x, valid_y, test_x, test_y):

	# Noramlization
	mean = np.mean(train_x,axis=0) # Mean of each col
	var = np.var(train_x,axis=0) # Var of Each col

	model = keras.Sequential(
		[
			keras.Input([train_x.shape[1]]),
			keras.layers.Normalization(mean=mean,variance=var),
			keras.layers.Dense(64,activation='relu'),
			keras.layers.Dense(1) # Classification
		]
	)

	# Model Compile
	model.compile(optimizer=keras.optimizers.SGD(
		learning_rate=params['lr'],momentum=params["momentum"]
	),
	loss="mean_squared_error",
	metrics=[keras.metrics.RootMeanSquaredError()])

	# Train and Track the Hyperparam with MLFlow tracking

	with mlflow.start_run(nested=True): # As we are trying with multiple combination, nested = True
		model.fit(train_x,train_y,validation_data=(valid_x,valid_y),
						epochs=epochs,
						batch_size=64)

		# Evaluate the model
		eval_result = model.evaluate(valid_x, valid_y, batch_size=64)

		eval_rmse = eval_result[1]

		# Log the params
		mlflow.log_param(params)
		mlflow.log_metric("Eval Rms", eval_rmse)

		# Log the model
		mlflow.tensorflow.log_model(
			model,
			"model",
			signature=signature
		)

		return {
			'loss': eval_rmse,
			'status': STATUS_OK,
			'model': model
		}
```

We will need to create an `objective` function for the `HyperOpt`.

```Py

# Objective Function for Hyperopt

def objective(params):

  # MlFlow will track the params and results for each run

  result = train_model(
    params,
    epochs=3,
    train_x=train_x,
    train_y=train_y,
    valid_x=valid_x,
    valid_y=valid_y,
    test_x=test_x,
    test_y=test_y
  )

  return result

```

**Parameters**

```Py

# Set all the parameters

space = {
 'lr': hp.loguniform('lr',np.log(1e-5),np.log(1e-1)),
 'momentum': hp.uniform("momentum",0.0,1.0)
}

```

**Parent Run**

```Py

# Set Exp

import mlflow.tensorflow


mlflow.set_experiment("/wine-quality")

# Create another run so that the nested run will work
with mlflow.start_run():

  # Conduct Hyperparameter search using Hyperopt
  trails = Trials()
  best = fmin(
    fn=objective,
    space=space,
    algo=tpe.suggest,
    max_evals=4,
    trials=trails
  )

  # Fetch the details of the best run
  best_run = sorted(trails.results, key=lambda x:x['loss'])[0]

  # Log the best parameters, loss and model
  for key, value in best.items():
    mlflow.log_param(key, value)

  mlflow.log_metric("Eval_RMSE", best_run['loss'])
  mlflow.tensorflow.log_model(
    best_run['model'],
    "model",
    signature=signature,
  )

  # Print out the best params and loss
  print(f"Best Param: {best}")
  print(f"Best Eval EMSE: {best_run['loss']}")

```

The `fmin` function calls the `objective` function 4 times as `max_eval=4`. Each call trains a new model using new hyperparameter combination. The `trials.results` contains results of all runs.

**Inferencing (Load and Predict)**

```Py

# Inferencing Model

import mlflow

model_uri = 'runs:/d6afa02fdf47443bb97a85e0068fd121/model'

loaded_model = mlflow.pyfunc.load_model(model_uri)
predictions = loaded_model.predict(test_x)
predictions

```


## **DVC (Data Version Control)**

For data verioning data.

If you store and process data files or datasets to produce other data or machine learning models, and you want to

- track and save data and machine learning models the same way you capture code;

- create and switch between versions of data and ML models easily;

- understand how datasets and ML artifacts were built in the first place;

- compare model metrics among experiments;

- adopt engineering tools and best practices in data science projects;

`pip install dvc`

**Initialize the DVC**

For it to be initialiZed the `git` should be initialized

`dvc init`

`.dvc` folder is created.

Note that git should not track the `Dataset` folder.

Now, add the files or data that you want to keep track of. `dvc add location/file.txt`

After the there is change in the Dataset, always add the file using `dvc add 'Datasets(DVC)/Day1.txt'`

Then we will only track the hash value (.dvc) file and .gitignore file from the dataset not the dataset. `git add 'Datasets(DVC)/Day1.txt.dvc' 'Datasets(DVC)/.gitignore'`

**Switching Between Previous Versions of the Data**

First find the commit id in which you want to switch to.

`git log` then copy your commit id.

Then,

`git checkout commit_id`

`dvc checkout`

You are in the previous time stamp.

Then setup the security creds

```bash

dvc remote modify origin --local access_key_id bee4f040f555191f8c8fb05458249bed6f421f06
dvc remote modify origin --local secret_access_key bee4f040f555191f8c8fb05458249bed6f421f06

```

```bash

(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete$ dvc remote list
origin  s3://dvc
(mlflow_env) toni-birat@tonibirat:/media/toni-birat/New Volume/ML_Flow_Complete$

```

Now, to pull and push the data we will need another library i.e. `dvc_s3`

DagsHub provide a default of `10GB` S3 storage for free.

Then. `dvc pull -r origin`. Origin is the name of the remote location in the DagsHub (Link)

Then, `dvc push -r origin` and `git push origin master`. Our dataset would be tracked.


## **Dags Hub**

Till now we've tracked out Machine Learning task locally, but with `DagsHub` we can host our MlFlow tracking in a remote location. With this many people can access it and compare against thier model.

It provide features like Version Contro for Code, Version Control for Data (DVC) and Experiment Tracking.

To track the data

```Text

dvc remote add origin s3://dvc
dvc remote modify origin endpointurl https://dagshub.com/ToniBirat7/ML_Flow_End_To_End_ML.s3

```

**This Didnot Work**

Try creating a new project.


## **Production Style Project Structure**

**Name** : First End to End Project

The project is created as a sub moudle of the main project.

A `.gitmodules` file is created to track the modules. Also, there is a independent repo for this project.

`git@github.com:ToniBirat7/First_End_To_End_ML_Project.git`

**Project Structure**

```Text
First_End_To_End_ML_Project
├── .gitignore
├── README.md
├── params.yaml
├── src
│   ├── __init__.pyS
│   │   ├── preprocess.py
│   │   ├── evaluate.py
│   │   ├── train.py
├── params.yaml
├── requirements.txt
```

**`params.yaml`**

The `params.yaml` file is used to store the hyperparameters and other parameters that we will use in our project. It is a YAML file.

```yaml
# params.yaml
model:
  name: "Pima Indian Diabetes"
  version: "1.0"
model_config:
  train_size: 0.8
  random_state: 42
  test_size: 0.2
model_params:
  # Hyperparameters for the model
  max_depth: 5
  n_estimators: 100
  learning_rate: 0.01
model_type: "Random Forest"
model:
  # Hyperparameters for the model
  C: 0.1
  max_iter: 100
  random_state: 42
model:
  # Hyperparameters for the model
  type: "Logistic Regression"
  params:
    C: 0.1
    max_iter: 100
    random_state: 42
data:
  # Data parameters
  source: "https://raw.githubusercontent.com/jbrownlee/Datasets/master/pima-indians-diabetes.data.csv"
  columns:
    - "Pregnancies"
    - "Glucose"
    - "BloodPressure"
    - "SkinThickness"
    - "Insulin"
    - "BMI"
    - "DiabetesPedigreeFunction"
    - "Age"
    - "Outcome"
  target: "Outcome"
```

**`src/preprocess.py`**

This file is used to preprocess the data. It contains functions to load the data, split the data into train and test sets, and preprocess the data.

```python

import pandas as pd
import sys
import yaml
import os

# Params

params = yaml.safe_load(open('params.yaml'))['preprocess']

def preprocess_data(input_file, output_file):
  """
  Preprocess the data by reading from input_file, performing necessary transformations,
  and saving the processed data to output_file.
  """
  # Read the data
  df = pd.read_csv(input_file, header=None)

  # Preprocessing : But the data is already clean, so we will just rename the columns

  os.makedirs(os.path.dirname(output_file), exist_ok=True)

  # Save the processed data
  df.to_csv(output_file, index=False)
  print(f"Processed data saved to {output_file}")

if __name__ == "__main__":
  preprocess_data(
    input_file=params['input'],
    output_file=params['output']
  )

```

**`src/train.py`**

```Py

import mlflow.sklearn
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from mlflow.models import infer_signature
from dotenv import load_dotenv
from urllib.parse import urlparse
import pandas as pd
import mlflow
import yaml
import os
import pickle

import os
print("CWD:", os.getcwd())

# Load environment variables from .env file
load_dotenv()

# Now use the environment variables
tracking_uri = os.getenv('MLFLOW_TRACKING_URI')
experiment_name = os.getenv('MLFLOW_EXPERIMENT_NAME')
username = os.getenv('MLFLOW_TRACKING_USERNAME')
password = os.getenv('MLFLOW_TRACKING_PASSWORD')

# Set env vars for MLflow auth
os.environ["MLFLOW_TRACKING_USERNAME"] = username
os.environ["MLFLOW_TRACKING_PASSWORD"] = password

print(f"Tracking URI: {tracking_uri}")

def hyperparameter_tuning(X_train, y_train, param_grid):
  """
  Perform hyperparameter tuning for the RandomForestClassifier.
  """
  # For simplicity, we will use default parameters
  model = RandomForestClassifier()
  grid_search = GridSearchCV(
    estimator=model,
    param_grid=param_grid,
    cv=3,
    n_jobs=-1,
    verbose=2,
  )
  grid_search.fit(X_train, y_train)
  print(f"\nBest parameters: {grid_search.best_params_}")
  return grid_search

# Load all parameters
def load_training_params():
  with open('params.yaml', 'r') as file:
    params = yaml.safe_load(file)['train']
  return params


def train_model(params,model_path):
  """
  Train the RandomForestClassifier model with the given training data and parameters.
  """
  path = params['input']
  df = pd.read_csv(path)

  print("\n" + "="*50)
  print("DATA LOADING AND PREPROCESSING")
  print("="*50)
  print(f"Columns in DataFrame: {df.columns.tolist()}")

  X = df.drop(columns=['Outcome'])
  y = df['Outcome']

  print(f"\nData loaded from: {path}")
  print(f"Dataset shape: {df.shape}")
  print(f"Features shape: {X.shape}")
  print(f"Labels shape: {y.shape}")
  print("\nStarting Train Test Split...")

  # Split the data into training and test sets
  if 'test_size' not in params:
    params['test_size'] = 0.2  # Default test size if not specified
  if 'random_state' not in params:
    params['random_state'] = 42  # Default random state if not specified

  # Ensure the input data is in the correct format
  if not isinstance(X, pd.DataFrame):
    raise ValueError("Input features X must be a pandas DataFrame.")
  if not isinstance(y, pd.Series):
    raise ValueError("Input labels y must be a pandas Series.")
  if X.empty or y.empty:
    raise ValueError("Input features X and labels y cannot be empty.")
  if X.shape[0] != y.shape[0]:
    raise ValueError("Input features X and labels y must have the same number of samples.")

  print(f"\nSplitting data with:")
  print(f"  - Test size: {params['test_size']}")
  print(f"  - Random state: {params['random_state']}")

  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=params['test_size'], random_state=42)

  print(f"\nTrain/Test Split Results:")
  print(f"  - Training data shape: {X_train.shape}")
  print(f"  - Test data shape: {X_test.shape}")
  print(f"  - Training labels shape: {y_train.shape}")
  print(f"  - Test labels shape: {y_test.shape}")

  # Set the signature for the model
  signature = infer_signature(X_train, y_train)
  print(f"\nModel signature inferred successfully.")

  # Perform hyperparameter tuning
  print("\n" + "="*50)
  print("HYPERPARAMETER TUNING")
  print("="*50)

  # Set the tracking URI and experiment name
  print(f"\nMLflow Configuration:")
  print(f"  - Tracking URI: {tracking_uri}")
  print(f"  - Experiment name: {experiment_name}")
  mlflow.set_tracking_uri(tracking_uri)

  # Set the experiment name
  if username and password:
    mlflow.set_experiment(experiment_name)
    print("  - Authentication: Enabled")
  else:
    print("  - Authentication: No credentials provided for MLflow tracking server")

  # Start an MLflow run
  with mlflow.start_run():
    print(f"\n✓ MLflow run started successfully")

    # Perform hyperparameter tuning
    param_grid = {
      'n_estimators': [100, 200],
      'max_depth': [None, 10, 20],
      'min_samples_split': [2, 5],
      'min_samples_leaf': [1, 2]
    }

    grid_search = hyperparameter_tuning(X_train, y_train, param_grid)
    print(f"\n✓ Hyperparameter tuning completed")
    best_model = grid_search.best_estimator_
    print(f"✓ Best model found: {best_model}")

    # Train the model
    best_model.fit(X_train, y_train)
    print(f"\n✓ Model training completed")

    # Evaluate the model
    y_pred = best_model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)

    print("\n" + "="*50)
    print("MODEL EVALUATION")
    print("="*50)
    print(f"Model Accuracy: {accuracy:.4f}")

    # Log metrics
    mlflow.log_metric('accuracy', accuracy)

    # Log parameters
    print(f"\nLogging parameters to MLflow...")
    print(f"Best parameters found: {grid_search.best_params_}")
    for key, value in grid_search.best_params_.items():
      mlflow.log_param(key, value)
    print(f"✓ Parameters logged successfully")

    # Log the confusion matrix and classification report as text files in the Artifacts
    print(f"\nLogging evaluation artifacts...")
    cm = confusion_matrix(y_test, y_pred)
    cr = classification_report(y_test, y_pred, output_dict=True)
    mlflow.log_text(str(cm), "confusion_matrix.txt")
    mlflow.log_text(str(cr), "classification_report.txt")
    print(f"✓ Confusion matrix and classification report logged")

    # Log the model signature
    print(f"✓ Model signature logged")
    mlflow.log_param("signature", str(signature))

    # Log the model
    print(f"\n" + "="*50)
    print("MODEL LOGGING")
    print("="*50)

    tracking_uri_type_store = urlparse(mlflow.get_tracking_uri()).scheme
    print(f"Tracking URI type: {tracking_uri_type_store}")

    if tracking_uri_type_store != 'file':
      print(f"Using MLflow server for model logging...")
      mlflow.sklearn.log_model(
          sk_model=best_model,
          artifact_path="model",
          signature=signature,
          registered_model_name=params['model_name']
      )
    else:
      print(f"Using local file system for model logging...")
      mlflow.sklearn.log_model(
          sk_model=best_model,
          artifact_path="model",
          signature=signature
      )
    print(f"✓ Model logged to MLflow successfully")

    # Ensure the directory exists
    os.makedirs(os.path.dirname(model_path), exist_ok=True)
    print(f"\nSaving model locally...")
    print(f"Model path: {model_path}")

    # Save the model to a file
    with open(model_path, 'wb') as f:
      pickle.dump(best_model, f)

    print(f"✓ Model saved locally to {model_path}")

    print(f"\n" + "="*50)
    print("TRAINING COMPLETED SUCCESSFULLY")
    print("="*50)


if __name__ == "__main__":
  params = load_training_params()
  model_path = params['output']

  # Ensure the model path is set correctly
  if not model_path:
    raise ValueError("Model path is not set in params.yaml.")

  print(f"Model will be saved to {model_path}")

  train_model(params, model_path)
  print("Training script executed successfully.")

```

**`src/evaluate.py`**

```Py

import mlflow
import mlflow.sklearn
import pandas as pd
import pickle
import numpy as np
import yaml
import os
import json
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    confusion_matrix,
    classification_report,
    roc_curve,
    precision_recall_curve
)
from sklearn.model_selection import train_test_split
from mlflow.models import infer_signature
from dotenv import load_dotenv
from urllib.parse import urlparse
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings

warnings.filterwarnings('ignore')

# Load environment variables from .env file
load_dotenv()

# MLflow and DagsHub configuration
tracking_uri = os.getenv('MLFLOW_TRACKING_URI')
experiment_name = "Evaluating the Trained Model Diabetes"
username = os.getenv('MLFLOW_TRACKING_USERNAME')
password = os.getenv('MLFLOW_TRACKING_PASSWORD')

# Set environment variables for MLflow authentication
os.environ["MLFLOW_TRACKING_USERNAME"] = username if username else ""
os.environ["MLFLOW_TRACKING_PASSWORD"] = password if password else ""

print(f"Username {username}")
print(f"Username {password}")

def load_evaluation_params():
  """Load evaluation parameters from params.yaml"""
  try:
    with open('params.yaml', 'r') as file:
      params = yaml.safe_load(file)
    return params
  except FileNotFoundError:
    print("Warning: params.yaml not found. Using default parameters.")
    return {
      'train': {
        'input': 'dataset/processed/diabetes_processed.csv',
        'output': 'model/diabetes_model.pkl',
        'test_size': 0.2,
        'random_state': 42,
        'model_name': 'RandomForestClassifierBestModel'
      }
    }

def load_data_and_model(data_path, model_path):
    """Load dataset and trained model"""
    print("\n" + "="*60)
    print("LOADING DATA AND MODEL")
    print("="*60)

    try:
        # Load dataset
        print(f"Loading dataset from: {data_path}")
        df = pd.read_csv(data_path)
        print(f"✓ Dataset loaded successfully")
        print(f"  - Dataset shape: {df.shape}")
        print(f"  - Features: {df.columns.tolist()}")

        # Prepare features and target
        X = df.drop(columns=['Outcome'])
        y = df['Outcome']
        print(f"  - Features shape: {X.shape}")
        print(f"  - Target shape: {y.shape}")
        print(f"  - Target distribution: {y.value_counts().to_dict()}")

        # Load trained model
        print(f"\nLoading trained model from: {model_path}")
        with open(model_path, 'rb') as f:
            model = pickle.load(f)
        print(f"✓ Model loaded successfully")
        print(f"  - Model type: {type(model).__name__}")

        return df, X, y, model

    except FileNotFoundError as e:
        print(f"✗ Error: File not found - {e}")
        raise
    except Exception as e:
        print(f"✗ Error loading data or model: {e}")
        raise

def split_data(X, y, test_size=0.2, random_state=42):
    """Split data into train and test sets"""
    print(f"\nSplitting data:")
    print(f"  - Test size: {test_size}")
    print(f"  - Random state: {random_state}")

    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )

    print(f"✓ Data split completed")
    print(f"  - Training set: {X_train.shape[0]} samples")
    print(f"  - Test set: {X_test.shape[0]} samples")

    return X_train, X_test, y_train, y_test

def calculate_comprehensive_metrics(y_true, y_pred, y_pred_proba=None):
    """Calculate comprehensive evaluation metrics"""
    print("\n" + "="*60)
    print("CALCULATING EVALUATION METRICS")
    print("="*60)

    metrics = {}

    # Basic metrics
    metrics['accuracy'] = accuracy_score(y_true, y_pred)
    metrics['precision'] = precision_score(y_true, y_pred, average='binary')
    metrics['recall'] = recall_score(y_true, y_pred, average='binary')
    metrics['f1_score'] = f1_score(y_true, y_pred, average='binary')

    # AUC-ROC if probabilities are available
    if y_pred_proba is not None:
        metrics['roc_auc'] = roc_auc_score(y_true, y_pred_proba[:, 1])

    # Confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    metrics['confusion_matrix'] = cm.tolist()

    # True/False Positives/Negatives
    tn, fp, fn, tp = cm.ravel()
    metrics['true_negatives'] = int(tn)
    metrics['false_positives'] = int(fp)
    metrics['false_negatives'] = int(fn)
    metrics['true_positives'] = int(tp)

    # Specificity and Sensitivity
    metrics['specificity'] = tn / (tn + fp) if (tn + fp) > 0 else 0
    metrics['sensitivity'] = tp / (tp + fn) if (tp + fn) > 0 else 0

    # Print metrics
    print(f"Basic Metrics:")
    print(f"  - Accuracy: {metrics['accuracy']:.4f}")
    print(f"  - Precision: {metrics['precision']:.4f}")
    print(f"  - Recall (Sensitivity): {metrics['recall']:.4f}")
    print(f"  - F1-Score: {metrics['f1_score']:.4f}")
    print(f"  - Specificity: {metrics['specificity']:.4f}")

    if 'roc_auc' in metrics:
        print(f"  - ROC-AUC: {metrics['roc_auc']:.4f}")

    print(f"\nConfusion Matrix:")
    print(f"  - True Negatives: {metrics['true_negatives']}")
    print(f"  - False Positives: {metrics['false_positives']}")
    print(f"  - False Negatives: {metrics['false_negatives']}")
    print(f"  - True Positives: {metrics['true_positives']}")

    return metrics

def generate_classification_report(y_true, y_pred):
    """Generate detailed classification report"""
    print(f"\nDetailed Classification Report:")
    report = classification_report(y_true, y_pred, output_dict=True)
    report_str = classification_report(y_true, y_pred)
    print(report_str)

    return report, report_str

def create_visualizations(y_true, y_pred, y_pred_proba=None, save_plots=True):
    """Create evaluation visualizations"""
    print("\n" + "="*60)
    print("GENERATING VISUALIZATIONS")
    print("="*60)

    plots_created = []

    try:
        # Set style
        plt.style.use('default')
        sns.set_palette("husl")

        # 1. Confusion Matrix Heatmap
        plt.figure(figsize=(8, 6))
        cm = confusion_matrix(y_true, y_pred)
        sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
                   xticklabels=['No Diabetes', 'Diabetes'],
                   yticklabels=['No Diabetes', 'Diabetes'])
        plt.title('Confusion Matrix - Diabetes Prediction')
        plt.ylabel('True Label')
        plt.xlabel('Predicted Label')

        if save_plots:
            confusion_matrix_path = 'artifacts/confusion_matrix.png'
            os.makedirs('artifacts', exist_ok=True)
            plt.savefig(confusion_matrix_path, dpi=300, bbox_inches='tight')
            plots_created.append(confusion_matrix_path)
            print(f"✓ Confusion matrix saved to: {confusion_matrix_path}")

        plt.close()

        # 2. ROC Curve (if probabilities available)
        if y_pred_proba is not None:
            plt.figure(figsize=(8, 6))
            fpr, tpr, _ = roc_curve(y_true, y_pred_proba[:, 1])
            auc_score = roc_auc_score(y_true, y_pred_proba[:, 1])

            plt.plot(fpr, tpr, linewidth=2, label=f'ROC Curve (AUC = {auc_score:.3f})')
            plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier')
            plt.xlabel('False Positive Rate')
            plt.ylabel('True Positive Rate')
            plt.title('ROC Curve - Diabetes Prediction')
            plt.legend()
            plt.grid(True, alpha=0.3)

            if save_plots:
                roc_curve_path = 'artifacts/roc_curve.png'
                plt.savefig(roc_curve_path, dpi=300, bbox_inches='tight')
                plots_created.append(roc_curve_path)
                print(f"✓ ROC curve saved to: {roc_curve_path}")

            plt.close()

            # 3. Precision-Recall Curve
            plt.figure(figsize=(8, 6))
            precision, recall, _ = precision_recall_curve(y_true, y_pred_proba[:, 1])

            plt.plot(recall, precision, linewidth=2, label='Precision-Recall Curve')
            plt.xlabel('Recall')
            plt.ylabel('Precision')
            plt.title('Precision-Recall Curve - Diabetes Prediction')
            plt.legend()
            plt.grid(True, alpha=0.3)

            if save_plots:
                pr_curve_path = 'artifacts/precision_recall_curve.png'
                plt.savefig(pr_curve_path, dpi=300, bbox_inches='tight')
                plots_created.append(pr_curve_path)
                print(f"✓ Precision-Recall curve saved to: {pr_curve_path}")

            plt.close()

        # 4. Prediction Distribution
        plt.figure(figsize=(10, 6))

        # Create subplots for actual vs predicted distributions
        fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

        # Actual distribution
        y_true_counts = pd.Series(y_true).value_counts().sort_index()
        ax1.bar(['No Diabetes', 'Diabetes'], y_true_counts.values,
                color=['lightblue', 'lightcoral'], alpha=0.7)
        ax1.set_title('Actual Distribution')
        ax1.set_ylabel('Count')

        # Predicted distribution
        y_pred_counts = pd.Series(y_pred).value_counts().sort_index()
        ax2.bar(['No Diabetes', 'Diabetes'], y_pred_counts.values,
                color=['lightblue', 'lightcoral'], alpha=0.7)
        ax2.set_title('Predicted Distribution')
        ax2.set_ylabel('Count')

        plt.tight_layout()

        if save_plots:
            distribution_path = 'artifacts/prediction_distribution.png'
            plt.savefig(distribution_path, dpi=300, bbox_inches='tight')
            plots_created.append(distribution_path)
            print(f"✓ Prediction distribution saved to: {distribution_path}")

        plt.close()

        print(f"✓ All visualizations created successfully")

    except Exception as e:
        print(f"✗ Error creating visualizations: {e}")

    return plots_created


def log_to_mlflow(metrics, report, report_str, model, X_test, y_test, plots_created, model_name):
    """Log evaluation results to MLflow with DagsHub integration"""
    print("\n" + "="*60)
    print("LOGGING TO MLFLOW & DAGSHUB")
    print("="*60)

    # Configure MLflow
    print(f"MLflow Configuration:")
    print(f"  - Tracking URI: {tracking_uri}")
    print(f"  - Experiment name: {experiment_name}")

    mlflow.set_tracking_uri(tracking_uri)

    # Set authentication if credentials are provided
    if username and password:
        print("  - Authentication: Enabled")
    else:
        print("  - Authentication: No credentials provided")

    # Set experiment
    try:
        mlflow.set_experiment(experiment_name)
        print(f"✓ Experiment set: {experiment_name}")
    except Exception as e:
        print(f"✗ Error setting experiment: {e}")
        # Create experiment if it doesn't exist
        try:
            mlflow.create_experiment(experiment_name)
            mlflow.set_experiment(experiment_name)
            print(f"✓ Created and set new experiment: {experiment_name}")
        except Exception as e2:
            print(f"✗ Error creating experiment: {e2}")

    # Start MLflow run
    with mlflow.start_run(run_name=f"diabetes_evaluation_{datetime.now().strftime('%Y%m%d_%H%M%S')}"):
        print(f"✓ MLflow run started")

        # Log evaluation metrics
        print(f"\nLogging metrics...")
        for metric_name, metric_value in metrics.items():
            if isinstance(metric_value, (int, float)) and not isinstance(metric_value, bool):
                mlflow.log_metric(metric_name, metric_value)
                print(f"  - {metric_name}: {metric_value}")
        print(f"✓ Metrics logged successfully")

        # Log parameters
        print(f"\nLogging parameters...")
        mlflow.log_param("model_type", type(model).__name__)
        mlflow.log_param("test_samples", len(y_test))
        mlflow.log_param("evaluation_date", datetime.now().isoformat())
        mlflow.log_param("model_name", model_name)

        # Log model parameters if available
        if hasattr(model, 'get_params'):
            model_params = model.get_params()
            for param_name, param_value in model_params.items():
                mlflow.log_param(f"model_{param_name}", param_value)
        print(f"✓ Parameters logged successfully")

        # Log artifacts
        print(f"\nLogging artifacts...")

        # Log confusion matrix as text
        cm_text = f"Confusion Matrix:\\n{metrics['confusion_matrix']}"
        mlflow.log_text(cm_text, "confusion_matrix.txt")

        # Log classification report
        mlflow.log_text(report_str, "classification_report.txt")

        # Log detailed metrics as JSON
        metrics_json = json.dumps(metrics, indent=2, default=str)
        mlflow.log_text(metrics_json, "detailed_metrics.json")

        # Log classification report as JSON
        report_json = json.dumps(report, indent=2, default=str)
        mlflow.log_text(report_json, "classification_report.json")

        print(f"✓ Text artifacts logged successfully")

        # Log visualization plots
        for plot_path in plots_created:
            try:
                mlflow.log_artifact(plot_path)
                print(f"  - Logged plot: {plot_path}")
            except Exception as e:
                print(f"  ✗ Error logging plot {plot_path}: {e}")

        print(f"✓ All artifacts logged successfully")

        # Get run info
        run_info = mlflow.active_run().info
        print(f"\n" + "="*60)
        print("MLFLOW RUN SUMMARY")
        print("="*60)
        print(f"Run ID: {run_info.run_id}")
        print(f"Experiment ID: {run_info.experiment_id}")
        print(f"Status: {run_info.status}")
        print(f"Start Time: {datetime.fromtimestamp(run_info.start_time/1000)}")

        if tracking_uri and "dagshub.com" in tracking_uri:
            print(f"\n🔗 View on DagsHub: {tracking_uri.replace('databricks', 'dagshub.com')}")

        return run_info.run_id


def evaluate_model(params_path='params.yaml'):
    """Main evaluation function"""
    print("\n" + "="*80)
    print("DIABETES PREDICTION MODEL EVALUATION")
    print("="*80)
    print(f"Started at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

    try:
        # Load parameters
        params = load_evaluation_params()
        train_params = params.get('train', {})

        data_path = train_params.get('input', 'dataset/processed/diabetes_processed.csv')
        model_path = train_params.get('output', 'model/diabetes_model.pkl')
        test_size = train_params.get('test_size', 0.2)
        random_state = train_params.get('random_state', 42)
        model_name = train_params.get('model_name', 'RandomForestClassifierBestModel')

        print(f"Configuration:")
        print(f"  - Data path: {data_path}")
        print(f"  - Model path: {model_path}")
        print(f"  - Test size: {test_size}")
        print(f"  - Random state: {random_state}")
        print(f"  - Model name: {model_name}")

        # Load data and model
        df, X, y, model = load_data_and_model(data_path, model_path)

        # Split data (using same split as training)
        X_train, X_test, y_train, y_test = split_data(X, y, test_size, random_state)

        # Make predictions
        print("\n" + "="*60)
        print("MAKING PREDICTIONS")
        print("="*60)

        y_pred = model.predict(X_test)
        print(f"✓ Predictions generated for {len(y_test)} samples")

        # Get prediction probabilities if available
        y_pred_proba = None
        if hasattr(model, 'predict_proba'):
            y_pred_proba = model.predict_proba(X_test)
            print(f"✓ Prediction probabilities generated")

        # Calculate metrics
        metrics = calculate_comprehensive_metrics(y_test, y_pred, y_pred_proba)

        # Generate classification report
        report, report_str = generate_classification_report(y_test, y_pred)

        # Create visualizations
        plots_created = create_visualizations(y_test, y_pred, y_pred_proba)

        # Log to MLflow
        run_id = log_to_mlflow(metrics, report, report_str, model, X_test, y_test, plots_created, model_name)

        print("\n" + "="*80)
        print("EVALUATION COMPLETED SUCCESSFULLY")
        print("="*80)
        print(f"Final Results Summary:")
        print(f"  - Accuracy: {metrics['accuracy']:.4f}")
        print(f"  - Precision: {metrics['precision']:.4f}")
        print(f"  - Recall: {metrics['recall']:.4f}")
        print(f"  - F1-Score: {metrics['f1_score']:.4f}")
        if 'roc_auc' in metrics:
            print(f"  - ROC-AUC: {metrics['roc_auc']:.4f}")
        print(f"  - MLflow Run ID: {run_id}")
        print(f"Completed at: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

        return metrics, run_id

    except Exception as e:
        print(f"\n✗ EVALUATION FAILED: {e}")
        import traceback
        traceback.print_exc()
        raise

if __name__ == "__main__":
    print("Starting Diabetes Prediction Model Evaluation...")
    metrics, run_id = evaluate_model()
    print("Evaluation script executed successfully.")

```


## **Complete Pipeline**

Till now we've implemented individually but this we will build the complete workflow.

We will track the dataset versions i.e. `data, params.yml`

First we will track the `raw data` with DVC.

`dvc add dataset/raw/diabetes.csv`

`git add dataset/raw/diabetes.csv.dvc`

Now, to combine `preprocessing`, `evaluate` and `train` we can use amazing feature of `DVC` i.e. `DVC Stage`.

### **DVC Stage**

This command is used to define stages in a ML or data pipeline. These stages represent steps lika data preprocessing, model training and evaluating.

```Text

Preprocessing --> Training --> Evaluation

```

So we can use `dvc stage add` to define the stages.

With `stage` we can define the `pipeline` of our project that needs to be executing the task in a sequence.

**Preprocessing Stage**

```bash

dvc stage add -n preprocess \
  -p preprocess.input,preprocess.output \
  -d src/preprocess.py -d dataset/raw/diabetes.csv \
  -o dataset/processed/diabetes_processed.csv \
  -- python src/preprocess.py

```

`add -n` : Create a new stage, name of the process
`-p` : Tracks the parameter available in the `yaml` file. As our file has `preprocess` under which we've `input` and `output`
`-d` : Specifies the dependencies.
`-o` : Output of the stage.

`python src/preprocess.py` : The command to run the stage and the script to run.

As soon as we run this command, it will create a `dvc.yaml` file in the root directory of the project. This file contains the information about the stages and their dependencies.
The `dvc.yaml` file will look like this:

```yaml
stages:
  preprocess:
    cmd: python src/preprocess.py
    deps:
      - src/preprocess.py
      - dataset/raw/diabetes.csv
    outs:
      - dataset/processed/diabetes_processed
    params:
      - preprocess.input
      - preprocess.output
```

**Training Stage**

```bash
dvc stage add -n train \
  -p train.input, train.output, train.test_size, train.random_state, train.model_name \
  -d src/train.py -d dataset/processed/diabetes_processed.csv \
  -o model/diabetes_model.pkl \
  -- python src/train.py

```

**Evaluation Stage**

```bash
dvc stage add -n evaluate \
  -p evaluate.input, evaluate.output, evaluate.metric \
  -d src/evaluate.py -d model/diabetes_model.pkl -d dataset/processed/diabetes_processed.csv \
  -o reports/diabetes_report.json \
  python src/evaluate.py

```

**Viewing the Pipeline**
To view the pipeline, we can use the `dvc pipeline show` command. This command will show the stages and their dependencies in a graphical format.

```bash
dvc pipeline show --ascii

```

This command will show the pipeline in a ASCII format.

```Text
Preprocessing --> Training --> Evaluation

```

**Running the Pipeline**

```bash
dvc repro
```

This command will run the pipeline and execute all the stages in the defined order.

**Tracking the Pipeline**
To track the pipeline, we can use `dvc dag` command. This command will show the directed acyclic graph (DAG) of the pipeline.

Also, track the data version with DVC with Dagshub. We will need to add the remote repository to DVC.

```bash
dvc remote add -d origin https://dagshub.com/username/repo.git
```

Replace `username` and `repo` with your Dagshub username and repository name.

Then, setup the credentials for the remote repository.

```bash
dvc remote modify origin --local access_key_id bee4f040f555191f8c8fb05458249bed6f421f06
dvc remote modify origin --local secret_access_key bee4f040f555191f8c8fb05458249bed6f421f06
```

Now, we can push the data to the remote repository.

```bash
dvc push -r origin
```

```bash
dvc pull -r origin
```

This command will pull the data from the remote repository and update the local repository with the latest data.

Now we will need to track the `dvc.yaml` file and the `params.yaml` file in the git repository.

```bash
git add dvc.yaml params.yaml
git commit -m "Added DVC stages for preprocessing, training, and evaluation"
```

Now, we can push the changes to the remote repository.

```bash
git push origin master
```

Once pushed, you can view the pipeline on Dagshub. You can also view the data versions and the parameters used in the pipeline.

**But in our current implementation the DVC is not available in the Dagshub. So we will need to use the local DVC for now.**

## **Conclusion**

In this project, we have implemented an end-to-end machine learning pipeline using DVC and MLflow. We have tracked the data, parameters, and stages of the pipeline using DVC. We have also used MLflow to track the experiments and log the results.
This project can be extended further by adding more stages to the pipeline, such as hyperparameter tuning, model deployment, and monitoring. DVC and MLflow provide a powerful set of tools to manage the machine learning lifecycle and make it easier to collaborate with other team members.


## **MLFlow in AWS (Try It)**

To run MLflow in AWS, you can use the following steps:

1. **Set up an EC2 instance**: Launch an EC2 instance with the desired specifications. Make sure to configure the security group to allow inbound traffic on the port you will use for MLflow (default is 5000).
2. **Install MLflow**: SSH into your EC2 instance and install MLflow using pip:
   ```bash
   pip install mlflow
   ```
3. **Run MLflow server**: Start the MLflow server on your EC2 instance:

   ```bash
   mlflow ui
   ```

4. **Access MLflow UI**: Open your web browser and navigate to `http://<your-ec2-public-ip>:5000` to access the MLflow UI.

5. **Configure tracking URI**: In your MLflow scripts, set the tracking URI to point to your EC2 instance:
   ```python
   import mlflow
   mlflow.set_tracking_uri("http://<your-ec2-public-ip>:5000")
   ```
6. **Log experiments**: Use MLflow's logging functions in your scripts to log parameters, metrics, and models as you would normally do.
7. **Persist data**: If you want to persist your MLflow data, you can set up a remote storage backend (like S3) for MLflow artifacts and models. You can configure this in your `mlflow.set_tracking_uri()` call or by setting environment variables.
8. **Security**: Consider securing your MLflow server by setting up authentication and HTTPS. You can use tools like Nginx or Apache to set up a reverse proxy with SSL.
9. **Monitoring and Scaling**: For production use, consider using AWS services like CloudWatch for monitoring and Auto Scaling for handling increased load.

```bash
10. **Backup**: Regularly back up your MLflow data and models to ensure you don't lose any important information.
```


## **Docker**

### **Dockerfile**

```Dockerfile
# Use the official Python image as the base image
FROM python:3.9-slim
# Set the working directory in the container
WORKDIR /app
# Copy the requirements file into the container
COPY requirements.txt .
# Install the required packages
RUN pip install --no-cache-dir -r requirements.txt
# Copy the entire application code into the container
COPY . .
# Expose the port that MLflow will run on
EXPOSE 5000
# Set environment variables for MLflow
ENV MLFLOW_TRACKING_URI=http://localhost:5000
ENV MLFLOW_EXPERIMENT_NAME=Diabetes_Prediction_Experiment
# Command to run the MLflow server
CMD ["mlflow", "ui", "--host", "0.0.0.0"]
```

### _Docker Compose_

```yaml
version: "3.8"
services:
  mlflow:
    build: .
    ports:
      - "5000:5000"
    environment:
      MLFLOW_TRACKING_URI: http://localhost:5000
      MLFLOW_EXPERIMENT_NAME: Diabetes_Prediction_Experiment
    volumes:
      - ./mlruns:/app/mlruns
```

### **Build and Run the Docker Container**

```bash
# Build the Docker image
docker build -t mlflow-diabetes-prediction .
# Run the Docker container
docker run -p 5000:5000 mlflow-diabetes-prediction
```

### **Docker Volume**

**What is Docker Volume?**

Docker volumes are a way to persist data generated by and used by Docker containers. When you create a volume, it is stored outside the container's filesystem, allowing you to keep the data even if the container is stopped or removed. This is particularly useful for applications like MLflow, where you want to retain experiment logs, models, and artifacts across container restarts.

To use a Docker volume with MLflow, you can modify the `docker-compose.yml` file to include a volume for the `mlruns` directory, which is where MLflow stores its experiment data.

```yaml
version: "3.8"
services:
  mlflow:
    build: .
    ports:
      - "5000:5000"
    environment:
      MLFLOW_TRACKING_URI: http://localhost:5000
      MLFLOW_EXPERIMENT_NAME: Diabetes_Prediction_Experiment
    volumes:
      - ./mlruns:/app/mlruns # This line mounts the local mlruns directory to the container
```

This configuration mounts the `mlruns` directory from your local machine to the `/app/mlruns` directory in the container. This way, any experiments, models, and artifacts logged by MLflow will be stored in the `mlruns` directory on your host machine, allowing you to access them even after stopping or removing the container.


## **Apache Airflow**

**Intro**

Apache Airflow is an open-source platform to programmatically author, schedule, and monitor workflows. It allows you to define complex data pipelines as Directed Acyclic Graphs (DAGs) using Python code. Airflow provides a rich user interface to visualize the execution of tasks, monitor their status, and manage dependencies between tasks.

**Installation**

To install Apache Airflow, you can use the following command:

```bash
pip install apache-airflow
```

**Basic Concepts**

1. **DAG (Directed Acyclic Graph)**: A DAG is a collection of tasks with dependencies defined between them. In Airflow, you define your workflows as DAGs using Python code.

2. **Operators**: Operators are the building blocks of Airflow tasks. They define what kind of work a task will do. There are different types of operators for various tasks, such as BashOperator for running bash commands, PythonOperator for executing Python functions, and more.

3. **Tasks**: A task is a single unit of work within a DAG. Each task is represented by an instance of an operator.

4. **Scheduler**: The Airflow scheduler is responsible for executing tasks on a defined schedule. It monitors the DAGs and triggers tasks based on their dependencies and schedules.

5. **Executor**: The executor is the component that actually runs the tasks. Airflow supports different executors, such as the LocalExecutor for running tasks locally and the CeleryExecutor for distributed task execution.

6. **UI**: Airflow provides a web-based user interface to visualize and manage your workflows. You can view the status of tasks, trigger runs, and monitor progress through the UI.

**Example DAG**

```python
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def my_task():
    print("Hello, Airflow!")

with DAG("my_dag", schedule_interval="@daily", start_date=datetime(2023, 1, 1)) as dag:
    start = DummyOperator(task_id="start")
    task1 = PythonOperator(task_id="task1", python_callable=my_task)
    end = DummyOperator(task_id="end")

    start >> task1 >> end

```

### **Running Airflow**

To run Airflow, you need to initialize the database and start the web server and scheduler. Here are the steps:

```bash
# Initialize the database
airflow db init
# Start the web server
airflow webserver --port 8080
# Start the scheduler
airflow scheduler
```

### **Accessing the Airflow UI**

Once the web server is running, you can access the Airflow UI by navigating to `http://localhost:8080` in your web browser. From there, you can view your DAGs, trigger runs, and monitor task execution.

### **Airflow with MLFlow**

To integrate Airflow with MLflow, you can create Airflow tasks that log experiments, parameters, and metrics to MLflow. This allows you to track your machine learning experiments and manage them effectively.

```python
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime
import mlflow

def log_experiment():
  mlflow.start_run()
  mlflow.log_param("param1", 5)
  mlflow.log_metric("metric1", 0.85)
  mlflow.end_run()

with DAG("mlflow_integration", schedule_interval="@daily", start_date=datetime(2023, 1, 1)) as dag:
  log_task = PythonOperator(task_id="log_experiment", python_callable=log_experiment)
  log_task

```

This DAG defines a task that logs parameters and metrics to MLflow. You can run this DAG in Airflow to track your machine learning experiments.

### **Why Airflow for MLOps?**

Apache Airflow is a powerful tool for managing complex workflows and data pipelines, making it an excellent choice for MLOps (Machine Learning Operations). Here are some reasons why Airflow is suitable for MLOps:

1. **Workflow Orchestration**: Airflow allows you to define complex workflows as Directed Acyclic Graphs (DAGs), making it easy to manage dependencies between tasks and ensure that tasks are executed in the correct order.

2. **Task Scheduling**: Airflow provides a robust scheduling mechanism that allows you to run tasks at specific intervals or based on triggers, making it ideal for automating machine learning workflows.

3. **Extensibility**: Airflow supports a wide range of operators and hooks, allowing you to integrate with various tools and services commonly used in MLOps, such as MLflow, Kubernetes, and cloud platforms.

4. **Monitoring and Logging**: Airflow provides a web-based user interface to monitor the status of tasks, view logs, and track the progress of workflows. This is crucial for debugging and maintaining machine learning pipelines.

5. **Scalability**: Airflow can scale horizontally by using distributed executors like Celery or Kubernetes, allowing you to handle large-scale machine learning workloads efficiently.

6. **Version Control**: Airflow allows you to version your DAGs and tasks, making it easier to manage changes to your machine learning workflows over time.

### **Astro With Airflow**

Astro is a managed service for Apache Airflow that simplifies the deployment, scaling, and management of Airflow instances. It provides a user-friendly interface and integrates with various cloud services, making it easier to run and manage Airflow workflows in production.

```python
from airflow import DAG
from airflow.providers.docker.operators.docker import DockerOperator
from datetime import datetime
with DAG(
    "docker_example",
    schedule_interval="@daily",
    start_date=datetime(2023, 1, 1),
    catchup=False,
) as dag:
    docker_task = DockerOperator(
        task_id="run_docker_container",
        image="python:3.9-slim",
        api_version="auto",
        auto_remove=True,
        command="python -c 'print(\"Hello from Docker!\")'",
        docker_url="unix://var/run/docker.sock",
        network_mode="bridge",
    )

    docker_task
```

# Astro with Airflow Example

This example demonstrates how to use the DockerOperator in Airflow to run a Docker container that executes a simple Python command. The DockerOperator allows you to run tasks inside Docker containers, providing isolation and reproducibility for your workflows.


### **Next Day**

**Docker Intro**

https://www.udemy.com/course/complete-mlops-bootcamp-with-10-end-to-end-ml-projects/learn/lecture/45859167#overview
