<a href="https://colab.research.google.com/github/Bharadwaja196/AIHACKATHON/blob/main/MLLABEXP2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# MLOps with DVC and Git

This notebook demonstrates how to use DVC (Data Version Control) and Git to manage a simple machine learning project. We will create a pipeline to train a model on two different versions of a dataset, compare their performance, and select the best model.

**DVC** helps in:
- Versioning large datasets and models.
- Creating reproducible pipelines.
- Managing dependencies between data, code, and models.

**Git** is used for versioning the code, DVC files (`.dvc` files, `dvc.yaml`, `.gitignore`, `.dvcignore`), and other project files.

In [None]:
# Install necessary libraries
%pip install dvc[gdrive] scikit-learn pandas matplotlib joblib PyYAML



## 1. Setup Directories and Initialize Git and DVC

We will create the necessary directory structure for our project and initialize both Git and DVC.

In [None]:
!cd /content
!rm -rf ml-project/ml-project   # delete the extra nested copies inside
!cd ml-project                  # go into your main project
!pwd                            # show current directory
!ls                             # list files/folders here


/content/ml-project/ml-project
data  dvc.lock	dvc.yaml  ml-project  models  results  scripts


In [None]:
# Setup directories and initialize Git and DVC
!mkdir -p ml-project/{data,models,scripts,results}
%cd ml-project
!git init
!dvc init --no-scm # Use --no-scm as we are already in a Git repository

/content/ml-project/ml-project/ml-project
Reinitialized existing Git repository in /content/ml-project/ml-project/ml-project/.git/
[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

## 2. Configure Git and DVC Ignores

We will configure `.gitignore` to ignore DVC's internal files and our large data/model files, and `.dvcignore` to tell DVC which files to not track.

In [None]:
# Configure Git and DVC ignores
# .gitignore: Ignore DVC cache, large files (data, models), and Python artifacts
!echo ".dvc/cache" > .gitignore
!echo "models/*.pkl" >> .gitignore # Ignore model files
!echo "results/*.txt" >> .gitignore # Ignore results files
!echo "__pycache__/" >> .gitignore
!echo "*.csv" >> .gitignore # Ignore data files from Git
!echo "drive/" >> .gitignore # Ignore Google Drive mount
!echo "*.gdoc" >> .gitignore
!echo "*.gsheet" >> .gitignore

# .dvcignore: Tell DVC to ignore files that are not part of the pipeline
# We explicitly remove *.csv from .dvcignore in a later step to track data files
!echo "__pycache__/" > .dvcignore
!echo "*.pkl" >> .dvcignore
!echo "*.txt" >> .dvcignore

## 3. Fix `.dvcignore` to allow tracking `.csv` files

Previously, we added `*.csv` to `.dvcignore`, which prevented DVC from tracking our data files. We need to remove this line from `.dvcignore`.

In [None]:

# Fix .dvcignore to allow tracking .csv files
# Read the current .dvcignore content
with open(".dvcignore", "r") as f:
    dvcignore_lines = f.readlines()

# Remove the line containing "*.csv"
dvcignore_lines_fixed = [line for line in dvcignore_lines if "*.csv" not in line]

# Write the modified content back to .dvcignore
with open(".dvcignore", "w") as f:
    f.writelines(dvcignore_lines_fixed)

print("✅ Removed '*.csv' from .dvcignore")

# Add the modified .dvcignore to Git
!git add .dvcignore
!git commit -m "Fix: Allow tracking .csv files in .dvcignore"

✅ Removed '*.csv' from .dvcignore
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   .gitignore[m
	[31mmodified:   dvc.lock[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31m.dvc/[m

no changes added to commit (use "git add" and/or "git commit -a")


## 4. Create Dummy Data Files

We will create two versions of a dummy dataset (`iris_v1.csv` and `iris_v2.csv`) based on the Iris dataset. `iris_v2.csv` will have some added noise to simulate a data change.

In [None]:
# Create dummy data files
from sklearn.datasets import load_iris
import pandas as pd
import numpy as np

iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name="target")

df_v1 = pd.concat([X, y], axis=1)
df_v1.to_csv("data/iris_v1.csv", index=False)

X_noisy = X.copy()
X_noisy.iloc[:, 0] += np.random.normal(0, 0.5, size=X.shape[0])
df_v2 = pd.concat([X_noisy, y], axis=1)
df_v2.to_csv("data/iris_v2.csv", index=False)

print("✅ iris_v1.csv and iris_v2.csv created successfully.")

✅ iris_v1.csv and iris_v2.csv created successfully.


## 5. Create Training Script

This script (`scripts/train.py`) will load data, train a Logistic Regression model, evaluate it, and save the trained model and metrics.

In [None]:
# Create training script
%%writefile scripts/train.py
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import joblib
import sys
import traceback

try:
    # Check if the correct number of arguments are provided
    if len(sys.argv) != 4:
        print(f"Usage: python {sys.argv[0]} <data_path> <model_path> <metrics_path>", file=sys.stderr)
        sys.exit(1)

    data_path, model_path, metrics_path = sys.argv[1], sys.argv[2], sys.argv[3]
    print(f"Loading data from: {data_path}")
    df = pd.read_csv(data_path)
    X = df.drop("target", axis=1)
    y = df["target"]
    print("Data loaded successfully.")

    print("Training Logistic Regression model...")
    model = LogisticRegression()
    model.fit(X, y)
    print("Model trained successfully.")

    preds = model.predict(X)
    acc = accuracy_score(y, preds)
    print(f"Calculated accuracy: {acc:.4f}")

    print(f"Saving model to: {model_path}")
    joblib.dump(model, model_path)
    print("Model saved successfully.")

    print(f"Saving metrics to: {metrics_path}")
    with open(metrics_path, "w") as f:
        f.write(f"accuracy: {acc:.4f}")
    print("Metrics saved successfully.")

except Exception as e:
    print(f"An error occurred during training: {e}", file=sys.stderr)
    traceback.print_exc(file=sys.stderr)
    sys.exit(1) # Exit with a non-zero code to indicate failure

Overwriting scripts/train.py


## 6. Create Comparison Script

This script (`scripts/compare_metrics.py`) will read the metrics from training on both datasets, compare them, and copy the best performing model to `models/production_model.pkl`.

In [None]:
!cd /content/ml-project && \
dvc init -f && \        # <- no ! here because it's still part of the same shell session
!git init && \
mkdir -p data models results scripts && \
dvc add data/iris_v1.csv && \
git add data/iris_v1.csv.dvc && \
git commit -m "Add iris_v1" && \
dvc add data/iris_v2.csv && \
git add data/iris_v2.csv.dvc && \
git commit -m "Add iris_v2" && \
dvc run -n train_v1 \
  -d scripts/train.py -d data/iris_v1.csv \
  -o models/model_v1.pkl -o results/metrics_v1.txt \
  python scripts/train.py data/iris_v1.csv models/model_v1.pkl results/metrics_v1.txt && \
dvc run -n train_v2 \
  -d scripts/train.py -d data/iris_v2.csv \
  -o models/model_v2.pkl -o results/metrics_v2.txt \
  python scripts/train.py data/iris_v2.csv models/model_v2.pkl results/metrics_v2.txt && \
dvc run -n compare_models \
  -d scripts/compare_metrics.py -d results/metrics_v1.txt -d results/metrics_v2.txt \
  python scripts/compare_metrics.py && \
git add dvc.yaml dvc.lock && \
git commit -m "DVC pipeline for v1 and v2 datasets"


Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m/bin/bash: line 1:  : command not found
Reinitialized existing Git r

In [None]:
!cd /content  # go to Colab root
!find . -name "dvc.yaml"

./dvc.yaml


In [None]:
!cd /content/ml-project   # change this path to match the search result
!pwd
!ls


/content/ml-project/ml-project/ml-project
data  dvc.lock	dvc.yaml  models  results  scripts


## 8. Define DVC Pipeline Stages

We will define the DVC pipeline in `dvc.yaml`. This includes stages for training models on both datasets and a stage for comparing the models.

In [None]:
# Remove the existing dvc.yaml to start fresh with stage definitions
!rm -f dvc.yaml

# Define DVC stages for training
!dvc stage add -n train_v1 \
  -d scripts/train.py -d data/iris_v1.csv \
  -o models/model_v1.pkl -o results/metrics_v1.txt \
  --run \
  python scripts/train.py data/iris_v1.csv models/model_v1.pkl results/metrics_v1.txt

!dvc stage add -n train_v2 \
  -d scripts/train.py -d data/iris_v2.csv \
  -o models/model_v2.pkl -o results/metrics_v2.txt \
  --run \
  python scripts/train.py data/iris_v2.csv models/model_v2.pkl results/metrics_v2.txt

# Define DVC stage for comparison
!dvc stage add -n compare_models \
  -d scripts/compare_metrics.py \
  -d results/metrics_v1.txt -d results/metrics_v2.txt \
  -d models/model_v1.pkl -d models/model_v2.pkl \
  -o models/production_model.pkl \
  --run \
  python scripts/compare_metrics.py

print("✅ Defined DVC stages in dvc.yaml")

! [31mERROR[39m: Path '/content/ml-project/ml-project/ml-project/models/model_v1.pkl' is ignored by
.dvcignore:2:*.pkl
[31mERROR[39m: Path '/content/ml-project/ml-project/ml-project/models/model_v2.pkl' is ignored by
.dvcignore:2:*.pkl
[31mERROR[39m: Path '/content/ml-project/ml-project/ml-project/models/production_model.pkl' is ignored by
.dvcignore:2:*.pkl
[0m✅ Defined DVC stages in dvc.yaml


## 9. Add Metrics to `dvc.yaml` for Tracking

We will explicitly add the metrics files to the `metrics` section in `dvc.yaml` so DVC tracks them.

In [None]:
import yaml

try:
    with open("dvc.yaml") as f:
        dvc_config = yaml.safe_load(f)
except FileNotFoundError:
    dvc_config = {}
except yaml.YAMLError as e:
    print(f"Error loading dvc.yaml: {e}", file=sys.stderr)
    dvc_config = {} # Start with an empty config if loading fails


if "metrics" not in dvc_config or not isinstance(dvc_config["metrics"], list):
    dvc_config["metrics"] = []

# Add metrics if not already present
if "results/metrics_v1.txt" not in dvc_config["metrics"]:
    dvc_config["metrics"].append("results/metrics_v1.txt")
if "results/metrics_v2.txt" not in dvc_config["metrics"]:
    dvc_config["metrics"].append("results/metrics_v2.txt")


with open("dvc.yaml", "w") as f:
    yaml.dump(dvc_config, f)

print("✅ Added metrics section to dvc.yaml.")

✅ Added metrics section to dvc.yaml.


## 10. Reproduce the Pipeline

Now we can reproduce the entire pipeline using `dvc repro`. This will execute the stages in the correct order based on their dependencies.

In [None]:
# Reproduce the pipeline
!dvc repro

Verifying data sources in stage: 'data/iris_v1.csv.dvc'
Use `dvc push` to send your updates to remote storage.
[0m

## 11. Show Metrics

We can use `dvc metrics show` to view the tracked metrics.

In [None]:
# Show metrics
!dvc metrics show

DVC failed to load some metrics for following revisions: ''.
Path
results/metrics_v1.txt
results/metrics_v2.txt
[0m

## 12. Add and Commit Changes to Git

We will add all the DVC-related files (`.dvc` files, `dvc.yaml`, `.gitignore`, `.dvcignore`) and our scripts to Git and commit them. The actual data and model files are ignored by Git but tracked by DVC.

In [None]:
# Add and commit changes to Git
!git add .
!git commit -m "Initial DVC pipeline setup with data, scripts, and stages"

[master b985720] Initial DVC pipeline setup with data, scripts, and stages
 9 files changed, 14 insertions(+), 39 deletions(-)
 create mode 100644 .dvc/config
 create mode 100644 .dvc/tmp/btime
 create mode 100644 .dvc/tmp/lock
 create mode 100644 .dvc/tmp/rwlock
 create mode 100644 .dvc/tmp/rwlock.lock
 rewrite dvc.yaml (98%)


In [None]:
ls /content/ml-project


[0m[01;34mdata[0m/  [01;34mml-project[0m/  [01;34mmodels[0m/  [01;34mresults[0m/  [01;34mscripts[0m/


In [None]:
cd /content/ml-project/ml-project


/content/ml-project/ml-project


In [None]:
!ls -l dvc.yaml
!ls -l scripts/train.py
!ls -l scripts/compare_metrics.py
!ls -l data/iris_v1.csv


-rw-r--r-- 1 root root 796 Aug 12 12:38 dvc.yaml
-rw-r--r-- 1 root root 1368 Aug 12 12:38 scripts/train.py
-rw-r--r-- 1 root root 477 Aug 12 12:08 scripts/compare_metrics.py
-rw-r--r-- 1 root root 2778 Aug 12 13:24 data/iris_v1.csv


In [None]:
import pandas as pd

# Step 0: Remove CSV ignore rule from .dvcignore
!sed -i '/\.csv/d' .dvcignore
print("✅ Removed '*.csv' rule from .dvcignore (so DVC will track CSV changes)")

# Step 1: Show metrics before change
print("\n📊 Metrics BEFORE change:")
!dvc metrics show

# Step 2: Modify first 3 rows in iris_v1.csv
df = pd.read_csv("data/iris_v1.csv")
df.iloc[0:3, 0] = df.iloc[0:3, 0] + 1.0
df.to_csv("data/iris_v1.csv", index=False)
print("✅ Modified data/iris_v1.csv")

# Step 3: Track change with DVC and commit to Git
!dvc add data/iris_v1.csv
!git add data/iris_v1.csv.dvc
!git commit -m "Updated iris_v1 dataset"

# Step 4: Force re-run only train_v1 stage
!dvc repro --force train_v1

# Step 5: Show metrics after change
print("\n📊 Metrics AFTER change:")
!dvc metrics show

# Step 6: Compare models
!python scripts/compare_metrics.py


✅ Removed '*.csv' rule from .dvcignore (so DVC will track CSV changes)

📊 Metrics BEFORE change:
Path                    accuracy
results/metrics_v1.txt  0.9733
results/metrics_v2.txt  0.98
[0m✅ Modified data/iris_v1.csv
[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/1 [00:00<?, ?file/s{'info': ''}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Adding data/iris_v1.csv to cache:   0% 0/1 [00:00<?, ?file/s][A
Adding data/iris_v1.csv to cache:   0% 0/1 [00:00<?, ?file/s{'info': ''}][A
                                                                         [A
  0% 0/1 [00:00<?, ?files/s][A
  0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...: 100% 1/1 [00:00<00:00, 48.30file/s{'info': ''}]
[0m[master df54ad8] Updated iris_v1 dataset
 1 file changed, 2 insertions(+), 2 deletions(-)
Verifying data sources in stage: 'data/iris_v1.csv.dvc'
       