# Version Control and Reproducibility

In any machine learning project, **version control** and **reproducibility** are essential. They ensure that your work can be **tracked, shared, and replicated** by others (or by you in the future).

In MLOps, this goes beyond just code — it includes **datasets, models, metrics, and experiments**.

## 🎯 Learning Objectives

By the end of this notebook, you will:
- Understand why version control is crucial for ML.
- Learn how Git and DVC work together to manage code and data.
- Explore reproducibility principles in ML experiments.
- See practical examples of Git and DVC commands in action.

## 🧠 What is Version Control?

**Version Control** refers to tracking and managing changes to your code or project files over time.

In ML projects, we often have multiple versions of:
- Code (Python scripts, notebooks)
- Datasets
- Trained models
- Configuration files

**Git** is the most widely used tool for version control in software development, and **DVC (Data Version Control)** extends this concept to handle large files and datasets.

## ⚙️ Git Basics for ML Projects

Git helps track code changes and collaborate with others. Here are common commands:

```bash
# Initialize a Git repository
git init

# Stage and commit files
git add .
git commit -m "Initial commit"

# Create and switch branches
git branch dev
git checkout dev

# Merge branches
git merge dev

# Push to remote
git remote add origin https://github.com/username/mlops-project.git
git push -u origin main
```

Git handles **code and small files**, but not large datasets or model binaries — that's where **DVC** comes in.

## 📦 DVC (Data Version Control)

**DVC** is an open-source tool that extends Git for handling large datasets, models, and pipelines.

It lets you version control data and models without actually storing them in Git — instead, it tracks **pointers** to data stored elsewhere (like AWS S3, Google Drive, etc.).

### Basic DVC Workflow
```bash
# Initialize DVC inside your project
dvc init

# Add data files
dvc add data/raw_data.csv

# Commit changes
git add data/raw_data.csv.dvc .gitignore
git commit -m "Add raw dataset tracking with DVC"

# Push dataset to remote storage
dvc remote add -d myremote s3://mybucket/data
dvc push
```

When someone clones your repo, they can retrieve data by simply running:

```bash
dvc pull
```

## 📊 Reproducibility in Machine Learning

Reproducibility means that someone can **re-run your code and get the same results**. In ML, it’s vital for transparency and reliability.

### Factors Affecting Reproducibility:
- **Random seeds** not fixed.
- **Different library versions**.
- **Untracked datasets or model weights**.
- **Non-deterministic GPU operations**.

### How to Ensure Reproducibility:
1. Fix random seeds:
   ```python
   import numpy as np
   import torch
   import random
   random.seed(42)
   np.random.seed(42)
   torch.manual_seed(42)
   ```
2. Log your **environment** and **dependencies** using `requirements.txt` or `conda env export`.
3. Use tools like **MLflow**, **Weights & Biases**, or **DVC Experiments** for tracking parameters and results.

In [None]:
# 🧪 Example: Reproducible ML Experiment with Scikit-Learn
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import numpy as np
import random

# Set seeds for reproducibility
np.random.seed(42)
random.seed(42)

# Load data
X, y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Evaluate
preds = model.predict(X_test)
acc = accuracy_score(y_test, preds)
print(f"Accuracy: {acc:.4f}")

## 🧩 MLflow for Experiment Tracking

You can use **MLflow** to record experiment parameters and metrics for reproducibility.

```python
import mlflow

with mlflow.start_run():
    mlflow.log_param('n_estimators', 100)
    mlflow.log_metric('accuracy', acc)
    mlflow.sklearn.log_model(model, 'rf_model')
```

This creates a **recorded experiment** that can be revisited later, ensuring results are traceable and reproducible.

## Summary

- Version control is the foundation of **collaboration and reproducibility**.
- Use **Git** for code and **DVC** for data and models.
- Always fix random seeds and track environments.
- Use tools like **MLflow** for experiment tracking.

Next up → **[02-Data_Versioning_with_DVC.ipynb](./02-Data_Versioning_with_DVC.ipynb)**: Learn how to manage datasets efficiently in ML projects.