# Assignment: Track Experiment Versions using DVC

## Objective
This assignment focuses on leveraging Data Version Control (DVC) to track and manage different versions of data, models, and experiment parameters for a machine learning project. You will learn to use DVC alongside Git to ensure reproducibility and enable seamless collaboration on ML experiments.

## Part 1: Environment Setup and Project Initialization (25 Marks)

1.  **Environment Setup:**
    * Create a new Python virtual environment.
    * Install necessary libraries: `dvc`, `scikit-learn`, `pandas`, `numpy`, `matplotlib` (for basic visualization), `git` (ensure Git is installed and configured on your system).
    * Provide a `requirements.txt` file.

2.  **Git Repository Initialization:**
    * Create a new, empty directory for your project (e.g., `dvc_ml_project`).
    * Initialize a Git repository within this directory (`git init`).
    * Make an initial commit (e.g., "Initial commit - empty project").
    * (Optional, but recommended for full experience) Create a remote repository on GitHub/GitLab and push your initial commit.

3.  **DVC Initialization:**
    * Initialize DVC within your Git repository (`dvc init`).
    * Configure a **local** DVC cache for this assignment (e.g., `dvc config cache.dir ../.dvc/cache`). This simplifies setup without needing cloud storage, but mention how it would differ for cloud deployment.
    * Make a Git commit with the `.dvc` directory and `.dvcignore` file (`git commit -m "Initialize DVC"`).

4.  **Data Acquisition:**
    * Download a small, well-known classification or regression dataset (e.g., Iris, Wine, Diabetes, or a small CSV from Kaggle) and save it as a CSV file (e.g., `data/raw_data.csv`) within your project directory.
    * Describe your chosen dataset: what it represents, its size, and the task it's suitable for.

5.  **DVC Add Data:**
    * Add your raw data file to DVC (`dvc add data/raw_data.csv`).
    * Explain what `dvc add` does (creates `.dvc` file, moves data to cache, updates `.gitignore`).
    * Make a Git commit: `git commit -m "Add raw data"`.

In [None]:
# Your commands for environment setup, Git/DVC initialization, and data acquisition.
# Description of your dataset.
# Git/DVC commands and explanation of their effects.
# Show `git log` output to verify commits.

## Part 2: Data Preprocessing and Featurization (30 Marks)

1.  **Preprocessing Script:**
    * Create a Python script (`src/prepare.py`) that performs basic data preprocessing:
        * Loads `data/raw_data.csv`.
        * Splits the data into features (X) and target (y).
        * Performs a train-test split (e.g., 80/20).
        * Saves the processed data (e.g., `data/X_train.csv`, `data/X_test.csv`, `data/y_train.csv`, `data/y_test.csv`).
    * Provide the full code for `src/prepare.py`.

2.  **DVC Stage for Preprocessing:**
    * Define a DVC stage for your preprocessing step using `dvc run`.
    * The stage should:
        * Name: `prepare`
        * Dependencies: `data/raw_data.csv`, `src/prepare.py`
        * Outputs: `data/X_train.csv`, `data/X_test.csv`, `data/y_train.csv`, `data/y_test.csv`
        * Command: `python src/prepare.py`
    * Explain what a DVC stage (`dvc.yaml`) represents and its components.
    * Run the DVC stage (`dvc repro`).
    * Make a Git commit: `git commit -m "Add data preparation stage"`.

In [None]:
# Full code for `src/prepare.py`.
# DVC commands and explanation of the stage.
# Show `dvc status` and `git log` output.

## Part 3: Model Training and Evaluation (30 Marks)

1.  **Training Script:**
    * Create a Python script (`src/train.py`) that:
        * Loads the processed training and test data (`data/X_train.csv`, `data/y_train.csv`, etc.).
        * Trains a simple machine learning model (e.g., `LogisticRegression`, `DecisionTreeClassifier`, `RandomForestClassifier`) using `X_train` and `y_train`.
        * Evaluates the model on `X_test` and `y_test`, calculating at least `accuracy` and `f1-score`.
        * Saves the trained model (e.g., using `joblib.dump` or `pickle`) to `models/model.pkl`.
        * Saves evaluation metrics to a JSON file (e.g., `metrics.json`).
    * Provide the full code for `src/train.py`.

2.  **DVC Stage for Training:**
    * Define a DVC stage for your training step using `dvc run`.
    * The stage should:
        * Name: `train`
        * Dependencies: Your processed data files (`data/X_train.csv`, etc.), `src/train.py`
        * Outputs: `models/model.pkl`
        * Metrics: `metrics.json`
        * Command: `python src/train.py`
    * Run the DVC stage (`dvc repro`).
    * Make a Git commit: `git commit -m "Add model training stage"`.

3.  **DVC Metrics and Visualization:**
    * Use `dvc metrics show` to display your recorded metrics.
    * Show the output of `dvc metrics show`.
    * Explain how DVC tracks and allows viewing of metrics.

In [None]:
# Full code for `src/train.py`.
# DVC commands for the training stage and metrics.
# Output of `dvc metrics show`.

## Part 4: Experiment Tracking and Reproducibility (15 Marks)

1.  **Modify and Retrain:**
    * Make a small change to your `src/train.py` script. For example:
        * Change a model hyperparameter (e.g., `max_depth` for a Decision Tree).
        * Change the random state for train-test split in `src/prepare.py`.
    * Re-run the DVC pipeline to update the model and metrics (`dvc repro`).
    * Make a Git commit: `git commit -m "Experiment 2: Changed model hyperparameter"`.

2.  **Compare Experiments:**
    * Use `dvc metrics diff` to compare the metrics between your initial training run and the new run.
    * Use `dvc dag` to visualize the pipeline dependency graph.
    * Show the output of `dvc metrics diff` and `dvc dag`.
    * Analyze the changes in metrics and explain how `dvc metrics diff` helps in comparing experiments.

3.  **Reproducibility (Bonus - 5 Marks):**
    * Revert your Git repository to the first commit (e.g., `git checkout [first_commit_hash]`).
    * Demonstrate how to `dvc pull` the correct data and model for that version, and then `dvc repro` to regenerate the exact results from that past experiment.
    * Explain how DVC and Git together ensure full reproducibility of any experiment version.

4.  **Reflection:**
    * Discuss the benefits of using DVC for MLOps, especially in terms of data versioning, experiment tracking, and reproducibility.
    * How does DVC complement Git in an ML workflow?
    * What challenges did you face, if any, and how did you overcome them?

In [None]:
# Your code for modifying scripts and re-running DVC.
# Output of `dvc metrics diff` and `dvc dag`.
# (For bonus, show commands for reverting and reproducing a past state).
# Your written reflection.

## Submission Guidelines

* Submit this Jupyter Notebook (.ipynb file) with all cells executed and outputs visible.
* Submit a `.zip` archive or a link to a Git repository containing your full DVC project (including `.dvc` files, `src/` folder, `data/` folder, `models/` folder, `metrics.json`, `dvc.yaml`, `README.md`, `.gitignore`, and `requirements.txt`).
* Ensure your Git history is clean and demonstrates the DVC stages and experiments through commits.
* Provide a `requirements.txt` file.
* Ensure your code is well-commented and easy to understand.
* All commands and outputs should be clearly presented as requested.
* Make sure your project can be cloned and run successfully by the instructor.