# Module 2 Exercises: Reproducibility

**Objective**: Master Git, DVC, and MLflow through interactive debugging and challenges.

---

## üõ†Ô∏è Setup

```bash
pip install dvc mlflow
```

## üêõ Part 1: The Bug Hunt

**Scenario**: A Junior Engineer wrote this script to track data. It causes an error or bad practice. **Find the mistake.**

### Bug 1: The Git Disaster
```bash
# Intent: Track 50GB dataset
git init
git add data/huge_dataset.csv
git commit -m "Add data"
dvc add data/huge_dataset.csv
```

<details>
<summary><b>üîª Click to Reveal Bug</b></summary>
<br>
<b>Mistake:</b> They ran <code>git add</code> BEFORE `dvc add`.
<br>
<b>Consequence:</b> The 50GB file is now in Git history. GitHub will reject the push. DVC cannot track it properly if Git is already tracking it.
<br>
<b>Fix:</b> `git rm --cached data/huge_dataset.csv` then `dvc add ...`.
</details>

### Bug 2: The MLflow Mystery
```python
import mlflow

mlflow.start_run()
mlflow.log_param("lr", 0.01)
accuracy = 0.85
# ... script ends ...
```

<details>
<summary><b>üîª Click to Reveal Bug</b></summary>
<br>
<b>Mistake:</b> They forgot to close the run with `mlflow.end_run()` (or use a context manager).
<br>
<b>Consequence:</b> The run might stay "Running" forever in the UI, or subsequent runs might get nested inside this one.
<br>
<b>Fix:</b> Use `with mlflow.start_run():`
</details>

## üß† Part 2: Rapid Fire Quiz

Run the cell below to test your knowledge.

In [None]:
def quiz():
    questions = [
        {
            "q": "Which file stores the S3 path and MD5 hash of your data?",
            "options": ["A) dvc.yaml", "B) data.dvc", "C) .gitignore"],
            "ans": "B"
        },
        {
            "q": "In MLflow, what is an 'Artifact'?",
            "options": ["A) A metric like accuracy", "B) A parameter like lr", "C) A file like model.pkl or plot.png"],
            "ans": "C"
        }
    ]
    
    score = 0
    for i, item in enumerate(questions):
        print(f"\nQ{i+1}: {item['q']}")
        for opt in item['options']:
            print(opt)
        user_ans = input("Your Answer (A/B/C): ").upper()
        if user_ans == item['ans']:
            print("‚úÖ Correct!")
            score += 1
        else:
            print(f"‚ùå Wrong. Correct was {item['ans']}")
    
    print(f"\nFinal Score: {score}/{len(questions)}")

# Uncomment to run
# quiz()

## üõ†Ô∏è Part 3: Challenge - The DVC Workflow

**Task**: You have a file `data/raw.csv`. You process it into `data/processed.csv` using `src/process.py`.
Write the exact DVC command to create a **Reproducible Stage** implementation of this.

<details>
<summary><b>üîª Click for Solution</b></summary>
<br>
<code>dvc run -n process_data \
-d data/raw.csv -d src/process.py \
-o data/processed.csv \
python src/process.py data/raw.csv data/processed.csv</code>
</details>