# Data Versioning with DVC

The objective of this notebook is to set up DVC for data control and document all dataset changes

Contents:
1. DVC Initialization Scripts
2. Dataset Inventory
3. Version History Documentation
4. Change Log Creation
5. DVC Workflow Guide

## Setup and Imports

In [None]:
import pandas as pd
import numpy as np
from pathlib import Path
from datetime import datetime
import subprocess
import os

# Define project paths
ROOT = Path.cwd().parent if (Path.cwd().parent / 'data').exists() else Path.cwd()
DATA_RAW = ROOT / "data" / "raw"
DATA_PROC = ROOT / "data" / "processed"
SCRIPTS = ROOT / "scripts"
REPORTS = ROOT / "reports"

# Create directories
for p in [DATA_RAW, DATA_PROC, SCRIPTS, REPORTS]:
    p.mkdir(parents=True, exist_ok=True)

print(f"{'='*80}")
print("DATA VERSIONING WITH DVC")
print(f"{'='*80}")
print(f"\nProject Root: {ROOT}")
print(f"Data Raw: {DATA_RAW}")
print(f"Data Processed: {DATA_PROC}")
print(f"Scripts: {SCRIPTS}")
print(f"Reports: {REPORTS}")

DATA VERSIONING WITH DVC

Project Root: /Users/lia/Desktop/Fase1
Data Raw: /Users/lia/Desktop/Fase1/data/raw
Data Processed: /Users/lia/Desktop/Fase1/data/processed
Scripts: /Users/lia/Desktop/Fase1/scripts
Reports: /Users/lia/Desktop/Fase1/reports


## Create DVC Initialization Script

In [2]:
# Create init_dvc.sh script
init_dvc_script = """#!/bin/bash
# DVC Initialization Script
# This script sets up DVC for data version control

set -euo pipefail

echo "=================================================="
echo "INITIALIZING DVC FOR DATA VERSION CONTROL"
echo "=================================================="

# Check if DVC is installed
if ! command -v dvc &> /dev/null; then
    echo "ERROR: DVC is not installed"
    echo "Install with: pip install dvc"
    exit 1
fi

# Check if git repository exists
if [ ! -d .git ]; then
    echo "ERROR: Not a git repository"
    echo "Initialize git first: git init && git add . && git commit -m 'init repo'"
    exit 1
fi

# Initialize DVC
echo ""
echo "Step 1: Initializing DVC..."
dvc init

# Check if initialization was successful
if [ $? -eq 0 ]; then
    echo "✓ DVC initialized successfully"
else
    echo "DVC initialization failed"
    exit 1
fi

echo ""
echo "Step 2: Configuring DVC..."
# Set autostage to true (automatically stage DVC files)
dvc config core.autostage true
echo "✓ DVC configuration complete"

echo ""
echo "Step 3: Adding DVC files to git..."
git add .dvc .dvcignore || true
git commit -m "chore: init dvc" || true

echo ""
echo "Step 4: Add a local remote for storage"
mkdir -p ../dvcstore
dvc remote add -d localstore ../dvcstore || true
git add .dvc/config || true
git commit -m "chore: add local dvc remote" || true

echo ""
echo "=================================================="
echo " DVC INITIALIZATION COMPLETE"
echo "=================================================="
echo ""
echo "Next steps:"
echo "1. Put raw data in: data/raw/"
echo "2. Track raw dir: dvc add data/raw"
echo "3. Commit pointer: git add data/raw.dvc && git commit -m 'track raw data'"
echo "4. Push data: dvc push"
echo "5. Define pipeline outs in dvc.yaml for data/processed, models, reports/figures"
"""

init_path = SCRIPTS / "init_dvc.sh"
init_path.write_text(init_dvc_script, encoding="utf-8")
os.chmod(init_path, 0o755)

print(f"\n✓ DVC initialization script created: {init_path}")
print(f"Usage: ./scripts/init_dvc.sh")


✓ DVC initialization script created: /Users/lia/Desktop/Fase1/scripts/init_dvc.sh
Usage: ./scripts/init_dvc.sh


## Create DVC Data Tracking Script

In [3]:
# Create track_data_dvc.sh script
track_data_script = """#!/bin/bash
set -euo pipefail

# DVC Data Tracking Script
# Tracks raw data files with DVC

echo "=================================================="
echo "ADDING RAW DATA FILES TO DVC TRACKING"
echo "=================================================="

RAW_DIR="data/raw"

if [ -d "$RAW_DIR" ]; then
    echo "Tracking $RAW_DIR ..."
    dvc add "$RAW_DIR"
    echo "✓ Added $RAW_DIR to DVC"
else
    echo "ERROR: $RAW_DIR not found. Create it and place your raw files there."
    exit 1
fi

echo ""
echo "Adding .dvc pointers to git..."
git add data/raw.dvc .gitignore || true
git commit -m "chore: track raw data with DVC" || true

echo ""
echo "Pushing raw data to DVC remote..."
dvc push || true

echo ""
echo "=================================================="
echo " DATA TRACKING COMPLETE"
echo "=================================================="
echo ""
echo "NOTE: Processed data, models, and figures should be pipeline outs in dvc.yaml"
echo "Run the pipeline with: dvc repro"
"""

track_path = SCRIPTS / "track_data_dvc.sh"
track_path.write_text(track_data_script, encoding="utf-8")
os.chmod(track_path, 0o755)

print(f"\n✓ DVC tracking script created: {track_path}")
print(f"Usage: ./scripts/track_data_dvc.sh")


✓ DVC tracking script created: /Users/lia/Desktop/Fase1/scripts/track_data_dvc.sh
Usage: ./scripts/track_data_dvc.sh


## Create dvc.yaml Pipeline Configuration

In [4]:
dvc_yaml = """stages:
  preprocess:
    cmd: python notebooks/02_data_preprocessing.py
    deps:
      - notebooks/02_data_preprocessing.py
      - data/processed/student_entry_performance_eda.csv
    params:
      - preprocessing.test_size
      - preprocessing.random_state
    outs:
      - data/processed/student_performance_cleaned.csv:
          persist: true
      - data/processed/student_performance_preprocessed.csv:
          persist: true
      - data/processed/student_performance_train.csv:
          persist: true
      - data/processed/student_performance_test.csv:
          persist: true
      - data/processed/feature_names.txt:
          persist: true
      - reports/figures:
          persist: true
"""

dvc_yaml_path = ROOT / "dvc.yaml"
dvc_yaml_path.write_text(dvc_yaml, encoding="utf-8")

print(f"\n✓ DVC pipeline created: {dvc_yaml_path}")

# Create params.yaml
params_yaml = """preprocessing:
  test_size: 0.2
  random_state: 42
  scaler: "standard"

model:
  algorithm: "SVM"
  random_state: 42
"""

params_yaml_path = ROOT / "params.yaml"
params_yaml_path.write_text(params_yaml, encoding="utf-8")

print(f"✓ Parameters file created: {params_yaml_path}")


✓ DVC pipeline created: /Users/lia/Desktop/Fase1/dvc.yaml
✓ Parameters file created: /Users/lia/Desktop/Fase1/params.yaml


## Dataset Inventory

In [7]:
# List all CSV files in processed directory
csv_files = sorted(DATA_PROC.glob("*.csv"))

print(f"\nFound {len(csv_files)} CSV files in processed data:\n")

dataset_info = []
for filepath in csv_files:
    if filepath.exists():
        size = filepath.stat().st_size
        try:
            df_temp = pd.read_csv(filepath)
            rows, cols = df_temp.shape
            dataset_info.append({
                'Filename': filepath.name,
                'Size (KB)': f"{size/1024:.2f}",
                'Rows': rows,
                'Columns': cols
            })
            print(f"- {filepath.name}")
            print(f"Size: {size/1024:.2f} KB | Rows: {rows} | Columns: {cols}\n")
        except Exception as e:
            print(f"⚠ {filepath.name} - Could not read: {e}\n")

# Create summary DataFrame
if dataset_info:
    df_inventory = pd.DataFrame(dataset_info)
    print(f"{'='*80}")
    print("DATASET SUMMARY TABLE")
    print(f"{'='*80}")
    print(df_inventory.to_string(index=False))
    print(f"\nAll datasets are stored in: {DATA_PROC}")


Found 7 CSV files in processed data:

- student_entry_performance_eda.csv
Size: 52.39 KB | Rows: 666 | Columns: 12

- student_performance_binary_preprocessed.csv
Size: 131.49 KB | Rows: 622 | Columns: 31

- student_performance_cleaned.csv
Size: 50.16 KB | Rows: 622 | Columns: 13

- student_performance_encoded.csv
Size: 50.16 KB | Rows: 622 | Columns: 13

- student_performance_preprocessed.csv
Size: 131.49 KB | Rows: 622 | Columns: 31

- student_performance_test.csv
Size: 26.96 KB | Rows: 125 | Columns: 31

- student_performance_train.csv
Size: 105.22 KB | Rows: 497 | Columns: 31

DATASET SUMMARY TABLE
                                   Filename Size (KB)  Rows  Columns
          student_entry_performance_eda.csv     52.39   666       12
student_performance_binary_preprocessed.csv    131.49   622       31
            student_performance_cleaned.csv     50.16   622       13
            student_performance_encoded.csv     50.16   622       13
       student_performance_preprocessed.csv  

## Version History Documentation

In [8]:
# Helper function to get git revision
def git_rev():
    try:
        rev = subprocess.check_output(
            ["git", "rev-parse", "--short", "HEAD"],
            stderr=subprocess.DEVNULL,
            cwd=ROOT
        ).decode('utf-8').strip()
        return rev
    except Exception:
        return "unknown"

git_revision = git_rev()
print(f"\nGit revision: {git_revision}")

# Define version history
version_entries = [
    {
        "Version": "v1.0",
        "Filename": "student_entry_performance_eda.csv",
        "Description": "Original dataset after EDA with fixed column names",
        "Changes": "Initial dataset – column names standardized, no other modifications."
    },
    {
        "Version": "v2.0",
        "Filename": "student_performance_cleaned.csv",
        "Description": "Cleaned data with duplicates removed + binary target",
        "Changes": "Removed 44 duplicates (6.6%); created Performance_Binary (0=Lower, 1=High); verified no missing values."
    },
    {
        "Version": "v3.0",
        "Filename": "student_performance_preprocessed.csv",
        "Description": "Model-ready: encoded and scaled features",
        "Changes": "Grouped rare study time categories; ordinal encoding for grades; one-hot encoding for nominal features; StandardScaler applied to ordinal features."
    },
    {
        "Version": "v3.1",
        "Filename": "student_performance_train.csv",
        "Description": "Training split (80% stratified)",
        "Changes": "train_test_split with stratify; random_state=42; maintains class balance."
    },
    {
        "Version": "v3.2",
        "Filename": "student_performance_test.csv",
        "Description": "Test split (20% stratified)",
        "Changes": "Hold-out set; not used in training; maintains class balance."
    },
]

# Get actual file stats
def file_stats(filepath):
    if not filepath.exists():
        return {"Rows": "—", "Columns": "—", "Size (KB)": "—", "Modified": "MISSING"}
    try:
        df = pd.read_csv(filepath)
        rows, cols = df.shape
        size_kb = f"{filepath.stat().st_size/1024:.2f}"
        mtime = datetime.fromtimestamp(filepath.stat().st_mtime).strftime("%Y-%m-%d %H:%M")
        return {"Rows": rows, "Columns": cols, "Size (KB)": size_kb, "Modified": mtime}
    except Exception as e:
        return {"Rows": "?", "Columns": "?", "Size (KB)": "?", "Modified": f"Error: {e.__class__.__name__}"}


Git revision: 1fa2c28


In [9]:
print(f"\n{'='*80}")
print("DATASET VERSION CHANGELOG")
print(f"{'='*80}")

records = []
for item in version_entries:
    fp = DATA_PROC / item["Filename"]
    stats = file_stats(fp)
    
    records.append({
        "Version": item["Version"],
        "Filename": item["Filename"],
        "Rows": stats["Rows"],
        "Columns": stats["Columns"],
        "Size (KB)": stats["Size (KB)"],
        "Modified": stats["Modified"],
        "Description": item["Description"]
    })
    
    print(f"\n{item['Version']}: {item['Filename']}")
    print("-" * 80)
    print(f"Description: {item['Description']}")
    print(f"Shape: {stats['Rows']} rows × {stats['Columns']} columns | Size: {stats['Size (KB)']} KB")
    print(f"Modified: {stats['Modified']}")
    print("\nChanges Made:")
    print(item["Changes"])
    print("=" * 80)

# Create version summary table
version_df = pd.DataFrame(records, columns=[
    "Version", "Filename", "Rows", "Columns", "Size (KB)", "Modified", "Description"
])

print(f"\n{'='*80}")
print("VERSION SUMMARY TABLE")
print(f"{'='*80}")
print(version_df.to_string(index=False))


DATASET VERSION CHANGELOG

v1.0: student_entry_performance_eda.csv
--------------------------------------------------------------------------------
Description: Original dataset after EDA with fixed column names
Shape: 666 rows × 12 columns | Size: 52.39 KB
Modified: 2025-10-23 10:15

Changes Made:
Initial dataset – column names standardized, no other modifications.

v2.0: student_performance_cleaned.csv
--------------------------------------------------------------------------------
Description: Cleaned data with duplicates removed + binary target
Shape: 622 rows × 13 columns | Size: 50.16 KB
Modified: 2025-10-23 10:35

Changes Made:
Removed 44 duplicates (6.6%); created Performance_Binary (0=Lower, 1=High); verified no missing values.

v3.0: student_performance_preprocessed.csv
--------------------------------------------------------------------------------
Description: Model-ready: encoded and scaled features
Shape: 622 rows × 31 columns | Size: 131.49 KB
Modified: 2025-10-23 10:49

## Detailed Transformation Decumentation

In [10]:
# Load datasets if they exist
def maybe_read(name):
    fp = DATA_PROC / name
    return (pd.read_csv(fp), fp) if fp.exists() else (None, fp)

df_v1, fp_v1 = maybe_read("student_entry_performance_eda.csv")
df_v2, fp_v2 = maybe_read("student_performance_cleaned.csv")
df_v3, fp_v3 = maybe_read("student_performance_preprocessed.csv")
df_tr, fp_tr = maybe_read("student_performance_train.csv")
df_te, fp_te = maybe_read("student_performance_test.csv")

# Helper functions
def shape_str(df):
    return f"{df.shape[0]} rows × {df.shape[1]} columns" if df is not None else "N/A"

def pct(x):
    return f"{100*x:.1f}%"

# Transformation 1: Data Cleaning (v1.0 → v2.0)
print(f"\n{'='*80}")
print("TRANSFORMATION 1: Data Cleaning (v1.0 → v2.0)")
print(f"{'='*80}")

cleaning_code = '''
# Remove duplicates
df_clean = df.drop_duplicates()

# Create binary target
def create_binary_target(performance):
    return 1 if performance in ['Excellent', 'Vg'] else 0

df_clean['Performance_Binary'] = df_clean['Performance'].apply(create_binary_target)
'''

print("\nOperations Performed:")
print("1. Duplicate removal")
print("2. Binary target creation")
print("3. Data validation")

print("\nCode Implementation:")
print(cleaning_code)

if df_v1 is not None and df_v2 is not None:
    dup_removed = df_v1.shape[0] - df_v2.shape[0]
    dup_pct = pct(dup_removed / df_v1.shape[0]) if df_v1.shape[0] else "0.0%"
    
    if "Performance_Binary" in df_v2.columns:
        bal = df_v2["Performance_Binary"].value_counts(normalize=True).rename({0: "Lower", 1: "High"})
        bal_str = " / ".join([f"{k}: {pct(v)}" for k, v in bal.sort_index().items()])
    else:
        bal_str = "Performance_Binary not found"
    
    print("\nImpact:")
    print(f"• Removed {dup_removed} rows ({dup_pct})")
    print(f"• Added Performance_Binary; class balance → {bal_str}")
    print(f"• Shapes: v1={shape_str(df_v1)} → v2={shape_str(df_v2)}")

# Transformation 2: Feature Engineering (v2.0 → v3.0)
print(f"\n{'='*80}")
print("TRANSFORMATION 2: Feature Engineering (v2.0 → v3.0)")
print(f"{'='*80}")

fe_code = '''
# Group rare study time categories
def group_study_time(time_val):
    return 'FOUR_PLUS' if time_val in ['FOUR', 'FIVE', 'SEVEN'] else time_val

# Ordinal mappings
grade_mapping = {'Average': 1, 'Good': 2, 'Vg': 3, 'Excellent': 4}
df_processed['Class_X_Grade_Encoded'] = df['Class_X_Percentage'].map(grade_mapping)

# One-hot encoding
df_encoded = pd.get_dummies(df, columns=nominal_columns, drop_first=True)
'''

print("\nOperations Performed:")
print("1. Rare category grouping")
print("2. Ordinal encoding (3 features)")
print("3. One-hot encoding (8 nominal variables)")

print("\nCode Implementation:")
print(fe_code)

if df_v2 is not None and df_v3 is not None:
    print("\nImpact:")
    print(f"• Features: {df_v2.shape[1]} → {df_v3.shape[1]} (+{df_v3.shape[1]-df_v2.shape[1]})")
    print(f"• Rows kept: {df_v3.shape[0]} (should match v2)")
    print(f"• Shapes: v2={shape_str(df_v2)} → v3={shape_str(df_v3)}")

# Transformation 3: Train/Test Split (v3.0 → v3.1 & v3.2)
print(f"\n{'='*80}")
print("TRANSFORMATION 3: Train-Test Split (v3.0 → v3.1 & v3.2)")
print(f"{'='*80}")

split_code = '''
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42, stratify=y
)
'''

print("\nOperations Performed:")
print("  1. Stratified splitting (80/20)")
print("  2. Random state fixing (42)")
print("  3. Class balance verification")

print("\nCode Implementation:")
print(split_code)

target_col = "Performance_Binary"
if df_tr is not None and df_te is not None:
    n_tr, n_te = df_tr.shape[0], df_te.shape[0]
    print("\nImpact:")
    print(f"• Train size: {n_tr} | Test size: {n_te} (total {n_tr + n_te})")
    
    if target_col in df_tr.columns and target_col in df_te.columns:
        bal_tr = df_tr[target_col].value_counts(normalize=True).sort_index()
        bal_te = df_te[target_col].value_counts(normalize=True).sort_index()
        print("• Class balance (train): " + " / ".join([f"{int(k)}={pct(v)}" for k, v in bal_tr.items()]))
        print("• Class balance (test): " + " / ".join([f"{int(k)}={pct(v)}" for k, v in bal_te.items()]))



TRANSFORMATION 1: Data Cleaning (v1.0 → v2.0)

Operations Performed:
1. Duplicate removal
2. Binary target creation
3. Data validation

Code Implementation:

# Remove duplicates
df_clean = df.drop_duplicates()

# Create binary target
def create_binary_target(performance):
    return 1 if performance in ['Excellent', 'Vg'] else 0

df_clean['Performance_Binary'] = df_clean['Performance'].apply(create_binary_target)


Impact:
• Removed 44 rows (6.6%)
• Added Performance_Binary; class balance → High: 44.4% / Lower: 55.6%
• Shapes: v1=666 rows × 12 columns → v2=622 rows × 13 columns

TRANSFORMATION 2: Feature Engineering (v2.0 → v3.0)

Operations Performed:
1. Rare category grouping
2. Ordinal encoding (3 features)
3. One-hot encoding (8 nominal variables)

Code Implementation:

# Group rare study time categories
def group_study_time(time_val):
    return 'FOUR_PLUS' if time_val in ['FOUR', 'FIVE', 'SEVEN'] else time_val

# Ordinal mappings
grade_mapping = {'Average': 1, 'Good': 2, 'Vg': 3

## Create DVC Setup Instructions

In [11]:
dvc_instructions = f"""
DVC (Data Version Control) Setup Instructions
==============================================

Objectives:
- Track ONLY RAW data with DVC (e.g., 'data/raw/')
- Generate processed data/models/figures via pipeline ('dvc.yaml' outs)
- Never 'dvc add' processed files

Project root: {ROOT}

Step 0: One-time Git init (if needed)
--------------------------------------
git init
git add .
git commit -m "init repo"

Step 1: Initialize DVC
----------------------
Use the helper script you created (portable & repeatable):

./scripts/init_dvc.sh

(Equivalent to `dvc init`, configure, and commit .dvc files.)

Step 2: Track ONLY raw data with DVC
-----------------------------------
Place the original dataset(s) in `data/raw/` and run:

./scripts/track_data_dvc.sh

This executes:
- dvc add data/raw
- git add data/raw.dvc
- git commit -m "track raw data with DVC"

Step 3: Define and Run the pipeline
--------------------------------
The `dvc.yaml` stage runs preprocessing and declares outs:

dvc repro

On each run, it (re)creates:
- data/processed/*.csv
- models/ (if produced)
- reports/figures/ (plots)

Step 4: Check status & push artifacts
----------------------------
dvc status
git add dvc.yaml dvc.lock
git commit -m "pipeline run"
dvc push

(If you haven't set a remote, the init script added a local one: ../dvcstore)

Common Commands
---------------
# Check what changed
dvc status

# Pull data from remote
dvc pull

# Push data to remote
dvc push

# Reproduce pipeline
dvc repro

# Show pipeline DAG
dvc dag

Benefits of DVC
===============
✓ Version control for large datasets
✓ Reproducibility of data pipeline
✓ Efficient storage (only metadata in Git)
✓ Easy collaboration with team members
✓ Data integrity verification with MD5 hashes
"""

instruction_file = ROOT / "DVC_SETUP_INSTRUCTIONS.md"
instruction_file.write_text(dvc_instructions, encoding="utf-8")

print(f"\nInstructions saved to: {instruction_file}")


Instructions saved to: /Users/lia/Desktop/Fase1/DVC_SETUP_INSTRUCTIONS.md


## Final Summary Report

In [12]:
# Compute dynamic metrics
def shape(df):
    return (df.shape[0], df.shape[1]) if df is not None else (None, None)

r1, c1 = shape(df_v1)
r2, c2 = shape(df_v2)
r3, c3 = shape(df_v3)
rt, ct = shape(df_tr)
re, ce = shape(df_te)

# Duplicates removed
dup_removed = (r1 - r2) if (r1 is not None and r2 is not None) else None
dup_pct = (100*dup_removed/r1) if (dup_removed is not None and r1) else None

# Feature expansion
feat_from = c2 if c2 is not None else c1
feat_to = c3 if c3 is not None else 0
feat_delta = (feat_to - feat_from) if (feat_from is not None and feat_to is not None) else None

# Class balance
def balance_str(df, target="Performance_Binary"):
    if df is None or target not in df.columns:
        return "N/A"
    vc = df[target].value_counts(normalize=True).sort_index()
    return " / ".join([f"{int(k)}={v*100:.1f}%" for k, v in vc.items()])

bal_train = balance_str(df_tr)
bal_test = balance_str(df_te)

def fmt_shape(r, c):
    return f"{r} × {c}" if (r is not None and c is not None) else "MISSING"

summary = f"""
{'='*80}
DATA VERSIONING SUMMARY - COMPLETE
{'='*80}

1) DVC Implementation
   • DVC initialized and linked to Git
   • Raw data tracked with `dvc add data/raw/`
   • Processed data/figures are pipeline outs in `dvc.yaml`
   • Reproducibility anchored by Git commit + `dvc.lock`
   • Current Git rev: {git_revision}

2) Documentation of Data Modifications
   • Version history: v1.0 → v3.2
   • Each transformation explained with rationale and code snippets
   • Shapes computed from actual artifacts in `data/processed/`

3) Change Log / History
   v1.0 (Original/EDA): {fmt_shape(r1, c1)}
   ↓ [Remove duplicates, create binary target]
   v2.0 (Cleaned): {fmt_shape(r2, c2)}
   ↓ [Encode categorical (ordinal + one-hot)]
   v3.0 (Preprocessed): {fmt_shape(r3, c3)}
   ↓ [80/20 stratified split]
   v3.1 (Train): {fmt_shape(rt, ct)}
   v3.2 (Test): {fmt_shape(re, ce)}

Key Metrics
-----------
• Total duplicate rows removed: {dup_removed if dup_removed is not None else 'N/A'}
• Percentage removed: {f'{dup_pct:.1f}%' if dup_pct is not None else 'N/A'}
• Feature expansion: {feat_from} → {feat_to} ({'+'+str(feat_delta) if feat_delta else 'N/A'})
• Final class balance (train): {bal_train}
• Final class balance (test): {bal_test}
• Target column: Performance_Binary
• Random seed: 42

Files Created
-------------
1. scripts/init_dvc.sh - DVC initialization script
2. scripts/track_data_dvc.sh - Data tracking script
3. dvc.yaml - Pipeline configuration
4. params.yaml - Parameter file
5. DVC_SETUP_INSTRUCTIONS.md - Setup guide

Next Steps
----------
→ Run: ./scripts/init_dvc.sh (initialize DVC)
→ Run: ./scripts/track_data_dvc.sh (track raw data)
→ Run: dvc repro (reproduce pipeline)
→ Proceed to Notebook 04: Model Training with MLflow

{'='*80}
"""

print(summary)

# Save summary
summary_file = ROOT / "DATA_VERSIONING_SUMMARY.md"
summary_file.write_text(summary, encoding="utf-8")
print(f"\nSummary saved to: {summary_file}")

print(f"\n{'='*80}")
print("DATA VERSIONING COMPLETE!")
print(f"{'='*80}")

print("\nArtifacts Created:")
artifacts = [
    SCRIPTS / "init_dvc.sh",
    SCRIPTS / "track_data_dvc.sh",
    ROOT / "dvc.yaml",
    ROOT / "params.yaml",
    ROOT / "DVC_SETUP_INSTRUCTIONS.md",
    ROOT / "DATA_VERSIONING_SUMMARY.md"
]

for artifact in artifacts:
    status = "✓" if artifact.exists() else "✗"
    print(f"  {status} {artifact.relative_to(ROOT)}")


DATA VERSIONING SUMMARY - COMPLETE

1) DVC Implementation
   • DVC initialized and linked to Git
   • Raw data tracked with `dvc add data/raw/`
   • Processed data/figures are pipeline outs in `dvc.yaml`
   • Reproducibility anchored by Git commit + `dvc.lock`
   • Current Git rev: 1fa2c28

2) Documentation of Data Modifications
   • Version history: v1.0 → v3.2
   • Each transformation explained with rationale and code snippets
   • Shapes computed from actual artifacts in `data/processed/`

3) Change Log / History
   v1.0 (Original/EDA): 666 × 12
   ↓ [Remove duplicates, create binary target]
   v2.0 (Cleaned): 622 × 13
   ↓ [Encode categorical (ordinal + one-hot)]
   v3.0 (Preprocessed): 622 × 31
   ↓ [80/20 stratified split]
   v3.1 (Train): 497 × 31
   v3.2 (Test): 125 × 31

Key Metrics
-----------
• Total duplicate rows removed: 44
• Percentage removed: 6.6%
• Feature expansion: 13 → 31 (+18)
• Final class balance (train): 0=55.5% / 1=44.5%
• Final class balance (test): 0=56.0% 