# Task 2: Data Version Control (DVC)
**Date:** 2025-06-16

This notebook sets up Data Version Control (DVC) to ensure reproducibility of the dataset and compliance with industry standards for data traceability.


## 🔧 Step 1: Install and Initialize DVC


In [2]:
# Install DVC if not already installed
%pip install dvc

# Initialize DVC in the current repository (only run once)
!dvc init


Collecting dvc
  Using cached dvc-3.60.1-py3-none-any.whl.metadata (17 kB)
Collecting attrs>=22.2.0 (from dvc)
  Using cached attrs-25.3.0-py3-none-any.whl.metadata (10 kB)
Collecting celery (from dvc)
  Using cached celery-5.5.3-py3-none-any.whl.metadata (22 kB)
Collecting colorama>=0.3.9 (from dvc)
  Using cached colorama-0.4.6-py2.py3-none-any.whl.metadata (17 kB)
Collecting configobj>=5.0.9 (from dvc)
  Using cached configobj-5.0.9-py2.py3-none-any.whl.metadata (3.2 kB)
Collecting distro>=1.3 (from dvc)
  Using cached distro-1.9.0-py3-none-any.whl.metadata (6.8 kB)
Collecting dpath<3,>=2.1.0 (from dvc)
  Using cached dpath-2.2.0-py3-none-any.whl.metadata (15 kB)
Collecting dulwich (from dvc)
  Using cached dulwich-0.22.8-cp312-cp312-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (4.9 kB)
Collecting dvc-data<3.17,>=3.16.2 (from dvc)
  Using cached dvc_data-3.16.10-py3-none-any.whl.metadata (5.0 kB)
Collecting dvc-http>=2.29.0 (from dvc)
  Using cached dvc_http-2.32.0-py3-no

## 💾 Step 2: Configure Local Remote Storage


In [3]:
# Create a local directory to store versioned data
!mkdir -p ../dvc_storage

# Add the local directory as the DVC remote
!dvc remote add -d localstorage ../dvc_storage


Setting 'localstorage' as a default remote.
[31mERROR[39m: configuration error - config file error: Not inside a DVC repo
[0m

## 📦 Step 3: Add Dataset to DVC


In [4]:
# Track the main dataset with DVC
!dvc add ../data/MachineLearningRating_v3.txt

# Check that the .dvc file was created
!ls ../data


[31mERROR[39m: you are not inside of a DVC repository (checked up to mount point '/')
[0minsurance.txt


## ✅ Step 4: Commit Tracking Files to Git


In [5]:
# Add DVC tracking files and commit to Git
!git add ../data/MachineLearningRating_v3.txt.dvc .gitignore .dvc/config
!git commit -m "Track insurance dataset with DVC and setup local remote"


fatal: pathspec '../data/MachineLearningRating_v3.txt.dvc' did not match any files
On branch task-2
Your branch is up to date with 'origin/task-2'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	[31mmodified:   task2_dvc_setup.ipynb[m

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	[31mtask4_predictive_modeling.ipynb[m

no changes added to commit (use "git add" and/or "git commit -a")


## ☁️ Step 5: Push Dataset to Remote Storage


In [6]:
# Push the dataset to the local remote
!dvc push


[31mERROR[39m: you are not inside of a DVC repository (checked up to mount point '/')
[0m

## 📌 Summary

- DVC was initialized in the repository.
- A local directory was configured as a DVC remote.
- The dataset was tracked and versioned using DVC.
- Changes were committed to Git and data was pushed to the remote.

This setup ensures data reproducibility and integrity for future analysis and modeling tasks.
