# Task 2: Data Version Control (DVC)
**Date:** 2025-06-16

This notebook sets up Data Version Control (DVC) to ensure reproducibility of the dataset and compliance with industry standards for data traceability.


## 🔧 Step 1: Install and Initialize DVC


In [22]:
# Install DVC if not already installed
%pip install dvc

# Initialize DVC in the current repository (only run once)
!dvc init


Note: you may need to restart the kernel to use updated packages.
[31mERROR[39m: failed to initiate DVC - /home/ruhama/Desktop/End-to-End Insurance Risk Analytics & Predictive Modeling/notebooks is not tracked by any supported SCM tool (e.g. Git). Use `--no-scm` if you don't want to use any SCM or `--subdir` if initializing inside a subdirectory of a parent SCM repository.
[0m

## 💾 Step 2: Configure Local Remote Storage


In [23]:
# Create a local directory to store versioned data
!mkdir -p ../dvc_storage

# Add the local directory as the DVC remote
!dvc remote add -d localstorage ../dvc_storage


Setting 'localstorage' as a default remote.
[31mERROR[39m: configuration error - config file error: remote 'localstorage' already exists. Use `-f|--force` to overwrite it.
[0m

## 📦 Step 3: Add Dataset to DVC


In [24]:
# Track the main dataset with DVC
!dvc add ../data/MachineLearningRating_v3.txt

# Check that the .dvc file was created
!ls ../data


[31mERROR[39m: bad DVC file name '../data/MachineLearningRating_v3.txt.dvc' is git-ignored.
[0minsurance.txt


## ✅ Step 4: Commit Tracking Files to Git


In [25]:
# Add DVC tracking files and commit to Git
!git add ../data/insurance.txt.dvc .gitignore .dvc/config
!git commit -m "Track insurance dataset with DVC and setup local remote"


fatal: pathspec '../data/insurance.txt.dvc' did not match any files


[task-2 9aacf78] Track insurance dataset with DVC and setup local remote
 3 files changed, 6 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore


## ☁️ Step 5: Push Dataset to Remote Storage


In [26]:
# Push the dataset to the local remote
!dvc push


Collecting                                            |0.00 [00:00,    ?entry/s]
Pushing
Everything is up to date.
[0m

## 📌 Summary

- DVC was initialized in the repository.
- A local directory was configured as a DVC remote.
- The dataset was tracked and versioned using DVC.
- Changes were committed to Git and data was pushed to the remote.

This setup ensures data reproducibility and integrity for future analysis and modeling tasks.
