# Introduction to Data Version Control With DVC

## What is Data Version Control (DVC)? 

- Definition and Importance of DVC
- Brief Overview of DVC’s Role in Data Science and Machine Learning Projects
- Benefits of Using DVC

## Prerequisites

Let's list of the topics that you should be familiar with before we dive head-first into the article:

- [Version control with Git](https://www.datacamp.com/courses/introduction-to-git)
- [Python fundamentals](https://www.datacamp.com/tracks/python-data-fundamentals)
- [Machine learning fundamentals](https://www.datacamp.com/tracks/machine-learning-fundamentals-with-python)
- [Terminal](https://www.datacamp.com/courses/introduction-to-shell)
- [Managing GitHub repositories](https://www.datacamp.com/courses/github-concepts)

FILL THIS IN LATER

## Key concepts of DVC

DVC is not a complex tool to learn but it does take some time to wrap your head around its most important concepts. They are crucial for using its commands correctly and avoid time-consuming and rage-inducing mistakes later. So, in this section, we will cover some DVC internals and key terms. 

### 1. DVC is not a version control system (technically)

When you get down to the technical details, DVC is not a version control system because we already have Git, the king. But we also know that Git sucks at versioning large files. Even the creator of Git, Linus Torvalds, admits this fact. Git is only good (extremely good) at detecting changes to small files. 

So, DVC, instead of reinventing the wheel for large files, uses a workaround to get advantage of Git's capabilities. When you add a large file or a directory to DVC (using a simple `dvc add images` command), the tool creates a small metadata file named `images.dvc`. If you print the contents of this metadata file, you will see the following:

![image.png](attachment:50f420c6-f131-4728-81fe-13a3be4e75d7.png)

It lists the folder's size, its number of files and most importantly, its MD5 hash. MD5 is a popular hashing function which produces 32 hexadecimal characters given a file or folder as an input. The hash changes completely even if a single bit is changed in the tracked asset. 

Using these hash values, Git is able to track large assets, no matter their size. Here is how it happens:
- You add an to DVC with `dvc add file/or/folder`
- `asset.dvc` file is generated for the added asset with a unique MD5
- Under the hood, DVC also adds the asset to `.gitignore` so that it isn't tracked by Git
- You start tracking the `asset.dvc` file with `git add asset.dvc`
- Then, anytime you make a change to the large file, its MD5 hash changes in the `asset.dvc` file. You version this change by calling `git add asset.dvc` again.

There are some more details to this process but we already have a good overview of what will happen when we start running some DVC commands.

### 2. DVC cache

When you initialize Git inside a folder, it creates a hidden `.git` directory that holds Git-related configurations, its internals and metadata. The same happens when you initialize DVC. 

Among the contents of the hidden `.dvc` folder, you only need to worry about DVC cache located under `.dvc/cache`. When you start tracking a large asset with DVC, it gets copied to this cache. The asset's subsequent versions will also be stored there. 

For example, let's say you are tracking a 1GB CSV file with DVC. Its current version is added to the cache with its MD5. Then, if you make a change to it (like renaming its columns or making a transformation), the new version is also added to the cash with a new MD5. Yes, you guessed correctly - each version of the file takes up 1 GB of disk space. That is one of the unavoidable aspects of data version control. 

On the bright side, things are a bit different for directories. If you are tracking a folder with thousands of images and you make a change that affects only a few files, DVC doesn't create a duplicate of the entire directory. Only the affected images will be saved in the cache with their new MD5s. 

DVC also has a few caching strategies that can significantly improve memory optimization. One is creating a central cache for the entire network of computers that has access to company data. For more details, read the [Large Dataset Optimization](https://dvc.org/doc/user-guide/data-management/large-dataset-optimization) page of DVC user guide.

### 3. DVC remote

## Installation and Setup

- System Requirements
- Installation Steps 
- Initial Setup and Configuration


## Basic DVC And Git Workflow

- Initializing a DVC Repository
- Adding and Committing Data Files
- Tracking Changes and Creating Versions
- Checking Out Previous Versions


- Benefits of Integrating DVC with Git
- Steps to Integrate DVC with Git
- Best Practices for Managing Code and Data Together


## Managing Data Pipelines

- Introduction to Data Pipelines in DVC
- Creating and Managing Pipelines
- Running and Reproducing Pipelines
- Visualizing Pipelines and Dependencies


## Remote Storage and Collaboration

- Setting Up Remote Storage (AWS S3, Google Cloud Storage, etc.)
- Pushing and Pulling Data to/from Remote Storage
- Collaborating with Team Members Using DVC


## Experiment Management and Reproducibility

- Tracking Experiments with DVC
- Comparing Experiment Results
- Ensuring Reproducibility in Machine Learning Projects


## Conclusion