# **Resources Utilized for Project Implementation**
- [Get Started: Data Versioning](https://dvc.org/doc/start/data-management/data-versioning)
- [How to connect DVC to Google Drive (remote storage) to store and version your data](https://blog.devgenius.io/how-to-connect-dvc-to-google-drive-remote-storage-to-store-and-version-your-data-64db2fad73ad)
- [MLOps Tutorial #2: When data is too big for Git](https://youtu.be/kZKAuShWF0s)

# Importing Necessary Libraries

In [1]:
import os
import subprocess
import pandas as pd
from sklearn.model_selection import train_test_split

# Defining Necessary Constants

In [2]:
DATA_FOLDER = os.path.abspath("data")
SEED1 = 8576
SEED2 = 202016

# Defining Necessary Functions

In [3]:
def read_data(csv_name: str) -> pd.DataFrame:
    # Read CSV data from the DATA_FOLDER
    return pd.read_csv(
        os.path.join(
            DATA_FOLDER,
            csv_name,
        )
    )

def split_data(raw_data: pd.DataFrame, seed: int) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    # Split the data into training, validation, and test sets
    train_data, temp_data = train_test_split(raw_data, test_size = 0.2, random_state = seed)
    validation_data, test_data = train_test_split(temp_data, test_size = 0.5, random_state = seed)
    return train_data, validation_data, test_data

def save_data(train_data: pd.DataFrame, validation_data: pd.DataFrame, test_data: pd.DataFrame):
    # Save the split data into CSV files
    for df, filename in zip(
        [train_data, validation_data, test_data],
        ["train.csv", "validation.csv", "test.csv"],
    ):
        df.to_csv(
            os.path.join(
                DATA_FOLDER,
                filename,
            ),
            index = False,
        )

def read_split_and_save_data(csv_name: str, seed: int):
    # Combining reading, splitting and saving data together
    save_data(*split_data(read_data(csv_name), seed))

def print_distribution_of_the_splits():
    # Print the distribution of the splitted data
    print(f"Distribution of the Splitted Data:")
    for filename in ["train.csv", "validation.csv", "test.csv"]:
        data_type = os.path.splitext(filename)[0].title()
        df = read_data(filename)
        zero_count = (df["spam"] == 0).count()
        one_count = (df["spam"] == 1).count()
        print(f"\nData Type: {data_type}\n0 count: {zero_count}\n1 count: {one_count}")

# Initializing DVC

In [4]:
!dvc init --subdir

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

# Adding Gdrive as Remote

In [5]:
!dvc remote add --default drive gdrive://18yyvV_GDAQe3SpAQnCZg-aCau_XrogpR
!dvc remote modify drive gdrive_acknowledge_abuse true
!git add .dvc/config
!git commit -m "Adding Gdrive as Remote"

Setting 'drive' as a default remote.
[0m[0m[main 7304aa7] Adding Gdrive as Remote
 3 files changed, 11 insertions(+)
 create mode 100644 Assignment 2/.dvc/.gitignore
 create mode 100644 Assignment 2/.dvc/config
 create mode 100644 Assignment 2/.dvcignore


# Adding Raw Data via DVC

In [6]:
!dvc add data/raw_data.csv -q
!git add data/.gitignore data/raw_data.csv.dvc
!git commit -m "Adding Raw Data for Assignment 2"

[?25l[32m⠋[0m Checking graph
[1A[2K[0m[main ac4cc52] Adding Raw Data for Assignment 2
 2 files changed, 6 insertions(+)
 create mode 100644 Assignment 2/data/.gitignore
 create mode 100644 Assignment 2/data/raw_data.csv.dvc


# Reading, Splitting and Saving Data with SEED1

In [7]:
read_split_and_save_data("raw_data.csv", SEED1)

# Adding Train, Validation and Test Data via DVC

In [8]:
!dvc add data/train.csv data/validation.csv data/test.csv -q
!git add data/.gitignore data/train.csv.dvc data/validation.csv.dvc data/test.csv.dvc
!git commit -m "Adding Train, Validation and Test for Assignment 2"

[?25l[32m⠋[0m Checking graph
[1A[2K[0m[main b7b55e7] Adding Train, Validation and Test for Assignment 2
 4 files changed, 18 insertions(+)
 create mode 100644 Assignment 2/data/test.csv.dvc
 create mode 100644 Assignment 2/data/train.csv.dvc
 create mode 100644 Assignment 2/data/validation.csv.dvc


# Reading, Splitting and Saving Data with SEED2

In [9]:
read_split_and_save_data("raw_data.csv", SEED2)

# Adding Updated Train, Validation and Test Data via DVC

In [10]:
!dvc add data/train.csv data/validation.csv data/test.csv -q
!git add data/.gitignore data/train.csv.dvc data/validation.csv.dvc data/test.csv.dvc
!git commit -m "Adding Updated Train, Validation and Test for Assignment 2"

[?25l[32m⠋[0m Checking graph
[1A[2K[0m[main ef059f7] Adding Updated Train, Validation and Test for Assignment 2
 3 files changed, 6 insertions(+), 6 deletions(-)


# Checking Out the First Version

In [11]:
all_commits = subprocess.getoutput("git log --oneline").splitlines()
commit_id_line = [line for line in all_commits if "Adding Train, Validation and Test for Assignment 2" in line][0]
commit_id = commit_id_line.split()[0]

!git checkout $commit_id data/train.csv.dvc data/validation.csv.dvc data/test.csv.dvc
!dvc checkout

Updated 3 paths from 798e933
Building workspace index                              |5.00 [00:00, 5.02entry/s]
Comparing indexes                                     |6.00 [00:00,  596entry/s]
Applying changes                                      |3.00 [00:00,   406file/s]
[33mM[0m       data/test.csv
[33mM[0m       data/train.csv
[33mM[0m       data/validation.csv
[0m

# Printing Out the Distribution of the Target Variable before Update

In [12]:
print_distribution_of_the_splits()

Distribution of the Splitted Data:

Data Type: Train
0 count: 4582
1 count: 4582

Data Type: Validation
0 count: 573
1 count: 573

Data Type: Test
0 count: 573
1 count: 573


# Checking Out the Updated Version

In [13]:
all_commits = subprocess.getoutput("git log --oneline").splitlines()
commit_id_line = [line for line in all_commits if "Adding Updated Train, Validation and Test for Assignment 2" in line][0]
commit_id = commit_id_line.split()[0]

!git checkout $commit_id data/train.csv.dvc data/validation.csv.dvc data/test.csv.dvc
!dvc checkout

Updated 3 paths from 57c4bb9
Building workspace index                              |5.00 [00:00,  232entry/s]
Comparing indexes                                    |6.00 [00:00, 1.34kentry/s]
Applying changes                                      |3.00 [00:00,   488file/s]
[33mM[0m       data/validation.csv
[33mM[0m       data/train.csv
[33mM[0m       data/test.csv
[0m

# Printing out the Distribution of the Target Variable after Update

In [14]:
print_distribution_of_the_splits()

Distribution of the Splitted Data:

Data Type: Train
0 count: 4582
1 count: 4582

Data Type: Validation
0 count: 573
1 count: 573

Data Type: Test
0 count: 573
1 count: 573


# Pushing All Data to Google Drive

In [15]:
!dvc push -q

[0m