# Setting up `DVC`

In [4]:
#Initialize a new dvc init project in the parnet directory of the current working directory
%cd ..
!dvc init
%cd Assignment_2

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


#### Import Necessary Libraries

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from prettytable import PrettyTable

#### Function to read data

In [13]:
def read_sms_data(file_path='SMSSpamCollection'):
    """
    Read SMS data from a file and return a pandas DataFrame.
    
    Args:
        file_path (str): Path to the SMS data file (default: 'SMSSpamCollection')
        
    Returns:
        pandas.DataFrame: DataFrame containing the SMS data with 'label' and 'text' columns
    """
    try:
        df = pd.read_csv(file_path, delimiter='\t', header=None, names=['label', 'text'])
        print(f"Successfully read {len(df)} messages")
        return df
    except Exception as e:
        print(f"Error reading file: {e}")
        return None

#### Reading the data

In [14]:
df = read_sms_data()
df.head()

Successfully read 5572 messages


Unnamed: 0,label,text
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


#### Define `utility functions`

In [None]:
def save_data(data, output_file):
    """
    Save the raw SMS data to a CSV file.
    
    Args:
        data (pandas.DataFrame): DataFrame containing the SMS data
        output_file (str): Name of the output CSV file (default: 'raw_data.csv')
    """
    try:
        data.to_csv(output_file, index=False)
        print(f"Successfully saved {len(data)} messages to {output_file}")
    except Exception as e:
        print(f"Error saving file: {e}")

def compare_ham_spam(train_data, validation_data, test_data):
    """
    Compare the total number of hams and spams in train, validation, and test datasets.
    
    Args:
        train_data (pandas.DataFrame): Training dataset
        validation_data (pandas.DataFrame): Validation dataset
        test_data (pandas.DataFrame): Test dataset
        
    Returns:
        PrettyTable: Table comparing the number of hams and spams in each dataset
    """
    # Count hams and spams in each dataset
    train_counts = train_data['label'].value_counts()
    validation_counts = validation_data['label'].value_counts()
    test_counts = test_data['label'].value_counts()
    
    # Create a PrettyTable
    table = PrettyTable()
    table.field_names = ["Dataset", "Ham", "Spam"]
    
    # Add rows to the table
    table.add_row(["Train", train_counts.get('ham', 0), train_counts.get('spam', 0)])
    table.add_row(["Validation", validation_counts.get('ham', 0), validation_counts.get('spam', 0)])
    table.add_row(["Test", test_counts.get('ham', 0), test_counts.get('spam', 0)])
    
    return table

#### Saving the raw data to `raw_data.csv`

In [None]:
save_data(df, "raw_data.csv")

Successfully saved 5572 messages to raw_data.csv


#### Function to perform `train/test/val` split

In [51]:
def split_and_save_data(data, random_seed):
    """
    Split data into train, validation and test sets and save them as CSV files.
    
    Args:
        data_path (str): Path to the raw data CSV file
        random_seed (int): Random seed for reproducibility
    """
    # Read the data
    df = data
    
    # First split: 80% train+val, 20% test
    train_val, test = train_test_split(df, test_size=0.3, 
                                     random_state=random_seed)
    
    # Second split: 80% train, 20% validation (from train_val)
    train, val = train_test_split(train_val, test_size=0.2, 
                                 random_state=random_seed)
    
    # Save the splits
    train.to_csv('train.csv', index=False)
    val.to_csv('validation.csv', index=False)
    test.to_csv('test.csv', index=False)
    
    print(f"Train size: {len(train)}")
    print(f"Validation size: {len(val)}")
    print(f"Test size: {len(test)}")

## Setting up the `Version 0` of the data

In [22]:
split_and_save_data(df, random_seed=42)

Train size: 3565
Validation size: 892
Test size: 1115


## Understanding DVC (Data Version Control)

DVC (Data Version Control) is an open-source tool designed for data science and machine learning projects that enables versioning of large files.

#### How DVC Works

1. **Git Integration**: DVC extends Git's functionality by storing data file metadata in Git while the actual data is stored in remote storage.

2. **Storage Management**: Instead of storing large datasets in Git (which can be inefficient), DVC stores references to the data in Git and the actual data in configurable remote storage (like Google Drive, S3, etc.).

3. **File Tracking**: DVC replaces large files with small metafiles that are tracked by Git. These metafiles contain information needed to uniquely identify and reproduce the data.

#### Important DVC Commands

- `dvc init`: Initialize a DVC project in a Git repository
- `dvc add <file>`: Start tracking a file with DVC
- `dvc remote add`: Configure a remote storage location
- `dvc push`: Upload tracked data to remote storage
- `dvc pull`: Download data from remote storage
- `dvc checkout`: Update working directory with tracked files
- `dvc commit`: Record changes to tracked files

#### Setting up remote storage
The following code writes a `.config` file that stores the required credentials to push the data files to google drive.

The encrypted files can be accessed [here](https://drive.google.com/drive/folders/10XuqI_krQMY_ZOFGlwqgMUDOX2Xg789O?usp=sharing).

In [None]:
!dvc remote add -d gdrive gdrive://10XuqI_krQMY_ZOFGlwqgMUDOX2Xg789O
!dvc remote modify gdrive_remote gdrive_client_id "<cliet_id>"
!dvc remote modify gdrive_remote gdrive_client_secret "<client_secret>"

Setting 'gdrive' as a default remote.


#### Adding large data files to `dvc`.

Note that, we are adding the newly generated `.dvc` files to `Git`, not the actual files.

In [None]:
# Add the CSV files to DVC tracking
!dvc add raw_data.csv train.csv validation.csv test.csv

# Add and commit changes to git
!git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc


To track the changes with git, run:

	git add test.csv.dvc validation.csv.dvc train.csv.dvc raw_data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



In [30]:
!git status

On branch main
Your branch is ahead of 'origin/main' by 1 commit.
  (use "git push" to publish your local commits)

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   raw_data.csv.dvc
	new file:   test.csv.dvc
	new file:   train.csv.dvc
	new file:   validation.csv.dvc

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	prepare.ipynb



#### Committing the changes

In [31]:
!dvc commit raw_data.csv train.csv validation.csv test.csv

!git commit -m "V0: Committing data files"

[main ca9936e] V0: Committing data files
 4 files changed, 20 insertions(+)
 create mode 100644 Assignment_2/raw_data.csv.dvc
 create mode 100644 Assignment_2/test.csv.dvc
 create mode 100644 Assignment_2/train.csv.dvc
 create mode 100644 Assignment_2/validation.csv.dvc


#### Pushing the data to remote storage

In [None]:
!dvc push

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=1046188398954-djkl9k2eb6dte6o11kt054hlv8tidlqq.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Authentication successful.
3 files pushed




In [None]:
train_data = pd.read_csv('train.csv')
validation_data = pd.read_csv('validation.csv')
test_data = pd.read_csv('test.csv')

In [None]:
print(compare_ham_spam(train_data, validation_data, test_data))

+------------+------+------+
|  Dataset   | Ham  | Spam |
+------------+------+------+
|   Train    | 3087 | 478  |
| Validation | 772  | 120  |
|    Test    | 966  | 149  |
+------------+------+------+


## Setting up `version 1` of the data

- follow the same steps as before

In [57]:
split_and_save_data(df, random_seed=59)

Train size: 3120
Validation size: 780
Test size: 1672


In [None]:
!dvc add raw_data.csv train.csv validation.csv test.csv
!git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc


To track the changes with git, run:

	git add train.csv.dvc validation.csv.dvc raw_data.csv.dvc test.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



In [None]:
!dvc commit raw_data.csv train.csv validation.csv test.csv
!git commit -m "V1: Committing data files"

[main 5a40bf0] V1: Committing data files
 3 files changed, 6 insertions(+), 6 deletions(-)


In [62]:
!dvc push 

3 files pushed


In [None]:
train_data = pd.read_csv('train.csv')
validation_data = pd.read_csv('validation.csv')
test_data = pd.read_csv('test.csv')

print(compare_ham_spam(train_data, validation_data, test_data))

+------------+------+------+
|  Dataset   | Ham  | Spam |
+------------+------+------+
|   Train    | 2713 | 407  |
| Validation | 674  | 106  |
|    Test    | 1438 | 234  |
+------------+------+------+


## Checking out `Version 0`

#### Show git commits

In [63]:
!git log --oneline

5a40bf0 V1: Committing data files
ca9936e V0: Committing data files
a4bb3dc Add `dvc` to the project
284cb58 Modularizing the functions
a54b433 Add `assignment 1`
5937698 Add helper files
d52a256 Initial commit


#### Checkout the `V0` version

In [64]:
!git checkout ca9936e

Note: switching to 'ca9936e'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at ca9936e V0: Committing data files


#### Restore the data files

In [65]:
!dvc checkout

M       train.csv
M       test.csv
M       validation.csv


In [66]:
train_data = pd.read_csv('train.csv')
validation_data = pd.read_csv('validation.csv')
test_data = pd.read_csv('test.csv')

print(compare_ham_spam(train_data, validation_data, test_data))

+------------+------+------+
|  Dataset   | Ham  | Spam |
+------------+------+------+
|   Train    | 3087 | 478  |
| Validation | 772  | 120  |
|    Test    | 966  | 149  |
+------------+------+------+


As we can see, the table is same as the table for `V0` version.

#### Checking out the latest version

In [2]:
!git checkout 5a40bf0

M	requirements.txt


HEAD is now at 5a40bf0 V1: Committing data files
