## Setting up `DVC` and preprocessing data


### Initializing the `DVC`

In [None]:
# Initialize a new DVC project in the parent directory of the current directory
%cd ..
!dvc init
%cd "Assignment 2"

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


### Setting up remote storage for `dvc`

1. DVC remote storage acts like Git remote but for data files
2. Data files are stored separately from code in Google Drive
3. `.dvc` files in Git track which version of data to retrieve


For Google Drive remote storage:

`DVC` remote can be configured with:
- `dvc remote add <remote-name> <folder>` - Sets up Google Drive as remote
- `dvc remote modify <remote-name> gdrive_client_id <client_id>`
- `dvc remote modify <remote-name> gdrive_client_secret <client_secret>`

(Follow [these](https://dvc.org/doc/user-guide/data-management/remote-storage/google-drive) steps to get `id-secret` pair.)

The above code will generate a `.config` file in the `.dvc` repo.

The gdrive remote storage can be accessed [here](https://drive.google.com/drive/folders/1o-BE-tYaAi1a-_dChnH4jB1nwjIUL5cT?usp=sharing). 

In [None]:
!dvc remote add myremote gdrive://0AIac4JZqHhKmUk9PDA
!dvc remote modify gdrive_remote gdrive_client_id "<cliet_id>"
!dvc remote modify gdrive_remote gdrive_client_secret "<client_secret>"

### Importing necessary libraries 

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from prettytable import PrettyTable

import warnings
warnings.filterwarnings("ignore")

### Defining `utility functions`

In [None]:
def read_data(file_path, names, sep='\t'):
    """
    Read data from a file into a pandas DataFrame.
    
    Parameters:
        file_path (str): Path to the data file
        sep (str): Separator used in the file (default: '\t')
        column_names (list): List of column names (default: ['category', 'message'])
    
    Returns:
        pandas.DataFrame: DataFrame containing the SMS data
    """
        
    data = pd.read_csv(
        file_path,
        sep=sep,
        names=names
    )
    
    return data


def save_data(df, path):
    """
    Save DataFrame to CSV file.
    
    Parameters:
        df (DataFrame): Data to save
        path (str): File path where data will be saved
    """
    try:
        df.to_csv(path, index=False)
        print(f"Successfully saved {path}")
    except Exception as e:
        print(f"Error saving file {path}: {str(e)}")


def split_data(data, train_pct=0.7, val_pct=0.1, test_pct=0.2, random_state=42):
    """
    Split data into train, validation and test sets based on percentage inputs
    
    Parameters:
        data (DataFrame): Input DataFrame to split
        train_pct (float): Percentage of data for training (default 0.7)
        val_pct (float): Percentage of data for validation (default 0.1)
        test_pct (float): Percentage of data for testing (default 0.2)
        random_state (int): Random seed for reproducibility
        
    Returns:
        tuple: (train_data, val_data, test_data)
    """
    assert round(train_pct + val_pct + test_pct, 3) == 1.0, "Percentages must sum to 1"
    
    train_data, temp_data = train_test_split(
        data, 
        train_size=train_pct,
        random_state=random_state,
    )
    
    val_ratio = val_pct / (val_pct + test_pct)
    val_data, test_data = train_test_split(
        temp_data,
        train_size=val_ratio,
        random_state=random_state,
    )
    
    print(f"Training data: {len(train_data)} samples ({train_pct*100:.1f}%)")
    print(f"Validation data: {len(val_data)} samples ({val_pct*100:.1f}%)")
    print(f"Test data: {len(test_data)} samples ({test_pct*100:.1f}%)")
    
    return train_data, val_data, test_data


def create_dataset_counts_table(train_data, val_data, test_data):
    """
    Create a pretty table showing spam/ham counts for each dataset
    
    Parameters:
        train_data (DataFrame): Training dataset
        val_data (DataFrame): Validation dataset
        test_data (DataFrame): Test dataset
        
    Returns:
        PrettyTable: Formatted table with dataset statistics
    """
    table = PrettyTable()
    table.field_names = ["Dataset", "Ham", "Spam", "Total"]
    
    for name, dataset in [("Training", train_data), ("Validation", val_data), ("Test", test_data)]:
        counts = dataset['category'].value_counts()
        table.add_row([
            name,
            counts.get('ham', 0),
            counts.get('spam', 0),
            len(dataset)
        ])
    
    return table

### Reading the data

In [59]:
data = read_data('sms-spam-collection/SMSSpamCollection', names=['category', 'message'])

### Saving the raw data

In [25]:
save_data(data, 'sms-spam-collection/raw_data.csv')

Successfully saved sms-spam-collection/raw_data.csv


## How DVC works

DVC tracks data files separately from code. When we commit changes:

1. First, we add the files to be tracked with `dvc add`.
2. This generates a `.dvc` file which contains metadata about the data.
3. The `.dvc` file gets committed to Git while the data is managed by DVC and added to the `.gitignore`.
4. `dvc commit` ensures the DVC cache is updated with latest changes.
5. `dvc push` uploads the data to the configured remote storage.

#### Committing a file with `dvc`

In [26]:
!dvc commit sms-spam-collection/raw_data.csv

#### Committing the added `.dvc` file to `Git` 

In [None]:
!git add sms-spam-collection/raw_data.csv.dvc
!git commit -m "Committing the tracked `.dvc` file for `raw_data.csv`"

### Setting up `Version0` for the data split

- Train-val-test splitting using random state `42`.
- Adding and committing those splits using `dvc`.
- Pushing the files to `remote storage`
- Adding the `.dvc` files to git and then committing.
- Creating the `spam-ham` count table.

In [36]:
train_data, val_data, test_data = split_data(data, random_state=42)
save_data(train_data, 'sms-spam-collection/train.csv')
save_data(val_data, 'sms-spam-collection/val.csv')
save_data(test_data, 'sms-spam-collection/test.csv')

Training data: 3900 samples (70.0%)
Validation data: 557 samples (10.0%)
Test data: 1115 samples (20.0%)
Successfully saved sms-spam-collection/train.csv
Successfully saved sms-spam-collection/val.csv
Successfully saved sms-spam-collection/test.csv


In [None]:
!dvc add sms-spam-collection/train.csv sms-spam-collection/val.csv sms-spam-collection/test.csv


To track the changes with git, run:

	git add 'sms-spam-collection\.gitignore' 'sms-spam-collection\val.csv.dvc' 'sms-spam-collection\test.csv.dvc' 'sms-spam-collection\train.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



In [38]:
!dvc commit sms-spam-collection/train.csv sms-spam-collection/val.csv sms-spam-collection/test.csv

In [46]:
!dvc push

Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=517335322439-k799651rp5hmlceqhlr6koh1l0hc65je.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8080%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Authentication successful.
3 files pushed




In [60]:
!git add sms-spam-collection/train.csv.dvc sms-spam-collection/val.csv.dvc sms-spam-collection/test.csv.dvc
!git commit -m "Committing train, validation and test datasets: V0"

[main d016042] Committing train, validation and test datasets: V0
 3 files changed, 15 insertions(+)
 create mode 100644 Assignment 2/sms-spam-collection/test.csv.dvc
 create mode 100644 Assignment 2/sms-spam-collection/train.csv.dvc
 create mode 100644 Assignment 2/sms-spam-collection/val.csv.dvc


In [51]:
create_dataset_counts_table(train_data, val_data, test_data)

Dataset,Ham,Spam,Total
Training,3377,523,3900
Validation,482,75,557
Test,966,149,1115


### Setting up `Version1` for the data split

Followed the same workflow as above.

In [61]:
train_data, val_data, test_data = split_data(data, random_state=52)
save_data(train_data, 'sms-spam-collection/train.csv')
save_data(val_data, 'sms-spam-collection/val.csv')
save_data(test_data, 'sms-spam-collection/test.csv')

Training data: 3900 samples (70.0%)
Validation data: 557 samples (10.0%)
Test data: 1115 samples (20.0%)
Successfully saved sms-spam-collection/train.csv
Successfully saved sms-spam-collection/val.csv
Successfully saved sms-spam-collection/test.csv


In [62]:
create_dataset_counts_table(train_data, val_data, test_data)

Dataset,Ham,Spam,Total
Training,3381,519,3900
Validation,479,78,557
Test,965,150,1115


In [63]:
!dvc add sms-spam-collection/train.csv sms-spam-collection/val.csv sms-spam-collection/test.csv


To track the changes with git, run:

	git add 'sms-spam-collection\val.csv.dvc' 'sms-spam-collection\train.csv.dvc' 'sms-spam-collection\test.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



In [64]:
!dvc commit sms-spam-collection/train.csv sms-spam-collection/val.csv sms-spam-collection/test.csv

In [66]:
!dvc push

3 files pushed


In [None]:
!git add sms-spam-collection/train.csv.dvc sms-spam-collection/val.csv.dvc sms-spam-collection/test.csv.dvc
!git commit -m "Committing train, validation and test datasets: V1"

[main eb249d8] Committing train, validation and test datasets: V1
 3 files changed, 6 insertions(+), 6 deletions(-)


In [None]:
create_dataset_counts_table(train_data, val_data, test_data)

Dataset,Ham,Spam,Total
Training,3381,519,3900
Validation,479,78,557
Test,965,150,1115


### Checking out previous `dvc` version

#### Showing `Git` commits

In [67]:
!git log --oneline

eb249d8 Committing train, validation and test datasets: V1
d016042 Committing train, validation and test datasets: V0
8766195 Committing the tracked `.dvc` file for `raw_data.csv`
c8e3d88 Writing functions to remove repetitive codes
4fe49aa Update README.md
3896694 add `train.ipynb`
f4f0af5 Add `prepare.ipynb`
d68819d add `readme` and processed datasets.
0cc82ef add `.gitignore` and `requirements`
4b32ac1 Create README.md


#### Checking out the previous version `V0`

In [70]:
!git checkout d016042

M	.gitignore
M	README.md
M	requirements.txt


Note: switching to 'd016042'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at d016042 Committing train, validation and test datasets: V0


#### Reflect changes in files using `dvc checkout`

In [71]:
!dvc checkout

M       sms-spam-collection\train.csv
M       sms-spam-collection\test.csv
M       sms-spam-collection\val.csv


#### Load the dataset and view the `spam-ham` count

In [None]:
train_data = read_data('sms-spam-collection/train.csv', names=['category', 'message'], sep=',')
val_data = read_data('sms-spam-collection/val.csv', names=['category', 'message'], sep=',')
test_data = read_data('sms-spam-collection/test.csv', names=['category', 'message'], sep=',')

In [76]:
create_dataset_counts_table(train_data, val_data, test_data)

Dataset,Ham,Spam,Total
Training,3377,523,3901
Validation,482,75,558
Test,966,149,1116


#### Note that, this table is the same as the `Version0` table.

This way, we can checkout to any `git commits` and then checkout `dvc` to extract data from that version/commit without pushing large files to github.

Checking out to the most recent version.

In [1]:
!git checkout eb249d8

M	.gitignore
M	README.md
M	requirements.txt


HEAD is now at eb249d8 Committing train, validation and test datasets: V1
