<div align = "center">

AML Assignment 2 : SMS Spam Classification Experiment Tracking

Part I - Data Version Control (DVC)

Trishita Patra

</div>

The file does the following:
* Loads raw dataset and stores it as `raw_data.csv`.
* Splits data into train, validation, and test datasets.
* Tracks dataset versions and updates splits using different random seeds via DVC.
* Enables version checkout and optional remote storage using Google Drive.

### Required Libraries

In [1]:
!pip install -q dvc
!git init
!dvc init

Reinitialized existing Git repository in /content/.git/
[31mERROR[39m: failed to initiate DVC - '.dvc' exists. Use `-f` to force.
[0m

In [2]:
import os
import pandas as pd
from sklearn.model_selection import train_test_split
import hashlib

In [3]:
from google.colab import files
uploaded = files.upload()

Saving SMSSpamCollection to SMSSpamCollection (1)


### Helper Functions

In [4]:
def load_data(file_path: str) -> pd.DataFrame:
    """
    Load SMS Spam dataset from a file path.
    """
    df = pd.read_csv(
        file_path,
        sep="\t",
        header=None,
        names=["label", "text"]
    )
    return df

In [5]:
def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """
    Preprocess the dataset:
    - Display unique labels and their count
    - Validate label names and count
    - Lowercase text
    - Encode labels (ham -> 0, spam -> 1)
    """

    df = df.copy()

    # -------- Label validation --------
    unique_labels = df["label"].unique()
    num_unique = len(unique_labels)

    print(f"Unique labels found ({num_unique}): {unique_labels}")

    expected_labels = {"ham", "spam"}

    if num_unique != 2 or set(unique_labels) != expected_labels:
        raise ValueError(
            f"Label validation failed.\n"
            f"Expected labels: {expected_labels}\n"
            f"Found labels: {set(unique_labels)}"
        )

    # -------- Preprocessing --------
    df["text"] = df["text"].str.lower()
    df["label"] = df["label"].map({"ham": 0, "spam": 1})

    return df


In [6]:
def split_data(
    df: pd.DataFrame,
    test_size: float = 0.15,
    val_size: float = 0.15,
    random_state: int = 42
):
    """
    Split data into train, validation, and test sets (70/15/15).
    Stratification ensures label ratios remain constant.
    Missing values (if any) are removed before splitting.
    """

    # -------- Null check --------
    null_count = df.isnull().sum().sum()

    if null_count > 0:
        print(f"Found {null_count} missing values. Dropping rows with missing data.")
        df = df.dropna().reset_index(drop=True)
    else:
        print("No missing values found. Proceeding with split.")

    # -------- Split data --------
    train_df, temp_df = train_test_split(
        df,
        test_size=test_size + val_size,
        #stratify=df["label"],
        random_state=random_state
    )

    relative_val_size = val_size / (test_size + val_size)

    val_df, test_df = train_test_split(
        temp_df,
        test_size=1 - relative_val_size,
        stratify=temp_df["label"],
        random_state=random_state
    )

    return train_df, val_df, test_df

In [7]:
def save_splits(train_df, val_df, test_df, raw_df, output_dir="."):
    """
    Save raw data and splits to CSV files.
    """
    raw_df.to_csv(os.path.join(output_dir, "raw_data.csv"), index=False)
    train_df.to_csv(os.path.join(output_dir, "train.csv"), index=False)
    val_df.to_csv(os.path.join(output_dir, "validation.csv"), index=False)
    test_df.to_csv(os.path.join(output_dir, "test.csv"), index=False)


In [8]:
def show_distribution(file): # To print label distribution
    df = pd.read_csv(file)
    print("\n", file)
    print(df["label"].value_counts())

In [9]:
def file_hash(path):
    with open(path, "rb") as f:
        return hashlib.md5(f.read()).hexdigest()

### Version 1

In [39]:
df = load_data("SMSSpamCollection")
df = preprocess_data(df)

train, val, test = split_data(df, random_state=42)
save_splits(train, val, test, df)

print("Seed = 42")

Unique labels found (2): ['ham' 'spam']
No missing values found. Proceeding with split.
Seed = 42


#### Track with DVC

In [11]:
# Credentials hidden before pushing to github
!git config --global user.email "mail@gmail.com"
!git config --global user.name "userid"

In [40]:
!dvc add raw_data.csv train.csv validation.csv test.csv
!git add .
!git commit -m "first version"

[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/4 [00:00<?, ?file/s{'info': ' raw_data.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Checking out /content/raw_data.csv:   0% 0/1 [00:00<?, ?files/s][A
Checking out /content/raw_data.csv:   0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:   0% 0/4 [00:00<?, ?file/s{'info': ' train.csv |'}]   
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Checking out /content/train.csv:   0% 0/1 [00:00<?, ?files/s][A
Checking out /content/train.csv:   0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:  50% 2/4 [00:00<00:00, 18.67file/s{'info': ' validation.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |      

### Version 2

In [41]:
train, val, test = split_data(df, random_state=123)
save_splits(train, val, test, df)

print("Seed = 123")

No missing values found. Proceeding with split.
Seed = 123


In [42]:
!dvc add train.csv validation.csv test.csv
!git add .
!git commit -m "updated version"

[?25l[32m⠋[0m Checking graph
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' train.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Checking out /content/train.csv:   0% 0/1 [00:00<?, ?files/s][A
Checking out /content/train.csv:   0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' validation.csv |'}]
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |0/? [00:00<?,    ?files/s][A
                                           [A
Checking out /content/validation.csv:   0% 0/1 [00:00<?, ?files/s][A
Checking out /content/validation.csv:   0% 0/1 [00:00<?, ?files/s{'info': ''}][A
Adding...:   0% 0/3 [00:00<?, ?file/s{'info': ' test.csv |'}]      
![A
          |0.00 [00:00,     ?file/s][A
                                    [A
![A
  0% |          |

### Checkout V1

In [38]:
'''
!rm -rf .git
!git init
!git add .
!git commit -m "fresh start"
'''

[33mhint: Using 'master' as the name for the initial branch. This default branch name[m
[33mhint: is subject to change. To configure the initial branch name to use in all[m
[33mhint: [m
[33mhint: 	git config --global init.defaultBranch <name>[m
[33mhint: [m
[33mhint: Names commonly chosen instead of 'master' are 'main', 'trunk' and[m
[33mhint: 'development'. The just-created branch can be renamed via this command:[m
[33mhint: [m
[33mhint: 	git branch -m <name>[m
Initialized empty Git repository in /content/.git/
[master (root-commit) 5e3caf1] fresh start
 31 files changed, 62241 insertions(+)
 create mode 100644 .config/.last_opt_in_prompt.yaml
 create mode 100644 .config/.last_survey_prompt.yaml
 create mode 100644 .config/.last_update_check.json
 create mode 100644 .config/active_config
 create mode 100644 .config/config_sentinel
 create mode 100644 .config/configurations/config_default
 create mode 100644 .config/default_configs.db
 create mode 100644 .config/gce
 

''

In [43]:
!git log --oneline

[33m0b588f2[m[33m ([m[1;36mHEAD -> [m[1;32mmaster[m[33m)[m updated version
[33mf85009a[m first version
[33m5e3caf1[m fresh start


In [44]:
!git checkout f85009a
!dvc checkout

Note: switching to 'f85009a'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at f85009a first version
Building workspace index          |4.00 [00:00, 40.5entry/s]
Comparing indexes          |5.00 [00:00, 2.62kentry/s]
Applying changes          |3.00 [00:00,   843file/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0m

In [45]:
show_distribution("train.csv")
print("Hash:", file_hash("train.csv"))
print("\n")
print(pd.read_csv("train.csv").head())
show_distribution("validation.csv")
show_distribution("test.csv")


 train.csv
label
0    3377
1     523
Name: count, dtype: int64
Hash: ea9369cb339c429b8f9c1d6ba210e72c


   label                                               text
0      0  quite late lar... ard 12 anyway i wun b drivin...
1      0                      on a tuesday night r u 4 real
2      0  go chase after her and run her over while she'...
3      0   g says you never answer your texts, confirm/deny
4      0       still work going on:)it is very small house.

 validation.csv
label
0    724
1    112
Name: count, dtype: int64

 test.csv
label
0    724
1    112
Name: count, dtype: int64


### Checkout V2

In [46]:
!git checkout 0b588f2
!dvc checkout

Previous HEAD position was f85009a first version
HEAD is now at 0b588f2 updated version
Building workspace index          |4.00 [00:00,  112entry/s]
Comparing indexes          |5.00 [00:00, 2.67kentry/s]
Applying changes          |3.00 [00:00,   301file/s]
[33mM[0m       test.csv
[33mM[0m       train.csv
[33mM[0m       validation.csv
[0m

In [47]:
show_distribution("train.csv")
print("Hash:", file_hash("train.csv"))
print("\n")
print(pd.read_csv("train.csv").head())
show_distribution("validation.csv")
show_distribution("test.csv")


 train.csv
label
0    3383
1     517
Name: count, dtype: int64
Hash: 7e80ba2a31b843694a14f878fd7a02f7


   label                                               text
0      0                     what year. and how many miles.
1      0  ok im not sure what time i finish tomorrow but...
2      1  87077: kick off a new season with 2wks free go...
3      1  get ur 1st ringtone free now! reply to this ms...
4      0                         not yet chikku..wat abt u?

 validation.csv
label
0    721
1    115
Name: count, dtype: int64

 test.csv
label
0    721
1    115
Name: count, dtype: int64


### Observation

- The class distribution in train, validation, and test sets doesn't remain identical across versions since stratified splitting is ommitted.
- However, the actual samples in each split changed - that is different dataset versions are obtained, which is confirmed by:
  - Different file hashes
  - Different first rows of train.csv

### Bonus

In [26]:
!pip install -q dvc[gdrive]

In [27]:
GDRIVE_FOLDER_ID="1G4k7FFYEHBFlZd3XY0cKLvxdsA-sS5Co"

!dvc remote add -d gdrive gdrive://$GDRIVE_FOLDER_ID

'''
!dvc remote modify gdrive gdrive_client_id '<ID>'
!dvc remote modify gdrive gdrive_client_secret '<SECRET>'
'''

Setting 'gdrive' as a default remote.
[0m

In [28]:
!dvc remote list


[32mgdrive  [0m[32mgdrive://1G4k7FFYEHBFlZd3XY0cKLvxdsA-sS5Co      [0m[32m(default)[0m
[0m

In [29]:
!dvc push

Collecting          |4.00 [00:00, 79.9entry/s]
Pushing
![A
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?client_id=710796635688-iivsgbgsb6uv1fap6635dhvuei09o66c.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8090%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.appdata&access_type=offline&response_type=code&approval_prompt=force

Pushing

[A[31mERROR[39m: interrupted by the user

  0% |          |0/? [1:22:54<?,    ?files/s][A
                                             [A[0m

After 1.5 hours, the run was interrupted. Unable to figure out the error. Most likely authentication error.

In [30]:
!git add .dvc/config
!git commit -m "configure Google Drive as DVC remote"

[master 4601234] configure Google Drive as DVC remote
 1 file changed, 4 insertions(+)
