# Assignment 2: Experiment Tracking

## 1. Data Version Control

### Track Data Versions using DVC

In `prepare.ipynb`, track the versions of data using **DVC**:

1. Load the raw data into `raw_data.csv`.
2. Split the data and save it into `train.csv`, `validation.csv`, and `test.csv`.
3. Update the train/validation/test split by choosing a different random seed.
4. Checkout the first version (before the update) using DVC and print the distribution of the target variable (number of 0s and number of 1s) in:
   - `train.csv`
   - `validation.csv`
   - `test.csv`
5. Checkout the updated version using DVC and print the distribution of the target variable in:
   - `train.csv`
   - `validation.csv`
   - `test.csv`

### Bonus

- **Decouple Compute and Storage**: Track the data versions using **Google Drive** as storage.

#### References for Data Version Control

- [DVC Documentation](https://dvc.org/doc/start/data-management/data-versioning)
- [Real Python: Data Version Control](https://realpython.com/python-data-version-control/)
- [Managing Google Drive with Python](https://towardsdatascience.com/how-to-manage-files-in-google-drive-with-python-d26471d91ecd)
- [MadeWithML - Versioning](https://madewithml.com/courses/mlops/versioning/)

---

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import re
import string
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import os

random_state=24

#### Set up DVC storage

In [None]:
# !dvc init

Initialized DVC repository.

You can now commit the changes to git.

[31m+---------------------------------------------------------------------+
[0m[31m|[0m                                                                     [31m|[0m
[31m|[0m        DVC has enabled anonymous aggregate usage analytics.         [31m|[0m
[31m|[0m     Read the analytics documentation (and how to opt-out) here:     [31m|[0m
[31m|[0m             <[36mhttps://dvc.org/doc/user-guide/analytics[39m>              [31m|[0m
[31m|[0m                                                                     [31m|[0m
[31m+---------------------------------------------------------------------+
[0m
[33mWhat's next?[39m
[33m------------[39m
- Check out the documentation: <[36mhttps://dvc.org/doc[39m>
- Get help and share ideas: <[36mhttps://dvc.org/chat[39m>
- Star us on GitHub: <[36mhttps://github.com/iterative/dvc[39m>
[0m

In [6]:
!dvc remote add -d gdrive_remote gdrive://1g-40__aCwQ_38Afqz_sOsD_OpKcvENfr

Setting 'gdrive_remote' as a default remote.
[0m

In [7]:
!dvc remote modify gdrive_remote gdrive_use_service_account true
!dvc remote modify gdrive_remote --local \
            gdrive_service_account_json_file_path dvc-storage-451816-cc381df40019.json

[0m[0m

#### 1. Load the data.

In [4]:
# 1. Load the data
raw_messages = pd.read_csv('/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment 1/sms+spam+collection/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE,names=["label", "message"])

In [5]:
raw_messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


In [8]:
# Save the raw data
raw_messages.to_csv("raw_data.csv", index=False)

# Track the raw data using DVC
!dvc add raw_data.csv
!git add raw_data.csv.dvc
!git commit -m "Added raw data"
!dvc push

[?25l[32m⠋[0m Checking graph                                       core[39m>
Adding...                                                                       
![A
Collecting files and computing hashes in raw_data.csv |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment_2/.dvc[A
                                                                                [A
![A
  0%|          |Adding raw_data.csv to cache          0/1 [00:00<?,     ?file/s][A
                                                                                [A
![A
  0%|          |Checking out /Users/kalyani/Documents/0/1 [00:00<?,    ?files/s][A
100% Adding...|███████████████████████████████████████|1/1 [00:00, 109.94file/s][A

To track the changes with git, run:

	git add raw_data.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true
[0m[31mHEAD detac

####  2. Preprocess the data

In [9]:
# Define stop words and stemmer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [10]:
def preprocess_text(text):
    """
    Clean and preprocess a single text message.
    """
    text = text.lower() # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
    tokens = word_tokenize(text) # Tokenize words
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

def preprocess_data(data):
    """
    Preprocess the entire dataset.
    """
    data['message'] = data['message'].apply(preprocess_text)
    # Encode labels: spam -> 1, ham -> 0
    data['label'] = data['label'].map({'spam': 1, 'ham': 0})
    return data

In [11]:
data = preprocess_data(raw_messages)
data.head()

Unnamed: 0,label,message
0,0,go jurong point crazy available bugis n great ...
1,0,ok lar joking wif u oni
2,1,free entry wkly comp win fa cup final tkts st ...
3,0,u dun say early hor u c already say
4,0,nah dont think go usf life around though


In [12]:
X = data['message']
y = data['label']

#### 3. Split the data into train/validation/test. 

In [76]:
def split_and_save_data(X, y, label_column="label", test_size=0.2, val_size=0.1, random_state=24):
    """
    Split the data into train, validation, and test sets, and save them as CSV files.
    """
    #Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state
    )
    
    #Train-validation split
    X_train, X_val, y_train, y_val = train_test_split(
        X_train, y_train, test_size=val_size, random_state=random_state
    )
    
    train_df = pd.DataFrame(X_train)
    train_df['label'] = y_train

    test_df = pd.DataFrame(X_test)
    test_df['label'] = y_test

    val_df = pd.DataFrame(X_val)
    val_df['label'] = y_val

    train_df.to_csv(f"./train.csv", index=False)
    test_df.to_csv(f"./test.csv", index=False)
    val_df.to_csv(f"./val.csv", index=False)

    print("Data splits saved successfully!")

##### Version 1 of train/validation/test split

In [77]:
split_and_save_data(X, y, label_column="label", test_size=0.2, val_size=0.1, random_state=24)

Data splits saved successfully!


In [78]:
# Track the new version with DVC
!dvc add train.csv val.csv test.csv
!git add train.csv.dvc val.csv.dvc test.csv.dvc
!git commit -m "Version 1 of train/validation/test split"
!dvc push

[?25l[32m⠋[0m Checking graph                                       core[39m>
  0% Adding...|                          | train.csv |0/3 [00:00<?,     ?file/s]
![A
Collecting files and computing hashes in train.csv    |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment_2/.dvc[A
                                                                                [A
![A
  0%|          |Checking out /Users/kalyani/Documents/0/1 [00:00<?,    ?files/s][A
  0% Adding...|                            | val.csv |0/3 [00:00<?,     ?file/s][A
![A
Collecting files and computing hashes in val.csv      |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment_2/.dvc[A
                                                         

##### Version 2: changing seed and saving again

In [79]:
# Perform new split with a different random seed
split_and_save_data(X, y, label_column="label", test_size=0.2, val_size=0.1, random_state=80)

Data splits saved successfully!


In [80]:
# Track the new version with DVC
!dvc add train.csv val.csv test.csv
!git add train.csv.dvc val.csv.dvc test.csv.dvc
!git commit -m "Version 2 of train/validation/test split"
!dvc push

[?25l                                                                core[39m>[32m⠋[0m Checking graph
  0% Adding...|                          | train.csv |0/3 [00:00<?,     ?file/s]
![A
Collecting files and computing hashes in train.csv    |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment_2/.dvc[A
                                                                                [A
![A
  0%|          |Checking out /Users/kalyani/Documents/0/1 [00:00<?,    ?files/s][A
  0% Adding...|                            | val.csv |0/3 [00:00<?,     ?file/s][A
![A
Collecting files and computing hashes in val.csv      |0.00 [00:00,     ?file/s][A
                                                                                [A
![A
  0% Checking cache in '/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment_2/.dvc[A
                                

In [81]:
!git log 

[33mcommit 005cba67765f3555b67dd3686d103128a312bc06[m[33m ([m[1;36mHEAD[m[33m)[m
Author: Kvgohokar <kalyani.gohokar2406@gmail.com>
Date:   Mon Feb 24 14:06:46 2025 +0530

    Version 2 of train/validation/test split

[33mcommit e870ff4e5059239f73ce66a3f501dde7fbdc181e[m
Author: Kvgohokar <kalyani.gohokar2406@gmail.com>
Date:   Mon Feb 24 14:06:37 2025 +0530

    Version 1 of train/validation/test split

[33mcommit 609e54b03924e0df9320287e2d4777ffef14a4ef[m
Author: Kvgohokar <kalyani.gohokar2406@gmail.com>
Date:   Mon Feb 24 12:56:55 2025 +0530

    Version 2 of train/validation/test split

[33mcommit e41bac6a2fce321e1841f2e83961d06949bd1374[m
Author: Kvgohokar <kalyani.gohokar2406@gmail.com>
Date:   Sun Feb 23 22:52:37 2025 +0530

    Updated train/validation/test split with different random seed

[33mcommit 41b5893b9360a5a48446e66ea6722bd6fb677234[m
Author: Kvgohokar <kalyani.gohokar2406@gmail.com>
Date:   Sun Feb 23 22:50:12 2025 +0530

    Updated train/validation/te

##### Checkout Version 1 distribution

In [82]:
!git checkout e870ff4e5059239f73ce66a3f501dde7fbdc181e  # Version 1
!dvc checkout

M	.dvc/config
M	.gitignore
any of your branches:

  005cba6 Version 2 of train/validation/test split

If you want to keep it by creating a new branch, this may be a good time
to do so with:

 git branch <new-branch-name> 005cba6

HEAD is now at e870ff4 Version 1 of train/validation/test split
Building workspace index                              |5.00 [00:00,  861entry/s]
Comparing indexes                                    |6.00 [00:00, 9.66kentry/s]
Applying changes                                      |3.00 [00:00, 3.86kfile/s]
[33mM[0m       test.csv
[33mM[0m       val.csv
[33mM[0m       train.csv
[0m

In [83]:
# Load and print class distributions
train = pd.read_csv("train.csv")
val = pd.read_csv("val.csv")
test = pd.read_csv("test.csv")

print("Version 1: class distribution:")
print("Train:\n", train["label"].value_counts())
print("Validation:\n", val["label"].value_counts())
print("Test:\n", test["label"].value_counts())

Version 1: class distribution:
Train:
 label
0    3464
1     549
Name: count, dtype: int64
Validation:
 label
0    385
1     61
Name: count, dtype: int64
Test:
 label
0    978
1    137
Name: count, dtype: int64


##### Checkout Version 2 distribution

In [84]:
!git checkout 005cba67765f3555b67dd3686d103128a312bc06 #Version 2
!dvc checkout

M	.dvc/config
M	.gitignore
Previous HEAD position was e870ff4 Version 1 of train/validation/test split
HEAD is now at 005cba6 Version 2 of train/validation/test split
Building workspace index                              |5.00 [00:00,  899entry/s]
Comparing indexes                                    |6.00 [00:00, 13.7kentry/s]
Applying changes                                      |3.00 [00:00, 4.92kfile/s]
[33mM[0m       train.csv
[33mM[0m       test.csv
[33mM[0m       val.csv
[0m

In [85]:
# Load and print class distributions
train = pd.read_csv("train.csv")
val = pd.read_csv("val.csv")
test = pd.read_csv("test.csv")

print("Version 2: class distribution:")
print("Train:\n", train["label"].value_counts())
print("Validation:\n", val["label"].value_counts())
print("Test:\n", test["label"].value_counts())

Version 2: class distribution:
Train:
 label
0    3487
1     526
Name: count, dtype: int64
Validation:
 label
0    372
1     74
Name: count, dtype: int64
Test:
 label
0    968
1    147
Name: count, dtype: int64
