# Assignment 2: Experiment Tracking

## 1. Data Version Control

### Track Data Versions using DVC

In `prepare.ipynb`, track the versions of data using **DVC**:

1. Load the raw data into `raw_data.csv`.
2. Split the data and save it into `train.csv`, `validation.csv`, and `test.csv`.
3. Update the train/validation/test split by choosing a different random seed.
4. Checkout the first version (before the update) using DVC and print the distribution of the target variable (number of 0s and number of 1s) in:
   - `train.csv`
   - `validation.csv`
   - `test.csv`
5. Checkout the updated version using DVC and print the distribution of the target variable in:
   - `train.csv`
   - `validation.csv`
   - `test.csv`

### Bonus

- **Decouple Compute and Storage**: Track the data versions using **Google Drive** as storage.

#### References for Data Version Control

- [DVC Documentation](https://dvc.org/doc/start/data-management/data-versioning)
- [Real Python: Data Version Control](https://realpython.com/python-data-version-control/)
- [Managing Google Drive with Python](https://towardsdatascience.com/how-to-manage-files-in-google-drive-with-python-d26471d91ecd)
- [MadeWithML - Versioning](https://madewithml.com/courses/mlops/versioning/)

---

In [1]:
!git --version

git version 2.39.5 (Apple Git-154)


In [2]:
!git config --global user.name "Kvgohokar"
!git config --global user.email "kalyani.gohokar2406@gmail.com"

In [27]:
!git init

Initialized empty Git repository in /Users/kalyani/Documents/CMI/Sem 4/AML/Assingment 2/.git/


In [None]:
!git remote add origin https://github.com/Kvgohokar/AppliedMachineLearning.git

In [13]:
!git remote -v

origin	https://github.com/Kvgohokar/AppliedMachineLearning.git (fetch)
origin	https://github.com/Kvgohokar/AppliedMachineLearning.git (push)


In [6]:
!git add "/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment 2"

In [7]:
!git commit -m "Added Assignment 2 folder"

[main (root-commit) e943979] Added Assignment 2 folder
 6 files changed, 7434 insertions(+)
 create mode 100644 prepare.ipynb
 create mode 100644 test.csv
 create mode 100644 train.csv
 create mode 100644 train.ipynb
 create mode 100644 val.csv
 create mode 100644 validation.csv


In [9]:
!git push origin main

fatal: protocol 'git@github.com:https' is not supported


In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import csv
import re
import string
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import nltk
import os

random_state=24

#### 1. Load the data.

In [3]:
# 1. Load the data
raw_messages = pd.read_csv('/Users/kalyani/Documents/CMI/Sem 4/AML/Assignment 1/sms+spam+collection/SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE,names=["label", "message"])

In [4]:
raw_messages.head()

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."


####  2. Preprocess the data

In [5]:
# Define stop words and stemmer
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

In [9]:
def preprocess_text(text):
    """
    Clean and preprocess a single text message.
    """
    text = text.lower() # Convert to lowercase
    text = re.sub(r'\d+', '', text)  # Remove numbers
    text = text.translate(str.maketrans('', '', string.punctuation)) # Remove punctuation
    tokens = word_tokenize(text) # Tokenize words
    # Remove stopwords and lemmatize
    tokens = [lemmatizer.lemmatize(word) for word in tokens if word not in stop_words]
    return ' '.join(tokens)

def preprocess_data(data):
    """
    Preprocess the entire dataset.
    """
    data['message'] = data['message'].apply(preprocess_text)
    # Encode labels: spam -> 1, ham -> 0
    data['label'] = data['label'].map({'spam': 1, 'ham': 0})
    return data

In [10]:
data = preprocess_data(raw_messages)

In [11]:
data.head()

Unnamed: 0,label,message
0,0,go jurong point crazy available bugis n great ...
1,0,ok lar joking wif u oni
2,1,free entry wkly comp win fa cup final tkts st ...
3,0,u dun say early hor u c already say
4,0,nah dont think go usf life around though


In [12]:
X = data['message']
y = data['label']

#### 3. Split the data into train/validation/test. 

In [13]:
def split_and_save_data(X, y, label_column="label", test_size=0.2, val_size=0.1, random_state=24):
    """
    Split the data into train, validation, and test sets, and save them as CSV files.
    """
    #Train-test split
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=test_size, random_state=random_state, stratify=y
    )
    
    #Train-validation split
    X_train, X_val, y_train, y_val = train_test_split(
        X_train, y_train, test_size=val_size, random_state=random_state, stratify=y_train
    )
    
    train_df = pd.DataFrame(X_train)
    train_df['label'] = y_train

    test_df = pd.DataFrame(X_test)
    test_df['label'] = y_test

    val_df = pd.DataFrame(X_val)
    val_df['label'] = y_val

    train_df.to_csv(f"./train.csv", index=False)
    test_df.to_csv(f"./test.csv", index=False)
    val_df.to_csv(f"./val.csv", index=False)

    print("Data splits saved successfully!")

In [14]:
split_and_save_data(X, y, label_column="label", test_size=0.2, val_size=0.1, random_state=24)

Data splits saved successfully!
