In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import os
import warnings
import json

warnings.filterwarnings('ignore')

In [2]:
!dvc init --no-scm

Initialized DVC repository.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


In [3]:
!git add .dvc

In [4]:
gdrive_link = json.load(open('credentials.json'))['gdrive_folder']

### Bonus

In [5]:
# Add Google Drive remote
!dvc remote add -d myremote gdrive://{gdrive_link}

Setting 'myremote' as a default remote.


In [6]:
!dvc remote modify myremote gdrive_use_service_account true

In [7]:
!dvc remote modify myremote --local gdrive_service_account_json_file_path dvc-assignment-e9a57d4791c3.json

In [8]:
df = pd.read_csv("./dataset/SMSSpamCollection", encoding='latin-1', sep='\t', names=['label', 'message'])

In [9]:
# Save the loaded data to raw_data.csv
df.to_csv('./raw_data.csv', index=False)
print("Data has been successfully saved to raw_data.csv")

Data has been successfully saved to raw_data.csv


In [10]:
!dvc add raw_data.csv

⠋ Checking graph



In [11]:
!git add raw_data.csv.dvc
!git commit -m "Add raw data tracking with DVC"
!git push --all

[main 0108169] Add raw data tracking with DVC
 6 files changed, 22 insertions(+)
 create mode 100644 Assignment2/.dvc/config
 create mode 100644 Assignment2/.dvc/tmp/btime
 create mode 100644 Assignment2/raw_data.csv.dvc
 create mode 100644 Assignment2/test.csv.dvc
 create mode 100644 Assignment2/train.csv.dvc
 create mode 100644 Assignment2/validation.csv.dvc


To https://github.com/M-Aalekhya/AppliedMachineLearning
   75fec75..0108169  main -> main


In [12]:
# pushed to gdrive
!dvc push

1 file pushed


In [13]:
df

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,spam,This is the 2nd time we have tried 2 contact u...
5568,ham,Will Ã¼ b going to esplanade fr home?
5569,ham,"Pity, * was in mood for that. So...any other s..."
5570,ham,The guy did some bitching but I acted like i'd...


In [14]:
# Converting spam and not spam to 1 and 0

df['label'] = (df['label'] == 'spam').astype(int)

In [15]:
df

Unnamed: 0,label,message
0,0,"Go until jurong point, crazy.. Available only ..."
1,0,Ok lar... Joking wif u oni...
2,1,Free entry in 2 a wkly comp to win FA Cup fina...
3,0,U dun say so early hor... U c already then say...
4,0,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5567,1,This is the 2nd time we have tried 2 contact u...
5568,0,Will Ã¼ b going to esplanade fr home?
5569,0,"Pity, * was in mood for that. So...any other s..."
5570,0,The guy did some bitching but I acted like i'd...


In [16]:
def splitandsave(df, random_state, train_path='./train.csv', val_path='./validation.csv', test_path='./test.csv', 
                       val_size=0.25, test_size=0.15):

    train_val, test = train_test_split(df, test_size=test_size, random_state=random_state)
    val_adjusted_size = val_size / (1 - test_size)
    train, val = train_test_split(train_val, test_size=val_adjusted_size, random_state=random_state)
    
    # Save splits to CSV
    train.to_csv(train_path, index=False)
    val.to_csv(val_path, index=False)
    test.to_csv(test_path, index=False)
    
    print(f"Data split sizes:")
    print(f"Train: {len(train)} samples")
    print(f"Validation: {len(val)} samples")
    print(f"Test: {len(test)} samples")
    
    return train, val, test

In [17]:
# split and save data accordingly with initial seed
train, val, test = splitandsave(df, 42)

Data split sizes:
Train: 3343 samples
Validation: 1393 samples
Test: 836 samples


In [18]:
# Track the split datasets with DVC
!dvc add train.csv validation.csv test.csv

⠋ Checking graph



In [19]:
# Commit this version
!git add train.csv.dvc validation.csv.dvc test.csv.dvc
!git commit -m "First data split with seed=42"
!git push --all

[main 3b86a45] First data split with seed=42
 3 files changed, 6 insertions(+), 6 deletions(-)


To https://github.com/M-Aalekhya/AppliedMachineLearning
   0108169..3b86a45  main -> main


In [20]:
# pushed to gdrive
!dvc push

3 files pushed


In [21]:
# Change the random seed
train, val, test = splitandsave(df, 150)

Data split sizes:
Train: 3343 samples
Validation: 1393 samples
Test: 836 samples


In [22]:
# Track the updated split datasets with DVC
!dvc add train.csv validation.csv test.csv

⠋ Checking graph



In [23]:
# Commit this version
!git add train.csv.dvc validation.csv.dvc test.csv.dvc
!git commit -m "Updated data split with seed=150"
!git push --all

[main d239f0c] Updated data split with seed=150
 3 files changed, 6 insertions(+), 6 deletions(-)


To https://github.com/M-Aalekhya/AppliedMachineLearning
   3b86a45..d239f0c  main -> main


In [24]:
# pushed to gdrive
!dvc push

3 files pushed


In [25]:
!git log --oneline

d239f0c Updated data split with seed=150
3b86a45 First data split with seed=42
0108169 Add raw data tracking with DVC
75fec75 minor changes
68980b0 Assignment1 finished
13b7e8f Initial commit


In [26]:
# Checkout the first version
!git checkout 3b86a45
!dvc checkout

# Load the first version datasets
train_df_v1 = pd.read_csv('train.csv')
val_df_v1 = pd.read_csv('validation.csv')
test_df_v1 = pd.read_csv('test.csv')

# Print distribution of target variable in first version
print("FIRST VERSION - Train set distribution:")
print(train_df_v1["label"].value_counts())

print("\nFIRST VERSION - Validation set distribution:")
print(val_df_v1["label"].value_counts())

print("\nFIRST VERSION - Test set distribution:")
print(test_df_v1["label"].value_counts())



M	Assignment2/.dvc/config


Note: switching to '3b86a45'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 3b86a45 First data split with seed=42


M       train.csv
M       validation.csv
M       test.csv
FIRST VERSION - Train set distribution:
label
0    2871
1     472
Name: count, dtype: int64

FIRST VERSION - Validation set distribution:
label
0    1225
1     168
Name: count, dtype: int64

FIRST VERSION - Test set distribution:
label
0    729
1    107
Name: count, dtype: int64


In [27]:
# Return to the latest version
!git checkout d239f0c
!dvc checkout

# Load the updated version datasets
train_df_v2 = pd.read_csv('train.csv')
val_df_v2 = pd.read_csv('validation.csv')
test_df_v2 = pd.read_csv('test.csv')

# Print distribution of target variable in updated version
print("UPDATED VERSION - Train set distribution:")
print(train_df_v2["label"].value_counts())

print("\nUPDATED VERSION - Validation set distribution:")
print(val_df_v2["label"].value_counts())

print("\nUPDATED VERSION - Test set distribution:")
print(test_df_v2["label"].value_counts())

M	Assignment2/.dvc/config


Previous HEAD position was 3b86a45 First data split with seed=42
HEAD is now at d239f0c Updated data split with seed=150


M       validation.csv
M       test.csv
M       train.csv
UPDATED VERSION - Train set distribution:
label
0    2912
1     431
Name: count, dtype: int64

UPDATED VERSION - Validation set distribution:
label
0    1200
1     193
Name: count, dtype: int64

UPDATED VERSION - Test set distribution:
label
0    713
1    123
Name: count, dtype: int64
