- Name: Deepanshi
- Roll No.: MDS202416
- Assignment 2

1. Imports, Setup, and Gitignore

In [1]:
!pip install python-dotenv

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install "dvc[gdrive]"

import pandas as pd
import urllib.request
import zipfile
import os
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split

# Download necessary NLTK data
nltk.download('stopwords')
print("Libraries imported and NLTK data downloaded.")

# Initialize Stemmer and Stopwords
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 25.3 -> 26.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Libraries imported and NLTK data downloaded.


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Deepanshi\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [46]:
import os
from dotenv import load_dotenv
load_dotenv()
client_id = os.getenv("GDRIVE_CLIENT_ID")
client_secret = os.getenv("GDRIVE_CLIENT_SECRET")

In [47]:
# Create or update .gitignore file 
# This prevents Git from tracking the heavy data files directly.
with open(".gitignore", "w") as f:
    f.write("""
.ipynb_checkpoints/
sms_spam_collection.zip
.env
""")
print(".gitignore created.")

.gitignore created.


2. Download and Load Dataset

In [5]:
# --- 1. Download the dataset from UCI ---
url = "https://archive.ics.uci.edu/static/public/228/sms+spam+collection.zip"
zip_path = "sms_spam_collection.zip"

In [6]:
# Download the zip file if it doesn't exist
if not os.path.exists(zip_path):
    urllib.request.urlretrieve(url, zip_path)
    print("Dataset downloaded.")

Dataset downloaded.


In [7]:
# --- 2. Unzip the file ---
with zipfile.ZipFile(zip_path, 'r') as zip_ref:
    zip_ref.extractall(".")

In [59]:
# --- 3. Load into Pandas ---
# The UCI dataset is tab-separated with no header.
# We manually assign the columns 'label' and 'text'.
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'text'])

In [60]:
print(f"Dataset shape: {df.shape}")
print(df.head())

Dataset shape: (5572, 2)
  label                                               text
0   ham  Go until jurong point, crazy.. Available only ...
1   ham                      Ok lar... Joking wif u oni...
2  spam  Free entry in 2 a wkly comp to win FA Cup fina...
3   ham  U dun say so early hor... U c already then say...
4   ham  Nah I don't think he goes to usf, he lives aro...


3. Basic EDA & Preprocessing

In [61]:
print("Class distribution:")
print(df["label"].value_counts())
print("\nMissing values:")
print(df.isnull().sum())
print("\nDuplicate rows:", df.duplicated().sum())

Class distribution:
label
ham     4825
spam     747
Name: count, dtype: int64

Missing values:
label    0
text     0
dtype: int64

Duplicate rows: 403


In [62]:
# Preprocessing Pipeline 
def clean_text(text):
    """
    Preprocessing pipeline to normalize SMS text:
    1. Lowercase: Normalize case sensitivity.
    2. Regex: Remove special characters/numbers (keep only alphabets).
    3. Tokenize: Split string into a list of words.
    4. Remove Stopwords: Filter out common words (e.g., 'the', 'is').
    5. Stemming: Reduce words to their root form (e.g., 'calling' -> 'call').
    """
    if not isinstance(text, str):
        return ""
        
    # 1. Lowercase
    text = text.lower()
    
    # 2. Remove non-alphabetic characters (keep spaces)
    text = re.sub(r'[^a-z\s]', '', text)
    
    # 3. Tokenize
    words = text.split()
    
    # 4. & 5. Remove stopwords and Stem
    cleaned_words = [stemmer.stem(word) for word in words if word not in stop_words]
    
    return " ".join(cleaned_words)

4. Apply Cleaning + Save Raw Data

In [63]:
print(f"Original shape: {df.shape}")

Original shape: (5572, 2)


In [64]:
# --- 1. Drop Duplicates ---
# Critical for SMS datasets as duplicate messages can cause data leakage 
# between train and test sets.
df = df.drop_duplicates(keep='first')
print(f"Shape after dropping duplicates: {df.shape}")

Shape after dropping duplicates: (5169, 2)


In [65]:
# --- 2. Encode Labels ---
# Map string labels to binary integers: Spam = 1, Ham = 0
df['target'] = df['label'].map({'spam': 1, 'ham': 0})

In [66]:
# --- 3. Apply Text Cleaning ---
# Apply the clean_text function to create the feature column
df['clean_text'] = df['text'].apply(clean_text)

In [52]:
# Remove any empty rows created by aggressive cleaning (e.g., messages with only special chars)
df = df[df['clean_text'].str.len() > 0]

print("Preprocessing complete.")


Preprocessing complete.


In [53]:
print("After preprocessing:", df.shape)

After preprocessing: (5068, 2)


In [54]:
# Save Raw Data
df = df[['target', 'clean_text']] # Keep only necessary columns
df.to_csv("raw_data.csv", index=False)
print("Preprocessing complete. raw_data.csv saved.")

Preprocessing complete. raw_data.csv saved.


5. Git/DVC Init

In [55]:
# 4. Initialize Git & DVC, and define split function
!git init
!dvc init

Reinitialized existing Git repository in D:/CMI Courses/Applied ML/Assignment2/.git/


ERROR: failed to initiate DVC - '.dvc' exists. Use `-f` to force.


In [56]:
!git add .gitignore .dvc/config
!git commit -m "Initialize Git and DVC"

[detached HEAD f5a468e] Initialize Git and DVC
 2 files changed, 6 insertions(+), 2 deletions(-)


6. Version 1 (Seed 21)

In [57]:
train_val, test = train_test_split(
    df,
    test_size=0.15,
    random_state=21,
    stratify=df['target']
)

val_size = 0.15 / 0.85

train, val = train_test_split(
    train_val,
    test_size=val_size,
    random_state=21,
    stratify=train_val['target']
)

In [58]:
train.to_csv("train.csv", index=False)
val.to_csv("validation.csv", index=False)
test.to_csv("test.csv", index=False)

print("Split with seed 21 saved.")

Split with seed 21 saved.


In [23]:
print("=== Seed 21 Distribution ===")

print("\nTrain:")
print(train['target'].value_counts())

print("\nValidation:")
print(val['target'].value_counts())

print("\nTest:")
print(test['target'].value_counts())

=== Seed 21 Distribution ===

Train:
target
0    3156
1     457
Name: count, dtype: int64

Validation:
target
0    677
1     98
Name: count, dtype: int64

Test:
target
0    677
1     98
Name: count, dtype: int64


In [24]:
!dvc add raw_data.csv train.csv validation.csv test.csv
!git add raw_data.csv.dvc train.csv.dvc validation.csv.dvc test.csv.dvc
!git commit -m "Version 1: Split with seed 21"


To track the changes with git, run:

	git add train.csv.dvc raw_data.csv.dvc test.csv.dvc validation.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[main 873224d] Version 1: Split with seed 21
 4 files changed, 20 insertions(+)
 create mode 100644 raw_data.csv.dvc
 create mode 100644 test.csv.dvc
 create mode 100644 train.csv.dvc
 create mode 100644 validation.csv.dvc


In [25]:
!git log

commit 873224d3df14076339ab01f466c905d389f035f2
Author: deepSin <singladeepanshi6@gmail.com>
Date:   Sun Feb 15 20:31:46 2026 +0530

    Version 1: Split with seed 21

commit 34ceb5cdc139a3a4e5e069f71b17681eccb3b774
Author: deepSin <singladeepanshi6@gmail.com>
Date:   Sun Feb 15 20:31:34 2026 +0530

    Initialize Git and DVC


7. Version 2 (Seed 77)

In [26]:
train_val, test = train_test_split(
    df,
    test_size=0.15,
    random_state=77,
    stratify=df['target']
)

In [27]:
val_size = 0.15 / 0.85

train, val = train_test_split(
    train_val,
    test_size=val_size,
    random_state=77,
    stratify=train_val['target']
)

In [28]:
train.to_csv("train.csv", index=False)
val.to_csv("validation.csv", index=False)
test.to_csv("test.csv", index=False)

print("Split with seed 77 saved.")

Split with seed 77 saved.


In [29]:
print("=== Seed 77 Distribution ===")

print("\nTrain:")
print(train['target'].value_counts())

print("\nValidation:")
print(val['target'].value_counts())

print("\nTest:")
print(test['target'].value_counts())


=== Seed 77 Distribution ===

Train:
target
0    3156
1     457
Name: count, dtype: int64

Validation:
target
0    677
1     98
Name: count, dtype: int64

Test:
target
0    677
1     98
Name: count, dtype: int64


In [30]:
!dvc add train.csv validation.csv test.csv
!git add train.csv.dvc validation.csv.dvc test.csv.dvc
!git commit -m "Version 2: Updated split with seed 77"


To track the changes with git, run:

	git add train.csv.dvc validation.csv.dvc test.csv.dvc

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



[main 1713e5e] Version 2: Updated split with seed 77
 3 files changed, 6 insertions(+), 6 deletions(-)


8. Checking Out and Printing Distributions

In [31]:
!git log --oneline

1713e5e Version 2: Updated split with seed 77
873224d Version 1: Split with seed 21
34ceb5c Initialize Git and DVC


In [32]:
!git log

commit 1713e5e1ae7478775c4c12e7d4c8068fd27206d2
Author: deepSin <singladeepanshi6@gmail.com>
Date:   Sun Feb 15 20:31:54 2026 +0530

    Version 2: Updated split with seed 77

commit 873224d3df14076339ab01f466c905d389f035f2
Author: deepSin <singladeepanshi6@gmail.com>
Date:   Sun Feb 15 20:31:46 2026 +0530

    Version 1: Split with seed 21

commit 34ceb5cdc139a3a4e5e069f71b17681eccb3b774
Author: deepSin <singladeepanshi6@gmail.com>
Date:   Sun Feb 15 20:31:34 2026 +0530

    Initialize Git and DVC


In [33]:
!git checkout HEAD~1
!dvc checkout

Note: switching to 'HEAD~1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 873224d Version 1: Split with seed 21


M       test.csv
M       train.csv
M       validation.csv


In [34]:
print("=== Distribution for Version 1 (Seed 21) ===")
train_old = pd.read_csv("train.csv")
val_old = pd.read_csv("validation.csv")
test_old = pd.read_csv("test.csv")

print("Train distribution:")
print(train_old["target"].value_counts())

print("\nValidation distribution:")
print(val_old["target"].value_counts())

print("\nTest distribution:")
print(test_old["target"].value_counts())

=== Distribution for Version 1 (Seed 21) ===
Train distribution:
target
0    3156
1     457
Name: count, dtype: int64

Validation distribution:
target
0    677
1     98
Name: count, dtype: int64

Test distribution:
target
0    677
1     98
Name: count, dtype: int64


In [35]:
# Go back to the latest commit (Version 2)
!git checkout main
!dvc checkout

Previous HEAD position was 873224d Version 1: Split with seed 21
Switched to branch 'main'


M       test.csv
M       train.csv
M       validation.csv


In [36]:
print("=== Distribution for Version 2 (Seed 77) ===")
train_new = pd.read_csv("train.csv")
val_new = pd.read_csv("validation.csv")
test_new = pd.read_csv("test.csv")

print("Train distribution:")
print(train_new["target"].value_counts())

print("\nValidation distribution:")
print(val_new["target"].value_counts())

print("\nTest distribution:")
print(test_new["target"].value_counts())

=== Distribution for Version 2 (Seed 77) ===
Train distribution:
target
0    3156
1     457
Name: count, dtype: int64

Validation distribution:
target
0    677
1     98
Name: count, dtype: int64

Test distribution:
target
0    677
1     98
Name: count, dtype: int64


9. Bonus- Google Drive Storage 

In [37]:
!dvc remote add -d myremote gdrive://1ZhdeoELGohWWqaw7WgTTM13OJ4fy3R8U

Setting 'myremote' as a default remote.


In [38]:
!dvc remote modify myremote gdrive_client_id "{client_id}"
!dvc remote modify myremote gdrive_client_secret "{client_secret}"

In [39]:
print("Credentials securely loaded from .env and applied to DVC!")

Credentials securely loaded from .env and applied to DVC!


In [40]:
import os

# DVC creates these files to prevent conflicts. If a process dies, they get stuck.
lock_files = [".dvc/tmp/rwlock", ".dvc/tmp/lock"]

for file in lock_files:
    if os.path.exists(file):
        os.remove(file)
        print(f"Removed stuck lock file: {file}")

print("DVC is unlocked and ready to go!")

Removed stuck lock file: .dvc/tmp/rwlock
Removed stuck lock file: .dvc/tmp/lock
DVC is unlocked and ready to go!


In [43]:
!dvc push

Everything is up to date.


In [42]:
!dvc remote list

myremote        gdrive://1ZhdeoELGohWWqaw7WgTTM13OJ4fy3R8U      (default)


In [44]:
# Checkout the PREVIOUS version (Seed 21) and push that to Google Drive as well
!git checkout HEAD~1
!dvc checkout
!dvc push

M	.dvc/config


Note: switching to 'HEAD~1'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at 873224d Version 1: Split with seed 21


M       test.csv
M       train.csv
M       validation.csv
3 files pushed


In [45]:
# Return to main branch
!git checkout main
!dvc checkout
print("Both versions successfully pushed to Google Drive!")

M	.dvc/config


Previous HEAD position was 873224d Version 1: Split with seed 21
Switched to branch 'main'


M       test.csv
M       train.csv
M       validation.csv
Both versions successfully pushed to Google Drive!


Summary:

1. Data Cleaning & Processing: Ingested the raw SMS spam dataset and performed essential text preprocessing. This included standardizing the text into a clean format (clean_text) and encoding the spam/ham labels into a binary target variable for machine learning.

2. Simulating Data Shift: Split the processed data into Training, Validation, and Test sets. To rigorously test model robustness later, this splitting process was performed twice using two different random seeds (Seed 21 and Seed 77), creating two distinct versions of the dataset.

3. Data Version Control (DVC): Initialized DVC to track the heavy data files (train.csv, validation.csv, test.csv). Instead of bloating the Git repository with large CSVs, DVC created lightweight .dvc pointer files to track data changes.

4. Git Integration: Committed the .dvc files to Git, creating a clear, time-traversable history. Version 1 (Seed 21) and Version 2 (Seed 77) were saved as distinct commits in the Git timeline.

5. Remote Cloud Storage: Configured a secure Google Drive remote using a Google Cloud Service Account. The actual heavy CSV files were successfully pushed to the cloud (dvc push), ensuring the data was safely backed up and easily retrievable for the training pipeline.