# Assignment 2 - Applied Machine Learning
---
## Arghadeep Ghosh

This notebook 'prepare.ipynb' contains the code for loading the dataset and splitting it into Train, Validation and Test data.

In [208]:
import os

import pandas as pd
import csv
import numpy as np

from sklearn.model_selection import train_test_split

## Loading the Data
---
The Data is loaded in a CSV format

In [209]:
messages = pd.read_csv('./SMSSpamCollection', sep='\t', quoting=csv.QUOTE_NONE, names=["label", "message"], encoding = "utf8")
messages

Unnamed: 0,label,message
0,ham,"Go until jurong point, crazy.. Available only ..."
1,ham,Ok lar... Joking wif u oni...
2,spam,Free entry in 2 a wkly comp to win FA Cup fina...
3,ham,U dun say so early hor... U c already then say...
4,ham,"Nah I don't think he goes to usf, he lives aro..."
...,...,...
5569,spam,This is the 2nd time we have tried 2 contact u...
5570,ham,Will ü b going to esplanade fr home?
5571,ham,"Pity, * was in mood for that. So...any other s..."
5572,ham,The guy did some bitching but I acted like i'd...


## Train, Validation and Test split
---
The Raw Data is split into 60% training data, 20% validation data and 20% test data. We generate the first batch of Train, Validation, Test data with a random seed 1337

In [210]:
np.random.seed(1337)

X_temp, X_test, Y_temp, Y_test = train_test_split(messages['message'], messages['label'], test_size=0.2)

X_train, X_valid, Y_train, Y_valid = train_test_split(X_temp, Y_temp, test_size=0.25)

len(X_train), len(X_valid), len(X_test)

train = pd.DataFrame({'label': Y_train, 'message': X_train})
valid = pd.DataFrame({'label': Y_valid, 'message': X_valid})
test = pd.DataFrame({'label': Y_test, 'message': X_test})

len(train), len(valid), len(test)

(3344, 1115, 1115)

In [211]:
if not os.path.exists('data'):
    os.mkdir('data')

messages.to_csv('data/raw_data.csv')
train.to_csv('data/train.csv')
valid.to_csv('data/validation.csv')
test.to_csv('data/test.csv')

## DVC Initialization
---
We initialize dvc and use a gdrive remote storage to track different versions of the data

In [212]:
!pip install dvc



In [213]:
!rd /s /q .git

In [214]:
!git init
!dvc init

Initialized empty Git repository in C:/Users/Argodep/Applied-Machine-Learning/Assignment 2/.git/
Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/iterative/dvc>


In [215]:
!dir data

 Volume in drive C has no label.
 Volume Serial Number is 7463-D5F5

 Directory of C:\Users\Argodep\Applied-Machine-Learning\Assignment 2\data

02/28/2023  04:44 AM    <DIR>          .
02/28/2023  04:44 AM    <DIR>          ..
02/28/2023  04:44 AM           513,426 raw_data.csv
02/28/2023  04:44 AM           100,926 test.csv
02/28/2023  04:44 AM           306,139 train.csv
02/28/2023  04:44 AM           106,393 validation.csv
               4 File(s)      1,026,884 bytes
               2 Dir(s)  31,878,471,680 bytes free


In [216]:
!pip install dvc_gdrive



In [217]:
!dvc add data
!git add data.dvc
!git add .gitignore
!git commit -m "Tracking Data"


To track the changes with git, run:

	git add .gitignore data.dvc

To enable auto staging, run:

	dvc config core.autostage true
[master (root-commit) 55b2bd7] Tracking Data
 5 files changed, 12 insertions(+)
 create mode 100644 .dvc/.gitignore
 create mode 100644 .dvc/config
 create mode 100644 .dvcignore
 create mode 100644 .gitignore
 create mode 100644 data.dvc


In [218]:
!dvc remote add --default drive gdrive://1HQL2eK29an7F5_mXvWXenehmy5TD30JV
!dvc remote modify drive gdrive_acknowledge_abuse true
!dvc push

Setting 'drive' as a default remote.
Everything is up to date.


In [219]:
!git status

On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   .dvc/config

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	SMSSpamCollection
	prepare.ipynb
	train.ipynb

no changes added to commit (use "git add" and/or "git commit -a")


In [220]:
!dir data.dvc
!type data.dvc

 Volume in drive C has no label.
 Volume Serial Number is 7463-D5F5

 Directory of C:\Users\Argodep\Applied-Machine-Learning\Assignment 2

02/28/2023  04:44 AM                96 data.dvc
               1 File(s)             96 bytes
               0 Dir(s)  31,877,373,952 bytes free
outs:
- md5: ddab8031e8e0d2cf31e9f997271193b1.dir
  size: 1026884
  nfiles: 4
  path: data


## Data Distribution 
---
Distribution of the Training, Validation and Test data generated using the seed 1337

In [221]:
train = pd.read_csv('data/train.csv')
valid = pd.read_csv('data/validation.csv')
test = pd.read_csv('data/test.csv')

In [222]:
train['label'].value_counts(), train['label'].value_counts()['spam']/len(train)

(ham     2902
 spam     442
 Name: label, dtype: int64,
 0.13217703349282298)

In [223]:
valid['label'].value_counts(), valid['label'].value_counts()['spam']/len(valid)

(ham     944
 spam    171
 Name: label, dtype: int64,
 0.15336322869955157)

In [224]:
test['label'].value_counts(), test['label'].value_counts()['spam']/len(test)

(ham     981
 spam    134
 Name: label, dtype: int64,
 0.12017937219730941)

## Generating New Data
---
Generating a new set of Training, Validation and Test data generated using seed 6969. The old data is replaced and it's distribution is printed.

In [225]:
np.random.seed(6969)

X_temp, X_test, Y_temp, Y_test = train_test_split(messages['message'], messages['label'], test_size=0.2)

X_train, X_valid, Y_train, Y_valid = train_test_split(X_temp, Y_temp, test_size=0.25)

len(X_train), len(X_valid), len(X_test)

train = pd.DataFrame({'label': Y_train, 'message': X_train})
valid = pd.DataFrame({'label': Y_valid, 'message': X_valid})
test = pd.DataFrame({'label': Y_test, 'message': X_test})

len(train), len(valid), len(test)

(3344, 1115, 1115)

In [226]:
if not os.path.exists('data'):
    os.mkdir('data')

messages.to_csv('data/raw_data.csv')
train.to_csv('data/train.csv')
valid.to_csv('data/validation.csv')
test.to_csv('data/test.csv')

In [227]:
train['label'].value_counts(), train['label'].value_counts()['spam']/len(train)

(ham     2893
 spam     451
 Name: label, dtype: int64,
 0.13486842105263158)

In [228]:
valid['label'].value_counts(), valid['label'].value_counts()['spam']/len(valid)

(ham     964
 spam    151
 Name: label, dtype: int64,
 0.13542600896860987)

In [229]:
test['label'].value_counts(), test['label'].value_counts()['spam']/len(test)

(ham     970
 spam    145
 Name: label, dtype: int64,
 0.13004484304932734)

## DVC Checkout
---
dvc checkout to obtain the previously generated data using random seed 1337. The distributions are printed to display that they are the same as before

In [230]:
!dvc checkout -f data.dvc

M       data\


In [231]:
train = pd.read_csv('data/train.csv')
valid = pd.read_csv('data/validation.csv')
test = pd.read_csv('data/test.csv')

len(train), len(valid), len(test)

(3344, 1115, 1115)

In [232]:
train['label'].value_counts(), train['label'].value_counts()['spam']/len(train)

(ham     2902
 spam     442
 Name: label, dtype: int64,
 0.13217703349282298)

In [233]:
valid['label'].value_counts(), valid['label'].value_counts()['spam']/len(valid)

(ham     944
 spam    171
 Name: label, dtype: int64,
 0.15336322869955157)

In [234]:
test['label'].value_counts(), test['label'].value_counts()['spam']/len(test)

(ham     981
 spam    134
 Name: label, dtype: int64,
 0.12017937219730941)