## 1. Setup DVC Remote

In [5]:
!dvc remote add -d myremote gs://mlops_dvc_lab -f
!dvc remote modify myremote credentialpath "C:\Users\jithi\Downloads\dvc-lab-486118-9dbdcb052a2d.json"

Setting 'myremote' as a default remote.


## 2. Initialize DVC

In [6]:
!dvc init --subdir -f

Initialized DVC repository.

You can now commit the changes to git.

+---------------------------------------------------------------------+
|                                                                     |
|        DVC has enabled anonymous aggregate usage analytics.         |
|     Read the analytics documentation (and how to opt-out) here:     |
|             <https://dvc.org/doc/user-guide/analytics>              |
|                                                                     |
+---------------------------------------------------------------------+

What's next?
------------
- Check out the documentation: <https://dvc.org/doc>
- Get help and share ideas: <https://dvc.org/chat>
- Star us on GitHub: <https://github.com/treeverse/dvc>


## 3. Track Data with DVC

In [7]:
!dvc add data/CC_GENERAL.csv


To track the changes with git, run:

	git add 'data\CC_GENERAL.csv.dvc' 'data\.gitignore'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



## 4. Push to Git & GCS

In [8]:
!dvc remote add -d myremote gs://mlops_dvc_lab -f
!dvc remote modify myremote credentialpath "C:\Users\jithi\Downloads\dvc-lab-486118-9dbdcb052a2d.json"

Setting 'myremote' as a default remote.


In [13]:
!git checkout -b Lab-2--DVC

!git add data/CC_GENERAL.csv.dvc
!git add data/.gitignore
!git commit -m "Add CC_GENERAL.csv to DVC tracking"
!git push origin Lab-2--DVC
!dvc push

Switched to a new branch 'Lab-2--DVC'


[Lab-2--DVC 90f455b] Add CC_GENERAL.csv to DVC tracking
 4 files changed, 9 insertions(+), 7 deletions(-)
 create mode 100644 Labs/Data_Labs/DVC_Labs/Lab_1/.dvc/.gitignore
 create mode 100644 Labs/Data_Labs/DVC_Labs/Lab_1/data/.gitignore
 create mode 100644 Labs/Data_Labs/DVC_Labs/Lab_1/data/CC_GENERAL.csv.dvc


remote: 
remote: Create a pull request for 'Lab-2--DVC' on GitHub by visiting:        
remote:      https://github.com/Jithin-Veeragandham/MLOps/pull/new/Lab-2--DVC        
remote: 
To https://github.com/Jithin-Veeragandham/MLOps
 * [new branch]      Lab-2--DVC -> Lab-2--DVC


Everything is up to date.


md5: 48ba769c96f7eccf8d206be7143e474f


## 5. Run Pipeline (First Time)

In [17]:
!dvc repro

'data\CC_GENERAL.csv.dvc' didn't change, skipping
Running stage 'preprocess':
> python scripts/preprocess.py
Original: 8950 rows
After cleaning: 8636 rows
Generating lock file 'dvc.lock'
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add 'data\.gitignore' dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


## 6. Run Pipeline (Cached - Skips)

In [18]:
!dvc repro

'data\CC_GENERAL.csv.dvc' didn't change, skipping
Stage 'preprocess' didn't change, skipping
Data and pipelines are up to date.


## 7. Commit Pipeline to Git

In [19]:
!git add dvc.yaml dvc.lock scripts/
!git commit -m "Add preprocessing pipeline"
!git push origin Lab-2--DVC
!dvc push

[Lab-2--DVC a108257] Add preprocessing pipeline
 3 files changed, 33 insertions(+)
 create mode 100644 Labs/Data_Labs/DVC_Labs/Lab_1/dvc.lock
 create mode 100644 Labs/Data_Labs/DVC_Labs/Lab_1/dvc.yaml
 create mode 100644 Labs/Data_Labs/DVC_Labs/Lab_1/scripts/preprocess.py


To https://github.com/Jithin-Veeragandham/MLOps
   90f455b..a108257  Lab-2--DVC -> Lab-2--DVC


1 file pushed


md5: 48ba769c96f7eccf8d206be7143e474f


In [20]:
!dvc dag

+-------------------------+  
| data\CC_GENERAL.csv.dvc |  
+-------------------------+  
              *              
              *              
              *              
      +------------+         
      | preprocess |         
      +------------+         
+-------------------+  
| data\data.txt.dvc |  
+-------------------+  


## 8. Modify Data (Multiply by 2)

In [None]:
import pandas as pd
df = pd.read_csv('data/CC_GENERAL.csv')
print("BEFORE:")
print(df.head())

BEFORE:
  CUST_ID      BALANCE  BALANCE_FREQUENCY  PURCHASES  ONEOFF_PURCHASES  \
0  C10001    40.900749           0.818182      95.40              0.00   
1  C10002  3202.467416           0.909091       0.00              0.00   
2  C10003  2495.148862           1.000000     773.17            773.17   
3  C10004  1666.670542           0.636364    1499.00           1499.00   
4  C10005   817.714335           1.000000      16.00             16.00   

   INSTALLMENTS_PURCHASES  CASH_ADVANCE  PURCHASES_FREQUENCY  \
0                    95.4      0.000000             0.166667   
1                     0.0   6442.945483             0.000000   
2                     0.0      0.000000             1.000000   
3                     0.0    205.788017             0.083333   
4                     0.0      0.000000             0.083333   

   ONEOFF_PURCHASES_FREQUENCY  PURCHASES_INSTALLMENTS_FREQUENCY  \
0                    0.000000                          0.083333   
1                    0.00000

In [None]:
numeric_cols = df.select_dtypes(include=['number']).columns
df[numeric_cols] = df[numeric_cols] * 2
print("\nAFTER (multiplied by 2):")
print(df.head())

# Save
df.to_csv('data/CC_GENERAL.csv', index=False)


AFTER (multiplied by 2):
  CUST_ID      BALANCE  BALANCE_FREQUENCY  PURCHASES  ONEOFF_PURCHASES  \
0  C10001    81.801498           1.636364     190.80              0.00   
1  C10002  6404.934832           1.818182       0.00              0.00   
2  C10003  4990.297724           2.000000    1546.34           1546.34   
3  C10004  3333.341084           1.272728    2998.00           2998.00   
4  C10005  1635.428670           2.000000      32.00             32.00   

   INSTALLMENTS_PURCHASES  CASH_ADVANCE  PURCHASES_FREQUENCY  \
0                   190.8      0.000000             0.333334   
1                     0.0  12885.890966             0.000000   
2                     0.0      0.000000             2.000000   
3                     0.0    411.576034             0.166666   
4                     0.0      0.000000             0.166666   

   ONEOFF_PURCHASES_FREQUENCY  PURCHASES_INSTALLMENTS_FREQUENCY  \
0                    0.000000                          0.166666   
1         

## 9. Track Changes & Rerun Pipeline

In [None]:
!dvc add data/CC_GENERAL.csv
!dvc repro


To track the changes with git, run:

	git add 'data\CC_GENERAL.csv.dvc'

To enable auto staging, run:

	dvc config core.autostage true


⠋ Checking graph



'data\CC_GENERAL.csv.dvc' didn't change, skipping
Running stage 'preprocess':
> python scripts/preprocess.py
Original: 8950 rows
After cleaning: 8636 rows
Updating lock file 'dvc.lock'

To track the changes with git, run:

	git add dvc.lock

To enable auto staging, run:

	dvc config core.autostage true
Use `dvc push` to send your updates to remote storage.


## 10. Push Updated Data

In [None]:
!git add data/CC_GENERAL.csv.dvc dvc.lock
!git commit -m "v2: Multiply dataset by 2"
!git push origin Lab-2--DVC
!dvc push

[Lab-2--DVC 0cfef36] v2: Multiply dataset by 2
 2 files changed, 6 insertions(+), 6 deletions(-)


To https://github.com/Jithin-Veeragandham/MLOps
   a108257..0cfef36  Lab-2--DVC -> Lab-2--DVC


2 files pushed


md5: 48ba769c96f7eccf8d206be7143e474f


## 11. View Version History

In [None]:

!git log --oneline data/CC_GENERAL.csv.dvc

0cfef36 v2: Multiply dataset by 2
90f455b Add CC_GENERAL.csv to DVC tracking


## 12. Compare Hashes Between Versions

In [26]:
# 6. Show hash difference between versions
!git show HEAD~1:./data/CC_GENERAL.csv.dvc
print("\n--- vs ---\n")
!git show HEAD:./data/CC_GENERAL.csv.dvc

outs:
- md5: c9b0bb7fc9e241b81da92c3528103664
  size: 902879
  hash: md5
  path: CC_GENERAL.csv

--- vs ---

outs:
- md5: 1e50f7d531b29e8c28cd1d43d8b5f2a8
  size: 1031504
  hash: md5
  path: CC_GENERAL.csv


## 13. Revert to Previous Version

In [None]:
!git checkout HEAD~1 -- data/CC_GENERAL.csv.dvc
!dvc checkout --force

# Verify it's back to original
df_reverted = pd.read_csv('data/CC_GENERAL.csv')
print("REVERTED:")
print(df_reverted.head())

M       data\CC_GENERAL.csv
REVERTED:
  CUST_ID      BALANCE  BALANCE_FREQUENCY  PURCHASES  ONEOFF_PURCHASES  \
0  C10001    40.900749           0.818182      95.40              0.00   
1  C10002  3202.467416           0.909091       0.00              0.00   
2  C10003  2495.148862           1.000000     773.17            773.17   
3  C10004  1666.670542           0.636364    1499.00           1499.00   
4  C10005   817.714335           1.000000      16.00             16.00   

   INSTALLMENTS_PURCHASES  CASH_ADVANCE  PURCHASES_FREQUENCY  \
0                    95.4      0.000000             0.166667   
1                     0.0   6442.945483             0.000000   
2                     0.0      0.000000             1.000000   
3                     0.0    205.788017             0.083333   
4                     0.0      0.000000             0.083333   

   ONEOFF_PURCHASES_FREQUENCY  PURCHASES_INSTALLMENTS_FREQUENCY  \
0                    0.000000                          0.083333  

ERROR: Checkout failed for following targets:
data\data.txt
Is your cache up to date?
<https://error.dvc.org/missing-files>


## 14. Revert to Previous Version

In [None]:

!git add data/CC_GENERAL.csv.dvc
!git commit -m "Revert to original dataset"
!git push origin Lab-2--DVC

[Lab-2--DVC ae8fe77] Revert to original dataset
 1 file changed, 2 insertions(+), 2 deletions(-)


To https://github.com/Jithin-Veeragandham/MLOps
   0cfef36..ae8fe77  Lab-2--DVC -> Lab-2--DVC
