# 1. Steps executed so far:
(Refer to `"Data Versioning using DVC"` document)

### **Initialize DVC repository:**

- Create a new GitHub repo
- Start a codespace
- Upload data (data/titanic.csv)
- Create and activate venv
- Install packages: <br>```pip install dvc==3.55.2 dvc-ssh==4.1.1 asyncssh==2.18.0```

- Initialize dvc repository: <br>`dvc init`


### **Configure Remote Storage:**

- Add remote storage: <br>`dvc remote add -d myremote ssh://<username>@<vm-ip>:22/<path-to-dvc-storage-folder>`

- Provide credentials: (will be stored locally)
<br>`dvc remote modify --local myremote password <your-vm-password>`

- Commit remote-storage config: <br>`git add .dvc/config`
<br>`git commit -m "remote storage configured"`
<br>`git push`




### **Data Versioning:**

- Add data for dvc tracking: <br>`dvc add data/titanic.csv`

- Add metadata files for git tracking: <br>`git add data/.gitignore data/titanic.csv.dvc`

- Commit changes in git: <br>`git commit -m "Initial data"`

- Tag the data:  <br>`git tag -a v1.1 -m "Dataset v1.1"`
<br> The `git tag` command in Git is used to create a reference to a specific point in your repository’s history, typically to mark a particular commit as important. Tags are often used to denote releases (e.g., version 1.0, 2.0, etc.).

- Push data to remote-storage: <br>`dvc push`

- Push metadata file to git repo: <br>`git push`

- Push tag: <br>`git push origin v1.1`


### **Whenever new data comes in, use the below commands in sequence:**

- `dvc status`
- `dvc add data/titanic.csv`
- `git add data/titanic.csv.dvc`
- `git commit -m "dataset updated"`
- `git tag -a v1.x -m "Dataset v1.x"`
- `dvc push`
- `git push`
- `git push origin v1.x`

# 2. Use DVC-API to load the data:

In [1]:
%%capture
# Install dvc
!pip -q install dvc==3.55.2
!pip -q install dvc-ssh==4.1.1
!pip -q install asyncssh==2.18.0

In [2]:
!dvc --version

3.55.2
[0m

Provide the below credentials:

- your GitLab Username
- your GitLab Access token
- your Sandbox VM Password


In [3]:
import os

# Provide credentials and save them as environment variables. Later, whenever needed, access the credentials using environment variables only.
os.environ["GTLB_USERNAME"] = "add-your-username"
os.environ["GTLB_ACCESS_TOKEN"] = "add-your-access-token"
os.environ["VM_PASSWORD"] = "add-your-vm-password"

In [5]:
import dvc.api
import pandas as pd

# Repo format:  "https://<gitlab-username>:<gitlab-token>@github.com/<gitlab-username>/<repo-name>"
# For example: "https://yrajm1997:glpat_abcdefxxxx@gitlab.com/yrajm1997/titanic-data-repo"

# https://gitlab.com/yrajm1997/loan-data-repo

gtlb_username = os.environ['GTLB_USERNAME']
gtlb_access_token = os.environ['GTLB_ACCESS_TOKEN']

repo_name = "loan-data-repo"     # Change as per your GitLab repository name

repo_url = 'https://' + gtlb_username + ':' + gtlb_access_token + '@gitlab.com/' + gtlb_username + '/' + repo_name
print(repo_url)

# Data version to retrieve
data_revision = 'v1.2'

# Configurations to access remote storage
remote_config = {
    'password': os.environ["VM_PASSWORD"]
    }

# Open data file using dvc-api and load the dataset
with dvc.api.open('data/credit_risk_dataset.csv', repo=repo_url, rev=data_revision, remote_config=remote_config) as file:  #remote_config=remote_config) as file:
    df = pd.read_csv(file)

df.tail()

https://yrajm1997:glpat-x2RGSzjSFBVm8fThzJ47@gitlab.com/yrajm1997/loan-data-repo


Unnamed: 0,person_age,person_income,person_home_ownership,person_emp_length,loan_intent,loan_grade,loan_amnt,loan_int_rate,loan_status,loan_percent_income,cb_person_default_on_file,cb_person_cred_hist_length
32577,54,120000,MORTGAGE,4.0,PERSONAL,A,17625,7.49,0,0.15,N,19
32578,65,76000,RENT,3.0,HOMEIMPROVEMENT,B,35000,10.99,1,0.46,N,28
32579,56,150000,MORTGAGE,5.0,PERSONAL,B,15000,11.48,0,0.1,N,26
32580,66,42000,RENT,2.0,MEDICAL,B,6475,9.99,0,0.15,N,30
32581,30,1000000,OWN,4.0,PERSONAL,B,6475,12.0,0,0.15,N,30


### References:

- [dvc.api.open()](https://dvc.org/doc/api-reference/open)