<a href="https://colab.research.google.com/github/ArgProgInti/DagsHub/blob/main/Using_DAGsHub_with_a_mirrored_repository_from_GitHub.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

<center> <a href="https://dagshub.com"><img alt=\"DAGsHub\" width=500px src=https://raw.githubusercontent.com/DAGsHub/client/master/dagshub_github.png></a> </center>

<center><h1>The 'Hello-World' Project - Colab Environment</h1></center>

---
## Hello and Welcome to DAGsHub! 👋 

We are very excited to have you on [DAGsHub](https://dagshub.com) and can't wait to see what remarkable projects you will create and share with the Data Science community. 
<br>

The primary goal of this notebook is to **help you learn the basic features and usage of DAGsHub** while maintaining a relatively clean environment. By following this notebook, you will create your first 'hello-world project on DAGsHub. We will see how to <u>configure Git and DVC</u> and use them to <u>track code and data files</u>. Then, we will define DAGsHub as the remote storage and <u>push DVC's tracked files</u> to it. Lastly, you will create our first <u>Dada Science Experiment on DAGsHub</u>.
<br>

**The project** - In this walkthrough, we will train a model to classify 'Ham' and 'Spam' emails. We will use the Enron dataset that stores labeled email in a CSV file.


<img src="https://dragonballz.co.il/wp-content/uploads/2020/12/discord-logo.jpg" height="23"/> [Discord Channel](https://discord.com/channels/698874030052212737/698874030572437526) | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/c/c9/Linkedin.svg/1200px-Linkedin.svg.png" height="23"/> [LinkedIn](https://www.linkedin.com/company/dagshub/) | <img src="https://help.twitter.com/content/dam/help-twitter/brand/logo.png" height="25"/> [Twitter](https://twitter.com/TheRealDAGsHub) | <img src="https://res-2.cloudinary.com/crunchbase-production/image/upload/c_lpad,f_auto,q_auto:eco/plwmuai9t3okgwbuhkho" height="30"/> [DAGsHub](https://dagshub.com) | <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/91/Octicons-mark-github.svg/1200px-Octicons-mark-github.svg.png" height="25"/> [GitHub](https://github.com/DAGsHub) 

# Configure DAGsHub, GitHub and Git

In [None]:
import requests
import getpass
import datetime

**Set Environment Variables - DAGsHub**


In [None]:
#@title Enter the DAGsHub repository owner name:

DAGSHUB_REPO_OWNER= "" #@param {type:"string"}

In [None]:
#@title Enter the DAGsHub repository name:

DAGSHUB_REPO_NAME= "" #@param {type:"string"}

In [None]:
#@title Enter the username of your DAGsHub account:

DAGSHUB_USER_NAME = "" #@param {type:"string"}

**Set Environment Variables - GitHub**


In [None]:
#@title Enter the GitHub repository owner name:

GITHUB_REPO_OWNER= "" #@param {type:"string"}

In [None]:
#@title Enter the GitHub repository name:

GITHUB_REPO_NAME= "" #@param {type:"string"}

In [None]:
#@title Enter the username of your GitHub account:

GITHUB_USER_NAME = "" #@param {type:"string"}

In [None]:
#@title Enter the email for your GitHub account:

GITHUB_EMAIL = "" #@param {type:"string"}

We take security very seriously and don't want your DAGsHub password to be saved in the notebook runtime. Thus, we created an API that generates an access token to your DAGsHub account. With this token, you will push your Git tracked files without saving the password as a variable.

In [None]:
r = requests.post('https://dagshub.com/api/v1/user/tokens', 
                  json={"name": f"colab-token-{datetime.datetime.now()}"}, 
                  auth=(DAGSHUB_USER_NAME, getpass.getpass('Please enter your DAGsHub token or password: ')))
r.raise_for_status()
DAGSHUB_TOKEN=r.json()['sha1']

In [None]:
GITHUB_TOKEN = getpass.getpass('Please enter your GitHub token or password: ')

**Configure Git**

In [None]:
!git config --global user.email {GITHUB_EMAIL}
!git config --global user.name {GITHUB_USER_NAME}

**Clone the Repository**

In [None]:
!git clone https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

%cd {GITHUB_REPO_NAME}

# Install and Configure DVC

**Initialize DVC**

In [None]:
# Install DVC
!pip install dvc &> /dev/null 

# Import DVC package - relevant only when working in a Colab environment
import dvc

# Initilize DVC in the local directory
!dvc init &> /dev/null 

# Track the changes with git
!git add .dvc .dvcignore .gitignore
!git commit -m "Initialize DVC"

**Configure DVC**

In [None]:
# Set DVC remote storage as 'DAGsHub storage'
!dvc remote add origin --local https://dagshub.com/{DAGSHUB_REPO_OWNER}/{DAGSHUB_REPO_NAME}.dvc

# General DVC configuration
!dvc remote modify --local origin auth basic
!dvc remote modify --local origin user {DAGSHUB_USER_NAME}
!dvc remote modify --local origin password {DAGSHUB_TOKEN}

# Project Setup 


At this point, we want to add the required files for our ML project to the local directory. We will use the dvc get command that downloads files from a Git repository or DVC storage without tracking them.

**Download the project's files**

In [None]:
!dvc get https://dagshub.com/nirbarazida/hello-world requirements.txt
!dvc get https://dagshub.com/nirbarazida/hello-world src
!dvc get https://dagshub.com/nirbarazida/hello-world-files data/

**Install Requirements**

In [None]:
!pip install -r requirements.txt &> /dev/null

# Track Files Using DVC and Git 🏇🏼

The data directory contains the data sets for this project, which are quite big. Thus, we will track this directory using DVC and use Git to track the rest of the project's files.

**Track Files with DVC**



In [None]:
# Add the data directory to DVC tracking
!dvc add data

In [None]:
# Track the changes with Git
!git add data.dvc .gitignore
!git commit -m "Add the data directory to DVC tracking"

**Track Files with Git**

In [None]:
!git add requirements.txt src/
!git commit -m "Add requirements and src to Git tracking"

# Push the Files to the Remotes 

**Push Git tracked files**


In [None]:
!git push https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

**Push DVC tracked files**


In [None]:
!dvc push -r origin

# Checkpoint 🎯



If you check your DAGsHub repository's new status, you will see all the files that we pushed with Git and DVC, as shown here.

- The main repository page:
<center><a><img src="https://i.ibb.co/F7TpFPw/5-repo-stat-after-push.png" alt="5-repo-stat-after-push" border="0"></a></center>
<br>

  <u>**Note**</u>: The DVC tracked files are marked with a blue background.

- The data directory:
<center><a><img src="https://i.ibb.co/6P9RrNj/6-data-dir-after-push.png" alt="6-data-dir-after-push" border="0"></a></center>
<br>

- The data file itself:
<center><a><img src="https://i.ibb.co/9HWvKTY/7-content-of-enron-file.png" alt="7-content-of-enron-file" border="0"></a></center>

# Process and Track Data Changes

We want to preprocess our data and track the results using DVC. by running the data_preprocessing.py module; we will generate four new files of processed data to the 'data' directory. We will track the new files with DVC and Git and push them to the remotes.

In [None]:
# Process the Data
!python src/data_preprocessing.py

In [None]:
# Track the Changes
!dvc add data &> /dev/null 
!git add data.dvc
!git commit -m "Process raw-data and save it to data directory"

**Push the Files to the remotes**

In [None]:
!git push https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

!dvc push -r origin &> /dev/null 

# Checkpoint

If you check the data directory's new status in your DAGsHub repository, you will see all the new data files there, as shown below.

- The data directory
<center><a><img src="https://i.ibb.co/GxjTxB3/8-data-dir-after-push.png" alt="8-data-dir-after-push" border="0" /></a></center>

# Create Data Science Experiments 🧪 

In [None]:
!pip3 install dagshub &> /dev/null

**Run new experiment**

In [None]:
!python3 src/modeling.py

**Track the Experiment Files**

In [None]:
!git add metrics.csv params.yml
!git commit -m "New Experiment - Random Forest Classifier with basic processing"

**Push the Files to the Remotes**

In [None]:
!git push https://{GITHUB_USER_NAME}:{GITHUB_TOKEN}@github.com/{GITHUB_REPO_OWNER}/{GITHUB_REPO_NAME}.git

# Checkpoint 🎯

If you check your DAGsHub repository's new status, you will see that a new experiment was added to the Experiment Tab. If you go to the tab, you will see the hyperparameters of the model and its performances.

- The experiment tab:
<center><a href="https://ibb.co/PWwrpQD"><img src="https://i.ibb.co/wQMdHsc/10-experiment.png" alt="10-experiment" border="0" /></a></center>

# Finish Line 🏁

**Congratulations**  - You made it to the finish line! 🥳

In the Get Started section, we covered the fundamental of DAGsHub usage. We started with creating a repository and configure Git and DVC. Then, we added a project to the repository using Git and DVC to track the files. Lastly, we created our very first Data Science Experiment with DAGsHub Logger. <br><br>

More resources that can interest you:
- [DAGsHub Docs](https://dagshub.com/docs/).
- [Get Started Tutorial](https://dagshub.com/docs/getting-started/overview/).
- [DAGsHub Blog](https://dagshub.com/blog/).
- [FAQ](https://dagshub.com/docs/faq/).

<br>

We hope that this Tutorial was helpful and made the on-boarding process easier for you. If you found an issue in the notebook, please [let us know](https://dagshub.com/DAGsHub-Official/DAGsHub-Issues/issues/). If you have any questions feel free to join our [Discord channel](https://discord.com/invite/9gU36Y6) and ask there. We can't wait to see what remarkable project you will create and share with the Data Science community!
<br><br>