# Add new labeled data 🛰️

**Description:** Stand alone notebook for adding new training and evaluation data. 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/new_data.ipynb)

# 1. Setup

If you don't already have one, obtain a Github Personal Access Token using the steps [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token). Save this token somewhere private.

In [None]:
from ipywidgets import Password, Text, VBox
inputs = [
      Password(description="Github Token:"),
      Text(description='Github Email:'),
      Text(description='Github User:'),
      Text(description='Github URL:'),
]
VBox(inputs)

In [None]:
token = inputs[0].value
email = inputs[1].value
username = inputs[2].value
github_url = inputs[3].value

!git config --global user.email $username
!git config --global user.name $email
!git clone {github_url.replace("https://", f"https://{username}:{token}@")}

In [None]:
from pathlib import Path
path_to_yaml = input("Path to openmapflow.yaml: ")
%cd {Path(path_to_yaml).parent}

In [None]:
!pip install -r requirements.txt -q
!pip install pyyaml==5.4.1 -q

In [None]:
from google.colab import files
from openmapflow.utils import colab_gee_gcloud_login
from openmapflow.config import PROJECT_ROOT, DataPaths, GCLOUD_PROJECT_ID
from openmapflow.raw_labels import _read_in_file

In [None]:
colab_gee_gcloud_login(GCLOUD_PROJECT_ID)

In [None]:
# Existing branches
!git branch -r

In [None]:
choice = input("a) Checking progress of dataset creation OR \nb) Creating new dataset \na/b: ")
if choice == "a":
  branch_name = input("Existing branch name: ")
  !git checkout {branch_name}
  !git pull
elif choice == "b":
  branch_name = input("New branch name: ")
  !git checkout -b {branch_name}
else:
  print(f"Invalid choice: {choice}, must be 'a' or 'b'")


# 2. Download latest data
Data is stored in remote storage (ie. Google Drive) so authentication is necessary.

In [None]:
!dvc pull -q

# 3. Upload labels

In [None]:
dataset_name = input("Dataset name (suggested format: <Country_Region_Year>): ")
while True:
    dataset_dir = PROJECT_ROOT / DataPaths.RAW_LABELS / dataset_name
    if dataset_dir.exists() and len(list(dataset_dir.iterdir())) > 0:
        dataset_name = input("Dataset name already exists, try a different name: ")
    else:
        dataset_dir.mkdir(exist_ok=True)
        break

print("--------------------------------------------------")
print(f"Dataset: {dataset_name} directory created")
print("---------------------------------------------------")
uploaded = files.upload()
for file_name in uploaded.keys():
    Path(file_name).rename(dataset_dir / file_name)

In [None]:
# Assess dataset
df = _read_in_file(dataset_dir / file_name)
df.head()

# 4. Create dataset
<img src="https://storage.googleapis.com/harvest-public-assets/openmapflow/new_data.png"/>

`openmapflow create-datasets` creates datasets from labels and earth observation data referenced in datasets.py.

It first checks if the necessary earth observation data is already available in Cloud Storage, or if an active Earth Engine task is already active. So Google Cloud and Earth Engine authentication is needed.

In [None]:
user_confirmation = input(
    "Open datasets.py and add a `LabeledDataset` object representing the labels just added.\n"+
    "Added `LabeledDataset y/[n]: "
)
if user_confirmation.lower() != "y":
    print("New features can only be created when a `LabeledDataset` object is added.")

In [None]:
from openmapflow.labeled_dataset import create_datasets
from datasets import datasets

create_datasets(datasets)

# 5. Push new dataset to the repository

In [None]:
# Pushing to remote storage
!dvc commit {DataPaths.RAW_LABELS} -f -q
!dvc commit {DataPaths.DATASETS} -f -q
!dvc push

In [None]:
# Pushing reference to github
commit_message = input("Commit message: ")
!git add .
!git commit -m '{commit_message}'
!git push 

Create a Pull Request so the data can be merged into the main branch.