# Add new labeled data 🛰️

**Description:** Stand alone notebook for adding new training and evaluation data. 

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nasaharvest/openmapflow/blob/main/openmapflow/notebooks/new_data.ipynb)

# 1. Setup

If you don't already have one, obtain a Github Personal Access Token using the steps [here](https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token). Save this token somewhere private.

In [1]:
try:
    from google.colab import files
    IN_COLAB = True
except:
    IN_COLAB = False
    
if IN_COLAB:
    from getpass import getpass
    github_url = input("Github HTTPS URL: ")
    email = input("Github email: ")
    username = input("Github username: ")
    token = getpass('Github Personal Access Token:')

    !git config --global user.email $username
    !git config --global user.name $email
    !git clone {github_url.replace("https://", f"https://{username}:{token}@")}

    # Temporary install from Github
    !pip install git+https://ivanzvonkov:$token@github.com/nasaharvest/openmapflow.git -q
else:
    !pip install google-auth -q
    print("Running notebook outside Google Colab. Assuming in local repository.")

You should consider upgrading via the '/Users/izvonkov/nasaharvest/openmapflow/venv/bin/python -m pip install --upgrade pip' command.[0m
Running notebook outside Google Colab. Assuming in local repository.


In [5]:
from pathlib import Path
from ipywidgets import Box
from tqdm.notebook import tqdm
from openmapflow.constants import CONFIG_FILE
from openmapflow.utils import colab_gee_gcloud_login

import ipywidgets as widgets
import os

cwd = Path.cwd()
root = None
for p in [cwd, cwd.parent, cwd.parent.parent]:
    if (p / CONFIG_FILE).exists():
        root = p
        break
if root == None:
    root = input("Path to project_root: ")
%cd {root}

from openmapflow.config import PROJECT_ROOT, DataPaths

/Users/izvonkov/nasaharvest/openmapflow/buildings-example


In [7]:
box_layout = widgets.Layout(flex_flow='column')

options = ["Add new labels", "Check progress of previously uploaded labels"]
use = widgets.RadioButtons(
    options=options,
    style= {'description_width': 'initial'},
    value=options[0],
    description='',
    disabled=False
)

branches_available = []
for branch in os.popen('git branch').read().split("\n"):
    if branch == "":
        continue
    branches_available.append(branch.replace("*", "").strip().replace("origin/", ""))

new_branch = widgets.Text(description='Enter a new branch name',
                        style={'description_width': 'initial'})
existing_branch = widgets.Dropdown(options=branches_available, 
                              description="Branch with existing labels",
                              style={'description_width': 'initial'})
existing_branch.layout.visibility = "hidden"

def change_visibility(event):
    try:
        i = event["new"]["index"]  
    except:
        return
    show_new = i == 0
    existing_branch.layout.visibility = "hidden" if show_new else "visible" 
    new_branch.layout.display = "block" if show_new else "none"

use.observe(change_visibility)
Box(children=[use, new_branch, existing_branch], layout=box_layout)

Box(children=(RadioButtons(options=('Add new labels', 'Check progress of previously uploaded labels'), style=D…

In [8]:
checking_progress_only = new_branch.value == ""
if checking_progress_only:
    !git checkout {existing_branch.value}
    !git pull
else:
    !git checkout -b'{new_branch.value}'

M	README.md
M	buildings-example/data/.gitignore
M	buildings-example/datasets.py
M	openmapflow/all_features.py
M	openmapflow/config.py
M	openmapflow/generate.py
D	openmapflow/github_workflows/openmapflow-deploy.yml
D	openmapflow/github_workflows/openmapflow-test.yml
M	setup.py
Already on 'uganda-buildings'
There is no tracking information for the current branch.
Please specify which branch you want to merge with.
See git-pull(1) for details.

    git pull <remote> <branch>

If you wish to set tracking information for this branch you can do so with:

    git branch --set-upstream-to=origin/<branch> uganda-buildings



# 2. Download latest data
Data is stored in remote storage (ie. Google Drive) so authentication is necessary.

In [8]:
if not checking_progress_only:
    for p in tqdm([DataPaths.MODELS, DataPaths.PROCESSED_LABELS, DataPaths.COMPRESSED_FEATURES]):
        !dvc pull {p} -q

    !tar -xzf {DataPaths.COMPRESSED_FEATURES} -C data

  0%|          | 0/3 [00:00<?, ?it/s]

[0m[0m[0m

# 3. Upload labels

In [9]:
if checking_progress_only:
    print("Checking progress only, skipping this cell.")
else:
    dataset_name = input("Dataset name (suggested format: <Country_Region_Year>): ")
    while True:
        dataset_dir = PROJECT_ROOT / DataPaths.RAW_LABELS / dataset_name
        if dataset_dir.exists() and len(list(dataset_dir.iterdir())) > 0:
            dataset_name = input("Dataset name already exists, try a different name: ")
        else:
            dataset_dir.mkdir(exist_ok=True)
            break

    print("--------------------------------------------------")
    print(f"Dataset: {dataset_name} directory created")
    print("---------------------------------------------------")
    
    if IN_COLAB:
        uploaded = files.upload()

        for file_name in uploaded.keys():
            Path(file_name).rename(dataset_dir / file_name)
    else:
        print(f"Please add file(s) into {dataset_dir}")

Checking progress only, skipping this cell.


# 4. Create features
<img src="https://storage.googleapis.com/harvest-public-assets/openmapflow/new_data.png"/>

In [10]:
if checking_progress_only:
    print("Checking progress only, skipping this cell.")
else:
    user_confirmation = input(
        "Open datasets.py and add a `LabeledDataset` object representing the labels just added.\n"+
        "Added `LabeledDataset y/[n]: "
    )
    if user_confirmation.lower() != "y":
        print("New features can only be created when a `LabeledDataset` object is added.")

Checking progress only, skipping this cell.


In [11]:
# TODO figure out public bucket permissions
if IN_COLAB:
    colab_gee_gcloud_login()
else:
    !earthengine authenticate

Fetching credentials using gcloud
Your browser has been opened to visit:

    https://accounts.google.com/o/oauth2/auth?response_type=code&client_id=517222506229-vsmmajv00ul0bs7p89v5m89qs8eb9359.apps.googleusercontent.com&redirect_uri=http%3A%2F%2Flocalhost%3A8085%2F&scope=https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fearthengine+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdevstorage.full_control+https%3A%2F%2Fwww.googleapis.com%2Fauth%2Faccounts.reauth&state=9YwTYNu3SxJPPX3FIZCIGDAfOn1cMe&access_type=offline&code_challenge=iyeOyJxPbI1Lp7wZMjZR9YPNK8ZNBEYIaV9yu5aDGnk&code_challenge_method=S256


Credentials saved to file: [/Users/izvonkov/.config/gcloud/application_default_credentials.json]

These credentials will be used by any library that requests Application Default Credentials (ADC).

Successfully saved authorization token.


`openmapflow-create-features` creates features from labels and earth observation data referenced in datasets.py.

It first checks if the necessary earth observation data is already available in Cloud Storage, or if an active Earth Engine task is already active. So Google Cloud and Earth Engine authentication is needed.

In [15]:
!openmapflow-create-features

------------------------------
Uganda_buildings_2020
Loading tifs already on Google Cloud: 89535it [00:14, 6012.49it/s]
Generating BBoxes from paths: 100%|████| 89535/89535 [00:00<00:00, 90996.32it/s]
Matching labels to tif paths: 100%|████████| 7463/7463 [00:36<00:00, 205.75it/s]
7328 labels not matched
Loading Earth Engine tasks: 100%|██████| 3578/3578 [00:00<00:00, 2782721.99it/s]
No explicit export_identifier in labels. One will be constructed during export
Exporting 7328 labels
7328it [00:01, 5515.86it/s]
Exporting:: 100%|███████████████████████████| 7328/7328 [02:24<00:00, 50.65it/s]
  mean_per_band = np.nanmean(array, axis=0)
  mean_per_band = np.nanmean(array, axis=0)
  mean_per_band = np.nanmean(array, axis=0)
  mean_per_band = np.nanmean(array, axis=0)
  mean_per_band = np.nanmean(array, axis=0)
  mean_per_band = np.nanmean(array, axis=0)
Creating pickled instances: 100%|█████████████| 135/135 [02:26<00:00,  1.09s/it]
------------------------------
Loading all features...
✔ F

In [16]:
!cat {DataPaths.DATASETS}

DATASET REPORT (autogenerated, do not edit directly)

Uganda_buildings_2020 (Timesteps: 24)
----------------------------------------------------------------------------
✖ training: 6449 labels, but 610 features
✖ testing: 848 labels, but 92 features
✖ validation: 824 labels, but 91 features


All data:
✔ Found no empty features
✔ No duplicates found

In [14]:
!git diff {DataPaths.DATASETS}

# 4. Pushing the new data to the repository

In [17]:
# Pushing to remote storage
for p in tqdm([DataPaths.RAW_LABELS, DataPaths.PROCESSED_LABELS, DataPaths.COMPRESSED_FEATURES]):
    !dvc commit {p} -f -q
!dvc push

  0%|          | 0/3 [00:00<?, ?it/s]

  0% Transferring|                                   |0/5 [00:00<?,     ?file/s]
![A
  0%|          |ffe7a3a41fa29080fa72dbc2b0dc23     0.00/? [00:00<?,        ?B/s][A
  0%|          |ffe7a3a41fa29080fa72dbc2b0dc23 0.00/1.41M [00:00<?,        ?B/s][A
  1%|          |ffe7a3a41fa29080fa72dbc2b08.00k/1.41M [00:01<05:45,    4.24kB/s][A
 34%|███▍      |ffe7a3a41fa29080fa72dbc2b0d488k/1.41M [00:02<00:02,     342kB/s][A
 36%|███▌      |ffe7a3a41fa29080fa72dbc2b0d520k/1.41M [00:02<00:03,     312kB/s][A
 52%|█████▏    |ffe7a3a41fa29080fa72dbc2b0d752k/1.41M [00:02<00:01,     501kB/s][A
 67%|██████▋   |ffe7a3a41fa29080fa72dbc2b0d968k/1.41M [00:02<00:00,     711kB/s][A
 78%|███████▊  |ffe7a3a41fa29080fa72dbc2b01.09M/1.41M [00:02<00:00,     832kB/s][A
 88%|████████▊ |ffe7a3a41fa29080fa72dbc2b01.23M/1.41M [00:02<00:00,     944kB/s][A
 98%|█████████▊|ffe7a3a41fa29080fa72dbc2b01.38M/1.41M [00:02<00:00,    1.05MB/s][A
 20% Transferring|██████▏                        |1/5 [00:04<00:18,  4.62s

In [19]:
# Pushing reference to github
commit_message = input("Commit message: ")
!git add .
!git commit -m '{commit_message}'
!git push 

Commit message: New buildings data
[uganda-buildings e5e5a49] New buildings data
 5 files changed, 44 insertions(+), 35 deletions(-)
 create mode 100644 buildings-example/data/datasets.txt
 rewrite buildings-example/datasets.py (80%)
fatal: The current branch uganda-buildings has no upstream branch.
To push the current branch and set the remote as upstream, use

    git push --set-upstream origin uganda-buildings



In [20]:
!git push --set-upstream origin uganda-buildings

Enumerating objects: 69, done.
Counting objects: 100% (69/69), done.
Delta compression using up to 8 threads
Compressing objects: 100% (52/52), done.
Writing objects: 100% (52/52), 7.50 KiB | 1.50 MiB/s, done.
Total 52 (delta 28), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (28/28), completed with 11 local objects.[K
remote: 
remote: Create a pull request for 'uganda-buildings' on GitHub by visiting:[K
remote:      https://github.com/nasaharvest/openmapflow/pull/new/uganda-buildings[K
remote: 
To github.com:nasaharvest/openmapflow.git
 * [new branch]      uganda-buildings -> uganda-buildings
Branch 'uganda-buildings' set up to track remote branch 'uganda-buildings' from 'origin'.


Create a Pull Request so the data can be merged into the main branch.