# KG-Hub: Getting Started (from scratch)

This notebook serves as a walkthrough for creating a KG-Hub project, building a new graph, and using the graph for machine learning. Some familiarity with the command line, GitHub, and Python will be helpful.

## Planning and setup

You likely already found the KG-Hub project template repository - that's where this notebook is.
If you found this someplace else, the template repository is https://github.com/Knowledge-Graph-Hub/kg-dtm-template.

First, create a name for your project repository. In KG-Hub, most projects include **KG** somewhere, e.g., "KG-Squid" or "KG-Mimivirus-Proteomics".

Make a new repository for your project, based on the template, in one of the following two ways:
* [Follow the browser-based directions here.](https://docs.github.com/en/repositories/creating-and-managing-repositories/creating-a-repository-from-a-template)
* Use the GitHub command line interface. This may be a preferable option if you're using a Windows command line and don't want to use a browser interface. Follow the [install instructions here](https://cli.github.com/manual/installation) as needed, then run:


In [None]:
!cd ~/ 
!gh repo create kg-project-name --public --clone --template https://github.com/Knowledge-Graph-Hub/kg-dtm-template

Next, select a name for your project. It should resemble your repository name, though any dashes will need to be changed to underscores.

Change the values below to your repository name so they may be used later in the walkthrough.

In [3]:
kg_repo_name = "kg-project-name"
kg_project_name = "kg_project_name"

Define some additional details so they may be used later as well.

In [4]:
description = '' # A short description of the project
long_description = '' # A slightly less short description of the project
gh_name = '' # Your GitHub user name, assuming that you created the project repository in your own account.
author_name = '' # Your name
author_email = '' # Your email address

## Walking through the KG project

Each KG project in KG-Hub is generally structured like this (with some omissions for clarity):
```
📦kg-project-name
 ┣ 📂kg_project_name
 ┃ ┣ 📂merge_utils
 ┃ ┃ ┗ 📜merge_kg.py - this produces the final, merged KG
 ┃ ┣ 📂transform_utils - data source-specific transformation functions
 ┃ ┃ ┣ 📂transform_one
 ┃ ┃ ┃ ┗ 📜transform_one.py
 ┃ ┃ ┣ 📂transform_two
 ┃ ┃ ┃ ┗ 📜transform_two.py
 ┃ ┃ ┗ 📜transform.py - sets defaults for transform outputs
 ┃ ┣ 📂utils - utilities and helper functions
 ┃ ┃ ┣ 📜download_utils.py
 ┃ ┃ ┣ 📜robot_utils.py - utilities for working with the ROBOT tool
 ┃ ┃ ┗ 📜transform_utils.py
 ┃ ┣ 📜download.py 
 ┃ ┗ 📜transform.py - sets up the individual transformations
 ┣ 📂tests
 ┃ ┣ 📂resources
 ┃ ┃ ┣ files required to run the tests
 ┃ ┣ various tests
 ┣ 📜LICENSE.txt
 ┣ 📜README.md - modify as needed
 ┣ 📜download.yaml - the download configuration file
 ┣ 📜merge.yaml - the merge configuration file
 ┣ 📜requirements.txt - empty by default, but add any new requirements here
 ┣ 📜run.py - the main interface for downloading, transforming, and merging
 ┗ 📜setup.py
```

The general process of *defining* how to assemble a KG looks like this:
1. Add data sources to `download.yaml`.
2. Add a new transform to `transform.py` and in the `transform_utils` directory to handle the new data source. If the data source is already a set of KGX tsv node and edgelists, then it may only require a 'passthrough' tranform (i.e., files aren't modified but may be validated and moved). 
3. Modify `merge.yaml` to include the new sources.

The general process of *assembling* the KG looks like this. Even if you haven't changed much in the new project yet, these commands will still retrieve several files, transform them, and merge them into a graph.

In [2]:
!python run.py download

Downloading files: 100%|██████████████████████████| 6/6 [00:03<00:00,  1.58it/s]


In [None]:
!python run.py transform

In [None]:
!python run.py merge

## Customizing your new KG project

Before going any further, open the project in your favorite development environment. You will need to replace all instances of `project_name` with your project's name.

In [5]:
print(kg_project_name)

kg_project_name


## KGX format basics and the KGX config files

## Biolink basics

## Components of the KG assembly pipeline (ETL, mostly)

## How to write transforms for new sources (including w/ Koza)

## How and where to store results

## Jenkins builds

## Loading graphs with GraPE

## Embeddings and basic ML approaches w/ NEAT