<a href="https://colab.research.google.com/github/ImagingDataCommons/IDC-Tutorials/blob/v14_whc/notebooks/getting_started/part1_prerequisites.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Getting started with IDC - Part 1: Setting up the prerequisites

## Summary

This notebook is part of [the "Getting started with IDC" series](https://github.com/ImagingDataCommons/IDC-Examples/tree/master/notebooks/getting_started) introducing NCI Imaging Data Commons to the users who want to interact with IDC programmatically.

In this notebook you will learn whether you need a Google account, and, if needed, how to set up your account.

Initial version: Nov 2022

Updated: May 2023

# What is IDC?

[NCI Imaging Data Commons (IDC)](https://datacommons.cancer.gov/repository/imaging-data-commons) is a cloud-based repository of publicly available cancer imaging data co-located with analysis and exploration tools and resources. IDC is a node within the broader NCI Cancer Research Data Commons (CRDC) infrastructure that provides secure access to a large, comprehensive, and expanding collection of cancer research data.

## What is an "IDC collection"?

IDC contains images and image-derived data (i.e., annotations and analysis results) from a variety of repositories. Those images are broadly organized by programs, which correspond to the various data collection initiatives. Programs consist of collections, which group data collected by a specific entity for a specific application. Collections include both original data that was collected by the contributing entity, and the image-derived data that might have been generated by other contributors, extending the original collection.

## What are the Google Cloud Platform (GCP) and the Amazon Web Services (AWS)?

Both GCP and AWS cloud-based environments provide access to a suite of tools and services that include compute, storage and database resources, to name a few. IDC is built upon the services provided by both services.

In particular, IDC data is maintained in both Google Cloud Storage (GCS) and AWS S3. Egress of IDC data out of each cloud is free, whether to a VM or to a local machine.

Searching IDC metadata is based on the use of GCP BigQuery (BQ). IDC maintains comprehensive metadata tables in BQ. Such tables can be queried using standard SQL. Limited use of BQ is free, as explained below, but more extensive use will require a Google account.

## GCP is not a free service - do I need to pay to use IDC?

**NO!**

As mentioned above, egress of IDC data out of the cloud is free. Moreover, while BQ queries are, in general, is not free, GCP [BigQuery free tier](https://cloud.google.com/bigquery/pricing#free-tier) includes 1 TB of query data per month, which we believe will be sufficient to complete this notebook series, and for many user's general querying needs.

None of the activities in this tutorial series will require you to pay for use of any GCP services, to have cloud credits, or even to connect your credit card to your account.


## What do I need to get started?

All you need is a Google account (google identity) and a web browser. If you don't have a Google account, you can learn how to get one [here](https://accounts.google.com/signup/v2/webcreateaccount?dsh=308321458437252901&continue=https%3A%2F%2Faccounts.google.com%2FManageAccount&flowName=GlifWebSignIn&flowEntry=SignUp#FirstName=&LastName=). Note that you do NOT need a Gmail email account - [you can use your non-Gmail email address to create one instead](https://support.google.com/accounts/answer/27441?hl=en#existingemail).

<font color='red'>**WARNING**</font>: if you have a Google account that was provided by your organization, it may not be suitable for this tutorial if the organization managing your account has restrictions in place related to GCP! If you experience issues using your organization account, please switch to a personal one (you can create one just for the purposes of this tutorial, if you prefer).

# Activate GCP for your account and create a GCP project

You do not need any special permissions or credits to create a project and use it in the subsequent parts of this tutorial series. You will **not** need cloud credits or a credit card to search or download IDC data! Please follow the steps below to create a project.

1.  Go to https://console.cloud.google.com/, and accept Terms and conditions.

<img src="https://www.dropbox.com/s/d570wqaqt72zzaz/agreed.png?raw=1" alt="agree" width="400"/>

2. In the upper left corner of the GCP console click "Select a project"

<img src="https://www.dropbox.com/s/hzty1pgfq6ll7hy/select.png?raw=1" alt="select" width="400"/>

3. In the project selector click "Create new project". If you already have a project, you may be able to reuse it for this tutorial.

<img src="https://www.dropbox.com/s/ybhdloqsjnffdb1/new.png?raw=1" alt="new" width="400"/>

4. Open the GCP console menu by clicking the ☰ menu icon in the upper left corner, and select "Dashboard". You will see information about your project, including your Project ID. Insert that project ID in the cell below in place of REPLACE_ME_WITH_YOUR_PROJECT_ID. The cell below will also prompt you to give Colab permissions to act on your behalf.

In [None]:
#@title Enter your Project ID
# initialize this variable with your Google Cloud Project ID!
my_ProjectID = "" #@param {type:"string"}

import os
os.environ["GCP_PROJECT_ID"] = my_ProjectID

from google.colab import auth
auth.authenticate_user()

## Locate and add `bigquery-public-data` project

`bigquery-public-data` is a public project that contains BigQuery tables with IDC metadata (we will work with those in the part 2 of this series). To navigate those metadata tables you need to manually add this project to your workspace.

1. Open the BigQuery console: https://console.cloud.google.com/bigquery, and click the `+ ADD DATA` button.

<img src="https://www.dropbox.com/s/cg99cyn1uzigw7s/add_data.png?raw=1" alt="add data" width="400"/>

2. Choose "Star a project" option from the list.

<img src="https://www.dropbox.com/s/6688galhthr5vsn/star_a_project.png?raw=1" alt="star a project" width="400"/>

3. Type `bigquery-public-data` as the project name and click `STAR` button.

<img src="https://www.dropbox.com/s/nzh7aybkre138g1/star.png?raw=1" alt="star" width="400"/>

In a few moments, `bigquery-public-data` project should appear in the list on the left hand side of the BigQuery console.

<img src="https://www.dropbox.com/s/s2f6vpolbimnyb8/bqpd_added.png?raw=1" alt="starred" width="400"/>

## Check the setup

Finally, let's run a query to confirm that the setup is working for your account.

In [18]:
%%bigquery --project=$my_ProjectID

SELECT COUNT(DISTINCT(collection_id)) as collections_cnt
FROM bigquery-public-data.idc_current.dicom_all


ERROR:
 403 POST https://bigquery.googleapis.com/bigquery/v2/projects/idc-etl-processing/jobs?prettyPrint=false: Access Denied: Project idc-etl-processing: User does not have bigquery.jobs.create permission in project idc-etl-processing.

Location: None
Job ID: 2cea6861-abe6-4ae6-aa0c-a469713d10ec



If the cell above completed without errors, you completed the prerequisites and can proceed to the next tutorial in the series, keeping the project ID handy - you will need it.

## Support

You can contact IDC support by sending email to support@canceridc.dev or posting your question on [IDC User forum](https://discourse.canceridc.dev).

## Acknowledgments

Imaging Data Commons has been funded in whole or in part with Federal funds from the National Cancer Institute, National Institutes of Health, under Task Order No. HHSN26110071 under Contract No. HHSN261201500003l.

If you use IDC in your research, please cite the following publication:

> Fedorov, A., Longabaugh, W. J. R., Pot, D., Clunie, D. A., Pieper, S., Aerts, H. J. W. L., Homeyer, A., Lewis, R., Akbarzadeh, A., Bontempi, D., Clifford, W., Herrmann, M. D., Höfener, H., Octaviano, I., Osborne, C., Paquette, S., Petts, J., Punzo, D., Reyes, M., Schacherer, D. P., Tian, M., White, G., Ziegler, E., Shmulevich, I., Pihl, T., Wagner, U., Farahani, K. & Kikinis, R. NCI Imaging Data Commons. Cancer Res. 81, 4188–4193 (2021). http://dx.doi.org/10.1158/0008-5472.CAN-21-0950