# First Steps with *pyopencga*; the Python client of OpenCGA
------
# 1. Overview

This notebook provides guidance for getting started with the *pyopencga* library, the Python client of OpenCGA.

We assume that your workstation (Linux, Mac, Windows) is connected to the internet and has Python  3 and the *pip* package manager installed. We then show you how to:

- Install *pyopencga*.
- Connect to an OpenCGA instance.
- Issue OpenCGA requests and work with responses.
- Launch asynchronous jobs and retrieve results.


Walk-through guides of some **common use cases** are provided in two further notebooks:<BR>
- ADD LINK TO NOTEBOOK-02
- ADD LINK TO NOTEBOOK-03
 
For reference, the methods implemented by *pyopencga* are are listed here:
- https://docs.google.com/spreadsheets/d/1QpU9yl3UTneqwRqFX_WAqCiCfZBk5eU-4E3K-WVvuoc

The OpenCGA web service endpoints used by *pyopencga* are listed here:
- https://ws.opencb.org/opencga-prod/webservices



# 2. Installing and importing the *pyopencga* library

You have two main options for get *pyopencga* installed in your python setup; from source code or using the Python *pip* package panager. We recommend the latter:
### 2.1. Install *pyopencga* with *pip*

[PROVIDE INSTRUCTIONS FOR THE FOLLOWING] you can import pyopencga directly if you have installed *pyopencga* with PyPI (The Python Package Index).<br>For further documentation reffer to https://pypi.org/project/pyopencga/. <br> The user just needs to access the console terminal (optionally within a python environment) and run:

`$ pip install pyopencga`

## 2.2. Importing the *pyopencga* library

This is the recommended way of using *pyopencga* 

In [1]:
from pyopencga.opencga_config import ClientConfiguration # import configuration module
from pyopencga.opencga_client import OpencgaClient # import client module
from pprint import pprint
import json


## 2.3. Setup the Client and Login in *pyopencga* 

**Configuration and Credentials** 

You need to provide **at least** a host server URL in the standard configuration format for OpenCGA as a python dictionary or in a json file.

Regarding credentials, you can set both user and password as two variables in the script. If you prefer not to show the password, it would be asked interactively without echo.


### 1.3.1 Set variables for server host, user credentials and project owner

In [2]:
# server host
host = 'http://bioinfo.hpc.cam.ac.uk/opencga-prod'

# user credentials
user = "demouser"
passwd = "demouser" ## you can skip this, see below.

# the user demo access projects from user opencga
prj_owner = "demo"

### 1.3.2 Creating ConfigClient dictionary for server connection configuration

In [3]:
# Creating ClientConfiguration dict
host = 'http://bioinfo.hpc.cam.ac.uk/opencga-prod'

config_dict = {"rest": {
                       "host": host 
                    }
               }

print("Config information:\n",config_dict)



Config information:
 {'rest': {'host': 'http://bioinfo.hpc.cam.ac.uk/opencga-prod'}}


### 1.3.3 Initialize the client configuration

Now we need to pass the *config_dict* dictionary to the **ClientConfiguration** method

In [4]:
config = ClientConfiguration(config_dict)
oc = OpencgaClient(config)


### 1.3.4 Import the user credentials to the previously defined *OpencgaClient* instance and Login

We can decide to pass the password as a variable, or just pass the user and be asked for the password interactively

In [26]:
# here we put only the user in order to be asked for the password interactively
# oc.login(user)

In [5]:
# or you can pass the user and passwd
oc.login(user, passwd)

#### ✅  Congrats! You are now connected to OpenCGA

# 3. Understanding REST Response

*pyopencga* queries web services that return a RESTResponse object, which might be difficult to interpretate. The RESTResponse type provide the data in a manner that is not as intuitive as a python list or dictionary. Because of this, we have develop a useful functionality that retrieves the data in a simpler format. 

[OpenCGA Client Libraries](http://docs.opencb.org/display/opencga/Using+OpenCGA), including *pyopencga*, implement a **RESTReponse wrapper** to make even easier to work with REST web services responses. <br>REST responsess include metadata and OpenCGA 2.0.1 has been designed to work in a federation mode (more information about OpenCGA federations can be found **[here](http://docs.opencb.org/display/opencga/Roadmapg)**).

All these can make a first-time user to struggle when start working with the responses. Please read this brief documentation about **[OpenCGA RESTful Web Services](http://docs.opencb.org/display/opencga/RESTful+Web+Services#RESTfulWebServices-OpenCGA2.x)**.

Let's see a quick example of how to use RESTResponse wrapper in *pyopencga*. 
You can get some extra inforamtion [here](http://docs.opencb.org/display/opencga/Python#Python-WorkingwiththeRestResponse). Let's execute a first simple query to fetch all projects for the user **demouser** already logged in **[step 1.3](#1.3-Setup-the-client-and-login-in-pyopencga)**.

In [28]:
## Let's fecth the available projects.
## First let's get the project client and execute search() funciton
project_client = oc.projects
projects = project_client.search()

## Uncomment this line to view the JSON response.
## NOTE: it incudes study information so this can be big
##pprint(projects.get_responses())

#### Although you can iterate through all the different projects provided by the response by executing the next chunk of code, this is a **not recommended** way.
We can explore this through an example; the next query iterates over all the projects retrieved from `projects.search()`

In [29]:
## Loop through all diferent projects 
for project in projects.responses[0]['results']:
   print(project['id'], project['name'])

family Family Studies GRCh37
population Population Studies GRCh38


## RestResponse API

Note: Table with API funcitons and the description

## Using the `get_results()` function 

Using the functions that *pyopencga* implements for the RestResponse object makes things much easier! <br> Let's dig into an example using the same query as above:

In [30]:
# Loop through all diferent projects 
for project in projects.get_results():
   print(project['id'], project['name'])

family Family Studies GRCh37
population Population Studies GRCh38


## Using the `result_iterator()` function to iterate over the Rest results

You can also iterate results, this is specially interesting when fetching many results from the server:

In [31]:
## Iterate through all diferent projects 
for project in projects.result_iterator():
   print(project['id'], project['name'])

family Family Studies GRCh37
population Population Studies GRCh38


## Using `print_results()` function to iterate over the Rest results

**IMPORTANT**: This function implements a configuration to exclude metadata, change separator or even select the fields! Then it reaches all the user-desired results and prints them directly in the terminal.<br>In this way, the RESTResponse obejct implements a very powerful custom function to print results 😎

In [32]:
## This function iterates over all the results, it can be configured to exclude metadata, change separator or even select the fields!
projects.print_results()

#Time: 81
#Num matches: -1
#Num results: 2
#Num inserted: 0
#Num updated: 0
#Num deleted: 0
#id	name	uuid	fqn	creationDate	modificationDate	description	organism	currentRelease	studies	internal	attributes
family	Family Studies GRCh37	eba0e1c7-0172-0001-0001-c7af712652b2	demo@family	20200625131808	20200625131808		{'scientificName': 'Homo sapiens', 'commonName': '', 'assembly': 'GRCh37'}	1	.	{'datastores': {}, 'status': {'name': 'READY', 'date': '20200625131808', 'description': ''}}	{}
population	Population Studies GRCh38	25f2842a-0173-0001-0001-e7bcbedc77ff	demo@population	20200706210517	20200706210517	Some population reference studies for GRCh38	{'scientificName': 'Homo sapiens', 'commonName': '', 'assembly': 'GRCh38'}	1	.	{'datastores': {}, 'status': {'name': 'READY', 'date': '20200706210517', 'description': ''}}	{}


##### Now, let's try to costumize the results so we can get printed only the portion of the data that we might be interested in

In [33]:
## Lets exclude metadata and print only few fields, use dot notation for ensted fields
projects.print_results(fields='id,name,organism.scientificName,organism.assembly',metadata=False)
print()


#id	name	organism.scientificName	organism.assembly
family	Family Studies GRCh37	Homo sapiens	GRCh37
population	Population Studies GRCh38	Homo sapiens	GRCh38



##### A very useful parameter is the *separator*.<br>It allows the user to decide the format in which the data is printed. For example, it's possible to print a CSV-like style:

In [34]:
## You can change separator

print('Print the projects with a header and a different separator:\n')
projects.print_results(fields='id,name,organism.scientificName,organism.assembly', separator=',', metadata=False)


Print the projects with a header and a different separator:

#id,name,organism.scientificName,organism.assembly
family,Family Studies GRCh37,Homo sapiens,GRCh37
population,Population Studies GRCh38,Homo sapiens,GRCh38


# 4. Working with JOBS

OpenCGA implemtns a number of analysis and operations that are executed as jobs

Note: Describe briefly how Jobs work and point to docs

## Job Info
Decribe job information, ...


## Executing Jobs
Lifecycle, status, ...

### Example

In [3]:
## Eexecute GWAS analysis
rest_response = oc.variant().gwas()

## wait for the job to finish
oc.wait_for_job(rest_response)

rest_response.print_results()

NameError: name 'oc' is not defined