# Use version control (Git), and create a script to automatically visualize my favorite database
Author: Nikolaj Pagh Kristensen

Background: My two favorite databases that I often use in my project contains information T-cell targets and T-cell receptor(TCR) sequences. These are: IEDB(http://www.iedb.org/) and VDJdb (https://vdjdb.cdr3.net/ named after the V, D and J segments of the TCR). I wish to create a script that automatically imports a subset of the database, calculate summary statistics for my favorite groups of T-cells. Furthermore, I wish to visualize summary statistics of the database in it's current form.

To break it up into subgoals:
1. Create a github repository",
1. Create the necessary initial files according to public guides (https://kbroman.org/github_tutorial/pages/init.html)
1. Look for an API to download IEDB data with (if possible). Download only data relevant for CD8 T cells. 
1. What epitope length is cataloged for each HLA?
1. What virual antigens are frequently targeted by T-cells?
1. What viruses do the database contain information on, and how many epitopes?

Extra things:
1. Do something similar but with VDJdb
2. Perform sequence analysis of epitopes. Generate clusters of epitopes based on sequence similarity.

# Start a new github repository and start keeping track of version tags
From the terminal I will execute the following,
1. Make new dirs (Python_projects and Python_projects/Course_work),
2. Initialize git by typing git init within the Python_course_report folder:`git init`,
3. Generate a README file (Markdown) as well as a .gitingore `touch .gitignore`,
1. Use `git add` to add files to the staging area
1. Use `git commit -m` to commint the stagining area together with a required comment

I double cheched my commit by using `git log`

Version numbering is a good practice: https://stackoverflow.com/questions/37814286/how-to-manage-the-version-number-in-git
    The general nomenclature is: `[major].[minor].[patch]-[build/beta/rc]`

1. Use initial `git tag -a v0.1.0 -m "Initial version"`
2. inspect tags using `git describe` succesfully returned a list of versions for the git repo


# Scrape IEDB for data
IEDB released an API to collect information directly on the command line or in your Jupyter notbook. The API postREST platform (https://postgrest.org/en/stable/), which has a list of functions and operators that are useful to know (https://postgrest.org/en/v7.0.0/api.html. For more information how it was implemented in the IEDB API see: https://query-api.iedb.org/docs/swagger/ and seek help here: https://help.iedb.org/hc/en-us/articles/4402872882189-Immune-Epitope-Database-Query-API-IQ-API-

## Core endpoints from the API
While there are many endpoints provided, we expect the majority of users will want to search against one or more of the following tables, which correspond to the tabbed search results on the IEDB. 

    epitope_search
    antigen_search
    tcell_search (assays) 
    bcell_search (assays)
    mhc_search (assays)
    tcr_search (receptors)
    bcr_search (receptors)
    reference_search

We're mainly interested in:

    epitope_search
    tcell_search (assays) 
    tcr_search (receptors)
   
A good example is found in file: use_case_1a.ipynb. As this is not my example, I have chosen to hide it with .gitignore
***
## Downloading information on a single epitope in a tabular format
This is described in the tutorial: use_case_1a.ipynb. 
<br>Let's start with the necessary packages:

In [29]:
import requests
import json
import pandas as pd
from io import StringIO

Let's define the url that we will be sending requests to. 

In [3]:
base_uri='https://query-api.iedb.org'

The user-case recommends a function to print the CURL command given a request:

In [22]:
# funciton to print the CURL command given a request
    #req is the request, a request
def print_curl_cmd(req):
    url = req.url
    print("This is the request url used with the curl cmd:") #My addition
    print("curl -X 'GET' '" + url + "'")

The first search should be *simple*, meaning that we will download all information related to a single epitope (my favorite epitope)
1. Define my_favorite_epitope
1. Define my search paramters (Fx: search_params={ 'linear_sequence': 'eq.SIINFEKL'}
2. Define the table_name (what information to search for
3. Define the full url of the table that we search for
4. fetch the result by parsing the full URL to requests.get, searching for your search paramters, saving the download in results. 
5. print the curl -x get command 

In [28]:
my_favorite_epitope = "HPVGEADYFEY"  #HPVGEADYFEY is one of the longest, dominant epitopes on B*3501
search_params={ 'linear_sequence': 'eq.'+my_favorite_epitope} #eq. here is a special postREST operator it abbreviates "equal"
table_name='epitope_search'
full_url=base_uri + '/' + table_name
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)

This is the request url used with the curl cmd:
curl -X 'GET' 'https://query-api.iedb.org/epitope_search?linear_sequence=eq.HPVGEADYFEY'


We have the results. Inspect it:

    what type is it?

Object `request.get` not found.


In [29]:
print("The downloaded file is of type", type(result))

The downloaded file is of type <class 'requests.models.Response'>


We need to transform *class: 'requests.models.Response'* into something useful. The IEDB API Tutorial recommends turning into a data frame right away

In [30]:
df = pd.json_normalize(result.json())
print("This is a data from of the following type:", type(df))
df

This is a data from of the following type: <class 'pandas.core.frame.DataFrame'>


Unnamed: 0,structure_id,structure_iri,structure_descriptions,curated_source_antigens,structure_type,linear_sequence,e_modification,linear_sequence_length,iedb_assay_ids,iedb_assay_iris,...,bcell_ids,bcell_iris,elution_ids,elution_iris,reference_types,pubmed_ids,reference_abstracts,reference_titles,reference_authors,reference_dates
0,24536,IEDB_EPITOPE:24536,[HPVGEADYFEY],"[{'accession': 'AFY97913.1', 'name': 'EBNA-1',...",Linear peptide,HPVGEADYFEY,,11,"[5452, 1328890, 1328891, 1382633, 1382634, 138...","[IEDB_ASSAY:1328890, IEDB_ASSAY:1328891, IEDB_...",...,,,"[1383248, 1383263, 1403131, 1403132, 1450643, ...","[IEDB_ASSAY:1383248, IEDB_ASSAY:1383263, IEDB_...","[Literature, Submission]","[11120837, 12576337, 14694109, 15148339, 15148...",[A classic feature of antigen presentation for...,[Allelic polymorphism in the T cell receptor a...,[Barbara Savoldo; John A Goss; Markus M Hammer...,"[2000, 2003, 2004, 2005, 2006, 2007, 2008, 200..."


The current dataframe has a single observation and 83 columns. That is because each epitope is represented as a row when using "epitope_search" with 83 different features.

Next aims:
- Inspect all column names
- Let's redo the epitope search, but let's limit the number of columns to receive. 
- Let's performn the epitope search for a given HLA allele instead (e.g. B\*35:01)

In [31]:
#Inspecting all column names
df.columns

Index(['structure_id', 'structure_iri', 'structure_descriptions',
       'curated_source_antigens', 'structure_type', 'linear_sequence',
       'e_modification', 'linear_sequence_length', 'iedb_assay_ids',
       'iedb_assay_iris', 'reference_ids', 'reference_iris', 'submission_ids',
       'submission_iris', 'pdb_ids', 'chebi_ids', 'qualitative_measures',
       'mhc_allele_evidences', 'antibody_isotypes', 'direct_ex_vivo_bool',
       'receptor_ids', 'receptor_group_ids', 'tcr_receptor_group_ids',
       'bcr_receptor_group_ids', 'receptor_group_iris',
       'tcr_receptor_group_iris', 'bcr_receptor_group_iris', 'receptor_types',
       'receptor_names', 'receptor_chain1_types', 'receptor_chain2_types',
       'receptor_chain1_full_seqs', 'receptor_chain2_full_seqs',
       'receptor_chain1_cdr1_seqs', 'receptor_chain2_cdr1_seqs',
       'receptor_chain1_cdr2_seqs', 'receptor_chain2_cdr2_seqs',
       'receptor_chain1_cdr3_seqs', 'receptor_chain2_cdr3_seqs',
       'host_organism_iri

These are the column names in Epitope_search: 
          
Of interest are:
- curated_source_antigens
- linear_sequence
- e_modification
- linear_sequence_length
- mhc_allele_names
- source_organism_names
- iedb_assay_ids
- reference_ids

**These also doubles as search paramters**
<br>See: https://query-api.iedb.org/docs/swagger/#/epitope_search/get_epitope_search


In [35]:
#This code now selects the the exact columns we are interested in from within the IEDB API before the resulting df is generated
search_params={  'linear_sequence': 'eq.'+my_favorite_epitope,
                'select': 'curated_source_antigens, linear_sequence, e_modification, linear_sequence_length, mhc_allele_names,\
                source_organism_names, iedb_assay_ids, reference_ids'}
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)
df = pd.json_normalize(result.json())
df

This is the request url used with the curl cmd:
curl -X 'GET' 'https://query-api.iedb.org/epitope_search?linear_sequence=eq.HPVGEADYFEY&select=curated_source_antigens%2C+linear_sequence%2C+e_modification%2C+linear_sequence_length%2C+mhc_allele_names%2C++++++++++++++++source_organism_names%2C+iedb_assay_ids%2C+reference_ids'


Unnamed: 0,curated_source_antigens,linear_sequence,e_modification,linear_sequence_length,mhc_allele_names,source_organism_names,iedb_assay_ids,reference_ids
0,"[{'accession': 'AFY97913.1', 'name': 'EBNA-1',...",HPVGEADYFEY,,11,"[HLA-A*02:01, HLA-A2, HLA-B*07:02, HLA-B*08:01...","[Human herpesvirus 4 (Epstein Barr virus), Hum...","[5452, 1328890, 1328891, 1382633, 1382634, 138...","[589, 1002081, 1004304, 1004493, 1004495, 1004..."


From the tutorial: 
    "Note the additional complexity in the URL of the last query **There are two parameters (linear_sequence & select)**, multiple values for the latter parameter, and many URL escape codes for the commas. Python's 'request' module handles this all for you, but one should be aware that all portions of the query need to be URL-escaped."

# Downloading information on a all epitopes in Epitope_search

In [36]:
#See title
search_params={  'select': 'curated_source_antigens, linear_sequence, e_modification, linear_sequence_length, mhc_allele_names,\
                source_organism_names, iedb_assay_ids, reference_ids'} 
table_name='epitope_search'
full_url=base_uri + '/' + table_name
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)
df = pd.json_normalize(result.json())

This is the request url used with the curl cmd:
curl -X 'GET' 'https://query-api.iedb.org/epitope_search?select=curated_source_antigens%2C+linear_sequence%2C+e_modification%2C+linear_sequence_length%2C+mhc_allele_names%2C++++++++++++++++source_organism_names%2C+iedb_assay_ids%2C+reference_ids'


Unnamed: 0,curated_source_antigens,linear_sequence,e_modification,linear_sequence_length,mhc_allele_names,source_organism_names,iedb_assay_ids,reference_ids
0,"[{'accession': 'P26664.3', 'name': 'Genome pol...",GDLCGSVFL,,9,"[H2-Kk, Mamu-A1*011:01]",[Hepatitis C virus (isolate 1)],"[1182763, 1182764]",[1000983]
1,"[{'accession': 'AAB59097.1', 'name': 'exotoxin...",GDLDPSSIPDKEQAISALPD,,20,,[Pseudomonas aeruginosa],[1269457],[1001199]
2,"[{'accession': 'P08318.1', 'name': 'Large stru...",GDLFSGDEDSDSSDG,,15,,[Human herpesvirus 5 (Human cytomegalovirus)],[1382552],[1004651]
3,"[{'accession': 'YP_001623301.1', 'name': 'cell...",GDLGKKGFEDGDLVV,,15,,[Renibacterium salmoninarum],"[1478738, 1478739]",[1006177]
4,"[{'accession': 'P03306.2', 'name': 'Genome pol...",GDLGSIA,,7,,[Foot-and-mouth disease virus (strain A10-61)],"[1480902, 1480903, 1480906, 1480907, 1480909, ...","[1008339, 1010423]"
...,...,...,...,...,...,...,...,...
9995,"[{'accession': 'P03468.2', 'name': 'Neuraminid...",ITYKNSTWVK,,10,[H2-Db],[Influenza A virus (A/Puerto Rico/8/1934(H1N1)...,"[1006289, 1006290]",[1000168]
9996,"[{'accession': 'AAB00381.1', 'name': 'S1 glyco...",ITYKVMREVRALAYFVNGTA,,20,[chicken],[Avian infectious bronchitis virus (strain Vic...,"[1498067, 1498075, 1498257, 1498284, 1500789]",[1013036]
9997,"[{'accession': 'AAA69398.1', 'name': 'D7R', 'i...",ITYLMNRFK,,9,"[HLA-A*03:01, HLA-A*11:01, HLA-A*30:01, HLA-A*...","[Camelpox virus M-96, Vaccinia virus Copenhage...","[1301052, 1301053, 1301054, 1301055, 1301056, ...","[1001865, 1003530, 1006442, 1014017, 1028287, ..."
9998,"[{'accession': 'AAB96556.1', 'name': '68k anky...",ITYLMNRFKNIDI,,13,[human],[Modified Vaccinia Ankara virus (MVA virus)],[1502511],[1007651]


Inspect the data frame: 
- shape
- sample
- linear_sequence_length

In [37]:
df.shape

(10000, 8)

**HUH**, it appear wierd that there are exactly 10.000 rows. There should be more... 
How many epitopes is within IEDB currently?
- (26-08-2021) 39868 epitopes are available. There most be a limit or something on the API.. 

Interesting: 
    The default request is 10.000, any more will be hidden on a different "page"
    <br> *By default, the IQ-API has a maximum page size of 10,000 records. In practice, this means that queries that result in more than 10,000 results will be divided into pages and only the first 10,000 records will be returned by the initial query. The API will always return a count of the records matching the query, as well as the number of pages of results. 
    
There ARE a way to download all hits. But, it requires me to use the curl command in the terminal.

## Download Epitope_search and Assay_search by using the terminal
TL:DR This did not work very well, had to try a for loop with the initial method instead

In [50]:
#Define the epitope_search url
search_params={  'select': 'curated_source_antigens, linear_sequence, e_modification, linear_sequence_length, mhc_allele_names,\
                source_organism_names, iedb_assay_ids, reference_ids'} 
table_name='epitope_search'
print(base_uri)
full_url=base_uri + '/' + table_name
print(full_url)

https://query-api.iedb.org
https://query-api.iedb.org/epitope_search


In [51]:
#Define the Antigen_search url
search_params={  'select': 'curated_source_antigens, linear_sequence, e_modification, linear_sequence_length, mhc_allele_names,\
                source_organism_names, iedb_assay_ids, reference_ids'} 
table_name='antigen_search'
print(base_uri)
full_url=base_uri + '/' + table_name
print(full_url)

https://query-api.iedb.org
https://query-api.iedb.org/antigen_search


#In the terminal: 
```bash
$ curl -I "https://query-api.iedb.org/epitope_search" -H 'Prefer: count=exact'
HTTP/1.1 206 Partial Content
Server: nginx/1.14.2
Date: Thu, 26 Aug 2021 17:33:11 GMT
Content-Type: application/json; charset=utf-8
Connection: keep-alive
Content-Range: 0-9999/1535288
Content-Location: /epitope_search
```

From the above, I can see that:
- There are 1535288 entries in the epitope_search database

I can access any part of the database above 9999 by specifying the offset paramter in the url

```bash
$ curl -I "https://query-api.iedb.org/antigen_search?offset=10000" -H 'Prefer: count=exact'
HTTP/1.1 206 Partial Content
Server: nginx/1.14.2
Date: Thu, 26 Aug 2021 17:41:01 GMT
Content-Type: application/json; charset=utf-8
Connection: keep-alive
Content-Range: 10000-19999/73218
Content-Location: /antigen_search?offset=10000
```


### Downloading information has to be done in chunks followed with merging

Aims:
- Create a simple bash script to download, store and bind datasets together for Epitope_search
- Do the same for assay_search

Bash can be run from jupyter notebook by using "magics" http://blog.dominodatalab.com/lesser-known-ways-of-using-notebooks
<br> For example `%%bash` let's you continue the cell by using bash language

In [10]:
%%bash
for i in 1 2 3 4 5
do
   echo "Welcome $i times"
done

#This is an example of a chunk of code that will be executed with Bash instead of Python 3. 

Welcome 1 times
Welcome 2 times
Welcome 3 times
Welcome 4 times
Welcome 5 times


Can bash also handle my request to IEDB from Jupyter Notebook?
<BR> YES. It even reports how much time was spend fetching the data. However, it doesn't appear that the data was stored anywhere...
<br> SOLVED. `curl -I` ONLY fetches HTTP info from the HTTP server. 
<br> Downloading the initial data as .csv is also possible if specified in the curl command. `-H  "accept: text/csv" `

    curl -I, --head
       (HTTP/FTP/FILE)  **Fetch the HTTP-header only! HTTP-servers feature the command HEAD which this uses
       to get nothing but the header of a document. When used on a FTP or FILE file,  curl  displays  the
       file size and last modification time only.

In [26]:
%%bash
#start with downloading the first 10 lines and save it to a file
curl "https://query-api.iedb.org/epitope_search?limit=10" -H  "accept: text/csv" 

structure_id,structure_iri,structure_descriptions,curated_source_antigens,structure_type,linear_sequence,e_modification,linear_sequence_length,iedb_assay_ids,iedb_assay_iris,reference_ids,reference_iris,submission_ids,submission_iris,pdb_ids,chebi_ids,qualitative_measures,mhc_allele_evidences,antibody_isotypes,direct_ex_vivo_bool,receptor_ids,receptor_group_ids,tcr_receptor_group_ids,bcr_receptor_group_ids,receptor_group_iris,tcr_receptor_group_iris,bcr_receptor_group_iris,receptor_types,receptor_names,receptor_chain1_types,receptor_chain2_types,receptor_chain1_full_seqs,receptor_chain2_full_seqs,receptor_chain1_cdr1_seqs,receptor_chain2_cdr1_seqs,receptor_chain1_cdr2_seqs,receptor_chain2_cdr2_seqs,receptor_chain1_cdr3_seqs,receptor_chain2_cdr3_seqs,host_organism_iri_search,host_organism_iris,host_organism_names,source_organism_iri_search,source_organism_iris,source_organism_names,mhc_allele_iri_search,mhc_allele_iris,mhc_allele_names,parent_source_antigen_iri_search,parent_source_anti

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 41525    0 41525    0     0  44602      0 --:--:-- --:--:-- --:--:--  109k


Okay, that became a very large file. Save it to a variable directly, instead of printing it:
https://stackoverflow.com/questions/13735051/how-to-capture-curl-output-to-a-file
<br> Bonus, there are also a solution to append data to a previous file

In [27]:
%%bash
echo "the file is saved in: "
pwd #this is important to call, it will let you know where the file is saved
curl "https://query-api.iedb.org/epitope_search?limit=10" -H 'Prefer: count=exact' -o output.csv  #O is for output. 


the file is saved in: 
/mnt/c/Users/nipagh/Desktop/Python_projects/Course_work


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 62231    0 62231    0     0  55414      0 --:--:--  0:00:01 --:--:--  105k


Awesome! After a couple of seconds, an "output.csv" file appeared in the working directory (Python_projects/Course_work)

Try downloading more:

    100.000 entries

In [101]:
%%bash
echo "the file is saved in: "
pwd #print working dir
curl "https://query-api.iedb.org/epitope_search?limit=100000" -H 'accept: text/csv' -o output.csv  #-o is for output. 

the file is saved in: 
/mnt/c/Users/nipagh/Desktop/Python_projects/Course_work


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 83.5M    0 83.5M    0     0   763k      0 --:--:--  0:01:52 --:--:--  490k


It took a bit more than a minute to download the appropriate dataset. Let's inspect it

In [105]:
data = pd.read_csv('output.csv', low_memory = False)

I still cannot download more than 10.000 entires at a time. That sucks

Can I reduce the number of columns that I am interested in?

The documentation for the Swagger URL query API was immensely helpful (see list of column names and what input they take and how):https://query-api.iedb.org/docs/swagger/#/epitope_search/get_epitope_search

    You can search for a specific value by using eq._value_of_interest_
    You can search for a value within a nested cell (e.g. taxonomy search) by using cs.{10239, next, this too}. 

   

In [267]:
%%bash
#Get something that looks like:
#https://query-api.iedb.org/epitope_search?limit=10&select=structure_id,linear_sequence
#'https://query-api.iedb.org/mhc_search?linear_sequence=eq.SIINFEKL'
#Trial and error, but the array cells can be reached by contains (cd.). It needs to be exactly what the cells can say
#"https://query-api.iedb.org/epitope_search?source_organism_iri_search=cs.%7BNCBITaxon:10239%7D
#For T cell assays: assay_iri_search=cs.%7BOBI:1110037%7D   http://www.ontobee.org/ontology/OBI?iri=http://purl.obolibrary.org/obo/OBI_1110037


#what is the base url that we will modify?
base_url="https://query-api.iedb.org"
echo "base_url: $base_url"
    
#Define and add table of interest
table_name='epitope_search'
modified_url="$base_url/$table_name?"
echo "URL with table name: $modified_url"

#Define and add clade filters of interest
#IEDB search tool at IEDB.org appear to use an organism/clade ID for all viruses. Virus (ID:10239), can I filter for that?
Clade_interest="cs.%7BNCBITaxon:10239%7D" #10239 is viruses generally.
echo $Clade_interest
modified_url=$modified_url"source_organism_iri_search=$Clade_interest"
echo "URL with table name, and clade: $modified_url"

#Epitope search if needed
    #Epitope_interest="eq.HPVGEADYFEY"
    #modified_url="$modified_url?linear_sequence=$Epitope_interest"
    #echo "URL with table name and epitope target: $modified_url"
    
#Remove all negative assay information
#the relevant column is qualitative_measures - This will never change
modified_url=$modified_url"&qualitative_measures=cs.%7BPositive%7D"
echo "URL with table name, and clade: $modified_url"

#Remove all non-Tcell assay information
#the relevant column is qualitative_measures - This will never change
modified_url=$modified_url"&assay_iri_search=cs.%7BOBI:1110037%7D"
echo "URL with table name, clade, measures, and outcome: $modified_url"

#Remove all epitopes not restricted to MHC class I
#the relevant column is qualitative_measures - This will never change
modified_url=$modified_url"&mhc_allele_iri_search=cs.%7BMRO:0001675%7D"
echo "URL with table name, clade, measures, and outcome: $modified_url"


#Cols of interest
    #Note, currently
cols_interest=("mhc_allele_iri_search" "qualitative_measures" "source_organism_iri_search" 'parent_source_antigen_names' 'linear_sequence' 'e_modification' "linear_sequence_length" "mhc_allele_names" "source_organism_names" "iedb_assay_ids" "reference_ids")

#Can we loop through the cols_interest to add each element to the url?    (yes - but it ends in "," - remove that)
modified_url="$modified_url&select="
echo "URL for iteration over cols of interest: $modified_url"
for element in "${cols_interest[@]}"
do
    modified_url="$modified_url$element,"
done
echo "Output_url: $modified_url"

#Remove the last characters of the string
modified_url="${modified_url::-1}"
echo "This is the final URL $modified_url"

#Go print the HTTP counts
echo " " #New line
curl -I $modified_url -H 'Prefer: count=exact' #-o output.csv  #-o is for output. 

#Go fetch the data
echo " " #New line
echo "the file is saved in: "
pwd #print working dir
curl $modified_url -H 'accept: text/csv' -o output.csv  #-o is for output. 

base_url: https://query-api.iedb.org
URL with table name: https://query-api.iedb.org/epitope_search?
cs.%7BNCBITaxon:10239%7D
URL with table name, and clade: https://query-api.iedb.org/epitope_search?source_organism_iri_search=cs.%7BNCBITaxon:10239%7D
URL with table name, and clade: https://query-api.iedb.org/epitope_search?source_organism_iri_search=cs.%7BNCBITaxon:10239%7D&qualitative_measures=cs.%7BPositive%7D
URL with table name, clade, measures, and outcome: https://query-api.iedb.org/epitope_search?source_organism_iri_search=cs.%7BNCBITaxon:10239%7D&qualitative_measures=cs.%7BPositive%7D&assay_iri_search=cs.%7BOBI:1110037%7D
URL with table name, clade, measures, and outcome: https://query-api.iedb.org/epitope_search?source_organism_iri_search=cs.%7BNCBITaxon:10239%7D&qualitative_measures=cs.%7BPositive%7D&assay_iri_search=cs.%7BOBI:1110037%7D&mhc_allele_iri_search=cs.%7BMRO:0001675%7D
URL for iteration over cols of interest: https://query-api.iedb.org/epitope_search?source_organi

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:08 --:--:--     0
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 3955k    0 3955k    0     0   467k      0 --:--:--  0:00:08 --:--:-- 1006k


In [261]:
%%bash
#troubleshooting - only print the HTTP message
curl "https://query-api.iedb.org/epitope_search?assay_iri_search=cs.%7BOBI:1110037%7D&qualitative_measures=cs.%7BPositive%7D&select=structure_id,source_organism_iri_search,qualitative_measures,iedb_assay_ids,iedb_assay_iris,assay_iri_search,assay_iris,assay_names" -H 'accept: text/csv' -o output.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 6072k    0 6072k    0     0  1066k      0 --:--:--  0:00:05 --:--:-- 1463k


In [256]:
%%bash
#troubleshooting - is 
curl "https://query-api.iedb.org/epitope_search?assay_iri_search=cs.%7BOBI:0001216%7D&select=structure_id,source_organism_iri_search,qualitative_measures,iedb_assay_ids,iedb_assay_iris,assay_iri_search,assay_iris,assay_names" -H 'accept: text/csv' -o output.csv

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1347k    0 1347k    0     0   167k      0 --:--:--  0:00:08 --:--:--  300k


In [212]:
search_params={ 'linear_sequences': 'cs.{SIINFEKL}'}
table_name='tcr_search'
full_url=base_uri + '/' + table_name
result = requests.get(full_url, params=search_params)
df = pd.json_normalize(result.json())
# funciton to print the CURL command given a request
    #req is the request, a request
def print_curl_cmd(req):
    url = req.url
    print("This is the request url used with the curl cmd:") #My addition
    print("curl -X 'GET' '" + url + "'")
    
print_curl_cmd(result)

This is the request url used with the curl cmd:
curl -X 'GET' 'https://query-api.iedb.org/tcr_search?linear_sequences=cs.%7BSIINFEKL%7D'


#
Okay, so I tried alot of different things with the CSV file that I got from the terminal, but it did not seem to convert the way I want it. Let's go back to the initial method of using `request.get` to fetch the data from the server. I just have to write a for loop that modifies the url, downloads the data, appends the data, which continues until all data has been appended. 

In [56]:
#Base variables and functions
base_url='https://query-api.iedb.org'
def print_curl_cmd(req):
    url = req.url
    print("This is the request url used with the curl cmd:") #My addition
    print("curl -X 'GET' '" + url + "'")

In [76]:
#The initial code for fecthing data of interest: 
search_params={ 'select': 'curated_source_antigens, linear_sequence, e_modification, linear_sequence_length, mhc_allele_names,\
                source_organism_names, iedb_assay_ids, reference_ids'} 
table_name='epitope_search'
full_url=base_url + '/' + table_name
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)
df = pd.json_normalize(result.json())

This is the request url used with the curl cmd:
curl -X 'GET' 'https://query-api.iedb.org/epitope_search?select=curated_source_antigens%2C+linear_sequence%2C+e_modification%2C+linear_sequence_length%2C+mhc_allele_names%2C++++++++++++++++source_organism_names%2C+iedb_assay_ids%2C+reference_ids'


In [84]:
#Inpect the result df
df[0:2]


Unnamed: 0,curated_source_antigens,linear_sequence,e_modification,linear_sequence_length,mhc_allele_names,source_organism_names,iedb_assay_ids,reference_ids
0,"[{'accession': 'AAM26117.1', 'name': 'lethal f...",NNIQSDLIKK,,10.0,,[Bacillus anthracis (anthrax bacterium)],"[1606108, 7621663]","[1014009, 1036746]"
1,"[{'accession': 'AAM26117.1', 'name': 'lethal f...",NNLTATLGAD,,10.0,,[Bacillus anthracis (anthrax bacterium)],"[1606030, 7621581]","[1014009, 1036746]"


In [98]:
#The first offset
base_url='https://query-api.iedb.org'
off_set = "1"
search_params={ 'select': 'curated_source_antigens, linear_sequence, e_modification, linear_sequence_length, mhc_allele_names,\
                source_organism_names, iedb_assay_ids, reference_ids'} 
table_name='epitope_search'
full_url=base_url + '/' + table_name + '?' + "offset=" + off_set
result = requests.get(full_url, params=search_params)
print_curl_cmd(result)
df = pd.json_normalize(result.json())

print("https://query-api.iedb.org/antigen_search?offset=10000")

This is the request url used with the curl cmd:
curl -X 'GET' 'https://query-api.iedb.org/epitope_search?offset=1&select=curated_source_antigens%2C+linear_sequence%2C+e_modification%2C+linear_sequence_length%2C+mhc_allele_names%2C++++++++++++++++source_organism_names%2C+iedb_assay_ids%2C+reference_ids'
https://query-api.iedb.org/antigen_search?offset=10000


In [99]:
#Inpect the result df
df[0:2]

Unnamed: 0,curated_source_antigens,linear_sequence,e_modification,linear_sequence_length,mhc_allele_names,source_organism_names,iedb_assay_ids,reference_ids
0,"[{'accession': 'Q9NW08.2', 'name': 'DNA-direct...",NAISTGNWSLKRFKM,,15.0,,[Homo sapiens (human)],"[1949997, 1949998]",[1024970]
1,"[{'accession': 'Q7Z1H1', 'name': 'Secreted pro...",NAKQCVFKHSQPN,,13.0,,[Necator americanus],"[1945502, 1945503, 1945504, 1945505]",[1024480]


# There appear to be an issue, wherein the requests starts at a random place within the database.