# Preliminary data exploration
This notebook includes basic analyses of the data available from the publication:  
Aflitos, Saulo, et al. "Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole‐genome sequencing." The Plant Journal 80.1 (2014): 136-148.  
http://onlinelibrary.wiley.com/doi/10.1111/tpj.12616/full  
The data consists of 3 "reference-level" genome assemblies of wild relatives of tomato, and 84 resequenced lines of diverse tomatos and relatives.  
Data was deposited in [ENA](https://www.ebi.ac.uk/ena) (project accession numbers given within the paper).  
The main goal of this notebook is to explore the published data, extract some stats and get a feeling of the quality and what can be done with it.

In [62]:
DATA_PATH = "../data/"
PY_PATH = "../python/"
FIGS_PATH = "../figs/"
OUT_PATH = "../output/"

import sys
sys.path.append('../')
import get_genome_stats
import pandas
from IPython.display import display
pandas.set_option('display.float_format', lambda x: "{:,.2f}".format(x) if int(x) != x else "{:,.0f}".format(x))

## Get the genome assemblies
3 assemblies for wild relatives need to be retrieved from ENA. Assemblies are given as contigs/scaffolds (same thing in this case), which are separated to different records in ENA. Download is performed through the ENA REST API, following the supplied [documentation](https://www.ebi.ac.uk/ena/browse/data-retrieval-rest).

In [34]:
ENA_base_url = "http://www.ebi.ac.uk/ena/data/view/"
arcanum_range = "CBYQ010000001-CBYQ010046594"    # range of records of contigs

In [35]:
# download arcanum contigs
arcanum_url = ENA_base_url + arcanum_range + "&display=fasta"
arcanum_fasta_path = DATA_PATH + "arcanum_contigs.fasta"
#download.file(arcanum_url, arcanum_fasta_path)

## Calculate basic assembly stats

In [39]:
arcanum_genome_stats = get_genome_stats.get_stats(arcanum_fasta_path)

In [63]:
display(pandas.DataFrame.from_dict(arcanum_genome_stats, orient='index'))

Unnamed: 0,0
Total length,665186956.0
Total scaffolds,46594.0
# of gaps,559128.0
% gaps,0.08
N50,31288.0
L50,5928.0
N90,6904.0
L90,22922.0
Min scaffold length,266.0
Max scaffold length,392206.0
