<div style="max-width:1200px"><img src="../_resources/mgnify_banner.png" width="100%"></div>

<img src="../_resources/mgnify_logo.png" width="200px">

# Locate samples on a map

**This notebook aimed to give a few examples of maps that can be drawn from MGnify MAGs samples, studies and genomes data.**

This notebook is divided in 6 sections:
- 1: Libraries needed to run the full notebook and Spark Session
- 2: Load previously queried and saved datasets (Here we use the `studies`, `genomes` and `samples` as an example).
- 3: Join the three datasets together 
- 4: First map example: `Interactive Map` representing the `Number of Genomes according to their geographic origin` (in this case per continents).
- 5: Second map example: `Number of Samples per country vs. Number of Genomes per country` on two different interactive maps displayed side-by-side.
- 6: Third map example: Use of the `samples' latitude and longitude` to bring samples, genomes and studies together on an `interactive map`.

This is an interactive code notebook (a Jupyter Notebook).
To run this code, click into each cell and press the ▶ button in the top toolbar, or press `shift+enter`.

## Libraries and Spark Session

### Import python libraries

In [1]:
# Dataframes and display
import pandas as pd

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
from pyspark.sql import Window as W

# Transformation data
from functools import reduce

# Map plots
import geopandas as gpd
from lets_plot import *
from lets_plot import tilesets

LetsPlot.setup_html()

# Warning verbosity
import warnings 
warnings.filterwarnings(action="ignore")

### Create Spark Session

In [2]:
spark = SparkSession.builder.getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/12/01 15:04:20 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
spark

## Load the datafiles

### Load the `studies` dataset

A sample of the studies dataset has been queried and saved as parquet file beforehand. To query from MGnify API, please refer to the following python notebook:`genome_search_example.ipynb`.  

For the studies dataset, use `studies` endpoint.   
A complete list of endpoints can be found at https://www.ebi.ac.uk/metagenomics/api/v1/.

In [4]:
studies_df = spark.read.parquet('outputs/parquets/studies')

In [5]:
studies_df.count()

4475

In [6]:
len(studies_df.columns)

14

**Output:** The studies dataset has 4475 rows and 14 columns.

In [7]:
studies_df.select('`attributes.accession`', '`attributes.secondary-accession`').show()

+--------------------+------------------------------+
|attributes.accession|attributes.secondary-accession|
+--------------------+------------------------------+
|        MGYS00006074|                     ERP135767|
|        MGYS00006073|                     ERP140676|
|        MGYS00006072|                     ERP139784|
|        MGYS00006034|                     ERP134737|
|        MGYS00000596|                     ERP012803|
|        MGYS00006069|                     ERP133351|
|        MGYS00005968|                     ERP133894|
|        MGYS00006067|                     ERP137915|
|        MGYS00001935|                     ERP090011|
|        MGYS00006063|                     ERP140432|
|        MGYS00006041|                     ERP125469|
|        MGYS00005766|                     ERP122587|
|        MGYS00005757|                     ERP129176|
|        MGYS00006060|                     ERP137998|
|        MGYS00006061|                     ERP137544|
|        MGYS00006059|      

**Output:** The column `attributes.secondary-accession` allows to link the `studies` dataset to the `genome` dataset.

### Load the `genomes` dataset

A sample of the studies dataset has been queried and saved as parquet file beforehand. To query from MGnify API, please refer to the following python notebook:`genome_search_example.ipynb`.  

For the studies dataset, use `genomes` endpoint.   
A complete list of endpoints can be found at https://www.ebi.ac.uk/metagenomics/api/v1/.

In [8]:
genomes_df = spark.read.parquet('outputs/parquets/genomes')

In [9]:
genomes_df.count()

9421

In [10]:
len(genomes_df.columns)

37

**Outputs:** The genomes dataset has 9421 rows and 37 columns.

In [11]:
genomes_df.select('`attributes.accession`', '`attributes.ena-study-accession`', '`attributes.ena-sample-accession`').show()

+--------------------+------------------------------+-------------------------------+
|attributes.accession|attributes.ena-study-accession|attributes.ena-sample-accession|
+--------------------+------------------------------+-------------------------------+
|       MGYG000299273|                     ERP108069|                     ERS7599803|
|       MGYG000299272|                     ERP127228|                     ERS6080759|
|       MGYG000299270|                     ERP127229|                     ERS7365004|
|       MGYG000299265|                     SRP311368|                           null|
|       MGYG000299259|                     ERP127228|                     ERS6080683|
|       MGYG000299258|                     DRP005925|                     ERS7766739|
|       MGYG000299256|                     ERP127228|                     ERS6080782|
|       MGYG000299253|                     SRP311368|                           null|
|       MGYG000299252|                     ERP127228| 

**Output:** The column `attributes.ena-study-accession` allows to link the `studies` dataset to the `genomes` dataset and the column `attributes.ena-sample-accession` allows to link the `samples` dataset to the `genomes` dataset.

### Load the `samples` datase`

A sample of the studies dataset has been queried and saved as parquet file beforehand. To query from MGnify API, please refer to the following python notebook:`genome_search_example.ipynb`.  

For the studies dataset, use `samples` endpoint.   
A complete list of endpoints can be found at https://www.ebi.ac.uk/metagenomics/api/v1/.

In [12]:
samples_df = spark.read.parquet('outputs/parquets/samples.parquet')

In [13]:
samples_df.printSchema()

root
 |-- data: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- attributes: struct (nullable = true)
 |    |    |    |-- accession: string (nullable = true)
 |    |    |    |-- analysis-completed: string (nullable = true)
 |    |    |    |-- biosample: string (nullable = true)
 |    |    |    |-- collection-date: string (nullable = true)
 |    |    |    |-- environment-biome: string (nullable = true)
 |    |    |    |-- environment-feature: string (nullable = true)
 |    |    |    |-- environment-material: string (nullable = true)
 |    |    |    |-- geo-loc-name: string (nullable = true)
 |    |    |    |-- host-tax-id: long (nullable = true)
 |    |    |    |-- last-update: string (nullable = true)
 |    |    |    |-- latitude: double (nullable = true)
 |    |    |    |-- longitude: double (nullable = true)
 |    |    |    |-- sample-alias: string (nullable = true)
 |    |    |    |-- sample-desc: string (nullable = true)
 |    |    |    |-- sam

In [14]:
len(samples_df.columns)

3

In [15]:
samples_df.count()

997

**Output:** In our example here, the 'samples' dataset parquet has been queried with links and metadata. We need to `explode` the dataset in order to acces the attributes as single column as it is the case for the two other datasets.

In [16]:
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

**Output:** The lines above will allow to visualise the data in a 'human-readable manner'.

In [17]:
samples = samples_df.select(F.explode('data')).select('col.id', 'col.attributes.*', 'col.links', 'col.relationships.*', 'col.type')

In [18]:
samples.show(truncate=False)

+----------+----------+------------------+------------+---------------+---------------------+-------------------+--------------------+------------+-----------+-------------------+--------+---------+-------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [19]:
samples.count()

24925

In [20]:
len(samples.columns)

23

**Output:** The samples dataset has 24925 rows and 23 columns.

## Join the 3 datasets into one datasets based on their unique IDs.

### Join the `studies` dataset on the `genomes` dataset

In [21]:
genomes_df.filter(F.col('`attributes.ena-study-accession`').isNotNull()).count()

8558

In [22]:
genomes_df.filter(F.col('`attributes.ena-study-accession`').isNotNull()).select('`attributes.ena-study-accession`').distinct().count()

99

**Outputs:** Among the 9421 rows of the genomes dataset, 8558 have a reference to a study.  
This example dataset has only references to 99 different studies.

*Note: Those datasets are only a sample of the data available on the MGnify API.*

In [23]:
studies_df.filter(F.col('`attributes.secondary-accession`').isNotNull()).count()

4475

In [24]:
studies_df.filter(F.col('`attributes.secondary-accession`').isNotNull()).select('`attributes.secondary-accession`').distinct().count()

4475

**Outputs:** Each entry of the `studies` dataset have a `unique reference ID`.

In [25]:
new_df = genomes_df.join(
              reduce(lambda df, x: df.withColumnRenamed(x, f'studies_{x}'),
                   studies_df.columns,
                   studies_df,
                    ), 
              F.col('`attributes.ena-study-accession`')==F.col('`studies_attributes.secondary-accession`'), 
              'left',
          )

In [26]:
new_df.count()

9421

In [27]:
new_df.select('`attributes.accession`', '`attributes.ena-study-accession`', '`attributes.ena-sample-accession`', '`studies_attributes.secondary-accession`').show()

+--------------------+------------------------------+-------------------------------+--------------------------------------+
|attributes.accession|attributes.ena-study-accession|attributes.ena-sample-accession|studies_attributes.secondary-accession|
+--------------------+------------------------------+-------------------------------+--------------------------------------+
|       MGYG000299273|                     ERP108069|                     ERS7599803|                             ERP108069|
|       MGYG000299272|                     ERP127228|                     ERS6080759|                                  null|
|       MGYG000299270|                     ERP127229|                     ERS7365004|                                  null|
|       MGYG000299265|                     SRP311368|                           null|                                  null|
|       MGYG000299259|                     ERP127228|                     ERS6080683|                                  null|


In [28]:
new_df.filter(F.col('`studies_attributes.secondary-accession`').isNotNull()).select('`attributes.ena-study-accession`').distinct().count()

43

**Output:** Only 43 of the `attributes.ena-study-accession` references have a corresponding entry in the studies dataset.

### Join the `samples` dataset on the genomes-studies dataset

In [29]:
samples.filter(F.col('id').isNotNull()).count()

24925

In [30]:
samples.filter(F.col('id').isNotNull()).select('id').distinct().count()

24925

**Output:** The 24925 samples entries have a unique identifier.

In [31]:
new_df.filter(F.col('`attributes.ena-sample-accession`').isNotNull()).count()

8440

In [32]:
new_df.filter(F.col('`attributes.ena-sample-accession`').isNotNull()).select('`attributes.ena-sample-accession`').distinct().count()

7183

**Output:** Among the 9421 genome entry, 8440 have a reference to a sample id and several genomes can be related to the same sample.

In [33]:
final_df = samples.join(
              reduce(lambda df, x: df.withColumnRenamed(x, f'g_{x}'),
                   new_df.columns,
                   new_df,
                    ), 
              F.col('`g_attributes.ena-sample-accession`')==F.col('id'), 
              'full',
          )

In [34]:
final_df.count()

34249

In [35]:
len(final_df.columns)

74

**Output:** The final dataframe has 34249 rows and 74 columns.

In [36]:
final_df.filter((F.col('`g_attributes.ena-sample-accession`').isNotNull()) & (F.col('id').isNotNull())).count()

156

In [37]:
final_df.filter((F.col('`g_attributes.ena-sample-accession`').isNotNull()) & (F.col('id').isNotNull())).select('id').distinct().count()

97

**Output:** 97 samples have correspondences in the genomes dataset.

## First map example: Interactive Map representing the Number of Genomes according to their geographic origin

In [38]:
final_df.filter(F.col('`g_attributes.geographic-origin`').isNotNull()).count()

9421

**Outputs:** All the genomes (9421 entries) present in the dataset (see Section: **Load the `genomes` dataset**) have a non-null geographic origin.

In [39]:
continents = [continent[0] for continent in final_df.filter(F.col('`g_attributes.geographic-origin`').isNotNull()).groupBy('`g_attributes.geographic-origin`').count().toLocalIterator()]

In [40]:
continents

['Europe',
 'Africa',
 'North America',
 'South America',
 'not provided',
 'Oceania',
 'Asia']

**Outputs:** Get the list of the continents represented. As we can see, despite all the entries have a value, the value can be `not provided`.

In [41]:
genomes_count = [genomes[1] for genomes in final_df.filter(F.col('`g_attributes.geographic-origin`').isNotNull()).groupBy('`g_attributes.geographic-origin`').count().toLocalIterator()]

In [42]:
genomes_count

[3748, 829, 1361, 93, 1708, 342, 1340]

**Outputs:** Get the list of the count per continents. The `not provided` count is 1708 and represent `app. 18%` of the genome data.

Those genomes will be excluded from the map.

In [43]:
world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))

In [44]:
world.continent.unique()

array(['Oceania', 'Africa', 'North America', 'Asia', 'South America',
       'Europe', 'Seven seas (open ocean)', 'Antarctica'], dtype=object)

**Outputs:** `world` represents our base map that will be filled with the data. 

In [45]:
# Create a dictionary of the data that will be used for the map
continents_data = dict(
    geographical_origin = continents,
    count = genomes_count
)

In [46]:
continents_data

{'geographical_origin': ['Europe',
  'Africa',
  'North America',
  'South America',
  'not provided',
  'Oceania',
  'Asia'],
 'count': [3748, 829, 1361, 93, 1708, 342, 1340]}

In [47]:
# Parameters for the map display (will be reused in the next section)
world_limits = coord_map(ylim=[-80, 85])
map_theme = theme(axis_line='blank', axis_text='blank', axis_ticks='blank', axis_title='blank')

In [48]:
# Join `data` dictionary and `map` world using the `map_join` parameter.
(ggplot()
 + geom_map(aes(fill='count'), 
             data=continents_data, 
             map=world,
             map_join=['geographical_origin', 'continent'], 
             color="white",
             tooltips=layer_tooltips().line("^fill count"))
 + ggtitle('Number of genomes according to their geographical origin')
 + scale_fill_gradient(low='#79f8f8', high="#cf0a1d", name="Number of genomes")
 + ggsize(800, 400)
 + world_limits
 + map_theme
) 

**Output:** Interactive map representing the count of genome according to their geographic origin. 

Hovering the map with the mouse curser allow the user to see the exact genome count per continent.

## Second map example: Number of Samples per country vs. Number of Genomes per country.

In [49]:
final_df.filter(F.col('geo-loc-name').isNotNull()).count()

3846

**Output:** Out of our 24925 sample entries, 3846 have a country name for geolocalisation.

In [50]:
samples_country_generator = (final_df
                 .filter(F.col('geo-loc-name').isNotNull())
                 .select(F.regexp_extract(F.col('geo-loc-name'), r'^((\w+)|[\w\s]+)+(''|;|:)', 0).alias('country'))
                 .select(F.when(F.col('country').isin('GAZ', 'USA'), 'United States of America').otherwise(F.col('country')).alias('country'))
                 .groupby('country')
                 .count()
                 .toLocalIterator())

samples_country_count = [data for data in samples_country_generator]

In [51]:
# Samples count per country
samples_country_data = dict(
    countries = [country[0] for country in samples_country_count],
    sample_count = [count[1] for count in samples_country_count]
)

In [52]:
genomes_country_generator = (final_df
                             .filter(F.col('g_id').isNotNull())
                            .filter(F.col('geo-loc-name').isNotNull())
                            .select(F.regexp_extract(F.col('geo-loc-name'), r'^((\w+)|[\w\s]+)+(''|;|:)', 0).alias('country'))
                            .select(F.when(F.col('country').isin('GAZ', 'USA'), 'United States of America').otherwise(F.col('country')).alias('country'))
                            .groupby('country')
                            .count()
                            .toLocalIterator())

genomes_country_count = [data for data in genomes_country_generator]

In [53]:
# Genomes count per country
genomes_country_data = dict(
    countries = [country[0] for country in genomes_country_count],
    genome_count = [count[1] for count in genomes_country_count]
)

**Output:** We generate 2 dictionaries containing the data for each of the maps. 
The generation of the data in this case is more complex than previously and requires to transform the original data:
1. To select only the `country` name (in order to match the map data)
2. All the names are not in the same `format`. (In our case, either the full country name or the ISO 3166 alpha-3)

In [54]:
set(samples_country_data['countries']+genomes_country_data['countries']).difference(world.name)

{'Arctic Ocean',
 'Atlantic Ocean',
 'Indian Ocean',
 'Pacific Ocean',
 'Western Tropical North Atlantic Ocean',
 'missing'}

In [55]:
final_df.filter((F.col('geo-loc-name').isNotNull()) & (F.col('geo-loc-name').endswith('Ocean'))).count()

152

In [56]:
final_df.filter((F.col('geo-loc-name').isNotNull()) & (F.col('geo-loc-name').contains('missing'))).count()

1

**Output:** One of the `drawback` of the map used in this example is that the Samples and Genomes retrived from an Ocean are ignored.

In our example, 152 samples are located in the Ocean, which represents app. `4% of the samples` and only 1 sample `geo-loc-name` is represented as 'missing'.

In [57]:
# The world map, the world_limits and the map_theme are reused from the previous sections. 

def get_map_plot(data: dict, col_fill: str, col_map: str, title: str, legend: str):
    return (ggplot()
        + geom_map(aes(fill=col_fill), 
                 data=data, 
                 map=world, map_join=[col_map, 'name'], 
                 color="white",
                 tooltips=layer_tooltips().line("^fill count"))
        + ggtitle(title)
        + scale_fill_gradient(low='#79f8f8', high="#cf0a1d", name=legend)
        + ggsize(800, 400)
        + world_limits
        + map_theme
    ) 

In [58]:
w, h = 480, 320
offset = 15 

bunch = GGBunch()
bunch.add_plot(get_map_plot(samples_country_data, 'sample_count', 'countries', 'Number of samples per country', 'Number of samples'), 0, 0, w, h)
bunch.add_plot(get_map_plot(genomes_country_data, 'genome_count', 'countries', 'Number of genomes per country', 'Number of genomes'), w + offset, 0, w, h)
bunch

**Output:** Interactive map representing the count of samples vs the count of genomes for a given country. 

Hovering the map with the mouse curser allow the user to see the exact count per country.

## Third map example: Use of the sample latitude and longitude to bring samples, genomes and studies together on an interactive map.

In [59]:
final_df.filter((F.col('latitude').isNotNull()) & (F.col('longitude').isNotNull())).count()

18132

**Output:** Out of our 24925 sample entries, 18132 are referenced with their latitude and longitude.

In [60]:
coord_samples = final_df.filter((F.col('latitude').isNotNull()) & (F.col('longitude').isNotNull())).select('latitude', 'longitude', 'id', 'g_id', 'g_studies_id')

In [61]:
coord_samples.show(n=10)

+--------+---------+----------+----+------------+
|latitude|longitude|        id|g_id|g_studies_id|
+--------+---------+----------+----+------------+
| 54.3384|    10.12|ERS1237320|null|        null|
| 54.3384|    10.12|ERS1237321|null|        null|
| 54.3384|    10.12|ERS1237322|null|        null|
| 54.3384|    10.12|ERS1237323|null|        null|
| 54.3384|    10.12|ERS1237325|null|        null|
| 54.3384|    10.12|ERS1237326|null|        null|
| 54.3384|    10.12|ERS1237327|null|        null|
| 54.3384|    10.12|ERS1237333|null|        null|
| 54.3384|    10.12|ERS1237334|null|        null|
| 54.3384|    10.12|ERS1237335|null|        null|
+--------+---------+----------+----+------------+
only showing top 10 rows



In [62]:
window_geo = W.partitionBy('latitude', 'longitude').rowsBetween(W.unboundedPreceding, W.unboundedFollowing)

In [63]:
coord_samples = (final_df
                 .filter((F.col('latitude').isNotNull()) & (F.col('longitude').isNotNull()))
                 .select('latitude', 'longitude', 'id', 'g_id', 'g_studies_id')
                 .withColumn('size', F.count('latitude').over(window_geo))
                 .withColumn("genome_id", F.collect_set("g_id").over(window_geo))
                 .withColumn("sample_id", F.collect_set("id").over(window_geo))
                 .withColumn("study_id", F.collect_set("g_studies_id").over(window_geo))
                 .select('latitude', 'longitude', 'size', 'genome_id', 'sample_id', 'study_id')
                 .distinct()
                )

In [64]:
coord_samples.filter(F.col('size')>200).show(truncate=False)

+--------+---------+----+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

In [65]:
# Convert the spark DataFrame to Pandas for dictionary generation.
pdf = coord_samples.toPandas()

**Outputs**: Pandas DataFrame. Generating lists with the function `toLocalIterator()`is costly. For big dataset, the function `toPandas()`is recommanded when working with Spark DataFrame.

In [66]:
# Average temperatures by continent (fictional)
coord_samples_dict = dict(
    longitude = pdf['longitude'],#[longitude[0] for longitude in coord_samples.select('longitude').toLocalIterator()],
    latitude = pdf['latitude'],#[latitude[0] for latitude in coord_samples.select('latitude').toLocalIterator()],
    sample_size = pdf['size'],#[size[0] for size in coord_samples.select('size').toLocalIterator()],
    genome_id = pdf['genome_id'],#[genome_id[0] for genome_id in coord_samples.select('genome_id').toLocalIterator()],
    sample_id = pdf['sample_id'],#[sample_id[0] for sample_id in coord_samples.select('sample_id').toLocalIterator()],
    study_id = pdf['study_id']
)

In [67]:
(ggplot() 
    + geom_livemap(tiles=tilesets.OPEN_TOPO_MAP, location=[10, 30], zoom=2, data_size_zoomin=2)
    + geom_point(aes('longitude', 'latitude', size='sample_size'),
              data=coord_samples_dict, 
              shape=21,
              alpha=.7,
              color='white',
              tooltips=layer_tooltips()
                        .title('@Study id|@study_id')
                        .line('Sample id|@sample_id')
                        .line('Genome id|@genome_id')
                        .line('Number of samples|@sample_size') 
                        .line('Longitude|^x')
                        .line('Latitude|^y')
               )
    + ggsize(1140, 860)
    + scale_size(range=[2, 20], trans='identity')
) 

**Output:** Interactive map representing the geolocalisation of the samples. Each dot size is proportional to the number of samples. It is possible to zoom in and out on the map.

Hovering the map with the mouse curser allow the user to see the `study` ID if available, the number of samples, and the latitude and longitude. 
The `sample` IDs and `genome` can also be displayed, however with the current settings, when their is too many samples or genomes it is not readable anymore. 