# Exploring the Ag1000G Phase 3 dataset (from MalariaGEN)

### Isabelle Rajendiran

The aim of this project is to use the genomic dataset from the Anopheles gambiae 1000 Genomes project (Ag1000G) to estimate historical and contemporary effective population size (Ne).

### 0. Installing and importing packages

The following project will be using the malariagen_data, so this needs to be installed along with the other relevant packages.

In [1]:
# INSTALLING AND IMPORTING PACKAGES

#!pip install -q malariagen_data
#!pip install ipyleaflet

import pandas as pd
import numpy as np
import dask
import dask.array as da
from dask.diagnostics.progress import ProgressBar
# silence some warnings
dask.config.set(**{'array.slicing.split_large_chunks': False})
import allel; print('scikit-allel', allel.__version__)
import malariagen_data
import ipyleaflet
from ipyleaflet import Map, basemaps
import plotly_express as px

scikit-allel 1.3.5


In [2]:
# IMPORT API
# AG3 DATA ACCESS FROM GOOGLE CLOUD

ag3 = malariagen_data.Ag3(pre='True') # Pre=True is needed to include data from all releases (3.0-3.7)
ag3

MalariaGEN Ag3 API client,MalariaGEN Ag3 API client
"Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs.","Please note that data are subject to terms of use,  for more information see the MalariaGEN website or contact data@malariagen.net.  See also the Ag3 API docs..1"
Storage URL,gs://vo_agam_release/
Data releases available,"3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8"
Results cache,
Cohorts analysis,20230516
Species analysis,aim_20220528
Site filters analysis,dt_20200416
Software version,malariagen_data 7.3.1
Client location,"England, GB"


If installed correctly, the API client should be displayed

EDIT: As of when this project was conducted, only releases 3.0-3.7 were available.

### 1. Exploring the sample sets

Each release e.g. 3.0 have organised their data in sample sets which corresponds to a set of mosquito specimens contributed by a collaborating study.
E.g., Ag3.0 data are organised into 28 sample sets. 

#### 1.1 Release 3.0

In [None]:
# LOAD SAMPLE SETS FOR RELEASE 3.0

df_sample_set_3_0 = ag3.sample_sets(release="3.0")

print(df_sample_set_3_0)

Taken from the MalariaGEN github: "The sample set identifiers all start with “AG1000G-” followed by the two-letter code of the country from which samples were collected (e.g., “AO” is Angola). Where there are multiple sample sets from the same country, these have been given alphabetical suffixes, e.g., “AG1000G-BF-A”, “AG1000G-BF-B” and “AG1000G-BF-C” are three sample sets from Burkina Faso."

In [None]:
# LOADING THE SAMPLE METADATA

df_sample_metadata_3_0= ag3.sample_metadata(sample_sets="3.0")

print(df_sample_metadata_3_0.columns) 

df_sample_metadata_3_0

#### 1.2 All releases (3.0-3.7)

In [3]:
# LOAD SAMPLE METADATA AND SAMPLE SETS

pd.set_option('display.max_rows', 10) 

df_sample_set_all = ag3.sample_sets()
print(df_sample_set_all)

df_sample_metadata_all= ag3.sample_metadata()
df_sample_metadata_all

                          sample_set  sample_count release
0                         AG1000G-AO            81     3.0
1                       AG1000G-BF-A           181     3.0
2                       AG1000G-BF-B           102     3.0
3                       AG1000G-BF-C            13     3.0
4                         AG1000G-CD            76     3.0
..                               ...           ...     ...
64      1288-VO-UG-DONNELLY-VMF00219           739     3.8
65  1314-VO-BF-KIENTEGA-KIMA-BF-2104           179     3.8
66   1315-VO-NG-OMITOLA-OMOL-NG-2008           117     3.8
67   1326-VO-UG-KAYONDO-KAJO-UG-2203           975     3.8
68                    tennessen-2021           208     3.8

[69 rows x 3 columns]


Load sample metadata:   0%|          | 0/69 [00:00<?, ?it/s]

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call,...,admin1_name,admin1_iso,admin2_name,taxon,cohort_admin1_year,cohort_admin1_month,cohort_admin1_quarter,cohort_admin2_year,cohort_admin2_month,cohort_admin2_quarter
0,AR0047-C,LUA047,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,...,Luanda,AO-LUA,Luanda,coluzzii,AO-LUA_colu_2009,AO-LUA_colu_2009_04,AO-LUA_colu_2009_Q2,AO-LUA_Luanda_colu_2009,AO-LUA_Luanda_colu_2009_04,AO-LUA_Luanda_colu_2009_Q2
1,AR0049-C,LUA049,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,...,Luanda,AO-LUA,Luanda,coluzzii,AO-LUA_colu_2009,AO-LUA_colu_2009_04,AO-LUA_colu_2009_Q2,AO-LUA_Luanda_colu_2009,AO-LUA_Luanda_colu_2009_04,AO-LUA_Luanda_colu_2009_Q2
2,AR0051-C,LUA051,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,...,Luanda,AO-LUA,Luanda,coluzzii,AO-LUA_colu_2009,AO-LUA_colu_2009_04,AO-LUA_colu_2009_Q2,AO-LUA_Luanda_colu_2009,AO-LUA_Luanda_colu_2009_04,AO-LUA_Luanda_colu_2009_Q2
3,AR0061-C,LUA061,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,...,Luanda,AO-LUA,Luanda,coluzzii,AO-LUA_colu_2009,AO-LUA_colu_2009_04,AO-LUA_colu_2009_Q2,AO-LUA_Luanda_colu_2009,AO-LUA_Luanda_colu_2009_04,AO-LUA_Luanda_colu_2009_Q2
4,AR0078-C,LUA078,Joao Pinto,Angola,Luanda,2009,4,-8.884,13.302,F,...,Luanda,AO-LUA,Luanda,coluzzii,AO-LUA_colu_2009,AO-LUA_colu_2009_04,AO-LUA_colu_2009_Q2,AO-LUA_Luanda_colu_2009,AO-LUA_Luanda_colu_2009_04,AO-LUA_Luanda_colu_2009_Q2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
15031,SAMN15222632,D342,Jacob Tennessen,Burkina Faso,Tengrela,2016,-1,10.700,-4.800,F,...,Cascades,BF-02,Comoe,coluzzii,BF-02_colu_2016,BF-02_colu_2016,BF-02_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016
15032,SAMN15222633,D343,Jacob Tennessen,Burkina Faso,Tengrela,2016,-1,10.700,-4.800,F,...,Cascades,BF-02,Comoe,coluzzii,BF-02_colu_2016,BF-02_colu_2016,BF-02_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016
15033,SAMN15222634,D346,Jacob Tennessen,Burkina Faso,Tengrela,2016,-1,10.700,-4.800,F,...,Cascades,BF-02,Comoe,coluzzii,BF-02_colu_2016,BF-02_colu_2016,BF-02_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016
15034,SAMN15222635,D347,Jacob Tennessen,Burkina Faso,Tengrela,2016,-1,10.700,-4.800,F,...,Cascades,BF-02,Comoe,coluzzii,BF-02_colu_2016,BF-02_colu_2016,BF-02_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016,BF-02_Comoe_colu_2016


From the 3.0 - 3.7 releases, there are 63 samples sets in total and 12775 samples.

#### 1.4 Presenting the data as pivot tables

In order to select suitable samples, the data is first visualised as pivot tables to count the number of samples grouped by country, year and taxon (and sample set for next steps).

In [5]:
# CONVERTING THE METADATA INFORMATION INTO PIVOT TABLES

# Display all rows
pd.set_option('display.max_rows', None) 

# Grouping the data by the desired columns (ORDER OF GROUPING MATTERS)
combined_data = pd.DataFrame(df_sample_metadata_all.groupby(['country','sample_set','year','taxon']).size()) 

# Convert into a pivot table for easier visualisation
combined_data_sample_set_table = pd.pivot_table(combined_data, values=None, index=['country','sample_set', 'year'], columns=['taxon']).fillna(0).astype(int) 
combined_data_sample_set_table

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,0,0,0,0,0,0,0,0,0,0
Unnamed: 0_level_1,Unnamed: 1_level_1,taxon,arabiensis,coluzzii,fontenillei,gambiae,gcx1,gcx2,gcx3,goundry,melas_suspected,unassigned
country,sample_set,year,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2
Angola,AG1000G-AO,2009,0,81,0,0,0,0,0,0,0,0
Benin,1237-VO-BJ-DJOGBENOU-VMF00050,2017,0,90,0,0,0,0,0,0,0,0
Benin,1237-VO-BJ-DJOGBENOU-VMF00067,2017,0,78,0,64,0,0,0,0,0,0
Benin,1237-VO-BJ-DJOGBENOU-VMF00212,2018,0,380,0,21,0,0,0,0,50,0
Burkina Faso,1191-VO-MULTI-OLOUGHLIN-VMF00106,2014,0,3,0,28,0,0,0,0,0,0
Burkina Faso,1191-VO-MULTI-OLOUGHLIN-VMF00140,2014,9,90,0,94,0,0,0,0,0,2
Burkina Faso,1191-VO-MULTI-OLOUGHLIN-VMF00140,2015,88,183,0,141,0,0,0,0,0,0
Burkina Faso,1191-VO-MULTI-OLOUGHLIN-VMF00140,2016,14,167,0,104,0,0,0,0,0,1
Burkina Faso,1191-VO-MULTI-OLOUGHLIN-VMF00140,2017,1,164,0,37,0,0,0,0,0,0
Burkina Faso,AG1000G-BF-A,2012,0,82,0,99,0,0,0,0,0,0


In [None]:
# REPEATED FOR A TABLE WITHOUT SAMPLE SETS

combined_data_omit_sample_set = pd.DataFrame(df_sample_metadata_all.groupby(['country','year','taxon']).size()) 
combined_data_table = pd.pivot_table(combined_data_omit_sample_set, values=None, index=['country', 'year'], columns=['taxon']).fillna(0).astype(int)  
combined_data_table

In [None]:
# DROPPING IRRELEVANT TAXA

# list of taxa with little number of samples
taxa_to_drop = ['fontenillei','gcx1','gcx2','gcx3','goundry','melas_suspected','unassigned'] 

# Dropping them from the pivot table
combined_data_table_dropped = combined_data_table.drop(labels=taxa_to_drop, axis=1, level=1) # Level=1 needed as the columns in the pivot table are in a MultiIndex format. Axis =1 means to drop in the 'column' and not 'row'.
combined_data_table_dropped

In [None]:
# REPEATED FOR THE OTHER TABLE

taxa_to_drop = ['fontenillei','gcx1','gcx2','gcx3','goundry','melas_suspected','unassigned']
combined_data_table_dropped_2 = combined_data_sample_set_table.drop(labels=taxa_to_drop, axis=1, level=1)
combined_data_table_dropped_2

#### 1.5 Exporting the tables as .csv files

In [None]:
# EXPORTING THE TABLES TO .CSV

# Converting the pivot tables to dataframes
omitted_sample_set_df = pd.DataFrame(combined_data_table_dropped.to_records())

# Correcting the columns name and adding a header row called taxa
# Create a new index with "taxa" as the top level and the existing column names as the second level
new_columns = pd.MultiIndex.from_arrays([['', '', 'taxa', 'taxa', 'taxa'], ['country', 'year', 'arabiensis', 'coluzzii', 'gambiae']])

# Assign the new index to the columns of the dataframe
omitted_sample_set_df.columns = new_columns

# Export the table to CSV file
omitted_sample_set_df.to_csv('Omitted_sample_set_df.csv',index=False)


In [None]:
# DISPLAY THE MODIFIED DATAFRAME

omitted_sample_set_df

In [None]:
# REPEATED FOR THE OTHER TABLE

sample_set_df =  pd.DataFrame(combined_data_table_dropped_2.to_records())
new_columns = pd.MultiIndex.from_arrays([['', '', '','taxa', 'taxa', 'taxa'], ['country','sample_set', 'year', 'arabiensis', 'coluzzii', 'gambiae']])
sample_set_df.columns = new_columns
sample_set_df.to_csv('Sample_set_df.csv',index=False)

### 2. Selecting candidate countries

From the tables above, I selected potential countries which had a large number of samples (minimum 50) and had samples taken between 2-5 years apart, counting each taxon separately.

Note: Anopheles gambiae, Anopheles coluzzii and Anopheles arabiensis are hereafter referred to as gambiae, coluzzi and arabiensis.

Combined sample_sets:

- Burkina faso for coluzzii and gambiae: 2012, 2014, 2015, 2016 and 2017
- Cameroon for coluzzii and gambiae: 2019 with 2021 for both.
- Democratic Republic of the Congo for gambiae: 2016/2017 (150+) with 2020 (145)
- Ghana for coluzzii: 2016 to 2018 (+1000). Also 2017 but too close in years. 2012 possibly (would need to add all species together)
- Mali for coluzzi: 2013 (80) with 2015 (346); Gambiae: 2012 (70) with 2014/2015
- Tanzania for arabiensis: 2015 (137) with 2018 (467) (or 2019 (97))
- Uganda for gambiae: 2012 (207) with 2017 or, 2017 (210) with 2019 (134)


Corresponding sample_sets:
- Burkina faso: 1191-VO-MULTI-OLOUGHLIN-VMF00140 (2014-2017), 1191-VO-MULTI-OLOUGHLIN-VMF00106 (2014), AG1000G-BF-A (2012), AG1000G-BF-B (2014)

- Cameroon: 1270-VO-MULTI-PAMGEN-VMF00193-VMF00202 (2021 (+2020)), 1281-VO-CM-CHRISTOPHE-VMF00208 (2019-2021)

- Democratic Republic of the Congo: 1264-VO-CD-WATSENGA-VMF00161 (2013-2018), 1264-VO-CD-WATSENGA-VMF00164 (2019-2020)

- Ghana: 1190-VO-GH-AMENGA-ETEGO-VMF00013 (2016), 1190-VO-GH-AMENGA-ETEGO-VMF00047 (2018 (+2017)), 1190-VO-GH-AMENGA-ETEGO-VMF00088 (2018), 1190-VO-GH-AMENGA-ETEGO-VMF00102 (2018), 1244-VO-GH-YAWSON-VMF00051 (2017-2018), 1244-VO-GH-YAWSON-VMF00149 (2018). (AG1000G-GH (2012))

- Mali: 1177-VO-ML-LEHMANN-VMF00004 (2012-2015), 1177-VO-ML-LEHMANN-VMF00015 (2013-2015), 1191-VO-MULTI-OLOUGHLIN-VMF00106 (2014-2015), AG1000G-GN-B (2012), AG1000G-ML-A (2014)

- Tanzania: 1246-VO-TZ-KABULA-VMF00185 (2018), 1246-VO-TZ-KABULA-VMF00197 (2019), AG1000G-TZ (2015(+2012,2013))

- Uganda: 1178-VO-UG-LAWNICZAK-VMF00025 (2017 (+2013,2017)), 1288-VO-UG-DONNELLY-VMF00168 (2017-2019), AG1000G-UG (2012)

####  2.1 Exploring the location metadata for the potential countries

It's necessary that the samples for the country are taken from roughly the same geographical location (as well as have above 50 samples and taken at least 2 years apart). Ideally, they would have samples for both species (A.gambiae and A.coluzzii). The list below is separated for potential with 2 species:

Both species:
- Burkina faso for coluzzii and gambiae: 2012,2014,2015,2016 and 2017 [2012-2014 has existing publications]
- Cameroon for coluzzii and gambiae: 2019 with 2021 for both. 
- Mali for coluzzi: 2013 (80) with 2015 (346); Gambiae: 2012 (70) with 2014/2015

Single species:
- Democratic Republic of the Congo for gambiae: 2016/2017 (150+) with 2020 (145)
- Ghana for coluzzii: 2016 to 2018 (+1000).
- Tanzania for arabiensis: 2015 (137) with 2018 (467) (or 2019 (97))
- Uganda for gambiae: 2012 (207) with 2017 or, 2017 (210) with 2019 (134)

#### Cameroon

In [None]:
# LOADING METADATA (LOCATION) FOR CAMEROON 
# Cameroon: 1270-VO-MULTI-PAMGEN-VMF00193-VMF00202 (2021 (+2020)), 1281-VO-CM-CHRISTOPHE-VMF00208 (2019-2021) for relevant years

# Limiting the number of rows shown
pd.set_option('display.max_rows', 10)

# Loading metadata for the relevant sample sets
Cameroon_metadata = ag3.sample_metadata(sample_sets=['1270-VO-MULTI-PAMGEN-VMF00193-VMF00202','1281-VO-CM-CHRISTOPHE-VMF00208'],
                                       sample_query =  "country == 'Cameroon' ")
Cameroon_metadata

744 samples across all locations and years for Cameroon

In [None]:
# DISPLAYING THE LOCATION (AND YEAR) AS A PIVOT TABLE

# Displays all rows
pd.set_option('display.max_rows', None)

# Group the metadata by location, year and taxon
Cameroon_metadata_location = pd.DataFrame(Cameroon_metadata.groupby(['location','year','taxon']).size())

# Convert to a pivot table 
Cameroon_metadata_location = pd.pivot_table(Cameroon_metadata_location, values=None, index=['location','year'], columns=['taxon']).fillna(0).astype(int)

# Dropping irrelevant taxa and the year 2020
Cameroon_metadata_location = Cameroon_metadata_location.drop(labels=['arabiensis','unassigned'],axis=1, level=1)
Cameroon_metadata_location.query("year == 2021 or year == 2019") # .query() selects and displays the wanted information


There is no single location which contains above 50 samples and for 2 different timepoints. To judge the geographical location of these samples (to group similar locations), the samples were plotted onto a map.

In [None]:
# PLOTTING THE SAMPLES ONTO A MAP

# Selecting the samples of interest 
sample_query = "(year == 2019 or year == 2021) and (taxon == 'coluzzii' or taxon == 'gambiae')"

# Plotting the samples onto a map using ipyleaflet
ag3.plot_samples_interactive_map(sample_sets=['1270-VO-MULTI-PAMGEN-VMF00193-VMF00202','1281-VO-CM-CHRISTOPHE-VMF00208'], 
                                 sample_query= sample_query, 
                                 basemap=basemaps.Esri.WorldStreetMap, center=(- 2, 20), zoom=3, min_samples=1)
                                 # Basemap changes the type of map seen (types of map possible are in the ipyleaflet manual)


The locations are quite distant (>100km apart) so I arbitarily am choosing to not group them. Alternatively, we could run FST calculations to measure the population differentiation between the locations to determine genetically whether they are one population.

#### Mali

In [6]:
# LOADING METADATA (LOCATION) FOR Mali
# Mali: 1177-VO-ML-LEHMANN-VMF00004 (2012-2015), 1177-VO-ML-LEHMANN-VMF00015 (2013-2015), 1191-VO-MULTI-OLOUGHLIN-VMF00106 (2014-2015), AG1000G-GN-B (2012), AG1000G-ML-A (2014)

# Displays only 10 rows
pd.set_option('display.max_rows', 10)

# Loading metadata for the relevant sample sets
Mali_metadata = ag3.sample_metadata(sample_sets=['1177-VO-ML-LEHMANN-VMF00004','1177-VO-ML-LEHMANN-VMF00015','1191-VO-MULTI-OLOUGHLIN-VMF00106', 'AG1000G-GN-B','AG1000G-ML-A' ],
                                   sample_query = "country == 'Mali' ")
Mali_metadata #1030 rows

Load sample metadata:   0%|          | 0/5 [00:00<?, ?it/s]

Unnamed: 0,sample_id,partner_sample_id,contributor,country,location,year,month,latitude,longitude,sex_call,...,admin1_name,admin1_iso,admin2_name,taxon,cohort_admin1_year,cohort_admin1_month,cohort_admin1_quarter,cohort_admin2_year,cohort_admin2_month,cohort_admin2_quarter
0,VBS00256-4651STDY7017184,GP97,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F,...,Koulikouro,ML-2,Banamba,coluzzii,ML-2_colu_2012,ML-2_colu_2012_06,ML-2_colu_2012_Q2,ML-2_Banamba_colu_2012,ML-2_Banamba_colu_2012_06,ML-2_Banamba_colu_2012_Q2
1,VBS00257-4651STDY7017185,GP98,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F,...,Koulikouro,ML-2,Banamba,coluzzii,ML-2_colu_2012,ML-2_colu_2012_06,ML-2_colu_2012_Q2,ML-2_Banamba_colu_2012,ML-2_Banamba_colu_2012_06,ML-2_Banamba_colu_2012_Q2
2,VBS00259-4651STDY7017186,GP100,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F,...,Koulikouro,ML-2,Banamba,coluzzii,ML-2_colu_2012,ML-2_colu_2012_06,ML-2_colu_2012_Q2,ML-2_Banamba_colu_2012,ML-2_Banamba_colu_2012_06,ML-2_Banamba_colu_2012_Q2
3,VBS00262-4651STDY7017187,GP103,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F,...,Koulikouro,ML-2,Banamba,coluzzii,ML-2_colu_2012,ML-2_colu_2012_06,ML-2_colu_2012_Q2,ML-2_Banamba_colu_2012,ML-2_Banamba_colu_2012_06,ML-2_Banamba_colu_2012_Q2
4,VBS00277-4651STDY7017189,GP118,Tovi Lehmann,Mali,Dallowere,2012,6,13.616,-7.037,F,...,Koulikouro,ML-2,Banamba,coluzzii,ML-2_colu_2012,ML-2_colu_2012_06,ML-2_colu_2012_Q2,ML-2_Banamba_colu_2012,ML-2_Banamba_colu_2012_06,ML-2_Banamba_colu_2012_Q2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1025,AZ0434-CW,MA11-14,Austin Burt,Mali,Tieneguebougou,2014,8,12.810,-8.080,F,...,Koulikouro,ML-2,Kati,coluzzii,ML-2_colu_2014,ML-2_colu_2014_08,ML-2_colu_2014_Q3,ML-2_Kati_colu_2014,ML-2_Kati_colu_2014_08,ML-2_Kati_colu_2014_Q3
1026,AZ0416-CW,MA10-32,Austin Burt,Mali,Tieneguebougou,2014,8,12.810,-8.080,F,...,Koulikouro,ML-2,Kati,coluzzii,ML-2_colu_2014,ML-2_colu_2014_08,ML-2_colu_2014_Q3,ML-2_Kati_colu_2014,ML-2_Kati_colu_2014_08,ML-2_Kati_colu_2014_Q3
1027,AZ0444-CW,MA11-28,Austin Burt,Mali,Tieneguebougou,2014,8,12.810,-8.080,F,...,Koulikouro,ML-2,Kati,coluzzii,ML-2_colu_2014,ML-2_colu_2014_08,ML-2_colu_2014_Q3,ML-2_Kati_colu_2014,ML-2_Kati_colu_2014_08,ML-2_Kati_colu_2014_Q3
1028,AZ0484-CW,MA3-1,Austin Burt,Mali,Tieneguebougou,2014,8,12.810,-8.080,M,...,Koulikouro,ML-2,Kati,gambiae,ML-2_gamb_2014,ML-2_gamb_2014_08,ML-2_gamb_2014_Q3,ML-2_Kati_gamb_2014,ML-2_Kati_gamb_2014_08,ML-2_Kati_gamb_2014_Q3


A total of 1030 samples from Mali across all years and locations.

In [7]:
# DISPLAYING THE LOCATION (AND YEAR) AS A PIVOT TABLE

# Display all rows
pd.set_option('display.max_rows', None)

# Group the metadata by location, year and taxon
Mali_metadata_location = pd.DataFrame(Mali_metadata.groupby(['location','year','taxon']).size())

# Convert to a pivot table
Mali_metadata_location = pd.pivot_table(Mali_metadata_location, values=None, index=['location','year'], columns=['taxon']).fillna(0).astype(int)

# Dropping irrelevant taxa and the year 2020
Mali_metadata_location = Mali_metadata_location.drop(labels=['arabiensis','unassigned'],axis=1, level=1)
Mali_metadata_location.query("year == 2012 or year == 2013 or year == 2015")

Unnamed: 0_level_0,Unnamed: 1_level_0,0,0
Unnamed: 0_level_1,taxon,coluzzii,gambiae
location,year,Unnamed: 2_level_2,Unnamed: 3_level_2
Dallowere,2012,28,5
Dallowere,2015,181,23
Kababougou,2015,14,10
Markabougou,2013,3,0
Ouassorola,2015,18,23
Siguima,2013,4,0
Siguima,2015,1,0
Sogolombougou,2015,9,24
Sokourani (Niono),2012,23,0
Sokourani (Niono),2013,73,0


In [8]:
# PLOTTING THE SAMPLES ONTO A MAP

# Selecting the relevant samples
sample_query = "(year == 2012 or year == 2013 or year == 2015) and (taxon == 'coluzzii' or taxon == 'gambiae') and (country == 'Mali')"

#Plotting the samples on a map
ag3.plot_samples_interactive_map(sample_sets=['1177-VO-ML-LEHMANN-VMF00004','1177-VO-ML-LEHMANN-VMF00015','1191-VO-MULTI-OLOUGHLIN-VMF00106', 'AG1000G-GN-B','AG1000G-ML-A'], 
                                 sample_query= sample_query, 
                                 basemap=basemaps.Esri.WorldStreetMap, center=(- 2, 20), zoom=3, min_samples=1)

Map(center=[-2, 20], controls=(ZoomControl(options=['position', 'zoom_in_text', 'zoom_in_title', 'zoom_out_tex…

Grouped regions:
- Tieneguebougou, Sogolombougou, Ouassorola, Kababougou (very close cluster)

Perhaps groupable:
- Thierola, Dallowere, Siguima - around 50Km apart
- Sokourani (Niono), Markabougou) - around 50 Km apart
- Takan, Toumani Oulena - around 100Km apart

First close cluster are definetely group-able. Others would need justification.

Taken from previously: 'Mali for coluzzi: 2013 (80) with 2015 (346); Gambiae: 2012 (70) with 2015 (2014 possible too)'

Coluzzi:
- 2013: Markabougou (3), Siguima (4), Sokourani (Niono)(73)
- 2015: Dallowere (181), Ouassorola (18), Siguima(1), Sogolombougou (9), Sokourani (Niono) (104), Thierola (3), Tieneguebougou (16)

Gambiae:
- 2012: Dallowere (5), Toumani Oulena (60), Takan (5), 
- 2015: Dallowere (23), Kababougou (10), Ouassorola (23), Sogolombougou (24), Tieneguebougou (12)


From this, for Coluzzi, it is possible to compare Sokourani (Niono) 2013 vs 2015.