# **CHAPTER 6. Map of geographical distribution**

Import all the needed for analysis modules (just `pandas`)

In [1]:
import pandas as pd

Read the metadata retrieved from `RefSeq` with `Phyloki` on the 1st step of the study

In [2]:
df = pd.read_csv("../01_PanPhylo_analysis/metadata/metadata.tsv", sep="\t")

Let us see the whole dataframe

In [3]:
df

Unnamed: 0,AN,AN_OrganismName,Country,Year,Host
0,NC_082276.1,NC_082276.1 Antarctomyces psychrotrophicus,Antarctica,ND,ND
1,NC_082275.1,NC_082275.1 Thelebolus microsporus,Antarctica,ND,ND
2,NC_061762.1,NC_061762.1 Ciborinia camelliae,Italy,2016,Camellia japonica
3,NC_069201.1,NC_069201.1 Spathularia flavida,China,2021,ND
4,NC_071210.1,NC_071210.1 Erysiphe quercicola,China,ND,ND
5,NC_070174.1,NC_070174.1 Pseudofabraea citricarpa,ND,ND,ND
6,NC_056195.1,NC_056195.1 Monilinia fructicola,Turkey,ND,peach
7,NC_056148.1,NC_056148.1 Golovinomyces cichoracearum,ND,ND,ND
8,NC_056147.1,NC_056147.1 Erysiphe pisi,ND,ND,ND
9,NC_056146.1,NC_056146.1 Erysiphe necator,ND,ND,ND


Good. But we need just 5 organisms of our interest for further analysis. Let's filter the dataframe

In [4]:
# List of organism names to search for
organisms = [
    "Pseudogymnoascus destructans",
    "Pseudogymnoascus pannorum",
    "Thelebolus microsporus",
    "Antarctomyces pellizariae",
    "Antarctomyces psychrotrophicus"
]

# Filter the DataFrame
filtered_df = df[df["AN_OrganismName"].apply(lambda x: any(org in x for org in organisms))]

# Display result
filtered_df

Unnamed: 0,AN,AN_OrganismName,Country,Year,Host
0,NC_082276.1,NC_082276.1 Antarctomyces psychrotrophicus,Antarctica,ND,ND
1,NC_082275.1,NC_082275.1 Thelebolus microsporus,Antarctica,ND,ND
11,NC_048507.1,NC_048507.1 Antarctomyces pellizariae,Brazil,2015,ND
13,NC_033907.1,NC_033907.1 Pseudogymnoascus destructans,"USA: New York, Williams Hotel",2008,Myotis lucifugus
16,NC_027422.1,NC_027422.1 Pseudogymnoascus pannorum,ND,ND,ND


Perfect! Yet, it seems to be there a data artefact... Why does _Antarctomyces pellizariae_ have `Brazil` in the `Country` column? Let's see the whole `source` metadata from its `RefSeq` entry!

In [5]:
%%bash

head -2 ../02_Comparative_genomics/data/ap.gb
head -52 ../02_Comparative_genomics/data/ap.gb | tail -11

LOCUS       NC_048507              30121 bp    DNA     circular PLN 03-APR-2023
DEFINITION  Antarctomyces pellizariae strain UFMGCB 12416 mitochondrion,
FEATURES             Location/Qualifiers
     source          1..30121
                     /organism="Antarctomyces pellizariae"
                     /organelle="mitochondrion"
                     /mol_type="genomic DNA"
                     /strain="UFMGCB 12416"
                     /isolation_source="Antarctic continent"
                     /type_material="holotype of Antarctomyces pellizariae"
                     /db_xref="taxon:1955577"
                     /geo_loc_name="Brazil"
                     /collection_date="Dec-2015"


It seems that researchers where from `Brazil` and did the study in `Brazil`, yet the location of where the fungus was isolated is `Antartica`! So, we change `Brazil` to `Antarctica`!

In [None]:
filtered_df.loc[filtered_df["AN_OrganismName"] == "NC_048507.1 Antarctomyces pellizariae", "Country"] = "Antarctica"

One more step: we need to change `USA` to `United States of America` to work with `Natural Earth`

In [None]:
filtered_df.loc[filtered_df["AN_OrganismName"] == "NC_033907.1 Pseudogymnoascus destructans", "Country"] = "United States of America"

Let's see the metadata now

In [8]:
filtered_df

Unnamed: 0,AN,AN_OrganismName,Country,Year,Host
0,NC_082276.1,NC_082276.1 Antarctomyces psychrotrophicus,Antarctica,ND,ND
1,NC_082275.1,NC_082275.1 Thelebolus microsporus,Antarctica,ND,ND
11,NC_048507.1,NC_048507.1 Antarctomyces pellizariae,Antarctica,2015,ND
13,NC_033907.1,NC_033907.1 Pseudogymnoascus destructans,United States of America,2008,Myotis lucifugus
16,NC_027422.1,NC_027422.1 Pseudogymnoascus pannorum,ND,ND,ND


Perfect! Unfortunatelly, _Pseudogymnoascus pannorum_ will not join the further analysis due to the lack of information in his `RefSeq` entry

In [9]:
filtered_df.to_csv("map_metadata.tsv", sep="\t", index=False)

Finally, we need to download the whole `Natural Earth` database in `GeoPackage` format

In [None]:
! wget https://naciscdn.org/naturalearth/packages/natural_earth_vector.gpkg.zip

In [None]:
! unzip natural_earth_vector.gpkg.zip -d ./natural_earth_vector && rm -rf natural_earth_vector.gpkg.zip