In [20]:
import re
import pandas as pd
import numpy as np
import PyPDF2

In [2]:
pd.set_option("display.max_columns", 85)
pd.set_option("display.max_rows", 85)

### NaturDoc - TL BL WT 22-23

# Data acquisition and exploration:

Initial datasets found by the pitcher:
* pay wall
* dataset of images for training scanning feature, but not a dataset containing information on symptoms, possible treatments etc. of herbs

We quickly realised that we would need to look for other datasets or find possible sources of scraping. We defined the one-sentence-pitch of our app as follows:

_Create an app to diagnose simple sickness based on form inputs, suggest nature-plant-based remedies or recipes to treat simple sickness._

Research into ethnobotany, folk medicine, notebooks on Kaggle etc. Some of the datasets/data sources we found were sometimes too simplistic, sometimes too specialised, sometimes in difficult to work formats. One dataset we found promising was the Duke Ethnobotanical dataset.

### Duke:

In [3]:
duke_df = pd.read_csv("./data/DUKE/ETHNOBOT.csv")

In [6]:
duke_df.head()

Unnamed: 0,ETHNO,ACTIVITY,GENUS,SPECIES,SPAUT,SPRANK,SPXNAM,SPXAUT,FAMILY,CNAME,COUNTRY,REFERENCE,LONGREF,EFFECTIVE,TAXON,TAXAUTHOR,USERID,CREATED,MODIFIED
0,1,Abcess,Achillea,millefolium,,,,,Asteraceae,,,Uphof,,,Achillea millefolium,,DUKE,02-FEB-98,
1,2,Abcess,Castilla,elastica,,,,,Moraceae,,Mexico,Standley,,,Castilla elastica,,DUKE,02-FEB-98,
2,3,Abcess,Pistia,stratiotes,,,,,Araceae,,Elsewhere,"Duke,1972",,,Pistia stratiotes,,DUKE,02-FEB-98,
3,4,Abdomen,Achillea,sibirica,,,,,Asteraceae,,China,Hunan,,,Achillea sibirica,,DUKE,02-FEB-98,
4,5,Abdomen,Adenia,lobata,,,,,Passifloraceae,,Upper Volta,Ayensu,,,Adenia lobata,,DUKE,02-FEB-98,


The dataset describes unique treatments by pairing an "activity" (= ailment targeted by this specific treatment) with a specific plant, further indicating the reference of the research that has described this relationship between plant and treatment and other information such as the country, genus and species, common name, family etc.

#### "Activities":

The "activities", i.e. curative properties ascribed to a given herb, cover a wide range of ailments and conditions.

In [8]:
activities = duke_df["ACTIVITY"].unique()
len(activities)

2235

As we can see, over 2200 unique conditions are described by the dataset. Looking at these afflictions in more detail:

In [20]:
print(*activities[::25].tolist(), sep=" --- ")

Abcess --- Acne --- Amebiasis --- Anodyne --- Antidote(Ipoh) --- Antidote(Cyanide) --- Antidote(Rhus) --- Antifertility --- Antismoking --- Appendicitis --- Ataxia(Locomotor) --- Beri-Beri --- Bitter-Aromatic --- Ache(Bones) --- Bunion --- Cancer(Eye) --- Cancer(Thyroid) --- Caries --- Cervix --- Cholera --- Colic(Veterinary) --- Copremia --- Cyanache --- Dentition --- Digestive disease --- Dye --- Egilops --- Enteromegaly --- Excitant --- Fear --- Finger --- Fracture(Veterinary) --- Gall Bladder --- Giddiness --- Gunshot --- Heart --- Hemolytic --- Hirudicide --- Hypertension --- Indigestion --- Intestinal-Troubles --- Kneecap --- Leg --- Liver --- Madness --- Medical --- Metroxenia --- Mucus --- Necklace --- Nightsweats --- Oncolytic --- Oxyuriasis --- Paroxysm --- Philtre --- Pneumonia --- Porridge --- Preventitive(Disease) --- Proctitis --- Puberty --- Rain-Making --- Repellant(Moth) --- Rib --- Scab(Veterinary) --- Sclerosis(Testicle) --- Shortwindedness --- Sore(Eye) --- Stasis -

The described disease/condition can range from a rather general description, such as aches, to specific ailments and injuries such as gunshot wounds, diabetes and so on. The category "spice" is also included. This wide range of descriptors warrants some further treatment of the data further down the line.

#### Herbs:

Herbs are mentioned by their taxonomic name and, if available, a common name. 
* Are there maybe several common names for the same herb mentioned at different data entries?

In [21]:
herbs = duke_df["TAXON"].unique()

In [61]:
len(herbs)

13079

In total, the dataset contains data about over 13000 unique herbs that have been subjected to ethonobotanical or phytochemical research.

In [69]:
herbs_list = list()
for herb in duke_df.values:
    herbs_dict = dict()
    herbs_dict["TAXON"] = herb[14]
    herbs_dict["CNAME"] = herb[9]
    if herbs_dict not in herbs_list:
        herbs_list.append(herbs_dict)

Pairing of taxonomic name and common name (if available):

In [76]:
herbs_list[::250]
print(*herbs_list[::1000], sep=" --- ")

{'TAXON': 'Achillea millefolium', 'CNAME': nan} --- {'TAXON': 'Lepidium sativum', 'CNAME': nan} --- {'TAXON': 'Nymphaea odorata', 'CNAME': 'Kokulu Nilufer'} --- {'TAXON': 'Crithmum maritimum', 'CNAME': 'Meerfenchel'} --- {'TAXON': 'Ocotea bullata', 'CNAME': 'Black Stinkwood'} --- {'TAXON': 'Typhonium orixense', 'CNAME': nan} --- {'TAXON': 'Viola odorata', 'CNAME': 'Banaf Shah'} --- {'TAXON': 'Curcuma domestica', 'CNAME': 'Kurkuma Uzum'} --- {'TAXON': 'Spondias mombin', 'CNAME': 'Mombin Franc'} --- {'TAXON': 'Hordeum vulgare', 'CNAME': 'Cebada'} --- {'TAXON': 'Dioscorea crispata', 'CNAME': nan} --- {'TAXON': 'Buddleja brasiliensis', 'CNAME': nan} --- {'TAXON': 'Befaria aestuans', 'CNAME': 'Bejar'} --- {'TAXON': 'Nesaea salicifolia', 'CNAME': nan} --- {'TAXON': 'Gastrochilus panduratum', 'CNAME': 'Dekunchi'} --- {'TAXON': 'Alocasia macrorrhiza', 'CNAME': 'Kuan Yin Lien'} --- {'TAXON': 'Cymbidium sp', 'CNAME': nan} --- {'TAXON': 'Hedychium philippense', 'CNAME': nan} --- {'TAXON': 'Cucurb

In [77]:
len(herbs_list)

25728

For some reason, the length is longer than the count of unique herbs?
* because there can be several common names that will only be mentioned at different entries, so it creates a unique "TAXON" + "CNAME" dictionary in the list?

In [78]:
for herb in herbs_list:
    if herb["TAXON"] == "Crithmum maritimum":
        print(herb)

{'TAXON': 'Crithmum maritimum', 'CNAME': 'Meerfenchel'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Fenouil Marin'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Finochio Marino'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Sea Fennel'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Zeevenkel'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Sanpetra'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Fennel,Sea'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Hinojo Marino'}


In [73]:
filt = (duke_df["TAXON"] == "Crithmum maritimum")
duke_df[filt]["CNAME"]

4225         Meerfenchel
15463      Fenouil Marin
17239    Finochio Marino
17862         Sea Fennel
42763          Zeevenkel
46606           Sanpetra
52103         Fennel,Sea
57628      Hinojo Marino
Name: CNAME, dtype: object

This need to be kept in mind further down the line when transforming data from the dataset.
* create dataframe sorted by unique herb (== "TAXON"), containing lists of all "CNAMES" and "ACTIVITIES" 
    * ignore rest of the data?

##### Ex. Ginger:

In [79]:
filt_ginger = (duke_df["CNAME"] == "Ginger")
duke_df[filt_ginger]

Unnamed: 0,ETHNO,ACTIVITY,GENUS,SPECIES,SPAUT,SPRANK,SPXNAM,SPXAUT,FAMILY,CNAME,COUNTRY,REFERENCE,LONGREF,EFFECTIVE,TAXON,TAXAUTHOR,USERID,CREATED,MODIFIED
9573,9574,Cancer,Zingiber,officinale,,,,,Zingiberaceae,Ginger,China,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
9650,9651,Cancer(Breast),Zingiber,officinale,,,,,Zingiberaceae,Ginger,US,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
10795,10796,Carminative,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
45326,45327,Sore,Zingiber,officinale,,,,,Zingiberaceae,Ginger,China,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
46069,46070,Spice,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
55629,55630,Wen,Zingiber,officinale,,,,,Zingiberaceae,Ginger,UK,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
62901,62902,Perfume,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
62907,62908,Stimulant,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
62913,62914,Tumor(Hand),Zingiber,officinale,,,,,Zingiberaceae,Ginger,Indonesia,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
73314,73315,Liqueur,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,


## KEW Datasets:

A really nice and thorough website, however downloadable csv's do not contain all the information from the website. Some of the datasets contain the links to the webpages for each individual plant, however a more thorough look reveals that there's no standardised presentation for the plants. Therefore, scraping for information from the website seems futile.

#### DWCA:

In [6]:
dwca_dis_df = pd.read_csv("./data/DWCA/wcvp_distribution.csv", sep="|")
dwca_rep_df = pd.read_csv("./data/DWCA/wcvp_replacementNames.csv", sep="|")

The wcvp_taxon.csv is extremely large (not pushed to github).

In [7]:
dwca_tax_df = pd.read_csv("./data/DWCA/wcvp_taxon.csv", sep="|")

### Head:

Something that these datasets might be useful for when cross-referencing is their information of localisation of the plants. Warrants more thorough examination.

In [82]:
dwca_dis_df.head()

Unnamed: 0,coreid,locality,establishmentmeans,locationid,occurrencestatus,threatstatus
0,1,Argentina Northeast,,TDWG:AGE,,
1,1,Argentina Northwest,,TDWG:AGW,,
2,1,Bolivia,,TDWG:BOL,,
3,1,Brazil North,,TDWG:BZN,,
4,1,Colombia,,TDWG:CLM,,


In [85]:
dwca_rep_df.head()

Unnamed: 0,taxonid,relatednameusageid,relationtype,remarks
0,250443,250232,replacement name,
1,281366,298758,replacement name,
2,2595809,2595916,replacement name,
3,336840,49782,replacement name,
4,2569633,2569235,replacement name,


In [86]:
dwca_tax_df.head()

Unnamed: 0,taxonid,family,genus,specificepithet,infraspecificepithet,scientfiicname,scientfiicnameauthorship,taxonrank,taxonomicstatus,acceptednameusageid,parentnameusageid,originalnameusageid,namepublishedin,nomenclaturalstatus,taxonremarks,scientificnameid,dynamicproperties,references
0,1,Orchidaceae,Aa,,,Aa,Rchb.f.,Genus,Accepted,1.0,,,Xenia Orchid. 1: 18 (1854),,Costa Rica to NW. Venezuela and N. Argentina,ipni:28572-1,"{""powoid"":""28572-1"",""lifeform"":"""",""climate"":""""...",https://powo.science.kew.org/taxon/urn:lsid:ip...
1,2,Orchidaceae,Aa,achalensis,,Aa achalensis,Schltr.,Species,Accepted,2.0,1.0,,Repert. Spec. Nov. Regni Veg. 16: 358 (1920),,"NC. Argentina (La Rioja, San Luis, Córdoba)",ipni:303631-2,"{""powoid"":""303631-2"",""lifeform"":""tuberous geop...",https://powo.science.kew.org/taxon/urn:lsid:ip...
2,3,Orchidaceae,Aa,argyrolepis,,Aa argyrolepis,Rchb.f.,Species,Accepted,3.0,1.0,,Xenia Orchid. 1: 18 (1854),,W. South America to N. Brazil,ipni:614525-1,"{""powoid"":""614525-1"",""lifeform"":""tuberous geop...",https://powo.science.kew.org/taxon/urn:lsid:ip...
3,4,Orchidaceae,Aa,brevis,,Aa brevis,Schltr.,Species,Synonym,132066.0,,,Repert. Spec. Nov. Regni Veg. Beih. 9: 51 (1921),,Peru,ipni:301821-2,"{""powoid"":""301821-2"",""lifeform"":"""",""climate"":""...",https://powo.science.kew.org/taxon/urn:lsid:ip...
4,5,Orchidaceae,Aa,calceata,,Aa calceata,(Rchb.f.) Schltr.,Species,Accepted,5.0,1.0,7138.0,Repert. Spec. Nov. Regni Veg. 11: 150 (1912),,Peru to Bolivia,ipni:1008443-2,"{""powoid"":""1008443-2"",""lifeform"":""tuberous geo...",https://powo.science.kew.org/taxon/urn:lsid:ip...


"taxonremarks" to aid in localising the plants?

In [103]:
len(dwca_tax_df["taxonremarks"].unique().tolist())

80139

In [93]:
dwca_val = dwca_tax_df["taxonremarks"].value_counts()

In [108]:
print(dwca_val[::2000])

France                                                                                                           12422
China to Korea                                                                                                      28
Kazakhstan (Karatau)                                                                                                12
Iraq to Afghanistan                                                                                                  7
Irkutsk (Siberia, Asia-Temperate)                                                                                    5
S. India (Tirunelveli Hills)                                                                                         4
Ecuador (Loja) to Peru                                                                                               3
WC. Canada to NW. & WC. U.S.A.                                                                                       3
SE. California to SW. Arizona and Mexico (NE. Ba

Too many values of inconsistent format?

In [115]:
len(dwca_dis_df["locality"].unique())

411

In [117]:
dwca_dis_df["locality"].value_counts()[::10]


Colombia                  30726
Brazil North              17061
Turkey                    14527
India                     12513
Mexico Central            10704
Cuba                       9699
Transcaucasus              8746
Germany                    8122
Morocco                    7457
Switzerland                6823
Cambodia                   6355
New York                   6055
Namibia                    5824
Georgia                    5466
Puerto Rico                5127
Palestine                  4825
Maryland                   4516
Kentucky                   4363
Oklahoma                   4217
Botswana                   4131
Inner Mongolia             3915
Minnesota                  3759
Baltic States              3578
North European Russi       3313
Mauritius                  3074
Irkutsk                    2875
Nova Scotia                2599
District of Columbia       2413
Kamchatka                  2021
Gambia                     1755
Tonga                      1387
Bermuda 

Less values but also somewhat inconsistent format? 

#### WCVP:

Both of these csv files are too large to be pushed to GitHub (e.g. wcvp_distribution.csv has over 1 million entries):

In [8]:
wcvp_dis_df = pd.read_csv("./data/WCVP/wcvp_distribution.csv", sep="|")
wcvp_names_df = pd.read_csv("./data/WCVP/wcvp_names.csv", sep="|")

  wcvp_names_df = pd.read_csv("./data/WCVP/wcvp_names.csv", sep="|")


Contains continent, region and area - seems promising?

In [118]:
wcvp_dis_df.head()

Unnamed: 0,plant_locality_id,plant_name_id,continent_code_l1,continent,region_code_l2,region,area_code_l3,area,introduced,extinct,location_doubtful
0,1394111,1,8,SOUTHERN AMERICA,80.0,Central America,COS,Costa Rica,0,0,0
1,1394112,1,8,SOUTHERN AMERICA,82.0,Northern South America,VEN,Venezuela,0,0,0
2,1394113,1,8,SOUTHERN AMERICA,83.0,Western South America,BOL,Bolivia,0,0,0
3,1394114,1,8,SOUTHERN AMERICA,83.0,Western South America,CLM,Colombia,0,0,0
4,1394115,1,8,SOUTHERN AMERICA,83.0,Western South America,ECU,Ecuador,0,0,0


In [119]:
wcvp_names_df.head()

Unnamed: 0,plant_name_id,ipni_id,taxon_rank,taxon_status,family,genus_hybrid,genus,species_hybrid,species,infraspecific_rank,infraspecies,parenthetical_author,primary_author,publication_author,place_of_publication,volume_and_page,first_published,nomenclatural_remarks,geographic_area,lifeform_description,climate_description,taxon_name,taxon_authors,accepted_plant_name_id,basionym_plant_name_id,replaced_synonym_author,homotypic_synonym,parent_plant_name_id,powo_id,hybrid_formula,reviewed
0,250443,309465-1,Species,Unplaced,Cyperaceae,,Isolepis,,panamensis,,,,J.Presl & C.Presl,,Reliq. Haenk.,1: 351,(1830),,Panama,,,Isolepis panamensis,J.Presl & C.Presl,,250232.0,J.Presl & C.Presl,,250129.0,309465-1,,Y
1,345888,77078885-1,nothosubsp.,Unplaced,Orchidaceae,,Anacamptis,×,gennarii,nothosubsp.,bornemanniae,Asch.,"H.Kretzschmar, Eccarius & H.Dietr.",,Orchid Gen. Anacamptis Orchis Neotinea,", ed. 2: 430",(2007),,Medit.,tuberous geophyte,subtropical,Anacamptis × gennarii nothosubsp. bornemanniae,"(Asch.) H.Kretzschmar, Eccarius & H.Dietr.",,142282.0,,,8578.0,77078885-1,A. morio subsp. longicornu × A. papilionacea,Y
2,2490178,231532-2,Species,Unplaced,Cucurbitaceae,,Sechium,,tacaco,,,Pittier,C.Jeffrey,,Kew Bull.,33: 361,(1978),,Costa Rica,,wet tropical,Sechium tacaco,(Pittier) C.Jeffrey,,2409709.0,,,2490165.0,231532-2,,N
3,2602950,582307-1,Species,Synonym,Siparunaceae,,Siparuna,,lindenii,,,Seem.,A.DC.,A.P.de Candolle,Prodr.,16(2): 656,(1868),,Brazil (Santa Catarina),,wet tropical,Siparuna lindenii,(Seem.) A.DC.,,2723851.0,,,,582307-1,,N
4,3143990,17540160-1,Subspecies,Unplaced,Pteridaceae,,Pentagramma,,triangularis,subsp.,maxonii,Weath.,"Yatsk., Windham & E.Wollenw.",,Amer. Fern J.,80: 16,(1990),,S. California to SW. New Mexico and Mexico (Ba...,,,Pentagramma triangularis subsp. maxonii,"(Weath.) Yatsk., Windham & E.Wollenw.",,3164043.0,,,3143986.0,17540160-1,,N


### Crossreferencing DUKE and KEW datasets:

In [4]:
duke_df.loc[:, "TAXON"]

0        Achillea millefolium
1           Castilla elastica
2           Pistia stratiotes
3           Achillea sibirica
4               Adenia lobata
                 ...         
82868      Jatropha multifida
82869      Myroxylon pereirae
82870     Syzygium aromaticum
82871      Gossypium hirsutum
82872    Colocasia antiquorum
Name: TAXON, Length: 82873, dtype: object

In [9]:
dwca_tax_df.loc[:, "scientfiicname"]

0                                              Aa
1                                   Aa achalensis
2                                  Aa argyrolepis
3                                       Aa brevis
4                                     Aa calceata
                            ...                  
1415002                      Crataegus sorbifolia
1415003                        Menyanthes excelsa
1415004                        Mammillaria valida
1415005                 Echinocactus neumannianus
1415006    Typhonium roxburghii var. longispathum
Name: scientfiicname, Length: 1415007, dtype: object

DUKE dataset filtered by matches scientific names matching those found in the KEW dataset:

In [10]:

filter = duke_df.loc[:, "TAXON"].isin(dwca_tax_df.loc[:, "scientfiicname"])
filt_df = duke_df[filter]
filt_df.shape

(74641, 19)

In [14]:
len(duke_df) - len(filt_df)

8232

8232 entries were thus discarded from the DUKE dataset.

(Keep in mind, every singular entry is a combination of a plant with a specific activity, so this number does not tell us how many plants were dropped.)

In [19]:
print(len(duke_df["TAXON"].unique()))
print(len(filt_df["TAXON"].unique()))
len(duke_df["TAXON"].unique()) - len(filt_df["TAXON"].unique())


13079
10695


2384

Entries belonging to 2384 unique scientific plant names were dropped from the DUKE dataset.

## WHO Monographs: