In [3]:
import re
import pandas as pd
import numpy as np
import PyPDF2

In [4]:
pd.set_option("display.max_columns", 85)
pd.set_option("display.max_rows", 85)

### NaturDoc - TL BL WT 22-23

# Data acquisition and exploration:

Initial datasets found by the pitcher:
* pay wall
* dataset of images for training scanning feature, but not a dataset containing information on symptoms, possible treatments etc. of herbs

We quickly realised that we would need to look for other datasets or find possible sources of scraping. We defined the one-sentence-pitch of our app as follows:

_Create an app to diagnose simple sickness based on form inputs, suggest nature-plant-based remedies or recipes to treat simple sickness._

Research into ethnobotany, folk medicine, notebooks on Kaggle etc. Some of the datasets/data sources we found were sometimes too simplistic, sometimes too specialised, sometimes in difficult to work formats. One dataset we found promising was the Duke Ethnobotanical dataset.

### Dr. Duke's Ethnobotanical Database:

[Dr. Duke's Phytochemical and Ethnobotanical Databases](https://catalog.data.gov/dataset/dr-dukes-phytochemical-and-ethnobotanical-databases-cecc0) contain a variety of databases, of which the ETHNOBOT.csv seemed particularly promising:

In [5]:
duke_df = pd.read_csv("../data/DUKE/ETHNOBOT.csv")

In [6]:
duke_df.head()

Unnamed: 0,ETHNO,ACTIVITY,GENUS,SPECIES,SPAUT,SPRANK,SPXNAM,SPXAUT,FAMILY,CNAME,COUNTRY,REFERENCE,LONGREF,EFFECTIVE,TAXON,TAXAUTHOR,USERID,CREATED,MODIFIED
0,1,Abcess,Achillea,millefolium,,,,,Asteraceae,,,Uphof,,,Achillea millefolium,,DUKE,02-FEB-98,
1,2,Abcess,Castilla,elastica,,,,,Moraceae,,Mexico,Standley,,,Castilla elastica,,DUKE,02-FEB-98,
2,3,Abcess,Pistia,stratiotes,,,,,Araceae,,Elsewhere,"Duke,1972",,,Pistia stratiotes,,DUKE,02-FEB-98,
3,4,Abdomen,Achillea,sibirica,,,,,Asteraceae,,China,Hunan,,,Achillea sibirica,,DUKE,02-FEB-98,
4,5,Abdomen,Adenia,lobata,,,,,Passifloraceae,,Upper Volta,Ayensu,,,Adenia lobata,,DUKE,02-FEB-98,


The dataset describes unique treatments by pairing an "activity" (= ailment targeted by this specific treatment) with a specific plant, further indicating the reference of the research that has described this relationship between plant and treatment and other information such as the country, genus and species, common name, family etc.

#### "Activities":

The "activities", i.e. curative properties ascribed to a given herb, cover a wide range of ailments and conditions.

In [7]:
activities = duke_df["ACTIVITY"].unique()
len(activities)

2235

As we can see, over 2200 unique conditions are described by the dataset. Looking at these afflictions in more detail:

In [8]:
print(*activities[::25].tolist(), sep=" --- ")

Abcess --- Acne --- Amebiasis --- Anodyne --- Antidote(Ipoh) --- Antidote(Cyanide) --- Antidote(Rhus) --- Antifertility --- Antismoking --- Appendicitis --- Ataxia(Locomotor) --- Beri-Beri --- Bitter-Aromatic --- Ache(Bones) --- Bunion --- Cancer(Eye) --- Cancer(Thyroid) --- Caries --- Cervix --- Cholera --- Colic(Veterinary) --- Copremia --- Cyanache --- Dentition --- Digestive disease --- Dye --- Egilops --- Enteromegaly --- Excitant --- Fear --- Finger --- Fracture(Veterinary) --- Gall Bladder --- Giddiness --- Gunshot --- Heart --- Hemolytic --- Hirudicide --- Hypertension --- Indigestion --- Intestinal-Troubles --- Kneecap --- Leg --- Liver --- Madness --- Medical --- Metroxenia --- Mucus --- Necklace --- Nightsweats --- Oncolytic --- Oxyuriasis --- Paroxysm --- Philtre --- Pneumonia --- Porridge --- Preventitive(Disease) --- Proctitis --- Puberty --- Rain-Making --- Repellant(Moth) --- Rib --- Scab(Veterinary) --- Sclerosis(Testicle) --- Shortwindedness --- Sore(Eye) --- Stasis -

The described disease/condition can range from a rather general description, such as aches, to specific ailments and injuries such as gunshot wounds, diabetes and so on. The category "spice" is also included. This wide range of descriptors warrants some further treatment of the data further down the line.

#### Herbs:

Herbs are mentioned by their taxonomic name and, if available, a common name. 
* Are there maybe several common names for the same herb mentioned at different data entries?

In [9]:
herbs = duke_df["TAXON"].unique()

In [10]:
len(herbs)

13079

In total, the dataset contains data about over 13000 unique herbs that have been subjected to ethonobotanical or phytochemical research.

In [11]:
herbs_list = list()
for herb in duke_df.values:
    herbs_dict = dict()
    herbs_dict["TAXON"] = herb[14]
    herbs_dict["CNAME"] = herb[9]
    if herbs_dict not in herbs_list:
        herbs_list.append(herbs_dict)

Pairing of taxonomic name and common name (if available):

In [12]:
herbs_list[::250]
print(*herbs_list[::1000], sep=" --- ")

{'TAXON': 'Achillea millefolium', 'CNAME': nan} --- {'TAXON': 'Lepidium sativum', 'CNAME': nan} --- {'TAXON': 'Nymphaea odorata', 'CNAME': 'Kokulu Nilufer'} --- {'TAXON': 'Crithmum maritimum', 'CNAME': 'Meerfenchel'} --- {'TAXON': 'Ocotea bullata', 'CNAME': 'Black Stinkwood'} --- {'TAXON': 'Typhonium orixense', 'CNAME': nan} --- {'TAXON': 'Viola odorata', 'CNAME': 'Banaf Shah'} --- {'TAXON': 'Curcuma domestica', 'CNAME': 'Kurkuma Uzum'} --- {'TAXON': 'Spondias mombin', 'CNAME': 'Mombin Franc'} --- {'TAXON': 'Hordeum vulgare', 'CNAME': 'Cebada'} --- {'TAXON': 'Dioscorea crispata', 'CNAME': nan} --- {'TAXON': 'Buddleja brasiliensis', 'CNAME': nan} --- {'TAXON': 'Befaria aestuans', 'CNAME': 'Bejar'} --- {'TAXON': 'Nesaea salicifolia', 'CNAME': nan} --- {'TAXON': 'Gastrochilus panduratum', 'CNAME': 'Dekunchi'} --- {'TAXON': 'Alocasia macrorrhiza', 'CNAME': 'Kuan Yin Lien'} --- {'TAXON': 'Cymbidium sp', 'CNAME': nan} --- {'TAXON': 'Hedychium philippense', 'CNAME': nan} --- {'TAXON': 'Cucurb

In [13]:
len(herbs_list)

25728

For some reason, the length is longer than the count of unique herbs?
* because there can be several common names that will only be mentioned at different entries, so it creates a unique "TAXON" + "CNAME" dictionary in the list?

In [14]:
for herb in herbs_list:
    if herb["TAXON"] == "Crithmum maritimum":
        print(herb)

{'TAXON': 'Crithmum maritimum', 'CNAME': 'Meerfenchel'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Fenouil Marin'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Finochio Marino'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Sea Fennel'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Zeevenkel'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Sanpetra'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Fennel,Sea'}
{'TAXON': 'Crithmum maritimum', 'CNAME': 'Hinojo Marino'}


In [15]:
filt = (duke_df["TAXON"] == "Crithmum maritimum")
duke_df[filt]["CNAME"]

4225         Meerfenchel
15463      Fenouil Marin
17239    Finochio Marino
17862         Sea Fennel
42763          Zeevenkel
46606           Sanpetra
52103         Fennel,Sea
57628      Hinojo Marino
Name: CNAME, dtype: object

This need to be kept in mind further down the line when transforming data from the dataset.
* create dataframe sorted by unique herb (== "TAXON"), containing lists of all "CNAMES" and "ACTIVITIES" 
    * ignore rest of the data?

##### Ex. Ginger:

In [16]:
filt_ginger = (duke_df["CNAME"] == "Ginger")
duke_df[filt_ginger]

Unnamed: 0,ETHNO,ACTIVITY,GENUS,SPECIES,SPAUT,SPRANK,SPXNAM,SPXAUT,FAMILY,CNAME,COUNTRY,REFERENCE,LONGREF,EFFECTIVE,TAXON,TAXAUTHOR,USERID,CREATED,MODIFIED
9573,9574,Cancer,Zingiber,officinale,,,,,Zingiberaceae,Ginger,China,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
9650,9651,Cancer(Breast),Zingiber,officinale,,,,,Zingiberaceae,Ginger,US,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
10795,10796,Carminative,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
45326,45327,Sore,Zingiber,officinale,,,,,Zingiberaceae,Ginger,China,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
46069,46070,Spice,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
55629,55630,Wen,Zingiber,officinale,,,,,Zingiberaceae,Ginger,UK,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
62901,62902,Perfume,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
62907,62908,Stimulant,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,
62913,62914,Tumor(Hand),Zingiber,officinale,,,,,Zingiberaceae,Ginger,Indonesia,Hartwell,,,Zingiber officinale,,DUKE,02-FEB-98,
73314,73315,Liqueur,Zingiber,officinale,,,,,Zingiberaceae,Ginger,Asia,Uphof,,,Zingiber officinale,,DUKE,02-FEB-98,


## Royal Botanic Gardens Kew Datasets:

[Plants of the World Online by the Royal Botanic Gardens, Kew](https://powo.science.kew.org/)

_Note: some of the csv files from the KEW datasets were too large to be pushed to Github, so please be aware that it's best to leave the cells with their output as they are and not execute them again. In the next notebook, we will include the two csv files that contain the most useful information for us, reduced in size._

A really nice and thorough website, however downloadable csv's do not contain all the information from the website. Some of the datasets contain the links to the webpages for each individual plant, however a more thorough look reveals that there's no standardised presentation for the plants. Therefore, scraping for information from the website seems futile.

#### DWCA:

In [17]:
dwca_dis_df = pd.read_csv("../data/DWCA/wcvp_distribution.csv", sep="|")
dwca_rep_df = pd.read_csv("../data/DWCA/wcvp_replacementNames.csv", sep="|")

The wcvp_taxon.csv is extremely large (not pushed to github).

In [18]:
dwca_tax_df = pd.read_csv("../data/DWCA/wcvp_taxon.csv", sep="|")

### Head:

Something that these datasets might be useful for when cross-referencing is their information of localisation of the plants. Warrants more thorough examination.

In [19]:
dwca_dis_df.head()

Unnamed: 0,coreid,locality,establishmentmeans,locationid,occurrencestatus,threatstatus
0,1,Argentina Northeast,,TDWG:AGE,,
1,1,Argentina Northwest,,TDWG:AGW,,
2,1,Bolivia,,TDWG:BOL,,
3,1,Brazil North,,TDWG:BZN,,
4,1,Colombia,,TDWG:CLM,,


In [20]:
dwca_rep_df.head()

Unnamed: 0,taxonid,relatednameusageid,relationtype,remarks
0,250443,250232,replacement name,
1,281366,298758,replacement name,
2,2595809,2595916,replacement name,
3,336840,49782,replacement name,
4,2569633,2569235,replacement name,


In [21]:
dwca_tax_df.head()

Unnamed: 0,taxonid,family,genus,specificepithet,infraspecificepithet,scientfiicname,scientfiicnameauthorship,taxonrank,taxonomicstatus,acceptednameusageid,parentnameusageid,originalnameusageid,namepublishedin,nomenclaturalstatus,taxonremarks,scientificnameid,dynamicproperties,references
0,1,Orchidaceae,Aa,,,Aa,Rchb.f.,Genus,Accepted,1.0,,,Xenia Orchid. 1: 18 (1854),,Costa Rica to NW. Venezuela and N. Argentina,ipni:28572-1,"{""powoid"":""28572-1"",""lifeform"":"""",""climate"":""""...",https://powo.science.kew.org/taxon/urn:lsid:ip...
1,2,Orchidaceae,Aa,achalensis,,Aa achalensis,Schltr.,Species,Accepted,2.0,1.0,,Repert. Spec. Nov. Regni Veg. 16: 358 (1920),,"NC. Argentina (La Rioja, San Luis, Córdoba)",ipni:303631-2,"{""powoid"":""303631-2"",""lifeform"":""tuberous geop...",https://powo.science.kew.org/taxon/urn:lsid:ip...
2,3,Orchidaceae,Aa,argyrolepis,,Aa argyrolepis,Rchb.f.,Species,Accepted,3.0,1.0,,Xenia Orchid. 1: 18 (1854),,W. South America to N. Brazil,ipni:614525-1,"{""powoid"":""614525-1"",""lifeform"":""tuberous geop...",https://powo.science.kew.org/taxon/urn:lsid:ip...
3,4,Orchidaceae,Aa,brevis,,Aa brevis,Schltr.,Species,Synonym,132066.0,,,Repert. Spec. Nov. Regni Veg. Beih. 9: 51 (1921),,Peru,ipni:301821-2,"{""powoid"":""301821-2"",""lifeform"":"""",""climate"":""...",https://powo.science.kew.org/taxon/urn:lsid:ip...
4,5,Orchidaceae,Aa,calceata,,Aa calceata,(Rchb.f.) Schltr.,Species,Accepted,5.0,1.0,7138.0,Repert. Spec. Nov. Regni Veg. 11: 150 (1912),,Peru to Bolivia,ipni:1008443-2,"{""powoid"":""1008443-2"",""lifeform"":""tuberous geo...",https://powo.science.kew.org/taxon/urn:lsid:ip...


"taxonremarks" to aid in localising the plants?

In [22]:
len(dwca_tax_df["taxonremarks"].unique().tolist())

80139

In [23]:
dwca_val = dwca_tax_df["taxonremarks"].value_counts()

In [24]:
print(dwca_val[::4000])

France                                                                                                           12422
Kazakhstan (Karatau)                                                                                                12
Irkutsk (Siberia, Asia-Temperate)                                                                                    5
Ecuador (Loja) to Peru                                                                                               3
SE. California to SW. Arizona and Mexico (NE. Baja California, NW. Sonora)                                           2
Labrador, Greenland, Europe                                                                                          2
Mexico (SE. Sonora, SW. Chihuahua, Sinaloa, NW. Durango)                                                             2
S. Altay to Mongolia                                                                                                 1
Mexico (México State, Jalisco, Colima)          

Too many values of inconsistent format?

In [25]:
len(dwca_dis_df["locality"].unique())

411

In [26]:
dwca_dis_df["locality"].value_counts()[::10]


Colombia                  30726
Brazil North              17061
Turkey                    14527
India                     12513
Mexico Central            10704
Cuba                       9699
Transcaucasus              8746
Germany                    8122
Morocco                    7457
Switzerland                6823
Cambodia                   6355
New York                   6055
Namibia                    5824
Georgia                    5466
Puerto Rico                5127
Palestine                  4825
Maryland                   4516
Kentucky                   4363
Oklahoma                   4217
Botswana                   4131
Inner Mongolia             3915
Minnesota                  3759
Baltic States              3578
North European Russi       3313
Mauritius                  3074
Irkutsk                    2875
Nova Scotia                2599
District of Columbia       2413
Kamchatka                  2021
Gambia                     1755
Tonga                      1387
Bermuda 

Less values but also somewhat inconsistent format? 

#### WCVP:

Both of these csv files are too large to be pushed to GitHub (e.g. wcvp_distribution.csv has over 1 million entries):

In [27]:
wcvp_dis_df = pd.read_csv("../data/WCVP/wcvp_distribution.csv", sep="|")
wcvp_names_df = pd.read_csv("../data/WCVP/wcvp_names.csv", sep="|")

  wcvp_names_df = pd.read_csv("../data/WCVP/wcvp_names.csv", sep="|")


Contains continent, region and area - seems promising?

In [28]:
wcvp_dis_df.head()

Unnamed: 0,plant_locality_id,plant_name_id,continent_code_l1,continent,region_code_l2,region,area_code_l3,area,introduced,extinct,location_doubtful
0,1394111,1,8,SOUTHERN AMERICA,80.0,Central America,COS,Costa Rica,0,0,0
1,1394112,1,8,SOUTHERN AMERICA,82.0,Northern South America,VEN,Venezuela,0,0,0
2,1394113,1,8,SOUTHERN AMERICA,83.0,Western South America,BOL,Bolivia,0,0,0
3,1394114,1,8,SOUTHERN AMERICA,83.0,Western South America,CLM,Colombia,0,0,0
4,1394115,1,8,SOUTHERN AMERICA,83.0,Western South America,ECU,Ecuador,0,0,0


In [29]:
wcvp_names_df.head()

Unnamed: 0,plant_name_id,ipni_id,taxon_rank,taxon_status,family,genus_hybrid,genus,species_hybrid,species,infraspecific_rank,infraspecies,parenthetical_author,primary_author,publication_author,place_of_publication,volume_and_page,first_published,nomenclatural_remarks,geographic_area,lifeform_description,climate_description,taxon_name,taxon_authors,accepted_plant_name_id,basionym_plant_name_id,replaced_synonym_author,homotypic_synonym,parent_plant_name_id,powo_id,hybrid_formula,reviewed
0,250443,309465-1,Species,Unplaced,Cyperaceae,,Isolepis,,panamensis,,,,J.Presl & C.Presl,,Reliq. Haenk.,1: 351,(1830),,Panama,,,Isolepis panamensis,J.Presl & C.Presl,,250232.0,J.Presl & C.Presl,,250129.0,309465-1,,Y
1,345888,77078885-1,nothosubsp.,Unplaced,Orchidaceae,,Anacamptis,×,gennarii,nothosubsp.,bornemanniae,Asch.,"H.Kretzschmar, Eccarius & H.Dietr.",,Orchid Gen. Anacamptis Orchis Neotinea,", ed. 2: 430",(2007),,Medit.,tuberous geophyte,subtropical,Anacamptis × gennarii nothosubsp. bornemanniae,"(Asch.) H.Kretzschmar, Eccarius & H.Dietr.",,142282.0,,,8578.0,77078885-1,A. morio subsp. longicornu × A. papilionacea,Y
2,2490178,231532-2,Species,Unplaced,Cucurbitaceae,,Sechium,,tacaco,,,Pittier,C.Jeffrey,,Kew Bull.,33: 361,(1978),,Costa Rica,,wet tropical,Sechium tacaco,(Pittier) C.Jeffrey,,2409709.0,,,2490165.0,231532-2,,N
3,2602950,582307-1,Species,Synonym,Siparunaceae,,Siparuna,,lindenii,,,Seem.,A.DC.,A.P.de Candolle,Prodr.,16(2): 656,(1868),,Brazil (Santa Catarina),,wet tropical,Siparuna lindenii,(Seem.) A.DC.,,2723851.0,,,,582307-1,,N
4,3143990,17540160-1,Subspecies,Unplaced,Pteridaceae,,Pentagramma,,triangularis,subsp.,maxonii,Weath.,"Yatsk., Windham & E.Wollenw.",,Amer. Fern J.,80: 16,(1990),,S. California to SW. New Mexico and Mexico (Ba...,,,Pentagramma triangularis subsp. maxonii,"(Weath.) Yatsk., Windham & E.Wollenw.",,3164043.0,,,3143986.0,17540160-1,,N


### Crossreferencing DUKE and KEW datasets:

In [30]:
duke_df.loc[:, "TAXON"]

0        Achillea millefolium
1           Castilla elastica
2           Pistia stratiotes
3           Achillea sibirica
4               Adenia lobata
                 ...         
82868      Jatropha multifida
82869      Myroxylon pereirae
82870     Syzygium aromaticum
82871      Gossypium hirsutum
82872    Colocasia antiquorum
Name: TAXON, Length: 82873, dtype: object

In [31]:
dwca_tax_df.loc[:, "scientfiicname"]

0                                              Aa
1                                   Aa achalensis
2                                  Aa argyrolepis
3                                       Aa brevis
4                                     Aa calceata
                            ...                  
1415002                      Crataegus sorbifolia
1415003                        Menyanthes excelsa
1415004                        Mammillaria valida
1415005                 Echinocactus neumannianus
1415006    Typhonium roxburghii var. longispathum
Name: scientfiicname, Length: 1415007, dtype: object

DUKE dataset filtered by matches scientific names matching those found in the KEW dataset:

In [32]:
filter = duke_df.loc[:, "TAXON"].isin(dwca_tax_df.loc[:, "scientfiicname"])
filt_df = duke_df[filter]
filt_df.shape

(74641, 19)

In [33]:
len(duke_df) - len(filt_df)

8232

8232 entries were thus discarded from the DUKE dataset.

(Keep in mind, every singular entry is a combination of a plant with a specific activity, so this number does not tell us how many plants were dropped.)

In [34]:
print(len(duke_df["TAXON"].unique()))
print(len(filt_df["TAXON"].unique()))
len(duke_df["TAXON"].unique()) - len(filt_df["TAXON"].unique())


13079
10695


2384

Entries belonging to 2384 unique scientific plant names were dropped from the DUKE dataset.

## WHO Monographs:

[WHO monographs on selected medicinal plants](https://apps.who.int/iris/handle/10665/42052)

There are several WHO monographs describing herbs and their medicinal applications in depth. The format presents some challenges, however the description of each herb follows the same structure. The text itself is written free-form and some information is therefore difficult to extract. Regardless, these monographs seem promising for the purpose of enriching the DUKE dataset with some more detailed text, e.g. serving as "instructions" of sorts.

Accessing a pdf with PyPDF2:

In [35]:
# creating a pdf file object
who_01_pdf = open('../data/WHO/monograph_01.pdf', 'rb')
  
# creating a pdf reader object
who_01_reader = PyPDF2.PdfReader(who_01_pdf)
  
# printing number of pages in pdf file
print(len(who_01_reader.pages))
  
# creating a page object
page_obj = who_01_reader.pages[4]
  
# extracting text from page
print(page_obj.extract_text())
  
# closing the pdf file object
who_01_pdf.close()

297
WHO monographs on selected medicinal plants
ivRadix Valerianae 267
Rhizoma Zingiberis 277
Annex
Participants in the WHO Consultation on Selected MedicinalPlants 288Contents


!!! For this specific pdf, page order is mangled:
Page 1 is the 8th actual page (+ 7)
* OK up to page 50 
* 59, 60 
* 67-76 
* 61-66
* 51-58
* 77-297

Issues with weird characters, e.g. when writing "fi" (rendered identically in markdown) or "-":

-> <code>"Rauwol ﬁae"</code> Rauwolfiae

-> <code>"Deﬁnition"</code> Definition

sometimes adds a space, but inconsistent?

#### Reading and cleaning up the pdf:

Regex used to target unusual "fi" and "-" characters, citations such as "( 1, 2)", spaces before dots. Every page is appended to a list of pages, so that I no longer need to open/read the pdf all over again.

In [36]:
who_01_pdf = open('../data/WHO/monograph_01.pdf', 'rb')

who_01_reader = PyPDF2.PdfReader(who_01_pdf)

who_01_pages = list()

pattern_fi = re.compile("ﬁ")
pattern_fl = re.compile("ﬂ")
pattern_minus = re.compile("–")
pattern_brackets = re.compile("\([\s0-9,-]+\)")
pattern_spacedot = re.compile(" \.")
pattern_spacecomma = re.compile(" ,")
pattern_spacequotesingle = re.compile(" ’")
pattern_spacequotedouble = re.compile(" ”")
pattern_doublespace = re.compile("\s\s+")
pattern_dashspace = re.compile("-\s+")
  
for page in range(len(who_01_reader.pages)):
    page_obj = who_01_reader.pages[page]
    text = page_obj.extract_text()
    text = re.sub(pattern_fi, "fi", text)
    text = re.sub(pattern_fl, "fl", text)
    text = re.sub(pattern_minus, "-", text)
    text = re.sub(pattern_brackets, "", text)
    text = re.sub(pattern_spacedot, ".", text)
    text = re.sub(pattern_spacecomma, ",", text)
    text = re.sub(pattern_spacequotesingle, "'", text)
    text = re.sub(pattern_spacequotedouble, "\"", text)
    text = re.sub(pattern_doublespace, " ", text)
    text = re.sub(pattern_dashspace, "", text)
    lines = text.split("\n")
    who_01_pages.append(lines)
    
who_01_pdf.close()

Creating slices of wrong page order:

In [37]:
p_01_50 = who_01_pages[7 : 57]
p_51_58 = who_01_pages[75 : 83]
p_59_60 = who_01_pages[57 : 59]
p_61_66 = who_01_pages[69 : 75]
p_67_76 = who_01_pages[59 : 69]
p_77_end = who_01_pages[83 : ]

Appending slices to a new list:

In [38]:
who_01_pages_corrected = p_01_50 + p_51_58 + p_59_60 + p_61_66 + p_67_76 + p_77_end

who_01_pages_corrected[50][:10]

['Radix Astragali',
 '51oblong, glabrous, reticulate. The root is flexible and long and covered with a',
 'tough, wrinkled, yellowish brown epidermis, which has a tendency to break up',
 'into woolly fibres. The woody interior is yellowish white.',
 'Plant material of interest: root',
 'General appearance',
 'Radix Astragali is cylindrical, some upper branches relatively thick, 30 -90cm',
 'long, 1 -3.5cm in diameter. Externally pale brownish yellow or pale brown,',
 'with irregular, longitudinal wrinkles or furrows. Texture hard and tenacious,broken with dif ficulty, fracture highly fibrous and starchy, bark yellowish',
 'white, wood pale yellow, with radiate striations and fissures, the centre part of']

As we can see, page 51 is now at the correct index position (== 50, as indices start from 0).

Compared to previous list, page 51 is at index 75:

(Also need to always add 7 to index as document includes cover, starts counting from i to vi before starting over at page 1.)

In [39]:
who_01_pages[75][:10]

['Radix Astragali',
 '51oblong, glabrous, reticulate. The root is flexible and long and covered with a',
 'tough, wrinkled, yellowish brown epidermis, which has a tendency to break up',
 'into woolly fibres. The woody interior is yellowish white.',
 'Plant material of interest: root',
 'General appearance',
 'Radix Astragali is cylindrical, some upper branches relatively thick, 30 -90cm',
 'long, 1 -3.5cm in diameter. Externally pale brownish yellow or pale brown,',
 'with irregular, longitudinal wrinkles or furrows. Texture hard and tenacious,broken with dif ficulty, fracture highly fibrous and starchy, bark yellowish',
 'white, wood pale yellow, with radiate striations and fissures, the centre part of']

#### WHO Remedies:
These already describe the specific name, i.e. which part of plant used:

In [103]:
for page in who_01_pages_corrected:
    for i, line in enumerate(page):
        if "Definition" == line:
            output = re.sub(r"[0-9]*", "", page[i-1])
            print(output)

Bulbus Allii Cepae
Bulbus Allii Sativi
Aloe
Aloe Vera Gel
Radix Astragali
Fructus Bruceae
Radix Bupleuri
Herba Centellae
Flos Chamomillae
Cortex Cinnamomi
Rhizoma Coptidis
Rhizoma Curcumae Longae
Radix Echinaceae
Herba Echinaceae Purpureae
Herba Ephedrae
Folium Ginkgo
Radix Ginseng
Radix Paeoniae
Semen Plantaginis
Radix Platycodi
Radix Rauwolfiae
Rhizoma Rhei
Folium Sennae
Fructus Sennae
Herba Thymi
Radix Valerianae
Rhizoma Zingiberis


### Definition:

For example, _Bulbus Allii Cepae_ specifically describes the fresh or dried bulbs of onions (taxonomic name _Allium cepa_). The name of the plant itself is mentioned following the definition header in the monograph:

In [104]:
who_01_def = list()
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Definition" == line:
            n_lines = 1
            while page[i + n_lines] != "Synonyms" and i + n_lines < max_len:
                n_lines += 1
            who_01_def.append(" ".join(page[i + 1 : i + n_lines]))

In [105]:
print(*who_01_def[:5], sep="\n")

Bulbus Allii Cepae is the fresh or dried bulbs of Allium cepa L. (Liliaceae) or its varieties and cultivars.
Bulbus Allii Sativi consists of the fresh or dried bulbs of Allium sativum L. (Liliaceae).
Aloe is the dried juice of the leaves of Aloe vera (L.) Burm. f. or of A. ferox Mill. and its hybrids with A. africana Mill. and A. spicata Baker (Liliaceae).
Aloe Vera Gel is the colourless mucilaginous gel obtained from the parenchymatous cells in the fresh leaves of Aloe vera (L) Burm. f. (Liliaceae).
Radix Astragali is the dried root of Astragalus membranaceus (Fisch.) Bunge and Astragalus mongholicus Bunge (Fabaceae).


It is therefore possible to cross-reference the WHO definition with the scientific names saved in the DUKE dataset (see later chapter).

### Geographical distribution:

Issue: Format of information varies a lot. 

Described with continents, countries and regions. Difficult to cross-reference. 

Different information about: native country, spread, commercial availability...

In [106]:
who_01_dis = list()
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Geographical distribution" == line:
            n_lines = 1
            while page[i + n_lines] != "General identity tests" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
            else:
                output = " ".join(page[i + 1 : i + n_lines])

            who_01_dis.append(output)

In [107]:
print(*who_01_dis[:5], sep="\n")

Bulbus Allii Cepae ( “onion") is probably indigenous to western Asia, but it is commercially cultivated worldwide, especially in regions of moderate climate.
Bulbus Allii Sativi is probably indigenous to Asia, but it is commercially cultivated in most countries.
Native to southern and eastern Africa, and subsequently introduced into northern Africa, the Arabian peninsula, China, Gibraltar, the Mediterranean countries, and the West Indies. It is commercially cultivated in Aruba, Bonaire, Haiti, India, South Africa, the United States of America, and Venezuela.
Probably native to north Africa along the upper Nile in the Sudan, and subsequently introduced and naturalized in the Mediterranean region, most of thetropics and warmer areas of the world, including Asia, the Bahamas, Central America, Mexico, the southern United States of America, south-east Asia, and the West Indies.
Indigenous to China, the Democratic People's Republic of Korea, Mongolia, and Siberia. Commercially cultivated in 

### Vernacular names:

Generally follows the format of "name"-"comma"-"space"-"name"-"comma"-"space". Sometimes includes a line of text as first sentence which ends with a full-stop.

Try splitting at dot, if length of returned list is longer than 2 remove index 1?

Later split at "comma"-"space" to return a list of names?

In [108]:
who_01_vernacular = list()
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Selected vernacular names" == line:
            n_lines = 1
            while page[i + n_lines] != "Description" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
            else:
                output = " ".join(page[i + 1 : i + n_lines])

            who_01_vernacular.append(output)

In [109]:
print(*who_01_vernacular[:5], sep="\n")

It is most commonly known as “onion". Basal, basl, cebolla, cebolla morada, cepa bulb, cepolla, cipolla, common onion, cu hanh, hom hua yai, hom khaao, hom yai, hu-t'sung, hu t'sung t'song, hua phak bhu, i-i-bsel, kesounni, khtim, Küchenzwiebel, l'oignon, loyon, Madras oignon, oignon, palandu, piyaj, piyaz, pyaz, pyaaz, ralu lunu, red globe onion, sibuyas, Spanish onion, tamanegi, umbi bawang merah, vengayan, yellow Bermuda onion, white globe onion, Zwiebel .
It is most commonly known as “garlic”. Ail, ail commun, ajo, akashneem, allium, alubosa elewe, ayo-ishi, ayu, banlasun, camphor of the poor, dai tóan, dasuan, dawang, dra thiam, foom, Gartenlauch, hom khaao, hom kía,hom thiam, hua thiam, kesumphin, kitunguu-sumu, Knoblauch, kra thiam, krathiam, krathiam cheen, krathiam khaao, l’ail, lahsun, lai, lashun, lasan, lasun, lasuna, Lauch, lay, layi, lehsun, lesun, lobha, majo, naharu, nectar of the gods,ninniku, pa-se-waa, poor man’s treacle, rason, rasonam, rasun, rustic treacles, seer,

### Medicinal uses:

#### Clinical:

Issues with chemical illustrations being "read" at the end of the page, such as:

.chamazuleneCH3 CH3CH3 (/H11002)αbisabololCH3OH CH3H CH3CH3 O OOH OHOH apigenin --

In [110]:
who_01_clinical = list()
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses supported by clinical data" == line:
            n_lines = 1
            while page[i + n_lines] != "Uses described in pharmacopoeias and in traditional systems of" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
            else:
                output = " ".join(page[i + 1 : i + n_lines])

            who_01_clinical.append(output)

In [111]:
who_01_clinical[8]

'Internal use Symptomatic treatment of digestive ailments such as dyspepsia, epigastric bloating, impaired digestion, and flatulence. Infusions of camomile flowers have been used in the treatment of restlessness and in mild cases of insomnia due to nervous disorders. External use Inflammation and irritations of the skin and mucosa (skin cracks, bruises, frostbite, and insect bites), including irritations and infections of the mouth and gums, and haemorrhoids.chamazuleneCH3 CH3CH3 (/H11002)αbisabololCH3OH CH3H CH3CH3 O OOH OHOH apigenin --'

Regex to target chemical formulae: they appear right after the final dot without a space in between:

In [112]:
who_01_clinical = list()

pattern_formula = re.compile(r"\.[A-Za-z0-9]+.*")
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses supported by clinical data" == line:
            n_lines = 1
            while page[i + n_lines] != "Uses described in pharmacopoeias and in traditional systems of" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 1 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            who_01_clinical.append(output)

In [113]:
who_01_clinical[8]

'Internal use Symptomatic treatment of digestive ailments such as dyspepsia, epigastric bloating, impaired digestion, and flatulence. Infusions of camomile flowers have been used in the treatment of restlessness and in mild cases of insomnia due to nervous disorders. External use Inflammation and irritations of the skin and mucosa (skin cracks, bruises, frostbite, and insect bites), including irritations and infections of the mouth and gums, and haemorrhoids.'

In [114]:
print(*who_01_clinical[:5], sep="\n")

The principal use of Bulbus Allii Cepae today is to prevent age-dependent changes in the blood vessels, and loss of appetite.
As an adjuvant to dietetic management in the treatment of hyperlipidaemia, and in the prevention of atherosclerotic (age-dependent) vascular changes. The drug may be useful in the treatment of mild hypertension.
Short-term treatment of occasional constipation.
None.
None.


#### Traditional:

Include same regex pattern:

In [115]:
who_01_traditional = list()

pattern_formula = re.compile(r"\.[A-Za-z0-9]+.*")
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses described in pharmacopoeias and in traditional systems of" == line:
            n_lines = 2
            while "Uses described in folk medicine" not in page[i + n_lines] and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 2 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 2 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            who_01_traditional.append(output)

In [116]:
print(*who_01_traditional[:5], sep="\n")

Treatment of bacterial infections such as dysentery, and as a diuretic. The drug has also been used to treat ulcers, wounds, scars, keloids, and asthma . Bulbus Allii Cepae has also been used as an adjuvant therapy for diabetes.
The treatment of respiratory and urinary tract infections, ringworm and rheumatic conditions. The herb has been used as a carminative in the treatment of dyspepsia.
None.
Aloe Vera Gel is widely used for the external treatment of minor wounds and inflammatory skin disorders. The gel is used in the treatment of minor skin irritations, including burns, bruises, and abrasions. The gel is further used in the cosmetics industry as a hydrating ingredient in liquids, creams, sun lotions, shaving creams, lip balms, healing ointments, and face packs. Aloe Vera Gel has been traditionally used as a natural remedy for burns. Aloe Vera Gel has been effectively used in the treatment of firstand second-degree thermal burns and radiation burns. Both thermal and radiationburns 

#### Folk medicine:

In [117]:
who_01_folk = list()

pattern_formula = re.compile(r"\.[A-Za-z0-9]+.*")
  
for page in who_01_pages_corrected:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses described in folk medicine" in line:
            n_lines = 2
            while page[i + n_lines] != "Pharmacology" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 2 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 2 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            who_01_folk.append(output)

In [118]:
print(*who_01_folk[:5], sep="\n")

As an anthelminthic, aphrodisiac, carminative, emmenagogue, expectorant, and tonic, and for the treatment of bruises, bronchitis, cholera, colic, earache, fevers, high blood pressure, jaundice, pimples, and sores.
As an aphrodisiac, antipyretic, diuretic, emmenagogue, expectorant, and sedative, to treat asthma and bronchitis, and to promote hair growth.
Treatment of seborrhoeic dermatitis, peptic ulcers, tuberculosis, and fungal infections, and for reduction of blood sugar (glucose) levels.
The treatment of acne, haemorrhoids, psoriasis, anaemia, glaucoma, petit ulcer, tuberculosis, blindness, seborrhoeic dermatitis, and fungal infections.
Treatment of nephritis, chronic bronchitis, postpartum urine retention, leprosy, and the sequelae of cerebrovascular accidents.


Other areas of interest include:
* __Posology__
    * Posology, Dosage Forms
* __Toxicology__
* __Contraindications__
    * Contraindications, Warnings, Precautions, Adverse reactions

The _Precautions_ as described by the WHO monographs can have many different types of parameters, i.e. _General_, _Drug interactions_, _Carcinogenesis_ etc. Accordingly, it is difficult to account for all possible categories, not to mention that this is probably more information than we would like to present in our website, regardless.

### Issues with wrong or missing matches as well as "null" values:

As we can see here, the lengths of the various lists do not match and the information is therefore all mixed up:

In [119]:
print(len(who_01_def))
print(len(who_01_dis))
print(len(who_01_vernacular))
print(len(who_01_clinical))
print(len(who_01_traditional))
print(len(who_01_folk))

27
25
28
28
27
28


Upon further testing the data extracted so far, we realised that footnotes cause issues (similar to chemical formulae). When the pdf is read, they are placed at the beginning of each page. This results in some matches being missed when comparing with an equality operator; the algorithm checking for matching headings should instead use the "in" keyword.

#### Rerunning code with "in"-keyword:

When rerunning the previous extraction code with the "in" keyword, we sometimes need to make sure that words like "see" or "and" are not included to avoid wrong matches.

Skipping "distribution", as it is difficult to use and the KEW datasets as well as our pytrend endeavors provide better data.

In [166]:
pattern_formula = re.compile(r"\.[A-Za-z0-9]+.*")

# Definitions with in-keyword:

who_01_def = list()
  
for page in who_01_pages_corrected[3:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Definition" in line:
            n_lines = 1
            while page[i + n_lines] != "Synonyms" and i + n_lines < max_len:
                n_lines += 1
            who_01_def.append(" ".join(page[i + 1 : i + n_lines]))

# Vernacular names with in-keyword:

who_01_vernacular = list()
reg_pattern = re.compile(r"\.")
  
for page in who_01_pages_corrected[3:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Selected vernacular names" in line:
            n_lines = 1
            while page[i + n_lines] != "Description" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
            else:
                output = " ".join(page[i + 1 : i + n_lines])
            
            output =  output.replace(".", "")
            
            # Sometimes taxonomic names appear in the WHO document names as a header of sorts:
            for name in duke_names:
                output = output.replace(name, ",")
            
            output = output.split(", ")

            who_01_vernacular.append(output)

# Clinical medicine with in-keyword:

who_01_clinical = list()
  
for page in who_01_pages_corrected[3:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses supported by clinical data" in line:
            n_lines = 1
            while "Uses described in pharmacopoeias and in traditional" not in page[i + n_lines] and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 1 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            # in case the description gets cut off, drop the incomplete sentence:
            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."
            
            who_01_clinical.append(output)

# Traditional medicine with in-keyword:

who_01_traditional = list()
  
for page in who_01_pages_corrected[3:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses described in pharmacopoeias and in traditional" in line:
            n_lines = 2
            while "Uses described in folk medicine" not in page[i + n_lines] and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 2 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 2 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."

            who_01_traditional.append(output)

# Folk with in-keyword:

who_01_folk = list()
  
for page in who_01_pages_corrected[3:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Uses described in folk medicine" in line:
            n_lines = 2
            while page[i + n_lines] != "Pharmacology" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 2 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 2 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."

            who_01_folk.append(output)

Verifying identical lengths:

In [167]:
print(len(who_01_def))
print(len(who_01_vernacular))
print(len(who_01_clinical))
print(len(who_01_traditional))
print(len(who_01_folk))

28
28
28
28
28


### Completing the lists for remaining values:

Add posology, adverse reactions, contraindication and warnings:

In [168]:
pattern_formula = re.compile(r"\.[A-Za-z0-9]+.*")
            
who_01_con = list()
  
for page in who_01_pages_corrected[4:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Contraindications" in line and "see" not in line.lower():
            n_lines = 1
            while page[i + n_lines] != "Warnings" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 1 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."

            who_01_con.append(output)
            
who_01_warn = list()
  
for page in who_01_pages_corrected[4:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Warnings" in line and "see" not in line.lower():
            n_lines = 1
            while page[i + n_lines] != "Precautions" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 1 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."

            who_01_warn.append(output)
            
who_01_adv = list()

ref_pattern = re.compile(r"[0-9]+\.")  

for page in who_01_pages_corrected[4:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Adverse reactions" in line and "see" not in line.lower() and not re.search(ref_pattern, line):
            n_lines = 1
            while page[i + n_lines] != "Posology" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
                output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 1 : i + n_lines])
                output = re.sub(pattern_formula, ".", output)

            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."

            who_01_adv.append(output)

who_01_pos = list()
  
for page in who_01_pages_corrected[4:]:
    max_len = len(page) - 1
    for i, line in enumerate(page):
        if "Posology" in line and "and" not in line.lower():
            n_lines = 1
            while page[i + n_lines] != "References" and i + n_lines < max_len:
                n_lines += 1
            if i + n_lines == max_len:
                output = " ".join(page[i + 1 : i + n_lines + 1])
                # output = re.sub(pattern_formula, ".", output)
            else:
                output = " ".join(page[i + 1 : i + n_lines])
                # output = re.sub(pattern_formula, ".", output)

            if "." != output[-1]:
                output = output.split(".")
                output = output[:-1]
                output = ".".join(output)
                output = output + "."

            who_01_pos.append(output)

Verifying the lengths of these new lists:

In [174]:
print(len(who_01_con))
print(len(who_01_warn))
print(len(who_01_adv))
print(len(who_01_pos))

28
28
28
28


### Null values:

The WHO monograph sometimes contains strings that indicate that there is no information available about something for a specific herb. These can be considered as "null" values of sorts.

In [130]:
who_01_warn[6:13]

['Radix Bupleuri causes sedation when used in large doses ; therefore, patients should be cautious when operating a motor vehicle or hazardous machinery.',
 'No information available.',
 'No information available.',
 'No information available.',
 'No information available.',
 'No information available.',
 'None.']

As we can see, there are different sentences describing this absence of information. For now, we think it is best to replace them with an empty string.

In [170]:
null_values = ["No information available.", "None."]

def make_nan(target_list: list):
    for i, value in enumerate(target_list):
        if value in null_values:
            target_list[i] = ""

In [171]:
make_nan(who_01_clinical)
make_nan(who_01_traditional)
make_nan(who_01_folk)
make_nan(who_01_con)
make_nan(who_01_warn)
make_nan(who_01_adv)
make_nan(who_01_pos)

In [175]:
who_01_warn[11:13]

['', '']

## Looking Ahead:

In the next notebook, we take a closer look at the datasets explored here to identify and experiment with possible approaches of manipulating and extracting data.