### Database Exploration

The objectives of this notebook are to explore the database generated by the generate_database.py and update_database.py script.

It's clear that many of the plant latin names that were extracted from Plantagen's website were not found with the google search API at the site: https://missouribotanicalgarden.org/. The goal therefore is to try to increase the number of succesful hits and add these to the database.

In [10]:
import sqlite3
import pandas as pd

In [12]:
# read in database, extract all plants and summarise result
DATABASE_LOC = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\PyForFun\House_Plant_Recommender\Database\house_plants.db"

conn = sqlite3.connect(DATABASE_LOC)
c = conn.cursor()
c.execute("""SELECT * FROM 'hyperlinks'""")

plants_found, plants_not_found = [], []
found, not_found = 0, 0

for row in c.fetchall():
    
    if row[1] == "no link found":
        not_found += 1
        plants_not_found.append(row)
    else:
        found += 1
        plants_found.append(row)
print(f"Total number of links now searched: {(found+not_found)}")
print(f"Number of links found: {found}")
print(f"Number of links not found: {not_found}")

c.close()

Total number of links now searched: 95
Number of links found: 32
Number of links not found: 63


In [16]:
missing_plants = [name[0] for name in plants_not_found] 
missing_plants

['Sansevieria',
 'Anthurium Andraeanum-Gruppen',
 'Myrsine africana',
 'Strelitzia nicolai',
 'Pelargonium (Peltatum-Zonale-Gruppen)',
 'Crassula coccinea',
 'Echeveria',
 'Parahemionitis cordata',
 'Alocasia gageana',
 'Nepenthes',
 'Bouvardia x domestica',
 'Sophora prostrata',
 'Cymbidium',
 'Tulipa gesneriana',
 'Wallisia cyanea',
 'Primula obconica',
 'Disocactus anguliger',
 'Cissus striata',
 'Philodendron erubescens',
 'Crocus x hybridus',
 'Haworthia attenuata',
 'Calathea warscewiczii',
 'Goeppertia rufibarba',
 'Pilea depressa',
 'Muscari botryoides',
 'Cissus rotundifolia',
 'x Oncidopsis grex',
 'Zamioculcas zamiifolia',
 'Sarracenia',
 'Muehlenbeckia axillaris',
 'Centella asiatica',
 'Philodendron',
 'Kalanchoe delagoensis',
 'Vriesea',
 'Euphorbia tithymaloides',
 'Goeppertia ornata',
 'Araucaria cunninghamii',
 'Echeveria purpusorum',
 'Ficus deltoidea',
 'Oreocereus trollii',
 'Alocasia baginda',
 'Ornithogalum dubium',
 'Ficus cyathistipula',
 'Euphorbia fruticosa',


#### Some notes from analysing the list above

- Some names contain words in swedish e.g.: 'Kalanchoe (Blossfeldiana-Gruppen)' and  'Pelargonium Peltatum-Gruppen', so these obviously would not match. 
- Some names contain abbreviations e.g: 'Ficus americana ssp. guianensis'. May cause match issues. 
- I did take a random selection and run these names through both google (to confirm real name) and through the https://missouribotanicalgarden.org search API to confirm the name is not present in the database. So the automated method is working fine. 


#### On adding to the database
- Clearly, we can increase the list of plants (with links) in the databse. For example, 'Pelargonium Peltatum-Gruppen' gives no matches, but: 'Pelargonium peltatum' https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a535 does. 

- Whilst the ficus species 'Ficus cyathistipula' (and others) were not found, the missouribotanicalgarden.org database has results for many other ficus spcecies, which are houseplants and can also be added to the database. 

- Depending on the importance of this project, a lot of time could be invested into increasing the size of the database (including by adding other sources). As this is supposed to be a fun side project, I will not invest too much time into this though.



In [21]:
plants_to_add = [
    # Those with Swedish words in their name:
    ("Anthurium andraeanum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=276219"),
    ("Pelargonium peltatum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a535"),
    ("Kalanchoe blossfeldiana", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=279373"),
    # Those with acronyms.  
    # Available ficus plants.  
    ("Ficus lyrata", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282753"),
    ("Ficus carica", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282762"),
    ("Ficus benjamina", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282745"),
    ("Ficus elastica", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b597"),
    ("Ficus pumila", "http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b599"),
    ("Ficus religiosa", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282754")
]


# At this point, last plant searched was: 'Senecio macroglossus'

In [None]:
# Going to create a variable named, new_plants_to_search,

# needs to be checked against the current list for duplicates of course...

