### Database Exploration

The objectives of this notebook is to explore the database generated by the generate_database.py and the update_database.py script. Then an attempt will be made to increase the number of succesful hits and add these to the database.

**Summary of this notebook:**
 
- It's clear that many of the plant latin names that were extracted from Plantagen's website were not found with the google search API at the site: https://missouribotanicalgarden.org/. 
- This is not because of an issue with the script per say, just simply there were many not available in the database.
- A smaller subsection could have been found at the database but contained Swedish words in their title (such as "gruppen" - "the group").
- These and others issues have been manually taken care of to help expand the database. 

<ins>Originally the database had:</ins>
| Searches Made  | Links found | Links Not found |
| ---- | ---- |----|
| 368 | 126  | 242|

<ins>Now (after manual intervention) the database has:</ins>

| Searches Made  | Links found | Links Not found |
| ---- | ---- |----|
| N/A (368) | 147  | 242|



A lot more work could be put into increasing the database size, but as this is a fun side project for me to learn from, I will not put in further effort towards this. 

Note that notebook was re-ran from start to finish so some of the cellblock outputs wont match what was written above (as the manual additions were already added at the time of re-running). 


In [1]:
import sqlite3
import pandas as pd

In [2]:
# read in database, extract all plants and summarise result
DATABASE_LOC = r"C:\Users\Rory Crean\Dropbox (lkgroup)\Backup_HardDrive\Postdoc\PyForFun\House_Plant_Recommender\Database\house_plants.db"

conn = sqlite3.connect(DATABASE_LOC)
c = conn.cursor()
c.execute("""SELECT * FROM 'hyperlinks'""")

plants_found, plants_not_found = [], []
found, not_found = 0, 0

for row in c.fetchall():
    
    if row[1] == "no link found":
        not_found += 1
        plants_not_found.append(row)
    else:
        found += 1
        plants_found.append(row)
print(f"Total number of links now searched: {(found+not_found)}")
print(f"Number of links found: {found}")
print(f"Number of links not found: {not_found}")

c.close()

Total number of links now searched: 389
Number of links found: 147
Number of links not found: 242


In [3]:
missing_plants = [name[0] for name in plants_not_found] 
missing_plants

['Sansevieria',
 'Anthurium Andraeanum-Gruppen',
 'Myrsine africana',
 'Strelitzia nicolai',
 'Pelargonium (Peltatum-Zonale-Gruppen)',
 'Crassula coccinea',
 'Echeveria',
 'Parahemionitis cordata',
 'Alocasia gageana',
 'Nepenthes',
 'Bouvardia x domestica',
 'Sophora prostrata',
 'Cymbidium',
 'Tulipa gesneriana',
 'Wallisia cyanea',
 'Primula obconica',
 'Disocactus anguliger',
 'Cissus striata',
 'Philodendron erubescens',
 'Crocus x hybridus',
 'Haworthia attenuata',
 'Calathea warscewiczii',
 'Goeppertia rufibarba',
 'Pilea depressa',
 'Muscari botryoides',
 'Cissus rotundifolia',
 'x Oncidopsis grex',
 'Zamioculcas zamiifolia',
 'Sarracenia',
 'Muehlenbeckia axillaris',
 'Centella asiatica',
 'Philodendron',
 'Kalanchoe delagoensis',
 'Vriesea',
 'Euphorbia tithymaloides',
 'Goeppertia ornata',
 'Araucaria cunninghamii',
 'Echeveria purpusorum',
 'Ficus deltoidea',
 'Oreocereus trollii',
 'Alocasia baginda',
 'Ornithogalum dubium',
 'Ficus cyathistipula',
 'Euphorbia fruticosa',


#### Some notes from analysing the list above

- Some names contain words in swedish e.g.: 'Kalanchoe (Blossfeldiana-Gruppen)' and  'Pelargonium Peltatum-Gruppen', so these obviously would not match. 
- Some names contain abbreviations e.g: 'Ficus americana ssp. guianensis'. May cause match issues. 
- I did take a random selection and run these names through both google (to confirm real name) and through the https://missouribotanicalgarden.org search API to confirm this plant is indeed not present in the database. So the automated method is working fine as far I can tell.  


#### On adding to the database
- Clearly, we can increase the list of plants (with links) in the databse. For example, 'Pelargonium Peltatum-Gruppen' gives no matches, but: 'Pelargonium peltatum' https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a535 does. 

- Whilst the ficus species 'Ficus cyathistipula' (and others) were not found, the missouribotanicalgarden.org database has results for many other ficus spcecies that are houseplants and can also be added to the database. 

- If desired, a lot of time could be invested into increasing the size of the database (including by adding other sources).


In [4]:
more_plants = [
    # Those that had Swedish words (now removed/translated) in their name:
    ("Anthurium andraeanum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=276219"),
    ("Pelargonium peltatum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a535"),
    ("Kalanchoe blossfeldiana", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=279373"),
    ("Narcissus 'Golden Dawn'", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=260291"),
    ("Rosa rubiginosa", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=286363"), 
    ("Musa acuminata", "http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282778"), 
    ("Begonia rex-cultorum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=242218"), 
    ("Dracaena fragrans 'Lemon Lime'", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=366918"),
    ("Dracaena fragrans", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282260"),  
    ("Dracaena marginata", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b592"),  
    ("Fittonia albivenis", "http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b601"),  
    ("Pelargonium × hortorum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a537"),  
    ("Hippeastrum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=264599&basic=Hippeastrum"),  
    ("Pelargonium (scented-leaved group)", "http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a534"),  
    ("Pelargonium × hortorum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a537"),  
    ("Cyclamen persicum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a444"),  
    ("Capsicum annuum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=287148"),  
    # Those with acronyms.  
    ("Sarracenia rubra", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=286847"),  
    ("Sarracenia flava", "http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b927"),  
    ("Pericallis × hybrida", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=263628"),  
    ("Begonia", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=263000"),  
    # Available ficus plants.  
    ("Ficus lyrata", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282753"),
    ("Ficus carica", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282762"),
    ("Ficus benjamina", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282745"),
    ("Ficus elastica", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b597"),
    ("Ficus pumila", "http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=b599"),
    ("Ficus religiosa", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282754")
]

len(more_plants)

27

Before adding these to the database, I will need to do a duplicates check. I will check for duplicates in both the url and the name to increase the chance of catching anything. 
As the URL can have extra bits of text tagged on at the end, I have made sure that the "plants_to_add" members all have those bits removed and then I will check if their string is in any of the urls already in the database, that should work... 

In [5]:
# lists to check against. 
plant_names_check = [plant[0].lower() for plant in plants_found]
plant_urls_check = [plant[1] for plant in plants_found]

In [6]:
duplicate_found = []
plants_to_add = []
for new_plant in more_plants:
    # name check
    if ((new_plant[0].lower() in plant_names_check) and (new_plant[1] in plant_urls_check)):
       duplicate_found.append(new_plant)
    else:
        plants_to_add.append(new_plant)
print(f"Plants without duplicates: {len(plants_to_add)}")
print(f"Plants with duplicates: {len(duplicate_found)}")

# Note that the notebook was re-run from start to finish hence why it now says there are no duplicates. 

Plants without duplicates: 0
Plants with duplicates: 27


In [7]:
# Checking those with duplicates
for duplicate in duplicate_found:
    print(f"Searching for duplicates for plant: {duplicate[0]} \n")
    if duplicate[0] in plant_names_check:
        print(duplicate[0])

    if duplicate[1] in plant_urls_check:
        print(duplicate[1])

Searching for duplicates for plant: Anthurium andraeanum 

https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=276219
Searching for duplicates for plant: Pelargonium peltatum 

https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?kempercode=a535
Searching for duplicates for plant: Kalanchoe blossfeldiana 

https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=279373
Searching for duplicates for plant: Narcissus 'Golden Dawn' 

https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=260291
Searching for duplicates for plant: Rosa rubiginosa 

https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=286363
Searching for duplicates for plant: Musa acuminata 

http://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=282778
Searching for duplicates for plant: Begonia rex-cultorum 

https://www.missouribotanicalgarden.org/PlantF

All duplicates have both a duplicated name and url, so I am happy they are true duplicates and need not be included.

In [8]:
# Now it is time to add these extra urls to the database. 
# Primary key is the latin name, so that will act as one final check. 
conn = sqlite3.connect(DATABASE_LOC)
c = conn.cursor()
c.executemany("""INSERT OR IGNORE INTO hyperlinks VALUES (?,?)""", plants_to_add)
conn.commit()

In [9]:
# Now summarise the final form of the database. 
c.execute("""SELECT * FROM 'hyperlinks'""")

output = []
found, not_found = 0, 0

for row in c.fetchall():
    output.append(row)

    if row[1] == "no link found":
        not_found += 1
    else:
        found += 1

print(f"Total number of links now searched: {len(output)}")
print(f"Number of links found: {found}")
print(f"Number of links not found: {not_found}")

Total number of links now searched: 389
Number of links found: 147
Number of links not found: 242


In [10]:
c.close()

##### Have to update one of the manually added links as it was inputted incorrectly

In [11]:
# correct form: ("Hippeastrum", "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=264599&basic=Hippeastrum"),  
# Getting the old hyperlink below
for plant in plants_found:
    if plant[0] == "Hippeastrum":
        print(plant)

('Hippeastrum', 'https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=264599&basic=Hippeastrum')


In [None]:
# read in database, extract all plants and summarise result
conn = sqlite3.connect(DATABASE_LOC)
c = conn.cursor()
c.execute("""
    UPDATE hyperlinks
    SET url = "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=264599&basic=Hippeastrum"
    WHERE url = "https://www.missouribotanicalgarden.org/PlantFinder/PlantFinderDetails.aspx?taxonid=26459"
""")
conn.commit()
c.close()