# Invasive alien species internet activity data mining and processing for **iEcology-IAS-miner**

In this notebook, we explore the functionality of the **iEcology-IAS-miner** python package, which is build to seemlessly extract internet activity, images, mentions and occurrences of invasive alien species across the EU from a variety of platforms. For the demonstration to work, all input files should be located in the same folder as the Python notebook. Note that these scripts will not work if the platform's respective API keys have not been set in your local .env file located in the root directory of the library. The .env file can be opened in any text editor and should look something like this:

YT_API_KEY='*insert youtube API key*'  
FLICKR_API_KEY='*insert flickr API key*'  
FLICKR_API_SECRET='*insert flickr secret*'  
WIKI_USER_AGENT='*insert project name* (*insert personal email*)'  
EASIN_EMAIL='*insert personal email*'  
EASIN_PW='*insert easin password*'  

Please contact me with persistent issues at **simon.reynaert@plantentuinmeise.be**.

In [2]:
#set the paths correctly on local device so that functions can be imported
import sys
import os

notebook_dir = os.getcwd()
print(f"The notebook is located at '{notebook_dir}'.")
src_path = os.path.abspath(os.path.join(notebook_dir, "../src"))
print(f"The functions are located at '{src_path}'.")

# Add to Python path if not already there
if src_path not in sys.path:
    sys.path.insert(0, src_path)

The notebook is located at 'c:\Users\simon\Documents\GitHub\iEcology-IAS-miner\scripts'.
The functions are located at 'c:\Users\simon\Documents\GitHub\iEcology-IAS-miner\src'.


## 1. Species list and synonyms mining

### 1.1. Get EASIN union list species names, synonyms and R identifiers

In [3]:
from list_mining.get_EASIN_unionlistofconcern import fetch_and_process_easin_data

fetch_and_process_easin_data(url = "https://easin.jrc.ec.europa.eu/apixg/catxg/euconcern",
                             output_file="EASIN_unionlist_species_and_synonyms.csv")

import pandas as pd

df = pd.read_csv("EASIN_unionlist_species_and_synonyms.csv")
df.head()

Data successfully saved to EASIN_unionlist_species_and_synonyms.csv


Unnamed: 0,EASINID,Scientific Name,Label,All Names
0,R00046,Acacia mearnsii,Common Name,Acácia
1,R00046,Acacia mearnsii,Common Name,Acácia negra
2,R00046,Acacia mearnsii,Common Name,Acacia noir
3,R00046,Acacia mearnsii,Common Name,Acácia-negra
4,R00046,Acacia mearnsii,Common Name,Aromo


### 1.2. Get Wikipedia union list species names and Q identifiers

In [None]:
from list_mining.get_unionlist_wiki import (
    run_easin_sitelinks_pipeline
)
# Define custom output filenames
custom_q_numbers_file = 'unionconcern_invasive_species_qnumbers_2025.csv'
custom_sitelinks_file = 'unionconcern_invasive_species_wikipedia_links_2025.csv'

# Run the pipeline
df_q_numbers, df_sitelinks = run_easin_sitelinks_pipeline(
    wiki_url='https://en.wikipedia.org/wiki/List_of_invasive_alien_species_of_Union_concern',
    q_number_file=custom_q_numbers_file,
    sitelinks_file=custom_sitelinks_file
)

df_sitelinks.head()

--- Starting Pipeline for URL: https://en.wikipedia.org/wiki/List_of_invasive_alien_species_of_Union_concern ---
Step 1/4: Fetching webpage and extracting scientific names...
Step 2/4: Getting Wikidata Q-numbers (This may take time)...


100%|██████████| 88/88 [00:54<00:00,  1.62it/s]


Step 3/4: Fetching sitelinks for all EU languages (This may take time)...


Fetching sitelinks: 100%|██████████| 88/88 [00:55<00:00,  1.58it/s]

Step 4/4: Saving data to unionconcern_invasive_species_qnumbers_2025.csv and unionconcern_invasive_species_wikipedia_links_2025.csv...
Pipeline completed and data saved successfully.





Unnamed: 0,Scientific Name,Q-number,Language,Wikipedia Title
0,Acacia saligna,Q402385,de,Weidenblatt-Akazie
1,Acacia saligna,Q402385,en,Acacia saligna
2,Acacia saligna,Q402385,es,Acacia saligna
3,Acacia saligna,Q402385,fi,Siniakaasia
4,Acacia saligna,Q402385,fr,Acacia saligna


### 1.3. Get GBIF species synonyms 

## 2. Invasive alien species internet activity mining 

### 2.1. Fetching Flickr images

### 2.2. Fetching wikipedia geolocated pageviews

### 2.3. Fetching Wikipedia language-based pageviews

### 2.4. Fetching Youtube videos

### 2.5. Fetching iNaturalist observations

### 2.6. Fetching GBIF observations

### 2.7. Fetching EASIN observations

In [None]:
#get EASIN credentials (prerequisite for mining EASIN data)

from EASIN_mining_and_map_generation.EASIN_API_credentials_registration import register_user

register_user() # make sure to set EASIN_EMAIL and EASIN_PASSWORD in your .env file - not shown here for safety reasons

📡 Sending registration request to EASIN...
⚠️ Unexpected response [406]: {'Message': "Generating EASIN user didn't succeed. Message: Name simon.reynaert@plantentuinmeise.be is already taken."}


In [None]:
# get EASIN union list IAS PER COUNTRY occurrence data ('glimpse') through publicly available REST API

from EASIN_mining_and_map_generation.get_unionlist_presence_EASIN_final import fetch_easin_presence

# 1. Define your file paths
input_csv_path = "list_of_union_concern.csv"
output_csv_path = "EASIN_IAS_occurrences_EU.csv"

# 2. Call the function and capture its return values
rows, missing_species = fetch_easin_presence(input_csv=input_csv_path, #actual function call
                                             output_csv=output_csv_path)

# 3. Print the informative summary using the captured values
print(f"\n Data Processing Complete")
# NOTE: The variable 'output_csv_path' is now used instead of 'output_csv'
print(f"   - {len(rows)} total presence records were written to '{output_csv_path}'.")

# Calculate the number of unique countries to estimate species count
unique_countries = set(r['country'] for r in rows)
# To avoid division by zero error if no rows are returned:
if unique_countries:
    estimated_species = len(rows) // len(unique_countries)
    print(f"   - These records cover approximately {estimated_species} species.")
else:
    print(f"   - No country records found in the output data.")


if missing_species:
    print("\n **Species with No Confirmed Match in EASIN:**")
    print(f"   - **{len(missing_species)}** species were not matched.")
    for species in missing_species:
        print(f"    - {species}")
else:
    print("\n All input species were successfully matched and processed.")


 Data Processing Complete
   - 5368 total presence records were written to 'EASIN_IAS_occurrences_EU.csv'.
   - These records cover approximately 88 species.

 All input species were successfully matched and processed.


In [3]:
# get all available (so full records) EASIN IAS occurrences using personal API credentials

from EASIN_mining_and_map_generation.get_EASIN_observations import run_easin_fetcher

run_easin_fetcher(species_file = "UnionList_Species_Traits_85_present.csv",
                   output_file = "EASIN_observations_BE_2010-2015.csv",
                   countries= ["BE"],
                   start_date="2010",
                   end_date="2015")

✅ Created new output file 'EASIN_observations_BE_2010-2015.csv' with 11 fixed fields.
🔄 Loading species data from UnionList_Species_Traits_85_present.csv...
🔍 Found 85 unique species IDs to process.
ℹ️ Resuming: 0 species already processed in EASIN_observations_BE_2010-2015.csv.
🔎 Date Filter:  (2010 to 2015)


🦎 Processing species:   0%|[32m          [0m| 0/85 [00:00<?, ?species/s, Status=Saved, Records=0]

✅ Saved 0 records for species R00053


🦎 Processing species:   1%|[32m          [0m| 1/85 [00:04<02:43,  1.94s/species, Status=Saved, Records=22]

✅ Saved 22 records for species R00212


🦎 Processing species:   2%|[32m▏         [0m| 2/85 [00:07<03:44,  2.71s/species, Status=Saved, Records=3107]

✅ Saved 3085 records for species R00460


🦎 Processing species:   4%|[32m▎         [0m| 3/85 [00:26<12:09,  8.90s/species, Status=Saved, Records=3107]


KeyboardInterrupt: 

## 3. Cleaning up internet activity data 

### 3.1. Deduplicating and geolocating Flickr images

In [18]:
# !!only works if the mined data .csv is located in the same folder as this notebook!!
from data_processing.process_flickr_images import process_flickr_data

process_flickr_data("flickr_species_observations_eu_combined_latin_normtag_2004-now.csv", "output_flickr_processing_test.csv", 100)

Deduplicating rows: 100%|██████████| 5054/5054 [00:00<00:00, 23721.05it/s]

Deduplicated & geocoded European data saved to: output_flickr_processing_test.csv
Number of rows in final CSV: 3055





### 3.2. Geolocating and pivoting iNaturalist observations

In [None]:
# !!only works if the mined data .csv is located in the same folder as this notebook!!
from data_processing.geolocate_process_inaturalist_data import process_inat_data

process_inat_data("species_inat_observations_onlycasual", "processed_inat_observations.csv")

Geolocated CSV saved to: species_inat_observations_onlycasual\Acacia_saligna_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Acacia_saligna_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Acridotheres_tristis_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Acridotheres_tristis_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Ailanthus_altissima_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Ailanthus_altissima_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Alopochen_aegyptiaca_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Alopochen_aegyptiaca_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Alternanthera_philoxeroides_geolocated.csv
Geolocated CSV saved to: species_inat_observations_onlycasual\Alternanthera_philoxeroides_geolocated.csv
Geolocated CSV saved

date_str,Scientific Name,Country,2016-01-01,2016-01-02,2016-01-03,2016-01-04,2016-01-05,2016-01-06,2016-01-07,2016-01-08,...,2025-07-07,2025-07-08,2025-07-09,2025-07-10,2025-07-11,2025-07-12,2025-07-13,2025-07-14,2025-07-15,2025-07-16
0,Acacia saligna,AL,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Acacia saligna,AT,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Acacia saligna,ES,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
3,Acacia saligna,FR,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Acacia saligna,GR,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
970,Xenopus laevis,PT,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
971,Xenopus laevis,RU,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
972,Xenopus laevis,SE,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
973,Xenopus laevis,SK,0.0,0.0,0,0.0,0,0.0,0,0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3.3. Processing dates and pivoting GBIF data

In [None]:
# !!only works if the mined data .csv is located in the same folder as this notebook!!
from data_processing.process_GBIF_observations import process_gbif_data

process_gbif_data(
    input_file="GBIF_species_occurrences_EU.csv")

Loading data from GBIF_species_occurrences_EU.csv...
Successfully loaded 2,640,153 rows.

Parsing event dates...


Parsing event dates: 100%|██████████| 2640153/2640153 [00:57<00:00, 45602.54it/s]



--- Parsing Summary ---
Total rows: 2,640,153
Parsed successfully: 2,630,741 (99.64%)
Failed parses: 9,412 (0.36%)
Saved failed date ranges to: GBIF_species_occurrences_EU_failed_dates.csv

Creating time series from 2016-01-01 to 2025-07-13...

Processed dataset saved to: GBIF_species_occurrences_EU_processed.csv


date_str,Scientific Name,Country,2016-01-01,2016-01-02,2016-01-03,2016-01-04,2016-01-05,2016-01-06,2016-01-07,2016-01-08,...,2025-07-04,2025-07-05,2025-07-06,2025-07-07,2025-07-08,2025-07-09,2025-07-10,2025-07-11,2025-07-12,2025-07-13
0,Acacia saligna,AL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,Acacia saligna,BE,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,Acacia saligna,CY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,Acacia saligna,DK,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,Acacia saligna,ES,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
902,Xenopus laevis,FR,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
903,Xenopus laevis,GB,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
904,Xenopus laevis,IT,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
905,Xenopus laevis,NL,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## 4. Data exploration and visualizations (more in Rmd files)

### 4.1. EASIN IAS presence map generation

In [None]:
# !!only works if the mined data .csv is located in the same folder as this notebook!!
from EASIN_mining_and_map_generation.generate_html_maps_IAS_presence_EASIN import generate_species_maps

generate_species_maps(csv_file = "species_by_country_presence_EASIN_updated.csv", #input CSV
                      shapefile_dir = "natural_earth",
                      map_output_dir="easin_species_maps_output")

Downloading Natural Earth shapefile...
Shapefile downloaded and extracted.
Saving map for Acacia saligna → easin_species_maps_output\Acacia_saligna_map.html
Saving map for Acridotheres tristis → easin_species_maps_output\Acridotheres_tristis_map.html
Saving map for Ailanthus altissima → easin_species_maps_output\Ailanthus_altissima_map.html
Saving map for Alopochen aegyptiaca → easin_species_maps_output\Alopochen_aegyptiaca_map.html
Saving map for Alternanthera philoxeroides → easin_species_maps_output\Alternanthera_philoxeroides_map.html
Saving map for Ameiurus melas → easin_species_maps_output\Ameiurus_melas_map.html
Saving map for Andropogon virginicus → easin_species_maps_output\Andropogon_virginicus_map.html
Saving map for Arthurdendyus triangulatus → easin_species_maps_output\Arthurdendyus_triangulatus_map.html
Saving map for Asclepias syriaca → easin_species_maps_output\Asclepias_syriaca_map.html
Saving map for Axis axis → easin_species_maps_output\Axis_axis_map.html
Saving map 

['easin_species_maps_output\\Acacia_saligna_map.html',
 'easin_species_maps_output\\Acridotheres_tristis_map.html',
 'easin_species_maps_output\\Ailanthus_altissima_map.html',
 'easin_species_maps_output\\Alopochen_aegyptiaca_map.html',
 'easin_species_maps_output\\Alternanthera_philoxeroides_map.html',
 'easin_species_maps_output\\Ameiurus_melas_map.html',
 'easin_species_maps_output\\Andropogon_virginicus_map.html',
 'easin_species_maps_output\\Arthurdendyus_triangulatus_map.html',
 'easin_species_maps_output\\Asclepias_syriaca_map.html',
 'easin_species_maps_output\\Axis_axis_map.html',
 'easin_species_maps_output\\Baccharis_halimifolia_map.html',
 'easin_species_maps_output\\Cabomba_caroliniana_map.html',
 'easin_species_maps_output\\Callosciurus_erythraeus_map.html',
 'easin_species_maps_output\\Callosciurus_finlaysonii_map.html',
 'easin_species_maps_output\\Cardiospermum_grandiflorum_map.html',
 'easin_species_maps_output\\Celastrus_orbiculatus_map.html',
 'easin_species_maps_ou

In [5]:
from IPython.display import display, HTML
import os

# Define the file path of your saved map
map_filename = r'easin_species_maps_output\Acacia_saligna_map.html'

# Check if the file exists before attempting to display it
if os.path.exists(map_filename):
    print(f"Displaying map from: {map_filename}")
    
    # Read the HTML file content
    with open(map_filename, 'r', encoding='utf-8') as f:
        html_content = f.read()
    
    # Use IPython.display.HTML to render the HTML content directly
    # in the notebook output cell.
    display(HTML(html_content))
    
else:
    print(f"Error: Map file not found at '{map_filename}'.")
    print("Please ensure the map generation step was successful.")

Displaying map from: easin_species_maps_output\Acacia_saligna_map.html
