# Marine Fish abundance investigation : Bay of Biscay and Algarve. 
This project will aim to investigate the drivers of fish abundance on the Atlantic coasts of France in the bay of Biscay and extend down the west facing coast of Spain and Portugal. The investigation will be completed on data taken from OBIS, and will showcase my ability to extract/source, clean/manipulate, model/analyse and finally interpret/predict. The dataset possesses presence data across a range of taxa. This is not a scientific investigation so I shall not provide written interpretation of the models nor explicitly explain certain scientific principles of which the reader may be unaware. Enjoy the show :)

This note book focuses on ** Step 1. Data Collection **

** Notebook Objective 
- Set up filters to download data from OBIS database.
- Limit the records extracted to a boundary box covering the Bay of Biscay to Algarve
- Filter to specific taxa: Bony fishes (Actinopterygii)
- Save the dataset locally as a csv for future analysis

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import json

In [3]:
# After afew attempts I decided to read up OBIS pagination use and apparently the best way to 
# ensure only spatially relevant records are pulled is by defining a polygon for youre study area so....

wkt = "POLYGON((-10 36, -1 36, -1 48, -10 48, -10 36))"
obis_url = "https://api.obis.org/v3/occurrence"

# Now I need to specify each record pull, so use polygon for spatial relevance, 
# bony fish as focus and the 5000 size is defined by obis as max per one pull, so I'll have to build a loop to get more

params_specific = {
    "scientificname": "Actinopterygii",  # class: bony fishes
    "geometry": wkt,                     # enforce spatial filter
    "size": 5000
}

# Define the aspects of the data pull loop, each loop max 5000, but I want at least lets say 100k records, 
# I could go higher for greater accuracy/sample size but thats not the point of this project 
all_results = []
batch_size = 5000
offset = 0
max_records = 100000

## Cheeky print to let me know that the code isnt dead in the water
print("Starting geo-filtered download with geometry...")

# Now I had a few issues with records turning up from america, 
# so I've included a function dict() to ensure that the spatial relevance is kept during every loop, 
# I've also coded in a couple print functions to update me on the specifc URL everytime (to ensure I'm pulling the right records)

while offset < max_records:
   
    params = dict(params_specific)
    params["offset"] = offset

    r = requests.get(obis_url, params=params, timeout=60)
    print("URL:", r.url)  

    if r.status_code != 200:
        print(f"Error {r.status_code} at offset {offset}")
        break

    batch = r.json().get("results", [])
    if not batch:
        print("No more results.")
        break

    all_results.extend(batch)
    print(f"Fetched {len(batch)} | Total: {len(all_results)}")
    offset += batch_size

# I'll print a message to let me know when its good to go

print("Done. Total records pulled:", len(all_results))

df_fish = pd.DataFrame(all_results)
print(df_fish.shape)

Starting geo-filtered download with geometry...
URL: https://api.obis.org/v3/occurrence?scientificname=Actinopterygii&geometry=POLYGON%28%28-10+36%2C+-1+36%2C+-1+48%2C+-10+48%2C+-10+36%29%29&size=5000&offset=0
Fetched 5000 | Total: 5000
URL: https://api.obis.org/v3/occurrence?scientificname=Actinopterygii&geometry=POLYGON%28%28-10+36%2C+-1+36%2C+-1+48%2C+-10+48%2C+-10+36%29%29&size=5000&offset=5000
Fetched 5000 | Total: 10000
URL: https://api.obis.org/v3/occurrence?scientificname=Actinopterygii&geometry=POLYGON%28%28-10+36%2C+-1+36%2C+-1+48%2C+-10+48%2C+-10+36%29%29&size=5000&offset=10000
Fetched 5000 | Total: 15000
URL: https://api.obis.org/v3/occurrence?scientificname=Actinopterygii&geometry=POLYGON%28%28-10+36%2C+-1+36%2C+-1+48%2C+-10+48%2C+-10+36%29%29&size=5000&offset=15000
Fetched 5000 | Total: 20000
URL: https://api.obis.org/v3/occurrence?scientificname=Actinopterygii&geometry=POLYGON%28%28-10+36%2C+-1+36%2C+-1+48%2C+-10+48%2C+-10+36%29%29&size=5000&offset=20000
Fetched 5000 | T

In [12]:
## I'll convert the output to a data frame called df_fish 
df_fish = pd.DataFrame(all_results)

## Preview the first few rows, Look at the dataset name, any great lakes records you've done it wrong ##
df_fish.head()

Unnamed: 0,basisOfRecord,brackish,catalogNumber,class,classid,collectionCode,country,datasetID,datasetName,date_end,...,continent,higherGeography,verbatimLatitude,verbatimLongitude,otherCatalogNumbers,typeStatus,georeferenceRemarks,maximumElevationInMeters,minimumElevationInMeters,verbatimCoordinates
0,Occurrence,False,414798_75_126822_A1550_GOV_-9_1_1.0000_360_1_-...,Teleostei,293496.0,EVHOE,FR,https://marineinfo.org/id/dataset/2759,DATRAS: ICES Database of trawl surveys,1668730000000.0,...,,,,,,,,,,
1,Occurrence,False,1494889,Teleostei,293496.0,Mackerel and Horse Mackerel Eggs Survey,,https://marineinfo.org/id/dataset/2470,Mackerel and Horse Mackerel Eggs Survey,991008000000.0,...,,,,,,,,,,
2,Occurrence,False,185765_94_126716_C0447_GOV_U_1_1.0000_15___1_20,Teleostei,293496.0,EVHOE,FR,https://marineinfo.org/id/dataset/2759,DATRAS: ICES Database of trawl surveys,909878400000.0,...,,,,,,,,,,
3,Occurrence,True,329475_48_126426_Y0336_GOV_-9_1_1.0000_100_-9_...,Teleostei,293496.0,EVHOE,FR,https://marineinfo.org/id/dataset/2759,DATRAS: ICES Database of trawl surveys,1604448000000.0,...,,,,,,,,,,
4,Occurrence,False,1495126,Teleostei,293496.0,Mackerel and Horse Mackerel Eggs Survey,,https://marineinfo.org/id/dataset/2470,Mackerel and Horse Mackerel Eggs Survey,1085357000000.0,...,,,,,,,,,,


In [13]:
## Now I'll choose some columns that will help me with the investigation later e.g. life stage, sex, family name
columns_to_keep = [
    'scientificName',
    'family',
    'decimalLatitude', 'decimalLongitude',
    'eventDate', 'depth',
    'basisOfRecord', 'datasetName', 'institutionCode',
    'lifeStage', 'sex', 'individualCount'
]

## Filter the download to what columns I specified above ##
df_clean = df_fish[columns_to_keep]

## Show first few records for inspection ##
df_clean.head(10)

Unnamed: 0,scientificName,family,decimalLatitude,decimalLongitude,eventDate,depth,basisOfRecord,datasetName,institutionCode,lifeStage,sex,individualCount
0,Trachurus trachurus,Carangidae,47.7797,-7.5153,2022-11-18T07:37:00,131.0,Occurrence,DATRAS: ICES Database of trawl surveys,,,,
1,Trachurus trachurus,Carangidae,47.25,-5.75,2001-05-28,205.0,Occurrence,Mackerel and Horse Mackerel Eggs Survey,,,,
2,Argentina sphyraena,Argentinidae,46.7597,-3.2023,1998-11-01T07:15:00,50.5,Occurrence,DATRAS: ICES Database of trawl surveys,,,,
3,Engraulis encrasicolus,Engraulidae,44.8643,-1.2862,2020-11-04T11:11:00,13.0,Occurrence,DATRAS: ICES Database of trawl surveys,,,,
4,Trachurus trachurus,Carangidae,47.25,-9.3167,2004-05-24,4004.0,Occurrence,Mackerel and Horse Mackerel Eggs Survey,,,,
5,Trisopterus minutus,Gadidae,47.8869,-7.2667,2012-11-12T14:59:00,95.0,Occurrence,DATRAS: ICES Database of trawl surveys,,,,
6,Capros aper,Caproidae,47.7982,-6.6801,2011-11-08T07:20:00,65.5,Occurrence,DATRAS: ICES Database of trawl surveys,,,,
7,Arnoglossus imperialis,Bothidae,46.5099,-3.3528,2007-10-20T07:47:00,61.0,Occurrence,DATRAS: ICES Database of trawl surveys,,,,
8,Scomber scombrus,Scombridae,43.8738,-8.5008,2010-04-19,132.5,Occurrence,ICES EggsAndLarvae,,,,
9,Micromesistius poutassou,Gadidae,47.6663,-4.8051,1998-10-28T09:47:00,58.0,Occurrence,DATRAS: ICES Database of trawl surveys,,,,


In [14]:
## Save cleaned raw data to a csv##
df_clean.to_csv("raw_obis_fish_occurences.csv", index = False)