<img src="../Images/DSC_Logo.png" style="width: 400px;">

# Data Preprocessing of PANGAEA Datasets

This notebook prepares previously downloaded PANGAEA datasets for visualization and analysis of orca and other cetacean sightings along Polarstern cruises.  

> Note: Only the data tables are processed here. Metadata files that contain information such as copyright and citation are not handled in this workflow, but they must always be consulted for interpretation and referenced in any use of the datasets.

# 1 Import Libraries

In [1]:
import pandas as pd
import numpy as np
import os

# 2 Orca Datasets
## 2.1 Load Multiple Files in Dataframe

Set path to orca data folder:

In [2]:
dataset_directory = "../Data/PANGAEA_orca_data/Datasets"

Define columns to keep:

In [3]:
columns = ["DATE/TIME", "LATITUDE", "LONGITUDE", "Whale species", "Individuals [#]", "Event"]

We want to load and combine all PANGAEA orca datasets into a single dataframe. We have an easy life with our choosen data example, because the retrieved PANGAEA orca datasets are already standardized. That means we can combine them directly without extra preprocessing steps such as renaming columns.

In [4]:
frames = []

for filename in os.listdir(dataset_directory):
    file_path = os.path.join(dataset_directory, filename)
    df = pd.read_csv(file_path, sep="\t", usecols=columns)
    frames.append(df)

df_orca = pd.concat(frames, ignore_index=True)

Preview df:

In [5]:
df_orca.head()

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,Whale species,Individuals [#],Event
0,2018-07-12 23:20:00,64.34152,4.32615,"Whale, unidentified",1.0,PS114-track
1,2018-07-15 15:24:00,76.98338,4.69592,"Large whale, unidentified",1.0,PS114-track
2,2018-07-15 22:25:00,78.3317,4.45958,"Dolphins, unidentified",4.0,PS114-track
3,2018-07-15 22:50:00,78.41395,4.44458,"Dolphins, unidentified",1.0,PS114-track
4,2018-07-15 23:30:00,78.54113,4.42075,"Dolphins, unidentified",3.0,PS114-track


Length of df:

In [6]:
len(df_orca)

618

## 2.2 Select Orca Data

The column "Whale species" holds not just sights of Orcas but also of other whale species: 

In [7]:
df_orca['Whale species'].unique()

array(['Whale, unidentified', 'Large whale, unidentified',
       'Dolphins, unidentified', 'Lagenorhynchus albirostris',
       'Balaenoptera musculus', 'Baleen whale, unidentified',
       'Balaena mysticetus', 'Megaptera novaeangliae',
       'Physeter macrocephalus', 'Balaenoptera physalus',
       'Delphinapterus leucas', 'Balaenoptera acutorostrata',
       'Orcinus orca', 'Small whale, unidentified',
       'Lagenorhynchus acutus', 'Delphinus delphis', 'Lagenorhynchus sp.',
       'Globicephala melas', 'Phocoena phocoena', 'Balaenoptera borealis',
       'Monodon monocerus', 'Balaenotptera physalus', 'Monodon monoceros',
       'Lagenorhyncus sp.', 'Large whale, unidenified'], dtype=object)

However, let's assume we are only interested in 'Orcinus orca' here so we filter the dataframe:

In [8]:
# Filter
df_orca = df_orca[df_orca['Whale species'] == 'Orcinus orca']

# Reset index
df_orca.reset_index(drop=True, inplace=True)

Remove rows where "Individuals [#]" is NaN:

In [9]:
df_orca = df_orca.dropna(subset=['Individuals [#]']).reset_index(drop=True)

Preview df:

In [10]:
df_orca.head()

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,Whale species,Individuals [#],Event
0,2018-08-06 02:05:00,71.13038,16.85015,Orcinus orca,4.0,PS115/1-track
1,2017-05-25 17:22:00,58.1091,4.45383,Orcinus orca,5.0,PS106/1-track
2,2016-06-17 19:57:00,69.68787,9.94335,Orcinus orca,10.0,PS99.1-track
3,2015-05-21 03:19:00,59.51962,3.43712,Orcinus orca,5.0,PS92-track
4,2015-05-24 15:12:00,70.2437,13.72962,Orcinus orca,5.0,PS92-track


Length of df:

In [11]:
len(df_orca)

27

Our combined dataset contains 27 sightings in total.

## 2.3 Datetime Conversion

The column "DATE/TIME" contains full timestamps for each sighting in the format YYYY-MM-DD HH:MM:SS. Let's assume for our analysis, we are interested only in the year of each record. 

We first convert the column into a proper datetime format in pandas:

In [12]:
df_orca['DATE/TIME'] = pd.to_datetime(df_orca['DATE/TIME'])

And then extract the year as a separate column:

In [13]:
df_orca['Year'] = df_orca['DATE/TIME'].dt.year

In [14]:
df_orca.head()

Unnamed: 0,DATE/TIME,LATITUDE,LONGITUDE,Whale species,Individuals [#],Event,Year
0,2018-08-06 02:05:00,71.13038,16.85015,Orcinus orca,4.0,PS115/1-track,2018
1,2017-05-25 17:22:00,58.1091,4.45383,Orcinus orca,5.0,PS106/1-track,2017
2,2016-06-17 19:57:00,69.68787,9.94335,Orcinus orca,10.0,PS99.1-track,2016
3,2015-05-21 03:19:00,59.51962,3.43712,Orcinus orca,5.0,PS92-track,2015
4,2015-05-24 15:12:00,70.2437,13.72962,Orcinus orca,5.0,PS92-track,2015


## 2.4 Save

In [15]:
dataset_directory = "../Data/PANGAEA_orca_data/Orca_preprocessed.txt"
df_orca.to_csv(dataset_directory, sep="\t", index=False)

# 3 Master Track Datasets
## 3.1 Load Multiple Files in Dataframe 

We load multiple files in a single dataframe as in Sect. 2.1, however, now we filter out empty dataframes (caused by zipped duplicates) and manually exclude the remaining duplicate dataset for cruise PS92.

In [16]:
dataset_directory = "../Data/PANGAEA_mastertrack_data/Datasets"
columns = ["Date/Time", "Latitude", "Longitude", "Event"]

frames = []
for filename in os.listdir(dataset_directory):
    file_path = os.path.join(dataset_directory, filename)

    try:
        df = pd.read_csv(file_path, sep="\t", usecols=columns)
    except pd.errors.EmptyDataError:
        print(f"Skipping file: {filename}")
        continue

    if filename == "PANGAEA_master_dataset_905170.txt":
        print(f"Skipping file: {filename}")
        continue

    frames.append(df)

df_mastertrack = pd.concat(frames, ignore_index=True)

print(df_mastertrack.head())

Skipping file: PANGAEA_master_dataset_905170.txt
Skipping file: PANGAEA_master_dataset_962550.txt
Skipping file: PANGAEA_master_dataset_963304.txt
Skipping file: PANGAEA_master_dataset_963840.txt
Skipping file: PANGAEA_master_dataset_972614.txt
Skipping file: PANGAEA_master_dataset_972617.txt
Skipping file: PANGAEA_master_dataset_974028.txt
             Date/Time  Latitude  Longitude       Event
0  2014-06-06 00:10:00  53.56549    8.55573  PS85-track
1  2014-06-06 00:20:00  53.56550    8.55573  PS85-track
2  2014-06-06 00:30:00  53.56549    8.55573  PS85-track
3  2014-06-06 00:40:00  53.56549    8.55573  PS85-track
4  2014-06-06 00:50:00  53.56549    8.55572  PS85-track


## 3.2 Select Master Tracks With Orca Data

Get the unique events from the orca dataset:

In [17]:
orca_events = df_orca["Event"].unique()

Filter mastertrack dataset by those events:

In [18]:
df_mastertrack = df_mastertrack[df_mastertrack["Event"].isin(orca_events)]

## 3.3 Datetime Conversion

For plotting the master tracks on a large-scale map a different datetime conversion makes sense to us: We resample the data from 10 minutes to daily means.

Convert 'Date/Time' to datetime type:

In [19]:
df_mastertrack['Date/Time'] = pd.to_datetime(df_mastertrack['Date/Time'])

Resample daily within each Event and take the mean of numeric cols:

In [20]:
df_mastertrack = (
    df_mastertrack
        .groupby('Event')
        .resample('D', on='Date/Time')
        .mean(numeric_only=True)
        .reset_index()   # brings Event and Date/Time back as columns
)

In [21]:
df_mastertrack.head()

Unnamed: 0,Event,Date/Time,Latitude,Longitude
0,PS106/1-track,2017-05-24,54.374258,7.460776
1,PS106/1-track,2017-05-25,57.303512,5.130603
2,PS106/1-track,2017-05-26,61.146302,3.346363
3,PS106/1-track,2017-05-27,65.125203,2.69201
4,PS106/1-track,2017-05-28,69.163828,3.388939


## 3.4 Save

In [22]:
dataset_directory = "../Data/PANGAEA_mastertrack_data/Mastertracks_preprocessed.txt"
df_mastertrack.to_csv(dataset_directory, sep="\t", index=False)

# 4 Select all Sightings of Cetacean in Individual Dataset

In the preprocessing of the (other) single dataset (Sect. 7 in notebook 1 - download pangaea data), we extract all cetacean sightings from the raw datasets. Following Jungblut et al., individual species are too scarce to analyze separately, so all records of whales and dolphins are pooled into a single group. This filtered dataset provides the basis for examining the distribution of cetaceans.

Load the dataset:

In [23]:
dataset_directory = "../Data/868991_dataset.txt"
df = pd.read_csv(dataset_directory, sep="\t", encoding="utf-8")

Inspect columns:

In [24]:
list(df.columns)

['ID',
 'Date/Time',
 'Latitude',
 'Longitude',
 'Bathy depth',
 'Speed',
 'Distance',
 'Sal',
 'Temp',
 'F chl',
 'Zone',
 'E. chrysocome',
 'S. magellanicus',
 'Penguins',
 'T. melanophris',
 'T. cauta',
 'T. chlororhynchos',
 'D. exulans',
 'D. dabbenena',
 'Diomedea sp.',
 'D. epomophora',
 'D. sanfordi',
 'D. epomophora_2',
 'Albatrosses',
 'M. giganteus',
 'Macronectes sp.',
 'F. glacialis',
 'D. capense',
 'P. macroptera',
 'P. feae',
 'P. incerta',
 'P. arminjoniana',
 'P. mollis',
 'B. bulwerii',
 'P. aequinoctialis',
 'P. conspicillata',
 'Petrels',
 'Pachyptila sp.',
 'C. diomedea',
 'C. edwardsii',
 'P. gravis',
 'P. griseus',
 'P. puffinus',
 'P. mauretanicus',
 'P. baroli',
 'P. lherminieri',
 'P. assimilis',
 'Shearwaters',
 'O. oceanicus',
 'G. nereis',
 'P. marina',
 'F. grallaria',
 'O. castro',
 'O. leucorhoa',
 'O. castro_2',
 'H. pelagicus',
 'Storm-petrels',
 'P. aethereus',
 'P. lepturus',
 'Tropicbirds',
 'F. aquila',
 'F. magnificens',
 'M. bassanus',
 'S. cape

Define cetacean (whale + dolphin + porpoise) columns:

In [25]:
whale_columns = [
    "P. catodon",
    "M. novaeangliae",
    "B. physalus",
    "B. borealis",
    "B. acutorostrata",
    "B. brydei",
    "G. melas",
    "G. macrorhynchus",
    "Globicephala sp.",
    "Z. cavirostris",
    "Beaked whale",
    "Whale unident",
    "O. orca", # <-
    # dolphins & porpoises (also cetaceans, included by Jungblut et al.):
    "L. australis","L. albirostris","Lagenorhynchus sp.",
    "D. delphis","S. frontalis","S. coeruleoalba","S. clymene",
    "T. truncatus","G. griseus","S. bredanensis","S. longirostris",
    "Dolphins unident","P. phocoena"
]

Add pooled Whales column:

In [26]:
df["Whales"] = df[whale_columns].sum(axis=1)

Filter dataset - Keep only rows where Whales is not 0:

In [27]:
df = df[df["Whales"] > 0]

In [28]:
print(df[["Date/Time", "Latitude", "Longitude", "Whales"]].head())

               Date/Time  Latitude  Longitude  Whales
4    2014-03-09 08:30:00 -32.60402   16.37436      28
10   2014-03-09 11:30:00 -32.24832   15.78811      12
150  2014-03-16 10:20:00 -10.99490   -6.64012       3
206  2014-03-19 13:40:00   0.43368  -12.43823       1
271  2014-03-23 18:05:00  11.02789  -20.99941       8


Save selected columns:

In [29]:
dataset_directory = "../Data/868991_dataset_preprocessed.txt"
df[["Date/Time","Latitude","Longitude","Whales"]].to_csv(dataset_directory, sep="\t", encoding="utf-8", index=False)