![Python logo](https://cmap.readthedocs.io/en/latest/_static/CMAP_logos/CMAP_logo_High_Res.png) 
# In this notebook we will download SeaFlow and enviormental data Using [Simons CMAP](https://simonscmap.com).

Below are the datasets that will be used
The End goalEnd goal is to create a dataset that has these variables below.



#### SeaFlow):
- time
- lat
- lon
- biomass
- CruiseName
- Temperature
- Salinity

In this notebook we will also use <u> depth, and cruise</u> to match with other avalaible CMAP dataframes. 


#### Darwin Biogeochemistry Climatology Model
- NO3
- PO4
- Fe
- Si
- chl


# Loading Functions

In [33]:
import pandas as pd
import numpy as np


## Creating Real Time data set
### Datetime is in UTC

In [34]:
# Set a working directory
import os

directory_path = '/Users/cristianswift/Desktop/armbrust-lab/Seaflow-Machine-Learning/'
os.chdir(directory_path)


## The Covariate Seaflow dataset is averaged per hour for every Picoplankton population

In [35]:
covari_path = 'data/original/SeaFlow_covariates.csv'

#changing column naes so that the date is just called data
covari_cols = ['time', 'PopulationName', 'lat',
       'lon', 'CellAbundance_10^6_cells_per_L',
       'Biomass_pgC_per_L', 'CellQuotas_fgC_per_cell',
       'CellDiameter_micrometer', 'salin', 'temp',
       'cruisename', 'Light_micromolQuanta_m2_s', 'SiO4_micromol_per_L',
       'NO3NO2', 'PO4', 'Fe',
       'SatChl', 'MixedLayerDepth_m']

#reading in the csv to a pandas df
covari = (pd
          .read_csv(covari_path, names=covari_cols)
          #need to get rid of the first row as it is a repeat when loaded as such
          .tail(-1)
         )

covari.head(3)

Unnamed: 0,time,PopulationName,lat,lon,CellAbundance_10^6_cells_per_L,Biomass_pgC_per_L,CellQuotas_fgC_per_cell,CellDiameter_micrometer,salin,temp,cruisename,Light_micromolQuanta_m2_s,SiO4_micromol_per_L,NO3NO2,PO4,Fe,SatChl,MixedLayerDepth_m
1,2016-04-20T00:00:00Z,Prochlorococcus,,,253.186148,9.232478116,0.036465179,0.57858,34.67912841,25.74054659,KOK1606,1764.076136,,,,,,26.96643939
2,2016-04-20T00:00:00Z,Synechococcus,,,1.588988684,0.279172358,0.175691848,1.06425,34.67912841,25.74054659,KOK1606,1764.076136,,,,,,26.96643939
3,2016-04-20T00:00:00Z,nanoeukaryotes (2-5µm),,,1.332641539,3.079117581,2.310536998,2.88901,34.67912841,25.74054659,KOK1606,1764.076136,,,,,,26.96643939


Making time format CMAP appropriate

In [36]:
covari['time'] = covari['time'].str.replace('Z', '')

### Adding a depth column so that we can get climatoligcal data from a depths we are intersted in
Seaflow observes data at 7m depth

In [48]:
covari['depth'] = 7

### Need to adjust columns that the dtypes are correct

In [38]:
def ChangeObjectTypes(df):
    for column in df:
        if column == 'PopulationName' or column == 'cruisename' or column == 'time':
            #changing to string
            df[column] = df[column].astype(str)
            
        else:
            #changing to numeric type
            df[column] = pd.to_numeric(df[column], errors='coerce')
    return df
covari = ChangeObjectTypes(covari)       

## Keeping only the data varaibles that we will be using for the machine learning model

In [39]:
covari = (covari[['time', 'PopulationName', 'lat','lon',
                 'Biomass_pgC_per_L','salin', 'temp','cruisename']]
          .dropna()
          .reset_index(drop=True)
         )
covari.head(2)

Unnamed: 0,time,PopulationName,lat,lon,Biomass_pgC_per_L,salin,temp,cruisename
0,2016-04-20T07:00:00,Prochlorococcus,21.520326,-158.326984,10.520443,34.893785,24.351745,KOK1606
1,2016-04-20T07:00:00,Synechococcus,21.520326,-158.326984,0.341429,34.893785,24.351745,KOK1606


In [40]:
covari.dtypes

time                  object
PopulationName        object
lat                  float64
lon                  float64
Biomass_pgC_per_L    float64
salin                float64
temp                 float64
cruisename            object
dtype: object

# Using SimonCMAP to gather additional features

### Our climatological data will come from the 

#### First installing and importing pycmap 

In [41]:
# !pip install pycmap
import pycmap

### Prepping covariate data for colocalization using Simon's CMAP

#### Setting API

In [42]:
api = pycmap.API(token='<6e1eb1d3-d364-4dfb-9121-8c23369dbbbe>')

In [43]:
api.get_dataset_metadata('tblDarwin_Nutrient_Climatology')

In [44]:
covari.head(3)

Unnamed: 0,time,PopulationName,lat,lon,Biomass_pgC_per_L,salin,temp,cruisename
0,2016-04-20T07:00:00,Prochlorococcus,21.520326,-158.326984,10.520443,34.893785,24.351745,KOK1606
1,2016-04-20T07:00:00,Synechococcus,21.520326,-158.326984,0.341429,34.893785,24.351745,KOK1606
2,2016-04-20T07:00:00,nanoeukaryotes (2-5µm),21.520326,-158.326984,3.338212,34.893785,24.351745,KOK1606


In [45]:
targets = {
        
        # Darwin Biogeochemistry Climatology Model
        "tblDarwin_Nutrient_Climatology": {
                          "variables": ["SiO2_darwin_clim", "POSi_darwin_clim", "PON_darwin_clim",
                                        "POFe_darwin_clim", "POC_darwin_clim", "PO4_darwin_clim",
                                        "PIC_darwin_clim", "O2_darwin_clim", "NO3_darwin_clim",
                                        "NO2_darwin_clim", "NH4_darwin_clim", "FeT_darwin_clim",
                                        "DOP_darwin_clim", "DON_darwin_clim", "DOFe_darwin_clim",
                                        "DOC_darwin_clim", "DIC_darwin_clim", "CDOM_darwin_clim",
                                        "ALK_darwin_clim"],
            # Tolerance varaibles/order: temporal [days], meridional [deg], zonal [deg], and vertical [m]
                          "tolerances": [1, 0.5, 0.5, 5]
                         }
        }


source = covari

covari_cmap = pycmap.Sample(
              source=source, 
              targets=targets, 
              replaceWithMonthlyClimatolog=False
             )


Gathering metadata .... 
Sampling starts
Sampling finished                                                                                                    

In [47]:
covari_cmap

Unnamed: 0,time,PopulationName,lat,lon,Biomass_pgC_per_L,salin,temp,cruisename,CMAP_SiO2_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_POSi_darwin_clim_tblDarwin_Nutrient_Climatology,...,CMAP_NO2_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_NH4_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_FeT_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_DOP_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_DON_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_DOFe_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_DOC_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_DIC_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_CDOM_darwin_clim_tblDarwin_Nutrient_Climatology,CMAP_ALK_darwin_clim_tblDarwin_Nutrient_Climatology
0,2016-04-20T07:00:00,Prochlorococcus,21.520326,-158.326984,10.520443,34.893785,24.351745,KOK1606,-0.022845,-0.000127,...,0.295276,1.282981,0.000015,0.013734,0.248717,0.000017,1.648093,1697.874775,0.000034,1954.876650
1,2016-04-20T07:00:00,Synechococcus,21.520326,-158.326984,0.341429,34.893785,24.351745,KOK1606,-0.022845,-0.000127,...,0.295276,1.282981,0.000015,0.013734,0.248717,0.000017,1.648093,1697.874775,0.000034,1954.876650
2,2016-04-20T07:00:00,nanoeukaryotes (2-5µm),21.520326,-158.326984,3.338212,34.893785,24.351745,KOK1606,-0.022845,-0.000127,...,0.295276,1.282981,0.000015,0.013734,0.248717,0.000017,1.648093,1697.874775,0.000034,1954.876650
3,2016-04-20T07:00:00,picoeukaryotes (< 2µm),21.520326,-158.326984,0.701902,34.893785,24.351745,KOK1606,-0.022845,-0.000127,...,0.295276,1.282981,0.000015,0.013734,0.248717,0.000017,1.648093,1697.874775,0.000034,1954.876650
4,2016-04-20T08:00:00,Prochlorococcus,21.662710,-158.323430,9.309387,34.902376,24.339265,KOK1606,-0.022845,-0.000127,...,0.295276,1.282981,0.000015,0.013734,0.248717,0.000017,1.648093,1697.874775,0.000034,1954.876650
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10906,2021-12-30T00:00:00,picoeukaryotes (< 2µm),32.673493,-117.545342,3.774488,33.468151,15.189021,TN398,0.363296,0.099231,...,0.202274,0.242743,0.000497,0.164132,2.736920,0.000178,19.695796,1819.587625,0.000756,2008.417775
10907,2021-12-30T01:00:00,Prochlorococcus,32.682100,-117.660321,0.874599,33.478846,15.327302,TN398,0.363296,0.099231,...,0.202274,0.242743,0.000497,0.164132,2.736920,0.000178,19.695796,1819.587625,0.000756,2008.417775
10908,2021-12-30T01:00:00,Synechococcus,32.682100,-117.660321,9.707579,33.478846,15.327302,TN398,0.363296,0.099231,...,0.202274,0.242743,0.000497,0.164132,2.736920,0.000178,19.695796,1819.587625,0.000756,2008.417775
10909,2021-12-30T01:00:00,nanoeukaryotes (2-5µm),32.682100,-117.660321,2.428084,33.478846,15.327302,TN398,0.363296,0.099231,...,0.202274,0.242743,0.000497,0.164132,2.736920,0.000178,19.695796,1819.587625,0.000756,2008.417775


### Checking for NaN values

In [46]:
covari_cmap.isna().sum()

time                                                    0
PopulationName                                          0
lat                                                     0
lon                                                     0
Biomass_pgC_per_L                                       0
salin                                                   0
temp                                                    0
cruisename                                              0
CMAP_SiO2_darwin_clim_tblDarwin_Nutrient_Climatology    0
CMAP_POSi_darwin_clim_tblDarwin_Nutrient_Climatology    0
CMAP_PON_darwin_clim_tblDarwin_Nutrient_Climatology     0
CMAP_POFe_darwin_clim_tblDarwin_Nutrient_Climatology    0
CMAP_POC_darwin_clim_tblDarwin_Nutrient_Climatology     0
CMAP_PO4_darwin_clim_tblDarwin_Nutrient_Climatology     0
CMAP_PIC_darwin_clim_tblDarwin_Nutrient_Climatology     0
CMAP_O2_darwin_clim_tblDarwin_Nutrient_Climatology      0
CMAP_NO3_darwin_clim_tblDarwin_Nutrient_Climatology     0
CMAP_NO2_darwi

## Saving as a CSV

In [None]:
#saving as a CSV file
covari_cmap.to_csv('data/modified/Seaflow_covariates_CMAP.csv', index=False)
