# North coast kelp data conversion

Since Laura only shared with me an MS Access DB file (.mdb), I'll have to start by extracting the data from there.

Resources:
- [Wiki for pyodbc](https://github.com/mkleehammer/pyodbc/wiki)

In [123]:
## Imports

import pandas as pd
import numpy as np
import csv, pyodbc
import pickle
import datetime

## Connect to db and retrieve data

Note that (as described in the Wiki) the original database filename did not work because it included underscores. I renamed 'Abalone_DiveSurveys_EH_06242020.mdb' to 'AbaloneDiveSurveys-06242020.mdb' to fix this problem

In [49]:
## Check available access driver

[x for x in pyodbc.drivers() if x.startswith('Microsoft Access Driver')]

['Microsoft Access Driver (*.mdb, *.accdb)']

In [87]:
## Function for connecting to db

def get_data_from_db(tbl_name):
    """Connects to .mdb file and extracts data from tbl_name. Saves data in a .csv file and column names in a .txt file."""
    
    # Connect
    conn_str = (
        r'DRIVER={Microsoft Access Driver (*.mdb, *.accdb)};'
        r'DBQ=C:\Users\dianalg\PycharmProjects\PythonScripts\MPA data integration\North coast kelp\AbaloneDiveSurveys-06242020.mdb;'
        )
    cnxn = pyodbc.connect(conn_str)
    crsr = cnxn.cursor()
    
    # Get column names
    cols = [row.column_name for row in crsr.columns(table=tbl_name)]
    
    # Get rows
    SQL = 'SELECT * FROM ' + tbl_name + ';'
    rows = crsr.execute(SQL).fetchall()
    
    # Close connection
    crsr.close()
    cnxn.close()
    
    # Save data
    with open(tbl_name + '.csv', 'w') as file:
        csv_writer = csv.writer(file) 
        csv_writer.writerows(rows)
        
    # Save column names
    with open(tbl_name + '_cols.data', 'wb') as file:
        pickle.dump(cols, file)

**Note** that the table names in this database are:
- Paste Errors
- Switchboard Items
- Tbl_Diver_Lookup
- Tbl_Format_Lookup
- Tbl_New_size
- Tbl_SppSeen
- Tbl_Survey_Lookup
- tblCounts
- tblHabitat
- tblNewGrowth
- tblQuadrat
- tblSite
- TblSpecies
- tblSubstrate
- tblSurvey
- tblSwimGroups

In [104]:
## Extract data from .mdb file

tables = ['tblSite', 'tblSurvey', 'tblCounts', 'tblSpecies']

for table in tables:
    get_data_from_db(table)

## Load data

In [102]:
## Function to load data

def load_table(tbl_name):
    """Takes tbl_name (a string) and loads saved data from that table."""
    
    # Get filenames
    col_name = tbl_name + '_cols.data'
    data_name = tbl_name + '.csv'
    
    # Retrieve column names
    with open(col_name, 'rb') as file:
        cols = pickle.load(file)
        
    # Load data
    data = pd.read_csv(data_name, header=None, names=cols, encoding='ANSI')
    return(data)

In [179]:
## Load data

site = load_table('tblSite')
survey = load_table('tblSurvey')
count = load_table('tblCounts')
species = load_table('tblSpecies')

In [180]:
## Tidy survey data

# Remove columns that are all nan; there are no rows that are all nan
survey.drop(columns=survey.columns[survey.isna().all()], inplace=True)

# Can probably drop these columns: DIVER, Orientation, Buddy, RANGE, TIDEHEIGHT, SLAT, SLONG, ELAT, ELONG< SLAT_old, SLONG_old, ELAT_old, ELONG_old, Format, DISTANCE, COMMENTS
survey.drop(columns=['DIVER (LEFT FOR TRANSECT)', 'Orientation', 'Buddy (RIGHT FOR TRANSECT)', 'RANGE', 'TIDEHEIGHT', 'SLAT', 'SLONG', 'ELAT', 'ELONG', 'SLAT_old', 'SLONG_old',
                     'ELAT_old', 'ELONG_old', 'Format', 'DISTANCE', 'COMMENTS'], inplace=True)

# We are only interested in the emergent survey type at this stage
survey = survey[survey['TYPE'] == 'EMERGENT']

# View
print(survey.shape)
survey.head()

(455, 16)


Unnamed: 0,Survey_ID,SiteID,Survey_Num,SURVEY,DATE,TYPE,TIME_of_Day,NUMBER,Avg Depth,Min_DEPTH,Max_Depth,SLAT_DD,SLONG_DD,ELAT_DD,ELONG_DD,TIME_MIN
37,1232,BR,BR99-01,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE01,0.0,39.0,,,,,,
38,1243,BR,BR99-02,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE02,0.0,52.0,,,,,,
39,1254,BR,BR99-03,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE03,0.0,27.0,,,,,,
40,1258,BR,BR99-04,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE04,0.0,25.0,,,,,,
41,1259,BR,BR99-05,Transect - 30m (Emergent),1999-07-29 00:00:00,EMERGENT,,CAVE05,0.0,20.0,,,,,,


**Note** that by filtering for TYPE = 'EMERGENT', we are discarding the majority of the data (all but 455 records out of ~ 3400). **Is this really the right thing to do?**

Just for documentation, this eliminates the following values under SURVEY:
- 'Transect - 30m (Rapid Emergent)'
- 'Transect - 20mx2m (Emergent)'
- nan
- 'ARM'
- 'Transect - 30mx2m (Emergent)'
- 'Swim'
- 'Transect - 5m (Invasive)'
- 'Transect - 20mx5m (Emergent)'
- 'Transect - 16mx2m (Emergent)'
- 'Transect - 1/4m'
- 'Transect - 1/4M'
- 'Transect - running transect'
- 'Transect - Running Transect'

Only 'Transect - 30m (Emergent)' is left. 

**Also note** that there are some weird values in the COUNT column of the count data, including:
- 'RG18-C3-'
- 'P'

## Convert

There is very minimal metadata for this data set, so I'm not sure what a lot of the columns mean. I'm just going to try to work through a basic conversion process, and see where questions and problems arise. **I'm going to start by assuming that this data set can be summarized by an occurrence file only**, containing: eventID, eventDate, datasetID, locality, localityRemarks (if needed), countryCode, decimalLat, decimalLon, coordinateUncertaintyInMeters, occurrenceID, scienficName, scientificNameID, taxonID, nameAccordingTo, occurrenceStatus, basisOfRecord, identificationQualifier (if needed), occurrenceRemarks (if needed), individualCount (or organismQuantity and organismQuantityType), minimumDepthInMeters, maximumDepthInMeters, samplingProtocol and samplingEffort.

### eventID

As with other similar surveys, it seems reasonable to assume that each transect is an event. **I don't know what the broader study organization is for this, i.e. number of sites, inside/outside MPAs, number of transects per site, what was looked for, what was measured.** The metadata states that the Survey_Num field is composed of the site location, year and site ID. Transect numbers should only be assigned for rapid assessment surveys, which were filtered out because they're not type EMERGENT. 

That said, each record in the survey table has a unique Survey_Num, so hopefully that can be used as an eventID. **For now I'm going to assume that the Survey_Num format is the site ID + year + transect number. Note** that there are a number of Survey_Nums that do not fit this format, though, including:
- 'PA71C-01'
- 'PA86dfg-13'

In [181]:
## eventID - as with other similar surveys, I'll assume the event is a transect

occ = pd.DataFrame({'eventID':survey['Survey_Num']})
occ.head()

Unnamed: 0,eventID
37,BR99-01
38,BR99-02
39,BR99-03
40,BR99-04
41,BR99-05


In [182]:
## eventDate

occ['eventDate'] = survey['DATE']

# format
eventDate = [datetime.datetime.strptime(dt, '%Y-%m-%d %H:%M:%S').date().isoformat() for dt in occ['eventDate']]
occ['eventDate'] = eventDate
occ.head()

Unnamed: 0,eventID,eventDate
37,BR99-01,1999-07-29
38,BR99-02,1999-07-29
39,BR99-03,1999-07-29
40,BR99-04,1999-07-29
41,BR99-05,1999-07-29


There are 71 unique survey dates in this data set. **Note** that the earliest of these is 1971-09-01 and the latest of these is 2001-08-22. **This makes no sense, seeing as Laura was talking about having entered the 2017 data over the summer. It must be the 'EMERGENT' filtering step. But I'll let it be for now.**

In [183]:
## datasetID

occ['datasetID'] = 'North coast kelp emergent transects'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID
37,BR99-01,1999-07-29,North coast kelp emergent transects
38,BR99-02,1999-07-29,North coast kelp emergent transects
39,BR99-03,1999-07-29,North coast kelp emergent transects
40,BR99-04,1999-07-29,North coast kelp emergent transects
41,BR99-05,1999-07-29,North coast kelp emergent transects


In [184]:
## locality

occ['SiteID'] = survey['SiteID']
occ = occ.merge(site, how='left', on='SiteID')
occ.rename(columns={'SITE':'locality'}, inplace=True)

## ------ TODO: Drop SiteID, but for now leave it in in case I need to merge on it again.

print(occ.shape)
occ.head()

(455, 5)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML)


**Note** that for most of the kelp forest survey data sets, there's a localityRemarks column indicating whether each site was inside or outside an MPA. I don't seem to have this information here, though.

In [185]:
## countryCode

occ['countryCode'] = 'US'
occ.head()

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US


**The site table doesn't contain lat/lon information, and the survey table seems to contain weird forms of it.** I received the following information from Laura in June:

| Site           | ID  | Latitude  | Longitude  |
|----------------|-----|-----------|------------|
| Albion Bay     | AB  | 39 22.900 | 123 77.881 |
| Caspar Cove    | CC  | 39 36.556 | 123 82.362 |
| Fort Ross      | FR  | 38 50.797 | 123 23.851 |
| Ocean Cove     | OC  | 38 55.363 | 123 30.688 |
| Point Arena    | PA  | 38 54.783 | 123 42.835 |
| Point Cabrillo | PC  | 39 20.752 | 123 49.628 |
| Russian Gulch  | RG  | 39 32.687 | 123 80.782 |
| Salt Point     | SP  | 38 33.972 | 123 20.172 |
| Sea Ranch      | SER | 38 42.468 | 123 27.302 |
| Timber Cove    | TC  | 38 53.476 | 123 28.192 |
| Todds Point    | TP  | 38 43.195 | 123 81.780 |
| Van Damme      | VD  | 39 16.044 | 123 47.662 |

**Note** that something is clearly wrong here. I assume these coordinates are in degrees + decimal minutes, but several of the decimal minute entries are > 60. This shouldn't be.

In [186]:
## Create a digital copy of the table Laura sent

# Define columns
site_name = ['Albion Bay', 'Caspar Cove', 'Fort Ross', 'Ocean Cove', 'Point Arena', 'Point Cabrillo', 'Russian Gulch', 'Salt Point', 'Sea Ranch', 'Timber Cove', 'Todds Point', 'Van Damme']
site_id = ['AB', 'CC', 'FR', 'OC', 'PA', 'PC', 'RG', 'SP', 'SER', 'TC', 'TP', 'VD']
lat_deg = [39, 39, 38, 38, 38, 39, 39, 38, 38, 38, 38, 39]
lat_min = [22.900, 36.556, 50.797, 55.363, 54.783, 20.752, 32.687, 33.972, 42.468, 53.476, 43.195, 16.044]
lon_deg = [123]*12
lon_min = [77.881, 82.362, 23.851, 30.688, 42.835, 49.628, 80.782, 20.172, 27.302, 28.192, 81.780, 47.662]

# Create df
site_lat_lon = pd.DataFrame({'Site':site_name,
                            'ID':site_id,
                            'Lat_deg':lat_deg,
                            'Lat_min':lat_min,
                            'Lon_deg':lon_deg,
                            'Lon_min':lon_min})

# Convert lat, lons to decimal degrees
site_lat_lon['Lat_dd'] = site_lat_lon['Lat_deg'] + site_lat_lon['Lat_min']/60
site_lat_lon['Lon_dd'] = site_lat_lon['Lon_deg'] + site_lat_lon['Lon_min']/60

# Merge with site table to populate what few lat, lons we know at this time. **Note that based on the prior text box, some of these are wrong.**
site = site.merge(site_lat_lon, how='left', left_on='SiteID', right_on='ID')
site.head()

Unnamed: 0,SiteID,SITE,Site,ID,Lat_deg,Lat_min,Lon_deg,Lon_min,Lat_dd,Lon_dd
0,ALB,Albion Bay,,,,,,,,
1,BR,Bodega Marine Life Refuge (BML),,,,,,,,
2,CAT,"Catalina Island, Bird Rock (Southern CA)",,,,,,,,
3,CC,Caspar Cove,Caspar Cove,CC,39.0,36.556,123.0,82.362,39.609267,124.3727
4,FM,Fisk Mill Cove,,,,,,,,


In [187]:
## decimalLatitude, decimalLongitude

occ = occ.merge(site[['SiteID', 'Lat_dd', 'Lon_dd']], how='left', on='SiteID')
occ.rename(columns={'Lat_dd':'decimalLatitude', 'Lon_dd':'decimalLongitude'}, inplace=True)
print(occ.shape)
occ.head()

(455, 8)


Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,decimalLatitude,decimalLongitude
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,


**Note** that once merged, the only lat, lons that are missing are from Bodega Marine Life Refuge.

In [169]:
out[out['Lat_dd'].isna() == True]

Unnamed: 0,eventID,eventDate,datasetID,SiteID,locality,countryCode,Lat_dd,Lon_dd
0,BR99-01,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
1,BR99-02,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
2,BR99-03,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
3,BR99-04,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
4,BR99-05,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
5,BR99-06,1999-07-29,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
6,BR99-07,1999-07-28,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
7,BR99-08,1999-08-12,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
8,BR99-09,1999-07-24,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
9,BR99-10,1999-07-24,North coast kelp emergent transects,BR,Bodega Marine Life Refuge (BML),US,,
