# Introduction

The purpose fo this guide is to introduce you to using pandas and other Python libraries to parse and analyze data from the [NOAA Institutional Repository](https://repository.library.noaa.gov/) JSON API, a REST API the NOAA IR offers as a web service for retrieving structured data from the NOAA IR.

### What skills are needed to follow along with this guide?

This guide assummes you have some general familarity with the following:
* General programming concepts;
* REST APIs;
* Python.

### What data can I get from the NOAA IR JSON API?

By calling to the NOAA IR JSON API, you can retrieve metadata for items in a [collection](https://repository.library.noaa.gov/browse/collections) via it's pid, which serves as a collection endpoint. Metadata in an item's fields contain descriptive information (MODS and Dublin Core), as well as adminstrative information such as the item's creation and modification date. 

### What collections are available? 

The current collections, along with their pids, are available to select from: 

```National Environmental Policy Act (NEPA) : 1
Weather Research and Forecasting Innovation Act : 23702
Coral Reef Conservation Program (CRCP) : 3
Ocean Exploration Program (OER) : 4
National Marine Fisheries Service (NMFS) : 5
National Weather Service (NWS): 6
Office of Oceanic and Atmospheric Research (OAR) : 7
National Ocean Service (NOS) : 8               
National Environmental Satellite and Data Information Service (NESDIS) : 9
Sea Grant Publications : 11
Education and Outreach : 12
NOAA General Documents : 10031
NOAA International Agreements : 11879
Office of Marine and Aviation Operations (OMAO) : 16402
Integrated Ecosystem Assessment (IEA):22022
NOAA Cooperative Institutes: 23649
Cooperative Science Centers: 24914```

## Parsing with pandas 

In [1]:
# import the neccessary libraries
import pandas as pd
import requests
import os

In [2]:
# get the base url for json api
base_url = "https://repository.library.noaa.gov/fedora/export/download/collection/"

In [3]:
# using the 'Ocean Exploration Program' endpoint, you can select
# the metadata associated with the collection with the lines below
url = base_url + '4'
response = requests.get(url)
json_d = response.json()
docs = json_d['response']['docs']

In [4]:
# once you have filtered down to the items in the collection
# you convert them from a list of dictionary objects into 
# a pandas DataFrame
df = pd.DataFrame(docs)
df.head(3) # print out the first three rows of DataFrame

Unnamed: 0,mods.abstract,fgs.state,mods.related_series,dc.title,mods.type_of_resource,fgs.lastModifiedDate,keywords,mods.name_corporate,mods.ss_publishyear,mods.title,...,mods.part_hierarchy,mods.related_original,dc.language,mods.language,rdf.isPeerTo,mods.sm_issn,mods.grants,mods.pmcid,mods.table_of_contents,mods.name_conference
0,"[Dive Summary in PDF, from EX1711 Okeanos Expl...",Active,[EX-17-11],"[Okeanos Explorer ROV dive summary, EX1711, Di...",[Dive Summary],2020-05-01T16:34:38.972Z,"[Multibeam mapping, Okeanos Explorer (Ship), U...",[United States. National Oceanic and Atmospher...,2017,"Okeanos Explorer ROV dive summary, EX1711, Div...",...,,,,,,,,,,
1,"[Dive Summary in PDF, from EX1711 Okeanos Expl...",Active,[EX-17-11],"[Okeanos Explorer ROV dive summary, EX1711, Di...",[Dive Summary],2020-05-01T16:34:39.359Z,"[Multibeam mapping, Okeanos Explorer (Ship), U...",[United States. National Oceanic and Atmospher...,2017,"Okeanos Explorer ROV dive summary, EX1711, Div...",...,,,,,,,,,,
2,"[Dive Summary in PDF, from EX1711 Okeanos Expl...",Active,[EX-17-11],"[Okeanos Explorer ROV dive summary, EX1711, Di...",[Dive Summary],2020-05-01T16:34:39.704Z,"[Multibeam mapping, Okeanos Explorer (Ship), U...",[United States. National Oceanic and Atmospher...,2017,"Okeanos Explorer ROV dive summary, EX1711, Div...",...,,,,,,,,,,


##### Highlevel view of a DataFrame

This initial output provides a lot information, but there are a few ways explore the dataset from a highlevel to get a better understanding of it. 

In [5]:
# one of the fist tasks we can do is to see how many columns are rows are available
# using the pandas' shape attribute
df.shape # columns, rows

(718, 49)

In [6]:
# Another way to get a better understanding of the NOAA IR dataset is to view the column names
df.columns

Index(['mods.abstract', 'fgs.state', 'mods.related_series', 'dc.title',
       'mods.type_of_resource', 'fgs.lastModifiedDate', 'keywords',
       'mods.name_corporate', 'mods.ss_publishyear', 'mods.title',
       'mods.ss_jobid', 'rdf.isMemberOf', 'dc.subject', 'mods.name_personal',
       'rdf.isOpenAccess', 'mods.origin', 'dc.description', 'PID',
       'mods.raw_date', 'mods.sm_localcorpname', 'mods.subject_topic',
       'dc.contributor', 'fgs.createdDate', 'mods.physical_description',
       'mods.subject_name', 'mods.sm_digital_object_identifier', 'dc.format',
       'dc.coverage', 'mods.subject_geographic', 'mods.alt_title',
       'mods.publisher_place', 'mods.country', 'mods.note', 'mods.genre',
       'mods.sm_compliance', 'mods.issue', 'mods.volume', 'dc.source',
       'mods.journal_title', 'mods.part_hierarchy', 'mods.related_original',
       'dc.language', 'mods.language', 'rdf.isPeerTo', 'mods.sm_issn',
       'mods.grants', 'mods.pmcid', 'mods.table_of_contents',
    

While ```df.shape``` does provide insight to the size of the DataFrame you dealing with, seeing the actual column names printed out gives new perspective. In this instance, we want to remove all Dublin Core columns (dc.\*), as they duplicate the information with the MODs columns (mods.\*). 

One method for carrying out such as task is to use pandas' filter method, which offers the option of regular expressions. 

In [7]:
df = df.filter(regex='mods.*|fgs.*|PID')
df.head(3)

Unnamed: 0,mods.abstract,fgs.state,mods.related_series,mods.type_of_resource,fgs.lastModifiedDate,mods.name_corporate,mods.ss_publishyear,mods.title,mods.ss_jobid,mods.name_personal,...,mods.volume,mods.journal_title,mods.part_hierarchy,mods.related_original,mods.language,mods.sm_issn,mods.grants,mods.pmcid,mods.table_of_contents,mods.name_conference
0,"[Dive Summary in PDF, from EX1711 Okeanos Expl...",Active,[EX-17-11],[Dive Summary],2020-05-01T16:34:38.972Z,[United States. National Oceanic and Atmospher...,2017,"Okeanos Explorer ROV dive summary, EX1711, Div...",1501,"[Kennedy, Brian R.C.]",...,,,,,,,,,,
1,"[Dive Summary in PDF, from EX1711 Okeanos Expl...",Active,[EX-17-11],[Dive Summary],2020-05-01T16:34:39.359Z,[United States. National Oceanic and Atmospher...,2017,"Okeanos Explorer ROV dive summary, EX1711, Div...",1501,"[Kennedy, Brian R.C.]",...,,,,,,,,,,
2,"[Dive Summary in PDF, from EX1711 Okeanos Expl...",Active,[EX-17-11],[Dive Summary],2020-05-01T16:34:39.704Z,[United States. National Oceanic and Atmospher...,2017,"Okeanos Explorer ROV dive summary, EX1711, Div...",1501,"[Kennedy, Brian R.C.]",...,,,,,,,,,,


Another common way to filter columns is to use pandas reindex method. This method is often used when you know the names of the columns you want to use. In this instance, we can filter down to a few essential columns that can be used for the remainder of the guide.

In [8]:
columns = ['PID', 'mods.title', 'mods.abstract',
           'mods.type_of_resource',
           'mods.sm_digital_object_identifier']

df = df.reindex(columns=columns)
df

Unnamed: 0,PID,mods.title,mods.abstract,mods.type_of_resource,mods.sm_digital_object_identifier
0,noaa:17445,"Okeanos Explorer ROV dive summary, EX1711, Div...","[Dive Summary in PDF, from EX1711 Okeanos Expl...",[Dive Summary],
1,noaa:17448,"Okeanos Explorer ROV dive summary, EX1711, Div...","[Dive Summary in PDF, from EX1711 Okeanos Expl...",[Dive Summary],
2,noaa:17446,"Okeanos Explorer ROV dive summary, EX1711, Div...","[Dive Summary in PDF, from EX1711 Okeanos Expl...",[Dive Summary],
3,noaa:17451,"Okeanos Explorer ROV dive summary, EX1711, Div...","[Dive Summary in PDF, from EX1711 Okeanos Expl...",[Dive Summary],
4,noaa:17226,"Okeanos Explorer ROV dive summary, EX1708, Div...","[Dive summary in PDF, from EX1708 Okeanos Expl...",[Dive Summary],
...,...,...,...,...,...
713,noaa:935,[Cruise summary for South Atlantic Bight 2006]...,[South Atlantic Bight 2006 final cruise report],[Professional Paper],
714,noaa:641,"Davidson Seamount 2006, Quick Look Report (QLR)",,[Professional Paper],
715,noaa:23750,Building the Knowledge-to-Action Pipeline in N...,[Ocean acidification (OA) describes the progre...,[Journal Article],[https://doi.org/10.3389/fmars.2019.00356]
716,noaa:514,Mesophotic coral ecosystem research strategy I...,"[""This document summarizes the results of the ...",[Technical Memorandum],


As with filtering on columns, pandas makes its simple to filter on rows. In particular, one often needs to filter out missing values before proceeding onto further analysis. In this case, we need will remove all records which lack DOIs using the ```notnull()``` method. 

In [9]:
df = df[df['mods.sm_digital_object_identifier'].notnull()]
df

Unnamed: 0,PID,mods.title,mods.abstract,mods.type_of_resource,mods.sm_digital_object_identifier
27,noaa:19265,Mapping Data Acquisition Summary Report : CRUI...,[NOAA Ship Okeanos Explorer was in dry dock fr...,[Technical Report],[https://doi.org/10.25923/4man-s223]
40,noaa:24983,Mapping Data Acquisition and Processing Summar...,[The purpose of this report is to briefly desc...,[Technical Report],[https://doi.org/10.25923/6ha9-py06]
50,noaa:23488,Expedition Cruise Report: EX-16-06. 2016 Deepw...,"[In August of 2016, NOAA Ship Okeanos Explorer...",[Technical Report],[https://doi.org/10.25923/52d7-h744]
51,noaa:23485,"Project instructions. EX-18-07, mapping deepwa...",[This document contains project instructions f...,[Technical Report],[https://doi.org/10.25923/fhr0-m519]
52,noaa:21417,"Cruise report. EX-16-04, CAPSTONE Wake Island ...","[""Cruise Report: EX-16-04, CAPSTONE Wake Islan...",[Technical Report],[https://doi.org/10.25923/z35c-tm74]
...,...,...,...,...,...
694,noaa:23770,Assessment of Mesophotic Coral Ecosystem Conne...,"[In coral reef ecosystems, mesophotic coral ha...",[Journal Article],[https://doi.org/10.3389/fmars.2018.00174]
695,noaa:21136,New frontiers in ocean exploration: The E/V Na...,,[Journal Article],[https://doi.org/10.5670/oceanog.2019.suppleme...
705,noaa:24546,The Morphometry of the Deep-Water Sinuous Mend...,"[Mendocino Channel, a deep-water sinuous chann...",[Journal Article],[https://doi.org/10.3390/geosciences7040124]
715,noaa:23750,Building the Knowledge-to-Action Pipeline in N...,[Ocean acidification (OA) describes the progre...,[Journal Article],[https://doi.org/10.3389/fmars.2019.00356]


In [10]:
# Confirming that items have been removed from the DataFrame
# the index missing values. We can reset the index using `reset_index'
df = df.reset_index(drop=True)
df

Unnamed: 0,PID,mods.title,mods.abstract,mods.type_of_resource,mods.sm_digital_object_identifier
0,noaa:19265,Mapping Data Acquisition Summary Report : CRUI...,[NOAA Ship Okeanos Explorer was in dry dock fr...,[Technical Report],[https://doi.org/10.25923/4man-s223]
1,noaa:24983,Mapping Data Acquisition and Processing Summar...,[The purpose of this report is to briefly desc...,[Technical Report],[https://doi.org/10.25923/6ha9-py06]
2,noaa:23488,Expedition Cruise Report: EX-16-06. 2016 Deepw...,"[In August of 2016, NOAA Ship Okeanos Explorer...",[Technical Report],[https://doi.org/10.25923/52d7-h744]
3,noaa:23485,"Project instructions. EX-18-07, mapping deepwa...",[This document contains project instructions f...,[Technical Report],[https://doi.org/10.25923/fhr0-m519]
4,noaa:21417,"Cruise report. EX-16-04, CAPSTONE Wake Island ...","[""Cruise Report: EX-16-04, CAPSTONE Wake Islan...",[Technical Report],[https://doi.org/10.25923/z35c-tm74]
...,...,...,...,...,...
151,noaa:23770,Assessment of Mesophotic Coral Ecosystem Conne...,"[In coral reef ecosystems, mesophotic coral ha...",[Journal Article],[https://doi.org/10.3389/fmars.2018.00174]
152,noaa:21136,New frontiers in ocean exploration: The E/V Na...,,[Journal Article],[https://doi.org/10.5670/oceanog.2019.suppleme...
153,noaa:24546,The Morphometry of the Deep-Water Sinuous Mend...,"[Mendocino Channel, a deep-water sinuous chann...",[Journal Article],[https://doi.org/10.3390/geosciences7040124]
154,noaa:23750,Building the Knowledge-to-Action Pipeline in N...,[Ocean acidification (OA) describes the progre...,[Journal Article],[https://doi.org/10.3389/fmars.2019.00356]


##### Transforming values

You may also have noticed that when using pandas to parse the NOAA IR JSON API data, certain fields are imported as Python list objects. pandas **accessor** methods allow you to easily transform these values from list objects into Python string objects. If you are a Python user, many of these of these **accessor** methods will be familiar you as they resemble Python builtin functions. 

In [11]:
# By selecting the first value in 'mods.type_of_resource' field
# you can see that column values are Python list objects
type(df['mods.type_of_resource'][0])

list

In [12]:
# the one way to solve this is to use pandas acessor methods, 
# updating the column values with a new set of values
df['mods.type_of_resource'] = df['mods.type_of_resource'].str.join('') 

In [13]:
# to confirm that the transformation from Python list to string
# has occurred we can check the data type once again
type(df['mods.type_of_resource'][0])

str

In [14]:
# However, if there are more than one column it may be more
# efficient to simply write a custom function, which you can simply apply
# to a column when needed

def list_to_str(value):
    """
    Converts list objects to string objects.
    
    Exception used to handle NaN values.
    """
    
    try:
        return ''.join(value)
    except TypeError:
        return value
    

In [15]:
# With a custom function we can apply it to the columns of interest
df['mods.abstract'] = df['mods.abstract'].apply(list_to_str)
df['mods.sm_digital_object_identifier'] = df['mods.sm_digital_object_identifier'].apply(list_to_str)

In [16]:
df.head()

Unnamed: 0,PID,mods.title,mods.abstract,mods.type_of_resource,mods.sm_digital_object_identifier
0,noaa:19265,Mapping Data Acquisition Summary Report : CRUI...,NOAA Ship Okeanos Explorer was in dry dock fro...,Technical Report,https://doi.org/10.25923/4man-s223
1,noaa:24983,Mapping Data Acquisition and Processing Summar...,The purpose of this report is to briefly descr...,Technical Report,https://doi.org/10.25923/6ha9-py06
2,noaa:23488,Expedition Cruise Report: EX-16-06. 2016 Deepw...,"In August of 2016, NOAA Ship Okeanos Explorer ...",Technical Report,https://doi.org/10.25923/52d7-h744
3,noaa:23485,"Project instructions. EX-18-07, mapping deepwa...",This document contains project instructions fo...,Technical Report,https://doi.org/10.25923/fhr0-m519
4,noaa:21417,"Cruise report. EX-16-04, CAPSTONE Wake Island ...","""Cruise Report: EX-16-04, CAPSTONE Wake Island...",Technical Report,https://doi.org/10.25923/z35c-tm74


##### Merging and concatenting DataFrames

Pandas offers powerful featues which allow you to easily concatenate and merge different DataFrame. 

For instance, with a few commands it is possible to combine the current dataframe from this guide with a new dataframe to carry out some basic analysis.

In [17]:
# First we will create a new dataframe using the Ocean and Atmospheric Collection
# In downloading this dataset and creating the dataframe we will also format 
# as we did with the initial dataframe

r = requests.get(base_url + '5')
json_d = r.json()
docs = json_d['response']['docs']
df2 = pd.DataFrame(docs)

In [18]:
# reindex to only five columns, not including index
df2 = df2.reindex(columns=columns)

# remove any records without DOIs
df2 = df2[df2['mods.sm_digital_object_identifier'].notnull()]

# transform rows without DOIS
df2['mods.abstract']= df2['mods.abstract'].apply(list_to_str)
df2['mods.sm_digital_object_identifier'] = df2['mods.sm_digital_object_identifier'].apply(list_to_str)
df2['mods.type_of_resource'] = df2['mods.type_of_resource'].str.join('') 

In [19]:
df2.head(3)

Unnamed: 0,PID,mods.title,mods.abstract,mods.type_of_resource,mods.sm_digital_object_identifier
0,noaa:12604,"Spatial distribution, diet, and nutritional st...",Surveys were conducted in the Yukon River estu...,Technical Memorandum,http://doi.org/10.7289/V5/TM-AFSC-334
28,noaa:15033,A guide to landing shark species with fins na...,"""The practice of finning, defined as the remov...",Technical Memorandum,http://doi.org/10.7289/V5/TM-SEFSC-712
34,noaa:15742,When El Nino Rages How Satellite Data Can Help...,"There are more than 2,000 islands across Hawai...",Journal Article,http://dx.doi.org/10.1175/bams-d-15-00219.1


#### concat

Now that the second datasete have been downloaded, imported, and cleaned up, we can combine it with the initial dataset we started with, we can combine them using pandas ```concat``` method to create a single DataFrame.

In [20]:
df3 = pd.concat([df,df2])

In [21]:
# We can print out all three DataFrame shapes to confirm the concat operation
print(df.shape)
print(df2.shape)
print(df3.shape)

(156, 5)
(2035, 5)
(2191, 5)


As items in the NOAA IR are shared accross collections, it is a good idea to use pandas ```drop_duplicates``` method to remove any potential duplicates from the newly created Dataframe.

In [22]:
print(df3.shape) # print statement before dropped duplicates
df3 = df3.drop_duplicates()
print(df3.shape) # print statement after....

(2191, 5)
(2190, 5)


#### Merge

pandas ```merge``` function is often compared to a SQL join as it allows you to join or merge a different two dataset together using a single key or multiple key. 

In the instance of the NOAA IR API data, it is often helpful join datasets together based on DOIs to determine if the is hosting NOAA IR has journal articles.

In [23]:
# first we will need to import the dataset we will merge on
# For this example, we can use an list of Open Access journal articles
oa_data = 'OA-example.csv'
oa_df = pd.read_csv(os.path.join('data',oa_data))
oa_df.columns # print out the columns

Index(['doi', 'issue', 'journal.id', 'journal.title', 'title', 'type',
       'volume', 'year', 'open_access_types'],
      dtype='object')

In [25]:
# The merge function can be used in a variety of ways; however,  
# the simplest approach carry out an 'inner' merge, the default merge type

merge_df = df3.merge(
        oa_df,
        left_on='mods.sm_digital_object_identifier',
        right_on='doi'
        )

In [29]:
# view the resulting columns and number of row
print(merge_df.columns)
print()
print(merge_df.shape)

Index(['PID', 'mods.title', 'mods.abstract', 'mods.type_of_resource',
       'mods.sm_digital_object_identifier', 'doi', 'issue', 'journal.id',
       'journal.title', 'title', 'type', 'volume', 'year',
       'open_access_types'],
      dtype='object')

(78, 14)


In [33]:
# seeing that there are duplicative rows, we can reindex by columns 
# once again to create a final Dataframe 
final_df = merge_df.reindex(columns=['PID', 'mods.title',
            'mods.abstract','mods.type_of_resource','doi',
            'journal.title'])

In [36]:
final_df.head(3) # print out results

Unnamed: 0,PID,mods.title,mods.abstract,mods.type_of_resource,doi,journal.title
0,noaa:24600,Stock assessment and end-to-end ecosystem mode...,Although all models are simplified approximati...,Journal Article,10.1371/journal.pone.0171644,PLoS ONE
1,noaa:24598,Validation of band counts in eyestalks for the...,Using known-age Antarctic krill (Euphausia sup...,Journal Article,10.1371/journal.pone.0171773,PLoS ONE
2,noaa:24579,Population growth is limited by nutritional im...,The Southern Resident killer whale population ...,Journal Article,10.1371/journal.pone.0179824,PLoS ONE
