---
syncID: 
title: "Downlaoding NEON Observation Data with Python"
description: ""
dateCreated: 2020-04-24
authors: Maxwell J. Burner
contributors: 
estimatedTime: 
packagesLibraries: requests, json, pandas
topics: api, data management
languagesTool: python
dataProduct: DP1.10003.001
code1: 
tutorialSeries: python-neon-api-series
urlTitle: python_neon_api_02_downloading_observational
---

In this tutorial we will learn to download Observational Sampling (OS) data from the NEON API into the Python environment.

<div id="ds-objectives" markdown="1">

### Objectives
After completing this tutorial, you will be able to:

* Navigate a NEON API request from the *data/* endpoint
* Describe the naming conventions of NEON OS data files
* Understand how to download NEON observational data using the Python Pandas library
* Describe the basic components of a Pandas dataframe


### Install Python Packages

* **requests**
* **json** 
* **numpy**
* **pandas**

We will not actually use the NumPy package in this tutorial; it is listed here because the Pandas package is built on top of NumPy, and requires that the latter be present.

</div>

In this tutorial we will learn how to download specific NEON data files into Python. We will specifically look at how to use the Pandas package to read in CSV files of observational data.

In the previous tutorial, we saw some of the data files containing information on land bird breeding counts. These are an example of NEON *observational data*. NEON has three basic types of data: Observational Sampling (OS), Instrumentation Sampling (IS), and Remote Sensing or Aerial Observation Plane(?) data (AOP). The process for request data is about the same for all three, but downloading and navigating the data tends to be very different depending on which category we want.

Here we will discuss downloading observational data, as it tends to be the simplest to handle.

## Libraries Downloaded

In addition to used requests and json packages again, we will use the Pandas package to read in the data. Pandas is a library that adds data frame objects to Python, based on the data frames of the R programming language; these offer a great way to store and manipulate tabular data.

In [1]:
import requests
import json
import pandas as pd

In [2]:
SERVER = 'http://data.neonscience.org/api/v0/'
SITECODE = 'TEAK'
PRODUCTCODE = 'DP1.10003.001'

## Look up Data Files

We already know from the last tutorial that landbird breeding counts (DP1.10003.001) are available at the Lower Teakettle site for 2018-06. We can again make a request to see what files in particular are available.

In [3]:
#Make Request
data_request = requests.get(SERVER+'data/'+PRODUCTCODE+'/'+SITECODE+'/'+'2018-06')
data_json = data_request.json()

In [4]:
#View names of files
for file in data_json['data']['files']:
    print(file['name'])

NEON.D17.TEAK.DP1.10003.001.readme.20191107T153235Z.txt
NEON.D17.TEAK.DP1.10003.001.2018-06.basic.20191107T153235Z.zip
NEON.D17.TEAK.DP1.10003.001.brd_perpoint.2018-06.basic.20191107T153235Z.csv
NEON.D17.TEAK.DP0.10003.001.validation.20191107T153235Z.csv
NEON.D17.TEAK.DP1.10003.001.EML.20180619-20180622.20191107T153235Z.xml
NEON.D17.TEAK.DP1.10003.001.variables.20191107T153235Z.csv
NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.basic.20191107T153235Z.csv
NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.expanded.20191107T153235Z.csv
NEON.D17.TEAK.DP1.10003.001.brd_references.expanded.20191107T153235Z.csv
NEON.D17.TEAK.DP1.10003.001.variables.20191107T153235Z.csv
NEON.Bird_Conservancy_of_the_Rockies.brd_personnel.csv
NEON.D17.TEAK.DP0.10003.001.validation.20191107T153235Z.csv
NEON.D17.TEAK.DP1.10003.001.2018-06.expanded.20191107T153235Z.zip
NEON.D17.TEAK.DP1.10003.001.EML.20180619-20180622.20191107T153235Z.xml
NEON.D17.TEAK.DP1.10003.001.readme.20191107T153235Z.txt
NEON.D17.TEAK.DP1.

Let's take a closer look at a file name.

In [5]:
print(data_json['data']['files'][6]['name'])

NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.basic.20191107T153235Z.csv


The format for most NEON data product file names is:

**NEON.[domain number].[site code].[data product ID].[file-specific name].[date of file creation]**

So the file whose name we singled out is domain 17, Lower Teakettle Site, Breeding Landbird point counts (DP1.10003.001), brd_perpoint.2018-06.basic, created 2019-11-07 at 15:32:35. The file name brd_perpoint.2018-06.basic indicates that this is the 'basic' version of bird counts by point, gathered in June 2018.

Bird counts and other observational data are usually kept in CSV files in the NEON database. Often the data for a particular month-site combination will be available in through two different .csv files, two different 'download packages'; a 'basic' package storing only the main measurements, and an 'expanded' package that also lists the uncertainties involved in each measurement. Let's save the url for the basic count data CSV file.

In [6]:
#Print names and URLs of files with birdcount data
for file in data_json['data']['files']:
    if('countdata' in file['name']): #Show both basic and expanded files
        print(file['name'],file['url'])
        if('basic' in file['name']):
            bird_count_url = file['url'] #save url of file with basic bird count data


NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.basic.20191107T153235Z.csv https://neon-prod-pub-1.s3.data.neonscience.org/NEON.DOM.SITE.DP1.10003.001/PROV/TEAK/20180601T000000--20180701T000000/basic/NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.basic.20191107T153235Z.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200501T131326Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=pub-internal-read%2F20200501%2Fus-west-2%2Fs3%2Faws4_request&X-Amz-Signature=a1cef8c9d93111149e39b05fe5713bd5cd3fa853a1cbd8b4e76b9f0c67d66828
NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.expanded.20191107T153235Z.csv https://neon-prod-pub-1.s3.data.neonscience.org/NEON.DOM.SITE.DP1.10003.001/PROV/TEAK/20180601T000000--20180701T000000/expanded/NEON.D17.TEAK.DP1.10003.001.brd_countdata.2018-06.expanded.20191107T153235Z.csv?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Date=20200501T131326Z&X-Amz-SignedHeaders=host&X-Amz-Expires=3600&X-Amz-Credential=pub-internal-read%2F20200501%2Fus-west-2%2Fs3%2

## Read file into Pandas Dataframe

There are a couple options for reading CSV files into Python. For files read directly from NEON's data repository, the best option seems to be the 'read_csv' function from the Pandas package. This function converts the contents of the target file into a pandas dataframe object, and has the added advantage of being able to read data files accessed through the web (Python has it's own built in package for reading CSV files, but this package can only read files present on your machine).

In [7]:
#Read bird count CSV data into a Pandas Dataframe
df_bird = pd.read_csv(bird_count_url)

Pandas is a popular Python package for data analysis and data manipulation. The package implements dataframe objects based on the dataframes used in the R programming language, and uses these objects for storing an manipulating tabular data.

A dataframe is a two-dimensional table of data, a grid built of rows and columns of values. Generally the columns correspond to the different variables being measured, while the rows correspond to each entry or measurement taken (in this case, each bird counted). Dataframes also have a header containing labels for each column, and an index containing labels for each row; both are 'index' objects stored as attributes of the dataframe object.

Python dataframes store their contents, header, and index in different attributes of the dataframe object. Other attributes contain metadata such as the overall shape of the dataframe, and the data type of each column.

You can find more about Python at their [official site](https://pandas.pydata.org/), which include a tutorials page [here](https://pandas.pydata.org/pandas-docs/version/0.15/tutorials.html).

In [8]:
#View the column names
df_bird.columns

Index(['uid', 'namedLocation', 'domainID', 'siteID', 'plotID', 'plotType',
       'pointID', 'startDate', 'eventID', 'pointCountMinute',
       'targetTaxaPresent', 'taxonID', 'scientificName', 'taxonRank',
       'vernacularName', 'observerDistance', 'detectionMethod',
       'visualConfirmation', 'sexOrAge', 'clusterSize', 'clusterCode',
       'identifiedBy'],
      dtype='object')

In [9]:
#Print out dimensions of the new dataframe
print('Number of columns: ',df_bird.shape[1])
print('Number of Rows: ',df_bird.shape[0])

Number of columns:  22
Number of Rows:  1729


In [10]:
#Print out names and data types of dataframe columns
print(df_bird.dtypes)

uid                    object
namedLocation          object
domainID               object
siteID                 object
plotID                 object
plotType               object
pointID                object
startDate              object
eventID                object
pointCountMinute        int64
targetTaxaPresent      object
taxonID                object
scientificName         object
taxonRank              object
vernacularName         object
observerDistance      float64
detectionMethod        object
visualConfirmation     object
sexOrAge               object
clusterSize           float64
clusterCode            object
identifiedBy           object
dtype: object


Pandas dataframes classify data as integer, floating point (decimal numbers), or object; the last category ususally indicates data stored as strings, such as text labels or date-time data.

In [11]:
#View first five rows of dataframe using the 'head' method
df_bird.head(5)

Unnamed: 0,uid,namedLocation,domainID,siteID,plotID,plotType,pointID,startDate,eventID,pointCountMinute,...,scientificName,taxonRank,vernacularName,observerDistance,detectionMethod,visualConfirmation,sexOrAge,clusterSize,clusterCode,identifiedBy
0,10176faf-160f-4ddb-a141-d53ddbdf11c0,TEAK_010.birdGrid.brd,D17,TEAK,TEAK_010,distributed,C1,2018-06-19T13Z,TEAK_010.C1.2018-06-19T05:30-07:00[US/Pacific],5,...,Setophaga coronata,species,Yellow-rumped Warbler,34.0,singing,No,Unknown,1.0,,WHERS
1,7bb4df8c-a481-4725-b03d-e2051e2b1e6d,TEAK_010.birdGrid.brd,D17,TEAK,TEAK_010,distributed,C1,2018-06-19T13Z,TEAK_010.C1.2018-06-19T05:30-07:00[US/Pacific],1,...,Setophaga coronata,species,Yellow-rumped Warbler,89.0,singing,No,Unknown,1.0,,WHERS
2,ed8e3f9e-78a9-42c2-9a53-9b47295538ea,TEAK_010.birdGrid.brd,D17,TEAK,TEAK_010,distributed,C1,2018-06-19T13Z,TEAK_010.C1.2018-06-19T05:30-07:00[US/Pacific],1,...,Regulus satrapa,species,Golden-crowned Kinglet,22.0,singing,No,Unknown,1.0,,WHERS
3,60d9a773-0bfd-4284-b3e5-a4c020a3feb8,TEAK_010.birdGrid.brd,D17,TEAK,TEAK_010,distributed,C1,2018-06-19T13Z,TEAK_010.C1.2018-06-19T05:30-07:00[US/Pacific],1,...,Empidonax oberholseri,species,Dusky Flycatcher,92.0,singing,No,Unknown,1.0,,WHERS
4,e4a6777d-68c9-4a54-b98b-a8a5d540b362,TEAK_010.birdGrid.brd,D17,TEAK,TEAK_010,distributed,C1,2018-06-19T13Z,TEAK_010.C1.2018-06-19T05:30-07:00[US/Pacific],1,...,Regulus satrapa,species,Golden-crowned Kinglet,80.0,singing,No,Unknown,1.0,,WHERS


We can now manipulate this dataframe using the various methods and functions of the Pandas library.

## Variable Information

Look again at the list of files available, specifically those that are NOT counting data.

In [17]:
#View names of files
for file in data_json['data']['files']:
    if( (not('countdata' in file['name'])) & (not('perpoint' in file['name'])) ):
        print(file['name'])

NEON.D17.TEAK.DP1.10003.001.EML.20190701-20190701.20191205T150127Z.xml
NEON.D17.TEAK.DP0.10003.001.validation.20191205T150127Z.csv
NEON.D17.TEAK.DP1.10003.001.readme.20191205T150127Z.txt
NEON.D17.TEAK.DP1.10003.001.variables.20191205T150127Z.csv
NEON.D17.TEAK.DP1.10003.001.2019-07.basic.20191205T150127Z.zip
NEON.D17.TEAK.DP1.10003.001.brd_references.expanded.20191205T150127Z.csv
NEON.D17.TEAK.DP1.10003.001.readme.20191205T150127Z.txt
NEON.Bird_Conservancy_of_the_Rockies.brd_personnel.csv
NEON.D17.TEAK.DP1.10003.001.2019-07.expanded.20191205T150127Z.zip
NEON.D17.TEAK.DP1.10003.001.variables.20191205T150127Z.csv
NEON.D17.TEAK.DP0.10003.001.validation.20191205T150127Z.csv
NEON.D17.TEAK.DP1.10003.001.EML.20190701-20190701.20191205T150127Z.xml


While the .zip files are packages containing multiple bird count data tables, the remaining files mostly serve to provide context to the data. The *variables* CSV file in particular contains a dataset with information on the variables used in the count data tables. This provides useful information such as units and defintions for each variable.

In [18]:
#Get variables information as pandas dataframe
for file in data_json['data']['files']:
    if('variables' in file['name']):
        df_variables = pd.read_csv(file['url'])

In [22]:
#View metadata and first few rows

print('Number of rows: ', df_variables.shape[0])
print('Number of columns: ',df_variables.shape[1])

print('Data Columns:\n')
print(df_variables.dtypes)

df_variables.head(5)

Number of rows:  64
Number of columns:  6
Data Columns:

table          object
fieldName      object
description    object
dataType       object
units          object
downloadPkg    object
dtype: object


Unnamed: 0,table,fieldName,description,dataType,units,downloadPkg
0,brd_perpoint,uid,Unique ID within NEON database; an identifier ...,string,,basic
1,brd_perpoint,namedLocation,Name of the measurement location in the NEON d...,string,,basic
2,brd_perpoint,domainID,Unique identifier of the NEON domain,string,,basic
3,brd_perpoint,siteID,NEON site code,string,,basic
4,brd_perpoint,plotID,Plot identifier (NEON site code_XXX),string,,basic


The table includes attributes indicating in which data a variable appears. We want to see information on the variables for the basic bird count table, since that is the table we downloaded. We can do this using comparisons and subsetting.

In [23]:
#Subset to view only variables in the basic countdata table
df_variables[(df_variables['table'] == 'brd_countdata')&(df_variables['downloadPkg'] == 'basic')]

Unnamed: 0,table,fieldName,description,dataType,units,downloadPkg
27,brd_countdata,uid,Unique ID within NEON database; an identifier ...,string,,basic
28,brd_countdata,namedLocation,Name of the measurement location in the NEON d...,string,,basic
29,brd_countdata,domainID,Unique identifier of the NEON domain,string,,basic
30,brd_countdata,siteID,NEON site code,string,,basic
31,brd_countdata,plotID,Plot identifier (NEON site code_XXX),string,,basic
32,brd_countdata,plotType,NEON plot type in which sampling occurred: tow...,string,,basic
33,brd_countdata,pointID,Identifier for a point location,string,,basic
34,brd_countdata,startDate,The start date-time or interval during which a...,dateTime,,basic
35,brd_countdata,eventID,An identifier for the set of information assoc...,string,,basic
36,brd_countdata,pointCountMinute,The minute of sampling within the point count ...,unsigned integer,,basic


### Challenge
The pandas *concat* function takes mulitple dataframes that have the same column names and attributes, but different rows, and combines the rows from all of the input dataframes into one output dataframe. Get basic bird count data for other months at Lower Teakettle, and combine the resulting dataframes into one with *concat*.