# Data Management and Reproducibility


### Reproducibility in Jupyter Notebooks  

Notebooks are a really great tool. They use REPL for data exploration, plotting, prototyping. 

REPL  -  Read-Evaluate-Print Loop (see the output of code inline) 
 
However, notebooks can be problematic because when you're exploring the data, you can run cells out of order or delete cells with downstream dependencies. This is one reason why professional software developers like to write and test entire scripts instead of line-by-line analysis.

In [None]:
# Toy example of cell execution order problems

x=3
print(x)

In [None]:
y=2*x
print(y)

In [None]:
x=2

One way to ensure reproducibility is to go up to the `Kernel` menu and click on `Restart and Run All` to make sure that your notebook is **Linearized** and runs properly when all the cells are executed in order

Another option would be to use an IDE to develop reproducible scripts and group scripts and notebooks together in projects. Examples of IDEs: 

 - Atom
 - Sublime
 - Vim/nano/emacs/notepad++ 
 - pycharm 
 - RStudio
 - Jupyter Lab 
 
 
 Advantages of IDEs include syntax highlighting, code completion, linting, and integrations with git(hub) and other tools. 
 
 

### Loading data

* From files (csv, txt, etc.) 
    - Our example
* REST api (REpresentational State Transfer)
    - Example: twitteR, Neon 
* wget/cURL 
    - Example: DataDryad, Retriever, NASA SEDAC 
* From database (mention only, no example)
    - Example: GDELT/BLAST

#### Neon data manual browsing

http://data.neonscience.org/browse-data?showAllDates=true&showAllSites=true&showTheme=org

### Code Structure 

Break down the data manipulation and analysis into discrete steps 

Write a function for each step, and present them in order 

Reuse code when possible 

In [None]:
# what if I want to download 100s or 1000s of files? discussion on REST API 

import requests

# must redo API call each time to make sure the download link is authorized and up to date 
base_url = 'http://data.neonscience.org/api/v0'
endpoint = 'data'
product_code= 'DP1.00098.001' # relative humidity 
site_code = 'ABBY' 
year_month = '2016-07'
package = '?package=basic'

api_call = str.join('/',[base_url,endpoint,product_code,site_code,year_month,package])
print(api_call)
r=requests.get(api_call)

In [None]:
#import pickle

#with open('json_response.pickle','wb') as handle:
#    pickle.dump(r, handle)

In [None]:
# use this cell if the response times out 
with open('json_response.pickle','rb') as handle: 
    r=pickle.load(handle)

In [None]:
r.json()

#url = r.json()['data']['files'][1]['url']

In [None]:
## all available time periods 
year_months = [
          "2016-04",
          "2016-05",
          "2016-06",
          "2016-07",
          "2016-08",
          "2016-09",
          "2016-10",
          "2016-11",
          "2016-12",
          "2017-01",
          "2017-02",
          "2017-03",
          "2017-04",
          "2017-05",
          "2017-06",
          "2017-12",
          "2018-01",
          "2018-02",
          "2018-03",
          "2018-04",
          "2018-05",
          "2018-06",
          "2018-07",
          "2018-08",
          "2018-09"]

#### wget example

In [None]:
url.split('?')[0]

In [None]:
import os
os.system('wget '+url.split('?')[0])

In [None]:
import urllib.request
import shutil

file_name='ABBY_rel_humid_2016-07_RAW.csv' # distinguish raw data 

# Download the file from `url` and save it locally under `file_name`:
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
    shutil.copyfileobj(response, out_file)


### Preprocessing input data 

Need to deal with: 
- null values (missing data) 

1) collect more data 

2) imputation 

3) subsetting 


- data types = categorical vs. ordinal. str vs int vs. boolean 

- Sampling bias - how do we know our data is representative of the underlying system? when repeated sampling gives the same distribution (this is the essence of sample size analysis). Also depends on definition of "same" and how you measure it (assumptino of gaussian?) 

In [None]:
import pandas as pd
df = pd.read_csv('ABBY_rel_humid_2016-07_RAW.csv')
print(df.shape)
df.head()

In [None]:
def display_all(df):
    with pd.option_context("display.max_rows", 1000, "display.max_columns", 1000): 
        display(df)

In [None]:
print(df.shape)
display_all(df.T)

### Evaluating data quality 

completeness: 
- fraction of missing values 

consistency:
- unique values for each category 
- numbers represented with the same data type 
- draw 2 sets of random samples from data and compare the distributions 

representativeness/accuracy: 
- compare data from different time periods or different sources 
- calibration with 2nd data source (ground truthing) 


In [None]:
display_all(df.describe().T)

In [None]:
df.isnull().sum()

In [None]:
df.isnull().sum()['RHMean']/len(df) # wow, 20% null values 

In [None]:
from matplotlib import pyplot as plt
%matplotlib inline

idxs_null=df[df['RHMaximum'].isnull()].index
plt.hist(idxs_null) # what does this tell us? 

In [None]:
df.loc[idxs_null]['startDateTime']

In [None]:
df=df.iloc[:35572]
df.isnull().sum()

In [None]:
df.to_csv('rel_humid_ABBY_2017-05.csv') # save intermediate output 

In [None]:
df['RHMean'].median()  #metric we want to aggregate for all time periods 

In [None]:
medians=[] #initialize an empty list

medians.append({year_month: df['RHMean'].median()}) # add the median for this file to the list 

In [None]:
import seaborn as sns 

sns.pairplot(df[['RHMean','tempRHMean','dewTempMean']])

In [None]:
import numpy as np

def get_sample(df, n):
    idxs = sorted(np.random.permutation(len(df))[:n])
    return df.iloc[idxs].copy()

In [None]:
g = sns.PairGrid(get_sample(df[['RHMean','tempRHMean','dewTempMean']],1000))
g = g.map_upper(plt.scatter)
g = g.map_lower(sns.kdeplot, cmap="Blues_d")
g = g.map_diag(sns.kdeplot, lw=3, legend=False)

In [None]:
plt.gcf()
plt.savefig('RH_pairplot.png', bbox_inches='tight')

###  Task separation and  dependency management  

when scripting a data pipeline, it's helpful to break down the analysis into separate tasks, and identify the dependencies of each task. 


Linear progression of tasks: each task only has one dependency. 

If you have multiple dependencies, then managing everything manually gets messy as the # of tasks increases 


 
### execute processing step, then save intermediate output during processing 

use /tmp directory if you don't care about intermediate file output. periodically delete these. 

can use the presence/absence of these files as a monitoring tool - know which part(s) of the pipeline are completed 

e.g. if the intermediate .csv file exists then, we know that processing happened. 
 


## Serialization

df.to_csv() method to save as csv  

but what if I want to save my model object or any other object in python? 

In [None]:
## faster, snapshot of memory. new feature in sklearn 20 
df.to_feather('tmp/forest-cover')

## slower, older. pickle is common in the python ecosystem. 
df.to_pickle('tmp/forest-cover-pickle.p') 

# can pickle or feather anything 


import pickle 

with open('RH_graph_object.pickle','wb') as handle: 
    pickle.dump(g, handle)

### Let's turn this data exploration into reproducible code 

### should I run script on my machine or on a cluster?  

#### Memory limitations: 

RAM - what is it and why is it important 

can the data fit in memory? 

HD space - can the data fit on disk? 

If not, use cloud storage or something. but beware IO speed limitations. 

#### Speed limitation - data streaming and processing can fit in memory, but throughput is limited - 

parallelization - multicore or cluster computing 
   
Modern options for cluster computing  - institutional clusters, AWS, google cloud, microsoft azure 



### Directory structure for a project - organizational suggestions

`/data/raw/` - immutable. never change these files 

`/code` or `/scripts` 

`/tmp` - temporary folder i.e. "scratch paper" 

`/data/clean/` - post-processed data 

`/figures`  

git workflow - save the code and the outputs for sure. if the inputs are large, make sure you have a system for dealing with large data. do not keep temp data (.gitignore) 


#### Create folders in terminal