# Download and process static data
This notebook downloads static data sources, loads them into memory, cleans that data then saves a cleaned data product, which may be used lateron for some analysis.


In [None]:
import os
import numpy as np
import pandas as pd
from datetime import datetime

# Download files from on-line host.
Including this step in the pipeline ensures that the data processing steps are reproducible. 
This will save the severe headache that comes from trying to share a particular cleaned data products.

In [None]:
%%bash
DATA_DIR="../data/camels/"
if [ -d "$DATA_DIR" ]; then rm -Rf $DATA_DIR; fi
mkdir $DATA_DIR
filenames=(camels_clim.txt camels_geol.txt camels_hydro.txt camels_name.txt camels_soil.txt camels_topo.txt camels_vege.txt)
for filename in ${filenames[@]}
do 
    wget -O "${DATA_DIR}${filename}" "https://gdex.ucar.edu/dataset/camels/file/${filename}"
done

# share the processing code, rather than the processed data
This isn't always feasible, as your datasets grow larger. But, maintaining clean reproducible code from source to product is essential for reproducibility

In [None]:
filenames=["camels_clim.txt",
           "camels_geol.txt",
           "camels_hydro.txt",
           "camels_name.txt",
           "camels_soil.txt",
           "camels_topo.txt",
           "camels_vege.txt"]
dfs = {}
for filename in filenames:
    with open(f"../data/camels/{filename}", "r") as f:
        dfs[filename] = pd.read_csv(f, sep=";", index_col="gauge_id")
df = pd.concat([dfs[filename] for filename in filenames], axis=1)

# What to do with text data and NaNs
When we bring in all the data into a single dataframe, we see that there are values with text data and NaN values. For the example we will be working with, these data will not help us. We need complete data (no NaNs), and we need continuous real numbers (bounded is okay).

In [None]:
df.head()

# Clean the data for our hypothetical analysis
Sometimes NaN values are perfectly fine, and they are perfectly natural, but for this example, we'll remove them

In [None]:
df = df.dropna(axis=1)

# Clean the data for our hypothetical analysis
Sometimes One Hot Data are perfectly fine, and they are perfectly natural, but for this example, we'll remove them as well.

In [None]:
drop_these_columns = []
for camels_data_column in df.columns.values:
    if type(df[camels_data_column].values[0]) == str:
        drop_these_columns.append(camels_data_column)
df = df.drop(drop_these_columns, axis=1)

# Save file with unique name
You might end up sharing this file around. It might get copied many times, but then something could change in the cleaning procedure, or new data might be added. In an effort not to get tripped up on different file versions crashing code, name the file something really annoying and specific.

In [None]:
nowstring = datetime.today().strftime("%d-%m-%Y_%H%M")
creator_initials = "jf"
df.to_csv(f"../data/camels/camels_attributes_cleaned_{nowstring}_{creator_initials}_DO_NOT_COPY.csv")