# Trying out the `src.data.download` module.

This notebooks shows how the new module might help us download source data.

## 0. Set up.

In [1]:
import pandas as pd
from src.data import download

In [2]:
# Enable editing of the package with instant feedback.
%load_ext autoreload
%autoreload 2

## 1. Download some data.

Try downloading `nfirs.csv` the first time. Some notes:

- `gdown` provides a some useful download status updates.
- The function automatically creates `Data/raw/` if it doesn't exist.

In [3]:
# Download nfirs.csv for the first time. Takes a few seconds.
fname = download.fetch("nfirs.csv")
fname

Downloading file 'nfirs.csv' from 'https://drive.google.com/uc?id=1ENJZwazX7hJ4GwI03DKgX51y-644x-cZ' to '/home/nathan/projects/rcp2/Data/raw'.
Downloading...
From: https://drive.google.com/uc?id=1ENJZwazX7hJ4GwI03DKgX51y-644x-cZ
To: /home/nathan/projects/rcp2/Data/raw/tmpj0pc_tdg
273MB [00:14, 18.7MB/s] 


'/home/nathan/projects/rcp2/Data/raw/nfirs.csv'

Try downloading for the second time. The function recognizes that the file exists and skips the download.

In [4]:
# Skip the download this time.
download.fetch("nfirs.csv")

'/home/nathan/projects/rcp2/Data/raw/nfirs.csv'

In [5]:
# The file looks ok.
pd.read_csv(fname, encoding="latin-1").info(verbose=False)

  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1950641 entries, 0 to 1950640
Columns: 29 entries, id to longitude
dtypes: float64(3), int64(8), object(18)
memory usage: 431.6+ MB


## 3. Inspect available source data.

You can register new raw data sources by adding them to `src.data.download.SOURCES`. Each entry has:

- A downloader with special download steps (optional).
- A SHA256 hash to verify download integrity.
- A URL for the source data.

In [6]:
download.SOURCES

{'nfirs.csv': {'downloader': 'download_from_google_drive',
  'sha256': '0fcd2c4edae304dbb21c1b0dc6ca9afd17d7d65f21e51cd26571f9d42db7f825',
  'url': 'https://drive.google.com/uc?id=1ENJZwazX7hJ4GwI03DKgX51y-644x-cZ'}}