In [None]:
# Disable the IPython pager
# https://gist.github.com/minrk/7715212
from IPython.core import page
def my_print(s):
    try:
        print(s["text/plain"])
    except (KeyError, TypeError):
        print(s)
page.page = my_print

# 02 - Data Sourcing
This notebook demonstratest the usage of [Intake](https://intake.readthedocs.io/) for creating data Catalogs for better organized and re-usable data sourcing

In [None]:
import pandas as pd
import intake

Instead of ad-hoc python code, we're going to define all the details (location, file format, parsing specifics etc.) in a Catalog dile, which cna then be shared with others to simplify and automate the loading process.

An Intake Catalog is a text file in a [YAML format](https://yaml.org/)

Let's start with a simple example for the "housing" dataset

In [None]:
%pycat ../data/catalog.yml

We can use the `intake.open_catalog()` to load the catalog into Python

In [None]:
catalog = intake.open_catalog("../data/catalog.yml")

List the available data sources and inspect their properties

In [None]:
list(catalog)

In [None]:
catalog.housing  # also catalog["housing"] 

Load the data using `read()` and get directly a pandas DataFrame

In [None]:
housing = catalog.housing.read()

In [None]:
housing

There are still a few improvements that we can take care of:

1. the "median_house_value" is of type "object" (string) because of "," used as a thousands' separator

In [None]:
housing["median_house_value"].head(5)

In [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html), there's an option for that called `thousands`

In [None]:
pd.read_csv?

We can add it to the catalog definition under `csv_kwargs`

...

After updating the catalog, we can reload it and test the data again

In [None]:
intake.open_catalog("../data/catalog.yml").housing.read()

2. The "near_ocean" column has a mixture of Yes/yes/No/no. We'd lik 0/1 instead

This can also be done by via `csv_kwargs`. You can use all options from In [`pandas.read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html)

...

In [None]:
intake.open_catalog("../data/catalog.yml").housing.read()

3. Finally, the "housing_median_age" column contains some "None" values, which should be parsed as `NaN`

...

In [None]:
intake.open_catalog("../data/catalog.yml").housing.read()