## Introducing `dcpy` to the world...
One-stop-shop tool for data ingestion.

To showcase the capabilities of `dcpy`, we will be working with **City Owned and Leased Property (COLP)** dataset from Open Data ([link](https://data.cityofnewyork.us/City-Government/City-Owned-and-Leased-Property-COLP-/fn4k-qyk2/about_data)).

First, we will show a common approach to extract and do basic cleaning using `pandas`. Then we will achieve the same result using the `dcpy` tool.


### Ingesting data with `pandas`

In [None]:
# code below is from oficial Open Data portal API docs.
# link: https://dev.socrata.com/foundry/data.cityofnewyork.us/fn4k-qyk2

import pandas as pd
from sodapy import Socrata


client = Socrata("data.cityofnewyork.us", None)

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("fn4k-qyk2", limit=50000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

# show data
results_df.head()

In [None]:
# Explore the dataset
print(results_df.columns)
print(results_df.dtypes)

### Cleaning with `pandas`
Suppose we need to clean up the dataset a bit. It has too many columns and data types are not correct.

We only need `bbl`, `agency`, and `parcel_name` columns in our future hypothetical analysis. Additionally, we need to remove duplicate records if there exist any.

In [None]:
# Filter columns
results_df = results_df[['bbl', 'agency', 'parcel_name' ]]

# Remove duplicates
results_df = results_df.drop_duplicates()

results_df.head()

In [None]:
# Save file in parquet format locally
results_df.to_parquet("./dcp_colp.parquet")

### Ingesting data with `dcpy`

First, we need to create a config YAML file for **COLP** dataset in `templates/` directory as following:

```yaml
# templates/dcp_colp.yml

id: dcp_colp

attributes:
  name: City Owned and Leased Property (COLP)

ingestion:
  source:
    type: socrata
    org: nyc
    uid: fn4k-qyk2
    format: geojson
  file_format:
    type: geojson

  processing_steps:
  - name: filter_columns
    args:
      columns: [ "bbl", "agency", "parcel_name" ]
  - name: deduplicate
```


Now we are ready to extract, clean, and save data locally with `dcpy`: 

In [None]:
%%bash
# run CLI command
python3 -m dcpy.cli lifecycle ingest dcp_colp --template-dir ./templates