## Introducing `dcpy` to the world...
One-stop-shop tool for data ingestion.

### Introduction
To showcase the capabilities of `dcpy`, we will be working with **City Owned and Leased Property (COLP)** dataset from Open Data ([link](https://data.cityofnewyork.us/City-Government/City-Owned-and-Leased-Property-COLP-/fn4k-qyk2/about_data)).

First, we will show a common approach to extract and do basic cleaning using `pandas`. Then we will achieve the same result using the `dcpy` tool.


### Ingesting data

In [None]:
%%bash

# Install sodapy to use Open Data API
pip install sodapy

SyntaxError: invalid syntax (1390223722.py, line 3)

In [None]:
# code below is from oficial Open Data portal API docs.
# link: https://dev.socrata.com/foundry/data.cityofnewyork.us/fn4k-qyk2

import pandas as pd
from sodapy import Socrata


client = Socrata("data.cityofnewyork.us", None)

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("fn4k-qyk2", limit=50000)

# Convert to pandas DataFrame and save in parquet file
results_df = pd.DataFrame.from_records(results)
results_df.to_parquet("./dcp_colp.parquet")

# show data
results_df.head()



Unnamed: 0,borough,tax_block,tax_lot,bbl,billbbl,cd,parcel_name,agency,use_code,use_type,...,:@computed_region_yeji_bk3q,:@computed_region_sbqj_enih,:@computed_region_92fq_4b7q,:@computed_region_f5dn_yrer,street_name,address,final_commitment,agreement,house_number,leased_properties
0,3,7071,200,3070710200.0,3070710200.0,313,FORD AMPHITHEATER,DCAS,1500,NO USE,...,2.0,35.0,45.0,21.0,,,,,,
1,3,2134,126,3021340126.0,3021340126.0,301,,DSBS,1500,NO USE,...,,,,,KENT AVENUE,KENT AVENUE,,,,
2,2,5519,150,2055190150.0,2055190150.0,210,,DCAS,1520,NO USE-VACANT LAND,...,,,,,SHORE DRIVE,SHORE DRIVE,,,,
3,1,2,23,1000020023.0,1000020023.0,101,PIER 6,DSBS,1500,NO USE,...,,,,,SOUTH STREET,SOUTH STREET,D,,,
4,3,8591,200,3085910200.0,3085910200.0,318,,DSBS,1500,NO USE,...,,,,,,,,,,


### Ingesting data with `dcpy`

First, we need to create a config YAML file for **COLP** dataset in `templates/` directory as following:

```yaml
# templates/dcp_colp.yml

id: dcp_colp

attributes:
  name: City Owned and Leased Property (COLP)

ingestion:
  source:
    type: socrata
    org: nyc
    uid: fn4k-qyk2
    format: geojson
  file_format:
    type: geojson
```


Now we are ready to extract data with `dcpy`: 

In [None]:
%%bash
# set necessary env variables
export RECIPES_BUCKET=test
export PUBLISHING_BUCKET=test

# run CLI command
python3 -m dcpy.cli lifecycle ingest dcp_colp --template-dir ./templates

INFO:dcpy:registering edm.recipes
INFO:dcpy:registering edm.publishing.drafts
INFO:dcpy:registering edm.publishing.published
INFO:dcpy:Registered Connectors: ['edm.recipes', 'edm.publishing.drafts', 'edm.publishing.published']
INFO:dcpy:Reading template from templates/dcp_colp.yml
INFO:dcpy:Reading template from templates/dcp_colp.yml
INFO:dcpy:✅ Raw data was found locally at .lifecycle/ingest/staging/dcp_colp/2025-03-18T19:32:29.777006-04:00/dcp_colp.geojson
INFO:dcpy:✅ Converted raw data to parquet file and saved as .lifecycle/ingest/staging/dcp_colp/2025-03-18T19:32:29.777006-04:00/init.parquet
