## Introducing `dcpy` to the world...
One-stop-shop tool for data ingestion.

To showcase the capabilities of `dcpy`, we will be working with **City Owned and Leased Property (COLP)** dataset from Open Data ([link](https://data.cityofnewyork.us/City-Government/City-Owned-and-Leased-Property-COLP-/fn4k-qyk2/about_data)).

First, we will show a common approach to extract and do basic cleaning using `pandas`. Then we will achieve the same result using the `dcpy` tool.


### Ingesting data with `pandas`

In [1]:
# code below is from official Open Data portal API docs.
# link: https://dev.socrata.com/foundry/data.cityofnewyork.us/fn4k-qyk2

import pandas as pd
from sodapy import Socrata


client = Socrata("data.cityofnewyork.us", None)

# Returned as JSON from API / converted to Python list of
# dictionaries by sodapy.
results = client.get("fn4k-qyk2", limit=50000)

# Convert to pandas DataFrame
results_df = pd.DataFrame.from_records(results)

# show data
results_df.head()



Unnamed: 0,borough,tax_block,tax_lot,bbl,billbbl,cd,street_name,address,agency,use_code,...,the_geom,:@computed_region_yeji_bk3q,:@computed_region_sbqj_enih,:@computed_region_92fq_4b7q,:@computed_region_f5dn_yrer,house_number,parcel_name,final_commitment,leased_properties,agreement
0,3,7033,84,3070330084.0,3070330084.0,313,OCEAN VIEW AVENUE,OCEAN VIEW AVENUE,DCAS,1520,...,"{'type': 'Point', 'coordinates': [-74.011649, ...",2,35,45,21,,,,,
1,2,3448,33,2034480033.0,2034480033.0,209,GILDERSLEEVE AVENUE,2008 GILDERSLEEVE AVENUE,PARKS,440,...,"{'type': 'Point', 'coordinates': [-73.850548, ...",5,26,31,58,2008.0,WATERFRONT GARDEN,,,
2,5,3303,18,5033030018.0,5033030018.0,502,JEFFERSON STREET,50 JEFFERSON STREET,EDUC,211,...,"{'type': 'Point', 'coordinates': [-74.098252, ...",1,76,14,30,50.0,PS 11,,,
3,4,14255,2780,4142552780.0,4142552780.0,410,165 AVENUE,165 AVENUE,DCAS,1520,...,"{'type': 'Point', 'coordinates': [-73.828104, ...",3,64,41,62,,,,,
4,4,1835,58,4018350058.0,4018350058.0,404,51 AVENUE,51 AVENUE,DCAS,1520,...,"{'type': 'Point', 'coordinates': [-73.874507, ...",3,68,5,66,,,,,


In [2]:
# Explore the dataset
print(results_df.columns)
print(results_df.dtypes)

Index(['borough', 'tax_block', 'tax_lot', 'bbl', 'billbbl', 'cd',
       'street_name', 'address', 'agency', 'use_code', 'use_type',
       'non_city_ownership', 'category_code', 'expanded_category_code',
       'excatdesc', 'x_coordinate', 'y_coordinate', 'latitude', 'longitude',
       'the_geom', ':@computed_region_yeji_bk3q',
       ':@computed_region_sbqj_enih', ':@computed_region_92fq_4b7q',
       ':@computed_region_f5dn_yrer', 'house_number', 'parcel_name',
       'final_commitment', 'leased_properties', 'agreement'],
      dtype='object')
borough                        object
tax_block                      object
tax_lot                        object
bbl                            object
billbbl                        object
cd                             object
street_name                    object
address                        object
agency                         object
use_code                       object
use_type                       object
non_city_ownership          

### Cleaning with `pandas`
Suppose we need to clean up the dataset a bit. It has too many columns and it's possible that records are not unique.

We only need `bbl`, `agency`, and `parcel_name` columns in our future hypothetical analysis. Additionally, we need to remove duplicate records if there exist any.

In [3]:
# Filter columns
results_df = results_df[['bbl', 'agency', 'parcel_name' ]]

# Remove duplicates
results_df = results_df.drop_duplicates()

results_df.head()

Unnamed: 0,bbl,agency,parcel_name
0,3070330084.0,DCAS,
1,2034480033.0,PARKS,WATERFRONT GARDEN
2,5033030018.0,EDUC,PS 11
3,4142552780.0,DCAS,
4,4018350058.0,DCAS,


In [4]:
# Save file in parquet format locally
results_df.to_parquet("./dcp_colp.parquet")

### Ingesting data with `dcpy`

First, we need to create a config YAML file for **COLP** dataset in `templates/` directory as following:

```yaml
# templates/dcp_colp.yml

id: dcp_colp

attributes:
  name: City Owned and Leased Property (COLP)

ingestion:
  source:
    type: socrata
    org: nyc
    uid: fn4k-qyk2
    format: geojson
  file_format:
    type: geojson

  processing_steps:
  - name: filter_columns
    args:
      columns: [ "bbl", "agency", "parcel_name" ]
  - name: deduplicate
```


Now we are ready to extract, clean, and save data locally with `dcpy`: 

In [5]:
%%bash
# run CLI command
python3 -m dcpy.cli lifecycle ingest dcp_colp --template-dir ./templates

13:05:30 INFO:dcpy:registering edm.recipes
13:05:30 INFO:dcpy:registering edm.publishing.drafts
13:05:30 INFO:dcpy:registering edm.publishing.published
13:05:30 INFO:dcpy:Registered Connectors: ['edm.recipes', 'edm.publishing.drafts', 'edm.publishing.published']
13:05:33 INFO:dcpy:Reading template from templates/dcp_colp.yml
13:05:33 INFO:dcpy:Using .lifecycle/ingest/staging/dcp_colp/2025-03-26T13:05:33.390037-04:00 to stage data
13:05:33 INFO:dcpy:Extracting dcp_colp.geojson from source to staging folder
13:05:35 INFO:dcpy:Converting dcp_colp.geojson to init.parquet
13:05:41 INFO:dcpy:Processing init.parquet to dcp_colp.parquet
13:05:42 INFO:dcpy:Copying dcp_colp.parquet to .lifecycle/ingest/datasets/dcp_colp/20240826
