# Buffalo Data Science Community
April 2023: Data Viz & Storytelling Workshop



# Define `get_dataset()` Function

Use the `get_dataset(dataset_name)` function to grab 1 of 4 datasets:

1. `recycling`: https://data.buffalony.gov/Quality-of-Life/Monthly-Recycling-and-Waste-Collection-Statistics/2cjd-uvx7
2. `neighborhood`: https://data.buffalony.gov/Economic-Neighborhood-Development/Neighborhood-Metrics/adai-75jt
3. `expenditures`: https://data.buffalony.gov/Government/Open-Expenditures-Filtered/bktd-jwim
4. `ev`: https://catalog.data.gov/dataset/electric-vehicle-title-and-registration-activity

For example: `get_dataset('ev')` will return the Electric Vehicle Registration dataset as a `pd.DataFrame`.

In [15]:
import pandas as pd


def _get_dataset_chunks(https_path: str, n_rows: int) -> pd.DataFrame:
    """
    By default, the SODA API only allows users to retrieve the first 1,000 rows.  This is a helper function
    to grab the entire dataset.  SODA API only allows for a max retrieval of 50,000 rows per call.  As a result, 
    use the offset variable to iterate through and grab the remaining chunks.

    :param https_path: str; the https path to the JSON file
    
    :param n_rows: int; the total number of rows in the dataset
    
    :return: pd.DataFrame
    """
    max_allowable_rows_per_chunk = 50_000
    data_chunks = [
        pd.read_json(f'{https_path}?$limit={max_allowable_rows_per_chunk}&$offset={i}')
        for i in range(0, n_rows, max_allowable_rows_per_chunk)
    ]
    return pd.concat(data_chunks).reset_index(drop=True)


def get_dataset(dataset_name: str) -> pd.DataFrame:
    """
    Retrieve the dataset directly from its website.
    
    :param dataset_name: str; the dataset name.
    
    :return: pd.DataFrame
    """
    dataset_name = dataset_name.strip().lower()
    links = {
        'recycling': ('https://data.buffalony.gov/resource/2cjd-uvx7.json', 1_536),
        'neighborhood': ('https://data.buffalony.gov/resource/adai-75jt.json', 35),
        'expenditures': ('https://data.buffalony.gov/resource/bktd-jwim.json', 173_321),
        'ev': ('https://data.wa.gov/resource/rpr4-cgyd.json', 681_315)
    }
    args = links[dataset_name]
    data =  _get_dataset_chunks(*args)
    pd.set_option('display.max_columns', len(data.columns))
    return data

# Install PyGWalker

In [14]:
try:
  import pygwalker as pyg
except ImportError:
  !pip install pygwalker
  import pygwalker as pyg

# Start Your Code Here

In [17]:
data = get_dataset('recycling')

In [18]:
data.head()

Unnamed: 0,date,month,type,total_in_tons
0,2022-12-31,December,Curb Recycling,1054.0
1,2022-12-31,December,Misc. Garbage,512.85
2,2022-12-31,December,Curb Garbage,5425.67
3,2022-12-31,December,E-Waste,12.6
4,2022-12-31,December,Haz Waste,0.0


In [19]:
pyg.walk(data)

Output hidden; open in https://colab.research.google.com to view.