In [None]:
%matplotlib inline

# UrbanSim Data Set Exploration


This notebook is a template that shows how to identify UrbanSim datasets of interest (based on tags) on Socrata, pull data from Socrata into a pandas DataFrame, and begin exploratory data analysis (EDA). It is meant to be a starting point for each data steward to begin their own EDA. Once data stewards have completed their EDA, please add a folder containing the modified notebook and any outputs to this [Box folder](https://mtcdrive.app.box.com/folder/76889542701)

**NOTE**: Please make and edit a COPY of this notebook and add your name as a prefix to the filename and remove the Template suffix (e.g. the resulting filename would look like this: `<Your name> DS INPUT UrbanSim Dataset Exploration.ipynb`)

## How to use this notebook:

1. Make a copy of this notebook to perform your own EDA
2. Save your modified notebook copy and any outputs in a folder and copy that folder to this [Box folder](https://mtcdrive.app.box.com/folder/76889542701)
3. Communicate your updates
    - write data-specific notes via the BASIS dataset Feedback feature
    - write general communications in the Slack channel #urbansim-data-collab
    - open a GitHub issue [here](https://github.com/BayAreaMetro/DataServices/issues)

In [None]:
# Requirements: run this cell first if necessary to install necessary Python libraries

! pip install boto3
! pip install s3fs
! conda install -y -c anaconda psycopg2
! pip install sodapy
! pip install sqlalchemy-redshift

In [None]:
import os
import sys
import pandas as pd

user = os.environ['USER']
sys.path.insert(0, '/Users/{}/Box/DataViz Projects/Utility Code'.format(user))
from utils_io import *

# UrbanSim Data Set Exploration

Pythonic Approach to Managing Socrata Data Assets using SodaPy

**SodaPy source**:
https://github.com/xmunoz/sodapy

**Github Documentation**: [BayAreaMetro/DataServices](https://github.com/BayAreaMetro/DataServices/tree/master/Project-Documentation/mdm)

## Socrata Data Sets


### UrbanSim Buildings

**Documentation**: [UrbanSim Buildings](https://github.com/BayAreaMetro/DataServices/blob/master/Project-Documentation/mdm/land-people-mdm/buildings.md)

**Socrata dataset ID**: ahwz-jtst

**Socrata dataset**: [UrbanSim Buildings](https://data.bayareametro.gov/Cadastral/UrbanSim-Parcels/6q7r-gybw)

**Primary key**: `building_id`

**Foreign key**: `joinid`

### UrbanSim Parcels

**Documentation**: [UrbanSim Parcels](https://github.com/BayAreaMetro/DataServices/blob/master/Project-Documentation/mdm/land-people-mdm/urbansim_parcels.md)

**Socrata dataset ID**: 6q7r-gybw

**Socrata dataset**: [UrbanSim Parcels](https://data.bayareametro.gov/Cadastral/UrbanSim-Parcels/6q7r-gybw)

**Primary key**: `joinid`

### General Plan and Zoning 2018

**Socrata dataset ID**: udk3-z2d5

**Socrata dataset**: [General Plan and Zoning](https://data.bayareametro.gov/Land-Use/General-Plan-and-Zoning-2018/udk3-z2d5)

**Primary key**: `zoning_id` (matches `recid` in General Plan and Regional Zoning tables and `RecID` in Codes tables)

**Foreign key**: `joinid`


### General Plan Codes

**Socrata dataset ID**: vzcc-dhby

**Socrata dataset**: [General Plan Codes](https://data.bayareametro.gov/Land-Use/Regional-General-Plan-Codes-2018/vzcc-dhby)

**Primary key**: `RecID` (matches `zoning_id` in General Plan and Zoning 2018 table and `recid` in General Plan and Regional Zoning tables)


### General Plan

**Socrata dataset ID**: cc3g-fj4w

**Socrata dataset**: [General Plan](https://data.bayareametro.gov/Land-Use/View-based-on-Regional-General-Plan-Codes-2018/cc3g-fj4w)

**Primary key**: `recid` (matches `zoning_id` in General Plan and Zoning 2018 table and `RecID` in Codes tables)


### Regional Zoning Codes

**Socrata dataset ID**: qdrp-c5ra

**Socrata dataset**: [Regional Zoning Codes](https://data.bayareametro.gov/Land-Use/Regional-Zoning-Codes-2018/qdrp-c5ra)

**Primary key**: `RecID` (matches `zoning_id` in General Plan and Zoning 2018 table and `recid` in General Plan and Regional Zoning tables)

### Regional Zoning

**Socrata dataset ID**: q2p6-hbrp

**Socrata dataset**: [Regional Zoning](https://data.bayareametro.gov/Land-Use/View-of-Parcels-and-Regional-Zoning-2018/q2p6-hbrp)

**Primary key**: `recid` (matches `zoning_id` in General Plan and Zoning 2018 table and `RecID` in Codes tables)


## Socrata Credentials

Socrata credentials for this notebook are managed by Kaya Tollas (ktollas@bayareametro.gov)

## 1. Pull these datasets into pandas DataFrames

#### Pull UrbanSim Buildings

In [None]:
urbansim_buildings_id = 'ahwz-jtst'
urbansim_buildings = pull_df_from_socrata(urbansim_buildings_id)

In [None]:
urbansim_buildings.head()

In [None]:
urbansim_buildings.shape  # (3655207, 15)

In [None]:
urbansim_buildings['apn'].nunique()  # 2507764

#### Pull UrbanSim Parcels

In [None]:
urbansim_parcels_id = '6q7r-gybw'
urbansim_parcels = pull_df_from_socrata(urbansim_parcels_id)

In [None]:
urbansim_parcels.head()

In [None]:
urbansim_parcels.shape  # (3655207, 7)

In [None]:
urbansim_parcels['apn'].nunique()  # 2643041

### General Plan and Zoning

#### Pull General Plan and Zoning 2018

In [None]:
gp_zoning_id = 'udk3-z2d5'
gp_zoning = pull_df_from_socrata(gp_zoning_id)

In [None]:
gp_zoning.head()

In [None]:
gp_zoning.shape  # (2142677, 11)

In [None]:
gp_zoning['joinid'].nunique()  # 2091536

#### Pull General Plan Codes

In [None]:
gp_codes_id = 'vzcc-dhby'
gp_codes = pull_df_from_socrata(gp_codes_id)

In [None]:
gp_codes.head()

In [None]:
gp_codes.shape  # (2240, 15)

In [None]:
gp_codes['recid'].nunique()  # 2240

#### Pull General Plan

In [None]:
general_plan_id = 'cc3g-fj4w'
general_plan = pull_df_from_socrata(general_plan_id)

In [None]:
general_plan.head()

In [None]:
general_plan.shape  # (334271, 16)

In [None]:
general_plan['recid'].nunique() # 147

#### Pull Regional Zoning Codes

In [None]:
rz_codes_id = 'qdrp-c5ra'
rz_codes = pull_df_from_socrata(rz_codes_id)

In [None]:
rz_codes.head()

In [None]:
rz_codes.shape  # (2979, 16)

In [None]:
rz_codes['recid'].nunique() # 2979

#### Pull Regional Zoning

In [None]:
regional_zoning_id = 'q2p6-hbrp'
regional_zoning = pull_df_from_socrata(regional_zoning_id)

In [None]:
regional_zoning.head()

In [None]:
regional_zoning.shape  # (148496, 17)

In [None]:
regional_zoning['recid'].nunique() # 240

## 2. EDA

Your exploratory data analysis steps go here:

Example EDA steps: https://kite.com/blog/python/data-analysis-visualization-python

Things to check for: unique values, data quality issues (e.g. misspellings, mismatches, etc.), missing values, data cleaning steps

In [None]:
# get descriptive statistics for all fields
urbansim_buildings.describe(include='all')

In [None]:
# check if this dataset contains null values
urbansim_buildings.isnull().values.any()

In [None]:
# get number of null values per field
urbansim_buildings.isnull().sum()

In [None]:
# visualize some categorical data
urbansim_buildings['fipco'].value_counts().plot.bar(title='FIPCO counts');