# 2022-47 Base Year Forecast Output QC

Test Plan: https://sandag.sharepoint.com/qaqc/_layouts/15/Doc.aspx?sourcedoc={f8b3d630-1290-445b-99a1-2fa9041ade92}&action=edit

Documentation: https://sandag.sharepoint.com/:w:/r/qaqc/_layouts/15/Doc.aspx?sourcedoc=%7B3AF20D75-0A22-4B9C-9CC4-85B3EEC294E6%7D&file=MGRABased_input_ABM_2019_process_notes.docx 

### Library Imports

In [33]:
import pandas as pd
import numpy as np

from pathlib import Path

### Download Data


In [34]:
def download_data(user):
    """
    This function downloads csv data for the 2019 Forecast Output

    :param user:    The user trying to download the data. Mostly here so that others can more 
                    easily run my code

    :returns:       Tuple with (mgra data, person data)
    """
    # Data is stored in this folder
    data_folder_path = Path(f"C:/Users/{user}/San Diego Association of Governments/" \
        "SANDAG QA QC - Documents/Projects/2022/2022-47 Base Year Forecast Output QC/Data/")
    
    # Define the files here
    files = ["mgra_ind.csv", "persons_2019_01.csv", "households_2019_01.csv"]

    # Download the data from each file and load into a dataframe
    dfs = []
    for file in files:
        dfs.append(pd.read_csv(data_folder_path / file))

    # Use the households file to add which mgra each person belongs to
    hhid_to_mgra = dfs[2][["hhid", "mgra"]]
    dfs[1] = dfs[1].merge(hhid_to_mgra, left_on="hhid", right_on="hhid")

    return dfs

# Get data and put the dfs into named containers
mgra, persons, _ = download_data("eli")

## Tests

#### Get the number of people in the region in the following categories in the file persons.csv

1. Full time worker (18+)
2. Part time worker (18+)
3. College student (18+)

In [35]:
# ptype == 1 corresponds to full time worker (18+)
print("Number of full time workers (18+):")
persons[persons["ptype"] == 1].shape[0]

Number of full time workers (18+):


1122490

In [36]:
# ptype == 2 corresponds to part time worker (18+)
print("Number of part time workers (18+):")
persons[persons["ptype"] == 2].shape[0]

Number of part time workers (18+):


359033

In [37]:
# ptype == 3 corresponds to college studnets (18+)
print("Number of college students (18+):")
persons[persons["ptype"] == 3].shape[0]

Number of college students (18+):


184270

#### Get the number of self-employed people in the region from persons.csv confirm with mgra.csv

Spoiler, persons.csv does not contain any data on self-employment

In [38]:
# occen5 = legal census occupation code
persons["occen5"].value_counts()

0    3294988
Name: occen5, dtype: int64

In [39]:
# occsoc5 = detailed occupation codes defined by the Standard Occupation Classification (SOC) system
persons["occsoc5"].value_counts()

00-0000    1369996
11-1021     768421
41-1011     456391
31-1010     318000
51-1011     146098
45-1010     136716
55-1010      99366
Name: occsoc5, dtype: int64

Using 2018 codes unless noted otherwise:

11-1201: General and Operations Managers

31-1010: Not actually defined in 2018, but 2010 is Nursing, Psychiatric, and Home Health Aids

41-1011: First-Line supervisors of retail sales workers

45-1010: First-Line Supervisors of Farming, Fishing, and Forestry Workers

51-1011: First-line supervisors of production and operating workers

55-1010:  Military Officer Special and Tactical Operations Leaders

In [40]:
# ndustry code defined in PECAS: They are about 270 industry categories grouped by 6-digit NAICS 
# code (North American Industrial Classification System)
persons["indcen"].value_counts()

0       3195622
9770      99366
Name: indcen, dtype: int64

9770 does not correspond to any NAICS code