# New York City
In this activity we will visualize data about New York City (NYC) and compare it with the state of New York and the United States (US). The American Community Survey (ACS) Public Use Microdata Sample (PUMS) dataset (1-year estimate from 2017) from https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.2017.html is used.

Download the following datasets and place the extracted csv-file in the data subdirectory:
https://www2.census.gov/programs-surveys/acs/data/pums/2017/1-Year/csv_pny.zip
https://www2.census.gov/programs-surveys/acs/data/pums/2017/1-Year/csv_hny.zip

In this activity the datasets 'New York Population Records' (./data/pny.csv) and 'New York Housing Unit Records' (./data/hny.csv) are used. The first dataset contains information about the New York population, and the second dataset contains information about housing units. The dataset contains data for about 1% of the population and housing units. Due to the extensive amount of data we do not provide the datasets for the whole US, instead we provide the required information related to the US if necessary. The pdf 'PUMS_Data_Dictionary_2017.pdf' gives an overview and description of all variables. A further description of the codes can be found in 'ACSPUMS2017CodeLists.xls'.

Use the following code cell to define all required import statements.

In [None]:
# Import statements


Use pandas to read both csv-files located in the subdirectory 'data'.

Use the given PUMA (public use microdata area code based on 2010 Census definition which are areas with populations of 100,000 or more) ranges to further divide the dataset into NYC districts (Bronx, Manhatten, Staten Island, Brooklyn, and Queens).

In [None]:
# PUMA ranges
bronx = [3701, 3710]
manhatten = [3801, 3810]
staten_island = [3901, 3903]
brooklyn = [4001, 4017]
queens = [4101, 4114]
nyc = [bronx[0], queens[1]]



In the dataset each sample has a certain weight that reflects the weight for the total dataset. Therefore we cannot simply calculate the median. Use the given weighted_median function in the following to compute the median.

In [None]:
# Function for a 'weighted' median
def weighted_frequency(values, weights):
    weighted_values = []
    for value, weight in zip(values, weights):
        weighted_values.extend(np.repeat(value, weight))
    return weighted_values

def weighted_median(values, weights):
    return np.median(weighted_frequency(values, weights))

## Wages
In this subtask we will create a plot containing multiple subplots which visualize information with regards to NYC wages.
- Visualize the median household income for the US, New York, New York City, and its districts.
- Visualize the average wage by gender for the given occupation categories for the population of NYC.
- Visualize the wage distribution for New York and NYC. Use the following yearly wage intervals: 10k steps between 0 and 100k, 50k steps between 100k and 200k, and >200k

In [None]:
# Data wrangling for median housing income


In [None]:
# Data wrangling for wage by gender for different occupation categories
occ_categories = ['Management,\nBusiness,\nScience,\nand Arts\nOccupations', 'Service\nOccupations',
                 'Sales and\nOffice\nOccupations', 'Natural Resources,\nConstruction,\nand Maintenance\nOccupations',
                 'Production,\nTransportation,\nand Material Moving\nOccupations']
occ_ranges = {'Management, Business, Science, and Arts Occupations': [10, 3540], 'Service Occupations': [3600, 4650], 
                 'Sales and Office Occupations': [4700, 5940], 'Natural Resources, Construction, and Maintenance Occupations': [6000, 7630], 
                 'Production, Transportation, and Material Moving Occupations': [7700, 9750]}



In [None]:
# Data wrangling for wage distribution


In [None]:
# Create figure with four subplots


# Median household income in the US
us_income_median = 60336

# Median household income


# Wage by gender in common jobs


# Wage distribution


# Overall figure


## Occupations
Use a tree map to visualize the percentage for the given occupation subcategories for the population of NYC.

In [None]:
# Data wrangling for occupations
occ_subcategories = {'Management,\nBusiness,\nand Financial': [10, 950],
                    'Computer, Engineering,\nand Science': [1000, 1965],
                    'Education,\nLegal,\nCommunity Service,\nArts,\nand Media': [2000, 2960],
                    'Healthcare\nPractitioners\nand\nTechnical': [3000, 3540],
                    'Service': [3600, 4650],
                    'Sales\nand Related': [4700, 4965],
                    'Office\nand Administrative\nSupport': [5000, 5940],
                    '': [6000, 6130],
                    'Construction\nand Extraction': [6200, 6940],
                    'Installation,\nMaintenance,\nand Repair': [7000, 7630],
                    'Production': [7700, 8965],
                    'Transportation\nand Material\nMoving': [9000, 9750]}



In [None]:
# Visualization of tree map


## Correlation


Use a heatmap to show the correlation between difficulties (self-care difficulty, hearing difficulty, vision, difficulty, independent living difficulty, ambulatory difficulty, veteran service connected disability, and cognitive difficulty) and age groups (<5, 5-11, 12-14, 15-17, 18-24, 25-34, 35-44, 45-54, 55-64, 65-74, 75+) in New York City.

In [None]:
# Data wrangling for New York City population difficulties


In [None]:
# Heatmap
