# The Urbanizational Development of New York City 

***The City That Never Sleeps - From Seed To Apple***


# Introdution 

The notebook is a behind-the-scenes look at the data wrangling behind the story on *The Urbanizational Evolution of New York City - From Seed to Apple* presented on https://esbenbl.github.io/


The structure of the notebook is:
1. [Motivation](#Motivation)
2. [Basic Statistics](#base_stats)
    - [Dataset of the buildings of New York City](#Dataset_of_the_buildings)
    - [Dataset of the population evolution of New York](#Dataset_of_the_population)
    - [Dataset on ethnicity population in NYC neighboorhoods between 20XX and 2020](#Dataset_ethnicitiy)
3. [Data Analysis](#Data_Analysis)
4. [Genre](#Genre)
5. [Visualization](#Visualization)
    - [TITEL Figure 1](#Figur_1) (Heatmap med udvikling over residential og commercial)
    - [TITEL Figure 2](#Figur_2) (RESIDENTIAL VS INDUSTRI BARPLOT)
    - [TITEL Figure 3](#Figur_3) (COMMERCIAL UDVIKLING I FORHOLD TIL BOUROUGHS)
    - [TITEL Figure 4](#Figur_4) (Den generelle udvikling af populationen i New York i forhold til Borough)
    - [TITEL Figure 5](#Figur_5) (UDVIKLINGEN I FORHOLD TIL FOREIGN POPULATION BY BOROUGH)
    - [TITEL Figure 6](#Figur_6) (Heatmap over fordelingen af forskellige ethniciteter i NYC)
    - [TITEL Figure 7](#Figur_7) (choropleth over de forskellige etnicitetsfordelinger)
6. [Discussion](#Discussion)
7. [Contribution](#Controbution)
8. [References](#References)

To run the notebook you need to load the packages in the next section

In [None]:
import geopandas as gpd
import requests 
import matplotlib.pyplot as plt 
%matplotlib inline
plt.rcParams["font.family"] = "Garamond"
plt.rcParams['axes.facecolor'] = "#FFF6E9"
import numpy as np
import pandas as pd
import seaborn as sns
import requests
import folium 
from folium import plugins
import warnings
warnings.filterwarnings("ignore")
import matplotlib.colors as mcolors
import plotly.graph_objs as go
import json
from jinja2 import Template
from folium.map import Layer
from branca.element import Template, MacroElement, Figure

# 1 Motivation <a id="Motivation"></a>
- What is your dataset?
- Why did you choose this/these particular dataset(s)?
- What was your goal for the end user's experience?


Let's paint the picture! A boat filled with tired but hopeful families glides across the still morning waters. They have all left their old homes for the promises and opportunities of the new world. In the distance, through the mist, a statue the size of a skyscraper breaks through the morning fog. The mood among the seaworn travellers brightens, realizing that they have finally arrived at their destination. Soon after, a skyline of actual skyscrapers paints the horizon and the bustling sounds of city life fill the air. Sounds familiar?

New York City is undeniably one of - if not *the* most influential and renowned city in western society. In many ways, the city incapsulates the idea The American Dream, with its buzzing streets, opportune business life, and rich cultural scene, spawning sayings like *"if you can make it here, you can make it anywhere"*. Pop cultural references to New York City is omnipresent in western movies, music, and art and each year [millions of turists](https://en.wikipedia.org/wiki/List_of_cities_by_international_visitors) from all over the world flock to the city see all the "familiar" places in real life. *The City that never sleeps*, *the Capital of the World*, *the Big Apple* - New York City has earned itself many nicknames. The City has the [second highest GDP in the world](http://www.citymayors.com/statistics/richest-cities-2020.html), and continues to be [the largest city in the USA for over 200 years](https://www.newyorkfed.org/medialibrary/media/research/epr/05v11n2/0512glae.pdf) with a population of around 8.5 million people today. Professor of economics at Harvard University, Edward L. Glaeser, states that *"While Boston's history is one of ongoing crises and reinvention (Glaeser 2005), New York's is one of almost unbroken triumph"* (Glaeser 2005:1). But what could be the explanation for NYC's urbanizational triumph the last 200 years?

Just as the history of hopeful immigrants arriving to New York City on boats from Europe is a central part of the city's identity, so are it's iconic skyscraper. The physical appearance of the city have come to shape it's economic, social and cultural fabric. Take for instance this very popular picture from 1932 of a band of construction workers taking their lunch break on a beam floating high above the streets of New York. The picture on the right shows the characters from the hugely popular sitcome *Friends*, doing more of less the same thing.

<div style="display: flex; justify-content: center;">
  <img style="width: 29%; height: 30%; margin-right: 1%;" src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9c/Lunch_atop_a_Skyscraper_-_Charles_Clyde_Ebbets.jpg/1280px-Lunch_atop_a_Skyscraper_-_Charles_Clyde_Ebbets.jpg">
  <img style="width: 40%; height: 10%; margin-left: 1%;" src="https://i.pinimg.com/originals/e5/7a/29/e57a29d29399922bd29bc220ce794a42.jpg">
</div>

The history of how New York City came to be The Big Apple - economicaly, socially, and culturally - thus have to include the history of how the city's developed it's physical form. 


Answering this question is a complex task, but we seek to give the reader a more granular understanding of the urbanisational development of NYC by using visualizations and statistics, and as a result maybe also come a closer to answering the question of NYC’s 200-year triumph. On our website we look into two important parts of development in the history of NYC, its commercializational development and its migrational development. We look into the commercialisational development of NYC, as we want to understand better how NYC has developed to become such a rich city with the second highest GDP in the world. We look into migrational development as a lot of the cultural and historical identity of NYC is inadvertently tied to its status as [a city build by migrants](#https://www.osc.state.ny.us/files/reports/osdc/pdf/report-7-2016.pdf). Thus we hope by visualizing patterns in the urbanisational development of NYC with commercialization and migration as our points of interest, we become able to see the historical effects through time, while understanding contemporary NYC better. 

In this notebook we explain our thoughts behind the project, and present all the code used for the project. We start by introducing the "`Basic Statistics`" of the project, in this section we go write about the three datasets we constructed for our analysis. Then in "`Data Analysis`" we go through our analysis presented on the website. We then write about our choices of data strcture and data narrative in the section "`Genre`". Thereafter we explain each of our visualisations used in the analysis in the section "`Visualitions`", here we also consider design considerations and choices. Finally "`Discussion`" we conclude with some final thoughts regarding the findings of our project, as well as some thoughts on further studies.


# 2 Basic Statistics <a id="base_stats"></a>
- Write about your choices in data cleaning and preprocessing
- Write a short section that discusses the dataset stats, containing key points/plots from your exploratory data analysis.

In this project we use multiple datasets as to get the most nuanced and representative data on NYC development through time with focus on segregation and migration.

In this section on Basic statistics we have three different headings, each representing a different dimension in which we have constructed a dataset. All our datasets come from trustworthy US government organisations and institutions. 

To be able to construct our maps of NYC we have found a geojson file with the mappings for the different boroughs of NYC as to make our visualisations possible. This we preload, before anything else. 

In [None]:
# NYC Borough GeoJson
new_york_boroughs_map = gpd.read_file("https://raw.githubusercontent.com/codeforgermany/click_that_hood/main/public/data/new-york-city-boroughs.geojson")

### **Dataset 1: Land Use in New York <a id="Dataset_of_the_buildings"></a>**

One of the essential sources of data for this project is the Primary Land Use Tax Lot Output (PLUTO) data from [NYC Planning](https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page) which has data on the use of all of the land plots of NYC. This dataset gives us information on the use of the land plot (*landuse*) and year for when the construction of the buildings on the landplots was completed (*yearbuilt*).  Using this dataset we can map the development of NYC from 1800-2020 start to finish.

The PLUTO dataset also has some limitations. One limitation is that the dataset only have data for buildings and land plots used today. Thus we do not have information on for example buildings which were demolished and rebuild or whether the land was once residential and then changed to commercial. A second limitation is regarding the variable *yearbuilt*, as the data dictionary states that the buildings construction was not neccesarily done on that specific indicated year, as within the timeframe 1910 and 1985 a majority of the construction years are in years ending in 5 or 0 (Pluto Data Dictionary: 35). A third limitation is that a large amount of buildings which was build between 1800 and the early 1900s are stated to be between 1899 and 1901 (Pluto Data Dictionary: 35). Acknowledging these limitations we still argue that the data represent an opportunity to see NYC urbanisational development through time.  

In the following code we load and clean the data so it is ready for analysis.

In [None]:
# PLUTO Data from https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page
columns_subset = ["borough","cd", "latitude", "longitude", "yearbuilt", 'landuse', "assesstot", "numbldgs",
                  "numfloors", "unitstotal", "bldgarea", "comarea", "resarea", "bbl"]

land_use_dataaset = pd.read_csv("Exam_datasets/pluto_22v3_1.csv")[columns_subset]
land_use_dataaset.shape

Before the data cleaning process the dataset has a total of 858.619 rows with 14 columns. 

##### Cleaning PLUTO data 

In [None]:
# drop nan for PLUTO
land_use_dataaset = land_use_dataaset.dropna().copy() # drop na 
land_use_dataaset.bbl = land_use_dataaset.bbl.apply(round).astype(str)
land_use_dataaset = land_use_dataaset.query("yearbuilt >= 1600 & yearbuilt < 2020").copy() # removed before 1600 due to uncertainty and after 2019 because decade is not finished
land_use_dataaset.yearbuilt = land_use_dataaset.yearbuilt.astype(int)

# Construct bins 
ten_year_bins = [0]+[year for year in range(1899,2029,10)]
ten_year_labels = ["Before 1900"] + [str(year)+"s" for year in range(1900,2020,10)]

# Into Bins
land_use_dataaset["yearbuilt_intervals"] = pd.cut(land_use_dataaset.yearbuilt, bins = ten_year_bins, labels = ten_year_labels)

# Dicts to convert values  
landuse_key = {1:"One & Two Family Building",
                2:"Multi-Family Walk-Up Buildings",
                3:"Multi-Family Elevator Buildings",
                4:"Mixed Residential & Commercial Buildings",
                5:"Commerical & Office Buildings",
                6:"Industrial & Manufacturing Buildings",
                7:"Transportation & Utility",
                8:"Public Facilities & Institutions",
                9:"Open Space & Outdoor Recreation",
                10: "Parking Facilities",
                11:"Vacant Land"}

borough_key = {"BK":"Brooklyn",
               "QN":"Queens",
               "MN":"Manhattan",
               "BX":"Bronx",
               "SI":"Staten Island"}

# Map dicts
land_use_dataaset.borough = land_use_dataaset.borough.apply(lambda x: borough_key[x])
land_use_dataaset["landuse_label"] =  land_use_dataaset.landuse.apply(lambda x: landuse_key[x])

OBS: Hvornår skal vi fjerne disse? 

In [None]:
# Removed early observations - not much happening in these years
land_use_dataaset_trimmed = land_use_dataaset.query("yearbuilt >= 1800").copy()

In [None]:
print(f"We end with a final dataset with a total of {land_use_dataaset.shape[0]} observations and a total of {land_use_dataaset.shape[1]} columns")

After the cleaning and preparation process we have final dataset with a total of 807.672 plots of land and 16 different columns. This means we have removed approximately 50.000 observations. 

One of the visualization we are going to present using the PLUTO data, is a temporal heatmap of the geographical distribution of construction in NYC including a distinction between residential land use and commercial/non-residential land use. Below, we wrangle the PLUTO data to allow for such analysis.  

In [None]:
# For mixed landuse, clasify as either Mostly Commerical or Mostly Residential based on square feet 
land_use_dataaset_trimmed["land_use_for_heatmap"] = np.where((land_use_dataaset_trimmed.landuse_label == "Mixed Residential & Commercial Buildings") &\
                                                                 (land_use_dataaset_trimmed.comarea > land_use_dataaset_trimmed.resarea),
                                                                 "Mostly Commerical",
                                                                  land_use_dataaset_trimmed.landuse_label)

land_use_dataaset_trimmed["land_use_for_heatmap"] = land_use_dataaset_trimmed["land_use_for_heatmap"].replace("Mixed Residential & Commercial Buildings", "Mostly Residential")

As mentioned above, the exact year of construction is associated with a certain degree of uncertainty, Thus, we group the construction years into 5 year bins, so e.g. the years 1993-1997 are grouped into 1995. Likewise, the year 1988-1992 are grouped into 1990. 

In [None]:
# group year to the nearest multple of 5 (rounding down, e.g. 1997->1995)
land_use_dataaset_trimmed['yearbuilt_grouped'] = land_use_dataaset_trimmed['yearbuilt'] // 5 * 5

We split PLUTO into a residentail and a commercial/non-residential based on primary land use

In [None]:
## Subset based on residential and non-residential land-use 
residential_buildings_and_landuse = land_use_dataaset_trimmed[land_use_dataaset_trimmed.land_use_for_heatmap.isin(['One & Two Family Building',
                                                                                                                    "Mostly Residential",
                                                                                                                    'Multi-Family Walk-Up Buildings',
                                                                                                                    'Multi-Family Elevator Buildings',
                                                                                                                    ])].copy()

commercial_buildings_and_landuse = land_use_dataaset_trimmed[land_use_dataaset_trimmed.land_use_for_heatmap.isin(['Mostly Commercial',
                                                                                                                    "Commerical & Office Buildings",
                                                                                                                    'Industrial & Manufacturing Buildings',
                                                                                                                    'Transportation & Utility',
                                                                                                                    "Parking Facilities",
                                                                                                                    "Public Facilities & Institutions"
                                                                                                                    ])].copy()

In [None]:
print(f"Lots used primarily for commercial/industrial purposes: {commercial_buildings_and_landuse.shape[0]}")
print(f"Lots used primarily for residential purposes: {residential_buildings_and_landuse.shape[0]}")

There are quite a few more residential lots than commercial/industrial ones. Let's look at the relationship when looking at the total number of square feet. 

In [None]:
print(f"Billion square feet for commercial/industrial use: {commercial_buildings_and_landuse.bldgarea.sum() / 1_000_000_000}")
print(f"Billion square feet for residential use: {residential_buildings_and_landuse.bldgarea.sum() / 1_000_000_000}")

Judging by square feet, we see that the are about 2.5 times more residential space than commercial/industrial. 

Next, we need to divide the data based on the constructions's associated 5-year time bin. 

In [None]:
# Define data range
data_range = range(min(residential_buildings_and_landuse.yearbuilt.min(), commercial_buildings_and_landuse.yearbuilt.min()),
                   max(residential_buildings_and_landuse.yearbuilt.max(), commercial_buildings_and_landuse.yearbuilt.max())+1, 5)

residential_list_of_lists_with_coordinates = [residential_buildings_and_landuse.query(f"yearbuilt_grouped == {year}")[["latitude", "longitude"]].values.tolist() for year in data_range]
commercial_list_of_lists_with_coordinates = [commercial_buildings_and_landuse.query(f"yearbuilt_grouped == {year}")[["latitude", "longitude"]].values.tolist() for year in data_range]

data_range

The PLUTO data has now been divided based on primary land use and converted into a list of lists with coordinates, where the index of the coordinate-list represents a 5-year bin between 1800 and 2020. We now have all we need for our temporal heatmap.  

## **Dataset 2: Population Development and Foreign Born Population of New York<a id="Dataset_of_the_population"></a>**

Besides our dataset of the buildings of NYC, another essential part of our analysis is the population evolution of NYC. To further examine the development of the migration and population movements of NYC, we have used data from [the official website of NYC](https://www.nyc.gov/). We have used two datasets from this website, [one on the overall population development](https://www.nyc.gov/assets/planning/download/office/planning-level/nyc-population/historical-population/nyc_total_pop_1900-2010.xlsx) and [one on the development of the foreign born population](https://www.nyc.gov/assets/planning/download/office/planning-level/nyc-population/historical-population/nyc_fb_pop_1900-2010.xlsx). Both datasets are rather similar, with the first one containing the overall population numbers for each NYC borough each decade, and the second containing the amount of foreing born population in each NYC borough each decade.

### **Dataset 2.1: Overall Population**

##### Loading and preparing the datasets

In [None]:
population_by_borough = pd.read_excel("https://www.nyc.gov/assets/planning/download/office/planning-level/nyc-population/historical-population/nyc_total_pop_1900-2010.xlsx", skiprows = 3, index_col = 0)
population_by_borough = population_by_borough.dropna().reset_index(names = ["decade"])

The dataset which we download above do not have the data from 2020, thus we also down the population data **from 2020** from [US Census Bureau's Decinnal Census](#https://www.census.gov/data/developers/data-sets/decennial-census.html). We also use this API in our construction of Dataset 3, which we explain further in the next section. 

In [None]:
# US Census parameters 

API_KEY = "ffe97aa3a40b95750950c76a41624538483d4731"

NY_STATE = ','.join(["36"])

COUNTIES = ','.join(["047", "061", "005", "081", "085"])

COUNTY_TO_BOROUGH = {"081": "Queens",
                     "085": "Staten Island",
                     "047": "Brooklyn",
                     "005": "Bronx",
                     "061": "Manhattan"}

POP_LABEL = {"P1_001N":"population"}

In [None]:
url = f"https://api.census.gov/data/2020/dec/pl?get=NAME,P1_001N&for=county:{COUNTIES}&in=state:{NY_STATE}&key={API_KEY}"
resp = requests.get(url).json()

NYC_pop_2020 = pd.DataFrame(resp[1:], columns = resp[0])

# Clean data 
NYC_pop_2020["BOROUGH"] = NYC_pop_2020.county.map(COUNTY_TO_BOROUGH)
NYC_pop_2020 = NYC_pop_2020.rename(columns = POP_LABEL)
NYC_pop_2020["population"] = NYC_pop_2020["population"].astype(float) 
NYC_pop_2020 = NYC_pop_2020.drop(["NAME", "state", "county"], axis = 1)
NYC_pop_2020["decade"] = 2020
NYC_pop_2020 = NYC_pop_2020.pivot(index = "decade", values= "population", columns = "BOROUGH")
NYC_pop_2020["New York City"] = NYC_pop_2020.values.sum()
NYC_pop_2020 = NYC_pop_2020.reset_index()

Concatenate 2020 data with 1900-2010 data

In [None]:
population_by_borough = pd.concat([population_by_borough,NYC_pop_2020]).set_index("decade")

Get population estimates from NYC Planning (Manual extraction from PDF): https://www.nyc.gov/assets/planning/download/pdf/planning-level/nyc-population/projections_report_2010_2040.pdf

Note: the projections were made in 2013. 

In [None]:
NYC_pop_projection = {"New York City":{2030:8_821_027,
                                       2040:9_025_145},
                      "Bronx":{2030:1_518_998, 
                               2040:1_579_245},
                      "Brooklyn":{2030:2_754_009,
                                  2040:2_840_525},
                      "Manhattan":{2030:1_676_720,
                                   2040:1_691_617},
                      "Queens":{2030:2_373_551,
                                2040:2_412_649},
                      "Staten Island":{2030:497_749,
                                       2040:501_109}}

NYC_pop_projection = pd.DataFrame(NYC_pop_projection)

# Concat 2020 data to make the lines in the plot have the same offset 
NYC_pop_2020 = population_by_borough.loc[2020:]
NYC_pop_projection = pd.concat([NYC_pop_2020, NYC_pop_projection])

### **Dataset 2.2: Foreign Born Population of New York**  

In [None]:
foreign_born_pop = pd.read_excel("https://www.nyc.gov/assets/planning/download/office/planning-level/nyc-population/historical-population/nyc_fb_pop_1900-2010.xlsx", skiprows = 3, index_col = 0)
foreign_born_pop = foreign_born_pop.dropna().reset_index(names = ["decade"])
foreign_born_pop.decade = foreign_born_pop.decade.apply(lambda x: str(x).replace("*","")).astype(int)
foreign_born_pop = foreign_born_pop.set_index("decade")

Calculate the share of foreign born population within each borough since 1900

In [None]:
share_of_foreign_born_by_decade = foreign_born_pop / population_by_borough.loc[1900:2010] * 100
share_of_foreign_born_by_decade = share_of_foreign_born_by_decade.drop("New York City", axis =1 )

### **Final Thoughts on Dataset 2.2**

Because of the simplicity of both datasets their is not much cleaning or preparation to take care of since both datasets consists of 12 rows and 7 columns. The construction of the datasets seems to have been a rather complicated endeavour with combinations of data form multiple sources, and it would probably be possible to write a full data science project regarding the composition of the datasets which presents exact population numbers for each decade over a 110 year period. We decide to trust the NYC government and the data legibility, but if one would like to scrutinise this decision it would be possible to start by going through the sources used for each dataset her: [documentation for population development](https://www.nyc.gov/assets/planning/download/pdf/planning-level/nyc-population/historical-population/nyc_total_pop_1900-2010.pdf) and [documentation for foreign born population development](https://www.nyc.gov/assets/planning/download/pdf/planning-level/nyc-population/historical-population/nyc_fb_pop_1900-2010.pdf).

As a supplement to our population data and get data on 2020, we also used an API from the Census Bureau. In the next section we will further explain our use of this API, and how we also used it to construct our third and final dataset. 

### **Dataset 3: The Ethnic Population in New York City Neighboorhoods**

In the third and final dataset we map the ethnicity throughout contemporary NYC. This data we got using [the United States Census Bureaus API](https://www.census.gov/data/developers/data-sets.html), this API is rather large with numerous dataset on all of The United States of America. We use a dataset gathered as part of the [American Community Survey (ACS)](https://www.census.gov/programs-surveys/acs), they gather data on the communities of USA to help officials. ACS have four surveybased dataset: [1-year estimates, 1-year supplemental estimates, 3-year estimates and 5-year estimates](https://www.census.gov/programs-surveys/acs/guidance/estimates.html). For this project we use the [5-year estimates](https://www.census.gov/data/developers/data-sets/acs-5year.html) because it gives the most reliable and granular data. The 5-year estimates dataset is a very big survey for all of USA, while at the same time being granular to block group level, which we specifically use to investigate the contemporary neighbourhood compositions of NYC. We are interested in the *race* variables of the survey on a neighbourhood level, but one should also be a bit cautious. The [*race* variable](https://www.census.gov/topics/population/race.html) is selfreport and self-identification, so people themselves can choose the group which they identify with, and they can choose multiple ethnicity groups. Thus we for example found the *race* variables to sum to more than 100% of the population in each neighbourhood. As we only want an estimate of the ethnicity spread of the neighborhoods of NYC we argue the data still probably represent the overall ethnic tendencies of the neighborhoods. 

In order to utilize the API, obtaining a key is mandatory, as well as navigating through various codes. Since we are solely focused on New York state and specifically on New York City within it, we need to be particular in specifying the states and counties. The subsequent code involves calling the API and subsequently processing the data in a structured manner.

### **Dataset 3.1: Data on Ethnicity**

We start by defining the overall variables which we are using to construct the dataset

In [None]:
# Preparing a dataframe for the dataset from the API
COLUMN_LABELS = {"DP05_0001E": "total_pop",
                 "DP05_0003PE": "pct_female",
                 "DP05_0002PE": "pct_male",
                 "DP03_0002PE": "pct_employed_over_16",
                 "DP03_0005PE": "pct_unemployed_over_16",
                 "DP03_0052PE": "pct_below_10000_income", 
                 "DP03_0062E": "median_household_income",
                 "DP03_0063E" : "also_median_household_income?",
                 "DP05_0037PE": "pct_white_one_race",
                 "DP05_0038PE": "pct_black_one_race",
                 "DP05_0044PE": "pct_asian_one_race",
                 "DP05_0071PE": "pct_hispanic_or_latino_any",
                 "DP05_0086E": "total_housing_units"}

COUNTY_TO_BOROUGH = {"081": "Queens",
                     "085": "Staten Island",
                     "047": "Brooklyn",
                     "005": "Bronx",
                     "061": "Manhattan"}


SEX_VAR = ["DP05_0003PE","DP05_0002PE"]
POP_VAR = ["DP05_0001E"]
LABOUR_VAR = ["DP03_0002PE", "DP03_0005PE"]
INCOME_VAR = ["DP03_0052PE", "DP03_0062E","DP03_0063E"]
RACE_VAR = ["DP05_0037PE", "DP05_0038PE", "DP05_0044PE", "DP05_0071PE"]
HOUSING_VAR = ["DP05_0086E"]

QUERY = ",".join(SEX_VAR+POP_VAR+LABOUR_VAR+INCOME_VAR+RACE_VAR+HOUSING_VAR)

Sending a request to the API using the QUERY, COUNTIES, NY_STATE and API_KEY values which we defined earlier. 

In [None]:
url = f"https://api.census.gov/data/2021/acs/acs5/profile?get=NAME,{QUERY}&for=tract:*&in=county:{COUNTIES}&in=state:{NY_STATE}&key={API_KEY}"
resp = requests.get(url).json()
us_census_df = pd.DataFrame(resp[1:], columns = resp[0])

# Clean response 
us_census_df = us_census_df.rename(columns = COLUMN_LABELS)
us_census_df["BOROUGH"] = us_census_df.county.map(COUNTY_TO_BOROUGH)

us_census_df = us_census_df.query("tract!='990100'") # Remove "errorneus" tracts
us_census_df[list(COLUMN_LABELS.values())] = us_census_df[list(COLUMN_LABELS.values())].astype(float) # Convert to float
us_census_df = us_census_df.replace(-666666666, np.nan) # Convert nan values 

### I areas where noone lives (based on the "total population") replace Nan-value in the other columns with 0. 
for col in list(COLUMN_LABELS.values()):
    us_census_df[col] = np.where((us_census_df["total_pop"] == 0 & pd.isna(us_census_df[col])),
                                 0,
                                 us_census_df[col])

### **Dataset 3.2: Geodata on Census Tract Level**

Load GeoJson with Census Tract geometry

In [None]:
census_tracts_geo = gpd.read_file("https://services5.arcgis.com/GfwWNkhOj9bNBqoJ/arcgis/rest/services/NYC_Census_Tracts_for_2020_US_Census/FeatureServer/0/query?where=1=1&outFields=*&outSR=4326&f=pgeojson")

Align with the dataset from the US Census Bereau

In [None]:
## Match US Census data tract IDs to the census tract geo data IDs
US_CENSUS_KEY = census_tracts_geo[["BoroName", "BoroCode"]].drop_duplicates().set_index("BoroName").to_dict()["BoroCode"]
us_census_df["borough_id"] = us_census_df.BOROUGH.map(US_CENSUS_KEY).astype("str")
us_census_df["tract_id"] = us_census_df["borough_id"] + us_census_df["tract"]

## Subset columns and rename 
census_tracts_geo = census_tracts_geo[["NTAName", "CDTANAME", "BoroCT2020", "geometry"]].rename(columns={"BoroCT2020":"tract_id"}).copy()

Merge geometry on Census data

In [None]:
census_tract_data = pd.merge(census_tracts_geo, us_census_df, on = "tract_id", how = "outer", indicator = True)

Sanity Check on merge

In [None]:
census_tract_data._merge.value_counts() # All tracts with demographic data are in the merge 

Examine the tract that could not be merged

In [None]:
census_tract_data.query("_merge == 'left_only'")

`Hoffman & Swinburne Islands` are two small and uninhabited island off the coast of Staten Island. Thus, filtering these out will not affect the remaining demographic analysis.  

Keep only observations in both dataset 

In [None]:
census_tract_data = census_tract_data.query("_merge == 'both'")

### **Thoughts on Dataset 3**
Dataset 3 is an opportunity for us, to examination the overall segregation of different ethnicities in NYC. An important consideration is that the dataset is self-report, and when asking people question regarding their ethnicity one should keep in mind two general weakness of survey. The first weakness is the issue of differential item functioning, and this is the question whether the respondents has the same interpretation of the survey question as both other respondents and the researcher. Especially a thing as discussed as ethnicity is one example of a topic, were one could think that people have very different interpretations (Jæger 2006:62). The second weakness which one could also think would be a problem is the social desirability bias, meaning that people choose the survey answers which they think are the most desirable (Krumpal 2011:2026). This is especially a problem when asking people controversial questions regarding topics such as stereotypes, sexual preferences, extreme opinions or racism. 

Even with these problems we use the dataset as we trust the data quality based that it is from ACS, and as the data still represents the best opportunity for what we want to research. 


### **Final thougts on Basic Statistics**

In this section we have explained the three datasets on which we build our analysis while also showing the code. The three datasets are:
 1. The PLUTO dataset, which has informations regarding the land plots of all of NYC. 
 2. The popolation dataset, which has information regarding the development of NYC population and foreign population from 1900-2010.
 3. The ACS dataset with 5-year estimates, which has information regarding the overall segregation of the different ethnicities on neighborhood level in NYC. 

# **3 Data Analysis <a id="Data_Analysis"></a>**
- Describe your data analysis and explain what you've learned about the dataset.

`In Figure 1`, We find this to carry an interesting analytical point about modern urbanizational challenges, where sky-rocketing housing prices and social segregation, invoking new geographical distinctions between social classses, i.e. those who can afford an appartment in the city and those who cannot (owners vs. subleter)... MANLGER KILDER MEN SYNES DET HER ER In our motivation we cited Glaeser(2005) who stated that NYC is a history of constant triumph with near cosntant growth, and looking at this plot one really gets a feeling of the near constant growthboom which NYC has seen nearly constanly for at least 200 years.

`In Figure 2`, the most notable decade is the 1920s, also going be the name *"the roaring twenties"*. More than 800 million square feet of building were constructed in NYC during that time. Then, in the 1930s, the amount of construction seems to fall, for then to be nearly only 25% in the 1940s of what was being constructed in the 20s. Two main events stand out in this time frame; The first event one could think would have this effect is the Wall Street Crash in 1929 which could probably play an important part in halting NYC build growth, the second event could be second world war from 1939-1945. After the 1930s-1940s NYCs construction growth return with the happy 1960s, for then to fall again in the 1970, 1980 and 1990. This makes overall historical sense, because if one look into NYC history one sees, that the 1970 and 1980 were hard for NYC (https://www.businessinsider.com/new-york-city-used-to-be-a-terrifying-place-photos-2013-7?r=US&IR=T & https://en.wikipedia.org/wiki/History_of_New_York_City_(1946%E2%80%931977)). But looking into this phenomen in a broader sense it seems, that nearly all of the larger cities at that timeframe fared even worse than NYC (https://www.newyorkfed.org/medialibrary/media/research/epr/05v11n2/0512glae.pdf p 20).  

The relation between commercial and residential seems to be quite steady through time, with the highest share of commercial construction during the 1990s - a time characterised in pop culture by yuppies on Manhattan - where circa half of all construction was meant for commercial use. On the other hand, the 1940 have the highest share of residential construction compared to commercial. 

The physical development of NYC is clearly affected by the world around it. Energy crises, wars, and economic recession are all manifested in the plot above. With the geographical understanding of the positioning of the five boroughs, let's take a deeper dive into the construction of New York City from a residential vs. industrial perspective. 

`In Figure 3` Brooklyn and Manhattan stand out as the boroughs with highest construction growth especially in the start of the 20th century, and as one could also glimpse in Figure 1, it is apparent that NYC has evolved from the boroughs of Manhattan and Brooklyn because these are the first which see construction. It is also apparent that a lot of the commercial constructions has been centered and constructed on Manhattan, with especially the 1920s being especially active. This makes sense as the 1920s is known for being a period of commercial transformation of [Manhattan](#https://www.entrepreneur.com/growing-a-business/built-for-business-midtown-manhattan-in-the-1920s/239257). 

In Figure 2, we mentioned the roaring twenties as an overall phenomenon, and here we see that Brooklyn and Manhattan both have a very high residential and commercial rise, were Queens and Bronx seems to take a big jump in especially residential construction. Queens seem to be the borough which comes best through the period of 1940s, as it does not fall as much as the other boroughs.

When initially examining the five boroughs depicted in Figure 3, it is evident that Staten Island has considerably fewer square feet constructed compared to the others. Staten Island really first began seeing growth after the 1940s, which could probably be related to the fact that Staten Island also became more connected with the rest of New York because of the construction of multiple bridges between New York and Staten Island, with one of the most renown being the [Verrazzano-Narrows Bridge](#https://en.wikipedia.org/wiki/Verrazzano-Narrows_Bridge), it is the only borough which actually tops in the 1970s. Both Staten Island and Bronx seems to be boroughs with a very high number of residential land plots and few commercial land plots.



`In Figure 4`, we find that New York City has in general grown population with the most rapid increase being in the early 2000th century,  aside from a decrease in the population during 1970-2000. Based on the population projection, this trend seems to continue going forward. 
Especially Queens but also Brooklyn has seen a rise in residents compared to the other boroughs. In Figure 4 we see Brooklyn overtaking Manhattan around 1920 in population size, and then Queens overtake Manhattan during the 1950s. This makes sense as we saw in Figure 3, that Brooklyn and Queens is mostly focused on residential land plots, thus eventhough they maybe have less construction than Manhattan. Furthermore, we also see that Queens construction in 1940s were the highest among all the boroughs, this was maybe necessary as the amount of residents was rising at such a high rate. 

Manhattan started off in the 1900s by being the most populous borough with nearly half of NYCs population living in Manhattan. But Manhattans population has fallen since the 1910s, while the other boroughs has seen a rise in population. Manhattans fall in population is probably because it has evolved from being a residential area of NYC to being more of a commercial hub, thus maybe also ‘pushing’ citizens from Manhattan to other boroughs. For Bronx we observe a steady rise in the early 1900s, for then to have a flatter curve, and now being at nearly the same population levels as Manhattan.

With Staten Island we see the same patterns as with its rise of construction in Figure 3. It has a very small population for a long time, for then to get a more step curve around the 1950 and rise steadily thereafter. We theorised in Figure 3 that the rise of Staten Island was probably related the bridhes bridges which connected Staten Island with the rest of NYC. 

As we claimed in the start of this project then NYC was build by immigrants, and this is apparent in `Figure 5`. One can see that a large share of the population in each borough was foreign born in 1900s and forward. There was a rise in migrants in until 1910, for then to start falling at arround 1920-1930 until 1970. It makes sense we see this large fall in share of foreign borns, as NYC was included in a [national restiction law](#https://en.wikipedia.org/wiki/Emergency_Quota_Act) which ended the rise in share of foreign born immigrants. Comparing the two lineplots in Figure 5, it seems that the fall in share of foreign population is more drastic than the fall in the total amount of foreign born in the boroughs. Furthermore it is of analytical interest that we in Figure 4, do not se a fall in general population in NYC, eventhough the amount of foreign population is falling. Thus one can conclude that either New Yorks own population was growing, or people from within USA moved to NYC resulting in the overall share of foreign born people being lower. 

Examining the the indvidual boroughs it is clear that Manhattan has moved from being an area with the absolut highest amount of foreign born, to now being overtaken by both Queens, Brooklyn and Bronx. One thesis for this development of Manhattan could be that all the commercial plots of lands, were before land plots used by migrants for residence (JA NU BLIVER DET POLITISK). Manhattan is also the borough which seems to have the smallest rise in its share of foreign born sinse the 1970, compared to all the other boroughs which has all seen a high rise. 

In contrast to the other four boroughs Staten Island seem to have always had a rather low population of foreign born, both in overall amount and in share, with under 10% of the share of population being born foreign. 


Generally, `Figure 6` show how the neighborhoods in New York City is highly segregated based on racial characteristics. The white population in New York City is mainly clustered on Manhattan, Brooklyn and Staten Island. A broad observation regarding the areas with white people is that beside Staten Island, it seems that the white population are a majority in neighborhoods centred more in middle of NYC (ie. The northern part of Brooklyn, the southern and middle part of Manhattan, and the left side of Queens). The asian population seem to live rather close to the white population. Besides small enclaves, such as the one centered at Manhattan which we will look more into in the next figure, the highest percent areas of Asians are in Brooks and Brooklyn and is rather closely intertwined with the white population in these two boroughs.  

The black population is very much centered around certain neighbourhoods in Bronx, Brooklyn and Queens. The hispanic/latino population seem more spread out than the other ethnicityes, but are especially centred in Bronx. Thus Bronx seems to be mostly a hispanic/latino and black neighbourhood, with the exception of the white population living by the ocean in the northern part of Bronx.


In Figure 6 we mentioned a small enclave of asian residents in the middle of multiple white neighboorhoods in Manhattan. In `Figure 7` we can look at this more in depth see that this specific neighborhood is  *Chinatown-Two Bridges*, which makes logically sense is predominantly asian.  

Just below Bronx and above Queens we see the renown prison island of NYC (*Rikers Island*) with at that time its 6000 prisoners. In this prison one can observe a population of 49,1% self-reporting as black, 40,2% self-reporting as white, 32,5% self-reporting as hispanic/latino, and only 2.7% self-reporting as asian. The council of NYC has planned [to build four new prisons as to move the prisons closer to families and courthouses]( https://gothamist.com/news/rikers-island-is-supposed-to-close-in-2027-so-why-is-mayor-adams-talking-about-plan-b). It would be interesting to see what kind of changes this would bring to the four neighborhoods.  

If one looks on the east side of the predominantly white Staten Island, one sees the neighborhood of *Miller Field*, which is the only predominantly black neighborhood in a largely white zone. Looking into [Miller Field](https://en.wikipedia.org/wiki/Miller_Field_(Staten_Island)), one sees that this area is actually an old military facility which mostly has been transformed into park now and when looking into the amount of total population it only states 260 citizens. Therefore one should be careful since a very low amount of answers in the survey data would be able to change the share of different ethnicities quite a lot in this neighborhood.  



# **4 Genre <a id="Genre"></a>**
- Which tools did you use from each of the 3 categories of Visual Narrative (Figure 7 in Segal and Heer). Why?
- Which tools did you use from each of the 3 categories of Narrative Structure (Figure 7 in Segal and Heer). Why?

When constructing a narrative and visualization-based project it is essential to be clear and have a well thought out plan for the "data story", and to do this we take inspiration from Segel and Heer (2010). Segel & Heer take inspiration in the work of artists and journalists to further their understanding of narrative visualizations and storytelling through visualizations (Segel & Heer 2010:2). Even though Segel & Heers paper seems to be mostly focused stories told through a single visualization, it is a bit in contrast to our project as we use multiple figures spanning multiple genres in this project. Thus, we have taken inspiration from multiple genres, but we have taken inspiration from the magazine visualization genre, as we have chosen to use quite a lot of text and still figures (Segel & Heer 2010:7). As we a seeking to make a not too uptight but still rather serious data scientific blogpost we felt that this genre presented us with the clearest way to formulate a coherent scientific argument using our visualizations. On a final keynote we also take to heard one of their key findings regarding narrative structure and messaging, they state that a pattern in the articles which shows an under-utilization of common narrative messaging techniques, such as commentaries, repetition, multimessaging, and annotations to emphasize key observations (Segel & Heer 2010:8). They hypothesize that the visualization feel more like a “story” and less like a data tool. Keeping this in mind we have tried to walk a balanced walk, as to keep the website understandable and short, while also explaining key point and findings. In the following we go through the figures, and give our thoughts regarding the visual and narrative structure of each. 


`Figure 1` is an interactive plot, using Segel & Heer terms one would probably categorize this figure as within the film/video/animation genre, with some annotated graph/map elements. Its visual narrative elements are a timebar which helps the reader to understand the overall concept of this project. Figure 1 has multiple interactive elements for the reader: zooming, motion, an interactive timebar, and feature distinction. The narrative structure of this visualization is rather linear, but it is still a bit open, because the figure present the possibility for the reader to look more into their own areas and timeslots of interest.  

`Figure 2, 3, 4 and 5` are all stills and therefore within the magazine genre. They all have quite a similar visual narrative, as we also present more concrete numbers for the reader. With this a bit heavier narrative structure in in mind we have taken measures to keep it as accessible as possible for the readers: one way to make it easier for the readers is by keeping the timebar to help the overall visual narrative of the article as a project which investigate the historical development of NYC. Another way we also seek to make it accessible is by using familiar objects between the graphs, so figure 2 & 3 share characteristics and figure 4 & 5 share characteristics. All to make the narrative structure lighter.   

`Figure 6` is a visualization consisting of four still maps. We have chosen to keep this figure simple with a minimal amount of interaction, as we want the narrative focus to be on the overall perspective of the highly segregated areas of NYC and the five boroughs. 

`Figure 7` is an interactable map, which we choose because we after the broader and more linear narrative structure of the last five figures not want the open of the by giving the reader the opportunity to look more into what they find interesting. Therefore we also introduce some of the interactivity options in narrative structures defined by Hegel & Seer, such as hover highlighting and filtering. 




# **5 Visualizations<a id="Visualizations"></a>**

In this section, we plot and explain the visualization we have chosen to include in our data story. For each visualization, we start by presenting the code needed for the given vizualization, which naturally concludes with plotting the visualization. Then, we present our thought behind our choice of visualization and its contribution to the overall narrative. 

We have an overall of X visualizations in our project. 

### **Visualization 1: The Temporal and Geographical Development of New York City's Residential and Commercial Body<a id="Figur_1"></a>**

First, we define a function that creates a colormap that is compatible with Plotly's `HeatMapWithTime`. Our goal with this, is to distinguish between residential and commercial/industrial construction using colors.

In [None]:
def color_gradient_heatmap(hexcode, n_gradients):
    ''' Create colormap from 'White' to the specified hexcode with n_gradients ''' 
    _cmap = mcolors.LinearSegmentedColormap.from_list(
    "Custom",["#FFFFFF", hexcode], N=n_gradients
    )

    gradient_dict = {i/n_gradients:mcolors.rgb2hex(_cmap(i)) for i in range(n_gradients+1)}

    return gradient_dict

As the `HeatMapWithTime` module does not have an inbuilt method for distinguishing between two cateogries/groups of datapoints (residential vs. non-residential) we have to tweak the module a bit. Below, we define the class `HeatMapWithTimeAdditional` which allows us to add an extra layer in the `HeatMapWithTime`, where both the main and additional layer share the same control bar / slider. Thanks to @Conengmo for providing a solution to this, in the following github-issue: https://github.com/python-visualization/folium/issues/1062

In [None]:
class HeatMapWithTimeAdditional(Layer):
    _template = Template("""
        {% macro script(this, kwargs) %}
            var {{this.get_name()}} = new TDHeatmap({{ this.data }},
                {heatmapOptions: {
                    radius: {{this.radius}},
                    minOpacity: {{this.min_opacity}},
                    maxOpacity: {{this.max_opacity}},
                    scaleRadius: {{this.scale_radius}},
                    useLocalExtrema: {{this.use_local_extrema}},
                    defaultWeight: 1,
                    {% if this.gradient %}gradient: {{ this.gradient }}{% endif %}
                }
            }).addTo({{ this._parent.get_name() }});
        {% endmacro %}
    """)

    def __init__(self, data, name=None, radius=15,
                 min_opacity=0, max_opacity=0.6,
                 scale_radius=False, gradient=None, use_local_extrema=False,
                 overlay=True, control=True, show=True):
        super(HeatMapWithTimeAdditional, self).__init__(
            name=name, overlay=overlay, control=control, show=show
        )
        self._name = 'HeatMap'
        self.data = data

        # Heatmap settings.
        self.radius = radius
        self.min_opacity = min_opacity
        self.max_opacity = max_opacity
        self.scale_radius = 'true' if scale_radius else 'false'
        self.use_local_extrema = 'true' if use_local_extrema else 'false'
        self.gradient = gradient

The tweaking is not over yet, though. Aside from `HeatMapWithTimeAdditional`, we also need to do some HTML/CSS tweaking to add a legend to the plot. Thanks to the Tile Mile Documentation here: https://tilemill-project.github.io/tilemill/docs/guides/advanced-legends/

In [None]:
legend_template = """
{% macro html(this, kwargs) %}

<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1">
  <title>jQuery UI Draggable - Default functionality</title>
  <link rel="stylesheet" href="//code.jquery.com/ui/1.12.1/themes/base/jquery-ui.css">

  <script src="https://code.jquery.com/jquery-1.12.4.js"></script>
  <script src="https://code.jquery.com/ui/1.12.1/jquery-ui.js"></script>
  
  <script>
  $( function() {
    $( "#maplegend" ).draggable({
                    start: function (event, ui) {
                        $(this).css({
                            right: "auto",
                            top: "auto",
                            bottom: "auto"
                        });
                    }
                });
});

  </script>
</head>
<body>

 
<div id='maplegend' class='maplegend' 
    style='position: absolute; z-index:9999; border:2px solid grey; background-color:rgba(255, 255, 255, 0.8);
     border-radius:6px; padding: 10px; font-size:14px; left: 730px; top: 430px;'>
     
<div class='legend-title'>Primary Lot Use</div>
<div class='legend-scale'>
  <ul class='legend-labels'>
    <li><span style='background:#FF2B2B;opacity:0.7;height:15px;width:15px;border-radius:50%';display:block;float:left;align-items:center;></span>Residential Construction</li>
    <li><span style='background:#0097FF;opacity:0.7;height:15px;width:15px;border-radius:50%';display:block;float:left;align-items:center;></span>Commercial/Industrial Construction</li>
    <li><strong>To Start:</strong> Turn on the layers you want to display through <br> the layer-control in the top right corner and press play to<br>initiate the temporal animation.</li>
  </ul>
</div>
</div>
 
</body>
</html>

<style type='text/css'>
  .maplegend .legend-title {
    text-align: left;
    margin-bottom: 5px;
    font-weight: bold;
    font-size: 90%;
    }
  .maplegend .legend-scale ul {
    margin: 0;
    margin-bottom: 5px;
    padding: 0;
    float: left;
    list-style: none;
    }
  .maplegend .legend-scale ul li {
    font-size: 80%;
    list-style: none;
    margin-left: 0;
    line-height: 18px;
    margin-bottom: 2px;
    }
  .maplegend ul.legend-labels li span {
    display: block;
    float: left;
    height: 16px;
    width: 30px;
    margin-right: 5px;
    margin-left: 0;
    border: 1px solid #999;
    }
  .maplegend .legend-source {
    font-size: 80%;
    color: #777;
    clear: both;
    }
  .maplegend a {
    color: #777;
    }
</style>
{% endmacro %}"""

Finally, we are mostly setup. Below, we plot the geograprical and temporal development of New York City's residential and commercial/industrial body. The red markers represent residential construction while the blue markers are commercial/industrial construction. We have also outlined the 5 boroughs of New York City: Manhattan, Brooklyn, Bronx, Queens, and Staten Island. 

**Please Note**: Per default, the plot is paused and the `layers` controlling whether the residential and/or commercial construction are visible are both turned off. Turn the layers on through the layer-control in the top right corner and press play to initiate the temporal animation. The division into `layers` allows the reader to explore either the geographical construction of residential or commerical buildings over time either respectively or combined.

In [None]:
## Params 
radius = 5
opacity = 0.8
blur = 0.8
show = False
overlay = True
width = 1050
height = 600

# Base fig
fig = Figure(width=width, height=height)

# Create NYC map 
NY_coord = [40.690610, -73.935242]
NY_map = folium.Map(NY_coord,
                    width=width, 
                    height=height,
                    tiles =None,
                    zoom_start = 10,
                    min_zoom = 10,
                    max_lat = 41,
                    min_lat = 40.3,
                    max_lon = -73.3,
                    min_lon = -74.5,
                    max_bounds = True)

# Name base layer
folium.TileLayer('cartodbpositron', name='NYC Buildings').add_to(NY_map)

# # Cmap for commercial buildings 
commercial_gradient = color_gradient_heatmap("#0097FF", 100)
residential_gradient = color_gradient_heatmap("#FF2B2B", 100)

## Residential Heatmap
plugins.HeatMapWithTime(residential_list_of_lists_with_coordinates,
                        index = [f"Years: {year}-{year+4}" for year in data_range],
                        auto_play = False,
                        max_opacity=opacity,
                        radius = radius,
                        blur = blur,
                        overlay = overlay,
                        show = show,
                        name = "Residential Buildings",
                        gradient = residential_gradient,
                        ).add_to(NY_map)

## Add Commercial Heatmap 
HeatMapWithTimeAdditional(commercial_list_of_lists_with_coordinates,
                            radius = radius,
                            max_opacity=opacity,
                            overlay = overlay,
                            show = show, 
                            name = "Commerical Buildings",
                            gradient = commercial_gradient).add_to(NY_map)



# # add neighborhoods
folium.GeoJson(new_york_boroughs_map[["geometry", "name"]],
                name = "Borough",
                tooltip=folium.GeoJsonTooltip(fields=['name'], aliases=['Borough:']),
                style_function =  lambda x: {"fillColor":"#18B406" if x["properties"]["name"]=="Staten Island" else \
                                                            "#A15C03" if x["properties"]["name"]=="Queens" else \
                                                            "#FFAA00" if x["properties"]["name"]=="Brooklyn" else \
                                                            "#FF0000" if x["properties"]["name"]=="Manhattan" else \
                                                            "#2BAAFF" if x["properties"]["name"]=="Bronx" else "",
                                                            "color":"black",
                                                            "weight":0.5}).add_to(NY_map)

folium.LayerControl().add_to(NY_map)


# add map to fig 
fig.add_child(NY_map)


# Add legend from template
macro = MacroElement()
macro._template = Template(legend_template)
fig.get_root().add_child(macro)

# Show plot 
fig

In [None]:
fig.save("Plots/construction_heatmap.html")

### **Thoughts on Visualization 1**
The heatmap displays the geographical distribution of residential vs. commercial/industrial construction over time, which is meant to give a historical introduction to how New York City has developed physically over the past 200-odd years. Furthermore, it provides an overall geographical understanding of New York City and the positional relation between its five boroughs. Again, **remember turn the layers on** through the layer-control in the top right corner and press play to initiate the animation.

From a more meta-plot perspective, we are using two different types of encodings, namely `position` where longitudinal and latitudinal coordinates are used to locate the building lots on a map, and `color` to distinguish between the two categories, residential and commercial/industrial construction. We are aware that these encodings, especially color, are not best suited for quantitative and comparative impressions of the data. However, for geographical exploration and for overall introductory purposes, we find the temporal heatmap to be really well suited.

Activating both the residential- and commercial-layer and pressing play, one can see how New York City's journey from seed to apple started with residential and industrial life making their way from Manhattan out into Brooklyn during the 19th century, rapidly spreading throughout the early 20th century, ultimately taking up all available space in the 5 boroughs. Quite quickly, starting around 1900, the heatmap becomes rather cluttered and at times one can hardly spot any of the underlying map simply due to the excessive degree of construction. The plot is interactive, inviting the reader to zoom in and explore the development in areas of their own interest, switching between residential and commercial construction (or both jointly) in varies periods of time.

One has to keep in mind the limitations of this dataset (See Section: [*Dataset 1: Land Use in New York*](#Dataset_of_the_buildings)), the data for Figure 1 has some uncertainty associated with the apparent "boom" in construction starting in 1900 - either because of faulty records prior to this point, or because uncertainty regarding build year was handled by recording the construction year as some-round-number e.g. 1900.


### **Figure 2: The Amount of Construction Per Decade<a id="Figur_2"></a>**

In [None]:
fig, ax  = plt.subplots(figsize = (10,6))
fig.patch.set_facecolor("#FFF6E9")

residential_squarefeet_by_decade = land_use_dataaset_trimmed.groupby("yearbuilt_intervals").resarea.sum().sort_index()
commercial_squarefeet_by_decade = land_use_dataaset_trimmed.groupby("yearbuilt_intervals").comarea.sum().sort_index()

residential_squarefeet_by_decade.plot.bar(ax=ax, color = "#E9655C",  alpha = 0.8, width = 0.8)
commercial_squarefeet_by_decade.plot.bar(ax=ax, bottom = residential_squarefeet_by_decade, alpha = 0.8, color = "skyblue", width = 0.8)

ax.yaxis.get_offset_text().set_visible(False) # Removes the scientific notation
#ax.set_title("Constructed Square Feet by Decade", size = 20)
ax.set_ylabel("Square Feet (in hundred millions)", size = 20)
ax.set_xlabel("", size = 20)
ax.legend(["Residential", "Commercial"],loc='upper center', 
          bbox_to_anchor=(0.5, -0.15),
          ncol = 2,
          fontsize="xx-large")

plt.xticks(fontsize=15)
plt.yticks(fontsize=15)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.savefig("Plots/squarefeet_by_decade.png", dpi = 300, bbox_inches = "tight")

plt.tight_layout()
plt.show()

### **Thoughts on Figure 2**

Having provided a geographical introduction to the physical construction of New York City over time in Figure 1, we now want to get a more quantified and comperative understanding of how New York City grew to look as we know it today. Talking about encodings, we therefore switch to `length` rather than `position` in Figure 2, visualizing the sum of square feet constructed within each decade in a stacked barplot distinguishing between residential and commercial construction. We stack the bars to make it easier to compare the total amount of construction between the decades. We are still in the beginning of our data narrative, wherefore the scope of the plot is still rather broad, i.e. spanning 100+ years and not distinguishing between boroughs, neighborhoods or census tracts.  

### **Figure 3: Decadal Construction Within the Boroughs**

In [None]:
BOROUGH_COLORS = {"Bronx":"#2BAAFF",
                  "Brooklyn":"#FFAA00",
                  "Manhattan":"#FF0000",
                  "Queens":"#A15C03",
                  "Staten Island":"#18B406"}

In [None]:
import matplotlib.pyplot as plt
import numpy as np

fig, axs = plt.subplots(nrows=1, ncols=5, figsize=(20, 5), sharey=True)
fig.patch.set_facecolor("#FFF6E9")
#fig.suptitle('Constructed Square Feet by Decade and Borough', fontsize=22, y=1.05)

categories = land_use_dataaset_trimmed['borough'].unique()

for i, category in enumerate(categories):
    ax = axs[i]
 
    residential_squarefeet_by_decade = land_use_dataaset_trimmed[land_use_dataaset_trimmed['borough'] == category].groupby("yearbuilt_intervals").resarea.sum().sort_index()
    commercial_squarefeet_by_decade = land_use_dataaset_trimmed[land_use_dataaset_trimmed['borough'] == category].groupby("yearbuilt_intervals").comarea.sum().sort_index()

    ax.bar(residential_squarefeet_by_decade.index, residential_squarefeet_by_decade.values, color="#E9655C", alpha=0.8, width=0.8)
    ax.bar(commercial_squarefeet_by_decade.index, commercial_squarefeet_by_decade.values, bottom=residential_squarefeet_by_decade.values, color="skyblue", alpha=0.8, width=0.8)
    ax.yaxis.get_offset_text().set_visible(False) # Removes the scientific notation

    ax.set_title(category, color = BOROUGH_COLORS[category], size=25)
    ax.set_xlabel("")
    ax.set_xticks(np.arange(0, len(residential_squarefeet_by_decade.index), 1))
    ax.set_xticklabels(residential_squarefeet_by_decade.index, rotation=90, fontsize=15)

    ax.spines['top'].set_visible(False)
    ax.spines['right'].set_visible(False)


    if i==0:
        ax.set_ylabel("Square Feet (in hundred million)", size=20)
        ax.set_yticks(ticks = ax.get_yticks(), labels = ax.get_yticks()/100_000_000, size=15)
    else:
        ax.tick_params(
            axis='y',      
            which='both',      # both major and minor ticks are affected
            left=False,       # ticks along the top edge are doff
            labelright=False  # labels along the bottom edge are off)
        )

fig.legend(["Residential", "Commercial"], loc="upper center", ncol=2, fontsize="xx-large", bbox_to_anchor=(0.5, 0))

plt.tight_layout()

plt.savefig("Plots/squarefeet_by_borough.png", dpi = 300, bbox_inches = "tight")
plt.show()


### **Thoughts on Figure 3**
Recalling the geographical awareness that the reader gained in Figure 1 and the general impression of New York City's physical evolution over the decades in Figure 2, we now go a step deeper in Figure 3. Here, we examine the constructional development of each of the 5 boroughs in New York City over time. We use the same stacked barplot presentation as in Figure 2 to help the reader intuitively understand that we are still analysing the same topic of residential/commercial land lot use. Furthermore, the 5 boroughs share the same y-axis, making them more readily visually comparable.

### **Figure 4: The Population of New York City and Its Boroughs<a id="Figur_4"></a>**

In [None]:
# https://www.nyc.gov/assets/planning/download/pdf/planning-level/nyc-population/projections_report_2010_2040.pdf (Found from this report from NYC)
NYC_pop_projection = {"New York City":{2030:8_821_027,
                                       2040:9_025_145},
                      "Bronx":{2030:1_518_998, 
                               2040:1_579_245},
                      "Brooklyn":{2030:2_754_009,
                                  2040:2_840_525},
                      "Manhattan":{2030:1_676_720,
                                   2040:1_691_617},
                      "Queens":{2030:2_373_551,
                                2040:2_412_649},
                      "Staten Island":{2030:497_749,
                                       2040:501_109}}

NYC_pop_projection = pd.DataFrame(NYC_pop_projection)

# Concat 2020 data to make the lines in the plot have the same offset 
NYC_pop_2020 = population_by_borough.loc[2020:]
NYC_pop_projection = pd.concat([NYC_pop_2020, NYC_pop_projection])

In [None]:
fig, ax = plt.subplots(figsize = (10,6))
fig.patch.set_facecolor("#FFF6E9")

colors = ["#000000", "#2BAAFF", "#FFAA00", "#FF0000", "#A15C03", "#18B406"]

population_by_borough.plot(color= colors, ax=ax, legend = False, alpha = 0.7)
NYC_pop_projection.plot(color = colors, style="--", legend = False, ax=ax, alpha = 0.7)

#ax.set_title("Population by Borough", size = 20)
ax.set_ylabel("Population (in millions)", size = 18)
ax.set_xlabel("Year", size = 15)
plt.xticks(list(population_by_borough.index) + [2030, 2040], fontsize=15)
plt.yticks(fontsize=15)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.axvline(2020, linestyle = "--", linewidth = 1, c="gray", alpha = 0.5)

# Change y-ticks
ax.set_yticklabels([int(tick/1000000) for tick in ax.get_yticks()])

# Annotate
ax.text(2040, NYC_pop_projection.loc[2040,'New York City'], 'New York\n     City', size=12, color="#000000")
ax.text(2041, NYC_pop_projection.loc[2040,"Bronx"]-150_000, 'Bronx', size=12, color="#2BAAFF")
ax.text(2041, NYC_pop_projection.loc[2040,"Brooklyn"], 'Brooklyn', size=12, color="#FFAA00")
ax.text(2041, NYC_pop_projection.loc[2040,"Manhattan"]+150_000, 'Manhattan', size=12, color="#FF0000")
ax.text(2041, NYC_pop_projection.loc[2040,"Queens"], 'Queens', size=12, color="#A15C03")
ax.text(2041, NYC_pop_projection.loc[2040,"Staten Island"]-200_000, 'Staten\nIsland', size=12, color="#18B406")
# Text and Arrow for population projection 
ax.text(2023, 5_000_000, 'Population\nprojection', size=12, color='gray')
ax.annotate("", xy=(2038, 4_700_000), xytext=(2022.5, 4_700_000),
            arrowprops=dict(arrowstyle="->", color ="gray", alpha=0.7))

plt.savefig("Plots/population_by_borough.png", dpi = 300, bbox_inches = "tight")

plt.tight_layout()
plt.show()

### **Thoughts on Figure 4**

Having examined the constructional evolution of New York City and it's five boroughs from a residential versus commercial perspective, `Figure 4` turn to the development in inhabitants to gain a better understanding of the people who turned the city from a British settlement into the *Capital of the World*. The insights from `Figure 1-3` will be actively included to attain deeper knowledge about how New York City's inhabitants are affected by the physical construction of the city, or maybe vice versa. For instance, what happens when a city prioritizes to building commercial structures rather than residential, as is the case with Manhattan. We plot a simple lineplot over the developments in inhabitants from 1900 to 2020 within each of the 5 boroughs and in total. We have also visualised population projections for the estimated population in 2030 and 2040 with dotted lines, to make it easier for the reader to dicern between observed and estimated data. 

### **Figure 5: New York City's Foreign Born Population<a id="Figur_5"></a>**

In [None]:
import matplotlib.pyplot as plt

fig, axs = plt.subplots(nrows=2, ncols=1, figsize=(10, 10))
fig.patch.set_facecolor("#FFF6E9")

#fig.suptitle("Foreign Born Population by Borough from 1900-2010", size=20)
## Foreign born population
colors = ["#000000", "#2BAAFF", "#FFAA00", "#FF0000", "#A15C03", "#18B406"]
foreign_born_pop.plot(color=colors, ax=axs[0], legend=False, alpha=0.7)
axs[0].set_ylabel("Foreign Born Population (in millions)", size=18)
axs[0].set_xticks(share_of_foreign_born_by_decade.index)
axs[0].set_xlabel("", size=15)
axs[0].yaxis.get_offset_text().set_visible(False) # Removes the scientific notation
#axs[0].set_yticklabels(['%.1f' % float(tick/1000000)+ ' mil.' for tick in axs[0].get_yticks()])
axs[0].set_xticklabels("")
axs[0].tick_params(axis='both', which='major', labelsize=15)
axs[0].spines['top'].set_visible(False)
axs[0].spines['right'].set_visible(False)

# Annotate
axs[0].text(2011, foreign_born_pop.loc[2010,'New York City'], 'New York\n     City', size=12, color=colors[0])
axs[0].text(2011, foreign_born_pop.loc[2010,"Bronx"]+50_000, 'Bronx', size=12, color=colors[1])
axs[0].text(2011, foreign_born_pop.loc[2010,"Brooklyn"]-30_000, 'Brooklyn', size=12, color=colors[2])
axs[0].text(2011, foreign_born_pop.loc[2010,"Manhattan"]-50_000, 'Manhattan', size=12, color=colors[3])
axs[0].text(2011, foreign_born_pop.loc[2010,"Queens"], 'Queens', size=12, color=colors[4])
axs[0].text(2011, foreign_born_pop.loc[2010,"Staten Island"]-50_000, 'Staten\nIsland', size=12, color=colors[5])

## Share of foreignborn in population
colors = ["#2BAAFF", "#FFAA00", "#FF0000", "#A15C03", "#18B406"]
share_of_foreign_born_by_decade.plot(color=colors, ax=axs[1], legend=False, alpha=0.7)
axs[1].set_title("", size=20)
axs[1].set_ylabel("Share of Foreign Born Population", size=18)
axs[1].set_xlabel("Year", size=15)
axs[1].set_xticks(share_of_foreign_born_by_decade.index)
axs[1].tick_params(axis='both', which='major', labelsize=15)
axs[1].spines['top'].set_visible(False)
axs[1].spines['right'].set_visible(False)

# Annotate
axs[1].text(2011, share_of_foreign_born_by_decade.loc[2010,"Bronx"], 'Bronx', size=12, color=colors[1-1])
axs[1].text(2011, share_of_foreign_born_by_decade.loc[2010,"Brooklyn"], 'Brooklyn', size=12, color=colors[2-1])
axs[1].text(2011, share_of_foreign_born_by_decade.loc[2010,"Manhattan"], 'Manhattan', size=12, color=colors[3-1])
axs[1].text(2011, share_of_foreign_born_by_decade.loc[2010,"Queens"], 'Queens', size=12, color=colors[4-1])
axs[1].text(2011, share_of_foreign_born_by_decade.loc[2010,"Staten Island"], 'Staten\nIsland', size=12, color=colors[5-1])

plt.savefig("Plots/foreign_And_share_population_by_borough.png", dpi = 300, bbox_inches = "tight")
plt.tight_layout()
plt.show()


### **Thoughts on Figure 5**
Having examined to general evolution of New York City's population across the 200+ years, we want to take a closer look at the cornerstone in history of not just how New York City came to be, but how all of the United States managed to become the worlds third most populous country - namely immigration. Thus, in `Figure 5` we visualize the amount of foreign born residents in New York City across time (the top plot) as well as the share that the foreign born inhabitants make up of the total population within each borough. We do this, by stacking two lineplots vertically in one figure, where both plots share the same x-axis. This makes it easier to compare between the absolute and relative degree of foreign born immigration at given points in time. Furthermore, we have choose to keep the same style for the lineplots in `Figure 5` as in `Figure 4`, to give the reader an intuitive understanding that we are still examining the population of NYC. We continue to use the same colors to dicern between the boroughs as in all the preceding plots.  

### **Figure 6: New York City's Racial Composition Today<a id="Figur_6"></a>**

In [None]:
fig, ax = plt.subplots(2,2, figsize = (10,10))
fig.patch.set_facecolor("#FFF6E9")

# White Residents
census_tract_data.plot(column='pct_white_one_race', cmap='YlOrBr', edgecolor = "grey", linewidth=0.0, ax = ax[0][0])
new_york_boroughs_map.plot(facecolor = "none", edgecolor = "black", linewidth = 0.1, ax = ax[0][0])
ax[0][0].set_axis_off()

# Add a colorbar
cbar = ax[0][0].get_figure().colorbar(ax[0][0].collections[0], shrink = 0.5, location='bottom', pad = 0)
cbar.set_label('Pct. White Residents', size=15, labelpad = 5)

# Black Residents 
census_tract_data.plot(column='pct_black_one_race', cmap='YlOrBr', ax = ax[0][1])
new_york_boroughs_map.plot(facecolor = "none", edgecolor = "black", linewidth = 0.1, ax = ax[0][1])
ax[0][1].set_axis_off()
# Add a colorbar
cbar = ax[0][1].get_figure().colorbar(ax[0][1].collections[0], shrink = 0.5, location='bottom', pad = 0)
cbar.set_label('Pct. Black Residents', size=15, labelpad = 5)

# Asian Residents 
census_tract_data.plot(column='pct_asian_one_race', cmap='YlOrBr', ax = ax[1][0])
new_york_boroughs_map.plot(facecolor = "none", edgecolor = "black", linewidth = 0.1, ax = ax[1][0])
ax[1][0].set_axis_off()
# Add a colorbar
cbar = ax[1][0].get_figure().colorbar(ax[1][0].collections[0], shrink = 0.5, location='bottom', pad = 0)
cbar.set_label('Pct. Asian Residents', size=15, labelpad = 5)

# Hispanic and Latino Residents 
census_tract_data.plot(column='pct_hispanic_or_latino_any', cmap='YlOrBr', ax = ax[1][1])
new_york_boroughs_map.plot(facecolor = "none", edgecolor = "black", linewidth = 0.1, ax = ax[1][1])
ax[1][1].set_axis_off()
# Add a colorbar
cbar = ax[1][1].get_figure().colorbar(ax[1][1].collections[0], shrink = 0.5, location='bottom', pad = 0)
cbar.set_label('Pct. Hispanic/Latino Residents', size=15, labelpad = 5)

# Show the map
plt.tight_layout()

# Save 
plt.savefig("Plots/race_dist_by_tract.png", dpi = 300, bbox_inches = "tight")

plt.show()

### **Thoughts on Figure 6**
With `Figure 1-5`, we now have a broad but solid impression of the evolution of both the physical and social development of New York City all the way from 1800 up until today. We now wish to take a more contemporary look at New York CIty and how its physical and social development manifests itself in the cities residential composition today. Concretely, we want to take a look at neighborhood segregation in New York City from a racial perspective. In a multicultural melthing pot such as New York City, which geographical patterns emerges when one observe it through a racial lens. For instance, how is the relatively high degree of commercial construction on Manhattan during the 20th centuery reflected in the characteristics of contemporary Manhattanites? 

To examine New York City's contemporary racial composition, we once again opt for a map-visualization in `Figure 6`. More specifically, we have chosen a choropleth map over the respective share of white, black, asian, and hispanic/latino residents in the given neighborhoods. Thus, we will make use of the same encodings as in `Figure 1`, i.e. `position` and `color`. Here, however, we use color intensity to distinguishes between the neighborhoods' share of residents with a specific racial characteristic. Though color intensity can make it difficult the convey an immediately exact and accurate impression of the magnitudinal relation between the data points - as also mentioned earlier - we think does really well a making the highly racially concentrated neighborhoods stand out in the choropleth map, in spite of the loss of exact accuracy associated with this encoding. By showing all four maps of New York City alongside each other in the figure, it is easier for the reader to quickly gain a comparative macro understanding of the racial segregation between the neighborhood or even broader areas.

### **Figure 7: An Interactive Look at New York City's Racial Composition<a id="Figur_7"></a>**

In [None]:
census_tract_data["hovertext"] = "<br>Neighborhood: " + census_tract_data.NTAName +\
                                 "<br>Borough: " + census_tract_data.BOROUGH +\
                                 "<br>Total population: " + census_tract_data.total_pop.apply(round).astype(str)

In [None]:
def cmap_with_alpha(alpha=0.7, n_colors=100, plt_cmap='YlOrBr'):
    cmap_rgba_vals = plt.get_cmap(plt_cmap, n_colors) # get rgba values from the plt_cmap
    cmap_rgb_vals = [tuple(map(lambda x: x*255, cmap_rgba_vals(i)[:3])) for i in range(n_colors)][5:] # convert to rgb + slice to exclude white as lower color
    GEOMAP_CMAP = [f"rgba{rgb+tuple([alpha])}" for rgb in cmap_rgb_vals] # set alpha and convert to plotly-compatible format
    return GEOMAP_CMAP

In [None]:
label_dict = {'pct_white_one_race':"Pct. White Residents",
              'pct_black_one_race':"Pct. Black Residents",
              'pct_asian_one_race':"Pct. Asian Residents",
              'pct_hispanic_or_latino_any':"Pct. Hispanic & Latino Residents"}

# we need to add this to select which trace 
# is going to be visible
visible = np.array(list(label_dict.keys()))

# define traces and buttons at once
traces = []
buttons = []
for col, label in label_dict.items():

    traces.append(go.Choroplethmapbox(geojson=json.loads(census_tract_data.to_json()),
                                    z = census_tract_data[col],
                                    locations = census_tract_data.index,
                                    colorscale=cmap_with_alpha(),
                                    #colorbar_title = label,
                                    colorbar=dict(
                                        title=label,
                                        titleside='top',
                                        len = 0.8,
                                        orientation = "h",
                                        y = -0.17,
                                        tickfont = dict(size=15) # set the position of the colorbar title
                                    ),
                                    marker_line_width=0.1,
                                    visible= True if col==list(label_dict.keys())[0] else False,
                                    text = census_tract_data["hovertext"],
                                    hovertemplate= f"{label}: " + "%{z}%"+
                                                    "%{text}<extra></extra>"
                                    ))

    buttons.append(dict(label=label,
                        method="update",
                        args=[{"visible":list(visible==col)},
                              {"title":""}]))

updatemenus = [{"active":0,
                "buttons":buttons,
                "direction":'down',
                "showactive":True,
                "x":0.02,
                "y":0.98,
                "xanchor":"left",
                "yanchor":"top"}]


# Show figure
fig = go.Figure(data=traces,
                layout=dict(updatemenus=updatemenus))

fig.layout['title']['text'] = ""
                  


fig.update_layout(mapbox_style="carto-positron",
                    height = 600,
                    width = 1000,
                    autosize=True,
                    paper_bgcolor= "#FFF6E9",
                    margin={"r":0,"t":0,"l":0,"b":0},
                    mapbox=dict(center={"lat": 40.690610, "lon": -73.935242},zoom=9),
                    font_family = "Garamond",
                    font_size = 18
                    )

fig.show()

### **Thoughts on Figure 7**
In `Figure 7`, we plot the same data as in `Figure 6` using the same style of visualisation - a choropleth map -, thus also using the same encodings - `position` and `color`. In `Figure 7`, however, we want to utilize the granularity of the data even further and allow the reader to interact with the data. This time around then, the choropleth map is interactive, allowing the reader to zoom in on specific neighborhoods with a drop-down menu to switch between racial categories. Furthermore, we have added hover-text, presenting the exact share of residents belonging to a given racial category in a neighborhood, the total number of residents in the neighborhood, the name of the neighborhood, and the borough within which the neighborhoods lies. This information appears when hovering the cursor over a particular neighborhood. 

Though `Figure 6` and `Figure 7` are generally similar, they convey two different messages. In `Figure 6` the geographical distribution of all 4 racial categories are presented simultaniously giving the reader a immediatly comperative understanding of how residential composition is highly clustered between neighborhoods as well as which parts of the city that are mainly inhabited by which racial groups. `Figure 7` displays only a single racial category at a time and is meant for more detailed exploration of the individual areas/neighborhoods and racial categories, inviting the reader to examine and unveil their own insights. Having had a quite broad scope throughout `Figure 1-6` spanning multiple decades, millions of people, hundred of thousands of building we want `Figure 7` to allow for a higher degree of granularity.  

# **6 Discussion<a id="Discussion"></a>**
- What went well?,
- What is still missing? What could be improved?, Why?

### Thoughts for further studies

In this section we present some considerations for further studies and ways in which to improve following this project. 

The PLUTO dataset present some exceptional opportunities, but it also has some limitations which we already explained in the [Basic Statistic section](#Dataset_of_the_buildings). For further studies it would be highly relevant to construct a dataset with none of the limitations from the PLUTO dataset. If one were to find records on demolished buildings of NYC and put these into a usable dataset, then it would be interesting seeing what buildings were demolished and what buildings were allowed to stay. We concluded that the rise in construction in 1920 was a result of the booming twenties, but maybe the high amount of construction in the dataset is because the buildings from the 20s just are the buildings which have been kept.

The overall population of NYC and the share of foreign born in NYC plays an essential role for our analysis. But this also becomes a rather broad analysis with the concept of *foreign born* being a very broad term of the many different migrants which NYC has experienced in the last centuries. If it were possible to acquire this data on an even more granular scale it would be highly interesting to map the urbanization development of NYC in conjunction with the different kinds of waves of immigrants and their neighborhoods.  

Even though the project has included multiple different data sources through XX different visualizations, we have only scratched the surface of the opportunities for digging deeper into the data of NYC. With the code and our gained experience with the [CENSUS API](#Dataset_ethnicitiy), one would rather easily be able to draw the same data from other metropolises in the USA and make some comparative studies. In our analysis of [Figure 2](#Figur_2) we analyzed that NYC had some tough years in 1970-1990, looking more into this phenomenon we then found that almost all the big cities in USA had these tough years, thus this was more a result of a macrophenomenon than it was specifically NYC. If one were to include the other metropolises, it would be possible to isolate which patterns were unique for NYC, and which was a result of more general national or global trends. One could also use the CENSUS API to get even more data on specifically NYC. For Staten Island we hypothesized that the population rise was related to the construction of bridges between Staten Island and the rest of NYC, but using the API one could maybe get data on the infrastructural development of NYC such as, harbors, railways, roads, airfields, and bridges to further investigate this aspect. These fundamental infrastructural constructions could be essential to include in the data if one were to understand the urbanization success of NYC. 


# **7 Contributions<a id="Contributions"></a>**
- You should write (just briefly) which group member was the main responsible for which elements of the assignment. (I want you guys to understand every part of the assignment, but usually there is someone who took lead role on certain portions of the work. That's what you should explain).
- It is not OK simply to write "All group members contributed equally".

# **8 References <a id="References"></a>**
- Make sure that you use references when they're needed and follow academic standards.



- Glaeser, Edward L. 2005: "Urban Colussus: Why is New York America´s Largest City", Economy Policy Review. https://www.newyorkfed.org/medialibrary/media/research/epr/05v11n2/0512glae.pdf

- Jæger, M.M. 2006: “Description as Choice”, Oxford Economic Papers, Vol. 32(3)

- Krumpal, Ivar 2011: "Determinants of social desirability bias in sensitive surveys: a literature review" Qual Quant (2013) 2025-2047, Springer Science+Business Media

Pluto data dictionary: https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page

# **Datasets**

- Building footprints, [City of New York](https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh) (SKAL FJERNES LIGE NU!)
- PLUTO data, [NYC Planning](https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page)
- [The overall population development of NYC's boroughs](https://www.nyc.gov/assets/planning/download/office/planning-level/nyc-population/historical-population/nyc_total_pop_1900-2010.xlsx) 
- [The overall development of the foreign born population of NYC's boroughs](https://www.nyc.gov/assets/planning/download/office/planning-level/nyc-population/historical-population/nyc_fb_pop_1900-2010.xlsx)

In [None]:
COUNTY_TO_BOROUGH = {"081": "Queens",
                     "085": "Staten Island",
                     "047": "Brooklyn",
                     "005": "Bronx",
                     "061": "Manhattan"}

In [None]:
url = f"https://api.census.gov/data/2020/acs/flows?get=MOVEDIN,GEOID1,GEOID2,MOVEDOUT,FULL1_NAME,FULL2_NAME,COUNTY2,MOVEDNET&for=county:{COUNTIES}&in=state:{NY_STATE}&key={API_KEY}"
resp = requests.get(url).json()
data = pd.DataFrame(resp[1:], columns = resp[0], dtype = str).dropna()
data.MOVEDNET = data.MOVEDNET.astype(int)
data["from_borough"] = data.county.map(COUNTY_TO_BOROUGH)


to_borough = []
for county in data.COUNTY2:
    try:
        to_borough.append(COUNTY_TO_BOROUGH["0"+county])
    except:
        to_borough.append("Outside NYC")
        
data["to_borough"] = to_borough

# Aggregate
data = data.groupby(["from_borough", "to_borough"])[["MOVEDIN", "MOVEDOUT", "MOVEDNET"]].apply(lambda x : x.astype(int).sum()).reset_index()

#data = data[data.from_borough != data.to_borough]

In [None]:
data

# GAMMELT

Our dataset of the buildings and the related plot of land is essential for our initial analysis and illustration of New York Cities evolution throughout time. This dataset is a combination of two datasets: One dataset from [City of New York](https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh) on the building footprints of New York, and another dataset from [NYC Planning](https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page) on the land plots of NYC. In the following code we load, clean, and merge the two datasets. 

In [None]:
# ## Loading the two datas

# # Building footprints from https://data.cityofnewyork.us/Housing-Development/Building-Footprints/nqwf-w8eh
# # Documentation https://github.com/CityOfNewYork/nyc-geo-metadata/blob/master/Metadata/Metadata_BuildingFootprints.md
# building_footprints = gpd.read_file("Exam_datasets/Building_Footprints.geojson") 

# # NYC Borough GeoJson
# new_york_boroughs_map = gpd.read_file("https://raw.githubusercontent.com/codeforgermany/click_that_hood/main/public/data/new-york-city-boroughs.geojson")

# # PLUTO Data from https://www.nyc.gov/site/planning/data-maps/open-data/dwn-pluto-mappluto.page
# columns_subset = ["borough","cd", "latitude", "longitude", 'landuse', "assesstot", "numbldgs",
#                   "numfloors", "unitstotal", "bldgarea", "comarea", "resarea", "bbl"]
# land_use_dataaset = pd.read_csv("Exam_datasets/pluto_22v3_1.csv")[columns_subset]

In [None]:
# buildings_and_landuse = pd.merge(building_footprints, land_use_dataaset, on = "bbl", how = "outer", indicator = True)

In [None]:
# print("Total amount of unique buildings in the dataset: {}".format(len(buildings_and_landuse)))
# print("Total amount of unique land plots: {}".format(buildings_and_landuse["bbl"].nunique()))
# print("The dataset has {} columns".format(len(buildings_and_landuse.columns)))

We combine the two dataset based on BBL column which stand for "borough, block and lot", merging on this variabe makes it possible to combine our data on buildings with our data on land lots.Before cleaning the dataset consists of 31 columns with a total of 10.898.686 buildings and a total 817.842 unique land plots. Furthermore we get info on how the buildings land plots is used, so we get a distinction between whether a building is build on commercial or residential land. 

This is not the final data, as we first do a sanity check of the merge and check how many BBL values from each of the two datasets does not exist in the other. 

##### Sanity check on merge between Buildings and land plots

In [None]:
# # We construct two explorative datasets to explore how many buildings which only exist in one of the two datasets
# only_in_building_footprints = buildings_and_landuse.query("_merge == 'left_only'")
# only_in_pluto = buildings_and_landuse.query("_merge == 'right_only'")

# print(f"Buildings not found in Building Footprint: {only_in_pluto.shape[0]}")
# print(f"Buildings not found in PLUTO: {only_in_building_footprints.shape[0]}")

With a dataset consisting of a total of 10.898.686 observations, some missing values seems to be unavoidable. Looking into the amount of missing values we have a total of 20.443 missing values which correspond to 0.2 percent of the total dataset. We think this is rather low, but even a low amount of missing values can be a problem if biased towards a certain group or category, therefore we will look into the distribution of the missing values.

In [None]:
# # Create a figure with two subplots
# fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 4))

# # Plot the first bar plot onto the first subplot
# only_in_pluto.borough.value_counts().plot.bar(title="Distribution of missing buildings from the Building Footprints dataset", ax=ax1)

# # Plot the second bar plot onto the second subplot
# buildings_and_landuse.borough.value_counts().plot.bar(title="Distribution of buildings in the main dataset", ax=ax2)

# # Set a common y-axis label for both subplots
# #fig.text(0.04, 0.5, va='center', rotation='vertical')

# # Display the plot
# plt.tight_layout()
# plt.show()

The Building Footprints data has a total of 6.033 missing values. We look into into the distribution of these missing values among the five boroughs of NYC, and conclude that the overall distribution of missing values look rather evenly spread. It is apparent that Queens and Brooklyn is the boroughs with the highest amount of missing values, but this shows an unbias distribution of missing values as these two also are the boroughs with the overall highest amount of buildings.   

In [None]:
# # Create a figure with two subplots
# fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(12, 4))

# # Plot the first bar plot onto the first subplot
# only_in_building_footprints.cnstrct_yr_intervals.value_counts().sort_index().plot.bar(title = "Distribution of missing buildings not found in PLUTO", ax=ax1)

# # Plot the second bar plot onto the second subplot
# buildings_and_landuse.cnstrct_yr_intervals.value_counts().sort_index().plot.bar(title = "Distribution of buildings in the main dataset", ax=ax2)

# # Set a common y-axis label for both subplots
# #fig.text(0.04, 0.5, va='center', rotation='vertical')

# # Display the plot
# plt.tight_layout()
# plt.show()

In [None]:
# buildings_and_landuse = buildings_and_landuse.query("_merge == 'both'").copy()
# buildings_and_landuse.cnstrct_yr = buildings_and_landuse.cnstrct_yr.astype("int64")
# # Dimensions of dataset 
# buildings_and_landuse.shape

This has been a presentation of one of the three essential datasets used in this project, the buildings and land use dataset. We have explained the overall dataset, and argue that the amount of missing data when combining the two datasets is miniscule and unbiased. After cleaning and merging the overall dataset we have a datasaet containing 31 columns with 1.069.453 rows of unique buildings and their related landplot.  

In [None]:
# import matplotlib.gridspec as gridspec

# # Create a 2x1 grid for the top plot and the bottom row of subplots
# fig = plt.figure(figsize=(20, 12))
# gs = gridspec.GridSpec(nrows=2, ncols=1, height_ratios=[1, 1])

# # Create the top plot
# ax1 = fig.add_subplot(gs[0])
# residential_squarefeet_by_decade = land_use_dataaset_trimmed.groupby("yearbuilt_intervals").resarea.sum().sort_index()
# commercial_squarefeet_by_decade = land_use_dataaset_trimmed.groupby("yearbuilt_intervals").comarea.sum().sort_index()
# residential_squarefeet_by_decade.plot.bar(ax=ax1, color="#E9655C", alpha=0.8, width=0.8)
# commercial_squarefeet_by_decade.plot.bar(ax=ax1, bottom=residential_squarefeet_by_decade, alpha=0.8, color="skyblue", width=0.8)
# ax1.set_title("Squarefeet by Decade in New York ", size=22)
# ax1.set_ylabel("Square Feet (in hundred millions)", size=20)
# ax1.set_xlabel("")
# ax1.yaxis.get_offset_text().set_visible(False) # Removes the scientific notation
# ax1.legend(["Residential", "Commercial"], fontsize="x-large")
# ax1.tick_params(axis='y', labelsize=15) # Only change ytick font size
# ax1.spines['top'].set_visible(False)
# ax1.spines['right'].set_visible(False)

# # Create a 1x5 grid for the subplots
# gs_subplots = gridspec.GridSpecFromSubplotSpec(nrows=1, ncols=5, subplot_spec=gs[1], wspace=0.1)

# categories = land_use_dataaset_trimmed['borough'].unique()

# # Find the maximum value of residential and commercial square feet across all boroughs
# max_residential = land_use_dataaset_trimmed.groupby("yearbuilt_intervals").resarea.sum().max()
# max_commercial = land_use_dataaset_trimmed.groupby("yearbuilt_intervals").comarea.sum().max()
# y_max = max(max_residential, max_commercial)

# for i, category in enumerate(categories):
#     ax = fig.add_subplot(gs_subplots[i], sharey=ax1)

#     residential_squarefeet_by_decade = land_use_dataaset_trimmed[land_use_dataaset_trimmed['borough'] == category].groupby("yearbuilt_intervals").resarea.sum().sort_index()
#     commercial_squarefeet_by_decade = land_use_dataaset_trimmed[land_use_dataaset_trimmed['borough'] == category].groupby("yearbuilt_intervals").comarea.sum().sort_index()

#     ax.bar(residential_squarefeet_by_decade.index, residential_squarefeet_by_decade.values, color="#E9655C", alpha=0.8, width=0.8)
#     ax.bar(commercial_squarefeet_by_decade.index, commercial_squarefeet_by_decade.values, bottom=residential_squarefeet_by_decade.values, color="skyblue", alpha=0.8, width=0.8)
#     ax.yaxis.get_offset_text().set_visible(False) # Removes the scientific notation

#     ax.set_title(category, size=18)
#     ax.set_xlabel("")
#     ax.set_xticks(np.arange(0, len(residential_squarefeet_by_decade.index), 1))
#     ax.set_xticklabels(residential_squarefeet_by_decade.index, rotation=90, fontsize=12)

#     if i == 0:
#         ax.set_ylabel("Square Feet (in hundred millions)", size=20)
# #        ax.legend(["Residential", "Commercial"], fontsize="x-large") # Add legend 
    
#     ax.spines['top'].set_visible(False)
#     ax.spines['right'].set_visible(False)

# plt.tight_layout()
# plt.show()
