# Data Bootcamp Project Outline
## Inès Ajimi

### New York City Snapchot

New York City has considerably changed in the past ten years -- but how?

The project will focus on current concerns about the **geographic dispersion of socio-economic inequality** in NYC, at a census tract level. The project is wide enough in scope to yield interesting exploratory results (think cool color coded maps) but could also be used to try to predict future changes in the city (particularly housing trends). 

So far, the project relies primarily on the **American Community Survey**. The ACS is a 5-year survey of various demographic, social, economic, and housing characteristics. Their smallest unit, the census tract, is small enough to provide a fine-grained view of NYC.

Variables of interest at the Census Tract level:
- Total Population
  - by sex, age, origin, household type
- Population per Acre (population density)
- School Enrollment 
- Educational Attainment
- Employment Status
- Occupation
- Commute Time + Mode of Commute
- Income
  - incl % below poverty line
- Health Insurance Coverage by type
- Housing
  -  age, n# rooms, rent as % of income

There are two problems with the ACS:
- it is fairly recent so only goes back to +- 2005
- it's survey data, so subject to sampling error

The project would have the following structure:
- look at some salient facts of the city today
  - who lives where, earns what, etc
- use older ACS data to observe socio-economic changes in neighborhoods, paying specific attention to issues of income and racial inequality
- if time and resources allow, use additional information (e.g. Yelp API + NYC Open Data) to find out *what drives the location-decision of individuals*
  - rents, public 'goods' (e.g. parks, response time to 311 calls), access to subway, quality of schools, attractiveness of location (possible proxy: filming permits, "hotness" of nearby bars/restaurants measured by Yelp reviews), etc
  - a concern about the above is whether the geographic unit of analysis can be matched to the ACS

Using mapping & geo-visualization packages.

--- 

#### Current Progress

In [1]:
import pandas as pd
import numpy  as np
import matplotlib.pyplot as plt

from census import Census
from us import states

In [2]:
my_api_key = 'put key here' 
c = Census(my_api_key)

I use the census API to retrieve information pertaining to every *county tract* in NYC. These are a few examples of variables I could use for my project.

In [3]:
code = ("NAME","B01001_001E", "B00002_001E", "B00002_001E", "B02001_002E", "B02001_003E", "B02001_005E", "B02001_008E",
       "B03003_001E", "B07001_001E", "B07004A_001E", "B07004B_001E", "B08303_001E", "B08302_001E") 
nyc = c.acs5.state_county_tract(code,  states.NY.fips, "061", Census.ALL) #061 is NYC county's fips code
nyc = pd.DataFrame(nyc)

In [4]:
nyc = nyc.rename(columns = {'B00002_001E': 'housing_units', 'B01001_001E': 'pop', 'B02001_002E': 'white', 'B02001_003E': 'af_am',
       'B02001_005E': 'asian', 'B02001_008E': 'mixed', 'B03003_001E': 'hisp', 'B07001_001E': 'geomob',
       'B07004A_001E': 'geomob_white', 'B07004B_001E': 'geomob_af_am', 'B08303_001E': 'travel_time_work', 
                            'B08302_001E': 'time_leave4_work', 'NAME': 'name'    
})

In [6]:
nyc.head(3)

Unnamed: 0,housing_units,pop,white,af_am,asian,mixed,hisp,geomob,geomob_white,geomob_af_am,time_leave4_work,travel_time_work,name,county,state,tract
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,"Census Tract 1, New York County, New York",61,36,100
1,83.0,2791.0,643.0,179.0,1310.0,20.0,2791.0,2791.0,643.0,179.0,1056.0,1056.0,"Census Tract 2.01, New York County, New York",61,36,201
2,160.0,7768.0,2853.0,1141.0,2186.0,462.0,7768.0,7768.0,2853.0,1141.0,2386.0,2386.0,"Census Tract 2.02, New York County, New York",61,36,202


The names of the census tracts are fairly uninformative (where's census tract 2.02, census tract 5?) so I merge the table with [NYC Planning](https://www1.nyc.gov/site/planning/data-maps/nyc-population/geographic-reference.page)'s Census Tract to Neighborhood Tabulation Area conversion table.

In [7]:
neighb = pd.read_excel(r"https://www1.nyc.gov/assets/planning/download/office/data-maps/nyc-population/census2010/nyc2010census_tabulation_equiv.xlsx", skiprows = 3)

Since we are going to merge these two datasets, we need to make sure they are 'compatible'. This involves renaming columns and checking that the columns we're going to merge on have the same keys.

We first rename the columns and keep only the columns of interest in the conversion table dataframe:

In [8]:
neighb.columns

Index(['Borough', '2010 Census Bureau FIPS County Code',
       '2010 NYC Borough Code', '2010 Census Tract', 'PUMA',
       'Neighborhood Tabulation Area (NTA)', 'Unnamed: 6'],
      dtype='object')

In [9]:
neighb = neighb.rename(columns = {'2010 Census Tract': 'tract', '2010 Census Bureau FIPS County Code': 'county', 
                                  'Unnamed: 6': 'neighborhood'})

In [10]:
neighb = neighb[["tract", "county", "neighborhood"]]

Then we check the dtypes of both DFs:

In [11]:
neighb.dtypes

tract           float64
county          float64
neighborhood     object
dtype: object

In [12]:
nyc[["tract", "county"]].dtypes

tract     object
county    object
dtype: object

There is a clear problem here. The columns that are going to be used as keys are of different types. Moreover, they are formatted differently: the county column in the neighborhood is missing "00" at the beginning.

In [13]:
neighb["tract"] = neighb["tract"].apply(lambda x: int(x) if x == x else "").apply(lambda x: ("00" + str(x)) if x == x else "")
neighb["county"] = neighb["county"].apply(lambda x: int(x) if x == x else "").apply(lambda x: ("0" + str(x)) if x == x else "")

In [14]:
neighb.dtypes

tract           object
county          object
neighborhood    object
dtype: object

Let's now merge the datasets.

In [15]:
nyc = pd.merge(nyc, neighb, on  = ("county", "tract"), how = "left")

#### Removing NAs

Some census tracts are missing data -- signalled by a 0 population count. I remove these observations.

In [16]:
nyc = nyc[(nyc["pop"] > 0) & (nyc["housing_units"] > 0)]

#### Demographics

In [17]:
for i in ["white", "af_am", "asian", "mixed", "hisp"]:
    name = "per_" + i
    nyc[name] = nyc[i] / nyc["pop"]

In [18]:
nyc["density"] = nyc["pop"]/nyc["housing_units"]

Let's look at what we have for NYU's neighborhood:

In [19]:
nyc[nyc["tract"] == "003400"]

Unnamed: 0,housing_units,pop,white,af_am,asian,mixed,hisp,geomob,geomob_white,geomob_af_am,...,county,state,tract,neighborhood,per_white,per_af_am,per_asian,per_mixed,per_hisp,density
34,157.0,6476.0,5104.0,96.0,960.0,151.0,6476.0,6476.0,5104.0,96.0,...,61,36,3400,East Village,0.788141,0.014824,0.14824,0.023317,1.0,41.248408


--- 