In this notebook Rafa and I do some exploratory analysis to set up our cleaning functions.  The two things we're looking to do are pick out the organizations that are both in NYC and involved in art.  The end goal is to have a function that will take a raw csv as input and output only the organizations we want, also as a csv.  The purpose of this notebook is to document some of our thought process when deciding what organizations to include or not include.  It also includes some function prototypes that operate on dataframes, the prototypes go through a couple iterations so the best ones are at the end.  The final functions will be in a different document.

In [1]:
import pandas as pd

Read in the data we're interested in.  These csvs take up about 9GB of space all together and will probably hang a computer with less than 16 GB of RAM if you try to read it all in like this.  I did it on a Kaggle kernel, which had enough space.

The IRS dataset includes 81 csvs, three different types from each year 1989-2015.  The three different types are Public Charity, Private Foundation, and Other.  The naming convention I've used here puts the type first followed by the year.  I couldn't load every year so I took a sampling of the years.

In [2]:
pc_1989 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core1989pc.csv')
pc_1995 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core1995pc.csv')
pc_2000 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core2000pc.csv')
pc_2005 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core2005pc.csv')
pc_2010 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core2010pc.csv')
pc_2015 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core2015pc.csv')
pf_1989 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core1989pf.csv')
pf_1995 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core1995pf.csv')
pf_2000 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core2000pf.csv')
pf_2005 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core2005pf.csv')
pf_2010 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core2010pf.csv')
pf_2015 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core2015pf.csv')
co_1989 = pd.read_csv('../input/irs-5yrsample-other/coreco.core1989co.csv')
co_1995 = pd.read_csv('../input/irs-5yrsample-other/coreco.core1995co.csv')
co_2000 = pd.read_csv('../input/irs-5yrsample-other/coreco.core2000co.csv')
co_2005 = pd.read_csv('../input/irs-5yrsample-other/coreco.core2005co.csv')
co_2010 = pd.read_csv('../input/irs-5yrsample-other/coreco.core2010co.csv')
co_2015 = pd.read_csv('../input/irs-5yrsample-other/coreco.core2015co.csv')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interacti

Setting up some lists and dictionaries that I think I might need later.  Printing the shape of each dataframe.  Worth noting that the dataframes have pretty different columns, this will be a problem for us to look at another day.

In [3]:
datasets = [pc_1989, pc_1995, pc_2000, pc_2005, pc_2010, pc_2015, pf_1989, pf_1995,
            pf_2000, pf_2005, pf_2010, pf_2015, co_1989, co_1995, co_2000, co_2005,
            co_2010, co_2015]
datasetnames = ['pc_1989', 'pc_1995', 'pc_2000', 'pc_2005', 'pc_2010', 'pc_2015', 'pf_1989', 'pf_1995',
            'pf_2000', 'pf_2005', 'pf_2010', 'pf_2015', 'co_1989', 'co_1995', 'co_2000', 'co_2005',
            'co_2010', 'co_2015']
name_dict = {name:dataset for name, dataset in zip(datasetnames, datasets)}
for name in name_dict.keys():
    print(name, 'has shape', name_dict[name].shape)

pc_1989 has shape (137459, 115)
pc_1995 has shape (190531, 116)
pc_2000 has shape (252006, 198)
pc_2005 has shape (315224, 185)
pc_2010 has shape (367146, 233)
pc_2015 has shape (429338, 154)
pf_1989 has shape (42174, 105)
pf_1995 has shape (51278, 107)
pf_2000 has shape (74407, 179)
pf_2005 has shape (87813, 177)
pf_2010 has shape (102130, 174)
pf_2015 has shape (109983, 170)
co_1989 has shape (226667, 104)
co_1995 has shape (202014, 112)
co_2000 has shape (122991, 148)
co_2005 has shape (157211, 134)
co_2010 has shape (166478, 154)
co_2015 has shape (147772, 154)


In Rafa's initial analysis, he recommended using the ZIP5 column to pick out the organizations in NYC.  I wanted to double check this for data from other years.  The following cell prints the number of rows missing data in the columns CITY, FIPS, ZIP5, and ZIP.

Taking a rough glance at the output, it looks like for most of the dataframes ZIP5 is missing the least amount of entries.  Also worth noting is that some of the 'other' years are missing tens of thousands of entries.

In [4]:
#looks like ZIP5 is the best option to start with, can try to catch some missing ones after
for name in name_dict:
    print(name, 'Rows missing in column CITY:', sum(name_dict[name].CITY.isna()), 
          '  FIPS:', sum(name_dict[name].FIPS.isna()),
          '  ZIP5:', sum(name_dict[name].ZIP5.isna()),
          '  ZIP:', sum(name_dict[name].ZIP.isna()),
          '  ADDRESS:', sum(name_dict[name].ADDRESS.isna()))

pc_1989 Rows missing in column CITY: 191   FIPS: 1606   ZIP5: 191   ZIP: 191   ADDRESS: 3623
pc_1995 Rows missing in column CITY: 304   FIPS: 2171   ZIP5: 299   ZIP: 299   ADDRESS: 379
pc_2000 Rows missing in column CITY: 1379   FIPS: 1752   ZIP5: 1356   ZIP: 1356   ADDRESS: 1379
pc_2005 Rows missing in column CITY: 690   FIPS: 868   ZIP5: 436   ZIP: 436   ADDRESS: 690
pc_2010 Rows missing in column CITY: 302   FIPS: 763   ZIP5: 14   ZIP: 14   ADDRESS: 378
pc_2015 Rows missing in column CITY: 962   FIPS: 1525   ZIP5: 939   ZIP: 6778   ADDRESS: 973
pf_1989 Rows missing in column CITY: 486   FIPS: 1044   ZIP5: 484   ZIP: 484   ADDRESS: 6826
pf_1995 Rows missing in column CITY: 292   FIPS: 889   ZIP5: 285   ZIP: 285   ADDRESS: 2206
pf_2000 Rows missing in column CITY: 350   FIPS: 601   ZIP5: 343   ZIP: 343   ADDRESS: 350
pf_2005 Rows missing in column CITY: 191   FIPS: 305   ZIP5: 171   ZIP: 171   ADDRESS: 191
pf_2010 Rows missing in column CITY: 283   FIPS: 406   ZIP5: 283   ZIP: 283   A

Here is the list of ZIP codes from Rafa's notebook.  A quick google search suggests that zip codes don't change too much over time.  So it should work pretty well to use the same list of zip codes in any year from 1989-2015, but we will need to come back and double check this.  

In [5]:
# copied from Rafa's notebook on github
# check how zip codes change over time
NYZIPS = [10453, 10457, 10460,
        10458, 10467, 10468,
        10451, 10452, 10456,
        10454, 10455, 10459, 10474,
        10463, 10471,
        10466, 10469, 10470, 10475,
        10461, 10462,10464, 10465, 10472, 10473,
        11212, 11213, 11216, 11233, 11238,
        11209, 11214, 11228,
        11204, 11218, 11219, 11230,
        11234, 11236, 11239,
        11223, 11224, 11229, 11235,
        11201, 11205, 11215, 11217, 11231,
        11203, 11210, 11225, 11226,
        11207, 11208,
        11211, 11222,
        11220, 11232,
        11206, 11221, 11237,
        10026, 10027, 10030, 10037, 10039,
        10001, 10011, 10018, 10019, 10020, 10036,
        10029, 10035,
        10010, 10016, 10017, 10022,
        10012, 10013, 10014,
        10004, 10005, 10006, 10007, 10038, 10280,
        10002, 10003, 10009,
        10021, 10028, 10044, 10065, 10075, 10128,
        10023, 10024, 10025,
        10031, 10032, 10033, 10034, 10040,
        11361, 11362, 11363, 11364,
        11354, 11355, 11356, 11357, 11358, 11359, 11360,
        11365, 11366, 11367,
        11412, 11423, 11432, 11433, 11434, 11435, 11436,
        11101, 11102, 11103, 11104, 11105, 11106,
        11374, 11375, 11379, 11385,
        11691, 11692, 11693, 11694, 11695, 11697,
        11004, 11005, 11411, 11413, 11422, 11426, 11427, 11428, 11429,
        11414, 11415, 11416, 11417, 11418, 11419, 11420, 11421,
        11368, 11369, 11370, 11372, 11373, 11377, 11378,
        10302, 10303, 10310,
        10306, 10307, 10308, 10309, 10312,
        10301, 10304, 10305,
        10314]
NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0]

Let's take a look at some instances missing ZIP5 values.

In [6]:
pc_2015.loc[pc_2015.ZIP5.isna(), ['EIN','NAME', 'TOTREV', 'ADDRESS', 'ZIP', 'ZIP5', 'FIPS', 'STATE']]

Unnamed: 0,EIN,NAME,TOTREV,ADDRESS,ZIP,ZIP5,FIPS,STATE
4332,20712445,,25300.0,,19124-3627,,,
12586,43804540,,0.0,,,,,
15451,61135265,,27696.0,,06443-4000,,,
18346,112296115,,1680795.0,,11428-2047,,,
20498,113430805,,33845.0,,11561-5023,,,
30192,141911784,,23373.0,,01569-1644,,,
30267,141938025,,54450.0,,30736-0897,,,
53656,208351795,,97141.0,,90201-4523,,,
62300,223550617,,0.0,,08232-2535,,,
67123,232636992,,15003.0,,19150-1215,,,


In [7]:
pc_2010.loc[pc_2010.ZIP5.isna(), ['EIN','NAME', 'TOTREV','ADDRESS', 'ZIP', 'ZIP5', 'FIPS', 'STATE']]

Unnamed: 0,EIN,NAME,TOTREV,ADDRESS,ZIP,ZIP5,FIPS,STATE
68617,205690632,,520474,,,,,
83961,223840394,,125456,,,,,
112759,260708942,,805461,,,,,
123960,263676303,,49063,,,,,
127071,264633829,,52771,,,,,
137549,300538381,,100000,,,,,
189352,391905679,,153869,,,,,
200991,421629062,,200054,,,,,
284936,710791653,,69982,,,,,
334977,900472490,,48188,,,,,


In [8]:
co_2015.loc[co_2015.ZIP5.isna(), ['EIN','NAME', 'TOTREV', 'ADDRESS', 'ZIP', 'ZIP5', 'FIPS', 'STATE']]

Unnamed: 0,EIN,NAME,TOTREV,ADDRESS,ZIP,ZIP5,FIPS,STATE
452,10592583,,4852,,11229-4111,,,
1771,36011288,,24855,,05060-0441,,,
3122,42784595,,1302,,01341-0000,,,
3361,43125734,,4865,,01331-1145,,,
3435,43191878,,33239,,02382-1409,,,
3495,43252622,,1003629,,01966-1645,,,
3653,43464646,,10703,,02563-2503,,,
6217,66062878,,31786,,06131-0041,,,
8842,133648357,,-523538,,10036-1308,,,
12599,200590420,,18345,,72110-0388,,,


In [9]:
pf_2015.loc[pf_2015.ZIP5.isna(), ['EIN','NAME','ADDRESS', 'ZIP', 'ZIP5', 'FIPS', 'STATE']]

Unnamed: 0,EIN,NAME,ADDRESS,ZIP,ZIP5,FIPS,STATE
9,10024907,,,,,,
10,10131950,,,,,,
11,10211537,,,,,,
12,10211545,,,,,,
13,10211547,,,,,,
14,10211792,,,,,,
15,10212437,,,,,,
16,10213987,,,,,,
17,10214019,,,,,,
18,10215216,,,,,,


In [10]:
co_2005.loc[co_2005.ZIP5.isna(), ['EIN','NAME','TOTREV', 'ADDRESS', 'ZIP', 'ZIP5', 'FIPS', 'STATE']]

Unnamed: 0,EIN,NAME,TOTREV,ADDRESS,ZIP,ZIP5,FIPS,STATE
229,10303581,,75429,,,,,
277,10345865,,31962,,,,,
284,10348179,,27673,,,,,
309,10363352,,188449,,,,,
495,10497806,,32111,,,,,
523,10527150,,222083,,,,,
539,10536651,,147071,,,,,
562,10553766,,182799,,,,,
590,10588134,,40089,,,,,
594,10590873,,42148,,,,,


A couple things: it seems like a lot of the instances missing ZIP5 values are missing any sort of location identifier.  However, they do have an EIN.  One can look up an organization by EIN here:  https://apps.irs.gov/app/eos/.  If we decide it's worth it, this is something we could spend time on later.

For recent years (2015) it seems like there are some entries that are missing a ZIP5 but have a full ZIP.  In these cases we can use the full ZIP to get a location.  However, I'm not going to do this now, because it seems like these entries are missing a lot of other data and thus won't be that useful to us anyway.  Our theory is that the data from recent years hasn't been cleaned as extensively yet by the NCCS.

Now we're going to write a function that takes a dataframe as input and returns all the rows of that dataframe that are in NYC as a new dataframe.  It's going to check both ZIP5 and FIPS.

In [11]:
# defining function to get only new york city organizations
def get_ny(dataframe):
    # still the zip codes from Rafa's notebook
    NYZIPS = [10453, 10457, 10460,
        10458, 10467, 10468,
        10451, 10452, 10456,
        10454, 10455, 10459, 10474,
        10463, 10471,
        10466, 10469, 10470, 10475,
        10461, 10462,10464, 10465, 10472, 10473,
        11212, 11213, 11216, 11233, 11238,
        11209, 11214, 11228,
        11204, 11218, 11219, 11230,
        11234, 11236, 11239,
        11223, 11224, 11229, 11235,
        11201, 11205, 11215, 11217, 11231,
        11203, 11210, 11225, 11226,
        11207, 11208,
        11211, 11222,
        11220, 11232,
        11206, 11221, 11237,
        10026, 10027, 10030, 10037, 10039,
        10001, 10011, 10018, 10019, 10020, 10036,
        10029, 10035,
        10010, 10016, 10017, 10022,
        10012, 10013, 10014,
        10004, 10005, 10006, 10007, 10038, 10280,
        10002, 10003, 10009,
        10021, 10028, 10044, 10065, 10075, 10128,
        10023, 10024, 10025,
        10031, 10032, 10033, 10034, 10040,
        11361, 11362, 11363, 11364,
        11354, 11355, 11356, 11357, 11358, 11359, 11360,
        11365, 11366, 11367,
        11412, 11423, 11432, 11433, 11434, 11435, 11436,
        11101, 11102, 11103, 11104, 11105, 11106,
        11374, 11375, 11379, 11385,
        11691, 11692, 11693, 11694, 11695, 11697,
        11004, 11005, 11411, 11413, 11422, 11426, 11427, 11428, 11429,
        11414, 11415, 11416, 11417, 11418, 11419, 11420, 11421,
        11368, 11369, 11370, 11372, 11373, 11377, 11378,
        10302, 10303, 10310,
        10306, 10307, 10308, 10309, 10312,
        10301, 10304, 10305,
        10314]
    # FIP codes also from Rafa's notebook.
    NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0]
    new_df = dataframe[dataframe.ZIP5.isin(NYZIPS) | dataframe.FIPS.isin(NYFIPS)]
    return new_df

Check that our function worked: first going to check rows where the FIPS column says NYC but the ZIP5 column says not NYC.

In [12]:
ny_df = get_ny(pc_2015)
ny_df.loc[ny_df.FIPS.isin(NYFIPS) & ~ny_df.ZIP5.isin(NYZIPS), ['ADDRESS','ZIP5', 'ZIP', 'FIPS']]

Unnamed: 0,ADDRESS,ZIP5,ZIP,FIPS
1279,1 PENN PLZ RM 3000,10119.0,10119-0032,36061.0
2139,79 N 11TH ST,11249.0,11249-1913,36047.0
2255,PO BOX 3357,10008.0,10008-3357,36061.0
2487,223 BROADWAY SUITE 1801,10279.0,10038-1754,36061.0
3616,120 BROADWAY FL 7,10271.0,10271-0021,36061.0
3655,1 PENN PLAZA C/O WENDER LAW NO 2527,10119.0,10119-0002,36061.0
3725,55 WATER STREET,10041.0,10041-0004,36061.0
4057,PO BOX 3852,10163.0,10163-3852,36061.0
4096,616 BEDFORD AVE STE 2B,11249.0,11249-9613,36047.0
4490,PO BOX 3142,10008.0,10008-3142,36061.0


Looks like something's up, there's 785 rows where the FIPS and ZIP5 don't match.  Let's check the zipcodes.

In [13]:
ny_df = get_ny(pc_2015)
missing_zips = ny_df.loc[ny_df.FIPS.isin(NYFIPS) & ~ny_df.ZIP5.isin(NYZIPS), 'ZIP5'].unique()
for i in missing_zips:
    print(i, i in NYZIPS)

10119.0 False
11249.0 False
10008.0 False
10279.0 False
10271.0 False
10041.0 False
10163.0 False
10107.0 False
10108.0 False
10113.0 False
10123.0 False
11351.0 False
10115.0 False
10276.0 False
10150.0 False
11439.0 False
11451.0 False
11202.0 False
10170.0 False
11424.0 False
10185.0 False
10122.0 False
11690.0 False
11242.0 False
11352.0 False
10116.0 False
10167.0 False
10282.0 False
11247.0 False
10278.0 False
10121.0 False
10155.0 False
10168.0 False
10281.0 False
10118.0 False
10110.0 False
10158.0 False
10159.0 False
10165.0 False
11241.0 False
10156.0 False
10178.0 False
10120.0 False
10105.0 False
10104.0 False
10175.0 False
10101.0 False
10153.0 False
10268.0 False
10173.0 False
10111.0 False
10311.0 False
10166.0 False
10069.0 False
10272.0 False
10112.0 False
10176.0 False
10162.0 False
10174.0 False
10177.0 False
10151.0 False
11430.0 False
11386.0 False
10106.0 False
10169.0 False
10154.0 False
11109.0 False
11380.0 False
10129.0 False
10103.0 False
10045.0 False
10171.

I checked the first 5 zipcodes in the list and they were all in NYC.  This makes me think our zipcode list is missing some entries.  For now, I'm going to append all the missing zipcodes, but it may be worth it later to come back and check these one by one.  We also need to go back and change our function to have the updated zips.

In [14]:
print('before:', len(NYZIPS), ' adding:', len(missing_zips))
for i in missing_zips:
    NYZIPS.append(i)
print('after:', len(NYZIPS))

before: 178  adding: 77
after: 255


Now if we try the same thing we get an empty dataframe.  Let's also try for some more years.

In [15]:
ny_df = get_ny(pc_2015)
ny_df.loc[ny_df.FIPS.isin(NYFIPS) & ~ny_df.ZIP5.isin(NYZIPS), ['ADDRESS','ZIP5', 'ZIP', 'FIPS']]

Unnamed: 0,ADDRESS,ZIP5,ZIP,FIPS


In [16]:
ny_df = get_ny(pc_2010)
ny_df.loc[ny_df.FIPS.isin(NYFIPS) & ~ny_df.ZIP5.isin(NYZIPS), ['ADDRESS','ZIP5', 'ZIP', 'FIPS']]

Unnamed: 0,ADDRESS,ZIP5,ZIP,FIPS
28914,PO BOX 1436,10138.0,10138-0001,36061
36324,301 GENERAL ROBERT E LEE AVE 2ND FL,11252.0,11252-0000,36047
52369,277 PARK AVE FL 40,10172.0,10172-2902,36061


I'm beginning to worry we're missing a lot of zips so I'm going to go through each dataframe I uploaded and get the zips that aren't in our list.

In [17]:
for df in datasets:
    ny_df = get_ny(df)
    zips = ny_df.loc[ny_df.FIPS.isin(NYFIPS) & ~ny_df.ZIP5.isin(NYZIPS), 'ZIP5'].unique()
    for i in zips:
        if i not in NYZIPS:
            NYZIPS.append(i)
    print(zips)

[11243. 11240. 10015. 10048. 10249. 10285. 10152. 10270. 10102. 10043.
 10172.]
[10109.]
[10081. 11252. 10055. 10313.]
[11251. 10125. 10133. 10117.]
[10138.]
[]
[10164. 10292. 10260.]
[]
[10072. 10080.]
[10179.     0.]
[]
['10021' '10065' '11219' '10022' '10003' '10028' '10122' '11217' '10017'
 '11361' '10013' '10004' '10005' '10001' '10168' '10016' '11210' '10031'
 '11223' '10036' '10018' '11211' '10024' '10019' '10119' '11204' '10023'
 '10128' '11106' '11234' '11120' '11375' '10008' '10274' '11694' '10165'
 '10025' '11249' '10173' '10158' '10471' '10309' '10312' '11427' '10014'
 '10010' '11230' '10170' '11205' '11215' '10177' '11201' '10020' '11238'
 '11231' '10111' '10461' '10150' '10306' '11214' '10007' '11224' '10118'
 '10153' '10110' '10012' '10163' '11218' '11373' '10107' '11367' '11220'
 '10032' '10075' '10115' '11235' '10011' '11415' '10027' '10463' '10002'
 '10026' '11101' '10120' '10103' '10055' '10039']
[11245.0 11256.0 11425.0 10046.0 10199.0 '10123' '10009' '11378' '11229

Now we have some strings in our NYZIPS list, but I'm not too worried about it right now.  All we use it for is checking if a zip belongs to the list, so honestly it may be nice to have the strings too.  I am going to take out the 0.0 though.

In [18]:
# printing it for easier copying and pasting
for i,j in zip(NYZIPS, range(len(NYZIPS))):
    if j%10 == 9:
        print(i, end = ', \n')
    else:
        print(i, end=", ")

10453, 10457, 10460, 10458, 10467, 10468, 10451, 10452, 10456, 10454, 
10455, 10459, 10474, 10463, 10471, 10466, 10469, 10470, 10475, 10461, 
10462, 10464, 10465, 10472, 10473, 11212, 11213, 11216, 11233, 11238, 
11209, 11214, 11228, 11204, 11218, 11219, 11230, 11234, 11236, 11239, 
11223, 11224, 11229, 11235, 11201, 11205, 11215, 11217, 11231, 11203, 
11210, 11225, 11226, 11207, 11208, 11211, 11222, 11220, 11232, 11206, 
11221, 11237, 10026, 10027, 10030, 10037, 10039, 10001, 10011, 10018, 
10019, 10020, 10036, 10029, 10035, 10010, 10016, 10017, 10022, 10012, 
10013, 10014, 10004, 10005, 10006, 10007, 10038, 10280, 10002, 10003, 
10009, 10021, 10028, 10044, 10065, 10075, 10128, 10023, 10024, 10025, 
10031, 10032, 10033, 10034, 10040, 11361, 11362, 11363, 11364, 11354, 
11355, 11356, 11357, 11358, 11359, 11360, 11365, 11366, 11367, 11412, 
11423, 11432, 11433, 11434, 11435, 11436, 11101, 11102, 11103, 11104, 
11105, 11106, 11374, 11375, 11379, 11385, 11691, 11692, 11693, 11694, 
11695,

In [19]:
# redefining the function with new zips
def get_ny(dataframe):
    # the zip codes from Rafa's notebook + our new ones
    NYZIPS = [10453.0, 10457.0, 10460.0, 10458.0, 10467.0, 10468.0, 10451.0, 10452.0, 10456.0, 10454.0, 
10455.0, 10459.0, 10474.0, 10463.0, 10471.0, 10466.0, 10469.0, 10470.0, 10475.0, 10461.0, 
10462.0, 10464.0, 10465.0, 10472.0, 10473.0, 11212.0, 11213.0, 11216.0, 11233.0, 11238.0, 
11209.0, 11214.0, 11228.0, 11204.0, 11218.0, 11219.0, 11230.0, 11234.0, 11236.0, 11239.0, 
11223.0, 11224.0, 11229.0, 11235.0, 11201.0, 11205.0, 11215.0, 11217.0, 11231.0, 11203.0, 
11210.0, 11225.0, 11226.0, 11207.0, 11208.0, 11211.0, 11222.0, 11220.0, 11232.0, 11206.0, 
11221.0, 11237.0, 10026.0, 10027.0, 10030.0, 10037.0, 10039.0, 10001.0, 10011.0, 10018.0, 
10019.0, 10020.0, 10036.0, 10029.0, 10035.0, 10010.0, 10016.0, 10017.0, 10022.0, 10012.0, 
10013.0, 10014.0, 10004.0, 10005.0, 10006.0, 10007.0, 10038.0, 10280.0, 10002.0, 10003.0, 
10009.0, 10021.0, 10028.0, 10044.0, 10065.0, 10075.0, 10128.0, 10023.0, 10024.0, 10025.0, 
10031.0, 10032.0, 10033.0, 10034.0, 10040.0, 11361.0, 11362.0, 11363.0, 11364.0, 11354.0, 
11355.0, 11356.0, 11357.0, 11358.0, 11359.0, 11360.0, 11365.0, 11366.0, 11367.0, 11412.0, 
11423.0, 11432.0, 11433.0, 11434.0, 11435.0, 11436.0, 11101.0, 11102.0, 11103.0, 11104.0, 
11105.0, 11106.0, 11374.0, 11375.0, 11379.0, 11385.0, 11691.0, 11692.0, 11693.0, 11694.0, 
11695.0, 11697.0, 11004.0, 11005.0, 11411.0, 11413.0, 11422.0, 11426.0, 11427.0, 11428.0, 
11429.0, 11414.0, 11415.0, 11416.0, 11417.0, 11418.0, 11419.0, 11420.0, 11421.0, 11368.0, 
11369.0, 11370.0, 11372.0, 11373.0, 11377.0, 11378.0, 10302.0, 10303.0, 10310.0, 10306.0, 
10307.0, 10308.0, 10309.0, 10312.0, 10301.0, 10304.0, 10305.0, 10314.0, 10119.0, 11249.0, 
10008.0, 10279.0, 10271.0, 10041.0, 10163.0, 10107.0, 10108.0, 10113.0, 10123.0, 11351.0, 
10115.0, 10276.0, 10150.0, 11439.0, 11451.0, 11202.0, 10170.0, 11424.0, 10185.0, 10122.0, 
11690.0, 11242.0, 11352.0, 10116.0, 10167.0, 10282.0, 11247.0, 10278.0, 10121.0, 10155.0, 
10168.0, 10281.0, 10118.0, 10110.0, 10158.0, 10159.0, 10165.0, 11241.0, 10156.0, 10178.0, 
10120.0, 10105.0, 10104.0, 10175.0, 10101.0, 10153.0, 10268.0, 10173.0, 10111.0, 10311.0, 
10166.0, 10069.0, 10272.0, 10112.0, 10176.0, 10162.0, 10174.0, 10177.0, 10151.0, 11430.0, 
11386.0, 10106.0, 10169.0, 10154.0, 11109.0, 11380.0, 10129.0, 10103.0, 10045.0, 10171.0, 
10286.0, 11371.0, 11120.0, 11431.0, 10274.0, 11243.0, 11240.0, 10015.0, 10048.0, 10249.0, 
10285.0, 10152.0, 10270.0, 10102.0, 10043.0, 10172.0, 10109.0, 10081.0, 11252.0, 10055.0, 
10313.0, 11251.0, 10125.0, 10133.0, 10117.0, 10138.0, 10164.0, 10292.0, 10260.0, 10072.0, 
10080.0, 10179.0, 10021, 10065, 11219, 10022, 10003, 10028, 10122, 
11217, 10017, 11361, 10013, 10004, 10005, 10001, 10168, 10016, 11210, 
10031, 11223, 10036, 10018, 11211, 10024, 10019, 10119, 11204, 10023, 
10128, 11106, 11234, 11120, 11375, 10008, 10274, 11694, 10165, 10025, 
11249, 10173, 10158, 10471, 10309, 10312, 11427, 10014, 10010, 11230, 
10170, 11205, 11215, 10177, 11201, 10020, 11238, 11231, 10111, 10461, 
10150, 10306, 11214, 10007, 11224, 10118, 10153, 10110, 10012, 10163, 
11218, 11373, 10107, 11367, 11220, 10032, 10075, 10115, 11235, 10011, 
11415, 10027, 10463, 10002, 10026, 11101, 10120, 10103, 10055, 10039, 
11245.0, 11256.0, 11425.0, 10046.0, 10199.0, 10123, 10009, 11378, 11229, 10006, 
10038, 10155, 11364, 11418, 10279, 10470, 10468, 11241, 10310, 10467, 
11434, 11372, 10314, 10272, 10048, 10116, 11228, 10308, 10462, 10307, 
10304, 11430, 11358, 11209, 11374, 11354, 11377, 11421, 10286, 11232, 
11245, 10469, 10176, 11385, 10044, 11102, 10459, 11435, 10281, 10034, 
10130.0, 11381.0, 10114.0]
    # FIP codes also from Rafa's notebook.
    NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0]
    new_df = dataframe[dataframe.ZIP5.isin(NYZIPS) | dataframe.FIPS.isin(NYFIPS)]
    return new_df

Now we're going to do the opposite: check for instances where the ZIP belongs to NYC but the FIP is outside NYC.  

In [20]:
df = pc_2015
ny_df = get_ny(df)
print(ny_df.loc[~ny_df.FIPS.isin(NYFIPS) & ny_df.ZIP5.isin(NYZIPS), ['ADDRESS', 'ZIP5', 'ZIP', 'FIPS']])

       ADDRESS     ZIP5         ZIP  FIPS
370829   STE 2  10025.0  10025-0000   NaN


This is fine.  In fact, it's good, because we are able to pick up an addres that doesn't have a FIP, this is the reason to check both.  Let's check the rest of the dataframes.

In [21]:
for name in name_dict.keys():
    ny_df = get_ny(name_dict[name])
    print(name, 'number of unmatched entries:', 
          len(ny_df.loc[~ny_df.FIPS.isin(NYFIPS) & ny_df.ZIP5.isin(NYZIPS), ['ADDRESS', 'ZIP5', 'ZIP', 'FIPS']]))
    

pc_1989 number of unmatched entries: 477
pc_1995 number of unmatched entries: 12
pc_2000 number of unmatched entries: 0
pc_2005 number of unmatched entries: 188
pc_2010 number of unmatched entries: 190
pc_2015 number of unmatched entries: 1
pf_1989 number of unmatched entries: 0
pf_1995 number of unmatched entries: 0
pf_2000 number of unmatched entries: 0
pf_2005 number of unmatched entries: 0
pf_2010 number of unmatched entries: 0
pf_2015 number of unmatched entries: 4
co_1989 number of unmatched entries: 257
co_1995 number of unmatched entries: 122
co_2000 number of unmatched entries: 41
co_2005 number of unmatched entries: 68
co_2010 number of unmatched entries: 133
co_2015 number of unmatched entries: 0


Looks like we have a couple dataframes to investigate.

In [22]:
ny_df = get_ny(pc_1989)
ny_df.loc[~ny_df.FIPS.isin(NYFIPS) & ny_df.ZIP5.isin(NYZIPS), ['ADDRESS', 'ZIP5', 'ZIP', 'FIPS']]

Unnamed: 0,ADDRESS,ZIP5,ZIP,FIPS
32783,PO BOX 20240,10025.0,10025-1511,36061
32790,1435 PROSPECT PL,11213.0,11213-2404,36047
32808,112 21 72ND AVE,11375.0,11375-4644,36081
32883,60 E 42ND ST STE 1419,10165.0,10165-1419,36061
32908,180 SECOND AVE,10003.0,10003-5778,36061
32920,295 WOODBINE ST,11237.0,11237-5914,36047
32935,5101 4TH AVE,11220.0,11220-1815,36047
32960,424 W 42ND ST STE 3R,10036.0,10036-6809,36061
32992,PO BOX 297 006,11229.0,11229-0297,36047
32999,BOX 1002,10314.0,10314-0004,36085


In [23]:
type(ny_df.loc[32783, 'FIPS'])

str

The problem here is that this database has FIPS that are strings.  I'll go ahead and add strings and ints to our list of FIPS.

In [24]:
# redefining the function with new fips
NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0, 36005, 36047, 36061, 36081, 36085,
             '36005', '36047', '36061', '36081', '36085' ]
    
def get_ny(dataframe):
    # zip codes from Rafa's notebook + missing ones
    NYZIPS = [10453.0, 10457.0, 10460.0, 10458.0, 10467.0, 10468.0, 10451.0, 10452.0, 10456.0, 10454.0, 
10455.0, 10459.0, 10474.0, 10463.0, 10471.0, 10466.0, 10469.0, 10470.0, 10475.0, 10461.0, 
10462.0, 10464.0, 10465.0, 10472.0, 10473.0, 11212.0, 11213.0, 11216.0, 11233.0, 11238.0, 
11209.0, 11214.0, 11228.0, 11204.0, 11218.0, 11219.0, 11230.0, 11234.0, 11236.0, 11239.0, 
11223.0, 11224.0, 11229.0, 11235.0, 11201.0, 11205.0, 11215.0, 11217.0, 11231.0, 11203.0, 
11210.0, 11225.0, 11226.0, 11207.0, 11208.0, 11211.0, 11222.0, 11220.0, 11232.0, 11206.0, 
11221.0, 11237.0, 10026.0, 10027.0, 10030.0, 10037.0, 10039.0, 10001.0, 10011.0, 10018.0, 
10019.0, 10020.0, 10036.0, 10029.0, 10035.0, 10010.0, 10016.0, 10017.0, 10022.0, 10012.0, 
10013.0, 10014.0, 10004.0, 10005.0, 10006.0, 10007.0, 10038.0, 10280.0, 10002.0, 10003.0, 
10009.0, 10021.0, 10028.0, 10044.0, 10065.0, 10075.0, 10128.0, 10023.0, 10024.0, 10025.0, 
10031.0, 10032.0, 10033.0, 10034.0, 10040.0, 11361.0, 11362.0, 11363.0, 11364.0, 11354.0, 
11355.0, 11356.0, 11357.0, 11358.0, 11359.0, 11360.0, 11365.0, 11366.0, 11367.0, 11412.0, 
11423.0, 11432.0, 11433.0, 11434.0, 11435.0, 11436.0, 11101.0, 11102.0, 11103.0, 11104.0, 
11105.0, 11106.0, 11374.0, 11375.0, 11379.0, 11385.0, 11691.0, 11692.0, 11693.0, 11694.0, 
11695.0, 11697.0, 11004.0, 11005.0, 11411.0, 11413.0, 11422.0, 11426.0, 11427.0, 11428.0, 
11429.0, 11414.0, 11415.0, 11416.0, 11417.0, 11418.0, 11419.0, 11420.0, 11421.0, 11368.0, 
11369.0, 11370.0, 11372.0, 11373.0, 11377.0, 11378.0, 10302.0, 10303.0, 10310.0, 10306.0, 
10307.0, 10308.0, 10309.0, 10312.0, 10301.0, 10304.0, 10305.0, 10314.0, 10119.0, 11249.0, 
10008.0, 10279.0, 10271.0, 10041.0, 10163.0, 10107.0, 10108.0, 10113.0, 10123.0, 11351.0, 
10115.0, 10276.0, 10150.0, 11439.0, 11451.0, 11202.0, 10170.0, 11424.0, 10185.0, 10122.0, 
11690.0, 11242.0, 11352.0, 10116.0, 10167.0, 10282.0, 11247.0, 10278.0, 10121.0, 10155.0, 
10168.0, 10281.0, 10118.0, 10110.0, 10158.0, 10159.0, 10165.0, 11241.0, 10156.0, 10178.0, 
10120.0, 10105.0, 10104.0, 10175.0, 10101.0, 10153.0, 10268.0, 10173.0, 10111.0, 10311.0, 
10166.0, 10069.0, 10272.0, 10112.0, 10176.0, 10162.0, 10174.0, 10177.0, 10151.0, 11430.0, 
11386.0, 10106.0, 10169.0, 10154.0, 11109.0, 11380.0, 10129.0, 10103.0, 10045.0, 10171.0, 
10286.0, 11371.0, 11120.0, 11431.0, 10274.0, 11243.0, 11240.0, 10015.0, 10048.0, 10249.0, 
10285.0, 10152.0, 10270.0, 10102.0, 10043.0, 10172.0, 10109.0, 10081.0, 11252.0, 10055.0, 
10313.0, 11251.0, 10125.0, 10133.0, 10117.0, 10138.0, 10164.0, 10292.0, 10260.0, 10072.0, 
10080.0, 10179.0, 10021, 10065, 11219, 10022, 10003, 10028, 10122, 
11217, 10017, 11361, 10013, 10004, 10005, 10001, 10168, 10016, 11210, 
10031, 11223, 10036, 10018, 11211, 10024, 10019, 10119, 11204, 10023, 
10128, 11106, 11234, 11120, 11375, 10008, 10274, 11694, 10165, 10025, 
11249, 10173, 10158, 10471, 10309, 10312, 11427, 10014, 10010, 11230, 
10170, 11205, 11215, 10177, 11201, 10020, 11238, 11231, 10111, 10461, 
10150, 10306, 11214, 10007, 11224, 10118, 10153, 10110, 10012, 10163, 
11218, 11373, 10107, 11367, 11220, 10032, 10075, 10115, 11235, 10011, 
11415, 10027, 10463, 10002, 10026, 11101, 10120, 10103, 10055, 10039, 
11245.0, 11256.0, 11425.0, 10046.0, 10199.0, 10123, 10009, 11378, 11229, 10006, 
10038, 10155, 11364, 11418, 10279, 10470, 10468, 11241, 10310, 10467, 
11434, 11372, 10314, 10272, 10048, 10116, 11228, 10308, 10462, 10307, 
10304, 11430, 11358, 11209, 11374, 11354, 11377, 11421, 10286, 11232, 
11245, 10469, 10176, 11385, 10044, 11102, 10459, 11435, 10281, 10034, 
10130.0, 11381.0, 10114.0]
    # now with new FIPS
    NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0, 36005, 36047, 36061, 36081, 36085,
             '36005', '36047', '36061', '36081', '36085' ]
    new_df = dataframe[dataframe.ZIP5.isin(NYZIPS) | dataframe.FIPS.isin(NYFIPS)]
    return new_df

Let's see if that fixed the problem.

In [25]:
for name in name_dict.keys():
    ny_df = get_ny(name_dict[name])
    print(name, 'number of unmatched entries:', 
          len(ny_df.loc[~ny_df.FIPS.isin(NYFIPS) & ny_df.ZIP5.isin(NYZIPS), ['ADDRESS', 'ZIP5', 'ZIP', 'FIPS']]))
    

pc_1989 number of unmatched entries: 0
pc_1995 number of unmatched entries: 0
pc_2000 number of unmatched entries: 0
pc_2005 number of unmatched entries: 0
pc_2010 number of unmatched entries: 0
pc_2015 number of unmatched entries: 1
pf_1989 number of unmatched entries: 0
pf_1995 number of unmatched entries: 0
pf_2000 number of unmatched entries: 0
pf_2005 number of unmatched entries: 0
pf_2010 number of unmatched entries: 0
pf_2015 number of unmatched entries: 4
co_1989 number of unmatched entries: 0
co_1995 number of unmatched entries: 0
co_2000 number of unmatched entries: 0
co_2005 number of unmatched entries: 0
co_2010 number of unmatched entries: 0
co_2015 number of unmatched entries: 0


The remaining dataframes that have unmatched entries have entries where the FIPS is missing, so we're good to go.  Honestly, I was expecting a lot more unmatched entries, the fact that we don't have many suggests that the data was cleaned pretty well.  I think it also means we don't really have to worry about past zip codes being different.

We'll leave the get_ny() function there for now, let's move on to the NTEE codes.  We could do this two different ways, by checking the NTEE1 column or by checking the NTEECC column.  Let's do it both ways and see if there's a difference.

In [26]:
pc_2015_ntee1 = pc_2015[pc_2015['NTEE1'] == 'A']
pc_2015_nteecc = pc_2015[pc_2015['NTEECC'].str.startswith('A')]

ValueError: cannot index with vector containing NA / NaN values

We can easily fix this error by telling the startswith function what to do with missing values.

In [27]:
pc_2015_ntee1 = pc_2015[pc_2015['NTEE1'] == 'A']
pc_2015_nteecc = pc_2015[pc_2015['NTEECC'].str.startswith('A', na=False)]
print(len(pc_2015_ntee1.index))
print(len(pc_2015_nteecc.index))

45831
45831


They are the same, let's see if they are all the same.

In [28]:
for name in name_dict.keys():
    df = name_dict[name]
    art_df_1 = df[df['NTEE1'] == 'A']
    art_df_cc = df[df['NTEECC'].str.startswith('A', na=False)]
    if len(art_df_1.index) == len(art_df_cc.index):
        pass
    else:
        print(name, 'has a difference between NTEE1 and NTEECC')

KeyError: 'NTEE1'

We got a KeyError, meaning some of our dataframes don't have the right column names.  Let's try again, but make the error more helpful.

In [29]:
for name in name_dict.keys():
    df = name_dict[name]
    try:
        art_df_1 = df[df['NTEE1'] == 'A']
    except KeyError:
        print(name, 'is missing column NTEE1')
        continue
    try:
        art_df_cc = df[df['NTEECC'].str.startswith('A', na=False)]
    except KeyError:
        print(name, 'is missing column NTEECC')
        continue
    if len(art_df_1.index) == len(art_df_cc.index):
        print(name, 'everything matches up')
    else:
        print(name, 'has a difference between NTEE1 and NTEECC')

pc_1989 is missing column NTEE1
pc_1995 is missing column NTEECC
pc_2000 everything matches up
pc_2005 everything matches up
pc_2010 has a difference between NTEE1 and NTEECC
pc_2015 everything matches up
pf_1989 is missing column NTEE1
pf_1995 is missing column NTEE1
pf_2000 has a difference between NTEE1 and NTEECC
pf_2005 has a difference between NTEE1 and NTEECC
pf_2010 everything matches up
pf_2015 everything matches up
co_1989 everything matches up
co_1995 everything matches up
co_2000 everything matches up
co_2005 everything matches up
co_2010 everything matches up
co_2015 everything matches up


So it looks like data from years '89 and '95 has it's NTEE information in some other column, we'll have to look that up.   
Turns out these old dataframes have 'nteeFinal' and 'nteeFinal1' instead of NTEECC and NTEE1.
We can also see that pc_2010, pf_2000, and pf_2005 have a difference, we'll investigate and see what's up.

In [30]:
def difference_report(df):
    print(df.loc[~(df['NTEE1'] == 'A') & (df['NTEECC'].str.startswith('A', na=False)), ['NTEE1', 'NTEECC']].head())
    print(df.loc[(df['NTEE1'] == 'A') & ~(df['NTEECC'].str.startswith('A', na=False)), ['NTEE1', 'NTEECC']].head())
    print('starts with A but does not have NTEE1==A:', len(df.loc[~(df['NTEE1'] == 'A') & (df['NTEECC'].str.startswith('A', na=False))]))
    print('has NTEE1==A but does not start with A:', len(df.loc[(df['NTEE1'] == 'A') & ~(df['NTEECC'].str.startswith('A', na=False))]))
difference_report(pc_2010)
difference_report(pf_2000)
difference_report(pf_2005)

     NTEE1 NTEECC
1926     S    A84
2115     G    A65
2246     Q    A6E
2458     O    A90
2648     F    A90
     NTEE1 NTEECC
1124     A    M40
1259     A    Q12
1498     A    P99
2192     A    C30
2220     A    X81
starts with A but does not have NTEE1==A: 748
has NTEE1==A but does not start with A: 951
Empty DataFrame
Columns: [NTEE1, NTEECC]
Index: []
      NTEE1 NTEECC
653       A    T23
51205     A    T23
57985     A    T23
starts with A but does not have NTEE1==A: 0
has NTEE1==A but does not start with A: 3
Empty DataFrame
Columns: [NTEE1, NTEECC]
Index: []
      NTEE1 NTEECC
823       A    T23
2635      A    T23
9914      A    T23
49185     A    T23
63367     A    T23
starts with A but does not have NTEE1==A: 0
has NTEE1==A but does not start with A: 6


Seems notable that all the ones that are misclassified for the later two are T23.  Here's the description from the IRS:
T23 	Private Operating Foundations 	Private foundations that use a bulk of their resources to provide charitable services or run charitable programs of their own. They make few, if any, grants to outside organizations and, like private independent foundations, they generally do not raise funds from the public. 
This doesn't really give me any insight, and there's so few of them that I won't worry about it.  

On the other hand, pc_2010 is a little bit of a problem.  Let's see how many mismatches there are in the nyc specific data.

Also worth noting is that earlier years have no missing values in the NTEE1 and NTEECC columns, which is suspicious.  Our best guess is that missing values were coded as some special string, we looked for it but couldn't figure out what it was.  It's also not mentioned in the data dictionary.

In [31]:
pc_2010_nyc = get_ny(pc_2010)
difference_report(pc_2010_nyc)

      NTEE1 NTEECC
7679      B    A11
24181     W    A23
26065     X    A23
27001     N    A23
27134     Y    A23
      NTEE1 NTEECC
3082      A    X20
21901     A    E86
24732     A    O50
26111     A    X83
26115     A    Q70
starts with A but does not have NTEE1==A: 36
has NTEE1==A but does not start with A: 50


This isn't that big of a deal, but it's still worth trying to figure out what's going on.  

In [32]:
pc_2010_nyc.loc[~(pc_2010_nyc['NTEE1'] == 'A') & (pc_2010_nyc['NTEECC'].str.startswith('A', na=False)), ['NAME','NTEE1', 'NTEECC']]
    

Unnamed: 0,NAME,NTEE1,NTEECC
7679,SW FINANCING INC,B,A11
24181,FEDERATION OF MULTICULTURAL PROGRAMS INC,W,A23
26065,TURKISH AMERICAN EYUP SULTAN CULTURAL CENTER INC,X,A23
27001,HAKKA ASSOCIATION OF N Y INC,N,A23
27134,PAN ICARIAN BROTHERHOOD OF NEW YORK INC,Y,A23
29882,THE JAMES BEARD FOUNDATION INC,D,A70
30057,BHAGAVAN FOUNDATION INC,T,A23
31091,ABC NO RIO INC C/O PETER CRAMER,X,A25
33289,WORLD STUDIO FOUNDATION INC,B,A40
33834,NEW YORK FRIENDS OF IRELAND,T,A23


In [33]:
pc_2010_nyc.loc[(pc_2010_nyc['NTEE1'] == 'A') & ~(pc_2010_nyc['NTEECC'].str.startswith('A', na=False)), ['NAME','NTEE1', 'NTEECC']]
    

Unnamed: 0,NAME,NTEE1,NTEECC
3082,SING FOR HOPE,A,X20
21901,SOARINGWORDS INC,A,E86
24732,40 GREENE AVE CULTURAL CENTER INC,A,O50
26111,PEACE TIMES WEEKLY INC,A,X83
26115,DR WANG KANG-LUS MEMORIAL FOUNDATION INC,A,Q70
27150,BATYA-FRIENDS OF UNITED HATZALAH INC,A,B11
27615,TRIANGLE SHIRTWAIST FACTORY FIRE MEMORIAL INC,A,B82
28748,FORWARD ASSOC INC,A,X83
29370,FOUNDATION FOR JEWISH CULTURE INC,A,X30
29634,JOY IN SINGING INC,A,X20


Based on the names, some of these look legit and some don't.  For the moment, I'm going to cast a wide net, but this is something we should discuss.

Let's make our get_arts function.  I'll handle the different column names for older dataframes later.  Without handling the different column names, our function is pretty simple.  I've also copied the most recent get_ny function for easy access.

In [34]:
def get_arts(dataframe):
    new_df = dataframe[(dataframe['NTEE1'] == 'A') | (dataframe['NTEECC'].str.startswith('A', na=False))]
    return new_df

def get_ny(dataframe):
    NYZIPS = [10453.0, 10457.0, 10460.0, 10458.0, 10467.0, 10468.0, 10451.0, 10452.0, 10456.0, 10454.0, 
10455.0, 10459.0, 10474.0, 10463.0, 10471.0, 10466.0, 10469.0, 10470.0, 10475.0, 10461.0, 
10462.0, 10464.0, 10465.0, 10472.0, 10473.0, 11212.0, 11213.0, 11216.0, 11233.0, 11238.0, 
11209.0, 11214.0, 11228.0, 11204.0, 11218.0, 11219.0, 11230.0, 11234.0, 11236.0, 11239.0, 
11223.0, 11224.0, 11229.0, 11235.0, 11201.0, 11205.0, 11215.0, 11217.0, 11231.0, 11203.0, 
11210.0, 11225.0, 11226.0, 11207.0, 11208.0, 11211.0, 11222.0, 11220.0, 11232.0, 11206.0, 
11221.0, 11237.0, 10026.0, 10027.0, 10030.0, 10037.0, 10039.0, 10001.0, 10011.0, 10018.0, 
10019.0, 10020.0, 10036.0, 10029.0, 10035.0, 10010.0, 10016.0, 10017.0, 10022.0, 10012.0, 
10013.0, 10014.0, 10004.0, 10005.0, 10006.0, 10007.0, 10038.0, 10280.0, 10002.0, 10003.0, 
10009.0, 10021.0, 10028.0, 10044.0, 10065.0, 10075.0, 10128.0, 10023.0, 10024.0, 10025.0, 
10031.0, 10032.0, 10033.0, 10034.0, 10040.0, 11361.0, 11362.0, 11363.0, 11364.0, 11354.0, 
11355.0, 11356.0, 11357.0, 11358.0, 11359.0, 11360.0, 11365.0, 11366.0, 11367.0, 11412.0, 
11423.0, 11432.0, 11433.0, 11434.0, 11435.0, 11436.0, 11101.0, 11102.0, 11103.0, 11104.0, 
11105.0, 11106.0, 11374.0, 11375.0, 11379.0, 11385.0, 11691.0, 11692.0, 11693.0, 11694.0, 
11695.0, 11697.0, 11004.0, 11005.0, 11411.0, 11413.0, 11422.0, 11426.0, 11427.0, 11428.0, 
11429.0, 11414.0, 11415.0, 11416.0, 11417.0, 11418.0, 11419.0, 11420.0, 11421.0, 11368.0, 
11369.0, 11370.0, 11372.0, 11373.0, 11377.0, 11378.0, 10302.0, 10303.0, 10310.0, 10306.0, 
10307.0, 10308.0, 10309.0, 10312.0, 10301.0, 10304.0, 10305.0, 10314.0, 10119.0, 11249.0, 
10008.0, 10279.0, 10271.0, 10041.0, 10163.0, 10107.0, 10108.0, 10113.0, 10123.0, 11351.0, 
10115.0, 10276.0, 10150.0, 11439.0, 11451.0, 11202.0, 10170.0, 11424.0, 10185.0, 10122.0, 
11690.0, 11242.0, 11352.0, 10116.0, 10167.0, 10282.0, 11247.0, 10278.0, 10121.0, 10155.0, 
10168.0, 10281.0, 10118.0, 10110.0, 10158.0, 10159.0, 10165.0, 11241.0, 10156.0, 10178.0, 
10120.0, 10105.0, 10104.0, 10175.0, 10101.0, 10153.0, 10268.0, 10173.0, 10111.0, 10311.0, 
10166.0, 10069.0, 10272.0, 10112.0, 10176.0, 10162.0, 10174.0, 10177.0, 10151.0, 11430.0, 
11386.0, 10106.0, 10169.0, 10154.0, 11109.0, 11380.0, 10129.0, 10103.0, 10045.0, 10171.0, 
10286.0, 11371.0, 11120.0, 11431.0, 10274.0, 11243.0, 11240.0, 10015.0, 10048.0, 10249.0, 
10285.0, 10152.0, 10270.0, 10102.0, 10043.0, 10172.0, 10109.0, 10081.0, 11252.0, 10055.0, 
10313.0, 11251.0, 10125.0, 10133.0, 10117.0, 10138.0, 10164.0, 10292.0, 10260.0, 10072.0, 
10080.0, 10179.0, 10021, 10065, 11219, 10022, 10003, 10028, 10122, 
11217, 10017, 11361, 10013, 10004, 10005, 10001, 10168, 10016, 11210, 
10031, 11223, 10036, 10018, 11211, 10024, 10019, 10119, 11204, 10023, 
10128, 11106, 11234, 11120, 11375, 10008, 10274, 11694, 10165, 10025, 
11249, 10173, 10158, 10471, 10309, 10312, 11427, 10014, 10010, 11230, 
10170, 11205, 11215, 10177, 11201, 10020, 11238, 11231, 10111, 10461, 
10150, 10306, 11214, 10007, 11224, 10118, 10153, 10110, 10012, 10163, 
11218, 11373, 10107, 11367, 11220, 10032, 10075, 10115, 11235, 10011, 
11415, 10027, 10463, 10002, 10026, 11101, 10120, 10103, 10055, 10039, 
11245.0, 11256.0, 11425.0, 10046.0, 10199.0, 10123, 10009, 11378, 11229, 10006, 
10038, 10155, 11364, 11418, 10279, 10470, 10468, 11241, 10310, 10467, 
11434, 11372, 10314, 10272, 10048, 10116, 11228, 10308, 10462, 10307, 
10304, 11430, 11358, 11209, 11374, 11354, 11377, 11421, 10286, 11232, 
11245, 10469, 10176, 11385, 10044, 11102, 10459, 11435, 10281, 10034, 
10130.0, 11381.0, 10114.0]
    NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0, 36005, 36047, 36061, 36081, 36085,
             '36005', '36047', '36061', '36081', '36085' ]
    new_df = dataframe[dataframe.ZIP5.isin(NYZIPS) | dataframe.FIPS.isin(NYFIPS)]
    return new_df