In this notebook I investigate some of the discrepancies between our IRS data set and both the DCLA list of cultural organizations and the DataArts data set.  

These discrepancies can go in one of two directions:
(1) There is an organization in either the DCLA or DataArts data set that we didn't pick out of the IRS data. 
(2)There is an organization that we picked out of the IRS data as doing work in Arts & Culture that isn't present in the DCLA or DataArts data set.

The type (1) discrepancies are a concern because we don't want to be systematically missing organizations that we should be including.  By looking for patterns in these organizations we can make our net wider and catch more organizations.

The type (2) discrepancies are a concern because we don't want to be including organizations that don't meet our criteria.  By looking for patterns in these organizations we can make our net more discriminative.



In [1]:
import pandas as pd

In [2]:
co_2015 = pd.read_csv('../input/irs-5yrsample-other/coreco.core2015co.csv')
pf_2015 = pd.read_csv('../input/irs-5yearsample-pf/nccs.core2015pf.csv')
pc_2015 = pd.read_csv('../input/irs-5yearsample-pc/nccs.core2015pc.csv')
pc_2014 = pd.read_csv('../input/irs2014/nccs.core2014pc.csv')
pf_2014 = pd.read_csv('../input/irs2014/nccs.core2014pf.csv')
co_2014 = pd.read_csv('../input/irs2014/coreco.core2014co.csv')
dcla = pd.read_csv('../input/dcla-cultural-organizations/DCLA_Cultural_Organizations.csv')
dataarts = pd.read_excel('../input/dataarts/NYC Completes.xlsx')

  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)
  interactivity=interactivity, compiler=compiler, result=result)


In [3]:
def get_arts(dataframe):
    new_df = dataframe[(dataframe['NTEE1'] == 'A') | (dataframe['NTEECC'].str.startswith('A', na=False))]
    return new_df

def get_ny(dataframe):
    NYZIPS = [10453.0, 10457.0, 10460.0, 10458.0, 10467.0, 10468.0, 10451.0, 10452.0, 10456.0, 10454.0, 
10455.0, 10459.0, 10474.0, 10463.0, 10471.0, 10466.0, 10469.0, 10470.0, 10475.0, 10461.0, 
10462.0, 10464.0, 10465.0, 10472.0, 10473.0, 11212.0, 11213.0, 11216.0, 11233.0, 11238.0, 
11209.0, 11214.0, 11228.0, 11204.0, 11218.0, 11219.0, 11230.0, 11234.0, 11236.0, 11239.0, 
11223.0, 11224.0, 11229.0, 11235.0, 11201.0, 11205.0, 11215.0, 11217.0, 11231.0, 11203.0, 
11210.0, 11225.0, 11226.0, 11207.0, 11208.0, 11211.0, 11222.0, 11220.0, 11232.0, 11206.0, 
11221.0, 11237.0, 10026.0, 10027.0, 10030.0, 10037.0, 10039.0, 10001.0, 10011.0, 10018.0, 
10019.0, 10020.0, 10036.0, 10029.0, 10035.0, 10010.0, 10016.0, 10017.0, 10022.0, 10012.0, 
10013.0, 10014.0, 10004.0, 10005.0, 10006.0, 10007.0, 10038.0, 10280.0, 10002.0, 10003.0, 
10009.0, 10021.0, 10028.0, 10044.0, 10065.0, 10075.0, 10128.0, 10023.0, 10024.0, 10025.0, 
10031.0, 10032.0, 10033.0, 10034.0, 10040.0, 11361.0, 11362.0, 11363.0, 11364.0, 11354.0, 
11355.0, 11356.0, 11357.0, 11358.0, 11359.0, 11360.0, 11365.0, 11366.0, 11367.0, 11412.0, 
11423.0, 11432.0, 11433.0, 11434.0, 11435.0, 11436.0, 11101.0, 11102.0, 11103.0, 11104.0, 
11105.0, 11106.0, 11374.0, 11375.0, 11379.0, 11385.0, 11691.0, 11692.0, 11693.0, 11694.0, 
11695.0, 11697.0, 11004.0, 11005.0, 11411.0, 11413.0, 11422.0, 11426.0, 11427.0, 11428.0, 
11429.0, 11414.0, 11415.0, 11416.0, 11417.0, 11418.0, 11419.0, 11420.0, 11421.0, 11368.0, 
11369.0, 11370.0, 11372.0, 11373.0, 11377.0, 11378.0, 10302.0, 10303.0, 10310.0, 10306.0, 
10307.0, 10308.0, 10309.0, 10312.0, 10301.0, 10304.0, 10305.0, 10314.0, 10119.0, 11249.0, 
10008.0, 10279.0, 10271.0, 10041.0, 10163.0, 10107.0, 10108.0, 10113.0, 10123.0, 11351.0, 
10115.0, 10276.0, 10150.0, 11439.0, 11451.0, 11202.0, 10170.0, 11424.0, 10185.0, 10122.0, 
11690.0, 11242.0, 11352.0, 10116.0, 10167.0, 10282.0, 11247.0, 10278.0, 10121.0, 10155.0, 
10168.0, 10281.0, 10118.0, 10110.0, 10158.0, 10159.0, 10165.0, 11241.0, 10156.0, 10178.0, 
10120.0, 10105.0, 10104.0, 10175.0, 10101.0, 10153.0, 10268.0, 10173.0, 10111.0, 10311.0, 
10166.0, 10069.0, 10272.0, 10112.0, 10176.0, 10162.0, 10174.0, 10177.0, 10151.0, 11430.0, 
11386.0, 10106.0, 10169.0, 10154.0, 11109.0, 11380.0, 10129.0, 10103.0, 10045.0, 10171.0, 
10286.0, 11371.0, 11120.0, 11431.0, 10274.0, 11243.0, 11240.0, 10015.0, 10048.0, 10249.0, 
10285.0, 10152.0, 10270.0, 10102.0, 10043.0, 10172.0, 10109.0, 10081.0, 11252.0, 10055.0, 
10313.0, 11251.0, 10125.0, 10133.0, 10117.0, 10138.0, 10164.0, 10292.0, 10260.0, 10072.0, 
10080.0, 10179.0, 10021, 10065, 11219, 10022, 10003, 10028, 10122, 
11217, 10017, 11361, 10013, 10004, 10005, 10001, 10168, 10016, 11210, 
10031, 11223, 10036, 10018, 11211, 10024, 10019, 10119, 11204, 10023, 
10128, 11106, 11234, 11120, 11375, 10008, 10274, 11694, 10165, 10025, 
11249, 10173, 10158, 10471, 10309, 10312, 11427, 10014, 10010, 11230, 
10170, 11205, 11215, 10177, 11201, 10020, 11238, 11231, 10111, 10461, 
10150, 10306, 11214, 10007, 11224, 10118, 10153, 10110, 10012, 10163, 
11218, 11373, 10107, 11367, 11220, 10032, 10075, 10115, 11235, 10011, 
11415, 10027, 10463, 10002, 10026, 11101, 10120, 10103, 10055, 10039, 
11245.0, 11256.0, 11425.0, 10046.0, 10199.0, 10123, 10009, 11378, 11229, 10006, 
10038, 10155, 11364, 11418, 10279, 10470, 10468, 11241, 10310, 10467, 
11434, 11372, 10314, 10272, 10048, 10116, 11228, 10308, 10462, 10307, 
10304, 11430, 11358, 11209, 11374, 11354, 11377, 11421, 10286, 11232, 
11245, 10469, 10176, 11385, 10044, 11102, 10459, 11435, 10281, 10034, 
10130.0, 11381.0, 10114.0]
    NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0, 36005, 36047, 36061, 36081, 36085,
             '36005', '36047', '36061', '36081', '36085' ]
    new_df = dataframe[dataframe.ZIP5.isin(NYZIPS) | dataframe.FIPS.isin(NYFIPS)]
    return new_df

In [4]:
NYZIPS = [10453.0, 10457.0, 10460.0, 10458.0, 10467.0, 10468.0, 10451.0, 10452.0, 10456.0, 10454.0, 
10455.0, 10459.0, 10474.0, 10463.0, 10471.0, 10466.0, 10469.0, 10470.0, 10475.0, 10461.0, 
10462.0, 10464.0, 10465.0, 10472.0, 10473.0, 11212.0, 11213.0, 11216.0, 11233.0, 11238.0, 
11209.0, 11214.0, 11228.0, 11204.0, 11218.0, 11219.0, 11230.0, 11234.0, 11236.0, 11239.0, 
11223.0, 11224.0, 11229.0, 11235.0, 11201.0, 11205.0, 11215.0, 11217.0, 11231.0, 11203.0, 
11210.0, 11225.0, 11226.0, 11207.0, 11208.0, 11211.0, 11222.0, 11220.0, 11232.0, 11206.0, 
11221.0, 11237.0, 10026.0, 10027.0, 10030.0, 10037.0, 10039.0, 10001.0, 10011.0, 10018.0, 
10019.0, 10020.0, 10036.0, 10029.0, 10035.0, 10010.0, 10016.0, 10017.0, 10022.0, 10012.0, 
10013.0, 10014.0, 10004.0, 10005.0, 10006.0, 10007.0, 10038.0, 10280.0, 10002.0, 10003.0, 
10009.0, 10021.0, 10028.0, 10044.0, 10065.0, 10075.0, 10128.0, 10023.0, 10024.0, 10025.0, 
10031.0, 10032.0, 10033.0, 10034.0, 10040.0, 11361.0, 11362.0, 11363.0, 11364.0, 11354.0, 
11355.0, 11356.0, 11357.0, 11358.0, 11359.0, 11360.0, 11365.0, 11366.0, 11367.0, 11412.0, 
11423.0, 11432.0, 11433.0, 11434.0, 11435.0, 11436.0, 11101.0, 11102.0, 11103.0, 11104.0, 
11105.0, 11106.0, 11374.0, 11375.0, 11379.0, 11385.0, 11691.0, 11692.0, 11693.0, 11694.0, 
11695.0, 11697.0, 11004.0, 11005.0, 11411.0, 11413.0, 11422.0, 11426.0, 11427.0, 11428.0, 
11429.0, 11414.0, 11415.0, 11416.0, 11417.0, 11418.0, 11419.0, 11420.0, 11421.0, 11368.0, 
11369.0, 11370.0, 11372.0, 11373.0, 11377.0, 11378.0, 10302.0, 10303.0, 10310.0, 10306.0, 
10307.0, 10308.0, 10309.0, 10312.0, 10301.0, 10304.0, 10305.0, 10314.0, 10119.0, 11249.0, 
10008.0, 10279.0, 10271.0, 10041.0, 10163.0, 10107.0, 10108.0, 10113.0, 10123.0, 11351.0, 
10115.0, 10276.0, 10150.0, 11439.0, 11451.0, 11202.0, 10170.0, 11424.0, 10185.0, 10122.0, 
11690.0, 11242.0, 11352.0, 10116.0, 10167.0, 10282.0, 11247.0, 10278.0, 10121.0, 10155.0, 
10168.0, 10281.0, 10118.0, 10110.0, 10158.0, 10159.0, 10165.0, 11241.0, 10156.0, 10178.0, 
10120.0, 10105.0, 10104.0, 10175.0, 10101.0, 10153.0, 10268.0, 10173.0, 10111.0, 10311.0, 
10166.0, 10069.0, 10272.0, 10112.0, 10176.0, 10162.0, 10174.0, 10177.0, 10151.0, 11430.0, 
11386.0, 10106.0, 10169.0, 10154.0, 11109.0, 11380.0, 10129.0, 10103.0, 10045.0, 10171.0, 
10286.0, 11371.0, 11120.0, 11431.0, 10274.0, 11243.0, 11240.0, 10015.0, 10048.0, 10249.0, 
10285.0, 10152.0, 10270.0, 10102.0, 10043.0, 10172.0, 10109.0, 10081.0, 11252.0, 10055.0, 
10313.0, 11251.0, 10125.0, 10133.0, 10117.0, 10138.0, 10164.0, 10292.0, 10260.0, 10072.0, 
10080.0, 10179.0, 10021, 10065, 11219, 10022, 10003, 10028, 10122, 
11217, 10017, 11361, 10013, 10004, 10005, 10001, 10168, 10016, 11210, 
10031, 11223, 10036, 10018, 11211, 10024, 10019, 10119, 11204, 10023, 
10128, 11106, 11234, 11120, 11375, 10008, 10274, 11694, 10165, 10025, 
11249, 10173, 10158, 10471, 10309, 10312, 11427, 10014, 10010, 11230, 
10170, 11205, 11215, 10177, 11201, 10020, 11238, 11231, 10111, 10461, 
10150, 10306, 11214, 10007, 11224, 10118, 10153, 10110, 10012, 10163, 
11218, 11373, 10107, 11367, 11220, 10032, 10075, 10115, 11235, 10011, 
11415, 10027, 10463, 10002, 10026, 11101, 10120, 10103, 10055, 10039, 
11245.0, 11256.0, 11425.0, 10046.0, 10199.0, 10123, 10009, 11378, 11229, 10006, 
10038, 10155, 11364, 11418, 10279, 10470, 10468, 11241, 10310, 10467, 
11434, 11372, 10314, 10272, 10048, 10116, 11228, 10308, 10462, 10307, 
10304, 11430, 11358, 11209, 11374, 11354, 11377, 11421, 10286, 11232, 
11245, 10469, 10176, 11385, 10044, 11102, 10459, 11435, 10281, 10034, 
10130.0, 11381.0, 10114.0]
NYFIPS = [36005.0, 36047.0, 36061.0, 36081.0, 36085.0, 36005, 36047, 36061, 36081, 36085,
             '36005', '36047', '36061', '36081', '36085' ]

In [5]:
pc_2015c = get_ny(get_arts(pc_2015))[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]
pf_2015c = get_ny(get_arts(pf_2015))[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]
co_2015c = get_ny(get_arts(co_2015))[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]
pc_2014c = get_ny(get_arts(pc_2014))[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]
pf_2014c = get_ny(get_arts(pf_2014))[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]
co_2014c = get_ny(get_arts(co_2014))[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]
df_2015 = pc_2015c.append(pf_2015c).append(co_2015c) # the cleaned 2015 dataset, the 3 types combined for convenience
df_2014 = pc_2014c.append(pf_2014c).append(co_2014c) # the cleaned 2014 dataset, the 3 types combined for convenience
old_2015 = pc_2015[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']].append(pf_2015[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]).append(co_2015[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']])
old_2014 = pc_2014[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']].append(pf_2014[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']]).append(co_2014[['EIN','NAME', 'NTEE1', 'NTEECC', 'ADDRESS', 'STATE', 'ZIP', 'ZIP5', 'FIPS']])


In [6]:
def standardize_id(string):
    # a helper function to standardize the tax ids, some of them have hyphens in them.
    string = str(string)
    string = string.replace("-", "")
    return string

In [7]:
# the following for year 2015
# get a list of all the EIN ids in the DataArts database
dataarts_ids = list(dataarts.loc[dataarts['fiscal_year'] == 2015, 'organizations_tax_id']
                        .apply(standardize_id).unique())
# do the same thing to get a list of all the EIN ids in our IRS database
irs_ids = list(df_2015['EIN'].apply(standardize_id).unique())
print('Number of organizations in the IRS dataset: ', len(irs_ids))
print('Number of organizations in the DataArts dataset: ', len(dataarts_ids))
# initialize some empty lists
in_dataarts_missing_from_irs = []
in_irs_missing_from_dataarts = []
dataarts_ids_in_irs = []
irs_ids_in_dataarts = [] # should be the same as the one above, just double checking
# getting taxids that are in DataArts but missing from IRS
for taxid in dataarts_ids:
    if taxid in irs_ids:
        dataarts_ids_in_irs.append(taxid)
    else:
        in_dataarts_missing_from_irs.append(taxid)
# getting taxids that are in IRS but missing from DataArts
for taxid in irs_ids:
    if taxid in dataarts_ids:
        irs_ids_in_dataarts.append(taxid)
    else:
        in_irs_missing_from_dataarts.append(taxid)
print('Number of organizations in both datasets:', len(dataarts_ids_in_irs))
print('Number of organizations in both, should match above:', len(irs_ids_in_dataarts))
print('')
print('Number of organizations in DataArts but missing from IRS:', len(in_dataarts_missing_from_irs))
print('Number of organizations in IRS but missing from DataArts:', len(in_irs_missing_from_dataarts))

Number of organizations in the IRS dataset:  3248
Number of organizations in the DataArts dataset:  1280
Number of organizations in both datasets: 888
Number of organizations in both, should match above: 888

Number of organizations in DataArts but missing from IRS: 392
Number of organizations in IRS but missing from DataArts: 2360


In [8]:
# finding the values were missing in the cleaned IRS dataset in the old uncleaned IRS dataset
missing_df = old_2015.loc[old_2015.EIN.apply(standardize_id).isin(in_dataarts_missing_from_irs)]
# get the ones that meet the arts criteria, meaning they were missed because of address
missing_address = get_arts(missing_df)
# get the ones that meet the address criteria, meaning they were missed because of arts
missing_classification = get_ny(missing_df)
# check to see if ones we are sitll missing are in 2014
still_missing = set(in_dataarts_missing_from_irs) - set(missing_df.EIN.apply(standardize_id))
missing_df2014 = old_2014.loc[old_2014.EIN.apply(standardize_id).isin(still_missing)]
print('Number of organizations in DataArts but missing from IRS:', len(in_dataarts_missing_from_irs))
print('Number of these DataArts organizations that were found in old IRS data:', len(missing_df))
print('Number of these found organizations that were missed because of address issue:',
      len(missing_address))
print('Number of these found organizations that were missed because of classification issue:',
      len(missing_classification))
print('')
print('Number of organizations that were not found in old IRS 2015 data but were found in 2014 data, implying that we missed them because they did not file in 2015:', len(missing_df2014))


Number of organizations in DataArts but missing from IRS: 392
Number of these DataArts organizations that were found in old IRS data: 223
Number of these found organizations that were missed because of address issue: 27
Number of these found organizations that were missed because of classification issue: 190

Number of organizations that were not found in old IRS 2015 data but were found in 2014 data, implying that we missed them because they did not file in 2015: 2


There are 169 organizations in the DataArts data that are still missing.  Two hypotheses about why this could be:
(1) There's a typo in the EIN in either the IRS or DataArts database, we could try to match them by name or address to see if this is the case.  As we're about to see with the DCLA, it's hard to match things by name and address, so I don't feel like going into this now, someone could come back to it if they feel like it.
(2) These organizations could be for-profit organizations, our IRS data only has non-profits.

To test this second hypothesis, let's look at the DataArts data that's missing from the IRS.

In [9]:
dataarts[dataarts['organizations_tax_id'].apply(standardize_id).isin(still_missing) 
         & (dataarts['fiscal_year'] == 2015)]

Unnamed: 0,fiscal_year,fiscal_year_end,organization_id,profile_id,form_type,status_code,status_internal,status,legacy,survey_responses_updated_at,organizations_id,organizations_name,organizations_legal_name,organizations_tax_id,organizations_fiscal_year_end,organizations_addr_street,organizations_addr_line2,organizations_addr_city,organizations_addr_state,organizations_addr_zip,organizations_addr_country,organizations_phone,organizations_url,organizations_year_founded,organizations_year_incorporated,organizations_city_council_district,organizations_state_house_district,organizations_state_senate_district,organizations_federal_congressional_district,organizations_county,organizations_county_supervisorial_district,organizations_parent_name,organizations_fiscal_sponsor_name,organizations_irs_exemption_date,organizations_nmbr_board_members,organizations_address2,organizations_audited,organizations_month_count,organizations_duns_number,organizations_cdp_taxonomy,...,constituencies_gender_particular_group,constituencies_gender_served,constituencies_age_particular,constituencies_age_served,constituencies_other_particular_groups,constituencies_other_served,constituencies_other_served_description,percent_non_local_sf,percent_non_local_long_form,percent_non_local,percent_under_18_sf,events_distinct_sf_ag,events_occurrences_sf_ag,publications_distinct_ag,publications_distributed_ag,total_working_capital_unrestricted,total_working_capital,total_working_capital_total,months_of_operating_cash_unrestricted,months_of_operating_cash_total,self_sufficiency_ratio,current_assets_total_over_current_liabilities_total,debt_service_impact,depreciation_as_percent_fixed_assets,in_kind_operating_ratio,leverage_ratio,operating_margin,unrestricted_net_assets_net_of_fixed_assets,total_operating_expenses_per_physical_attendee,liquid_unrestricted_net_assets,MARKETING_EXPENSE_AS_PERCENT_OF_TOTAL_EXPENSES,MARKETING_EXPENSE_AS_PERCENT_OF_REVENUE,FUNDRAISING_EFFICIENCY,SUPP_TRUSTEE_BOARD_AVERAGE,SUPP_INDIVIDUAL_AVERAGE,SUPP_CORPORATE_AVERAGE,SUPP_FOUNDATION_AVERAGE,TOTAL_EMPLOYEES_PEOPLE,TOTAL_EMPLOYEES_FTES,NYC_MARKER
263,2015,2015-12-31,4805,87016,L,2,Completed,Completed,0,2017-02-04 23:01:25,4805,"Ryan Repertory Company, Inc.","Ryan Repertory Company, Inc.",115279252,2017-12-31 00:00:00,"2445 Bath Avenue, Brooklyn New York",,Brooklyn,NY,11214,USA,718-996-4800,www.ryanrep.org,1972.0,1981.0,47,47,22,13,Kings,0.0,,,1981.0,13.0,,0.0,12.0,,5,...,No,,No,,No,,,,,,,3.0,33.0,,,,0.0,0.0,0.000,0.000,0.266,0.000,0.000,,0.977,,0.953,,12.590,,0.000,0.000,,,1.000,,,14.0,8.375,Brooklyn
448,2015,2015-12-31,4898,86481,L,2,Completed,Completed,0,2017-02-05 01:07:34,4898,New York African Chorus Ensemble Inc.,New York African Chorus Ensemble Inc.,201090906,2017-12-31 00:00:00,515 WEST 151st STREET,SUITE 2W,New York,NY,10031,USA,(347)938-9335,www.nyafricanensemble.com,2004.0,2004.0,7,71,30,15,New York,0.0,,,2004.0,4.0,,0.0,12.0,808844166.0,5,...,No,,No,,No,,,,20.0,20.0,,8.0,11.0,,,,0.0,0.0,0.000,0.000,0.200,,,,0.000,,-0.024,,5.808,,0.132,3.764,,,22.707,0.00,11784.000,21.0,1.073,Manhattan
482,2015,2015-07-31,4920,27605,L,2,Completed,Completed,0,2017-02-09 20:06:39,4920,Sing for Hope,Sing for Hope,010856384,2018-07-31 00:00:00,575 Eighth Ave,Suite 1812,New York,NY,10018,USA,212-966-5955,www.singforhope.org,2006.0,2006.0,1,66,25,8,New York,0.0,,,2006.0,20.0,,1.0,12.0,809701480.0,6,...,No,,No,,Yes,"IndDisabilities,IndInstitutions,IndLowIncome,M...",,,10.0,10.0,,209.0,226.0,0.0,0.0,1575597.0,1575597.0,1575597.0,5.845,5.845,0.000,27.713,0.000,,0.289,0.0,0.000,685597.0,1.301,744580.0,0.037,,9.613,2272.727,543.478,41363.80,0.000,311.0,311.000,Manhattan
620,2015,2015-12-31,4986,27887,L,2,Completed - Migrated,Completed,1,2016-02-25 17:34:12,4986,nicu's spoon,nicu's spoon,06-1614045,2015-12-31 00:00:00,"38 west 38th st, 5th floor",ste. 3B,New York,NY,10025,,212-245-6467,www.spoontheater.org,2001.0,2001.0,5,75,26,14,New York,0.0,,,2001.0,8.0,,0.0,12.0,99464518.0,5,...,No,,Yes,"PreK,K12,OlderAdults",Yes,"IndDisabilities,IndLowIncome,Immigrants,LGBT",,,,,,4.0,30.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.287,,0.000,,0.000,0.0,0.091,0.0,8.313,,0.072,0.252,5.775,1000.000,621.370,,633.333,4.0,2.500,Manhattan
697,2015,2015-06-30,5042,28165,L,5,"Complete - Historically, Review Complete",Completed,1,2015-11-16 19:08:06,5042,Citywide Youth Opera,Citywide Youth Opera,05-0616054,2015-06-30 00:00:00,321 E. 43rd Street,,New York,NY,10017,,212-539-3561,http://www.citywideyouthopera.org,2005.0,2005.0,4,74,26,14,New York,0.0,,,2005.0,7.0,,0.0,12.0,181227864.0,5,...,No,,Yes,"K12,YoungAdults",No,,,,,,,8.0,8.0,0.0,0.0,0.0,0.0,0.0,0.000,0.000,0.679,0.000,0.000,0.000,0.000,0.0,0.163,0.0,81.322,0.0,0.058,0.088,,1145.000,248.671,1000.00,,0.0,0.000,Manhattan
1368,2015,2015-12-31,5571,87541,L,2,Completed,Completed,0,2017-02-12 23:25:37,5571,"Uptown Dance Academy, Inc.","Uptown Dance Academy, Inc.",133891881,2017-12-31 00:00:00,167 E 121st,,New York,NY,10027,USA,917-202-1601,www.uptowndanceacademy.com,1996.0,1996.0,9,70,30,13,New York,0.0,,,1997.0,9.0,,0.0,12.0,,3,...,No,,No,,No,,,,20.0,20.0,,10.0,52.0,,,44.0,44.0,44.0,0.002,0.002,0.847,,,,0.000,0.0,0.000,44.0,2696.110,44.0,0.000,0.000,,,18.045,,,19.0,1.000,Manhattan
1560,2015,2015-06-30,5763,32047,L,2,Completed - Migrated,Completed,1,2016-06-29 17:21:01,5763,Fourth Arts Block,"Fourth Arts Block, Inc.",043767933,2016-06-30 00:00:00,61 East 4th Street,Ground Floor,New York,NY,10003,USA,212-228-4670,www.fabnyc.org,2001.0,2003.0,2,66,29,8,New York,0.0,,,2003.0,5.0,,1.0,12.0,828573290.0,7,...,No,,No,,Yes,"IndArtists,IndLowIncome,LGBT",,,,,,46.0,110.0,3.0,6000.0,45861.0,45861.0,225569.0,1.995,1.995,0.168,18.658,0.000,0.926,0.000,0.0,0.751,46186.0,15.791,58960.0,0.037,0.227,20.599,,51.616,4428.00,42377.800,43.0,5.750,Manhattan
1614,2015,2015-07-31,5797,87393,L,2,Completed,Completed,0,2017-02-13 22:55:51,5797,Caribbean Cultural Theatre,Caribbean Cultural Theatre,830508237,2015-07-31 00:00:00,138 South Oxford Street,Suite 4E,Brooklyn,NY,11217,USA,718-783-8345,none,1995.0,1997.0,35,57,18,11,Kings,0.0,,,2009.0,7.0,,0.0,1.0,130237824.0,2,...,,,,,,,,,25.0,25.0,,15.0,63.0,,,,1722.0,1722.0,0.000,0.473,0.112,0.000,0.000,,0.185,,0.236,,29.855,,0.114,1.018,,,106.818,,,4.0,1.013,Brooklyn
1817,2015,2015-06-30,6082,86208,S,2,Completed,Completed,0,2017-02-06 02:07:14,6082,Immediate Medium,Immediate Medium,061607955,2017-06-30 00:00:00,PO Box 1138,,New York,NY,10276,USA,212-696-7227,www.immediatemedium.org,2002.0,2002.0,34,66,29,8,Kings,0.0,,,2003.0,15.0,,0.0,12.0,,5,...,No,,No,,Yes,Artists,,0.0,,0.0,0.0,2.0,16.0,,,,-928.0,-928.0,0.000,0.000,0.199,0.882,,,0.000,,,,41.127,,0.002,0.014,42.211,,,,,4.0,0.100,Brooklyn
1868,2015,2015-06-30,6128,87361,S,2,Completed,Completed,0,2017-02-10 23:19:11,6128,"FreshStart-Cultural Theatre Arts Productions, ...","FreshStart-Cultural Theatre Arts Productions, ...",861135829,2017-06-30 00:00:00,10 Richman Plaza,Suite 23A,Bronx,NY,10453,USA,(347)323-6961,freshstartctaproductions.org,2002.0,2005.0,16,77,28,16,Bronx,0.0,,,2008.0,6.0,,,12.0,,5,...,No,,Yes,Older Adults (65+ years),Yes,Other distinct group(s) (please describe),NYC Adult and Family Shelter Facility Populations,0.0,,0.0,0.0,3.0,3.0,,,,0.0,0.0,0.000,0.000,0.000,0.000,,,0.000,,,,16.250,,0.185,,100.000,,,,,10.0,0.500,Bronx


I googled 10 of the organizations in this list and 9 were non-profits and 1 didn't say but seemed like a non-profit.  So there may be a couple for-profit companies in this list, but it seems like that's not the main reason.  

I don't have any other ideas about why they could be missing, so I'm going to move on.

27 organizations were missing because the address wasn't picked up by our address function, let's look at those.

In [10]:
missing_address

Unnamed: 0,EIN,NAME,NTEE1,NTEECC,ADDRESS,STATE,ZIP,ZIP5,FIPS
21733,115347056,EN GARDE ARTS INC,A,A65,120 HAMILTON AVE,NY,10706-2405,10706,36119.0
22478,131924971,NAUMBURG ORCHESTRAL CONC,A,A69,50 GLENWOOD RD,NY,11803-0000,11803,36059.0
23472,133035736,SANDY GROUND HISTORICAL SOCIETY INC,A,A82,79 FIELDSTONE LN,NY,11581-2303,11581,36059.0
24056,133223977,CANTICORUM VIRTUOSI INC,A,A6B,2 COVE RD,NY,10590-1023,10590,36119.0
26296,133966046,MARIE CHRISTINE GIORDANO DANCE COMPANY INC,A,A62,PO BOX 174,NJ,07646-0174,7646,34003.0
27118,134198925,ETHELS FOUNDATION FOR THE ARTS INC,A,A126,303 S BROADWAY STE 105,NY,10591-5410,10591,36119.0
38044,201125609,ALARM WILL SOUND INC,A,A69,49 ROWLEY ST STE 1,NY,14607-2630,14607,36055.0
96059,264210673,PRISM QUARTET INCORPORATED,A,A68,257 W HARVEY ST,PA,19144-3320,19144,42101.0
193849,463203280,ON SITE OPERA INC,A,A6A,17 BOILING SPRING RD,NJ,07423-1302,7423,34003.0
224236,522158599,AMERICAN DANCE INSTITUTE,A,A6E,1570 E JEFFERSON ST,MD,10001-6809,20852,24031.0


A lot of these don't have addresses in NY, of the ones that do, I checked about 10 again and they were all outside of NYC, so I think we can safely ignore these.  It's possible they are in the DataArts dataset because they do work in NYC or something like that.

There were 190 organizations we missed because they aren't classified under 'A' by the IRS, let's look at these.

In [11]:
missing_classification

Unnamed: 0,EIN,NAME,NTEE1,NTEECC,ADDRESS,STATE,ZIP,ZIP5,FIPS
18015,111633524,CONRAD POPPENHUSEN ASSOCIATION,P,P28,PO BOX 560091,NY,11356-0091,11356,36081.0
18022,111635083,QUEENS BOTANICAL GARDEN SOCIETY INC,C,C41,4350 MAIN ST,NY,11355-4742,11355,36081.0
18234,112137138,BROOKLYN NAVY YARD DEVELOPMENT CORPORATION,S,S20,BLDG 29263 FLUSHING AVE300,NY,11205-0000,11205,36047.0
18431,112405466,ALLEY POND ENVIRONMENTAL CENTER INC,C,C60,22806 NORTHERN BLVD,NY,11362-1068,11362,36081.0
18447,112417338,BROOKLYN BOTANIC GARDEN CORPORATION,C,C41,1000 WASHINGTON AVE,NY,11225-1008,11225,36047.0
18508,112453853,RIDGEWOOD BUSHWICK SENIOR CITIZENS COUNCIL INC,P,P81,555 BUSHWICK AVE,NY,11206-4657,11206,36047.0
18618,112507910,FEDERATION OF ITALIAN AMERICAN ORG OF BROOKLYN...,Y,Y40,7403 18TH AVE,NY,11204-5614,11204,36047.0
18712,112547268,BRIC ARTS MEDIA BKLYN INC,S,S41,647 FULTON STREET,NY,11217-1152,11217,36047.0
18872,112614265,EL PUENTE DE WILLIAMSBURG INC,O,O20,211 S 4TH ST,NY,11211-5605,11211,36047.0
18956,112652331,CENTRAL ASTORIA LOCAL DEVELOPMENT COALITION IN...,S,S20,25-69 28TH ST,NY,11103-4164,11103,36081.0


In [12]:
missing_classification.NTEECC.value_counts()

S20     13
P28      8
P20      8
B90      8
B99      7
Q20      5
P84      5
C41      4
B70      4
O50      4
N52      4
T20      4
X20      3
P27      3
O20      3
J40      3
N113     3
C50      3
S31      2
W70      2
O99      2
D50      2
N67      2
S80      2
B43      2
N32      2
P81      2
I20      2
B114     2
S41      2
        ..
P85      1
W19      1
P40      1
L20      1
U30      1
T12      1
B60      1
B118     1
B92      1
C42      1
S50      1
S99      1
G42      1
V35      1
B30      1
T22      1
B80      1
L81      1
O55      1
X99      1
X21      1
Q30      1
N80      1
P88      1
X30      1
P30      1
B24      1
B11      1
W30      1
B199     1
Name: NTEECC, Length: 101, dtype: int64

This list gives us the most common classifications that we missed.  Let's see if any of them are worth including.

S20 	Community & Neighborhood Development 

Organizations that focus broadly on strengthening, unifying and building the economic, cultural, educational and social services of an urban community or neighborhood. Use this code for community and neighborhood improvement organizations other than those specified below. 

P20 	Human Service Organizations 	
Organizations that provide a broad range of social services for individuals or families. Use this code for multiservice organizations such as Lutheran Social Services, Catholic Social Services and other community service organizations not specified below that provide a variety of services from throughout the P section or services from the P section in combination with services described in other sections (e.g., an organization that provides family counseling, substance abuse services, employment assistance and services for at - risk youth). 

B90 	Educational Services 	
Organizations that provide educational programs within the formal educational system or offered as an adjunct to the traditional school curriculum which help students succeed in school and prepare for life. Includes organizations that partner parents, families, schools, business and/or community leaders to broker resources for the benefit of local schools. 

P28 	Neighborhood Centers 	
Neighborhood-based multipurpose centers that offer, at a single location, a wide variety of services and activities that are structured to meet the needs of the entire community through different programs for different age and interest groups. 

None of these IRS codes seem worth including.  However, there are clearly some organizations in the last dataframe that we would like to have in our list.  I'm not totally sure what to do about this -- I can't think of a way to separate them that doesn't involve going one by one, which I don't think is worth it.

I think my best idea is to include any that have 'art' or 'theater' in their name.  Let's just see what that list would look like.

In [13]:
def check_name(inputstring):
    # totally random list I'm coming up with right now
    strings_to_check = ['art', 'theater', 'music', 'dance', 'choir']
    for checkstring in strings_to_check:
        if checkstring in inputstring:
            return True
        else:
            pass
    return False
missing_classification[missing_classification.NAME.str.lower().apply(check_name)]

Unnamed: 0,EIN,NAME,NTEE1,NTEECC,ADDRESS,STATE,ZIP,ZIP5,FIPS
18712,112547268,BRIC ARTS MEDIA BKLYN INC,S,S41,647 FULTON STREET,NY,11217-1152,11217,36047.0
19408,112953522,INTERNATIONAL AFRICAN ARTS FESTIVAL,N,N52,1360 FULTON ST STE 401,NY,11216-2600,11216,36047.0
23130,132917442,SONORA HOUSE INC C O JUDITH MARTIN,L,L99,PO BOX 823,NY,10108-0823,10108,36061.0
24389,133363579,DOING ART TOGETHER INC,B,B90,127 W 127TH ST,NY,10027-3723,10027,36061.0
25100,133613210,EDUCATION THROUGH MUSIC INC,W,W19,122 E 42ND ST RM 1501,NY,10168-1503,10168,36061.0
25126,133621169,VISUAL AIDS FOR THE ARTS INC,G,G198,526 W 26TH ST RM 510,NY,10001-5521,10001,36061.0
26893,134137551,TOPAZ ARTS INC,B,B99,PO BOX 770150,NY,11377-0150,11377,36081.0
26967,134158573,VISION INTO ART PRESENTS INC,G,G41,25 COLUMBUS CIRCLE NO 68B,NY,10019-1107,10019,36061.0
27129,134201577,MAKING BOOKS SING INC NEW YORK CITY CHILDRENS ...,B,B92,340 E 46TH ST,NY,10017-3003,10017,36061.0
39680,201633483,YOUNG URBAN CHRISTIANS AND ARTISTS INC,X,X20,754 MELROSE AVE,NY,10451-4446,10451,36005.0


Well, was it worth it to get these 21 organizations?  Definitely not haha.  I don't think it's worth it to include them in our dataset simply to reduce confusion.  But, if the group feels differently, here's the EINs.

In [14]:
orgs_to_add = missing_classification[missing_classification.NAME.str.lower().apply(check_name)]
list(orgs_to_add.EIN)

[112547268,
 112953522,
 132917442,
 133363579,
 133613210,
 133621169,
 134137551,
 134158573,
 134201577,
 201633483,
 204557408,
 264540036,
 271409736,
 474287935,
 753077676,
 134236056,
 463276178,
 133958495,
 421642691,
 132925233,
 132752494]

That does it for organizations we were missing from DataArts.  Now we're going to go the other direction and check organizations that were in our IRS dataset but aren't in DataArts.  It's harder to come up with hypotheses about why they are not in DataArts because I don't know how DataArts constructed their dataset.  Let's start by taking a look at the data.

In [15]:
missing_from_dataarts = df_2015.loc[df_2015.EIN.apply(standardize_id).isin(in_irs_missing_from_dataarts)]
missing_from_dataarts

Unnamed: 0,EIN,NAME,NTEE1,NTEECC,ADDRESS,STATE,ZIP,ZIP5,FIPS
148,10263908,SKOWHEGAN SCHOOL OF PAINTING AND SCULPTURE INC,A,A25,136 WEST 22ND STREET,NY,10011-2424,10011,36061.0
482,10391592,MAINE JAZZ CAMP,A,A68,VAN BRUNT STATION BOX 150-597,NY,11215-0597,11231,36047.0
1354,10621206,MUSIC OF THE SPHERES SOCIETY INC,A,A68,46 RIVERSIDE DR APT 2N,NY,10024-6893,10024,36061.0
1400,10632725,GAMETOPHYTE INC,A,A60,528 HANCOCK ST STE 3 FL,NY,11233-1019,11233,36047.0
1521,10668318,INTERNATIONAL KEYBOARD INSTITUTE AND FESTIVAL,A,A68,229 W 97TH ST RM/STE 1B,NY,10025-4115,10025,36061.0
1537,10671893,ALEXANDRIA AND AKEAS PLAYHOUSE INC AAPI SERVICES,A,A61,134 MARINERS LN,NY,10303-2548,10303,36085.0
1741,10728746,UGLY DUCKLING PRESSE LTD,A,A20,232 3RD ST STE E002,NY,11215-2733,11215,36047.0
1994,10798319,THE TANK LTD,A,A65,151 WEST 46TH STREET 8TH FLOOR,NY,10036-8512,10036,36061.0
2138,10848423,DIPLATANOS SOCIETY AGIA MARINA INC,A,A23,1992 CONEY ISLAND AVE,NY,11223-2329,11223,36047.0
2157,10856384,SING FOR HOPE INC,A,A68,575 8TH AVE RM 1812,NY,10018-3501,10018,36061.0


A pretty mixed bag, there are definitely some in here that don't belong but it seems like the majority do.  I'm not really sure what to do here, I think I'm going to not worry about it and call it good.

Moving on to the DCLA now.  This is going to be much harder because the DCLA doesn't have EIN as a column in their dataset.  They have three variables that could be used to identify an organization: Name, Address, and Phone Number.  The IRS doesn't have phone, which leaves Name and Address.  It will be hard to match these exactly, but we can do our best.  First we're going to define some standarizing functions that will help us match things up.

Also, I had to reset the indices of some dataframes to get this part to work, so if you go back up and try to run some of the previous functions they'll be messed up.

In [16]:
def standardize_name(string):
    string = str(string)
    string = string.lower()
    string = ''.join(e for e in string if e.isalnum())
    return string
def standardize_address(string):
    string = str(string)
    string = string.lower()
    string = string.replace('street', 'st')
    string = string.replace('avenue', 'ave')
    string = string.replace('road', 'rd')
    string = string.split(',', 1)[0]
    string = ''.join(e for e in string if e.isalnum())
    return string

df_2015.reset_index(inplace = True)

irs_names = list(df_2015.NAME.apply(standardize_name))
irs_addresses = list(df_2015.ADDRESS.apply(standardize_address))
dcla_names = list(dcla['Organization Name'].apply(standardize_name))
dcla_addresses = list(dcla['Address'].apply(standardize_address))

in_dcla_missing_from_irs = []
in_irs_missing_from_dcla = []
matching_dcla_irs = []
matching_irs_dcla = []
for name, address, index in zip(dcla['Organization Name'], dcla['Address'], dcla.index):
    checkname = standardize_name(name)
    checkaddress = standardize_address(address)
    # now is where I'm getting the sense that there's an easier way to do this than with these
    # loops, so if you know of one please let me know
    if checkname in irs_names:
        matching_dcla_irs.append(index)
    elif checkaddress in irs_addresses:
        matching_dcla_irs.append(index)
    else:
        in_dcla_missing_from_irs.append(index)

for name, address, index in zip(df_2015.NAME, df_2015.ADDRESS, df_2015.index):
    checkname = standardize_name(name)
    checkaddress = standardize_address(address)
    
    if checkname in dcla_names:
        matching_irs_dcla.append(index)
    elif checkaddress in dcla_addresses:
        matching_irs_dcla.append(index)
    else:
        in_irs_missing_from_dcla.append(index)
        
print('Number of organizations in the IRS dataset: ', len(irs_names))
print('Number of organizations in the DCLA dataset: ', len(dcla_names))
print('')
print('Number of organizations in both datasets:', len(matching_dcla_irs))
print('Number of organizations in both, should match above:', len(matching_irs_dcla))
print('')
print('Number of organizations in DCLA but missing from IRS:', len(in_dcla_missing_from_irs))
print('Number of organizations in IRS but missing from DCLA:', len(in_irs_missing_from_dcla))

Number of organizations in the IRS dataset:  3272
Number of organizations in the DCLA dataset:  2130

Number of organizations in both datasets: 921
Number of organizations in both, should match above: 874

Number of organizations in DCLA but missing from IRS: 1209
Number of organizations in IRS but missing from DCLA: 2398


I don't think there's much more than this that we can do.  We can't hunt down the missing ones like we could with DataArts other than doing more work to try to match the names.  We can take a look at the data in the IRS that's missing from DCLA just to see. 

In [17]:
df_2015.iloc[in_irs_missing_from_dcla]

Unnamed: 0,index,EIN,NAME,NTEE1,NTEECC,ADDRESS,STATE,ZIP,ZIP5,FIPS
0,148,10263908,SKOWHEGAN SCHOOL OF PAINTING AND SCULPTURE INC,A,A25,136 WEST 22ND STREET,NY,10011-2424,10011,36061.0
1,482,10391592,MAINE JAZZ CAMP,A,A68,VAN BRUNT STATION BOX 150-597,NY,11215-0597,11231,36047.0
2,1354,10621206,MUSIC OF THE SPHERES SOCIETY INC,A,A68,46 RIVERSIDE DR APT 2N,NY,10024-6893,10024,36061.0
3,1400,10632725,GAMETOPHYTE INC,A,A60,528 HANCOCK ST STE 3 FL,NY,11233-1019,11233,36047.0
4,1521,10668318,INTERNATIONAL KEYBOARD INSTITUTE AND FESTIVAL,A,A68,229 W 97TH ST RM/STE 1B,NY,10025-4115,10025,36061.0
5,1537,10671893,ALEXANDRIA AND AKEAS PLAYHOUSE INC AAPI SERVICES,A,A61,134 MARINERS LN,NY,10303-2548,10303,36085.0
6,1741,10728746,UGLY DUCKLING PRESSE LTD,A,A20,232 3RD ST STE E002,NY,11215-2733,11215,36047.0
7,1994,10798319,THE TANK LTD,A,A65,151 WEST 46TH STREET 8TH FLOOR,NY,10036-8512,10036,36061.0
8,2138,10848423,DIPLATANOS SOCIETY AGIA MARINA INC,A,A23,1992 CONEY ISLAND AVE,NY,11223-2329,11223,36047.0
9,2157,10856384,SING FOR HOPE INC,A,A68,575 8TH AVE RM 1812,NY,10018-3501,10018,36061.0


These mostly  look good, I think the most likely explanation is that they didn't receive any funds from the city and that's why they didn't make it into the DCLA dataset.

So at the end of the day nothing has really changed, but at least we have a better idea about the kinds of organizations we have and the kinds of organizations we might be missing.  