## API Calls to ProPublica for Retrieving Filing Data

This outlines the process of making API calls to ProPublica to obtain additional data. The purpose of these API calls is to get the Names, Previous Years Revenue for reference and NTEE code to categorize based on which non profit.

### Motivation

The motivation behind making API calls to ProPublica is twofold:

1. **Initial Understanding:** These calls provide an initial understanding of the data required for the project.

2. **Additional Features:** ProPublica's API offers some features and data that can enhance our main dataset.

In [None]:
# Importing needed libraries for the API Calls projects
import os # For pathing and batch folder naming
import requests # This library is the reson why API calls work
import pandas as pd # Good ol` pandas, cant do any data scraping without this puppy
import concurrent.futures # Did someone say multithreading?

### The Function to get our Data and Write to a CSV

1. **API Endpoint:** We accessed ProPublica's API, then we queried the data based on Employer Identification Numbers (EINs). Since ProPublica for nonprofits is free we just get the data.

2. **Data Collection:** The API calls fetched data for organizations based on their EINs. This data includes financial information, tax period years, PDF document URLs, organization details (such as name, city, state), and more.

3. **Multithreading:** To optimize the data retrieval process, we used multithreading using the `concurrent.futures.ThreadPoolExecutor` to parallelize the API requests and to learn more about multi threading.


In [None]:
def einLookup(eins): # Our einLookup function, that take EIN and puts the data in the result 
    results = []

    def fetch_data(ein): # Fetching the Data, essentially sending query and getting output
        try: # Try catch, good practice to prevent disruptions
            q = f'https://projects.propublica.org/nonprofits/api/v2/organizations/{ein}.json'
            response = requests.get(q) 
            response.raise_for_status()
            data = response.json() # Getting our data from the response of the API
            # print(data)

            if 'filings_with_data' in data and data['filings_with_data']: # Now we look for filings with data in the data given
                filingData = data['filings_with_data'][0] # So if we find it (since filings will contain our total rev and total expenses)
                yearEnd = filingData.get('tax_prd_yr', 'N/A') # Get the year

                totRevenue = filingData.get('totrevenue', 'N/A') # Total rev
                totfuncexpns = filingData.get('totfuncexpns', 'N/A') # Total Expenses
                netRevenue = totRevenue - totfuncexpns # Calculating net rev

                result = [ # Storing our result here to later append to the csv
                    ein,
                    data['organization']['name'],
                    totRevenue,
                    netRevenue,
                    yearEnd,
                    filingData.get('pdf_url', 'N/A'),
                    data['organization'].get('city', 'N/A'),
                    data['organization'].get('state', 'N/A'),
                    data['organization'].get('ntee_code', 'N/A'),
                    filingData.get('formtype', 'N/A')
                ]
                return result # Returning results
            else:
                print(f'No data found for EIN {ein}') # If no data is found, just mention EIN of skipped
                return None
        except Exception as e: # If there is exception we catch it
            print(f'Error for EIN {ein}: {e}')
            return None

    with concurrent.futures.ThreadPoolExecutor(max_workers=8) as executor: # Multithreading it for 8 cores
        results = list(executor.map(fetch_data, eins))

    results = [r for r in results if r is not None]

    return results

def csvWriter(filename, data): # The infamous CSV writer
    # Setting Col names
    df = pd.DataFrame(data, columns=['EIN_value', 'Name', 'Gross_Revenue', 'Net_Revenue', 'Tax_Period_Year', 'PDF_URL', 'City', 'State', 'NTEE_Code', 'Form_Type'])
    df.to_csv(filename, index=False) # Converting to csv
    print(f'Data has been written to {filename}.') # Writing to file

def einLoader(): # Now our EIN value loader, and where we get the data
    # Setting path to read data, more specifically EIN values from the XML data set pulled from the IRS
    einPath = "../OpportunityHack/Data/einval.xlsx"
    einVal = pd.read_excel(einPath)
    einList = [str(ein) for ein in einVal['EIN']]

    # Setting path to store the batches of data
    batchFolder = "../OpportunityHack/Batch"
    batchSize = 50 # Batches are nice, since you can see data is loading, easy to merge and you get some data
    length = len(einList)
    # length = 2 # Testing for single batch to see what we get

    for i in range(0, length, batchSize): # Loop through length, write files to each batch and store them
        batch = einList[i:i + batchSize]
        results = einLookup(batch)
        batchFileName = os.path.join(batchFolder, f'out_batch_{i // batchSize}.csv')
        csvWriter(batchFileName, results)

### Calling our Main Function for actually getting the Data

In [None]:
if __name__ == "__main__": # The Main Function where we call our function to get the data
    einLoader()

### The Merger that Merges the Batches CSV to a final data set for analysis

In [None]:
pathOfBatch = "../OpportunityHack/Batch" # We read the dir where we have batch
filePath = os.listdir(pathOfBatch) # set file path to os.listdir
files = [file for file in filePath if os.path.isfile(os.path.join(pathOfBatch, file))] # Add each path

num = len(files) + 1 # Set num to the count of files + 1 to loop through, merge and delete them
mergeDF = pd.DataFrame() # Making the mergeDF an actual df

for i in range(num): # Looping through num of files
    filename = f'../OpportunityHack/Batch/out_batch_{i}.csv' 
    df = pd.read_csv(filename) # Reading each file
    mergeDF = pd.concat([mergeDF, df], ignore_index=True) # Concatinating them

mergeName = '../OpportunityHack/Data/merged_out.csv' # Setting our merged csv
mergeDF.to_csv(mergeName, index=False)

for i in range(num): # Now removing the renaming batch files
    filename = f'../OpportunityHack/Batch/out_batch_{i}.csv'
    os.remove(filename)

print(f'Merged data successfully written to {mergeName}. Batch files have been deleted.')
