# Tutorial 10-02 - Multiprocessing

Going back to our work with GeoNinjas PythonAnalytics, we're being given another task to speed up.  In this case, we have a single dataset comprised of highways intersected with California counties.  Our colleagues would like us to set up a repeatable process to create a zipped file geodatabase output for each county containing a feature class that has only the highways for that county.  We could do this sequentially and it wouldn't take forever, but this is a great candidate for multiprocessing.  This is good practice for more complex and time-consuming operations.

## Setup Inputs and Get a List of Counties

#### 1.  Import packages.

In this case, you'll be using `arcpy` for creating file geodatabases and feature classes.  You'll be using the `os` package for dealing with folders and paths.  You'll be using the `zipfile` package to compress a file geodatabase into a single file.  Finally, you'll be using the `multiprocessing` package to perform your logic multiple times in parallel.

In [None]:
import arcpy
import os
import zipfile
import multiprocessing

#### 2.  Setup inputs

The first thing you'll want to do is set up variables for your user inputs.  Even if you don't end up turning this script into a tool, it's helpful to put the inputs in a place at the beginning of the script that you can replace later to reuse the script.

In [None]:
# input file geodatabase path
input_fgdb = r"..\Chapter 03 - ArcPy Basics\Chapter 03 Files\Chapter 02 - Working with Maps.gdb"

# input feature class name
input_fc_name = "Highways_intersect"

# output folder
output_folder = r".\zipped_outputs"

#### 3.  Create and test input feature class path

It's a good idea to test that your input paths are valid to start.  You can use the `os` package to combine the file geodatabase path and feature class name to get a full path.  Then you can use the `arcpy.Exists()` function to ensure that the path is valid.

In [None]:
full_fc_path = os.path.join(
    input_fgdb, input_fc_name
)

full_fc_path

In [None]:
arcpy.Exists(full_fc_path)

#### 4.  Generate a list of counties

Assuming the path is valid, you can use a `SearchCursor` and a list comprehension to generate a list of counties.  For more details on this, check out the **ArcPy Basics** chapter.

In [None]:
counties = [r[0] for r in arcpy.da.SearchCursor(full_fc_path, ['NAMELSAD'])]
len(counties)

You can use a set object to narrow that list of all counties down to unique values.  For more information on this, review the **Data Structures** and **ArcPy Basics** chapters.

In [None]:
counties = list(set(counties))
counties.sort()
counties

#### 5. Setup an output folder

You can use the `os` package to test and see if a folder exists at a specified path.  In this case, if it doesn't exist, we'll make it using the `os.mkdir()` function.

In [None]:
if not os.path.exists(output_folder):
    os.mkdir(output_folder)

## Setup Logic for a Single County

Now that you've got a list of counties, the next step is to set up the logic you're going to execute on each county individually.

#### 1.  Pick a single county

Execute the following cell to use the first value in the counties list.

In [None]:
county = counties[0]
county

#### 2.  Replace any spaces.

Feature class names don't allow spaces, so you can use the built-in `.replace()` method on the string data type to remove any spaces.  In this case, you can replace them all with underscores.

In [None]:
county_no_spaces = county.replace(" ", "_")
county_no_spaces

#### 3. Create a file geodatabase

You can use the `arcpy` package to create a file geodatabase.  There's a geoprocessing tool (function) called `CreateFileGDB` in the `management` toolbox (module).

NOTE - The `CreateFileGDB` tool returns an `arcpy.Results` object.  To get the path to the file geodatabase, we'll be using an index and referring to the path as `fgdb[0]` in many cases.

In [None]:
fgdb = arcpy.management.CreateFileGDB(
    out_folder_path = output_folder,
    out_name = f"{county_no_spaces}_Output"
)

fgdb

In [None]:
fgdb[0]

#### 4.  Create a feature class for the county

To export the features for a single county, you'll be using the `ExportFeatures` tool in the `conversion` toolbox.  This tool allows you to set a `where_clause` to pick features by an attribute expression.

In [None]:
output_fc = arcpy.conversion.ExportFeatures(
    in_features = full_fc_path,
    out_features = os.path.join(fgdb[0], 
                                f"{county_no_spaces}_Highways"),
    where_clause = f"NAMELSAD = '{county}'"
)
output_fc[0]

If you're curious about how successful that operation was, you can use `arcpy.Exists()` to ensure that it was successful or `arcpy.management.GetCount()` to find out how many records were exported.

In [None]:
arcpy.management.GetCount(output_fc)

#### 5.  Compress the file geodatabase into a zip file

Now that you've created a file geodatabase and feature class, you can zip the file geodatabase up into a single zip file for easy transfer.  You'll use the `zipfile` package to do this.

In [None]:
# define a path for the zip file
zip_file_path = os.path.join(output_folder, f"{county} Highways.zip") 

# use a context manager to create a zipfile object
with zipfile.ZipFile(zip_file_path, "w") as zipper:
    
    # use os.walk to iterate through each file in the file geodatabase
    for root, dirs, files in os.walk(fgdb[0]):
        for file in files:
            
            # original file path
            fpath = os.path.join(root, file)
            
            # relative zipfile path
            zpath = os.path.relpath(
                        os.path.join(root, file),
                        os.path.join(fgdb[0], '..')
                    )
            
            # write the file
            zipper.write(
                fpath,
                zpath
            )

#### 6.  Delete the temporary file geodatabase

Since you've zipped the file geodatabase, you can delete the file geodatabase as part of your cleanup.

In [None]:
arcpy.management.Delete(fgdb)

## Create a repeatable function

Similarly to what we did in the previous exercise, now we're going to take all the logic we just developed and turn it into a function.  Luckily enough, we used variables for our inputs.  This will require very little refactoring.

#### 1.  Refactor the individual logic as a function

In the cell below, you can copy and paste all the code you wrote in the previous step.  You can turn it into a function and set some parameters for the input feature class, output folder, and individual county.

In [None]:
def zip_county_highways(full_fc_path, output_folder, county):
    
    # remove spaces from county name
    county_no_spaces = county.replace(" ", "_")
    
    # create a file geodatabase
    fgdb = arcpy.management.CreateFileGDB(
        out_folder_path = output_folder,
        out_name = f"{county_no_spaces}_Output"
    )

    # Create a feature class
    output_fc = arcpy.conversion.ExportFeatures(
        in_features = full_fc_path,
        out_features = os.path.join(fgdb[0], 
                                    f"{county_no_spaces}_Highways"),
        where_clause = f"NAMELSAD = '{county}'"
    )
    
    # define a path for the zip file
    zip_file_path = os.path.join(output_folder, f"{county} Highways.zip") 

    # zip the file geodatabase
    with zipfile.ZipFile(zip_file_path, "w") as zipper:
        for root, dirs, files in os.walk(fgdb[0]):
            for file in files:
                fpath = os.path.join(root, file)
                zpath = os.path.relpath(
                            os.path.join(root, file),
                            os.path.join(fgdb[0], '..')
                        )
                zipper.write(
                    fpath,
                    zpath
                )
    
    # delete the file geodatabase
    arcpy.management.Delete(fgdb)
    
    # return the zip file path
    return zip_file_path

#### 2.  Test your function

In [None]:
zip_county_highways(full_fc_path, output_folder, "Butte County")

## Setup Multiprocessing

The last section of this tutorial is where we're finally going to set up a script to execute this process in parallel.  Ultimately, we'll execute this as a script outside the Jupyter Notebook because there are some special conditions involving iPython that make multiprocessing particularly difficult.  It's just much easier to do as a script.

Until then, there's a couple things we can do here to make our script writing easier.

#### 1.  Find out how many cores you have available.

You can use the `multiprocessing` package to find out how many cores you have available on your machine or in your environment.  This will help inform how many parallel processes you can run at the same time.

In [None]:
multiprocessing.cpu_count()

#### 2.  Clean up your code and prepare a script

Now you can take all this code that you've developed and clean it up to be used in a script.  You can start with defining your function, then importing your packages, and defining your inputs.  Then you can insert the logic to generate the list of unique counties.  You can stop there for now and we'll work on the multiprocessing next.

In [None]:
def zip_county_highways(full_fc_path, output_folder, county):
    
    # remove spaces from county name
    county_no_spaces = county.replace(" ", "_")
    
    # create a file geodatabase
    fgdb = arcpy.management.CreateFileGDB(
        out_folder_path = output_folder,
        out_name = f"{county_no_spaces}_Output"
    )

    # Create a feature class
    output_fc = arcpy.conversion.ExportFeatures(
        in_features = full_fc_path,
        out_features = os.path.join(fgdb[0], 
                                    f"{county_no_spaces}_Highways"),
        where_clause = f"NAMELSAD = '{county}'"
    )
    
    # define a path for the zip file
    zip_file_path = os.path.join(output_folder, f"{county} Highways.zip") 

    # zip the file geodatabase
    with zipfile.ZipFile(zip_file_path, "w") as zipper:
        for root, dirs, files in os.walk(fgdb[0]):
            for file in files:
                fpath = os.path.join(root, file)
                zpath = os.path.relpath(
                            os.path.join(root, file),
                            os.path.join(fgdb[0], '..')
                        )
                zipper.write(
                    fpath,
                    zpath
                )
    
    # delete the file geodatabase
    arcpy.management.Delete(fgdb)
    
    # return the zip file path
    return zip_file_path

# package imports
import arcpy
import os
import zipfile
import multiprocessing

#=============================== INPUTS ===================================
# input file geodatabase path
input_fgdb = r"..\Chapter 03 - ArcPy Basics\Chapter 03 Files\Chapter 02 - Working with Maps.gdb"

# input feature class name
input_fc_name = "Highways_intersect"

# output folder
output_folder = r".\zipped_outputs"
#==========================================================================

if __name__ == '__main__':
    # get the full feature class path
    full_fc_path = os.path.join(
        input_fgdb, input_fc_name
    )

    # get the county for each feature
    counties = [r[0] for r in arcpy.da.SearchCursor(full_fc_path, ['NAMELSAD'])]

    # narrow the counties down to unique counties
    counties = list(set(counties))
    counties.sort()

    # create the output folder
    if not os.path.exists(output_folder):
        os.mkdir(output_folder)
        
    # get your cpu count for multiprocessing
    process_count = multiprocessing.cpu_count()

#### 3.  Set up multiprocessing

Finally, you can write your multiprocessing logic.  **This will not work in a Jupyter Notebook**, so your last step will be putting this all in a script file and executing that.

You'll be using the `concurrent` package in a very similar way that you did during the Threading tutorial earlier in this chapter.  This time, you'll be creating a `ProcessPoolExecutor` instead of a `ThreadPoolExecutor`.  All the other logic still applies.  You'll be submitting your function to the executor and iterating through a list of futures.

In [None]:
if __name__ == '__main__':

    from concurrent.futures import ProcessPoolExecutor, as_completed
    
    # set up the process pool executor
    with ProcessPoolExecutor(max_workers=process_count) as executor
        
        # set up a list to contain all the future objects
        futures_list = []
        
        # submit each job to the executor
        for county in counties:
            futures_list.append(executor.submit(zip_county_highways, full_fc_path, output_folder, county))

        # iterate through the futures to see when they're completed
        for future in as_completed(futures_list):
            print(future.result())

#### 4.  Finalize and run your script

The final step here is to combine all this logic into a .py script file and run that outside your Jupyter Notebook.