<a href="https://github.com/theonaunheim">
    <img style="border-radius: 100%; float: right;" src="static/strawberry_thief_square.png" width=10% alt="Theo Naunheim's Github">
</a>
<br style="clear: both">
<hr>
<br>

<h1 align='center'>Subprocesses</h1>

<br>

<div style="display: table; width: 100%">
    <div style="display: table-row; width: 100%;">
        <div style="display: table-cell; width: 50%; vertical-align: middle;">
            <img src="static/subprocess.png" width="300">
        </div>
        <div style="display: table-cell; width: 10%">
        </div>
        <div style="display: table-cell; width: 40%; vertical-align: top;">
            <blockquote>
                <p style="font-style: italic;">"I love to delegate. I am either lazy enough, or busy enough, or trusting enough, or congenial enough, that the notion of leaving tasks in someone else's lap doesn't just sound wise to me, it sounds attractive."</p>
                <br>
                <p>-John Ortberg</p>
            </blockquote>
        </div>
    </div>
</div>

<br>

<div align='left'>
    <br>
    Image courtesy of <a href='https://commons.wikimedia.org/w/index.php?search=split+lane&title=Special:Search&profile=default&fulltext=1&searchToken=dmt3fqeomz3cl82rr4p82nmwh#/media/File:Singapore_Road_Signs_-_Regulatory_Sign_-_Split_Way.svg'>Woodennature</a> under the <a href='https://creativecommons.org/licenses/by/3.0/'>CC BY 3.0</a>
</div>

<hr>

# Generally

Sometimes you just to call a process. Python makes handing off work to other processes fairly painless.

---

# Modules covered

### Standard Library
* [asyncio](https://docs.python.org/3/library/asyncio.html)
* [multiprocessing](https://docs.python.org/3.6/library/multiprocessing.html)
* [os](https://docs.python.org/3/library/os.html)
* [pathlib](https://docs.python.org/3/library/pathlib.html)
* [random](https://docs.python.org/3/library/random.html)
* [subprocess](https://docs.python.org/3/library/subprocess.html)
* [time](https://docs.python.org/3/library/time.html)

### Third Party Libraries
* [comtypes.client](https://pythonhosted.org/comtypes/)
* [win32com.client](http://docs.activestate.com/activepython/2.4/pywin32/com.html)


# Modules not covered

### Standard Library
* None

### Third Party Libraries
* None

---

In [None]:
# Python stdlib imports
import asyncio
import multiprocessing
import os
import pathlib
import random
import subprocess
import time

# Third party imports
import comtypes.client
import win32com.client

# subprocess module

Subprocess is the easiest way to trigger processes. Previously this was done through the subprocess.call(), subprocess.check_output(), and subprocess.Popeen. Now it is largely done through subprocess.run().

### <a href='https://docs.python.org/3/library/subprocess.html#using-the-subprocess-module'>run()</a>

In [None]:
# If you don't need output, run it without a stdout argument. Simply pass a list of arguments.
process = subprocess.run(['msg', os.environ['USERNAME'], 'We triggered the "msg" program!'])

process

In [None]:
# If you need output, pass it a standard out argument.
process = subprocess.run(['ping', 'localhost', '-n', '1'], stdout=subprocess.PIPE) #, shell=True)

print(process.stdout.decode())

In [None]:
# This is great for offloading onto efficient C-based processes. For each PDF.
for path in pathlib.Path('./data').rglob('*.pdf'):
    # Setup arguments
    args = ['pdftotext', str(path.absolute()), '-']
    # Extract text with pdftotext subprocess (you have this if you have Git).
    process = subprocess.run(args, stdout=subprocess.PIPE, shell=True)
    # Decode text and ignore errors
    output = process.stdout.decode(errors='replace')[:100]
    # Display.
    print('''The first 100 characters of {} are:\n\n{}\n\n\n\n'''.format(path, output))
    

# If this errors out, you probably don't have pdftotext installed (it comes with Git)

### <a href="https://docs.python.org/3/library/subprocess.html#subprocess.check_output">check_outout()</a>

Check output may be simpler.

In [None]:
# Only works if you're running Microsoft Edge
output = subprocess.check_output(['tasklist', '/FI', 'IMAGENAME eq MicrosoftEdge.exe'])

print(output.decode())

# comtypes.client / win32com.client

COM types are useful ways to hook into Windows-specific subprocesses. You may be familiar with this through "CreateObject" in VBA or if you use C#. The usage follows roughly the same path as VBA, but with clearer syntax.

Note: you shouldn't distribute data in Excel format if you can help it, but if you need to, we have xlwings, pandas.read_excel, and DataFrame.to_excel().

This may not work unless you have the newest version of Office.

In [None]:
# This requires word.

# Lets take this content ...
CONTENT = '''

https://youtu.be/kCBxI9yKLgw

When I Was a Lad

When I was a lad I served a term
As office boy to an Attorney's firm.
I cleaned the windows and I swept the floor,
And I polished up the handle of the big front door.
(He polished up the handle of the big front door.)
I polished up that handle so carefully
That now I am the Ruler of the Queen's Navy!
(He polished up that handle so carefully,
That now he is the ruler of the Queen's Navy!)

'''

#  Create path for word file.
OUTPUT_PATH = str(pathlib.Path('data/hms_pinafore.docx').absolute())

# Create a Word application object
# word = comtypes.client.CreateObject("Word.Application")
word = win32com.client.Dispatch('Word.Application')

# Let's make it Visible so we can see what is going on.
word.Visible = True

# Create a document
doc = word.Documents.Add()

# Set the text
word.Selection.Text = CONTENT

# Save to disk.
doc.SaveAs(OUTPUT_PATH)

# Close the doc
doc.Close()

# Quit the program. It will continue consuming memory even if errors, unless closed.
word.Quit()

# multiprocessing module

# Note: none of these concurrency examples will work because multiprocessing must occur in your main namespace, not Jupyter.

If you simply want to spin off Python jobs onto other CPU cores, you can use the multiprocessing module. This can be as simple or as complicated as you want it to be.

Also, if you want managers, sockets, mutexes, events, and all sorts of fancy, you can do that.

Note: this allows you to bypass the [Global Interpreter Lock (GIL)](https://wiki.python.org/moin/GlobalInterpreterLock).

### Basic Pool

In [None]:
def blocking_process(delay):
    '''This process blocks for a specified period of time and returns a string.'''
    time.sleep(delay)
    return("I slept for {} seconds!".format(delay))
    
if __name__ == '__main__':

    # Get the core count so you don't overload burden your computer
    core_count = multiprocessing.cpu_count()
    # Create a pool based on core_count or core_count - 1
    pool = multiprocessing.Pool(core_count)
    # Lay out your arguments
    args = [0, 1, 2, 3, 4]
    # Map the function and arguments to the pool
    result = pool.map(blocking_process, args)

    this_would_produce = [
        "I slept for 0 seconds!",
        "I slept for 1 seconds!",
        "I slept for 2 seconds!",
        "I slept for 3 seconds!",
        "I slept for 4 seconds!",
    ]

### More generally

In [None]:
def blocking_process(delay, queue):
    '''This is like the above but doesn't block. It instead puts the results in queue.'''
    time.sleep(delay)
    queue.put("I slept for {} seconds without inconveniencing you. You're welcome.".format(delay))
    
if __name__ == '__main__':

    # Create a queue for holding results and create two processes.
    q = multiprocessing.Queue()
    process_1 = multiprocessing.Process(target=blocking_process, args=(3,q), name='process_1')
    process_2 = multiprocessing.Process(target=blocking_process, args=(5,q), name='process_2')

    # Start the processes
    print('Starting processes!')
    process_1.start()
    process_2.start()

    print(multiprocessing.active_children())

    # Get the results
    print('We are not blocked, but we can wait on the result if we want.')
    result_1 = q.get(timeout=10)
    print(result_1)
    result_2 = q.get(timeout=10)
    print(result_2)

    # Join the processes
    process_1.join()
    process_2.join()

    this_would_produce = '''

    Starting processes!
    [<Process(process_2, started)>, <Process(process_1, started)>]
    We are not blocked, but we can wait on the result if we want.
    I slept for 3 seconds without inconveniencing you. You're welcome.
    I slept for 5 seconds without inconveniencing you. You're welcome.

    '''

# asyncio module

# Note: none of this will work in Jupyter because Jupyter itself is run on an event loop.

Asynchronous programming is a deep topic. This just scratches the surface. We are just going to create the event loop and the let it go. Asyncio can cover everything from file watchers to concurrent web requests to pretty much anything else.

In [None]:
# Get PDF paths
PDF_PATHS = [
     str(pathlib.Path('./data/sub2/foundations_of_data_science.pdf').absolute()),
     str(pathlib.Path('./data/sub2/JPM Big Data and AI Strategies.pdf').absolute()),
] * 5

# Note the async keyword
async def get_text(pdf_path): 
    '''This extracts text from a single pdf.'''
    # Create the subprocess, redirect the standard output into a pipe
    process = await asyncio.create_subprocess_exec('pdftotext',
                                                   pdf_path,
                                                   '-',
                                                   stdout=asyncio.subprocess.PIPE,
                                                   stderr=asyncio.subprocess.PIPE) 
    # Read output
    data = await process.communicate() 
    # Have process exit and return data.
    await process.wait()
    # Decode cp1252 for windows
    decoded_data = data[0].decode('cp1252')
    return decoded_data

# Create the loop (Windows requires Proactor loop)
loop = asyncio.ProactorEventLoop()

# Set the loop
asyncio.set_event_loop(loop)

# Set up tasks
print('Creating tasks ...')
tasks = [get_text(path) for path in PDF_PATHS]

# Get the text
print('Running {} concurrent subprocesses ...'.format(len(tasks)))
data = loop.run_until_complete(asyncio.gather(*tasks))

# Close the loop
loop.close()
print(data)

# Show the first 500 bytes of the first of the docs.
print('\n\nTasks done. First 300 bytes of doc 1 are:')
print(data[0][:300])

# Additional Learing Resources

* ### [Developing with Asyncio](https://docs.python.org/3/library/asyncio-dev.html)

---

# Next Up: [Other](6_other.ipynb)

<img style="margin-left: 0;" src="static/other.png" width="200">

<div align='left'>
    <br>
    Image courtesy of <a href='https://commons.wikimedia.org/wiki/File:Flag_of_None.svg'>Rainer Zenz</a>. Image is public domain.
</div>


---