# Building Command-Line Tools with Python 

> Multiple exclamation marks are a sure sign of a diseased mind.
>
> --- Terry Pratchett

The [Jupyter Notebook](https://jupyter.org/), PyCharm, and other graphical interfaces
are great for prototyping code and exploring data, but eventually we may need to apply our code to thousands of data files,
run it with many different parameters, or combine it with other programs as part of a data analysis pipeline.
The easiest way to do this is often to turn our code into a standalone program that can be run in the Unix shell
just like other command-line tools {cite:p}`Tasc2017`.

In this chapter we will develop some **command-line Python program** that handle input and output in the same way as other shell commands,
can be controlled by several option flags, and provide useful information when things go wrong.
The result will have more scaffolding than useful application code, but that scaffolding stays more or less the same as programs get larger.

After the previous chapters, our Zipf's Law project should have the following files and directories:


```text
zipf/
├── bin
│   └── book_summary.sh
└── data
    ├── README.md
    ├── dracula.txt
    ├── frankenstein.txt
    ├── jane_eyre.txt
    ├── moby_dick.txt
    ├── sense_and_sensibility.txt
    ├── sherlock_holmes.txt
    └── time_machine.txt
```

> **Python Style**
>
> When writing Python code there are many style choices to make.
> How many spaces should I put between functions?
> Should I use capital letters in variable names?
> How should I order all the different elements of a Python script?
> Fortunately,
> there are well established conventions and guidelines
> for good Python style.
> We follow those guidelines throughout this book
> and discuss them in detail in Appendix **TODO** \@ref(style).

## Programs and Modules 

To create a Python program that can run from the command line,\index{Python!program vs.\ module}
the first thing we do is to add the following to the bottom of the file:

In [None]:
if __name__ == '__main__':

This strange-looking check tells us whether the file is running as a standalone program or whether it is being imported as a module by some other program.
When we import a Python file as a module in another program, the `__name__` variable is automatically set to the name of the file.\index{\_\_name\_\_ variable (in Python)}\index{Python!\_\_name\_\_ variable}
When we run a Python file as a standalone program, on the other hand, `__name__` is always set to the special string `"__main__"`.
To illustrate this, let's consider a script named `print_name.py` that prints the value of the `__name__` variable:

In [1]:
print(__name__)

__main__


When we run this file directly, it will print `__main__`: 

```bash
$ python print_name.py
```

```text
__main__
```

But if we import `print_name.py` from another file or from the Python interpreter, it will print the name of the file, i.e., `print_name`.

```bash
$ python
```

```text
Python 3.7.6 (default, Jan  8 2020, 13:42:34) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: 
Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license"
for more information.
```

```python
>>> import print_name
```
```text
print_name
```

Checking the value of the variable `__name__` therefore tells us whether our file is the top-level program or not. If it is, we can handle command-line options, print help, or whatever else is appropriate;
if it isn't, we should assume that some other code is doing this. 

We could put the main program code directly under the `if` statement like this:

```python
if __name__ == "__main__":
    # code goes here
```

but that is considered poor practice, since it makes testing harder (Chapter **TODO** ref(testing)). Instead, we put the high-level logic in a function, then call that function if our file is being run directly:



In [9]:
def main():
    print('Hello World!')

if __name__ == '__main__':
    main()

Hello World!


This top-level function is usually called `main`, but we can use whatever name we want.

## Handling Command-Line Options 

The main function in a program usually starts by parsing any options the user gave on the command line.
The most commonly used library for doing this in Python is [`argparse`](https://docs.python.org/3/library/argparse.html), which can handle options with or without arguments, convert arguments from strings to numbers or other types, display help, and many other things.

The simplest way to explain how `argparse` works is by example. 

Let's create a short Python program called `script_template.py`:


In [16]:
import argparse


def main(args):
    print('Input file:', args.infile)
    print('Output file:', args.outfile)


if __name__ == '__main__':
    USAGE = 'Brief description of what the script does.'
    parser = argparse.ArgumentParser(description=USAGE)
    parser.add_argument('infile', type=str,
                        help='Input file name')
    parser.add_argument('outfile', type=str,
                        help='Output file name')
    
    # since jupyter notebook does not have a command line interface we need to simulate it. 
    # args = parser.parse_args()  # use this when running from command line 
    args = argparse.Namespace(infile='input.txt', outfile='output.txt') # use this when running from jupyter notebook
    main(args)

Input file: input.txt
Output file: output.txt


**NOTE** not if run from jupyternotebook
> **Empty Lines, Again**
> 
> As we discussed in the last chapter for shell scripts,
> remember to end your Python scripts in a newline character
> (which we view as an empty line).


If `script_template.py` is run as a standalone program at the command line, then `__name__ == '__main__'` is true, so the program uses `argparse` to create an argument parser. 
It then specifies that it expects two command-line arguments: an input filename (`infile`) and an output filename (`outfile`).
The program uses `parser.parse_args()` to parse the actual command-line arguments given by the user and stores the result in a variable called `args`,
which it passes to `main`. That function can then get the values using the names specified in the `parser.add_argument` calls.

> **Specifying Types**
>
> We have passed `type=str` to `add_argument` to tell `argparse` that
> we want `infile` and `outfile` to be treated as strings.
> `str` is not quoted because it is not a string itself:
> instead,
> it is the built-in Python function that converts things to strings.
> As we will see below,
> we can pass in other functions like `int`
> if we want arguments converted to numbers.

If we run `script_template.py` at the command line, the output shows us that `argparse` has successfully handled the arguments:


In [19]:
%%bash
cd ../../exercises/zipf/bin
# If running from a jupyternotebook
python script_template_nb.py
# if running from command line
# python script_template.py in.csv out.png

Input file: input.txt
Output file: output.txt



It also displays an error message if we give the program invalid arguments:

```bash
$ python script_template.py in.csv
```

```text
usage: script_template.py [-h] infile outfile
script_template.py: error: the following arguments are
  required: outfile
```

Finally, it automatically generates help information (which we can get using the `-h` option):


In [22]:
%%bash
cd ../../exercises/zipf/bin
python script_template.py -h 



usage: script_template.py [-h] infile outfile

One-line description of what the script does.

positional arguments:
  infile      Input file name
  outfile     Output file name

options:
  -h, --help  show this help message and exit


## Documentation 

Our program template is a good starting point, but we improve it right away by adding a bit of documentation.
To demonstrate, let's write a function that doubles a number:



In [23]:
def double(num):
    'Double the input.'
    return 2 * num

The first line of this function is a string that isn't assigned to a variable.
Such a string is called a documentation string, or **docstring** for short. If we call our function it does what we expect:

In [24]:
double(3)


6

However, we can also ask for the function's documentation, which is stored in `double.__doc__`:

In [25]:
double.__doc__

'Double the input.'

Python creates the variable `__doc__` automatically for every function, just as it creates the variable `__name__` for every file.
If we don't write a docstring for a function, `__doc__`'s value is an empty string. 
We can put whatever text we want into a function's docstring, but it is usually used to provide online documentation.

We can also put a docstring at the start of a file, in which case it is assigned to a variable called `__doc__`
that is visible inside the file. If we add documentation to our template, it becomes:

In [26]:
"""Brief description of what the script does."""

import argparse


def main(args):
    """Run the program."""
    print('Input file:', args.infile)
    print('Output file:', args.outfile)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument('infile', type=str,
                        help='Input file name')
    parser.add_argument('outfile', type=str,
                        help='Output file name')
    # since jupyter notebook does not have a command line interface we need to simulate it. 
    # args = parser.parse_args()  # use this when running from command line 
    args = argparse.Namespace(infile='input.txt', outfile='output.txt') # use this when running from jupyter notebook
    main(args)

Input file: input.txt
Output file: output.txt


Note that docstrings are usually written using triple-quoted strings, since these can span multiple lines.
Note also how we pass `description=__doc__` to `argparse.ArgumentParser`. This saves us from typing the same information twice, but more importantly ensures that
the help message provided in response to the `-h` option will be the same as the interactive help. 

Let's try this out in an interactive Python session. (Remember, do not type the `>>>` prompt: Python provides this for us.)

***NOTE*** Python Prompt can not be used in Jupyter Notebook that is why, the example is non interactive!
```bash
$ python
```
```text
Python 3.7.6 (default, Jan  8 2020, 13:42:34) 
[Clang 4.0.1 (tags/RELEASE_401/final)] :: 
Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license"
for more information.
```
```python
>>> import script_template
>>> script_template.__doc__
```
```text
'Brief description of what the script does.'
```
```python
>>> help(script_template)
```
```text
Help on module script_template:

NAME
    script_template - Brief description of what the script does.

FUNCTIONS
    main(args)
        Run the program.

FILE
    /Users/amira/script_template.py
```

As this example shows, if we ask for help on the module, Python formats and displays all of the docstrings for everything in the file.
We talk more about what to put in a docstring in Appendix **TODO** ref(documentation).


## Counting Words

Now that we have a template for command-line Python programs, we can use it to check Zipf's Law for our collection of classic novels.
We start by moving the template into the directory where we store our runnable programs (Section **TODO** ref(getting-started-organize)):


In [27]:
%%bash 
python