# Learn to Write Functions Others Can Use in Python
## I would rather read binary and bleed from my eyes
<img src='images/relaxed.jpg'></img>
<figcaption style="text-align: center;">
    <strong>
        Photo by 
        <a href='https://www.pexels.com/@dziana-hasanbekava?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Dziana Hasanbekova</a>
        on 
        <a href='https://www.pexels.com/photo/unrecognizable-man-relaxing-on-hammock-5480702/?utm_content=attributionCopyText&utm_medium=referral&utm_source=pexels'>Unsplash</a>
    </strong>
</figcaption>

### Introduction

Let's get this straight: every single line of code is written for other humans to read. Period. 

Ever wondered why everyone codes in English? Why not in Chinese or Russian or Klingon or in ancient Farsi? Actually, does the coding language matter? Well, no. Every source code, regardless of the language, is converted into a machine language ONLY computer can work with. So the underlying truth why all programming languages use English keywords is that it is the global language and understood by billions. Writing source codes in English makes it easier for humans to create computer programs and collaborate with other programmers across the globe. It all comes down to writing understandable code. 

The code you write is your face, the first thing other programmers judge you with. That's why, the sooner you instill this truth the better. 

This post will be about how to write clean, well-documented functions that follow best practices and are a delight to IDEs.

### Docstrings

After you have written a function, the first step to make it understandable is to add a docstring. Here is the anatomy of a good docstring:

In [1]:
def function_name(arguments):
    """
    1. Description of what the function does.
    2. Description of the arguments, if any.
    3. Description of the return value(s), if any.
    4. Description of errors, if any.
    5. Optional extra notes or examples of usage.
    """

All well-documented and popular libraries follow this anatomy in different formats. Out there, there are 4 main docstring formats:
- Google style
- Numpydoc
- reStructured Text
- EpyTex

We will only focus on the first two since they are the most popular.

### Google Style Docstrings

Let's start with the function description section of Google Style:

In [2]:
def google_style(arg_1, arg_2=42):
    """Description of what the function does.
    """

The first sentence should contain the purpose of the function, like a topic sentence in an essay. It should start right after opening the triple quotes. Optional explanations should be given as separate, unindented paragraphs:

In [3]:
def google_style(arg_1, arg_2=42):
    """Description of what the function does.
    
    Lorem ipsum dolor sit amet, consectetur adipiscing elit.
    Aliquam venenatis magna a consequat mollis. In ultrices consequat nibh. 
    Sed eu sollicitudin dui. Phasellus eu iaculis justo. 
    
    Curabitur faucibus ipsum vel aliquet convallis. 
    Maecenas eros lorem, varius nec accumsan eu, suscipit eget quam. 
    In a ultricies est. Morbi varius maximus elit, non tempus metus viverra et.    
    """

The next is the arguments section:

In [4]:
def google_style(arg_1, arg_2=42):
    """Description of what the function does.
    
    ...
    
    Args:
      arg_1 (type): Description of arg_1 that can continue
        to the next line with 2 space indent.
      arg_2 (int, optional): Write optional when the argument
        has a default value
    """

Starting the new paragraph with `Args:` indicates you are defining the parameters. The parameters are given on a new line and indented with 2 spaces. After the argument name, the data type of the argument should be given between parentheses. For optional arguments, an extra _'optional'_ key should be added.

Finally, define the return values:

In [5]:
def google_style(arg_1, arg_2=42):
    """Description of what the function does.
    
    ...
    
    Args:
    
      ...
      
    Returns:
      bool: Optional desc. of the return value_1
      dict: Optional desc. of the return value_2
      Extra lines shouldn't be indented
    """

You can also pass an errors section if your function raises any intentional errors:

In [6]:
def google_style(arg_1, arg_2=42):
    """Description of what the function does.
    
    ...
    
    Args:
    
      ...
      
    Returns:
      
      ...
    
    Raises:
      ValueError: Describe the case where your 
        function intentionally raises this error    
    """

Sometimes, you might need to include examples or extra notes at the end:

In [7]:
def google_style(arg_1, arg_2=42):
    """Description of what the function does.
    
    ...
    
    Args:
    
      ...
      
    Returns:
      
      ...
    
    Raises:
      ...
    
    Notes:
      Extra notes and use cases of the function in the
      form of free text.
    """

### Numpydoc Format Docstrings

This format is the most popular in the data science community. Here is the full format:

In [8]:
def numpy_style(arg_1, arg_2=42):
    """
    Description of the function's purpose
    
    ...
    
    Parameters
    ----------
    arg_1: expected type of arg_1
      Description of the argument.
      Multi-lines are allowed
    arg_2: int, optional
      Again, write optional when argument
      has a default value
      
    Returns
    -------
    The type of the return value
      Can include a desc of the returned value.
    """

The `error`s and `notes` section follow the same pattern. Even though it takes more lines, I like this one better.

Here is a sample function in both styles:

In [9]:
import requests
import json

In [17]:
# Google style
def send_request(key: str, lat: float = 0, lon: float = 0):
    """Send a request to Climacell Weather API
    to get weather info based on lat/lon.
    
    Climacell API provides realtime weather
    information which can be accessed using
    their 'Realtime Endpoint'.
    
    Args:
      key (str): an API key with length of 32 chars.
      lat (float, optional): value for latitude.
        Default=0
      lon (float, optional): value for longitude.
        Default=0
    
    Returns:
      int: status code of the result 
      dict: Result of the call as a dict
    
    Notes:
      See https://www.climacell.co/weather-api/ 
      for more info on Weather API. You can get
      API key from there, too.
    """
    # Store the endpoint
    endpoint = "https://api.climacell.co/v3/weather/realtime"
    # Build query string params
    params = {'lat': lat, 'lon': lon, 'fields': 'temp',
              'apikey': api, 'unit_system': 'si'}
    # Get response
    response = requests.request('GET', endpoint, params=params)
    # Extract response code
    code = response.status_code
    # Convert results to dict
    result = json.loads(response.content)
    
    return code, result

In [18]:
# Numpydoc style
def send_request(key: str, lat: float = 0, lon: float = 0):
    """
    Send a request to Climacell Weather API
    to get weather info based on lat/lon.
    
    Climacell API provides realtime weather
    information which can be accessed using
    their 'Realtime Endpoint'.
    
    Parameters
    ----------
      key (str): an API key with length of 32 chars.
      lat (float, optional): value for latitude.
        Default=0
      lon (float, optional): value for longitude.
        Default=0
    
    Returns
    -------
      int: status code of the result 
      dict: Result of the call as a dict
    
    Notes
    -----
      See https://www.climacell.co/weather-api/ 
      for more info on Weather API. You can get
      API key from there, too.
    """
    # Store the endpoint
    endpoint = "https://api.climacell.co/v3/weather/realtime"
    # Build query string params
    params = {'lat': lat, 'lon': lon, 'fields': 'temp',
              'apikey': api, 'unit_system': 'si'}
    # Get response
    response = requests.request('GET', endpoint, params=params)
    # Extract response code
    code = response.status_code
    # Convert results to dict
    result = json.loads(response.content)
    
    return code, result

> If the function uses `yield` keyword, you can change `Returns` section with `Yields`.

Adding type hinting can be very helpful to the users if they use modern IDEs like PyCharm. You can learn more about type hinting [here](https://towardsdatascience.com/type-checking-in-python-in-the-right-way-5e18eba44296).

### Accessing the Docstrings of functions Without Googling

You can also access any function's docstring by calling `.__doc__` on the function name:

In [12]:
print(range.__doc__)

range(stop) -> range object
range(start, stop[, step]) -> range object

Return an object that produces a sequence of integers from start (inclusive)
to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
These are exactly the valid indices for a list of 4 elements.
When step is given, it specifies the increment (or decrement).


In [13]:
print(print.__doc__)

print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the values to a stream, or to sys.stdout by default.
Optional keyword arguments:
file:  a file-like object (stream); defaults to the current sys.stdout.
sep:   string inserted between values, default a space.
end:   string appended after the last value, default a newline.
flush: whether to forcibly flush the stream.


Using `__doc__` may work nicely with small functions but for large functions with big docstrings such as `numpy.ndarray`, you can use `inspect` module's `.getdoc` function:

In [14]:
from numpy import ndarray
import inspect

print(inspect.getdoc(ndarray))

ndarray(shape, dtype=float, buffer=None, offset=0,
        strides=None, order=None)

An array object represents a multidimensional, homogeneous array
of fixed-size items.  An associated data-type object describes the
format of each element in the array (its byte-order, how many bytes it
occupies in memory, whether it is an integer, a floating point number,
or something else, etc.)

Arrays should be constructed using `array`, `zeros` or `empty` (refer
to the See Also section below).  The parameters given here refer to
a low-level method (`ndarray(...)`) for instantiating an array.

For more information, refer to the `numpy` module and examine the
methods and attributes of an array.

Parameters
----------
(for the __new__ method; see Notes below)

shape : tuple of ints
    Shape of created array.
dtype : data-type, optional
    Any object that can be interpreted as a numpy data type.
buffer : object exposing buffer interface, optional
    Used to fill the array with data.
offset : int, 

It displays function's documentation in an easy to read manner. 

### Do One Thing At a Time

A common mistake many beginners make is writing too long and complicated functions. It is always recommended to design functions to only perform one specific task. Small and precise functions are easy to test and debug with modern IDEs and will be flexible.

Now, you might be thinking: 'I never had to spend more than 2-3 minutes if my code generates an error. I could always solve them with simple `print` statements and playing around...' That's a classic rookie misunderstanding. When I first started learning Python, I also thought the books and courses I took made a very big deal of bugs in code. Because then, the code I wrote was not sophisticated enough to generate giant pain in the a** bugs or errors. 

If you still think like that, try to write a script/program that works with at least 200 lines long and then we will talk. In the meantime, consider this function:

In [22]:
import pandas as pd


def top25(path, country):
    """
    Reads a csv file into pandas.DataFrame
    from `path` and returns top 25 cities
    of `country` based on population
    """
    df = pd.read_csv('data/worldcities.csv')
    # Subset for cities of given country
    subset = df[df['country'] == country][['city_ascii', 'lat',
                                           'lng', 'population']]
    # Extract top 25 based on population size
    subset_sorted = subset.sort_values('population',
                                       ascending=False).iloc[:25]
    # Rename lng column to lon
    subset_sorted['lon'] = subset_sorted['lng']
    # Drop lng column
    subset_sorted.drop('lng', axis='columns', inplace=True)
    # Reorder columns
    subset_sorted = subset_sorted[['city_ascii', 'lat',
                                   'lon', 'population']]
    return subset_sorted.reset_index().drop('index', axis='columns')

First of all, the docstring does give a good description of the function. If we spend some time reading the code, we will realize that its main purpose is to read a `csv` file from `path` argument and subset it using `country` argument and return top 25 most populated cities of that country. If you pay attention, the main purpose of the function is done in a single line (right after the second comment). The other lines are performing cleaning tasks that are not very clear.

What would be ideal is to break this function up so that all the cleaning is done in one chunk and subsetting for 25 cities in another. Let's start with cleaning:

In [28]:
import pandas as pd


def preprocess(path):
    """
    Loads a CSV file in pandas.DataFrame
    and performs basic data cleaning.
    
    Parameters
    ----------
      path (str): a path to the CSV file
    
    Returns
    -------
      pandas.DataFrame
    """
    # Load the data
    df = pd.read_csv(path)
    # Rename lng column to lon
    df.rename(columns={'lng': 'lon'}, inplace=True)
    # Reorder columns
    df = df[['city_ascii', 'lat', 'lon', 'population']]
    
    return df

In [30]:
cities = preprocess('data/worldcities.csv')
cities.head()

Unnamed: 0,city_ascii,lat,lon,population
0,Tokyo,35.6897,139.6922,37977000.0
1,Jakarta,-6.2146,106.8451,34540000.0
2,Delhi,28.66,77.23,29617000.0
3,Mumbai,18.9667,72.8333,23355000.0
4,Manila,14.5958,120.9772,23088000.0


In the above function, I used `.rename` method to rename the `lng` column to `lon`. In the original, dirty function, first a new column was created and the old one was dropped which was unnecessary. 

Next is to create another function that subsets for top cities of a given country:

In [32]:
def top_25(df, country: str):
    """
    Filters the `df` for `country`
    and isolates its 25 most populated
    cities.
    
    Parameters
    ----------
      df (pandas.DataFrame): the dataframe to filter
        containing countries and cities data
      country (str): the name of the country to filter for
      
    Returns
    -------
      pandas.DataFrame or None: return top 25 cities
        as pandas.DataFrame or None if no match found
        for `country`
    """
    # Filter for `country` in `df`
    if country in df['country'].unique():
        filtered = df[df['country'] == country]
    else:  # Return None if no match
        return None
    # Sort in descending order based on population
    pop_sorted = filtered.sort_values('population', ascending=False)
    # Extract top 25 cities
    top = pop_sorted.iloc[:25]
    
    return top

This function is better than the original function because I also inserted exception handling logic for when there is no matching country in the data. 

Both of the new functions contain better documentation and follow best practices.