# Lesson I 

## Docstrings

In this course, you'll learn how to write functions that others can use. Docstrings are a Python best practice that will make your code much easier to use, read, and maintain.

### A Complex Function

```python
    def split_and_stack(df, new_names):
        half = int(len(df.columns) / 2)
        left = df.iloc[:, half]
        right = df.iloc[:, :half]
        return pd.DataFrame(
            data=np.vstack([left.values, right.values]),
            columns=new_names
        )
```

Look at this ``split_and_stack()`` function. If you wanted to understand what the function does, what the arguments are supposed to be, and what it returns, you would have to spend some time deciphering the code.

#### A Complex function with a docstring

With a docstring though, it is much easier to tell what the expected inputs and outputs should be, as well as what the function does. This makes it easier for you and other engineers to use your code in the future.

```python
    def split_andstack(df, new_names):
        """ Split a DataFrame's columns into two halves and then stack them vertically,
        returning a new DataFrame with 'new_names' as the column names.

        Args:
            df (DataFrame): The DataFrame to split.
            new_names (iterable of str): The column names for the new DataFrame.

        Returns:
            DataFrame    
        """

        half = int(len(df.columns) / 2)
        left = df.iloc[:, half]
        right = df.iloc[:, :half]
        return pd.DataFrame(
            data=np.vstack([left.values, right.values]),
            columns=new_names
        )
```

### Anatomy of a Docstring

A docstring is a string written as the first line of a function. Because docstrings usually span multiple lines, they are enclosed in triple quotes, Python's way of writing multi-line strings. Every docstring has some (although usually not all) of these five key pieces of information: 

* Description of what the function does.
* Description of the arguments, if any.
* Description of the return value, if any.
* Description of errors that may occur, if any.
* Optional extra notes or examples of usage.

### Docstring Formats

Consistent style makes a project easier to read, and the Python community has evolved several standards for how to format your docstrings.

* Google Style
* Numpydoc
* reStructuredText
* Epytext

Google-style and Numpydoc are the most popular formats.

#### Google Style - description

In Google style, the docstring starts with a concise description of what the function does. This should be in imperative language. For instance: "Split the data frame and stack the columns" instead of "This function will split the data frame and stack the columns".

```python
    def function(arg_1, arg_2=42):
        """ Description of what the function does. 
        """
```

#### Google Style - arguments

Next comes the *"Args"* section where you list each argument name, followed by its expected type in parentheses, and then what its role is in the function.

```python
       def function(arg_1, arg_2=42):
        """ Description of what the function does. 
        
        Args:
            arg_1 (str) : Description of arg_1 that can break onto the next line 
            if needed.
            arg_2 (int, optional): write optional when argument has a default value.  
        
        """ 
```

If you need extra space, you can break to the next line and indent as I've done here. If an argument has a default value, mark it as "optional" when describing the type. If the function does not take any parameters, feel free to leave this section out.

#### Google Style - return value

The next section is the "Returns" section, where you list the expected type or types of what gets returned. You can also provide some comment about what gets returned, but often the name of the function and the description will make this clear. Additional lines should not be indented.

```python
        def function(arg_1, arg_2=42):
            """ Description of what the function does. 
            
            Args:
                arg_1 (str) : Description of arg_1 that can break onto the next line 
                if needed.
                arg_2 (int, optional): write optional when argument has a default value.  
            
            Returns:
                bool: Optional description of the return value
                Extra lines are not indented.
            """ 
```

#### Google Style - errors and extra notes

Finally, if your function intentionally raises any errors, you should add a "Raises" section. You can also include any additional notes or examples of usage in free form text at the end.

```python
    def function(arg_1, arg_2=42):
        """ Description of what the function does. 
        
        Args:
            arg_1 (str) : Description of arg_1 that can break onto the next line 
            if needed.
            arg_2 (int, optional): write optional when argument has a default value.  
        
        Raises:
            ValueError: If arg_1 is not a string.
            RuntimeError: If arg_2 is not an integer.
        
        Returns:
            bool: Optional description of the return value
            Extra lines are not indented.
        Raises:
            ValueError: Include any error types that the function intentionally raises.

        Notes:
            See the docstring for the function for more information.    
        """
```

### Numpydoc

The Numpydoc format is very similar and is the most common format in the scientific Python community.

```python
    def function(arg_1, arg_2=42):
        """
        Description of what the function does.

        Parameters
        ----------
        arg_1 : str
            Description of arg_1 that can break onto the next line 
            if needed.
        arg_2 : int, optional
            write optional when argument has a default value.

        Returns
        -------
        bool
            Optional description of the return value
            Extra lines are not indented.
```

### Retrieveing the Docstring

Sometimes it is useful for your code to access the contents of your function's docstring. Every function in Python comes with a ``__doc__`` attribute that holds this information. Notice that the ``__doc__`` attribute contains the *raw docstring*, including any tabs or spaces that were added to make the words line up visually.

In [1]:
def the_answer():
    """Return the answer to life,
    the universe, and everything.
    
    Returns:
     int
    """
    return 42

print(the_answer.__doc__)

Return the answer to life,
    the universe, and everything.
    
    Returns:
     int
    


To get a cleaner version, with those leading spaces removed, you can use the ``getdoc()`` function from the ``inspect`` module. The inspect module contains a lot of useful methods for gathering information about functions.

In [2]:
import inspect
print(inspect.getdoc(the_answer))

Return the answer to life,
the universe, and everything.

Returns:
 int


## Exercise

### Crafting a docstring

You've decided to write the world's greatest open-source natural language processing Python package. It will revolutionize working with free-form text, the way numpy did for arrays, pandas did for tabular data, and scikit-learn did for machine learning.

The first function you write is ``count_letter()``. It takes a string and a single letter and returns the number of times the letter appears in the string. You want the users of your open-source package to be able to understand how this function works easily, so you will need to give it a docstring. Build up a Google Style docstring for this function.

In [4]:
# Add a docstring to count_letter()
def count_letter(content, letter):
    """
    Count the number of times 'letter' 
    appears in 'Content'
    
    # Add a Google Style Arguments section
     Args:
      content (str) : The string to search
      letter (str): The letter to search for.
    
    # Add a returns section
     Returns:
       int
    # Add a section detailing what errors might be raised
     Raises:
      ValueError: If 'letter' is not a one-character string.   
    """
    if (not isinstance(letter, str)) or len(letter) != 1:
        raise ValueError('`letter` must be a single character string.')
    return len([char for char in content if char == letter])

### Retrieving the docstring

You and a group of friends are working on building an amazing new Python IDE (integrated development environment -- like PyCharm, Spyder, Eclipse, Visual Studio, etc.). The team wants to add a feature that displays a tooltip with a function's docstring whenever the user starts typing the function name. That way, the user doesn't have to go elsewhere to look up the documentation for the function they are trying to use. You've been asked to complete the ``build_tooltip()`` function that retrieves a docstring from an arbitrary function.

You will be reusing the ``count_letter()`` function that you developed in the last exercise to show that we can properly extract its docstring.

In [5]:
# Get the "count_letter" docstring by using an attribute of the function
docstring = count_letter.__doc__

border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

############################

    Count the number of times 'letter' 
    appears in 'Content'
    
    # Add a Google Style Arguments section
     Args:
      content (str) : The string to search
      letter (str): The letter to search for.
    
    # Add a returns section
     Returns:
       int
    # Add a section detailing what errors might be raised
     Raises:
      ValueError: If 'letter' is not a one-character string.   
    
############################


In [6]:
import inspect

# Inspect the count_letter() function to get its docstring
docstring = inspect.getdoc(count_letter)

border = '#' * 28
print('{}\n{}\n{}'.format(border, docstring, border))

############################
Count the number of times 'letter' 
appears in 'Content'

# Add a Google Style Arguments section
 Args:
  content (str) : The string to search
  letter (str): The letter to search for.

# Add a returns section
 Returns:
   int
# Add a section detailing what errors might be raised
 Raises:
  ValueError: If 'letter' is not a one-character string.   
############################


In [7]:
import inspect

def build_tooltip(function):
  """Create a tooltip for any function that shows the
  function's docstring.

  Args:
    function (callable): The function we want a tooltip for.

  Returns:
    str
  """
  # Get the docstring for the "function" argument by using inspect
  docstring = inspect.getdoc(function)
  border = '#' * 28
  return '{}\n{}\n{}'.format(border, docstring, border)

print(build_tooltip(count_letter))
print(build_tooltip(range))
print(build_tooltip(print))

############################
Count the number of times 'letter' 
appears in 'Content'

# Add a Google Style Arguments section
 Args:
  content (str) : The string to search
  letter (str): The letter to search for.

# Add a returns section
 Returns:
   int
# Add a section detailing what errors might be raised
 Raises:
  ValueError: If 'letter' is not a one-character string.   
############################
############################
range(stop) -> range object
range(start, stop[, step]) -> range object

Return an object that produces a sequence of integers from start (inclusive)
to stop (exclusive) by step.  range(i, j) produces i, i+1, i+2, ..., j-1.
start defaults to 0, and stop is omitted!  range(4) produces 0, 1, 2, 3.
These are exactly the valid indices for a list of 4 elements.
When step is given, it specifies the increment (or decrement).
############################
############################
print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False)

Prints the value

# Lesson II

## DRY and "Do One Thing"

*DRY (also known as "don't repeat yourself")* and the *"Do One Thing"* principle are good ways to ensure that your functions are well designed and easy to test. Let's see how.

### Don't Repeat Yourself (DRY)

When you are writing code to look for answers to a research question, it is totally normal to copy and paste a bit of code, tweak it slightly, and re-run it. However, this kind of repeated code can lead to real problems.

```python
    train = pd.read_csv('train.csv')
    train_y = train['labels'].values
    train_x = train[col for col in train.columns if col != 'labels'].values
    train_pca = PCA(n_components=2).fit_transform(train_x)
    plt.scatter(train_pca[:, 0], train_pca[:, 1])
```

```python
    train = pd.read_csv('validation.csv')
    train_y = train['labels'].values
    train_x = train[col for col in train.columns if col != 'labels'].values
    train_pca = PCA(n_components=2).fit_transform(train_x)
    plt.scatter(train_pca[:, 0], train_pca[:, 1])
```

```python
    train = pd.read_csv('test.csv')
    train_y = train['labels'].values
    train_x = train[col for col in train.columns if col != 'labels'].values
    train_pca = PCA(n_components=2).fit_transform(train_x)
    plt.scatter(train_pca[:, 0], train_pca[:, 1])
```

In this code snippet, I load my train, validation, and test data, and plot the first two principal components of each dataset. I wrote the code for the train dataset, then copied it and pasted it into the next two blocks, updating the paths and the variable names.

#### The problem with repeating yourself

But one of the problems with copying and pasting is that it is easy to accidentally introduce errors that are hard to spot. If you'll notice in the last block, I accidentally took the principal components of the train data instead of the test data. **Yikes!**

```python
    train = pd.read_csv('test.csv')
    train_y = train['labels'].values
    train_x = train[col for col in train.columns if col != 'labels'].values
    train_pca = PCA(n_components=2).fit_transform(train_x) ### Yikes! ###
    plt.scatter(train_pca[:, 0], train_pca[:, 1])
```

Another problem with repeated code is that if you want to change something, you often have to do it in multiple places. For instance, if we realized that our CSVs used the column name ``"label"`` instead of ``"labels"``, we would have to change our code in six places. Repeated code like this is a good sign that you should write a function. So let's do that.

### Use functions to avoid repatition

Wrapping the repeated logic in a function and then calling that function several times makes it much easier to avoid the kind of errors introduced by copying and pasting. And if you ever need to change the column "label" back to "labels", or you want to swap out PCA for some other dimensionality reduction technique, you only have to do it in one or two places.

```python
    def load_and_plot(path):
        """ Load a data set and plot the first two principal components.

        Args:
            path (str): Path to the CSV file containing the data.
        
        Returns:
            tuple of ndarray : (features, labels)
        
        """
        # Load the data
        data = pd.read_csv(path)
        y = date['labels'].values
        x = data[col for col in data.columns if col != 'labels'].values
        
        # plot the first two principal components
        pca = PCA(n_components=2).fit_transform(x)
        plt.scatter(pca[:, 0], pca[:, 1])
        
        # return loaded data
        return x, y
```

```python
    train_x, train_y = load_and_plot('train.csv')
    validation_x, validation_y = load_and_plot('validation.csv')
    test_x, test_y = load_and_plot('test.csv')
```

However, there is still a big problem with this function.

* First, it loads the data
* Then it plots the data
* Then it returns the data

This function violates another software engineering principle: **Do One Thing**. Every function should have a single responsibility. Let's look at how we could split this one up.

### Do One Thing

Instead of one big function, we could have a more nimble function that just loads the data and a second one for plotting. 

We get several advantages from splitting the ``load_and_plot()`` function into two smaller functions. 

```python
    def load_data(path):
        """ Load a data set.

        Args:
            path (str): Path to the CSV file containing the data.

        Returns:
            tuple of ndarray : (features, labels)
        
        """

        data = pd.read_csv(path)
        y = date['labels'].values
        x = data[col for col in data.columns if col != 'labels'].values

        return x, y    
```

```python
     def plot_data(x, y):
          """ Plot the first two principal components.
    
          Args:
                x (ndarray): Features
                y (ndarray): Labels
          
          """
          pca = PCA(n_components=2).fit_transform(x)
          plt.scatter(pca[:, 0], pca[:, 1])
```

First of all, our code has become more flexible. Imagine that later on in your script, you just want to load the data and not plot it. That's easy now with the ``load_data()`` function. Likewise, if you wanted to do some transformation to the data before plotting, you can do the transformation and then call the ``plot_data()`` function. 

We have decoupled the loading functionality from the plotting functionality.

#### Advantages of doing one thing

The Code becomes:
* More flexible
* More easily understood
* Simpler to test
* Simpler to debug
* Easier to maintain and change

### Code smells and refactoring

Repeated code and functions that do more than one thing are examples of *"code smells"*, which are indications that you may need to refactor. 
Refactoring is the process of improving code by changing it a little bit at a time. This process is well described in *Martin Fowler's book, "Refactoring"*, which is a good read for any aspiring software engineer.

## Exercise

### Extract a Function

While you were developing a model to predict the likelihood of a student graduating from college, you wrote this bit of code to get the z-scores of students' yearly GPAs. Now you're ready to turn it into a production-quality system, so you need to do something about the repetition. Writing a function to calculate the z-scores would improve this code.

```python
    # Standardize the GPAs for each year
    df['y1_z'] = (df.y1_gpa - df.y1_gpa.mean()) / df.y1_gpa.std()
    df['y2_z'] = (df.y2_gpa - df.y2_gpa.mean()) / df.y2_gpa.std()
    df['y3_z'] = (df.y3_gpa - df.y3_gpa.mean()) / df.y3_gpa.std()
    df['y4_z'] = (df.y4_gpa - df.y4_gpa.mean()) / df.y4_gpa.std()    
```

*Note: df is a pandas DataFrame where each row is a student with 4 columns of yearly student GPAs: ``y1_gpa``, ``y2_gpa``, ``y3_gpa``, ``y4_gpa``*

In [None]:
def standardize(column):
    """Standardize the values in a column.

    Args:
        column (pandas Series): The data to standardize.

    Returns:
        pandas Series: the values as z-scores
    """
    # Finish the function so that it returns the z-scores
    z_score = (column - column.mean()) / column.std()
    return z_score

# Use the standardize() function to calculate the z-scores
df['y1_z'] = standardize(df['y1_gpa'])
df['y2_z'] = standardize(df['y2_gpa'])
df['y3_z'] = standardize(df['y3_gpa'])
df['y4_z'] = standardize(df['y4_gpa'])