# Programming techniques

This notebook demonstrates a few programming techniques that can can be useful when performing data exploration and data preparation. These techniques help you make code more concise, more reusable and easier to modify.

It is divided into two sections: general techniques that can be used in *all* Python code and techniques specific to the Pandas library.


In [1]:
# Imports
import pandas as pd

## General techniques

This section contains programming techniques that can be used in all Python code, regardless of which library is used.

### Data types

Many of the techniques in this notebook make use of two common data types: the **list** and the **dictionary**.

#### The dictionary

A dictionary (often abbreviated to "dict") is a simple data structure where every data item has a unique name (a "key"). You define a dict as follows:

In [2]:
mydict = {
    'key1' : 'value1',
    'key2' : 42,
    # 'key3' : ...
}

The values of a dictionary can be anything: strings, integers, lists (see below), other dictionaries, even objects and functions.

To add an item to a dictionary you do: `mydict['newkey'] = newvalue.
To loop through all the keys in a dictionary you do:


In [3]:
for key in mydict.keys():
    val = mydict[key]
    # do something

It's also possible to get keys and values at the same time, like so:


In [4]:
for key, value in mydict.items():
    # do something, such as:
    print(f'{key} is {value}')


key1 is value1
key2 is 42


#### The list

A list is simply a sequence of things. `[1, 2, 3, 4, 5]` is a list. So is `['mary', 'bob', 'jane']`. But a list can also contain dictionaries, functions, objects, etc.

Lists allow you to perform the same *process* multiple times, for different *values*. Repeatedly doing the same thing is of course what automation is all about, which is why Python has many tools for working with lists.

Here is a simple example:

In [5]:
mylist = [1, 2, 3, 4, 5]
for item in mylist:
    print(f'Item {item} + 1 is {item + 1}')

Item 1 + 1 is 2
Item 2 + 1 is 3
Item 3 + 1 is 4
Item 4 + 1 is 5
Item 5 + 1 is 6


To append something to a list you use the `append` method. To remove the last item from a list you use the `pop` method.

In [6]:
mylist = [1, 2, 3, 4, 5]
mylist.append(6)
print(mylist)
mylist.pop()
print(mylist)

[1, 2, 3, 4, 5, 6]
[1, 2, 3, 4, 5]


##### List comprehensions

Python has a very complicated but very useful set of features for looping through lists: *list comprehensions*

In their simplest form they look like this:



In [7]:
newlist = [item for item in mylist]
print(newlist)


[1, 2, 3, 4, 5]


This simply creates a copy of the original list.

You can, however, perform more complicated tasks, such as create a new list that only contains items that match a certain criterium.

Here we create a new list that only contains the even numbers from the original list (the % or 'modulo' operator returns the remainder after performing integer division):

In [8]:
newlist = [item for item in mylist if item % 2 == 0]
print(newlist)

[2, 4]


Here is a slightly more complicated example. Return even numbers from the original list and for all the odd numbers return 0:


In [9]:
newlist = [item if item % 2 == 0 else 0 for item in mylist]
newlist

[0, 2, 0, 4, 0]

Notice we had to move the "if else" segment to the front of the list comprehension. This syntactic quirk is an unfortunate artifact of Python's history.

### Reuse your code using functions

One of the core principles of programming is DRY: Don't Repeat Yourself.

If you have written code that could be used in other contexts, put it in a function. A "function" is a bit of code that takes zero or more inputs, performs some task and returns a result, like so:

```python
def your_function(a, b, c):
    # Do something with a, b and c
    return 1 # whatever it is you want to return
```

It's good practice to store functions you use often in a library file that you can reuse in all your projects.

Here, for example, we have a file "myfunctions.py" that is stored in a folder "lib" (for "library") from which we can import useful functions:

In [10]:

from lib.myfunctions import add_one_to
print(f'Adding one to 2 gives {add_one_to(2)}')

Adding one to 2 gives 3


We can combine functions and list comprehensions to perform complex tasks:

In [11]:
newlist = [add_one_to(item) if item % 2 == 0 else item for item in [1, 2, 3, 4, 5]]
print(newlist)

[1, 3, 3, 5, 5]


It is not always necessary to define a function before using it. And especially in a list comprehension it is often enough to create a temporary function without a name (such a function is called a "lambda"):

In [12]:
newlist = [(lambda x : x + 1)(item) for item in [1, 2, 3, 4]]
print(newlist)

[2, 3, 4, 5]


Notice the parentheses () around the lambda and around the item (which is used as a parameter)

## Pandas specific techniques

The techniques described above can be used in all Python code. They are not specific to any one library or tool.

The techniques below, however, are specific to the Pandas library that is often used in data science.

### Applying functions to Pandas data

Pandas dataframes are dictionaries where every value is list (a series). It is, however, a very bad idea to treat them as such because pandas adds all sorts of functionality to these dictionaries of lists and this functionality will break if you try to perform, say, a list comprehension.

Still, it is possible to perform some of the same tasks.

Here, for example, we add 1 to every value in a dataframe column (series)

In [13]:
# Create a new dataframe with one column
df = pd.DataFrame({'col' : [1, 2, 3, 4, 5]})
df['col'].map(lambda x : x + 1)

0    2
1    3
2    4
3    5
4    6
Name: col, dtype: int64

Notice that applying ("mapping") the lambda did not change the values in the original column.

If you want to keep the output of your map + lambda, you will need to create a new column based on another column:

In [14]:
df['col2'] = df['col'].map(lambda x : x + 1)
df

Unnamed: 0,col,col2
0,1,2
1,2,3
2,3,4
3,4,5
4,5,6


If you want to update the values in a column in place, you probably shouldn't, but if you must:

In [15]:
df['col'] = df['col'].map(lambda x : x + 1)
df

Unnamed: 0,col,col2
0,2,2
1,3,3
2,4,4
3,5,5
4,6,6


### Using a dictionary to selectively replace values in a column

A common task in data preparation is to replace values in a column with some other value, for example to remove typos, to standardize values, etc.

You can create a dictionary with the original values as the keys and the new values as the value. Such a dictionary is called a "map" (not to be confused with the `map` function above).

To replace values in a column, use the `replace` method, like so:

In [16]:
mapdict = {
    2 : 'abc',
    4 : 'def',
    6 : 'geh'
}
df['col'].replace(mapdict)

0    abc
1      3
2    def
3      5
4    geh
Name: col, dtype: object

### Shift

We often want to compare values to previous or next values in the same column. In Pandas we do this by creating copies of this column and shifting them up or down. Then, once we have done this, we can compare the original and the shifted values:

In [17]:
df = pd.DataFrame({'col' : [1, 2, 3, 4, 5]})
# Shift down
df['col'].shift(1)

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
Name: col, dtype: float64

Notice we ended up with a NaN value at the start of the column. The reason for this makes sense (there is no data before the first item in the original column) but it is annoying because we will have to decide what to do with this NaN value. Fill it with 0? Drop the entire row?

Passing a positive value as a parameter to shift causes the column to be shifted down. Passing a negative value causes it to be shifted up:

In [18]:
df = pd.DataFrame({'col' : [1, 2, 3, 4, 5]})
# Shift down. Notice we end up with a NaN that we need to fill with something
df['col'].shift(-1)

0    2.0
1    3.0
2    4.0
3    5.0
4    NaN
Name: col, dtype: float64

Shifting is useful to determine deltas (differences between values in the same column):

In [19]:
df = pd.DataFrame({'col' : [ 1, 2, 4, 6, 9, 14]})
df['shifted'] = df['col'].shift(1)
df['delta'] = df['col'] - df['shifted']
df

Unnamed: 0,col,shifted,delta
0,1,,
1,2,1.0,1.0
2,4,2.0,2.0
3,6,4.0,2.0
4,9,6.0,3.0
5,14,9.0,5.0


### Pandas boolean indexing

To perform operations on Pandas dataframes and series we often use "boolean indexes". "Boolean" means: consists of True / False values only.

A boolean index is a list of True / False values. When it is applied to a dataframe or series it returns only those elements for which the boolean index contains a "True" value.

Here is an example:

In [20]:
df = pd.DataFrame({'col' : [1, 2, 3, 4, 5]})
df[[True, False, True, False, True]]

Unnamed: 0,col
0,1
2,3
4,5


This is the reason why something like `df[df['col'] == 2]` works: `df['col'] == 2` returns True if 'col' is equal to 2 and False if it is not:

In [21]:
df['col'] == 2

0    False
1     True
2    False
3    False
4    False
Name: col, dtype: bool

You can also use this technique to compare two columns to each other:

In [22]:
df = pd.DataFrame({
    'col1' : [1, 2, 3, 4, 5],
    'col2' : [1, 1, 3, 3, 4]
})
df['col1'] == df['col2']

0     True
1    False
2     True
3    False
4    False
dtype: bool

### Updating columns based on some other column value

If you want to use boolean indexes to determine which values to update and then update them, you need to use the df.loc function.

If you use boolean indexes, this function takes two parameters: a boolean index and the name of the column to update:

In [23]:
df = pd.DataFrame({
    'col1' : [1, 2, 3, 4, 5],
    'col2' : [1, 1, 3, 3, 4],
    'columns_are_equal' : [False, False, False, False, False]
})
df.loc[df['col1'] == df['col2'], 'columns_are_equal'] = True
df

Unnamed: 0,col1,col2,columns_are_equal
0,1,1,True
1,2,1,False
2,3,3,True
3,4,3,False
4,5,4,False
