In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:80% !important; }</style>"))

# Lecture 10 - Defining, calling and applying functions and miscellanea
---


### Content

1. User-defined functions
2. Applying functions to dataframes
3. Removing duplicates
4. Re-shaping dataframes with transpose



### Learning Outcomes

At the end of this lecture, you should be able to:

* write custom functions in python and call them
* apply functions to dataframes
* remove duplicate rows in dataframes
* transpose dataframes


---

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
from pylab import rcParams

%matplotlib inline

In [None]:
# Set some Pandas options as you like
pd.set_option('max_columns', 30)
pd.set_option('max_rows', 30)

In [None]:
rcParams['figure.figsize'] = 15, 10
rcParams['font.size'] = 20

# 1. Functions

Functions are groups of statements that perform a single unit of functionality. They are the primary and most important method of code organization and reuse in programming. Without them our code would become unmanageable.

Arguably most programmers doing data analysis do not write enough functions and there really is no such thing as having too many.

Functions are declared using the **def** keyword. We can pass arguments to functions. Functions can perform calculations and optionally return value. They do this using a keyword **return**.

Once we have defined them, we can call them from almost anywhere in our scripts.

In [None]:
def my_function(x, y, z=1.5):
    if z > 1:
        return z * (x + y)
    else:
        return z / (x + y)

There is no issue with having multiple return statements. If the end of a function is
reached without encountering a return statement, **None** (null value) is returned.

Each function can have some number of **positional arguments** and some number of
keyword arguments. **Keyword arguments** are most commonly used to specify default
values or optional arguments. 

In the above function, x and y are positional arguments
while z is a keyword argument. This means that it can be called in either of these
equivalent ways:

In [None]:
my_function(5, 6, z=0.7)

In [None]:
my_function(3.14, 7, 3.5)

The main restriction on function arguments it that the keyword arguments must follow
the positional arguments (if any). You can specify keyword arguments in any order;
this frees you from having to remember which order the function arguments were
specified in and only what their names are.

### Scope

Functions can access variables in two different scopes: global and local. An alternate
and more descriptive name describing a variable scope in Python is a namespace. 

Any variables that are assigned within a function by default are assigned to the local namespace.
The local namespace is created when the function is called and immediately
populated by the function’s arguments. After the function is finished, the local namespace
is destroyed. 

Consider the following function:

In [None]:
list(range(5))

In [None]:
def func():
    a = []
    for i in range(5):
        a.append(i)
func()
a

Upon calling func(), the empty list a is created, 5 elements are appended, then 'a' is
destroyed when the function exits. 

Suppose instead we had declared a in a global space as follows:

In [None]:
a = []
def func():
    for i in range(5):
        a.append(i)
func()
a

The example above illustrates that the variables in the global space are visible in the functions; however, this is bad practice and can lead to convoluted code, so as seen previously, it is best to pass arguments into functions that they should be operating on.

In [None]:
my_list = [1,2,3]
def func_with_arguments(b):
    b[2] = "we can change values that the arguments (variables) reference if they are mutable"
    
func_with_arguments(my_list)
my_list

In [None]:
my_list = [1,2,3]
def func_with_arguments(b):
    b = "but can't reflect changes to the references themselves in the variables from the outer scope"
    
#calling the function with a list..
func_with_arguments(my_list)
print(my_list)

    
#or calling the function with a scaler value will not alter it
my_scalar = 10000000
func_with_arguments(my_scalar)
print(my_scalar)

If we want to change the actual reference variables with what is performed in functions, then we have to do the following:

In [None]:
my_list = [1,2,3]
print(my_list)
def func_with_arguments_and_return_value(b):
    b = "but we can circumvent things this way...."
    return b
    
my_list = func_with_arguments_and_return_value(my_list)
my_list

### Returning Multiple Values

Almost all languages are restricted in how many values their functions can return - with the restriction being just one value. 

One of the most wonderful features of Python lies with its ability to return multiple values from a function. 

Here is a simple example:

In [None]:
def f():
    a = 5
    b = 6
    c = 7
    return a, b, c

a, b, c = f()
print(a)
print(b)

In data analysis and other scientific applications, you will likely find yourself doing this
very often as many functions may have multiple outputs, whether those are data structures
or other auxiliary data computed inside the function. 

If you think about tuple packing and unpacking from earlier lectures, you may realize that what’s happening
here is that the function is actually just returning one object, namely a tuple,
which is then being unpacked into the result variables. In the above example, we could
have done instead:

In [None]:
return_value = f()
return_value

In this case, return_value would be, as you may guess, a 3-tuple with the three returned
variables.

In [None]:
return_value[0]

**Exercise:** Define a function which accepts two arguments and returns the argument containing the largest value. Call this function in order to verify that it is working.

**Exercise:** Define a function which accepts a list and returns the mean and the standard deviation of its values. Call this function in order to verify that it is working.

## 2. Functions and Dataframes - Using *apply()* and *applymap()* 

Built-in or user-defined functions can be applied along the entire axes of a dataframe.

To apply a function to an entire axis (or multiple axes) of a dataframe, we resort to the apply() method, which can take an optional axis argument to determine if the axis is vertical/column-wise (0) or horizontal/row-wise (1).

### Functions along an axis

In [None]:
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

df = df[['one','two','three']]
df

Below is an example of applying a built in sum function 

In [None]:
df.apply(np.sum, axis=0)

**Exercise**: Apply the mean function to the above dataframe in a row-wise manner.

**Exercise**: Replace the missing value in both columns with the row-wise mean value.

**Exercise**: Calculate the column-wise product for the first and third columns only.     

**Exercise**: Write a function which calculates the sum of a vector and then returns the square of the sum. Once you have done this, apply your function to the dataframe in a row-wise manner, whilst creating a new column 'four', to which you will add insert the result. 

### Functions applied element-wise

The apply() method produces some form of aggregate calculations on the axes of a dataframe.  applymap() on the other hand extends us the flexibility of applying functions which manipulate single elements in a dataframe.

Say we would like to define a function which returns 'pos' for a positive number and alternatively 'neg'

In [None]:
def pos_neg_to_string(x):
    if x >= 0:
        return 'pos'
    else: 
        return 'neg'

We can apply this to our dataframe as follows:

In [None]:
df.applymap(pos_neg_to_string)

Having the ability to apply element-wise operations on dataframes is extremely useful when it comes to dataset cleaning and transformations.

Let's take a look at a dataset with a column called "OCCUPATION_M":

In [None]:
assig = pd.read_csv("../datasets/RURAL_LS_SAMPLE_TRIMMED.csv")
assig.OCCUPATION_M

Clearly the values in this column need to be cleaned up.

Let's first find out what all the unique values are in this dataset.

In [None]:
assig.OCCUPATION_M.unique()

We can now write a function that removes the first 3 characters in each entry in order to tidy the values.

In [None]:
def remove_first_three_chars(x):
    return x.replace(x[:3], '')

In [None]:
assig[['OCCUPATION_M']].applymap(remove_first_three_chars)

In order to make the change permanent, we need to assign the result to the dataframe:

In [None]:
assig['OCCUPATION_M'] = assig[['OCCUPATION_M']].applymap(remove_first_three_chars)
assig[['OCCUPATION_M']]

## Dummy Variables


A dummy variable is a numerical variable used in data analysis to represent subgroups of the sample in under study. 

In research design, a dummy variable is often used to distinguish different treatment groups. Regression analysis also requires that categorical values be converted into appropriate numerical values which dummy variables achieve. This is accomplished by taking distinct values from a column and creating new columns out of them which are populated with 0 or 1 in order to indicate whether or not the particular data point belongs to this. 

This is a frequent operation that can be easily in Python.

In [None]:
assig['OCCUPATION_M'].str.get_dummies()

We can also specify if there are multiple values within some cells that should be treated as separate columns. In this example we will say that the forward slash indicates a distinct value for which we would like to generate a column for.

In [None]:
assig['OCCUPATION_M'].str.get_dummies('/')

**Exercise:** From the same dataset, consider the column 'supermarket spend in a week'. The '\\$' character can cause issues in some applications. We want to clean up this column in such a way that the first 3 characters are replaced as well as the '\\$' character, and we also want to change entries with 'No Answer' to reflect that they are actually missing values so replace them with np.NaN. Write a function to do this and apply this function to this column.

Verify that your code works. 

## 3. Removing Duplicates

Duplicate rows may be naturally occurring in some datasets or they might arise from input errors. In many instances, like machine learning, these duplicate entries need to be removed from the datasets. 

Dataframes provide straightforward functionality to remove such records.

Here is an example:


In [None]:
df = pd.DataFrame({'c1': ['one'] * 3 + ['two'] * 4,
                  'c2': [1, 1, 2, 3, 3, 4, 4]})
df

`drop_duplicates` returns a DataFrame where the duplicated rows **across all columns** are dropped:

In [None]:
df.drop_duplicates()

In [None]:
df

We can also pass a particular column we  would like the duplicates removed from. Let's first make a change to the dataframe:

In [None]:
df.loc[1, 'c1'] = 'five'
df

In [None]:
df.drop_duplicates(['c2'])

Notice that `drop_duplicates` by default keep the first observed value combination.

## 4. Transpose

Transposing is a special form of reshaping tabular data in such a way that the rows become columns and likewise the columns become rows.

In [None]:
df = pd.DataFrame({'one' : pd.Series(np.random.randn(3), index=['a', 'b', 'c']),
                'two' : pd.Series(np.random.randn(4), index=['a', 'b', 'c', 'd']),
                'three' : pd.Series(np.random.randn(3), index=['b', 'c', 'd'])})

df = df[['one','two','three']]
df

Transpose of a dataframe can be accomplished using either the transpose() method call  or simple .T

In [None]:
df.T

Transpose operations are not permanent unless you re-assign the result back tothe original dataframe.

In [None]:
df

**Exercise:** Slice and select out a dataframe with rows 'c' and 'd' and columns 'one' and 'two', then execute a transpose.  

In [None]:
%%javascript
require(['base/js/utils'],
function(utils) {
   utils.load_extensions('calico-spell-check', 'calico-document-tools', 'calico-cell-tools');
});