# Week 5 - Functions I

1. [Question 1](#bullet1)
2. [Question 2](#bullet2)
3. [How to name functions](#bullet3)
4. [Comments](#bullet4)
5. [Practice](#bullet5)
6. [Looking at function source code](#bullet6)
7. [Control Statements: if...elif()](#bullet7)
8. [Types of function parameters](#bullet8)
9. [Handling errors](#bullet9)
10. [Args and Kwargs](#bullet10)
11. [Return statements](#bullet11)
12. [Lazy evaluation](#bullet12)
13. [Lexical scoping](#bullet13)

In [1]:
## importing necessary libraries
import numpy as np
import pandas as pd
import scipy

## 1.  Question: What does the following code do?<a class="anchor" id="bullet1"></a>

In [6]:
rng = np.random.default_rng(seed=32521)

a = rng.standard_normal(10)
b = rng.standard_normal(10)
c = rng.standard_normal(10)
d = rng.standard_normal(10)

the_dict = {'a':a,'b':b,'c':c,'d':d}

df = pd.DataFrame(the_dict)
df

Unnamed: 0,a,b,c,d
0,0.980714,2.102271,-0.475476,0.192623
1,2.467158,0.689942,0.092262,-0.317355
2,-0.959375,-0.234055,-0.041392,1.468489
3,0.771635,0.753869,-0.52985,-1.024178
4,-1.249973,-1.253808,-0.064688,0.521052
5,0.594245,-1.508091,-1.245175,1.485229
6,0.164417,0.380744,0.274783,-0.535932
7,-1.505085,1.693092,-1.606498,-1.364671
8,0.001804,-0.007817,-0.598265,-1.316875
9,0.4847,-1.013335,0.944282,0.789056


## 2. Question: What does the following code do?<a class="anchor" id="bullet2"></a>

In [40]:
df['a'] = (df['a'] - min(df['a'])) / (max(df['a']) - min(df['a']))
df['b'] = (df['b'] - min(df['b'])) / (max(df['b']) - min(df['b']))
df['c'] = (df['c'] - min(df['c'])) / (max(df['a']) - min(df['c']))
df['d'] = (df['d'] - min(df['d'])) / (max(df['d']) - min(df['d']))

This code is rescaling the data to have values between 0-1.

**Issue**: There is a mistake in the code above, can you spot it?

**Lesson**: Writing a function will simplify this code and reduce the chances for an error. Let's write a function which simplifies the above. Before we do so, we note that the general syntax for writing functions.

**Key parts of a function**:
1. You need to name the function (e.g.-rescale01)
2. You need parameters (inputs or arguments) 
3. You need code in the body of the function 

In [41]:
def function_name(arg1, arg2):
    return # Do something!

Your function can have any number of inputs or arguments.  
__Tip 1:__ When we write functions, it's a good idea to **name the arguments something generic, like x**

In [2]:
# Our function requires a single input, a pd.Series object, which we will name x
def rescale01(x):
    return (x - min(x)) / (max(x)-min(x))

This is a good start. But notice how we are calling min(x) two separate times. This is not ideal in two respects. First, should you ever want to update min(x) with something else, you'll have to change it everywhere in the code. Second, whenever you need the same value more than once, your code will be more readable (and faster!) if you compute it one time and store the result rather than 2+ times.  
  
__Tip 2:__ Within your function, make your code as concise as possible and avoid repeating the same computation multiple times 

In [43]:
# Try to avoid than using min() and max() to compute the range multiple times
def rescale01(x):
    min_ = x.min()  # min and max are keywords, so the "_" can be appended to avoid confusion
    max_ = x.max()
    return ( x - min_ ) / ( max_ - min_ ) 

In [44]:
# Compare the readability of this code to what we saw above
df['a'] = rescale01(df['a'])  
df['b'] = rescale01(df['b'])
df['c'] = rescale01(df['c'])
df['d'] = rescale01(df['d'])

**Question**: Is it easier to understand what the point of the code is?  
**Question**: Is it easier to read/less complex?  
**Question**: Is it easier to update the code / functionality?  
  
**Answers**:  YES!!  

In [3]:
# Example: the following fails, as it returns incorrect values because of the np.inf element
x = pd.Series([1, 2, 3, 4, np.inf])
rescale01(x)

0    0.0
1    0.0
2    0.0
3    0.0
4    NaN
dtype: float64

#### Challenge: Fix the function above so that if np.inf values are passed in, the function won't fail

0    0.000000
1    0.333333
2    0.666667
3    1.000000
4         NaN
dtype: float64

Coding best practices: __"Do not Repeat Yourself" (aka "DRY")__. More repitition in code means higher chance of having errors in code.

## 3. How to name functions<a class="anchor" id="bullet3"></a>

Name should be as short as possible and should make clear what function does. **Verbs** better than nouns if possible.

- Too short: f()
- Not a verb, or descriptive: my_awesome_function()
- Long, but clear: impute_missing() or collapse_years()

**Snake case versus camel case**  
Snake case is when you write the function name in the following way, using "_" to connect words:  
snake_case()

Camel case is written in the following manner:  
camelCase()

In [5]:
# Whichever you choose, be consistent rather than moving back and forth, i.e.- don't do the following:
def col_mins(x,y): 
    return min(x,y)

def rowMaxes(x,y): 
    return max(x,y)

If you have a group of functions working towards a similar goal, try to keep the naming conventions in the same order as arguments.
  
**Good**  
input_select()  
input_checkbox()  
input_text()  
  
**Not so good**  
select_input()  
checkbox_input()  
text_input()  

Moreover, try not to write functions that overwrite functions, variables or keywords that are already a part of python. 

In [19]:
int = 0 # don't name a function int for example !

Also, don't give a variable a name which overwrites a function, like sum, which is already a function in Python. If you do this, the variable sum *shadows* the built-in function, making it inaccessible in your code as shown below

In [4]:
sum = 10 + 5
print(sum)
sum([10, 15])

15


TypeError: 'int' object is not callable

In [7]:
del sum # delete sum var so our code later will work !

## 4. Comments<a class="anchor" id="bullet4"></a>

In [22]:
# Now a few comments about comments...

# Comments should explain the why of the code rather than the what or the how
# the what or the how should be obvious in the code itself

# to break code into chunks that are easy to read use "-" or "="

# Load data ------------------------------------------------------------
x1 = 7 # code entered here

#Plot data ============================================================
x2 = 8 # code entered here

## 5. Practice<a class="anchor" id="bullet5"></a>

In [221]:
# Practice naming functions: Read the source code for each of the following 
# three functions, puzzle out what they do, and then brainstorm better names.

In [222]:
def prefix_match(string, prefix):
    temp = string.find(prefix)
    if temp == -1:
        return False
    else:
        return True 

In [20]:
prefix_match('dog', 'do')

True

In [21]:
prefix_match('dog', 'st')

False

In [47]:
def remove_last(x):
    if len(x) <= 1:
        return None
    return x[:-1]

In [48]:
remove_last("dog")

'do'

In [49]:
remove_last([1, 2, 3, 4, 5])

[1, 2, 3, 4]

In [52]:
def match_length(x, y):
    lst = []
    
    for i in range(len(x)):
        lst.append(y)
    return lst

In [53]:
match_length("dog", "d")

['d', 'd', 'd']

## 6. Looking at function source code<a class="anchor" id="bullet6"></a>

In [11]:
?np.mean

[1;31mSignature:[0m
[0mnp[0m[1;33m.[0m[0mmean[0m[1;33m([0m[1;33m
[0m    [0ma[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mout[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mkeepdims[0m[1;33m=[0m[1;33m<[0m[0mno[0m [0mvalue[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mwhere[0m[1;33m=[0m[1;33m<[0m[0mno[0m [0mvalue[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Compute the arithmetic mean along the specified axis.

Returns the average of the array elements.  The average is taken over
the flattened array by default, otherwise over the specified axis.
`float64` intermediate and return values are used for integer inputs.

Parameters
----------
a : array_like
    Array containing numbers whose mean is desired. If `a` is not an
    arr

In [12]:
?rescale01 # Our function

Object `rescale01 # Our function` not found.


**Question:** Why does our function not return a help menu?  
**Answer**: We never set it up ! We will learn how to later

## 7. Control Statements: if...elif() <a class="anchor" id="bullet7"></a>

Code usually exectures in a straight forward, linear manner. We call this **sequential execution**. But there are a few ways you can change the execution order. The first way to do so is by using **if()... else() statements**. These statements are quite easy to understand; after checking some condition(s), a snippet of code will either exectute or not.

In [None]:
# The syntax of if().... else() statements :
if (condition):
  # code executed when condition is TRUE
else:
  # code executed when condition is FALSE

In [19]:
# Example of a function using if then statements ...
def is_even(x):
    if x % 2 == 0:
        return True
    return False

# why does this work without an else clause?
# what does this function do?

In [60]:
# Can search for several conditions using & or |
def temp2(x):
    if (x % 2 == 0) & (x % 5 == 0):
        return True
    return False

def temp3(x):
    if (x % 2 == 0) | (x % 5 == 0):
        return True
    return False
# what is the difference between these functions?

In [8]:
vec = [1, 3, 5, 7, 9]
if sum(vec) < 1:
    print("empty list")
elif sum(vec) == 1:
    print("I'm a unitary vector")
else:
    print("I'm not a unitary vector")

I'm not a unitary vector


`any()` and `all()` are also helpful if your condition returns a vector and you need to collapse to a single TRUE/FALSE statement

In [62]:
any([True, False, False])

True

In [63]:
all([True, False, False])

False

In [64]:
any([True if i > np.mean(vec) else False for i in vec])

True

In [65]:
all([True if i > np.mean(vec) else False for i in vec])

False

##### Question: what is the difference between any and all?

## 8. Types of function parameters <a class="anchor" id="bullet8"></a>

There are two broad types of function arguments: data args and args that control details of computation. A general rule of thump is that **data arguments should come first followed by args with details for computation**.

In [24]:
# Example:
x = pd.Series([1, 24, 4325, 8432, 34])

def my_mean(lst, condition=False):
    temp = lst
    if condition:
        temp = [i for i in lst if i > 1000]
    return sum(temp)/len(temp)
    
    
my_mean(x, condition=True)

# lst refers to the data 
# condition (set to False by default) determines details of computation 

# In this case, best practice is to put lst, the data arg, first in the function definition

6378.5

In [25]:
# Another example: compute confidence interval around mean using normal approximation
def mean_ci(x, conf = 0.95):
    se = np.std(x) / np.sqrt(len(x))
    alpha = 1 - conf
    return np.mean(x) + se * scipy.stats.norm.ppf([alpha/2, 1 - alpha/2])

In [26]:
print("X    Freq")
print(x)
print("\n")
print("Mean: "+  str(my_mean(x)))
print("The 95% CI is (" + str(mean_ci(x, conf=0.95)) + ")")

X    Freq
0       1
1      24
2    4325
3    8432
4      34
dtype: int64


Mean: 2563.2
The 95% CI is ([-395.13852072 5521.53852072])


##### Question: Why is this 95% CI for the mean so wide?

In [27]:
x = np.random.uniform(1,100,1000)
print("Mean: "+  str(round(my_mean(x), 3)))
print("The 95% CI is (" + str(mean_ci(x, conf=0.95)) + ")")

Mean: 51.925
The 95% CI is ([50.0950069  53.75421898])


In [28]:
x = np.random.uniform(1,100,10)
print("Mean: "+  str(round(my_mean(x), 3)))
print("The 95% CI is (" + str(mean_ci(x, conf=0.95)) + ")")

Mean: 43.494
The 95% CI is ([26.90972997 60.07838596])


You can add default values for your arguments to take, just incase a value is not entered by the user. This is exactly what we did when we specify `condition=False`. This means that when the condition argument is not specified when the function is called, it will default to False.

In [29]:
def my_mean(lst, condition=False):
    temp = lst
    if condition:
        temp = [i for i in lst if i > 1000]
    return sum(temp)/len(temp)

## 9. Handling errors<a class="anchor" id="bullet9"></a>

In [130]:
len([4,5,6])

3

In [30]:
# It's good practice to check important preconditions, and throw an error if they are not true:
def wt_mean(x, w):
    assert len(x) == len(w), "x and w must be the same length!!!"  # assert statements throw and error if not True!
    return sum(np.multiply(x, w))/sum(w)

In [31]:
x = [1,2,3]
w = [1,1,1]
print("The weighted mean of x is: " + str(wt_mean(x, w)))

The weighted mean of x is: 2.0


In [32]:
x = [1,2,3]
w = [1,1,1,1]
wt_mean(x, w)

AssertionError: x and w must be the same length!!!

##### Question: What is does the wt_mean() function call below return? 

In [None]:
x = [1,2,3]
w = [1,1,10]
wt_mean(x, w)

## 10. Flexible Agruments: \*args and \**kwargs<a class="anchor" id="bullet10"></a>

__*args__ is used if you need to pass a variable number of objects into the function. The asterisk * is a tuple __unpacking operator__. Placing the * before a parameter name tells python to pack any remaining arguments into a tuple that's passed to the args parameter.

In [36]:
def my_sum(*args):
    result = 0
    # Iterating over the Python args tuple
    for x in args:
        result += x  # same thing as result = result + x
    return result
print(my_sum(1, 2, 3 ))
print("=======\n" + str(my_sum(1, 2, 3, 4, 5)))

6
15


__**kwargs__ is like args, except you can pass multiple __named__ arguments. The double asterisks ** are a __dictionary unpacking operator__. Recall that dictionaries are great for handling key-value(s) pairs.

In [None]:
def concatenate(**kwargs):
    result = ""
    # Iterating over the Python kwargs dictionary
    for arg in kwargs.values():
        result += arg  # the += operator is short hand for code like this: result = result + arg
    return result

print(concatenate(b="Python ", c="Is ", d="Great", e="!"))

Python Is Great!


In [None]:
def myFun(**kwargs):
    for key, value in kwargs.items():
        print((key, value))
    
# Check code
myFun(first='Geeks', mid='for', last='Geeks')

('first', 'Geeks')
('mid', 'for')
('last', 'Geeks')


In [None]:
def foo1(a, *args):
    print(f"a is: {a}")
    print(f"args are: {args}") # args takes in tuple packing

In [None]:
foo1(1)

a is: 1
args are: ()


In [None]:
foo1(1, 2, 3, 4)

a is: 1
args are: (2, 3, 4)


In [None]:
def foo2(a, **kwargs):
    print(f"a is: {a}")
    print(f"kwargs are: {kwargs}") # kwargs takes in dictionary packing

In [None]:
foo2(1)

a is: 1
kwargs are: {}


In [None]:
foo2(1, b=2, c=3, d=4)

a is: 1
kwargs are: {'b': 2, 'c': 3, 'd': 4}


In [None]:
# can use them together
def foo(a, *args, **kwargs):
    print(f"a is: {a}")
    print(f"args are: {args}")
    print(f"kwargs are: {kwargs}")

In [None]:
foo(1, 2, 3, d=4, e=5)

a is: 1
args are: (2, 3)
kwargs are: {'d': 4, 'e': 5}


In [None]:
# print_list.py
my_list = [1, 2, 3]
print(my_list)

[1, 2, 3]


In [None]:
my_list = [1, 2, 3]
print(my_list)   # print list
print(*my_list)  # print unpacked list, which is just the contents of the previous list

[1, 2, 3]
1 2 3


In [None]:
def my_sum(*args):
    result = 0
    for x in args:
        result += x
    return result

list1 = [1, 2, 3]
list2 = [4, 5]
list3 = [6, 7, 8, 9]

print(my_sum(*list1, *list2, *list3))

45


## 11. Return statements <a class="anchor" id="bullet11"></a>

Throughout all the functions we have seen, there have been return statements. What do they do? They run some calculation or task and usually there's output we want to return. This is what the return statement is for.

In [97]:
# Example
def check(x):
    result = ''
    if x > 0:
        result = 'Positive'
    elif x < 0:
        result = 'Negative'
    else:
        result = 'Zero'
    return result

In [98]:
check(1)

'Positive'

**Question**: What if we wanted to return multiple things at once? How can we do this?  
**Answer**: Return a list object with all the output you want to return !

In [None]:
# Calc col mean and return the sample size n
def col_mean(df):
    
    n = df.shape[0]
    
    output = []
    for i in df.columns:
        output.append(np.mean(df[i]))
    return [n, output]

In [None]:
# Create data frame to run through function
rng = np.random.default_rng(seed=3252)

a = rng.standard_normal(10)
b = rng.standard_normal(10)
c = rng.standard_normal(10)
d = rng.standard_normal(10)

the_dict = {'a':a,'b':b,'c':c,'d':d}

df = pd.DataFrame(the_dict)
df

Unnamed: 0,a,b,c,d
0,-0.757929,0.295862,0.417556,1.167163
1,0.807697,0.019923,-1.726416,1.767645
2,0.433601,-2.352702,-0.585652,0.461624
3,-0.949885,-0.548677,-0.012005,1.457183
4,-1.119631,1.061025,-0.981373,-0.216714
5,0.064981,0.760106,-1.244226,0.287626
6,-0.6124,-0.671701,1.370228,-1.224794
7,-1.144289,1.442454,-1.705773,0.694906
8,-1.710033,-0.398088,-0.13943,0.402294
9,-0.663915,-0.341473,-2.086549,-0.861754


In [None]:
col_mean(df)

[-0.5651802647782048,
 -0.07332713554716727,
 -0.6693640830516738,
 0.3935178923880054]

In [110]:
# Recall: You can even return an entire dataset as one element of a list!
from sklearn import datasets # load iris dataset
iris0 = datasets.load_iris()
iris = pd.DataFrame(data=iris0.data, columns=iris0.feature_names)
x = [iris, 0, 1, 2]
x

[     sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)
 0                  5.1               3.5                1.4               0.2
 1                  4.9               3.0                1.4               0.2
 2                  4.7               3.2                1.3               0.2
 3                  4.6               3.1                1.5               0.2
 4                  5.0               3.6                1.4               0.2
 ..                 ...               ...                ...               ...
 145                6.7               3.0                5.2               2.3
 146                6.3               2.5                5.0               1.9
 147                6.5               3.0                5.2               2.0
 148                6.2               3.4                5.4               2.3
 149                5.9               3.0                5.1               1.8
 
 [150 rows x 4 columns],
 0,
 1,
 2]

In [111]:
iris0

{'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

In [None]:
iris

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm)
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2
...,...,...,...,...
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3


Note, you can also return tuples from functions! It does not matter (usually) which you use. If your object needs to be further edited/modified, then use a list. If you want the output from your function to be immutable, then tuples are better.

From a performance angle, tuples are a little more efficient than list objects. 

Lastly, note it is also possible to return functions! We will see an example of this at the end of today's lecture notes.

## 12. Lazy evaluation <a class="anchor" id="bullet12"></a>

In a nutshell, lazy evaluation means that the object is evaluated when it is needed, not when it is created.

## 13. Lexical scoping <a class="anchor" id="bullet13"></a>

Not essential you understand this to write functions in Python, but can't hurt. This is a rather advanced programming concept.

Environment of a function controls how Python finds the value associated with a name. The key concept for function environment is something called "lexical scoping". 

When a function begins running, it first looks inside the immediate function script/environment for variables which are referenced. We call this the **local scope**. If a variable is not found in the local scope, then Python steps up to the parent environment in which the function was called. At the very top of the hirearchy is the **global environment**. Objects in your global environment will be accessible by all functions. There is also a third scoping level, the **built-in scope**, which includes all of base Pythons variables and functions, such as the print function.

In [1]:
# Local scope: the variable new_val only exists within the function call
def square(value):
    new_val = value ** 2  # new val is not defined globally
    return new_val
print(square(10))
print(new_val)

100


NameError: name 'new_val' is not defined

In [4]:
# Local scope: the variable new_val only exists within the function call
def square(value):
    global new_val 
    new_val = value ** 2  # new val is defined globally
    return new_val
print(square(10))
print(new_val)  # this works ! 

100
100


## 14. Nested functions <a class="anchor" id="bullet13"></a>

Once we get to nested functions, there are now 4+ level of scopes since the encolosing function is now part of the picture.

In [None]:
def outer(...):
    
    x = ...
    
    def inner(...):
        y = x ** 2
    return ....

Above, we can see that each time the outer function is called, it will create the inner function to use during that function's execution. 

For the inner() function, when it starts searching for x it first looks within the local scope of the inner function. If it doesn't find x there, then it continues searching in the outer function. We call this outer function and **enclosing function**, since it encloses the inner() function. If Python still cannot find x in the outer() function, it then searches the global scope, and then the built-in scope.

**Question:** Why nest functions?  
**Answer:** What if you need to run some calculation many times? Your code will not scale well unless it runs on function calls to automate a set of intermediary steps. 

In [7]:
# Nesting functions: Returning functions
def raise_val(n):
    
    def inner(x):
        raised = x ** n
        return raised
    
    return inner
square = raise_val(2)  # Create the function square(), which squares any number
cube   = raise_val(3)  # Create the function cube(), which cubes any number
print(square(3), cube(3))

9 27
