# RC I
---
ECE4710J  2022SP

Materials collected by Sizhe Zhou. Credit to all the related online resources and the usage of this notebook is limited to education purpose. 

Feb. 25th, 2022

---

# 1. Python

## 1.1 with statement

with statement in Python is used in exception handling to make the code cleaner and much more readable. It simplifies the management of common resources like file streams. Observe the following code example on how the use of with statement makes code cleaner.

In [None]:
# file handling
  
# 1) without using with statement
file = open('file_path', 'w')
file.write('hello world !')
file.close()
  
# 2) without using with statement
file = open('file_path', 'w')
try:
    file.write('hello world')
finally:
    file.close()


# using with statement
with open('file_path', 'w') as file:
    file.write('hello world !')

Notice that unlike the first two implementations, there is no need to call `file.close()` when using with statement. The with statement itself ensures proper acquisition and release of resources. An exception during the `file.write()` call in the first implementation can prevent the file from closing properly which may introduce several bugs in the code, i.e. many changes in files do not go into effect until the file is properly closed.

The second approach in the above example takes care of all the exceptions but using the with statement makes the code compact and much more readable. Thus, with statement helps avoiding bugs and leaks by ensuring that a resource is properly released when the code using the resource is completely executed. The with statement is popularly used with file streams, as shown above and with Locks, sockets, subprocesses and telnets etc.

>More advanced usage: context Manager; contextlib module...


## 1.2 Slicing (Similar with Numpy)
List slicing returns a new list from the existing list


In [None]:
lst = [1,2,3]
print('lst: ', id(lst))

# be careful
lst1 = lst
print('lst1: ', id(lst))

lst2 = lst[:2]
print('lst2: ', id(lst2))

Syntax: `Lst[ Initial : End : IndexJump ]`

If Lst is a list, then the above expression returns the portion of the list from index Initial to index End, at a step size IndexJump.
If Initial is not assigned, it indexs from 0. If End is not assigned, it indexs till the end. The default value for IndexJump is 1.
> note: if IndexJump is 1, the length can be directly calculated by End - Initial

## 1.3 Lambda Function

Synatx: `lambda_expr ::=  "lambda" [parameter_list] ":" expression`


Lambda expressions (sometimes called lambda forms) are used to create anonymous functions. The expression lambda parameters: expression yields a function object. The unnamed object behaves like a function object defined with:

```python

def <lambda>(parameters):
    return expression
    
```


Note that functions created with lambda expressions cannot contain statements or annotations.

In [None]:
# Add 10 to argument a, and return the result:
x = lambda a: a + 10
print(x(5))

In [None]:
# Multiply argument a with argument b and return the result:
x = lambda a, b : a * b
print(x(5, 6))

### Why Use Lambda Functions?
The power of lambda is better shown when you use them as an anonymous function inside another function.

Say you have a function definition that takes one argument, and that argument will be multiplied with an unknown number:

In [None]:
def myfunc(n):
  return lambda a : a * n

In [None]:
def myfunc(n):
  return lambda a : a * n

mydoubler = myfunc(2)

print(mydoubler(11))

In [None]:
def myfunc(n):
  return lambda a : a * n

mydoubler = myfunc(2)
mytripler = myfunc(3)

print(mydoubler(11))
print(mytripler(11))

Use lambda functions when an anonymous function is required for a short period of time.

## 1.4 Generators
First lets understand iterators. According to Wikipedia, an iterator is an object that enables a programmer to traverse a container, particularly lists. However, an iterator performs traversal and gives access to data elements in a container, but does not perform iteration. There are three parts namely:

- Iterable
- Iterator
- Iteration

### 1.4.1 Iterable
An iterable is any object in Python which has an __iter__ or a __getitem__ method defined which returns an iterator or can take indexes. In short an iterable is any object which can provide us with an iterator. So what is an iterator?

### 1.4.2 Iterator
An iterator is any object in Python which has a next (Python2) or __next__ method defined. That’s it. That’s an iterator. Now let’s understand iteration.

### 1.4.3 Iteration
In simple words it is the process of taking an item from something e.g a list. When we use a loop to loop over something it is called iteration. It is the name given to the process itself. Now as we have a basic understanding of these terms let’s understand generators.

### 1.4.4 Generators
Generators are iterators, but you can only iterate over them once. It’s because they do not store all the values in memory, they generate the values on the fly. You use them by iterating over them, either with a ‘for’ loop or by passing them to any function or construct that iterates. Most of the time generators are implemented as functions. However, they do not return a value, they yield it. Here is a simple example of a generator function:

In [None]:
def generator_function():
    for i in range(10):
        yield i

for item in generator_function():
    print(item)

# Output: 0
# 1
# 2
# 3
# 4
# 5
# 6
# 7
# 8
# 9

It is not really useful in this case. Generators are best for calculating large sets of results (particularly calculations involving loops themselves) where you don’t want to allocate the memory for all results at the same time. Many Standard Library functions that return lists in Python 2 have been modified to return generators in Python 3 because generators require fewer resources.

Here is an example generator which calculates fibonacci numbers:

In [None]:
# generator version
def fibon(n):
    a = b = 1
    for i in range(n):
        yield a
        a, b = b, a + b

Now we can use it like this:

In [None]:
for x in fibon(10):
    print(x)

This way we would not have to worry about it using a lot of resources. However, if we would have implemented it like this:

In [None]:
def fibon(n):
    a = b = 1
    result = []
    for i in range(n):
        result.append(a)
        a, b = b, a + b
    return result

It would have used up all our resources while calculating a large input. We have discussed that we can iterate over generators only once but we haven’t tested it. Before testing it you need to know about one more built-in function of Python, next(). It allows us to access the next element of a sequence. So let’s test out our understanding:

In [None]:
def generator_function():
    for i in range(3):
        yield i

gen = generator_function()
print(next(gen))
# Output: 0
print(next(gen))
# Output: 1
print(next(gen))
# Output: 2
print(next(gen))
# Output: Traceback (most recent call last):
#            File "<stdin>", line 1, in <module>
#         StopIteration

As we can see that after yielding all the values next() caused a StopIteration error. Basically this error informs us that all the values have been yielded. You might be wondering why we don’t get this error when using a for loop? Well the answer is simple. The for loop automatically catches this error and stops calling next. Did you know that a few built-in data types in Python also support iteration? Let’s check it out:

In [None]:
my_string = "Yasoob"
next(my_string)
# Output: Traceback (most recent call last):
#      File "<stdin>", line 1, in <module>
#    TypeError: str object is not an iterator

Well that’s not what we expected. The error says that str is not an iterator. Well it’s right! It’s an iterable but not an iterator. This means that it supports iteration but we can’t iterate over it directly. So how would we iterate over it? It’s time to learn about one more built-in function, iter. It returns an iterator object from an iterable. While an int isn’t an iterable, we can use it on string!

In [None]:
my_string = "Yasoob"
my_iter = iter(my_string)
print(next(my_iter))
# Output: 'Y'

## 1.5 Zip and Unzip

### 1.5.1 Zip
Let’s say that we have two lists, one that includes first names, and the other includes last names. We would like to somehow combine the first names with the corresponding last names as tuples. In other words, we would like to combine elements from multiple iterables that have the same index together in a list of tuples:

```
list_1 = ['Jane', 'John', 'Jennifer']
list_2 = ['Doe', 'Williams', 'Smith']
Desired Output:
[('Jane', 'Doe'), ('John', 'Williams'), ('Jennifer', 'Smith')]
```

We can accomplish this with the zip() function, which is a built-in python function. The zip() function is named due to its analogous mechanism as physical zippers. When you zip something, you bring both sides together. And that’s how the zip() function works! It brings elements of the same index from multiple iterable objects together as elements of the same tuples.

Syntax: `zip(*iterables)`

The zip() function takes in iterables as arguments, such as lists, files, tuples, sets, etc. The zip() function will then create an iterator that aggregates elements from each of the iterables passed in. In other words, it will return an iterator of tuples, where the i-th tuple will contain the i-th element from each of the iterables passed in. This iterator will stop once the shortest input iterable has been exhausted.


In [None]:
# Using the zip() function
first_names = ['Jane', 'John', 'Jennifer']
last_names = ['Doe', 'Williams', 'Smith']
full_names = list(zip(first_names, last_names))
print(full_names)
# [('Jane', 'Doe'), ('John', 'Williams'), ('Jennifer', 'Smith')]

Remember, the zip() function returns an iterator. Thus, we need to use the list() function that will use this returned iterator (or zip object) to create a list. In addition, as long as the iterables passed in are ordered (sequences), then the tuples will contain elements in the same left-to-right order of the arguments passed in the zip() function.

In [None]:
# What if we have three iterable objects?
first_names = ['Jane', 'John', 'Jennifer']
last_names = ['Doe', 'Williams', 'Smith']
ages = [20, 40, 30]
names_and_ages = list(zip(first_names, last_names, ages))
print(names_and_ages)
# [('Jane', 'Doe', 20), ('John', 'Williams', 40), ('Jennifer', 'Smith', 30)]

In [None]:
# Passing in one argument to zip()
# If we only pass in one iterable object to the zip() function, then we will get a list of 1-item tuples as follows:
first_names = ['Jane', 'John', 'Jennifer']
print(list(zip(first_names)))
# [('Jane',), ('John',), ('Jennifer',)]

In [None]:
# Iterables with unequal lengths
first_names = ['Jane', 'John', 'Jennifer']
last_names = ['Doe', 'Williams', 'Smith', 'Jones']
full_names = list(zip(first_names, last_names))
print(full_names)
# [('Jane', 'Doe'), ('John', 'Williams'), ('Jennifer', 'Smith')]


If the elements in the longer iterables are needed, then we can use the itertools.zip_longest() (zip_longest() function located in the itertools module) function instead of zip(). It will continue until the longest iterable is exhausted, and will replace any missing values with the value passed in for the fillvalue argument (default is None).

#### Parallel Iteration of Iterables
We can use the zip() function to iterate in parallel over multiple iterables. Since the zip() function returns an iterator, we can use this zip object (the iterator it returns) in a for loop. And since with each iteration of this iterator a tuple is returned, we can unpack the elements of this tuple within the for loop:


In [None]:
first_names = ['Jane', 'John', 'Jennifer']
last_names = ['Doe', 'Williams', 'Smith']
ages = [20, 40, 30]
for first, last, age in zip(first_names, last_names, ages):
    print(f'{first} {last} is {age} years old')
# Output: 
# Jane Doe is 20 years old
# John Williams is 40 years old
# Jennifer Smith is 30 years old

### 1.5.2 Unzip
Let’s say that we have the following list of tuples:

In [None]:
first_and_last_names = [('Jane', 'Doe'), ('John', 'Williams'), ('Jennifer', 'Smith')]

And we want to separate the elements in these tuples into two separate lists. Well, since that is the opposite of zipping (bringing things together), it would be unzipping (taking things apart). To unzip in python, we can use the unpacking operator * with the zip() function as follows:

In [None]:
first_names, last_names = zip(*first_and_last_names)
first_names = list(first_names)
last_names = list(last_names)
print(first_names)
# ['Jane', 'John', 'Jennifer']
print(last_names)
# ['Doe', 'Williams', 'Smith']

The unpacking operator * will unpack the first_and_last_names list of tuples into its tuples. These tuples will then be passed to the zip() function, which will take these separate iterable objects (the tuples), and combines their same-indexed elements together into tuples, making two separate tuples. Lastly, through tuple unpacking, these separated tuples will be assigned to the first_names and last_names variables. We then use the list() function to convert these tuples into lists.

## 1.6 Unpacking Operators in Python
### 1.6.1 * Operator

In [None]:
# Let’s say we have a list:
num_list = [1,2,3,4,5]

# And we define a function that takes in 5 arguments and returns their sum:
def num_sum(num1,num2,num3,num4,num5):
    return num1 + num2 + num3 + num4 + num5

And we want to find the sum of all the elements in num_list. Well, we can accomplish this by passing in all the elements of num_list to the function num_sum. Since num_list has five elements in it, the num_sum function contains five parameters, one for each element in num_list.
One way to do this would be to pass the elements by using their index as follows:

In [None]:
num_sum(num_list[0], num_list[1], num_list[2], num_list[3], num_list[4])
# 15

However, there is a much easier way to do this, and that’s by using the * operator. The * operator is an unpacking operator that will unpack the values from any iterable object, such as lists, tuples, strings, etc…
For example, if we want to unpack num_list and pass in the 5 elements as separate arguments for the num_sum function, we could do so as follows:

In [None]:
num_sum(*num_list)
# 15

And that’s it! The asterisk, *, or unpacking operator, unpacks num_list, and passes the values, or elements, of num_list as separate arguments to the num_sum function.

Note: For this to work, the number of elements in num_list must match the number of parameters in the num_sum function. If they don’t match, we would get a TypeError

In [None]:
# * Operator with Built-In Functions:
# We can also use the asterisk, *, or unpacking operator, with built-in functions in python, such as print:
print(*num_list)
# 1 2 3 4 5

In [None]:
# Unpacking Multiple Lists:
# Let’s say we have another list:
num_list_2 = [6,7,8,9,10]

# And we want to print all the elements in both num_list and num_list_2. We can use the unpacking operator, *, to accomplish this as follows:
print(*num_list, *num_list_2)
# 1 2 3 4 5 6 7 8 9 10
# Both num_list and num_list_2 are unpacked. Then, all the elements are passed in to print as separate arguments.

In [None]:
# Merging Multiple Lists:
# We can also create a new list that contains all the elements from num_list and num_list_2:
new_list = [*num_list, *num_list_2]
# [1,2,3,4,5,6,7,8,9,10]
# Note: We could have simply added num_list and num_list_2 to create new_list. However, this was just to portray the functionality of the unpacking operator.

In [None]:
# Other Uses of * Operator:
# Let’s say that we have a string assigned to the variable name:
name = 'Michael'

# And we want to break this name up into 3 parts, with the first letter being assigned to a variable, 
# the last letter being assigned to another variable, 
# and everything in the middle assigned to a third variable. We can do so as follows:
# first, *middle, last = name
first, *middle, last = name

And that’s it! Since name is a string, and strings are iterable objects, we can unpack them. The values on the right side of the assignment operator will be assigned to the variables on the left depending on their relative position in the iterable object. As such, the first letter of ‘Michael’ is assigned to the variable first, which would be ‘M’ in this case. The last letter, ‘l’, is assigned to the variable last. And the variable middle will contain all the letters between ‘M’ and ‘l’ in the form of a list: [‘i’, ‘c’, ‘h’, ‘a’, ‘e’].

Note: The first and last variables above are called mandatory variables, as they must be assigned concrete values. The middle variable, due to using the * or unpacking operator, can have any number of values, including zero. If there are not enough values to unpack for the mandatory variables, we will get a ValueError.

### 1.6.2 Packing with * Operator:
We can also use the * operator to pack multiple values into a single variable. For example:

In [None]:
*names, = 'Michael', 'John', 'Nancy'
# names 
['Michael', 'John', 'Nancy']

The reason for using a trailing comma after *names is because the left side of the assignment must be a tuple or list. Therefore, the names variable now contains all the names on the right side in the form of a list.

Note: This is what we do when we define functions that can receive a varying number of arguments! That is the concept of *args and **kwargs!


#### *args:
For example, let’s say we have a function, names_tuple, that takes in names as arguments and returns them back. However, the number of names that we pass in to this function can vary. Well, we can’t just choose a number of parameters that this function would have since the number of positional arguments can change with each calling of the function. We can instead use the * operator to pack the arguments passed in into a tuple as follows:

In [None]:
def names_tuple(*args):
    return args
names_tuple('Michael', 'John', 'Nancy')
# ('Michael', 'John', 'Nancy')
names_tuple('Jennifer', 'Nancy')
# ('Jennifer', 'Nancy')

No matter what number of positional arguments we pass in when we call the names_tuple function, the *args argument will pack the positional arguments into a tuple, similar to the *names assignment above.

#### **kwargs
To pass in a varying number of keyword or named arguments, we use the ** operator when defining a function. The ** unpacking operator will pack the varying number of named arguments we pass in into a dictionary.



In [None]:
def names_dict(**kwargs):
    return kwargs
names_dict(Jane = 'Doe')
# {'Jane': 'Doe'}
names_dict(Jane = 'Doe', John = 'Smith')
# {'Jane': 'Doe', 'John': 'Smith'}

#### Dictionaries
What happens when we try to use the * operator with a dictionary?

In [None]:
num_dict = {'a': 1, 'b': 2, 'c': 3}
print(*num_dict)
# a b c

Notice how it printed the keys of the dictionary and not the values? To unpack a dictionary, we need to use the ** unpacking operator. However, since each value is associated with a specific key, the function that we pass these arguments to must have parameters with the same names as the keys of the dictionary being unpacked. For example:

In [None]:
def dict_sum(a,b,c):
    return a+b+c

This dict_sum function has three parameters: a, b, and c. These three parameters are named the same as the keys of num_dict. Therefore, once we pass in the unpacked dictionary using the ** operator, it’ll assign in the values of the keys according to the corresponding parameter names:

In [None]:
dict_sum(**num_dict)
# 6
# Thus, the values, or arguments, for the a, b, and c parameters in dict_sum will be 1, 2, and 3, respectively. 
# And the sum of these three values is 6.

#### Merging Dictionaries:
Just like with lists, the ** operator can be used to merge two or more dictionaries:

In [None]:
num_dict = {'a': 1, 'b': 2, 'c': 3}
num_dict_2 = {'d': 4, 'e': 5, 'f': 6}
new_dict = {**num_dict, **num_dict_2}
new_dict
# {‘a’: 1, ‘b’: 2, ‘c’: 3, ‘d’: 4, ‘e’: 5, ‘f’: 6}

## 1.7 Comprehension
Comprehensions are constructs that allow sequences to be built from other sequences. Python 2.0 introduced list comprehensions and Python 3.0 comes with dictionary and set comprehensions.

### 1.7.1 List Comprehensions
A list comprehension consists of the following parts:

- An Input Sequence.
- A Variable representing members of the input sequence.
- An Optional Predicate expression.
- An Output Expression producing elements of the output list from members of the Input Sequence that satisfy the predicate.


Say we need to obtain a list of all the integers in a sequence and then square them:

In [None]:
a_list = [1, '4', 9, 'a', 0, 4]

squared_ints = [e**2 for e in a_list if type(e) == int]

print(squared_ints)
# [ 1, 81, 0, 16 ]

<img src='./assets/listComprehension.gif' style='zoom: 150%'/>

The iterator part iterates through each member e of the input sequence a_list.
The predicate checks if the member is an integer.
If the member is an integer then it is passed to the output expression, squared, to become a member of the output list.


Much the same results can be achieved using the built in functions, map, filter and the anonymous lambda function.

The filter function applies a predicate to a sequence:

In [None]:
filter(lambda e: type(e) == int, a_list)

Map modifies each member of a sequence:

In [None]:
map(lambda e: e**2, a_list)

The two can be combined:

In [None]:
map(lambda e: e**2, filter(lambda e: type(e) == types.IntType, a_list))

The above example involves function calls to map, filter, type and two calls to lambda. Function calls in Python are expensive. Furthermore the input sequence is traversed through twice and an intermediate list is produced by filter.

The list comprehension is enclosed within a list so, it is immediately evident that a list is being produced. There is only one function call to type and no call to the cryptic lambda instead the list comprehension uses a conventional iterator, an expression and an if expression for the optional predicate.

### 1.7.2 Nested Comprehensions
![Matrix](./assets/idMatrix.png)

In python we can represent such a matrix by a list of lists, where each sub-list represents a row. A 3 by 3 matrix would be represented by the following list:

In [None]:
[ [ 1 if item_idx == row_idx else 0 for item_idx in range(0, 3) ] for row_idx in range(0, 3) ]

### 1.7.3 Techniques
Using zip() and dealing with two or more elements at a time:

```
['%s=%s' % (n, v) for n, v in zip(self.all_names, self)]
```

Multiple types (auto unpacking of a tuple):
```
[f(v) for (n, f), v in zip(cls.all_slots, values)]
```


A two-level list comprehension using os.walk():
```
# Comprehensions/os_walk_comprehension.py
import os
restFiles = [os.path.join(d[0], f) for d in os.walk(".")
             for f in d[2] if f.endswith(".rst")]
for r in restFiles:
    print(r)
```

os.walk()'s signature:
`walk(top, topdown=True, onerror=None, followlinks=False)`

### 1.7.4 Set Comprehensions
Set comprehensions allow sets to be constructed using the same principles as list comprehensions, the only difference is that resulting sequence is a set.

Say we have a list of names. The list can contain names which only differ in the case used to represent them, duplicates and names consisting of only one character. We are only interested in names longer then one character and wish to represent all names in the same format: The first letter should be capitalised, all other characters should be lower case.

Given the list:

In [None]:
names = [ 'Bob', 'JOHN', 'alice', 'bob', 'ALICE', 'J', 'Bob' ]

We require the set:
{ 'Bob', 'John', 'Alice' }

Note the new syntax for denoting a set. Members are enclosed in curly braces.

The following set comprehension accomplishes this:

In [None]:
{ name[0].upper() + name[1:].lower() for name in names if len(name) > 1 }

### 1.7.5 Dictionary Comprehensions
Say we have a dictionary the keys of which are characters and the values of which map to the number of times that character appears in some text. The dictionary currently distinguishes between upper and lower case characters.

We require a dictionary in which the occurrences of upper and lower case characters are combined:

In [None]:
mcase = {'a':10, 'b': 34, 'A': 7, 'Z':3}

mcase_frequency = { k.lower() : mcase.get(k.lower(), 0) + mcase.get(k.upper(), 0) for k in mcase.keys() }

# mcase_frequency == {'a': 17, 'z': 3, 'b': 34}

## 1.8 Pipeline
We will not talk about this in this RC. But it will probably be covered by your project. Given all the previous knowledge, you will easily understand this concept and I also don't want to keep rob you of the pleasure.

## 1.9 Decorator (Maybe useless for you temporarily)
Decorators are a significant part of Python. In simple words: they are functions which modify the functionality of other functions. They help to make our code shorter and more Pythonic. Most beginners do not know where to use them so I am going to share some areas where decorators can make your code more concise.

First, let’s discuss how to write your own decorator.

It is perhaps one of the most difficult concepts to grasp. We will take it one step at a time so that you can fully understand it.

### 1.9.1 Everything in Python is an object:
First of all let’s understand functions in Python:

In [None]:
def hi(name="yasoob"):
    return "hi " + name

print(hi())
# output: 'hi yasoob'

# We can even assign a function to a variable like
greet = hi
# We are not using parentheses here because we are not calling the function hi
# instead we are just putting it into the greet variable. Let's try to run this

print(greet())
# output: 'hi yasoob'

# Let's see what happens if we delete the old hi function!
del hi
print(hi())
#outputs: NameError

In [None]:
print(greet())
#outputs: 'hi yasoob'

### 1.9.2. Defining functions within functions:
So those are the basics when it comes to functions. Let’s take your knowledge one step further. In Python we can define functions inside other functions:

In [None]:
def hi(name="yasoob"):
    print("now you are inside the hi() function")

    def greet():
        return "now you are in the greet() function"

    def welcome():
        return "now you are in the welcome() function"

    print(greet())
    print(welcome())
    print("now you are back in the hi() function")

hi()
#output:now you are inside the hi() function
#       now you are in the greet() function
#       now you are in the welcome() function
#       now you are back in the hi() function

# This shows that whenever you call hi(), greet() and welcome()
# are also called. However the greet() and welcome() functions
# are not available outside the hi() function e.g:

greet() # dont run the above code block if you want it to output right :)
#outputs: NameError: name 'greet' is not defined

So now we know that we can define functions in other functions. In other words: we can make nested functions. Now you need to learn one more thing, that functions can return functions too.

### 1.9.3. Returning functions from within functions:
It is not necessary to execute a function within another function, we can return it as an output as well:

In [None]:
def hi(name="yasoob"):
    def greet():
        return "now you are in the greet() function"

    def welcome():
        return "now you are in the welcome() function"

    if name == "yasoob":
        return greet
    else:
        return welcome

a = hi()
print(a)
#outputs: <function greet at 0x7f2143c01500>

#This clearly shows that `a` now points to the greet() function in hi()
#Now try this

print(a())
#outputs: now you are in the greet() function

Just take a look at the code again. In the if/else clause we are returning greet and welcome, not greet() and welcome(). Why is that? It’s because when you put a pair of parentheses after it, the function gets executed; whereas if you don’t put parenthesis after it, then it can be passed around and can be assigned to other variables without executing it. Did you get it? Let me explain it in a little bit more detail. When we write a = hi(), hi() gets executed and because the name is yasoob by default, the function greet is returned. If we change the statement to a = hi(name = "ali") then the welcome function will be returned. We can also do print hi()() which outputs now you are in the greet() function.

### 1.9.4 Giving a function as an argument to another function:

In [None]:
def hi():
    return "hi yasoob!"

def doSomethingBeforeHi(func):
    print("I am doing some boring work before executing hi()")
    print(func())

doSomethingBeforeHi(hi)
#outputs:I am doing some boring work before executing hi()
#        hi yasoob!

Now you have all the required knowledge to learn what decorators really are. Decorators let you execute code before and after a function.

### 1.9.5. Writing your first decorator:
In the last example we actually made a decorator! Let’s modify the previous decorator and make a little bit more usable program:

In [None]:
def a_new_decorator(a_func):

    def wrapTheFunction():
        print("I am doing some boring work before executing a_func()")

        a_func()

        print("I am doing some boring work after executing a_func()")

    return wrapTheFunction

def a_function_requiring_decoration():
    print("I am the function which needs some decoration to remove my foul smell")

a_function_requiring_decoration()
#outputs: "I am the function which needs some decoration to remove my foul smell"

a_function_requiring_decoration = a_new_decorator(a_function_requiring_decoration)
#now a_function_requiring_decoration is wrapped by wrapTheFunction()

a_function_requiring_decoration()
#outputs:I am doing some boring work before executing a_func()
#        I am the function which needs some decoration to remove my foul smell
#        I am doing some boring work after executing a_func()

Did you get it? We just applied the previously learned principles. This is exactly what the decorators do in Python! They wrap a function and modify its behaviour in one way or another. Now you might be wondering why we did not use the @ anywhere in our code? That is just a short way of making up a decorated function. Here is how we could have run the previous code sample using @.

In [None]:
@a_new_decorator
def a_function_requiring_decoration():
    """Hey you! Decorate me!"""
    print("I am the function which needs some decoration to "
          "remove my foul smell")

a_function_requiring_decoration()
#outputs: I am doing some boring work before executing a_func()
#         I am the function which needs some decoration to remove my foul smell
#         I am doing some boring work after executing a_func()

#the @a_new_decorator is just a short way of saying:
a_function_requiring_decoration = a_new_decorator(a_function_requiring_decoration)

I hope you now have a basic understanding of how decorators work in Python. Now there is one problem with our code. If we run:

In [None]:
print(a_function_requiring_decoration.__name__)
# Output: wrapTheFunction

That’s not what we expected! Its name is “a_function_requiring_decoration”. Well, our function was replaced by wrapTheFunction. It overrode the name and docstring of our function. Luckily, Python provides us a simple function to solve this problem and that is functools.wraps. Let’s modify our previous example to use functools.wraps:

In [None]:
from functools import wraps

def a_new_decorator(a_func):
    @wraps(a_func)
    def wrapTheFunction():
        print("I am doing some boring work before executing a_func()")
        a_func()
        print("I am doing some boring work after executing a_func()")
    return wrapTheFunction

@a_new_decorator
def a_function_requiring_decoration():
    """Hey yo! Decorate me!"""
    print("I am the function which needs some decoration to "
          "remove my foul smell")

print(a_function_requiring_decoration.__name__)
# Output: a_function_requiring_decoration

Now that is much better. Let’s move on and learn some use-cases of decorators.

#### Blueprint:


In [None]:
from functools import wraps
def decorator_name(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        if not can_run:
            return "Function will not run"
        return f(*args, **kwargs)
    return decorated

@decorator_name
def func():
    return("Function is running")

can_run = True
print(func())
# Output: Function is running

can_run = False
print(func())
# Output: Function will not run

Note: @wraps takes a function to be decorated and adds the functionality of copying over the function name, docstring, arguments list, etc. This allows us to access the pre-decorated function’s properties in the decorator.

#### Authorization
Decorators can help to check whether someone is authorized to use an endpoint in a web application. They are extensively used in Flask web framework and Django. Here is an example to employ decorator based authentication:


In [None]:
from functools import wraps

def requires_auth(f):
    @wraps(f)
    def decorated(*args, **kwargs):
        auth = request.authorization
        if not auth or not check_auth(auth.username, auth.password):
            authenticate()
        return f(*args, **kwargs)
    return decorated

#### Logging

In [None]:
from functools import wraps

def logit(func):
    @wraps(func)
    def with_logging(*args, **kwargs):
        print(func.__name__ + " was called")
        return func(*args, **kwargs)
    return with_logging

@logit
def addition_func(x):
   """Do some math."""
   return x + x


result = addition_func(4)
# Output: addition_func was called

### 1.9.6. Decorators with Arguments
Come to think of it, isn’t @wraps also a decorator? But, it takes an argument like any normal function can do. So, why can’t we do that too?

This is because when you use the @my_decorator syntax, you are applying a wrapper function with a single function as a parameter. Remember, everything in Python is an object, and this includes functions! With that in mind, we can write a function that returns a wrapper function.

#### 1.9.6.1. Nesting a Decorator Within a Function
Let’s go back to our logging example, and create a wrapper which lets us specify a logfile to output to.

In [None]:
from functools import wraps

def logit(logfile='out.log'):
    def logging_decorator(func):
        @wraps(func)
        def wrapped_function(*args, **kwargs):
            log_string = func.__name__ + " was called"
            print(log_string)
            # Open the logfile and append
            with open(logfile, 'a') as opened_file:
                # Now we log to the specified logfile
                opened_file.write(log_string + '\n')
            return func(*args, **kwargs)
        return wrapped_function
    return logging_decorator

@logit()
def myfunc1():
    pass

myfunc1()
# Output: myfunc1 was called
# A file called out.log now exists, with the above string

@logit(logfile='func2.log')
def myfunc2():
    pass

myfunc2()
# Output: myfunc2 was called
# A file called func2.log now exists, with the above string

#### 1.9.6.2. Decorator Classes
Now we have our logit decorator in production, but when some parts of our application are considered critical, failure might be something that needs more immediate attention. Let’s say sometimes you want to just log to a file. Other times you want an email sent, so the problem is brought to your attention, and still keep a log for your own records. This is a case for using inheritence, but so far we’ve only seen functions being used to build decorators.

Luckily, classes can also be used to build decorators. So, let’s rebuild logit as a class instead of a function.

In [None]:
class logit(object):

    _logfile = 'out.log'

    def __init__(self, func):
        self.func = func

    def __call__(self, *args):
        log_string = self.func.__name__ + " was called"
        print(log_string)
        # Open the logfile and append
        with open(self._logfile, 'a') as opened_file:
            # Now we log to the specified logfile
            opened_file.write(log_string + '\n')
        # Now, send a notification
        self.notify()

        # return base func
        return self.func(*args)



    def notify(self):
        # logit only logs, no more
        pass


This implementation has an additional advantage of being much cleaner than the nested function approach, and wrapping a function still will use the same syntax as before:

In [None]:
logit._logfile = 'out2.log' # if change log file
@logit
def myfunc1():
    pass

myfunc1()
# Output: myfunc1 was called

Now, let’s subclass logit to add email functionality (though this topic will not be covered here).

In [None]:
class email_logit(logit):
    '''
    A logit implementation for sending emails to admins
    when the function is called.
    '''
    def __init__(self, email='admin@myproject.com', *args, **kwargs):
        self.email = email
        super(email_logit, self).__init__(*args, **kwargs)

    def notify(self):
        # Send an email to self.email
        # Will not be implemented here
        pass

From here, @email_logit works just like @logit but sends an email to the admin in addition to logging.

---

---

# 2. Numpy
## 2.1 Broadcasting
The term broadcasting describes how NumPy treats arrays with different shapes during arithmetic operations. Subject to certain constraints, the smaller array is “broadcast” across the larger array so that they have compatible shapes. Broadcasting provides a means of vectorizing array operations so that looping occurs in C instead of Python. It does this without making needless copies of data and usually leads to efficient algorithm implementations. There are, however, cases where broadcasting is a bad idea because it leads to inefficient use of memory that slows computation.

NumPy operations are usually done on pairs of arrays on an element-by-element basis. In the simplest case, the two arrays must have exactly the same shape, as in the following example:

In [None]:
import numpy as np

In [None]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a * b

NumPy’s broadcasting rule relaxes this constraint when the arrays’ shapes meet certain constraints. The simplest broadcasting example occurs when an array and a scalar value are combined in an operation:

In [None]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b

The result is equivalent to the previous example where b was an array. We can think of the scalar b being stretched during the arithmetic operation into an array with the same shape as a. The new elements in b, as shown in Figure 1, are simply copies of the original scalar. The stretching analogy is only conceptual. NumPy is smart enough to use the original scalar value without actually making copies so that broadcasting operations are as memory and computationally efficient as possible.

![BC](./assets/broadcasting_1.png)

The code in the second example is more efficient than that in the first because broadcasting moves less memory around during the multiplication (b is a scalar rather than an array).

### 2.1.1 General Broadcasting Rules
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when

1. they are equal, or

2. one of them is 1

If these conditions are not met, a ValueError: operands could not be broadcast together exception is thrown, indicating that the arrays have incompatible shapes. The size of the resulting array is the size that is not 1 along each axis of the inputs.


### 2.1.2 Broadcastable arrays
A set of arrays is called “broadcastable” to the same shape if the above rules produce a valid result.

For example, if a.shape is (5,1), b.shape is (1,6), c.shape is (6,) and d.shape is () so that d is a scalar, then a, b, c, and d are all broadcastable to dimension (5,6); and

- a acts like a (5,6) array where a[:,0] is broadcast to the other columns,

- b acts like a (5,6) array where b[0,:] is broadcast to the other rows,

- c acts like a (1,6) array and therefore like a (5,6) array where c[:] is broadcast to every row, and finally,

- d acts like a (5,6) array where the single value is repeated.


---

---

# 3. Pandas

## 3.1 Merge

## 3.2 Groupby

## 3.3 Pivot Table
---

---
# 3. Pandas

In [None]:
import pandas as pd

I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers) and sex (male/female) data.

In [None]:
df = pd.DataFrame(
    {
        "Name": [
            "Braund, Mr. Owen Harris",
            "Allen, Mr. William Henry",
            "Bonnell, Miss. Elizabeth",
        ],
        "Age": [22, 35, 58],
        "Sex": ["male", "male", "female"],
    }
)

df

To manually store data in a table, create a DataFrame. When using a Python dictionary of lists, the dictionary keys will be used as column headers and the values in each list as columns of the DataFrame.

<!-- <img src="./assets/df.svg" style="zoom:70%" /> -->
![df](./assets/df.svg)

## Each column in a DataFrame is a Series

![series](./assets/01_table_series.svg)

In [None]:
# I’m just interested in working with the data in the column Age
df["Age"]

When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use the column label in between square brackets [].

You can create a Series from scratch as well:

In [None]:
ages = pd.Series([22, 35, 58], name="Age")

ages

A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row labels.

---
The describe() method provides a quick overview of the numerical data in a DataFrame. As the Name and Sex columns are textual data, these are by default not taken into account by the describe() method.

Many pandas operations return a DataFrame or a Series. The describe() method is an example of a pandas operation returning a pandas Series or a pandas DataFrame.

---

dtypes is an attribute of a DataFrame and Series.

---

DataFrame.shape is an attribute of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas Series is 1-dimensional and only the number of rows is returned.

---


## 3.1 How do I filter specific rows from a DataFrame?
I’m interested in the passengers older than 35 years.

In [None]:
titanic = pd.read_csv("./assets/titanic.csv")

titanic.head()

In [None]:
above_35 = titanic[titanic["Age"] > 35]

above_35.head()

To select rows based on a conditional expression, use a condition inside the selection brackets [].

The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a value larger than 35:

In [None]:
titanic["Age"] > 35

The output of the conditional expression (>, but also ==, !=, <, <=,… would work) is actually a pandas Series of boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows for which the value is True will be selected.

I’m interested in the Titanic passengers from cabin class 2 and 3.

In [None]:
class_23 = titanic[titanic["Pclass"].isin([2, 3])]

Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets []. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for which rows the Pclass column is either 2 or 3.

The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with an | (or) operator:

In [None]:
class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]

class_23.head()

When combining multiple conditional statements, each condition must be surrounded by parentheses (). Moreover, you can not use or/and but need to use the or operator | and the and operator &.

I want to work with passenger data for which the age is known.

In [None]:
age_no_na = titanic[titanic["Age"].notna()]

age_no_na.head()

The notna() conditional function returns a True for each row the values are not an Null value. As such, this can be combined with the selection brackets [] to filter the data table.

## 3.2 How do I select specific rows and columns from a DataFrame?


I’m interested in the names of the passengers older than 35 years.

In [None]:
adult_names = titanic.loc[titanic["Age"] > 35, "Name"]

adult_names.head()

In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the part before the comma is the rows you want, and the part after the comma is the columns you want to select.

When using the column names, row labels or a condition expression, use the loc operator in front of the selection brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels, a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.

I’m interested in rows 10 till 25 and columns 3 to 5.

In [None]:
titanic.iloc[9:25, 2:5]

Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the iloc operator in front of the selection brackets [].

When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For example, to assign the name anonymous to the first 3 elements of the third column:

In [None]:
titanic.iloc[0:3, 3] = "anonymous"

titanic.head()

## 3.3 How to create plots in pandas?


In [None]:
import pandas as pd

import matplotlib.pyplot as plt

In [None]:
air_quality = pd.read_csv("./assets/air_quality_no2.csv", index_col=0, parse_dates=True)

air_quality.head()

In [None]:
air_quality.plot()


With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.

I want to plot only the columns of the data table with the data from Paris.

In [None]:
air_quality["station_paris"].plot()

To plot a specific column, use the selection method of we talked above in combination with the plot() method. Hence, the plot() method works on both Series and DataFrame.

I want to visually compare the NO2 values measured in London versus Paris.

In [None]:
air_quality.plot.scatter(x="station_london", y="station_paris", alpha=0.5)

Apart from the default line plot when using the plot function, a number of alternatives are available to plot data. Let’s use some standard Python to get an overview of the available plot methods:

In [None]:
[
    method_name
    for method_name in dir(air_quality.plot)
    if not method_name.startswith("_")
]


In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an overview of the available methods, for example air_quality.plot. + TAB.

I want each of the columns in a separate subplot.

In [None]:
axs = air_quality.plot.area(figsize=(12, 4), subplots=True)

Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The builtin options available in each of the pandas plot functions are worth reviewing.

I want to further customize, extend or save the resulting plot.

In [None]:
fig, axs = plt.subplots(figsize=(12, 4))

In [None]:
air_quality.plot.area(ax=axs)

In [None]:
axs.set_ylabel("NO$_2$ concentration")

In [None]:
fig.savefig("./assets/no2_concentrations.png")

Each of the plot objects created by pandas is a matplotlib object. As Matplotlib provides plenty of options to customize plots, making the link between pandas and Matplotlib explicit enables all the power of matplotlib to the plot. This strategy is applied in the previous example:

In [None]:
fig, axs = plt.subplots(figsize=(12, 4))        # Create an empty matplotlib Figure and Axes
air_quality.plot.area(ax=axs)                   # Use pandas to put the area plot on the prepared Figure/Axes
axs.set_ylabel("NO$_2$ concentration")          # Do any matplotlib customization you like
fig.savefig("./assets/no2_concentrations.png")           # Save the Figure/Axes using the existing matplotlib method.

## 3.3 How to create new columns derived from existing columns


I want to express the NO2 concentration of the station in London in mg/m3

(If we assume temperature of 25 degrees Celsius and pressure of 1013 hPa, the conversion factor is 1.882)

In [None]:
air_quality["london_mg_per_cubic"] = air_quality["station_london"] * 1.882

air_quality.head()

To create a new column, use the [] brackets with the new column name at the left side of the assignment.

The calculation of the values is done element_wise. This means all values in the given column are multiplied by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!

I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column

In [None]:
air_quality["ratio_paris_antwerp"] = (
    air_quality["station_paris"] / air_quality["station_antwerp"]
)


air_quality.head()

The calculation is again element-wise, so the / is applied for the values in each row.

Also other mathematical operators (+, -, \*, /) or logical operators (<, >, =,…) work element wise. The latter was already used in the subset data tutorial to filter rows of a table using a conditional expression.

If you need more advanced logic, you can use arbitrary Python code via apply().

I want to rename the data columns to the corresponding station identifiers used by openAQ

In [None]:
air_quality_renamed = air_quality.rename(
    columns={
        "station_antwerp": "BETR801",
        "station_paris": "FR04014",
        "station_london": "London Westminster",
    }
)

air_quality_renamed.head()

The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the current names and the values the new names to update the corresponding names.

The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example, converting the column names to lowercase letters can be done using a function as well:

In [None]:
air_quality_renamed = air_quality_renamed.rename(columns=str.lower)

air_quality_renamed.head()

## 3.4 How to calculate summary statistics?
The statistic applied to multiple columns of a DataFrame (the selection of two columns return a DataFrame, see the subset data tutorial) is calculated for each numeric column.

In [None]:
import pandas as pd


In [None]:
titanic = pd.read_csv("./assets/titanic.csv")

titanic.head()

Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined using the DataFrame.agg() method:

In [None]:
titanic.agg(
    {
        "Age": ["min", "max", "median", "skew"],
        "Fare": ["min", "max", "median", "mean"],
    }
)

## 3.5 Aggregating statistics grouped by category
What is the average age for male versus female Titanic passengers?

In [None]:
titanic[["Sex", "Age"]].groupby("Sex").mean()

As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[["Sex", "Age"]]. Next, the groupby() method is applied on the Sex column to make a group per category. The average age for each gender is calculated and returned.

Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a common pattern. The groupby method is used to support this type of operations. More general, this fits in the more general split-apply-combine pattern:

- Split the data into groups

- Apply a function to each group independently

- Combine the results into a data structure

The apply and combine steps are typically done together in pandas.

In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column containing numerical columns:

In [None]:
titanic.groupby("Sex").mean()

It does not make much sense to get the average value of the Pclass. if we are only interested in the average age for each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:

In [None]:
titanic.groupby("Sex")["Age"].mean()

![gpby](./assets/06_groupby_select_detail.svg)

What is the mean ticket fare price for each of the sex and cabin class combinations?

In [None]:
titanic.groupby(["Sex", "Pclass"])["Fare"].mean()

Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby() method.

## 3.6 Count number of records by category
![cnrc](./assets/06_valuecounts.svg)

What is the number of passengers in each of the cabin classes?

In [None]:
titanic["Pclass"].value_counts()

The value_counts() method counts the number of records for each category in a column.

The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records within each group:

In [None]:
titanic.groupby("Pclass")["Pclass"].count()

Both size and count can be used in combination with groupby. Whereas size includes NaN values and just provides the number of rows (size of the table), count excludes the missing values. In the value_counts method, use the dropna argument to include or exclude the NaN values.

## 3.7 How to reshape the layout of tables?

In [None]:
import pandas as pd

In [None]:
air_quality = pd.read_csv(
    "./assets/air_quality_long.csv", index_col="date.utc", parse_dates=True
)


air_quality.head()

### 3.7.1 Sort table rows

I want to sort the Titanic data according to the age of the passengers.

In [None]:
titanic.sort_values(by="Age").head()

I want to sort the Titanic data according to the cabin class and age in descending order.


In [None]:
titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()

With Series.sort_values(), the rows in the table are sorted according to the defined column(s). The index will follow the row order.

### 3.7.2 Long to wide table format

Let’s use a small subset of the air quality data set. We focus on NO2 data and only use the first two measurements of each location (i.e. the head of each group). The subset of data will be called no2_subset

In [None]:
# filter for no2 data only
no2 = air_quality[air_quality["parameter"] == "no2"]

In [None]:
# use 2 measurements (head) for each location (groupby)
no2_subset = no2.sort_index().groupby(["location"]).head(2) # understand the split-apply-combine pattern

no2_subset

In [None]:
no2_subset.T # just for fun

I want the values for the three stations as separate columns next to each other

In [None]:
no2_subset.pivot(columns="location", values="value")

The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.

As pandas support plotting of multiple columns out of the box, the conversion from long to wide table format enables the plotting of the different time series at the same time:

In [None]:
no2.head()

In [None]:
no2.pivot(columns="location", values="value").plot()

When the index parameter is not defined, the existing index (row labels) is used.

### 3.7.3 Pivot table
![pivot_table](./assets/07_pivot_table.svg)

I want the mean concentrations for NO2 and PM2.5 in each of the stations in table form

In [None]:
air_quality.pivot_table(
    values="value", index="location", columns="parameter", aggfunc="mean"
)

In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated (in this specific case, the values on different time steps) pivot_table() can be used, providing an aggregation function (e.g. mean) on how to combine these values.

Pivot table is a well known concept in spreadsheet software. When interested in summary columns for each variable separately as well, put the margin parameter to True:

In [None]:
air_quality.pivot_table(
    values="value",
    index="location",
    columns="parameter",
    aggfunc="mean",
    margins=True,
)

In case you are wondering, pivot_table() is indeed directly linked to groupby(). The same result can be derived by grouping on both parameter and location:

In [None]:
air_quality.groupby(["parameter", "location"]).mean()

### 3.7.4 Wide to long format


In [None]:
no2

Starting again from the wide format table created in the previous section:

In [None]:
no2_pivoted = no2.pivot(columns="location", values="value").reset_index()

no2_pivoted.head()

I want to collect all air quality NO2 measurements in a single column (long format)

In [None]:
no_2 = no2_pivoted.melt(id_vars="date.utc")

no_2.head()

The pandas.melt() method on a DataFrame converts the data table from wide format to long format. The column headers become the variable names in a newly created column.

The solution is the short version on how to apply pandas.melt(). The method will melt all columns NOT mentioned in id_vars together into two columns: A column with the column header names and a column with the values itself. The latter column gets by default the name value.

The pandas.melt() method can be defined in more detail:

In [None]:
no_2 = no2_pivoted.melt(
    id_vars="date.utc",
    value_vars=["BETR801", "FR04014", "London Westminster"],
    value_name="NO_2",
    var_name="id_location",
)


no_2.head()

The result in the same, but in more detail defined:

- value_vars defines explicitly which columns to melt together

- value_name provides a custom column name for the values column instead of the default column name value

- var_name provides a custom column name for the column collecting the column header names. Otherwise it takes the index name or a default variable

Hence, the arguments value_name and var_name are just user-defined names for the two generated columns. The columns to melt are defined by id_vars and value_vars.

## 3.8 How to combine data from multiple tables?
![concat](./assets/08_concat_row.svg)

In [None]:
air_quality_no2 = pd.read_csv("./assets/air_quality_no2_long.csv",
                              parse_dates=True)


air_quality_no2 = air_quality_no2[["date.utc", "location",
                                   "parameter", "value"]]


air_quality_no2.head()

In [None]:
air_quality_pm25 = pd.read_csv("./assets/air_quality_pm25_long.csv",
                               parse_dates=True)


air_quality_pm25 = air_quality_pm25[["date.utc", "location",
                                     "parameter", "value"]]


air_quality_pm25.head()

### 3.8.1 Concatenating objects
I want to combine the measurements of NO2 and PM2.5, two tables with a similar structure, in a single table

In [None]:
air_quality = pd.concat([air_quality_pm25, air_quality_no2], axis=0)

air_quality.head()

The concat() function performs concatenation operations of multiple tables along one of the axis (row-wise or column-wise).

By default concatenation is along axis 0, so the resulting table combines the rows of the input tables. Let’s check the shape of the original and the concatenated tables to verify the operation:

In [None]:
print('Shape of the ``air_quality_pm25`` table: ', air_quality_pm25.shape)

print('Shape of the ``air_quality_no2`` table: ', air_quality_no2.shape)

print('Shape of the resulting ``air_quality`` table: ', air_quality.shape)

Hence, the resulting table has 3178 = 1110 + 2068 rows.

The axis argument will return in a number of pandas methods that can be applied along an axis. A DataFrame has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1). Most operations like concatenation or summary statistics are by default across rows (axis 0), but can be applied across columns as well.

Sorting the table on the datetime information illustrates also the combination of both tables, with the parameter column defining the origin of the table (either no2 from table air_quality_no2 or pm25 from table air_quality_pm25):

In [None]:
air_quality = air_quality.sort_values("date.utc")

air_quality.head()

In this specific example, the parameter column provided by the data ensures that each of the original tables can be identified. This is not always the case. the concat function provides a convenient solution with the keys argument, adding an additional (hierarchical) row index. For example:

In [None]:
air_quality_ = pd.concat([air_quality_pm25, air_quality_no2], keys=["PM25", "NO2"])
air_quality_.head()

The existence of multiple row/column indices at the same time has not been mentioned within these tutorials. Hierarchical indexing or MultiIndex is an advanced and powerful pandas feature to analyze higher dimensional data.

Multi-indexing is out of scope for this pandas introduction. For the moment, remember that the function reset_index can be used to convert any level of an index to a column, e.g. air_quality.reset_index(level=0)

### 3.8.2 Join tables using a common identifier
![MERGE](./assets/08_merge_left.svg)

In [None]:
stations_coord = pd.read_csv("./assets/air_quality_stations.csv")

stations_coord.head()

In [None]:
air_quality.head()

In [None]:
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")

air_quality.head()

Using the merge() function, for each of the rows in the air_quality table, the corresponding coordinates are added from the air_quality_stations_coord table. Both tables have the column location in common which is used as a key to combine the information. By choosing the left join, only the locations available in the air_quality (left) table, i.e. FR04014, BETR801 and London Westminster, end up in the resulting table. The merge function supports multiple join options similar to database-style operations.

Compared to the previous example, there is no common column name. However, the parameter column in the air_quality table and the id column in the air_quality_parameters_name both provide the measured variable in a common format. The left_on and right_on arguments are used here (instead of just on) to make the link between the two tables.

### 3.8.3 Database-style DataFrame or named Series joining/merging


```
pd.merge(
    left,
    right,
    how="inner",
    on=None,
    left_on=None,
    right_on=None,
    left_index=False,
    right_index=False,
    sort=True,
    suffixes=("_x", "_y"),
    copy=True,
    indicator=False,
    validate=None,
)
```

The return type will be the same as left. If left is a DataFrame or named Series and right is a subclass of DataFrame, the return type will still be DataFrame.

merge is a function in the pandas namespace, and it is also available as a DataFrame instance method merge(), with the calling DataFrame being implicitly considered the left object in the join.

The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join. If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.

#### many-to-many joins: joining columns on columns.

It is worth spending some time understanding the result of the many-to-many join case. In SQL / standard relational algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian product of the associated data. Here is a very basic example with one unique key combination:

In [None]:
left = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key": ["K0", "K1", "K2", "K3"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


result = pd.merge(left, right, on="key")
result

![relation](./assets/merging_merge_on_key.png)

Here is a more complicated example with multiple join keys. Only the keys appearing in left and right are present (the intersection), since how='inner' by default.

In [None]:
left = pd.DataFrame(
    {
        "key1": ["K0", "K0", "K1", "K2"],
        "key2": ["K0", "K1", "K0", "K1"],
        "A": ["A0", "A1", "A2", "A3"],
        "B": ["B0", "B1", "B2", "B3"],
    }
)


right = pd.DataFrame(
    {
        "key1": ["K0", "K1", "K1", "K2"],
        "key2": ["K0", "K0", "K0", "K0"],
        "C": ["C0", "C1", "C2", "C3"],
        "D": ["D0", "D1", "D2", "D3"],
    }
)


result = pd.merge(left, right, on=["key1", "key2"])
result

![multiple](./assets/merging_merge_on_key_multiple.png)

The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a summary of the how options and their SQL equivalent names:

| Merge method | SQL Join Name      | Description                                         |
| ------------ | ------------------ | --------------------------------------------------- |
| `left`       | `LEFT OUTER JOIN`  | Use keys from left frame only                       |
| `right`      | `RIGHT OUTER JOIN` | Use keys from right frame only                      |
| `outer`      | `FULL OUTER JOIN`  | Use union of keys from both frames                  |
| `inner`      | `INNER JOIN`       | Use intersection of keys from both frames           |
| `cross`      | `CROSS JOIN`       | Create the cartesian product of rows of both frames |

In [None]:
result = pd.merge(left, right, how="left", on=["key1", "key2"])
result

![lala](./assets/merging_merge_on_key_left.png)

In [None]:
result = pd.merge(left, right, how="right", on=["key1", "key2"])
result

![lll](./assets/merging_merge_on_key_left.png)

In [None]:
result = pd.merge(left, right, how="outer", on=["key1", "key2"])
result

![](./assets/merging_merge_on_key_outer.png)

In [None]:
result = pd.merge(left, right, how="inner", on=["key1", "key2"])
result

![](./assets/merging_merge_on_key_inner.png)

In [None]:
result = pd.merge(left, right, how="cross")
result

![](./assets/merging_merge_cross.png)

---
# 4. About problems in conda source in China
---

---
# 5. Pandas Exercise (see materials posted on Canvas)

Let's play a game!

---