# Introduction to Python, Pandas and Seaborn

# Python basics

## Variables and simple output

A **variable** is a recipient in which we _store_ a piece of data, to which we give a name :

In [71]:
number1 = 42
string2 = "hello"

The above code creates two variables : `number1` and `string2`.
As we see, `number1` gets to store the number *42*, and `string2` gets to store a chain of characters (or *string*) representing the word _"hello"_.

The easiest method to **output** what is stored in a variable is to use the function `print` :

In [72]:
print(number1)


42


In [73]:
print(string2)

hello


The function `print` is a little bit more powerful, as it can output a lot of things :

In [74]:
print("Hello, World!")

Hello, World!


In [75]:
print("variable 'number1' contains :", number1)

variable 'number1' contains : 42


In [76]:
print(number1, "°C")

42 °C


#### Exercise :

Create 3 variables : the first one, `var1` will contain the number $3,14159$, the second one, `text` will contain the chain of characters *python*, and the third one, `not0` will contain the number *-20* :

In [77]:
var1 = 3.14159
text = "python"
not0 = -20


(Hints : a decimal number uses a simple dot `.` (not a comma !), a chain of characters is delimited by simple quotes `'` or double quotes `"`, and putting a minus sign `-` in front of a number makes it negative.)

## Simple operations

It is as easy to perform operations on numbers as on your pocket calculator :

In [78]:
print("1 + 1 = ", 1 + 1)

1 + 1 =  2


If the data stored in a variable is a number, it behaves exactly as a number :

In [79]:
a = 42
print(a / 2)

21.0


Note : Here we see that several instructions can be put in a row. They will be executed one after the other, in the order they appear.

#### Exercise :

Do the multiplication `6 * 9`. Does it result in 42 ?

In [80]:
42 == (6*9)


False

What happens if you divide a number (say, 42) by zero ?

In [81]:
42/0

ZeroDivisionError: division by zero

(Hint : if your error says `ZeroDivisionError: division by zero`, then you are correct.)

## Lists and `for` loops

What if we want to store a large set of **consecutive** results, so that we can `print` them later ? Well...

In [None]:
result1 = 50
result2 = -2
result3 = 64
result4 = 100
# and so on

...is already boring to write, and...

In [None]:
print(result1)
print(result2)
print(result3)
print(result4)
# and so on

50
-2
64
100


...is frankly _tedious_ .

A **list** is a data structure that allows to store related elements **into one variable**. The point is to keep these elements **ordered**.

In [None]:
results = []
results.append(50)
results.append(-2)
results.append(64)
results.append(100)
# still tedious right now

A `for` loop is a construct that **iterates** through a _collection_, where an **iterator** takes successively each value of the collection. Observe this :

In [None]:
for elem in results:
    print(elem)
# that was way shorter to write :)

50
-2
64
100


The above code is going through each element stored in the variable `results`, and prints its content.

### Inline definition

Compare the above definition of `results`, with this one :

In [None]:
results = [50, -2, 64, 100]

Both lists actually store the **same** elements, but one is much shorter to write !
And that way, the whole process is shorter, when outputting a long sequence of data :

In [None]:
for elem in results:
    print(elem)

50
-2
64
100


(For your information : _When to use the above compared to this one ?_ If you **know** in advance the data you want to store in a list, use this notation ; if you don't, use the `append` method.)

#### Exercise :

You are given a list of temperatures for several days, which goes as follows (in °C) :
* day 1 : 5
* day 2 : -2
* day 3 : 0
* day 4 : 1
* day 5 : -5
* day 6 : 3

Use both methods to create a list `temperatures` to store the six temperatures, and use a `for` loop to print these temperatures with a _"°C"_ after the number :

In [None]:
temperatures = [5, -2, 0, 1, -5, 3]
for temp in temperatures:
    print(temp, "°C")

5 °C
-2 °C
0 °C
1 °C
-5 °C
3 °C


(Hint : look above, where the `print` function is used, for the Celsius part.)

## Blocks, indentation and instructions

### Recap :

Observe the following pieces of code :

In [None]:
for elem in [1, 2]:
    print(elem)

1
2


In [None]:
for _ in [1, 2]:
    print("hello")

hello
hello


In the first piece of code, the list $[1, 2]$ is **iterated through**, and we use `print` on each element `elem`. In the second piece of code, the same list is iterated through, but we just `print` _"hello"_.

### Novelty :

In [None]:
for elem in [1, 2]:
    print(elem)
    print("hello")

1
hello
2
hello


In [None]:
for elem in [1, 2]:
    print(elem)
print("hello")

1
2
hello


These pieces of code contain some novelty : in both codes we have a `for` loop and two **instructions** (here : `print`), but the output seems completely different ! Why ?

It is due to **indentation**, which is the number of spaces before each instruction. In the first code, both have the same number, so they are considered as being part of the same **block** of instructions, whereas in the second, they are part of different blocks of instructions.

The rule is the following : if instructions are part of the same block, they are executed consecutively.

 This is why in the first piece of code, the word _hello_ is printed after each content of the list. Whereas in the second piece of code, that second instruction is part of another block, and is therefore only executed **after** the `for` loop.

### Nested `for` loops :

#### Exercise :

Explain the behaviour of this code :

In [None]:
for elem in [1, 2]:
    print(elem)
    for _ in [1, 2]:
        print("hello")

1
hello
hello
2
hello
hello


## Conditions

What if we want to output _"hello"_ only if our list `results` contains a certain value, say $64$ ?

This is where an `if` block is used :

In [None]:
if 64 in results:
    print("hello")

hello


And what if we also want to output _"not hello"_ if our list doesn't contain it ?

In [None]:
if 64 in results:
    print("hello")
else:
    print("not hello")

hello


What can be observed here, is that this creates **blocks** of instructions. So for example, if we want to also append the number $42$ to the `results` list if it isn't already part of it, we can write the following :

In [None]:
if 42 not in results:
    print("not hello")
    results.append(42)

not hello


#### Exercise :

Create a list `numbers` containing the numbers $2$, $4$, $8$, $16$, $32$ and $128$ :

In [None]:
numbers = [2, 4, 8, 16, 32, 128]


Now create a piece of code where you check if the number $64$ is included in `numbers`, and :
* if it is, output _64 is included_
* otherwise, append it to the list, and output _64 is now included_

In [None]:
if 64 in numbers:
    print("64 is included")
else:
    numbers.append(64)
    print("64 is now included")

64 is now included


Repeat here your previous code :

In [None]:
if 64 in numbers:
    print("64 is included")
else:
    numbers.append(64)
    print("64 is now included")


64 is included


What do you observe ?

### Boolean expressions

The above condition is of **boolean** nature : $64$ (or $42$) is **either** _in_ the list or _not in_ the list.

Similarly, we might want to test if the value of an element is bigger or smaller than another value. Observe this :

In [None]:
if 42 < 64:
    print("42 is smaller than 64")

42 is smaller than 64


In [None]:
if 42 <= 42:
    print("42 is smaller or equal to 42")

42 is smaller or equal to 42


In [None]:
if 42 > 64:
    print("42 is not greater than 64")

In [None]:
if 42 >= 42:
    print("42 is greater or equal to 42")

42 is greater or equal to 42


In [None]:
if 4 != 42:
    print("4 is different from 42")

4 is different from 42


**Important note :** We haven't considered checking if a number is the same as another number yet. This is because you should remember to use **the correct operator**, which is `==`. If you use a single `=`, it would mean that you're trying to store something into a variable, which is fundamentally different !

Observe :

In [None]:
if "hello" == "hello":
    print("both strings are equal")

both strings are equal


### Combining boolean expressions

Sometimes, we want to test several things at the same time ; this is where the keywords `and` and `or` come in.

Observe :

In [None]:
random_number = 4

In [None]:
if (random_number < 100) and (random_number not in results):
    results.append(random_number)
    print("appended the random number")

appended the random number


In [None]:
if (9 in results) or (random_number in results):
    print("9 is not included, but the random number is")

9 is not included, but the random number is


Note : The parentheses around the boolean expressions are not mandatory here, but they are recommended if they make your code more readable.

### Nesting blocks

All of these block types can be nested. For example, this is one of the ways how we can unique the elements of a list :

In [None]:
numbers = [5, 2, 4, 5, 6, 4, -20, 2]

In [None]:
uniques = []

for n in numbers:
    if n not in uniques:
        uniques.append(n)

In [None]:
print(numbers)
print(uniques)

[5, 2, 4, 5, 6, 4, -20, 2]
[5, 2, 4, 6, -20]


Similarly, an `if` block can contain another `if` block :

In [None]:
if random_number < 10:
    if random_number > 0:
        print("the random number is smaller than 10 and greater than 0")

the random number is smaller than 10 and greater than 0


#### Exercise :

Print the elements of the above `uniques` list, but only when they are positive and their double is smaller than 10 :

In [82]:
for u in uniques:
    if u > 0 and u*2 < 10:
        print(u)


2
4


(Hint : Nest an `if` block in a `for` loop.)

#### Exercise :

Try to rewrite the following conditions as nested conditions :

In [None]:
if random_number > 0 and (random_number < 10 and random_number - 2 > 0):
    print("success ?")

success ?


In [None]:
if random_number > 0 and (random_number * 4 < 10 or random_number * 2 < 10):
    print("success ?")

success ?


## Index in a list

A list is an **ordered** collection of elements : there is a _first_ element, then a _second_ element, then a _third_ element, and so on. That means that we can access the stored data if we know its **index** in the list :

In [None]:
print(numbers)

[5, 2, 4, 5, 6, 4, -20, 2]


In [None]:
print(" first element :", numbers[0])
print("second element :", numbers[1])
print("fourth element :", numbers[3])
print("  last element :", numbers[-1])

 first element : 5
second element : 2
fourth element : 5
  last element : 2


**Important note** : The index of a list starts at $0$, not at 1.

Additionally, we can that way also modify the data stored at a certain index :

In [None]:
numbers[3] = 42

In [None]:
print(numbers)

[5, 2, 4, 42, 6, 4, -20, 2]


Notice the change, compared to above : the fourth element went from being $5$ to being $42$.

#### Exercise :

Using the above `numbers` list, output the content of the third element ($4$) and the second last element ($-20$) :

Now, modify the list so that $42$ becomes $16$, and $-20$ becomes $0$ :

Finally, try to access the tenth element of the list :

(Hint : if your error says `IndexError: list index out of range`, then you are correct.)

## Dictionaries

Let's say we want to store people's name and age. With lists, we would do this :

In [None]:
names = ["Pekka", "Francesca", "Hans", "Isabelle", "Stijn", "Marina"]
ages  = [73,      18,          93,     28,         36,      6]

Now let's assume we want to find how old _Isabelle_ is. We would have to find the index where her name is stored, and then use it to find her age :

In [None]:
print(names[3])
print("Isabelle is", ages[3], "years old.")

Isabelle
Isabelle is 28 years old.


This is quite an annoying thing to do. Let's try something different :

In [None]:
ages = { "Pekka": 73, "Francesca": 18, "Hans": 93, "Isabelle": 28, "Stijn": 36, "Marina": 6 }

This data structure is called a **dictionary** : it associates a **key** (here the name) to a **value** (here the age). Let's find out how old _Pekka_ is :

In [None]:
print("Pekka is", ages["Pekka"], "years old.")

Pekka is 73 years old.


One advantage of dictionaries is that we can add data on the fly :

In [None]:
ages["Ueli"] = 45

## Functions

Whenever we write code, there is often the need to perform the same routines.

For example, obtaining the length of the `results` list can be this way :

In [None]:
length = 0

for _ in results:
    length = length + 1

print("length of 'results' =", length)

length of 'results' = 6


And to obtain the length of the `numbers` list :

In [None]:
length = 0

for _ in numbers:
    length = length + 1

print("length of 'numbers' =", length)

length of 'numbers' = 8


It is obvious that hardly anything changes between those two pieces of code (and they actually have been copy-pasted !). Wouldn't it be nice if we didn't have to copy-paste it ?

By defining a **function** (here called `length_of`), we can do this :

In [None]:
def length_of(xs):
    length = 0
    
    for _ in xs:
        length = length + 1
    
    return length

In [None]:
print("length of 'results' =", length_of(results))
print("length of 'numbers' =", length_of(numbers))

length of 'results' = 6
length of 'numbers' = 8


There are a few new elements, here :
* The keyword `def` indicates the definition of a function.
* The definition of a function is a **block** of instructions, in which blocks can be nested (for example, here the `for` loop).
* A function takes a certain number of **parametres**, which are the ones written in the parentheses ; this number could be any number, including no parametres at all.
* The function is **called** by writing its name, followed by parentheses, containing the values we want to give to its parametres.
* A function is equal to the value that follows the keyword **return**.

### More on parametres

Let's illustrate a function without parametre and one which has one :

In [None]:
def say_hello():
    print("hello")

def say(words):
    print(words)

In [None]:
say_hello()

say("hello")
say(words = "good bye")

hello
hello
good bye


Observe how the parametre name can be omitted or indicated, upon function call.

Now, let's illustrate a function with two parametres :

In [None]:
def subtract(x, y):
    return x - y

In [None]:
print(subtract(64, 32))
print(subtract(32, 64))

print(subtract(x = 64, y = 32))
print(subtract(y = 32, x = 64))

print(subtract(64, y = 32))

32
-32
32
32
32


The first two calls to `subtract` show the importance of the **order** in which the parametres are provided. The next two calls show a cool trick to bypass this order constraint : provide the parametres names. Finally, the last call shows that you don't need to resort to either one or the other.

Finally, let's illustrate a function that has a **default** value :

In [None]:
def print_error(msg, prefix = "error"):
    print(prefix, ":", msg)

In [None]:
print_error("you made a mistake")
print_error(msg = "you made a mistake")

print_error("write code, not poems", "usage")
print_error(prefix = "usage", msg = "write code, not poems")

error : you made a mistake
error : you made a mistake
usage : write code, not poems
usage : write code, not poems


Here we see that if a default value is provided for a parametre, we can leave it unspecified when calling the function if we want. Otherwise, it behaves the same way as usual.

#### Exercise :

Write a function `add` that returns the sum of three numbers it takes as parametres :

## Some useful functions

Python comes with some useful functions that allow you to do a few useful tricks. Here are a few of them :
* the `len` function allows to obtain the length of any iterable collection :

In [None]:
print(len(names))
print(len(ages))

6
7


* the `range` function creates an iterable collection of numbers up to the one specified :

In [None]:
for i in range(10):
    print(i)

0
1
2
3
4
5
6
7
8
9


These two functions combined allow to do some interesting things. Observe :

In [None]:
for n in numbers:
    print(n)

5
2
4
42
6
4
-20
2


In [None]:
for i in range(len(numbers)):
    print(numbers[i])

5
2
4
42
6
4
-20
2


* the `sum` function adds the various elements of an iterable collection :

In [None]:
print(sum(numbers))

45


* the `sorted` function, which returns a sorted version of an iterable collection :

In [None]:
print(sorted(numbers))

[-20, 2, 2, 4, 4, 5, 6, 42]


* the `int`, `float` and `list` functions, that convert data into other types of data :

In [None]:
print(int(-273.15))
print(float("-273.15"))

print(list(range(10)))

-273
-273.15
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]


As we see, using the `int` function on a float removes its decimal part.

#### Exercise :

Use the `len` function on the `numbers` variable, and store it in a variable `n` :

In [None]:
n = 

SyntaxError: invalid syntax (2472302036.py, line 1)

Use the `sorted` function on the `numbers` variable, and store it in a list `l` :

In [None]:
l = 

Now, print the median of the numbers by using `int(n/2)` as index for `l` :

## Objects

Not everything in Python is numbers. Most of the data you will encounter are actually **objects**.

The short and easy way to explain object-oriented programming is, here : objects have **properties** and **methods**. You access them by putting a dot `.` between the variable in which it is stored and the name of the method (or property) :

In [None]:
print("'numbers' contains", numbers.count(4), "times the number 4")
print("The index of Isabelle in 'names' is", names.index("Isabelle"))

**Spoiler :** Lists are objects, and `append` is one of their methods ! So are dictionaries and strings.

## Modules

In the same way that we use functions as a way to reduce the amount of repeated code, we use **modules** as a way to reduce the number of times we rewrite functions across files and projects. More importantly, it also allows to share code with other people in a cleaner way.

In order to use things defined in a module, we use the **import** keyword :

In [None]:
import numpy as np

What we see here additionally is that we can rename a module using the **as** keyword. From now on, we can use objects and functions from the _NumPy_ project in the following way :

In [None]:
print(np.array(numbers))

#### Exercise :

Import the _pandas_ module as _pd_ :

# NumPy and PyPlot basics

**_NumPy_** contains a lot of things, and not all of them are useful for us here. We will restrict ourselves therefore to one data structure, which is the `np.array`. It is an object that looks like our lists from before, except it can do more things.

In [None]:
import numpy as np


## Creating an `np.array`

As a starter, there are several ways to create an `np.array` :
* From a regular `list` :

In [None]:
np.array([1, 2, 4, 8, 16, 32])

array([ 1,  2,  4,  8, 16, 32])

* As an array of zeroes :

In [None]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

## Multiple dimensions

One nice thing with `np.array` is that they can have multiple dimensions. Let's create two of them :

In [None]:
np.array([[1, 2], [3, 4]])

In [None]:
np.zeros((3, 4))

The way we can access the various dimensions is by using the `shape` property :

In [None]:
array = np.array([[1, 2, 3], [4, 5, 6]])

print(array.shape)

## Accessing elements and modifying an `np.array`

The dimensions of an `np.array` are fixed, so unlike a `list` object, we cannot append elements. On the other hand, we can access and modify elements much easier :

In [None]:
array = np.array(numbers)

print(numbers[5])
print(array[5])

Especially for multidimensional arrays, we can now use the following notation :

In [None]:
array = np.zeros((3, 4))

array[2, 1] = 1

In [None]:
print(array)

We can obtain the values according to the first dimension this way :

In [None]:
array[2]

Or the second dimension that way :

In [None]:
array[:, 1]

This will set all the values of the second row to $2$ :

In [None]:
array[1] = 2

print(array)

And this will set the third column to $-1$ :

In [None]:
array[:, 2] = -1

print(array)

#### Exercise :

Create an array called `ys`, from our `numbers` list :

In [None]:
ys = 

## Operations on `np.array`

Observe the result of the following operations :

In [None]:
array + 1

In [None]:
array * 2

In [None]:
array * array

What happens when we use another array ? Let's see :

In [None]:
array + np.array([1, 2, 3, 4])

In [None]:
array * np.array([1, 2, 3, 4])

## Operations on dimensions

The following methods might be useful during this class :
* The **transpose** is the inversion of its axes. Compare :

In [None]:
print(array)

In [None]:
print(array.T)

* The `reshape` method creates a copy with different number of dimensions. Compare :

In [None]:
np.array(numbers)

In [None]:
np.array(numbers).reshape((2, 4))

#### Exercise :

From the previous exercise's `ys` variable, use its `reshape` method to create a 4x2 array :

## Mathematical operations

_NumPy_ comes with several common mathematical operations, such as `np.exp`, `np.log`, `np.sqrt` and `np.power` :

In [None]:
np.exp(array)

In [None]:
np.log(array)

In [None]:
np.sqrt(array)

In [None]:
np.power(array, 3)

#### Exercise :

Use `np.exp` on `array` and store it in a variable `x` :

In [None]:
x = 

Now use `np.log` on `x` :

What do you observe ?

## PyPlot basics

_**PyPlot**_ is a gigantic module. We will restrict ourselves here to only four types of plots.

In [None]:
import matplotlib.pyplot as plt

The main idea is that you can create the canvas of a figure as an `Axes` object, and then either call its **plotting methods** or set its **axes properties**.

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

### Plotting lines

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.plot(numbers)

# changing the axes properties
axes.set_xlabel("index")
axes.set_ylabel("number")
axes.set_title("numbers");

It is possible to plot several lines on one plot by simply calling a plotting method again :

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.plot(numbers)
axes.plot(uniques)

# changing the axes properties
axes.set_xlabel("index")
axes.set_ylabel("number")
axes.set_title("numbers");

### Plotting bars

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.bar(ages.keys(), ages.values())

# changing the axes properties
axes.set_xlabel("name")
axes.set_ylabel("age")
axes.set_title("people and their age");

### Scatter plot

In [None]:
xs = [1, 3,  5, 7, 2, 4,  6, 8]
ys = [5, 0, -2, 8, 2, 4, -1, 0]

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.scatter(xs, ys)

# changing the axes properties
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_title("numbers");

### Histogram

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.hist(numbers)

# changing the axes properties
axes.set_xlabel("number")
axes.set_ylabel("occurrences")
axes.set_title("numbers histogram");

The number of bins can be changed by setting the `bins` parametre :

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.hist(numbers, bins = 100)

# changing the axes properties
axes.set_xlabel("number")
axes.set_ylabel("occurrences")
axes.set_title("numbers histogram");

If plotting one histogram over another, it is a good idea to set the `alpha` parametre to a value between $0$ and $1$ :

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.hist(numbers, alpha = 0.4)
axes.hist(numbers, bins = 100, alpha = 0.4)

# changing the axes properties
axes.set_xlabel("number")
axes.set_ylabel("occurrences")
axes.set_title("numbers histogram");

### Multiple plots per figure

The `plt.subplots` function can create more than just one plot per figure. This means that there will be more than just one `Axes` object. Observe :

In [None]:
_, axes = plt.subplots(nrows = 2, figsize = (16, 10))

# first plot
axes[0].plot(numbers)

axes[0].set_xlabel("index")
axes[0].set_ylabel("number")
axes[0].set_title("numbers")

# second plot
axes[1].bar(ages.keys(), ages.values())

axes[1].set_xlabel("name")
axes[1].set_ylabel("age")
axes[1].set_title("people and their age");

In [None]:
_, axes = plt.subplots(ncols = 2, figsize = (16, 10))

# first plot
axes[0].plot(numbers)

axes[0].set_xlabel("index")
axes[0].set_ylabel("number")
axes[0].set_title("numbers")

# second plot
axes[1].bar(ages.keys(), ages.values())

axes[1].set_xlabel("name")
axes[1].set_ylabel("age")
axes[1].set_title("people and their age");

In [None]:
_, axes = plt.subplots(nrows = 2, ncols = 2, figsize = (16, 10))

# up left plot
axes[0, 0].plot(numbers)

axes[0, 0].set_xlabel("index")
axes[0, 0].set_ylabel("number")
axes[0, 0].set_title("numbers")

# up right plot
axes[0, 1].bar(ages.keys(), ages.values())

axes[0, 1].set_xlabel("name")
axes[0, 1].set_ylabel("age")
axes[0, 1].set_title("people and their age")

# down left plot
axes[1, 0].bar(ages.keys(), ages.values())

axes[1, 0].set_xlabel("name")
axes[1, 0].set_ylabel("age")
axes[1, 0].set_title("people and their age")

# down right plot
axes[1, 1].plot(numbers)

axes[1, 1].set_xlabel("index")
axes[1, 1].set_ylabel("number")
axes[1, 1].set_title("numbers");

### Heatmaps

In [None]:
_, axes = plt.subplots(figsize = (16, 10))

# calling a plotting method
axes.imshow(array)

# changing the axes properties
axes.set_xlabel("x")
axes.set_ylabel("y")
axes.set_title("the content of 'array'");

# Pandas basics

The `pandas` library contains one main structure that is going to interest us : `DataFrame`. It is basically a table with column names.

In [None]:
import pandas as pd

## Creating a `pd.DataFrame`

As with `np.array`, there are several ways to create a `pd.DataFrame` :
* From an existing `np.array` :

In [None]:
pd.DataFrame(array)

NameError: name 'array' is not defined

* From a **dictionary** :

In [None]:
pd.DataFrame({
    "name": list(ages.keys()),
    "age":  list(ages.values())
})

NameError: name 'pd' is not defined

* From scratch, filling in the columns one after the other :

In [None]:
df = pd.DataFrame()

df["name"] = list(ages.keys())
df["age"]  = list(ages.values())

In [None]:
df

## Renaming the columns and the index

A `pd.DataFrame` has row names and column names. The names of the rows are usually not important, but the names of the columns should always be descriptive of the data they contain.

Therefore, we can rename both rows and columns.

In [None]:
people = pd.DataFrame(ages.values())

people.columns = ["age"]
people.index = list(ages.keys())

In [None]:
people

#### Exercise :

Create a `pd.DataFrame` from the `numbers` variable, and name its only column _number_ :

## Accessing and modifying data

It is very easy to create a new column :

In [None]:
people["height"] = [178, 167, 173, 172, 189, 61, 192]
people["grade"]  = [  5,   0,   4,   2,   3,  0,   5]

In [None]:
people

It is also easy to obtain the values of a column :

In [None]:
people["height"]

And we can obtain the value of several columns by indicating their name as a list of values :

In [None]:
people[["age", "grade"]]

One common problem when accessing data is that we might want to filter. For example, we might want only the people who failed :

In [None]:
people[people["grade"] == 0]

Or we want to have only those who failed **and** are old enough to go to university :

In [None]:
people[(people["grade"] == 0) & (people["age"] >= 18)]

Or maybe we want to have the people who are young **or** old :

In [None]:
people[(people["age"] <= 20) | (people["age"] >= 50)]

What if we want to get only the grade, from those people ? Well, we combine the above access :

In [None]:
people[(people["age"] <= 20) | (people["age"] >= 50)]["grade"]

#### Exercise :

Display the height of people who are between the age of 20 and 50 :

### Fancier access

The `loc` method allows to access rows and columns in a similar way. Here we view the two people who failed :

In [None]:
people.loc[people["grade"] == 0, "grade"]

Now, what if we want to modify the data so that _Francesca_ and _Marina_ get a bare pass instead of failing ? It's easy :

In [None]:
people.loc[people["grade"] == 0, "grade"] = 1

In [None]:
people

You can access the contents of a dataframe by index (similar to a `np.array`) using the `iloc` method.

### Transpose

Exactly as with `np.array`, we can take the **transpose** of a `pd.DataFrame` :

In [None]:
people.T

## Missing data

Very often when working with data, parts of it can be missing. For example, we could have no idea how tall _Hans_ is. Or no idea what grade a _fail_ is supposed to be.

These values are usually represented as `np.nan`. Let's change them in our dataframe :

In [None]:
people.loc["Hans", "height"] = np.nan

people.loc[["Francesca", "Marina"], "grade"] = np.nan

In [None]:
people

How do we know which values are `NaN` ? We can use the `isna` method :

In [None]:
people[people["grade"].isna()]

Or maybe we might want to work only with data that is complete. Then we use `dropna` :

In [None]:
people.dropna()

### Filling missing values

Usually, there is another option that can be used, which is to fill the missing numbers with some numbers using `fillna` :

In [None]:
people.fillna(0)

In some cases, instead of filling in one unique number, we might try to take the number from the previous entry.

Let's create a time series :

In [None]:
temperatures = {
    "january":  -2,
    "february":  1,
    "march":     5,
    "april":     3,
    "may":       np.nan,
    "june":      15,
    "july":      23,
    "august":    29,
    "september": 15,
    "october":   np.nan,
    "november":  5,
    "december":  2
}

In [None]:
temperatures = pd.DataFrame(
    data    = list(temperatures.values()),
    columns = ["year 1"],
    index   = list(temperatures.keys())
)

In [None]:
temperatures.T

It can make more sense to take the previous value in the series. Let's do it :

In [None]:
temperatures.fillna(method = "ffill").T

#### Exercise :

Fill the missing temperature values with 30 :

## Statistics

_Pandas_ comes with a lot of methods to compute statistics. We will stay with the _average_ and the _standard deviation_ here, though :

In [None]:
people.mean()

In [None]:
people.std()

What if we want only the average height ? Well :

In [None]:
people.mean()["height"]

Those statistics are implemented to **ignore non-acquired values**. So if we treat a fail as a zero, we get this average grade :

In [None]:
people["grade"].fillna(0).mean()

## Plotting with Pandas

Pandas implements its own ways to plot figures, usually with lots of additional preconfigured fluff.

For example, let's see if there is a correlation between _age_ and _grade_ :

In [None]:
people.plot.scatter(x = "age", y = "grade", figsize = (16, 10));

Or let's plot the time series :

In [None]:
temperatures.plot(figsize = (16, 10));

What happens if we add another year ?

In [None]:
temperatures["year 2"] = [-5, -2, 4, 7, 15, 21, 26, 29, 20, 10, 3, 0]

In [None]:
temperatures.plot(figsize = (16, 10));

#### Exercise :

Plot only _year 1_ values, and with filled values from the previous entry :

## Wide format versus long format

Both data frames we have been using here were stored in **wide format** : that means, each line is an entry on its own.

We can switch to long format using the `melt` melt :

In [None]:
wide = temperatures.copy()
wide["month"] = temperatures.index

In [None]:
long = wide.melt(
    id_vars    = ["month"],
    value_vars = ["year 1", "year 2"],
    var_name   = "year",
    value_name = "°C"
)

In [None]:
long

Data is often stored in long format, so it's better to know how to handle it. Long format can be converted into wide using the `pivot` method :

In [None]:
long.pivot(
    index   = "month",
    columns = "year",
    values  = "°C"
)

# Seaborn basics

The _Seaborn_ project is based on _PyPlot_, and brings a **lot** of fluff to our plots.

Let's import it :

In [None]:
import seaborn as sns

Let's also add to the `people` data frame a _name_ column :

In [None]:
people["name"] = people.index

NameError: name 'people' is not defined

## Plotting time series

Plotting our time series in wide format makes _Seaborn_ use the data frame's index as the X axis, and each column becomes an entry on its own :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.lineplot(data = temperatures, ax = ax);

As can be seen, the index is in the wrong order, so don't use anything else than numbers. Let's remedy this by using our data in long format :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.lineplot(x = "month", y = "°C", data = long, ax = ax);

As we see, by default _Seaborn_'s `lineplot` averages series.

Additionally, there are a `hue` and a `style` parametres that can be of great help when dealing with different **kinds** of data :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.lineplot(x = "month", y = "°C", hue = "year", data = long, ax = ax);

## Scatter plots

Let's use `scatterplot` to see if there is any correlation between _age_ and _height_ :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.scatterplot(x = "age", y = "height", data = people, ax = ax);

It can be nice to identify various points ; this is where the `hue` and `style` parametres become very useful again :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.scatterplot(x = "age", y = "height", hue = "name", data = people, ax = ax);

## Histograms

_Seaborn_'s `distplot` has some nice additions, such as kernel density estimation :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.distplot(temperatures["year 1"].fillna(method = "ffill"), ax = ax)
sns.distplot(temperatures["year 2"], ax = ax);

# Heatmap

The `heatmap` function is the function you actually want to use for **heatmaps**, as it provides the correct indexing on all axes, and allows for display of color bar and numbers on each cell using the `annot` parametre :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.heatmap(array, annot = True, ax = ax);

The `center` parameter can help with differentiating positive and negative values :

In [None]:
_, ax = plt.subplots(figsize = (16, 10))

sns.heatmap(array, annot = True, center = 0, ax = ax);