Notebook adapted by Andrew Ferguson from code prepared by Mathieu Blondel.

# Welcome

Welcome to the introduction to Python in Google Colab! In this practical, we will learn about the programming language [Python](https://www.python.org/) as well as [NumPy](https://numpy.org/) and [Matplotlib](https://matplotlib.org/), two fundamental tools for data science and machine learning in Python.

# Notebooks

This week, we will use [Jupyter notebooks](https://jupyter.org/) and [Google Colab](https://colab.research.google.com/) as the primary way to practice machine learning. Notebooks are a great way to mix executable code with rich contents (e.g., HTML, images, equations written in LaTeX). Colab allows to run notebooks on the cloud for free without any prior installation, while leveraging the power of GPUs.

The document that you are reading is not a static web page, but an interactive environment called a notebook, that lets you write and execute code. Notebooks consist of so-called code cells, blocks of one or more Python instructions. For example, here is a code cell that stores the result of a computation (the number of seconds in a day) in a variable and prints its value:

In [1]:
seconds_in_a_day = 24 * 60 * 60
seconds_in_a_day

86400

Click on the "play" button to execute the cell. You should be able to see the result. Alternatively, you can also execute the cell by pressing Ctrl + Enter if you are on Windows / Linux or Command + Enter if you are on a Mac.

Variables that you defined in one cell can later be used in other cells:

In [2]:
seconds_in_a_week = 7 * seconds_in_a_day
seconds_in_a_week

604800

Note that the order of execution is important. For instance, if we do not run the cell storing *seconds_in_a_day* beforehand, the above cell will raise an error, as it depends on this variable.

The order in which the cells were executed is indicated by an integer betweeen square brackets on the left side of the cell (e.g., [2]). Unexecuted cells have an empty set of square brackets (e.g., [ ]).

To make sure that you run all the cells in the correct order, you can also click on "Runtime" in the top-level menu, then "Run all".

**Exercise.** Add a cell below this cell: click on this cell then click on "+ Code". In the new cell, compute and display the number of seconds in a year by reusing the variable *seconds_in_a_day*. Run the new cell.

In [3]:
days_in_a_year = 365
seconds_in_a_year = days_in_a_year * seconds_in_a_day
seconds_in_a_year

31536000

# Python

Python is one of the most popular programming languages for machine learning, both in academia and in industry. As such, it is essential to learn this language for anyone interested in machine learning. In this section, we will review Python basics.

It is impossible to do any more than scratch the surface of the Python language in this short introduction. For more information, here are some entry-level resources:

*   List of Python [tutorials](https://wiki.python.org/moin/BeginnersGuide/Programmers)
* Four-hour [course](https://www.youtube.com/watch?v=rfscVS0vtbw) on Youtube



## Lists

Lists are a container type for ordered sequences of elements. Lists can be initialized empty

In [5]:
my_list = []

or with some initial elements

In [1]:
my_list = [1, 2, 3]

Lists have a dynamic size and elements can be added (appended) to them

In [2]:
my_list.append(4)
my_list

[1, 2, 3, 4]

We can access individual elements of a list using integers.

The *first* element of the list `my_list` is accessed as `my_list[0]`. The *last* element of the list `my_list` is accessed as `my_list[-1]`.

**N.B. Python indexing starts from 0!**

In [3]:
my_list[2]

3

In [4]:
my_list[0]

1

In [5]:
my_list[-1]

4

We can access "slices" of a list using `my_list[i:j]` where `i` is the start of the slice (again, indexing starts from 0) and `j` the end of the slice. For instance:

In [6]:
my_list[1:3]

[2, 3]

Omitting the second index means that the slice should run until the end of the list

In [7]:
my_list[1:]

[2, 3, 4]

We can check if an element is in the list using `in`

In [8]:
5 in my_list

False

The length of a list can be obtained using the `len` function

In [9]:
len(my_list)

4

## Strings

Strings are used to store text. They can be delimited using either single quotes or double quotes

In [None]:
string1 = "some text"
string2 = 'some other text'

Strings behave similarly to lists. As such we can access individual elements in exactly the same way

In [None]:
string1[3]

and similarly for slices

In [None]:
string1[5:]

String concatenation is performed using the `+` operator

In [None]:
string1 + " " + string2

## Arithmetic operations

Python supports the usual arithmetic operators: + (addition), * (multiplication), / (division), ** (power), // (integer division).

## Conditionals

As their name indicates, conditionals are a way to execute code depending on whether a condition is True or False. As in other languages, Python supports `if` and `else` but `else if` is contracted into `elif`, as the example below demonstrates.

In [None]:
my_variable = 5
if my_variable < 0:
  print("negative")
elif my_variable == 0:
  print("null")
else: # my_variable > 0
  print("positive")

Notice the use of `print` to display messages. It also works to display the value of variables. In any case, the last line of a cell is always printed

In [None]:
my_variable = 5
print(my_variable)
my_variable # This behaves as print(my_variable)

In [None]:
my_variable = 5
my_variable # This does not behaves as print(my_variable) (since it is not the last line)
print(my_variable)

In [None]:
# Notice the two spaces before `print` in the snippet with the if, elif, else statements. These spaces are called indentation and they are important as they tell python when the condition starts and stops. Run the following examples and check that you fully understand the output.

In [None]:
my_variable = 5
if my_variable > 0:
  print("Instruction bla")
  print("Instruction bli")
  print("Instruction blu")

In [None]:
my_variable = 5
if my_variable < 0:
  print("Instruction bla")
  print("Instruction bli")
  print("Instruction blu")

In [None]:
my_variable = 5
if my_variable < 0:
  print("Instruction bla")
print("Instruction bli")
print("Instruction blu")

An incorrect indentation will produce an error. What is wrong with the following code ?

In [None]:
my_variable = 5
if my_variable < 0:
  print("Instruction bla")
print("Instruction bli")
  print("Instruction blu")

Here `<` and `>` are the strict `less` and `greater than` operators, while `==` is the equality operator (not to be confused with `=`, the variable assignment operator). The operators `<=` and `>=` can be used for less (resp. greater) than or equal comparisons.

Here, we use 2-space indentation but many programmers also use 4-space indentation. Any one is fine as long as you are consistent throughout your code.

## Loops

Loops are a way to execute a block of code multiple times. There are two main types of loops: while loops and for loops. Here as well, indentation is used to tell python where the loop start and stops.

While loop

In [None]:
i = 0
while i < len(my_list):
  print(my_list[i])
  i += 1 # equivalent to i = i + 1

For loop

In [None]:
for i in range(len(my_list)):
  print(my_list[i])

If the goal is simply to iterate over a list, we can do so directly as follows

In [None]:
for element in my_list:
  print(element)

## Functions

To improve code readability, it is common to separate the code into different blocks, responsible for performing precise actions: functions. A function takes some inputs and process them to return some outputs.

In [None]:
def square(x):
  return x ** 2

def multiply(a, b):
  return a * b

# Functions can be composed.
square(multiply(3, 2))

The code inside a function should be indented. When used in combination with a loop, it is better to define the function once outside the loop.

The following code redefines the function several times. It is not efficient and is in addition hard to read.

In [None]:

a = 0
for i in range(10):
  def add_one(x):
    y = x + 1
    return y
  a = add_one(a)
print(a)

Here is a better version:

In [None]:

def add_one(x):
  y = x + 1
  return y

a = 0
for i in range(10):
  a = add_one(a)
print(a)

An interesting features of functions is that variables defined inside their definition cannot be accessed from outside.
However a function can access values defined outside its scope (although it is not a good practice to do so.
Check that you understand the behavior of the following examples.

In [None]:
# Local variables cannot be accessed from the outside
c = 5
a = 3

def my_function():
  a = 4
  b = 6
  c = a + b
  return c

print(a)
print(c)
print(my_function())

In [None]:
# A function accessing values outside its scope
a = 5
c = 3

def my_function():
  b = 6
  c = a + b
  return c

print(a)
print(c)
print(my_function())

In [None]:
# The better practice to avoid accessing variable outside the scope of the function.
a = 5
c = 3

def my_function():
  h = 5
  b = 6
  c = h + b
  return c

print(a)
print(c)
print(my_function())

To improve code readability, it is sometimes useful to explicitly name the arguments

In [None]:
square(multiply(a=3, b=2))

## Exercises

**Exercise 1.** Using a conditional, write the [relu](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)) function defined as follows

$\text{relu}(x) = \left\{
   \begin{array}{rl}
     x, & \text{if }  x \ge 0 \\
     0, & \text{otherwise }.
   \end{array}\right.$

In [None]:
def relu(x):
    if x > 0:
        return x
    else:
        return 0

relu(-3)

**Exercise 2.** Using a for loop, write a function that computes the [Euclidean norm](https://en.wikipedia.org/wiki/Norm_(mathematics)#Euclidean_norm) of a vector, represented as a list.

In [None]:
def euclidean_norm(vector):
    sos = 0
    for v in vector:
        sos += v**2
    return sos **0.5

my_vector = [0.5, -1.2, 3.3, 4.5]
# The result should be roughly 5.729746940310715
euclidean_norm(my_vector)

**Exercise 3.** Using a for loop and a conditional, write a function that returns the maximum value in a vector.

In [None]:
def vector_maximum(vector):
    maximum = vector[0]
    for v in vector:
        if maximum <= v:
            maximum = v
    return maximum
vector_maximum(my_vector)

**Bonus exercise.** if time permits, write a function that sorts a list in ascending order (from smaller to bigger) using the [bubble sort](https://en.wikipedia.org/wiki/Bubble_sort) algorithm.

In [None]:
def is_sorted(my_list):
    for i in range(1, len(my_list)):
        if my_list[i-1] > my_list[i]:
            return False
    return True

def bubble_sort(my_list):
    while not is_sorted(my_list):
        for i in range(1, len(my_list)):
            if my_list[i-1] > my_list[i]:
              l = my_list[i-1]
              my_list[i-1] = my_list[i]
              my_list[i] = l
    return my_list

my_list = [1, -3, 3, 2]
# Should return [-3, 1, 2, 3]
bubble_sort(my_list)

## Going further

Clearly, it is impossible to cover all the language features in this short introduction. To go further, we recommend the following resources:



*   List of Python [tutorials](https://wiki.python.org/moin/BeginnersGuide/Programmers)
* Four-hour [course](https://www.youtube.com/watch?v=rfscVS0vtbw) on Youtube



# NumPy

NumPy is a popular library for storing arrays of numbers and performing computations on them. Not only does this enable to write often more succinct code, this also makes the code faster, since most NumPy routines are implemented in C for speed.

To use NumPy in your program, you need to import it as follows

In [None]:
import numpy as np

## Array creation



NumPy arrays can be created from Python lists

In [None]:
my_array = np.array([1, 2, 3])
my_array

NumPy supports array of arbitrary dimension. For example, we can create two-dimensional arrays (e.g. to store a matrix) as follows

In [None]:
my_2d_array = np.array([[1, 2, 3], [4, 5, 6]])
my_2d_array

We can access individual elements of a 2d-array using two indices

In [None]:
my_2d_array[1, 2]

We can also access rows

In [None]:
my_2d_array[1]

and columns

In [None]:
my_2d_array[:, 2]

Arrays have a `shape` attribute

In [None]:
print(my_array.shape)
print(my_2d_array.shape)

Contrary to Python lists, NumPy arrays must have a type and all elements of the array must have the same type.

In [None]:
my_array.dtype

The main types are `int32` (32-bit integers), `int64` (64-bit integers), `float32` (32-bit real values) and `float64` (64-bit real values).

The `dtype` can be specified when creating the array

In [None]:
my_array = np.array([1, 2, 3], dtype=np.float64)
my_array.dtype

We can create arrays of all zeros using

In [None]:
zero_array = np.zeros((2, 3))
zero_array

and similarly for all ones using `ones` instead of `zeros`.

We can create a range of values using

In [None]:
np.arange(5)

or specifying the starting point

In [None]:
np.arange(3, 5)

Another useful routine is `linspace` for creating linearly spaced values in an interval. For instance, to create 10 values in `[0, 1]`, we can use

In [None]:
np.linspace(0, 1, 10)

Another important operation is `reshape`, for changing the shape of an array

In [None]:
my_array = np.array([1, 2, 3, 4, 5, 6])
my_array.reshape(3, 2)

Play with these operations and make sure you understand them well.

## Basic operations

In NumPy, we express computations directly over arrays. This makes the code much more succinct.

Arithmetic operations can be performed directly over arrays. For instance, assuming two arrays have a compatible shape, we can add them as follows

In [None]:
array_a = np.array([1, 2, 3])
array_b = np.array([4, 5, 6])
array_a + array_b

Compare this with the equivalent computation using a for loop

In [None]:
array_out = np.zeros_like(array_a)
for i in range(len(array_a)):
  array_out[i] = array_a[i] + array_b[i]
array_out

Not only this code is more verbose, it will also run much more slowly.

In NumPy, functions that operates on arrays in an element-wise fashion are called [universal functions](https://numpy.org/doc/stable/reference/ufuncs.html). For instance, this is the case of `np.sin`

In [None]:
np.sin(array_a)

Vector inner product can be performed using `np.dot`

In [None]:
np.dot(array_a, array_b)

When the two arguments to `np.dot` are both 2d arrays, `np.dot` becomes matrix multiplication

In [None]:
array_A = np.random.rand(5, 3)
array_B = np.random.randn(3, 4)
np.dot(array_A, array_B)

Matrix transpose can be done using `.transpose()` or `.T` for short

In [None]:
array_A.T

## Slicing and masking

Like Python lists, NumPy arrays support slicing

In [None]:
np.arange(10)[5:]

We can also select only certain elements from the array

In [None]:
x = np.arange(10)
mask = x >= 5
x[mask]

## Exercises

**Exercise 1.** Create a 3d array of shape (2, 2, 2), containing 8 values. Access individual elements and slices.

In [None]:
A = np.random.rand(2, 2, 2)
print(A)
print(A[1, 1])
print(A[1, :, -1])

**Exercise 2.** Rewrite the relu function (see Python section) using [np.maximum](https://numpy.org/doc/stable/reference/generated/numpy.maximum.html). Check that it works on both a single value and on an array of values.

In [None]:
def relu_numpy(x):
  return np.maximum(x, 0)

print(relu_numpy(np.array([1, -3, 2.5])))
print(relu_numpy(-4))

**Exercise 3.** Rewrite the Euclidean norm of a vector (1d array) using NumPy (without for loop)

In [None]:
def euclidean_norm_numpy(x):
  return np.sqrt(np.sum(x**2))

my_vector = np.array([0.5, -1.2, 3.3, 4.5])
euclidean_norm_numpy(my_vector)

**Exercise 4.** Write a function that computes the Euclidean norms of a matrix (2d array) in a row-wise fashion. Hint: use the `axis` argument of [np.sum](https://numpy.org/doc/stable/reference/generated/numpy.sum.html).

In [None]:
def euclidean_norm_2d(X):
  return np.sqrt(np.sum(X**2, axis=1))

my_matrix = np.array([[0.5, -1.2, 4.5],
                      [-3.2, 1.9, 2.7]])
# Should return an array of size 2.
euclidean_norm_2d(my_matrix)

**Exercise 5.** Compute the mean value of the features in the [iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html). Hint: use the `axis` argument on [np.mean](https://numpy.org/doc/stable/reference/generated/numpy.mean.html).

In [None]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

# Result should be an array of size 4.
np.mean(X, axis=0)

## Going further

* NumPy [reference](https://numpy.org/doc/stable/reference/)
* SciPy [lectures](https://scipy-lectures.org/)
*   One-hour [tutorial](https://www.youtube.com/watch?v=QUT1VHiLmmI) on Youtube



# Matplotlib

## Basic plots

Matplotlib is a plotting library for Python.

We start with a rudimentary plotting example.

In [None]:
from matplotlib import pyplot as plt

x_values = np.linspace(-3, 3, 100)

plt.figure()
plt.plot(x_values, np.sin(x_values), label="Sinusoid")
plt.xlabel("x")
plt.ylabel("sin(x)")
plt.title("Matplotlib example")
plt.legend(loc="upper left")
plt.show()

We continue with a rudimentary scatter plot example. This example displays samples from the [iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html) using the first two features. Colors indicate class membership (there are 3 classes).

In [None]:
from sklearn.datasets import load_iris
X, y = load_iris(return_X_y=True)

X_class0 = X[y == 0]
X_class1 = X[y == 1]
X_class2 = X[y == 2]

plt.figure()
plt.scatter(X_class0[:, 0], X_class0[:, 1], label="Class 0", color="C0")
plt.scatter(X_class1[:, 0], X_class1[:, 1], label="Class 1", color="C1")
plt.scatter(X_class2[:, 0], X_class2[:, 1], label="Class 2", color="C2")
plt.show()

We see that samples belonging to class 0 can be linearly separated from the rest using only the first two features.

## Exercises



**Exercise 1.** Plot the relu and the [softplus](https://en.wikipedia.org/wiki/Rectifier_(neural_networks)#Softplus) functions on the same graph.

In [None]:
x = np.linspace(-3, 3, 100)
plt.figure()
plt.plot(x, relu_numpy(x))
plt.plot(x, np.log(1 + np.exp(x)))
plt.show()

What is the main difference between the two functions?

In [None]:
# Answer: One has a smooth and non-constant derivative, not the other

**Exercise 2.** Repeat the same scatter plot but using the [digits dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_digits.html) instead.

In [None]:
from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)

plt.figure()
for cl in np.unique(y):
    plt.scatter(X[y==cl, 34], X[y==cl, 35], c="C" + str(cl))
plt.show()

Are pixel values good features for classifying samples?

In [None]:
# Answer: Based on the value of these 2 pixels, it seems hard to classify anything ...

## Going further

*  Official [tutorial](https://matplotlib.org/tutorials/introductory/pyplot.html)
* [Tutorial](https://www.youtube.com/watch?v=qErBw-R2Ybk) on Youtube