# Introduction to notebooks, Python, NumPy, Pandas and Scikit-learn

Notebooks, such as this one, consist of cells containing programming code that you can run interactively. Some other cells, such as the one you are reading now, can contain text and even images.

### Exercise
* Select the next cell with the mouse and run it by clicking the ▶️ ("play") button above
* Change the message to be printed and run the cell again, this time by using the keyboard short-cut: Shift-Enter

In [None]:
print("Hello, world")

## Python

This course will be using the Python version 3 programming language. Let's try some more Python code

In [None]:
x = 2
y = 3
print('x =', x)
print('y =', y)

In [None]:
y += x
print('y =', y)

### Exercise

What happens if you run the previous cell again (the one with `y += x`). Why?

## NumPy

[NumPy](https://numpy.org/) is the fundamental package for scientific computing with Python. It provides a data type, the NumPy array, which is much more efficient for numerical calculations than Python's own list data type.

In [None]:
import numpy as np

Let's create a small NumPy array, and print it's data type and shape (size):

In [None]:
a = np.array((1, 2, 3, 4))
print(a)
print(a.dtype)
print(a.shape)


The shape is shown as a tuple `(4,)`, it just means it has a single dimension of size 4.

We can also specify the data type ourselves by giving the `dtype` parameter:

In [None]:
a = np.array((1,2,3,4), dtype=float) # Type can be explicitly specified
print(a)
print(a.dtype)
print(a.shape)

Multidimensional arrays can also be created. For example here is a 2x3 array or matrix:

In [None]:
b = np.array([[1,2,3], [4,5,6]])
print(b)
print(b.shape)

NumPy has many convenience functions for creating arrays, for example to create a 3x3 array initialized to zeros:

In [None]:
c = np.zeros((3,3), int)
print(c)

You can access single elements or slices (subsets) with the normal Python slicing syntax:

In [None]:
print(a)
print(a[1])
print(a[1:3])

You can also slice matrices:

In [None]:
print(b)
print('first row:\n', b[0,:])
print('second and third column:\n', b[:,1:2])

You can get the [transpose of a matrix](https://en.wikipedia.org/wiki/Transpose) with the `.T` attribute:

In [None]:
print(b.T)

### Exercise

- create an identity matrix of dimensionality 3x3, that is a 3x3 matrix with all zeros except the diagonal from top-left to bottom right having 1's
- print the second row and second column of that matrix

Hint: read the [NumPy documentation on array creation](https://numpy.org/doc/stable/reference/routines.array-creation.html)

NumPy contains linear algebra operations for matrix and vector products, eigenproblems and linear systems. Typically, NumPy is built against optimized BLAS libraries which means that these operations are quite efficient.

In [None]:
A = np.array(((2, 1), (1, 3)))
B = np.array(((-2, 4.2), (4.2, 6)))
C = np.dot(A, B) # matrix-matrix product
print(C)

In [None]:
w, v = np.linalg.eig(A) # eigenvalues in w, eigenvectors in v
print(w)
print(v)

In [None]:
b = np.array((1, 2))
x = np.linalg.solve(C, b) # Solve Cx = b
print(x)
print(np.dot(C, x)) # np.dot calculates also matrix-vector and vector-vector products

Further reading:

* [NumPy Quickstart tutorial](https://numpy.org/doc/stable/user/quickstart.html)
* [NumPy: the absolute basics for beginners](https://numpy.org/doc/stable/user/absolute_beginners.html)
* [NumPy Reference](https://numpy.org/doc/stable/reference/index.html)

## Simple plotting with Matplotlib 

[Matplotlib](https://matplotlib.org/) is a useful library for plotting visualizations in Python.

In [None]:
import matplotlib.pyplot as plt

In notebooks we can add this "magic" command to make the figures appear inside the notebook.

In [None]:
%matplotlib inline

In [None]:
x = np.linspace(-np.pi, np.pi, 100)
y = np.sin(x)
plt.plot(x, y)
plt.title('A simple plot')
plt.xlabel('time (s)')
plt.show()

## Pandas

[Pandas](https://pandas.pydata.org/) is a tool for data analysis in Python. It's mainly used for tabular data, i.e., data that is best expressed in a column-oriented format rather than as multi-dimensional arrays. In the column-based format each column can have a different data type, e.g., integer, float or string. For this Pandas introduces the `DataFrame` data type.

In [None]:
import pandas as pd

To demonstrate Pandas DataFrames, let's load the [Iris flowers dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). First, we'll use a shell command to download the CSV file if it doesn't exist.

In [None]:
import os
if not os.path.isfile('iris.csv'):
    !wget https://a3s.fi/mldata/iris.csv

The file `iris.csv` should appear in the Jupyter Notebook file browser to the left, you can double click on the file if you wish to inspect it.

Next, we'll load the CSV data as Pandas DataFrame:

In [None]:
data = pd.read_csv('iris.csv')

Let's look at the columns and their data types

In [None]:
data.dtypes

We can also print the DataFrame

In [None]:
data

A specific column can be accessed its name:

In [None]:
print(data['sepal length'])

We can also use numerical indices as we are used to from NumPy with `.iloc`:

In [None]:
print('First row:')
print(data.iloc[0])
print()
print('Second column:')
print(data.iloc[:,1])

### Exercise

Take a look at the [documentation of Pandas](https://pandas.pydata.org/docs/user_guide/index.html) and figure out how to:
- print the "head" (i.e., first 10 rows) of the dataset above
- get all the flowers of the class 'Iris-setosa', i.e., all the rows of `data` where the column named 'class' equals 'Iris-setosa'

## Scikit-learn

In the first day of this course we will mostly rely on [Scikit-learn, a machine learning framework for Python](https://scikit-learn.org/stable/index.html).  

In scikit-learn all machine learning models follow the same pattern:

1. First create a model object with the appropriate constructor for the method you are using.  Here you can also specify _hyperparameters_ for the method:
```
clf = SomeModel(param1=a, param2=b)
```


2. Next, fit your model to the training set (e.g., train your classifier):
```
clf.fit(X_train, y_train)
```


3. Finally, for the inference stage (e.g., predict the classes of new unseen items with your trained classifier):
```
y_predicted_test = clf.predict(X_test)
```


### Iris flower classifier

Now let's try to create a classifier that predicts the class for the Iris flowers!

First, let's remind ourselves how the data looked like.

In [None]:
data.head()

A classifier would take the characteristics of the flower as input, i.e., the four measurement values: sepal length, sepal width, petal length, and petal width.

The output of the classifier would be the class of the flower. There are four unique classes:

In [None]:
data['class'].unique()

We'll create a DataFrame `X` with just the input values, which is the first four columns of the dataset.

In [None]:
X = data.iloc[:,:4]
X

In `y` we'll put the class to be predicted by the classifier.

In [None]:
y = data['class']
y

### Training and testing datasets

Scikit-learn has a convenient function for splitting the data into training and testing parts. 

(**Remember:** if you are doing real science, you should have three datasets: training, validation and testing!)

In [None]:
from sklearn.model_selection import train_test_split
np.random.seed(5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1)

print('X_train', X_train.shape)
print('y_train', y_train.shape)
print('X_test', X_test.shape)
print('y_test', y_test.shape)

### Train the actual model

We'll use a simple logistic regression model (which is a classifier despite the name) using the [`LogisticRegression` method from Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

For the moment don't worry about the actual model and how it works, we'll cover that in the next lecture!

In [None]:
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(C=1.0)

Next we'll do the actual training of the model by feeding it the data in `X_train` and asking it to update its internal parameters to fit to the correct labels in `y_train`.

Often training is the most time-consuming part, so we'll use the magic `%%time` command in Jupyter notebooks to measure how long it took.

In [None]:
%%time

clf.fit(X_train, y_train)

Next, let's try our model on the test set.

In [None]:
y_pred = clf.predict(X_test)

We can make a temporary table with the correct and predicted class side-by-side to inspect the result.

In [None]:
tmp = X_test.copy()
tmp.loc[:, 'correct class'] = y_test
tmp.loc[:, 'predicted class'] = y_pred
tmp.loc[:, 'correct prediction?'] = y_test == y_pred

tmp

Instead of just visually examining the results we can also calculate some performance metric, for example the accuracy, i.e., the proportion of correctly classified items.

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test, y_pred)

### Exercise

Check the [documentation of `LogisticRegression` in Scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and rerun the training and testing above with different parameters. At the very least, try different values for the `C` parameter.