# Introduction

The goal of this lecture is to introduce you the necessary technologies you will need during the course *and* to show you some example of usage of such technologies.

<center>

<a href="https://app.sli.do/dm-intro">

![](../images/01-slido-qr-intro.png)

</a>

</center>

But first and foremost let us fix once and for all the long-standing question about how to pronounce "Jupyter". It have to be pronounced /ˈdʒuːpɪtər/, just like the name of the planet Jupiter. [Here a link to a video where the creator of Jupyter, Fernando Pérez, pronounce it](https://www.youtube.com/watch?v=cc2hHjARNTY&t=335s).

## Numpy

From: [https://numpy.org](https://numpy.org)
> Fast and versatile, the NumPy vectorization, indexing, and broadcasting concepts are the de-facto standards of array computing today.

In [1]:
import numpy as np

## Arrays

An **array** is a central data structure of the NumPy library. An array is a grid of values and it contains information about the raw data, how to locate an element, and how to interpret an element. It has a grid of elements that can be indexed in various ways. **The elements are all of the same type, referred to as the array dtype**.

Numpy will cast the elements of the array to the most general type possible. For example, if you have a list of integers and floats, the integers will be cast to floats.

In [2]:
np.array([1,2,3,4]).dtype

dtype('int32')

In [3]:
np.array([1.0,2,3,4]).dtype

dtype('float64')

In [5]:
np.array(['1',2,3,4]).dtype

dtype('<U11')

## Axes and Shape

Numpy arrays can represent any number of dimensions. Dimensions are called axes, and the number of elements in each axis determines the shape of the array:

In [None]:
arr = np.array([[1,2,3], [4,5,6]])
arr

This array has two axes, the first, ranging over the rows of the matrix, has size 2, the second one has size 3.

In [None]:
arr.shape

### Exercises

- Write a function that takes a list of numbers and returns a numpy array with the same elements.
- Write a numpy array with shape (2, 2, 2) and fill it with numbers of your choice.

## Creating arrays

There are many different ways to create an array

In [None]:
np.array([1, 2, 3])

In [None]:
np.zeros(2)

In [None]:
np.ones(3)

arange is similar to the built-in range, but returns an array instead of a list:

In [None]:
np.arange(1, 10, 2)

linspace creates an array with a specified number of elements, and spaced equally between the specified beginning and end values (extremes included).

In [None]:
np.linspace(1, 10, 4)

### Exercises

- create an array containing the first 10 positive even numbers using only the `arange` function
- create an array containing the first 10 positive even numbers using only the `linspace` function


## Indexing and slicing

You can index and slice NumPy arrays in the same ways you can slice Python lists.

In [None]:
data = np.array([1, 2, 3])

In [None]:
data[1]

In [None]:
data[0:2]

In [None]:
data[1:]


In [None]:
data[-2:]

### Exercises

- Create a 2D array with shape (3, 3) and fill it with numbers from 1 to 9. If you already know how to use the reshape function, you can use it.
- then print:
  - the first row
  - the first column
  - the lower right 2x2 submatrix in two different ways


## Changing the array

Once the array is created its shape is immutable. You cannot add or remove element from an axis. You can however build new arrays that fulfil your requirements.

### Array concatenation

To concatenate two arrays, you can use the concatenate function. It takes a tuple or list of arrays as its first argument, and the axis along which the arrays should be concatenated as its second argument.

In [None]:
a = np.array([1,2,3,4])
b = np.array([5,6,7,8])

If you concatenate two arrays of the same shape, the result will be a larger array of the same shape. The arrays are concatenated along the first axis:

In [None]:
np.concatenate((a,b))

In [None]:
a = np.array(
    [ [1,2],
      [3,4],
      [5,6] ])

b = np.array(
    [ [7,8],
      [9,10],
      [11,12] ])

If you concatenate along the first axis, you will be concatenating the rows of the two arrays.

In [None]:
np.concatenate((a,b), axis=0)

 If you concatenate along the second axis, you will be concatenating the columns of the two arrays.

 **Note**: the axis along which you concatenate the arrays must have the same size.

In [None]:
np.concatenate((a,b), axis=1)


To delete a row or column, you can use the delete function. It takes the array to delete from as its first argument, the index of the row/column (or a slice of those indices) to delete as its second argument, and the axis along which to delete the element as its third argument.


In [None]:

ab = np.concatenate((a,b,a,b), axis=1)
ab

In [None]:
np.delete(ab, 1, axis=0)

In [None]:
np.delete(ab, slice(1,3), axis=1)

# Reshaping

Arrays can be reshaped using the `.reshape` method. The only constraint is that the number of elements implied by the new shape matches the number of elements in the array.

In [None]:
a

In [None]:
 a.reshape([2,3])

As usual, the reshape method returns a new array, and does not modify the original array.

In [None]:
a

In [None]:
a.reshape([6,1])

In [None]:
a.reshape([1,6])

to create a one dimensional array from a matrix (e.g., from a column vector), you can use the flatten method:

In [None]:
a = np.arange(1,10).reshape(3,3)
a

In [None]:
rowvec = a.reshape(1,9)
rowvec

this is a row vector (i.e., a 2-dimensional matrix having only one column), not a proper unidimensional vector

to get a proper unidimensional vector, you can use the flatten method

In [None]:
rowvec.flatten()

or by passing to reshape a tuple of length 1 containing the number of elements of the original vector

In [None]:
a.reshape((9,))

# exercises

- create a 2D array with shape (8,8) containing the numbers from 1 to 64
- from the previous array, create a 1D array containing the elements the upper right submatrix of size 3x3

# Adding an axis

Axes can be added to an array using np.newaxis. As usual the operation will create a new array with the specified dimensions.

In [None]:
a =  np.array([1,2,3,4,5])
a.shape

We can change this 1D vector to a row or a column vector by adding a new axis.

Specifically, if you add the new axis as the first axis, the resulting vector will have shape `[1,len(a)]`

In [None]:
a[np.newaxis, :]

if you add the new axis as the second axis, the resulting vector will have shape `[len(a),1]`

In [None]:
a[:, np.newaxis]

Reshape can be often used instead of np.newaxis, for instance a.reshape(1, -1) is equivalent to a[np.newaxis, :], however, np.newaxis is more flexible since reshape can only handle one unknown dimension at a time.

In [None]:
a = np.linspace(1, 9, 9).reshape(3,3)
# a.reshape(-1, -1, 1) # This is invalid
a[:, :, np.newaxis]

### Exercises

- create a 2D array with shape (3,3) containing the numbers from 1 to 9
- reshape it to a 2D array with shape (3, 3, 1) assuming to know the number of rows and columns
- reshape it to a 3D array with shape (3, 1, 3) assuming that the number of rows and the number of columns are unknown

## Basic operations

+,-,*,/ are defined over arrays and implement corresponding elementwise operations:

In [None]:
a = np.arange(10).reshape(5,2)
b = np.arange(2, 12).reshape(5,2)
a,b

In [None]:
a+b

In [None]:
a-b

In [None]:
a*b

In [None]:
a/b

Matrix multiplication can be done using the @ operator (or the `.matmul` method). Matrix transposition can be done using the `.T` attribute.

In [None]:
a = np.arange(9).reshape(3,3)
aT = a.T

a,aT

In [None]:
a @ aT, np.matmul(a, aT)

Matrix inversion can be done using the `np.linalg.inv` method.


In [None]:
np.linalg.inv(aT @ a)

In [None]:
np.linalg.matrix_power(aT @ a, -1)

### Exercises

- create a 2D array A with shape (3,3) containing the values of the expression $n^2$ (where $n$ is the index of the element in the flattened array);
- create a 2D array B with shape (3,3) containing the values of the expression $3n$;
- create a final array with shape (3,3) containing the values of the expression $n^2 + 3n + 4$.


## Other useful operations

Many useful operations are defined on numpy arrays:

- max
- min
- sum

if no axis is provided these functions will work as if the array was flat. Otherwise they do work acting on the given axis.

In [None]:
a = np.arange(9).reshape(3,3)
a

In [None]:
a.min()

In [None]:
a.min(axis=0)

In [None]:
a.min(axis=1)

In [None]:
a.sum()

In [None]:
a.sum(axis=0)

In [None]:
a.sum(axis=1)

# Pandas

Pandas is a library that simplifies handling tabular data. Common tasks best dealt with pandas are:
- reading/writing data from common formats (csv, excel, latex, xml, sql, ...)
- reshaping
- filtering
- aggregating
- merging/joining
- plotting
- ...

In [None]:
import pandas as pd

## Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its **index**.

### Creation, indexing and slicing

In [None]:
# From a list
s = pd.Series([1, 2, 3, 4])
s

In [None]:
# From a numpy array
s = pd.Series(np.array([1, 2, 3, 4]))
s

In [None]:
# From a list, with custom index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s

### Indexing and slicing series

When slicing a series, the resulting series will have the same index as the original one. Unless you select exactly one element, in which case the resulting series will be a scalar.

In [None]:
s.iloc[0]

In [None]:
s.iloc[1:3]

In [None]:
s.iloc[[0, 2]]

In [None]:
### Indexing and slicing series with custom index")

s['a']

In [None]:
s['b':'c']

In [None]:
s[['a', 'c']]

To access the last element of a series, **you cannot use the negative index -1**  (it is not supported by pandas' [] operator).

Instead, you can use the tail() method or the iloc() method

In [None]:
s.tail(1)

In [None]:
s.iloc[-1]

More about the loc and iloc methods in the next section.

## DataFrames

A DataFrame is a tabular data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.


### Creation

In [None]:
# From a dictionary
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df

In [None]:
# From a list of dictionaries
df = pd.DataFrame([{'col1': 1, 'col2': 3}, {'col1': 2, 'col2': 4}])
df

In [None]:
# From a list of lists or a numpy array
df = pd.DataFrame([[1, 2], [3, 4]], columns=['col1', 'col2'])
df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['col1', 'col2'])
df

In [None]:
# From a list of tuples
df = pd.DataFrame([(1, 3), (2, 4)], columns=['col1', 'col2'])
df

### Reading data

The function `read_csv` is the most common way to read data from a csv file. It takes as input the path to the file and returns a DataFrame. It automatically adds a row index to the DataFrame.

In [None]:
print(open('data.csv').read())

In [None]:
data = pd.read_csv('data.csv')
display(data)

writing data to a csv file

In [65]:
data.to_csv('data_w_index.csv', index=True)

the output file will have an extra column with the index

In [None]:

print(open('data_w_index.csv', 'r').read())


if you read back the file, you will get an additional row index (the one you just saved plus the one that was automatically added by pandas)

In [None]:
data = pd.read_csv('data_w_index.csv')
data


to read the csv file with the index as a column, use index_col=0


In [None]:

data = pd.read_csv('data_w_index.csv', index_col=0)
display(data)


to write the csv file without the index, use index=False, if you want to remove also the column names, use header=False

In [None]:

data.to_csv('data_wo_index.csv', index=False)
print(open('data_wo_index.csv','r').read())


### Exercise

- read the file `data.csv` and save the first 3 rows in a new file `data_first3.csv`, the output file should be in the same format as the input file.


### Indexing and slicing dataframes


In [None]:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8], 'col3': [9, 10, 11, 12]})
display(df)


If you use alfanumeric names to index a dataframe, pandas will return the corresponding column.

In [None]:

print("df['col1']")
display(df['col1'])

print("df[['col1', 'col2']]")
display(df[['col1', 'col2']])


If you use integers, pandas will return the corresponding rows

In [None]:

print("df[0:2]")
display(df[0:2])

If you use a list of booleans, pandas will return the corresponding rows."

In [None]:

df[[True, False, True, False]]



this is very powerful, because you can use boolean expressions to filter the rows of the dataframe
for example, you can filter the rows where the value of col1 is greater than 2

In [None]:

df[df['col1'] > 2]

you can combine boolean expressions using the & (and) and | (or) operators.

**Note:** put each piece of the expression in parentheses, otherwise the precedence of the operators will be wrong and you will get an error.

In [None]:
df[(df['col1'] > 2) | (df['col3'] < 10)]

if you want to select a subset of the columns and of the rows, you can use the loc attribute (if you want to use the index) or the iloc (if you want to use the position of the row)

In [None]:
df.loc[0:2, 'col1':'col2']

In [None]:
df.iloc[0:2, 0:2]

**Note:** when you use the loc attribute, the first element of the tuple is the index of the row, the second element is the name of the column. When you use the iloc attribute, the first element of the tuple is the position of the row, the second element is the position of the column.

Indexing using the **iloc** attribute follows python's slicing rules, so you can use the : operator to select a range of rows or columns. Keep in mind that ranges are always exclusive on the right side. On the contrary, when you use the **loc** attribute, results will include the right side of the range.

**In general when using the index, the right side of the range is included, while when using the position, the right side is excluded.**

### Exercises

- read the file `data.csv` and use it to build a new dataframe having only the rows where the `y` and `value` are both even and `name` is not 'origin'.

### Statistics

Pandas provides a lot of useful functions to compute statistics on dataframes:
- mean
- median
- std
- var
- min
- max
- sum
- count
- ...|

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris

To evaluate numerical statistics, we have to discard the non-numeric columns. We can do this using the select_dtypes method.

In [None]:
iris.select_dtypes('float').mean() # in this case this is equivalent to iris.iloc[:, :-1].mean()


In [None]:
iris.select_dtypes('float').median()

To compute aggregate statistics on the dataframe, we can use the describe method.

In [None]:
iris.describe()

To compute a set of specific statistics, we can use the agg method.

In [None]:
iris.select_dtypes('float').agg(['mean', 'std'])

You can group the rows of a dataframe using the groupby method. This method returns a GroupBy object, which can be used to compute aggregate statistics on the groups.

In [None]:
iris.groupby('species').agg(['mean', 'std'])

### Plotting

`iris.plot()` will plot all the columns of the dataframe, using the index as x axis

In [None]:
import matplotlib.pyplot as plt

plt.close('all')
iris.plot()
plt.show()

More information about chart visualization here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

## Convert to numpy array

To convert a dataframe to a numpy array, use the `.values` attribute

In [None]:
iris.select_dtypes('float').values

### Exercises

On the iris dataset perform the following operations:
1. Select only the rows belonging to 'setosa', then evaluate the mean and standard deviation of each column
2. Select only the rows belonging to 'setosa' **or** 'versicolor', then evaluate the mean and standard deviation of each column, grouping the results by species
3. Add a target variable by converting 'species' to numerical values (0,1,2), and evaluate the least squares solution of the resulting linear regression problem.
    - Note: the least squares formula is: $w = (X^T X)^{-1} X^T y$
4. Compute the predictions of the model on the training set ($\hat{y} =  X w$), and evaluate the accuracy of the model on the training set ($\frac{1}{n}\sum_{i=1}^n |y_i - \hat{y_i}|$).

If everything is ok, you should be able to get an accuracy of 96% (on the training set).


In [None]:
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(url, names=[
                   'sepal length', 'sepal width', 'petal length', 'petal width', 'species'])
iris
