## Numpy

[https://numpy.org](https://numpy.org)

Veloce e versatile, la vettorizzazione, l'indicizzazione e i concetti di broadcasting di NumPy sono oggi gli standard del calcolo con array.

In [1]:
import numpy as np

## Arrays

Un **array** è una struttura dati centrale della libreria NumPy. Un array è una griglia di valori e contiene informazioni sui dati grezzi, su come individuare un elemento e su come interpretarlo. Ha una griglia di elementi che possono essere indicizzati in vari modi. **Gli elementi sono tutti dello stesso tipo, noto come dtype**.

In [3]:
np.array([1,2,3,4]).dtype

dtype('int64')

In [4]:
np.array([1.0,2,3,4]).dtype

dtype('float64')

In [8]:
np.array(['1',2,3,4]).dtype

dtype('<U21')

## Assi e Forma  

Gli array di NumPy possono rappresentare qualsiasi numero di dimensioni. Le dimensioni sono chiamate assi, e il numero di elementi in ciascun asse determina la forma dell'array:

In [13]:
arr = np.array([[1,2,3], [4,5,6]])
arr

array([[1, 2, 3],
       [4, 5, 6]])

Questo array ha due assi: il primo, che si estende sulle righe della matrice, ha dimensione 2, mentre il secondo ha dimensione 3.

In [10]:
arr.shape

(2, 3)

### Convertire gli argomenti di una funzione in array numpy

In [11]:
def list_2_np_array(*args):
    print(args)
    return np.array(args)  # Converte direttamente gli argomenti in un array NumPy

list_np = list_2_np_array(1, 2, 3, 4, 5, 6, 7)
print(list_np)

(1, 2, 3, 4, 5, 6, 7)
[1 2 3 4 5 6 7]


### Convertire una lista in un array Numpy

In [15]:
def array_2_np_array(lst): 
    return np.array(lst)  # Converte una lista in un array NumPy

num_list = [1, 2, 3, 4, 5]
print(type(num_list))
array_np = array_2_np_array(num_list)
print(array_np)
print(type(array_np))

<class 'list'>
[1 2 3 4 5]
<class 'numpy.ndarray'>


### Creazione di un array NumPy con forma (2, 2, 2)

In [16]:
#excercise 2

arr = np.array([[[1,2],[1,2]], [[4,5],[4,5]]])
arr.shape


(2, 2, 2)

## Creazione di array  

Esistono molti modi diversi per creare un array.

In [17]:
np.array([1, 2, 3])

array([1, 2, 3])

In [18]:
np.zeros(2)

array([0., 0.])

In [19]:
np.ones(3)

array([1., 1., 1.])

`arange` è simile alla funzione `range` built-in di python, ma restituisce un array invece di una lista:

In [20]:
np.arange(1, 10, 1)

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

`linspace` crea un array con un numero specificato di elementi, distribuiti equamente tra i valori di inizio e fine specificati (estremi inclusi).

In [21]:
np.linspace(1, 10, 3)

array([ 1. ,  5.5, 10. ])

### Esercizio 1
- Crea un array contenente i primi 10 numeri pari positivi usando solo la funzione `arange`.

In [22]:
np.arange(2, 21, 2)

array([ 2,  4,  6,  8, 10, 12, 14, 16, 18, 20])

### Esercizio 2
- Crea un array contenente i primi 10 numeri pari positivi usando solo la funzione `linspace`.

In [23]:
np.linspace(2, 20, 10)

array([ 2.,  4.,  6.,  8., 10., 12., 14., 16., 18., 20.])

## Indicizzazione e slicing

Puoi indicizzare e fare slicing sugli array NumPy nello stesso modo in cui puoi fare slicing sulle liste Python.

nota: Fare slicing significa estrarre una parte di una sequenza, come una lista o un array, specificando un intervallo di elementi. In pratica, puoi selezionare una porzione di dati da una struttura più grande, come un array o una lista, usando una sintassi che definisce l'inizio, la fine e, opzionalmente, il passo di selezione.

In [28]:
data = np.array([1, 2, 3, 4])

In [35]:
elemento = data[1]

print(elemento)  # Output: 2

2


In [36]:
data[0:2]

array([1, 2])

In [37]:
data[1:]


array([2, 3, 4])

In [38]:
data[-2:]

array([3, 4])

### Esercizi

- Crea un array 2D con forma (3, 3) e riempilo con numeri da 1 a 9.
- Poi stampa:
  - la prima riga
  - la prima colonna
  - la sottogratta 2x2 nell'angolo in basso a destra in due modi diversi

In [40]:
arr = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [41]:
#first row
arr[0]

array([1, 2, 3])

In [42]:
#first column
arr[:, 0]

array([1, 4, 7])

In [43]:
#first way lower right matrix 2x2
arr[1:3, 1:3]

array([[5, 6],
       [8, 9]])

In [44]:
#second way
arr[1:,1:]

array([[5, 6],
       [8, 9]])

## Modificare l'array

Una volta creato, la forma dell'array è immutabile. Non puoi aggiungere o rimuovere elementi da un asse. Tuttavia, puoi costruire nuovi array che soddisfano le tue esigenze.

### Concatenazione di array

Per concatenare due array, puoi usare la funzione `concatenate`. Essa prende come primo argomento una tupla o una lista di array e come secondo argomento l'asse lungo cui gli array devono essere concatenati.

In [45]:
a = np.array([1,2,3,4])
b = np.array([3,6,7,8])

Se concateni due array della stessa forma, il risultato sarà un array più grande con la stessa forma. Gli array vengono concatenati lungo il primo asse:

In [46]:
np.concatenate((a,b))

array([1, 2, 3, 4, 3, 6, 7, 8])

In [47]:
a = np.array(
    [ [1,2], 
      [3,4], 
      [5,6] ])

b = np.array(
    [ [7,8], 
      [9,10], 
      [11,12] ])

Se concatenati lungo il primo asse, concatenarai le righe dei due array.

In [53]:
np.concatenate((a,b), axis=0)

array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12]])

Se concatenati lungo il secondo asse, concatenarai le colonne dei due array.

**Nota**: l'asse lungo cui concatenare gli array deve avere la stessa dimensione.

In [100]:
np.concatenate((a,b), axis=1)

array([[ 1,  2,  7,  8],
       [ 3,  4,  9, 10],
       [ 5,  6, 11, 12]])

Per eliminare una riga o una colonna, puoi usare la funzione `delete`. Essa prende come primo argomento l'array da cui eliminare, come secondo argomento l'indice della riga/colonna (o una fetta di quegli indici) da eliminare, e come terzo argomento l'asse lungo cui eliminare l'elemento.

In [54]:
ab = np.concatenate((a,b,a,b), axis=1)
ab

array([[ 1,  2,  7,  8,  1,  2,  7,  8],
       [ 3,  4,  9, 10,  3,  4,  9, 10],
       [ 5,  6, 11, 12,  5,  6, 11, 12]])

In [59]:
np.delete(ab, 0, axis=0)

array([[ 3,  4,  9, 10,  3,  4,  9, 10],
       [ 5,  6, 11, 12,  5,  6, 11, 12]])

In [60]:
np.delete(ab, slice(1,3), axis=1)

array([[ 1,  8,  1,  2,  7,  8],
       [ 3, 10,  3,  4,  9, 10],
       [ 5, 12,  5,  6, 11, 12]])

# Modifica della forma

Gli array possono essere rimodellati utilizzando il metodo `.reshape`. L'unico vincolo è che il numero di elementi implicato dalla nuova forma deve corrispondere al numero di elementi nell'array.

In [61]:
a

array([[1, 2],
       [3, 4],
       [5, 6]])

In [62]:
a.reshape([2,3])

array([[1, 2, 3],
       [4, 5, 6]])

Come di consueto, il metodo `reshape` restituisce un nuovo array e non modifica l'array originale.

In [63]:
a

array([[1, 2],
       [3, 4],
       [5, 6]])

In [64]:
a.reshape([6,1])

array([[1],
       [2],
       [3],
       [4],
       [5],
       [6]])

In [65]:
a.reshape([1,6])

array([[1, 2, 3, 4, 5, 6]])

Per creare un array monodimensionale da una matrice (ad esempio, da un vettore colonna), puoi utilizzare il metodo `flatten`:

In [66]:
a = np.arange(1,10)
a

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [67]:
a = a.reshape(3,3)
a

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [68]:
rowvec = a.reshape(1,9)
rowvec

array([[1, 2, 3, 4, 5, 6, 7, 8, 9]])

Questo è un vettore riga (ovvero una matrice bidimensionale con una sola colonna), non un vero e proprio vettore unidimensionale.

Per ottenere un vero e proprio vettore unidimensionale, puoi utilizzare il metodo `flatten`.

In [69]:
rowvec.flatten()

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

Oppure passando al metodo `reshape` una tupla di lunghezza 1 contenente il numero di elementi del vettore originale.

In [116]:
a.reshape((9,))

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

# Esercizi

## Esercizio 1
Crea un array 2D con forma (8,8) contenente i numeri da 1 a 64.

In [72]:
array = np.arange(1,65)
array = array.reshape(8, 8)
array

array([[ 1,  2,  3,  4,  5,  6,  7,  8],
       [ 9, 10, 11, 12, 13, 14, 15, 16],
       [17, 18, 19, 20, 21, 22, 23, 24],
       [25, 26, 27, 28, 29, 30, 31, 32],
       [33, 34, 35, 36, 37, 38, 39, 40],
       [41, 42, 43, 44, 45, 46, 47, 48],
       [49, 50, 51, 52, 53, 54, 55, 56],
       [57, 58, 59, 60, 61, 62, 63, 64]])

## Esercizio 2
Dal precedente array, crea un array 1D contenente gli elementi della sottomatrice in alto a destra di dimensione 3x3.

In [74]:
array = np.arange(1,65)
array = array.reshape(8, 8)
array = array[0:3,5:]
array.flatten()

array([ 6,  7,  8, 14, 15, 16, 22, 23, 24])

# Adding an axis

Axes can be added to an array using np.newaxis. As usual the operation will create a new array with the specified dimensions.

In [124]:
a =  np.array([1,2,3,4,5])
a.shape

(5,)

We can change this 1D vector to a row or a column vector by adding a new axis.

Specifically, if you add the new axis as the first axis, the resulting vector will have shape `[1,len(a)]`

In [125]:
a[np.newaxis, :]

array([[1, 2, 3, 4, 5]])

if you add the new axis as the second axis, the resulting vector will have shape `[len(a),1]`

In [126]:
a[:, np.newaxis]

array([[1],
       [2],
       [3],
       [4],
       [5]])

Reshape can be often used instead of np.newaxis, for instance a.reshape(1, -1) is equivalent to a[np.newaxis, :], however, np.newaxis is more flexible since reshape can only handle one unknown dimension at a time.

In [128]:
a = np.linspace(1, 9, 9)
a

array([1., 2., 3., 4., 5., 6., 7., 8., 9.])

In [129]:
a = a.reshape(3,3)
a

array([[1., 2., 3.],
       [4., 5., 6.],
       [7., 8., 9.]])

In [130]:
# a.reshape(-1, -1, 1) This is invalid
a[:, :, np.newaxis]

array([[[1.],
        [2.],
        [3.]],

       [[4.],
        [5.],
        [6.]],

       [[7.],
        [8.],
        [9.]]])

### Exercises

- create a 2D array with shape (3,3) containing the numbers from 1 to 9
- reshape it to a 2D array with shape (3, 3, 1) assuming to know the number of rows and columns
- reshape it to a 3D array with shape (3, 1, 3) assuming that the number of rows and the number of columns are unknown

## Basic operations

+,-,*,/ are defined over arrays and implement corresponding elementwise operations:

In [None]:
a = np.arange(10).reshape(5,2)
b = np.arange(2, 12).reshape(5,2)
a,b

In [None]:
a+b

In [None]:
a-b

In [None]:
a*b

In [None]:
a/b

Matrix multiplication can be done using the @ operator (or the `.matmul` method). Matrix transposition can be done using the `.T` attribute.

In [None]:
a = np.arange(9).reshape(3,3)
aT = a.T

a,aT

In [None]:
a @ aT, np.matmul(a, aT)

Matrix inversion can be done using the `np.linalg.inv` method.


In [None]:
np.linalg.inv(aT @ a)

In [None]:
np.linalg.matrix_power(aT @ a, -1)

### Exercises

- create a 2D array A with shape (3,3) containing the values of the expression $n^2$ (where $n$ is the index of the element in the flattened array);
- create a 2D array B with shape (3,3) containing the values of the expression $3n$;
- create a final array with shape (3,3) containing the values of the expression $n^2 + 3n + 4$.


## Other useful operations

Many useful operations are defined on numpy arrays:

- max
- min
- sum

if no axis is provided these functions will work as if the array was flat. Otherwise they do work acting on the given axis.

In [None]:
a = np.arange(9).reshape(3,3)
a

In [None]:
a.min()

In [None]:
a.min(axis=0)

In [None]:
a.min(axis=1)

In [None]:
a.sum()

In [None]:
a.sum(axis=0)

In [None]:
a.sum(axis=1)

# Pandas

Pandas is a library that simplifies handling tabular data. Common tasks best dealt with pandas are:
- reading/writing data from common formats (csv, excel, latex, xml, sql, ...)
- reshaping
- filtering
- aggregating
- merging/joining
- plotting
- ...

In [131]:
import pandas as pd

## Series

A Series is a one-dimensional array-like object containing an array of data (of any NumPy data type) and an associated array of data labels, called its **index**.

### Creation, indexing and slicing

In [134]:
# From a list
s = pd.Series([1, 2, 3, 4])
s

0    1
1    2
2    3
3    4
dtype: int64

In [135]:
# From a numpy array
s = pd.Series(np.array([1, 2, 3, 4]))
s

0    1
1    2
2    3
3    4
dtype: int32

In [136]:
# From a list, with custom index
s = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s

a    1
b    2
c    3
d    4
dtype: int64

### Indexing and slicing series

When slicing a series, the resulting series will have the same index as the original one. Unless you select exactly one element, in which case the resulting series will be a scalar.

In [137]:
s[0]

1

In [138]:
s[1:3]

b    2
c    3
dtype: int64

In [139]:
s[[0, 2]]

a    1
c    3
dtype: int64

In [140]:
### Indexing and slicing series with custom index")

s['a']

1

In [141]:
s['b':'c']

b    2
c    3
dtype: int64

In [142]:
s[['a', 'c']]

a    1
c    3
dtype: int64

To access the last element of a series, **you cannot use the negative index -1**  (it is not supported by pandas' [] operator). 

Instead, you can use the tail() method or the iloc() method

In [143]:
s.tail(1)

d    4
dtype: int64

In [144]:
s.iloc[-1]

4

More about the loc and iloc methods in the next section.

## DataFrames

A DataFrame is a tabular data structure containing an ordered collection of columns, each of which can be a different value type (numeric, string, boolean, etc.). The DataFrame has both a row and column index; it can be thought of as a dict of Series all sharing the same index.


### Creation

In [145]:
# From a dictionary
df = pd.DataFrame({'col1': [1, 2], 'col2': [3, 4]})
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [146]:
# From a list of dictionaries
df = pd.DataFrame([{'col1': 1, 'col2': 3}, {'col1': 2, 'col2': 4}])
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


In [148]:
# From a list of lists or a numpy array
df = pd.DataFrame([[1, 2], [3, 4]], columns=['col1', 'col2'])
df = pd.DataFrame(np.array([[1, 2], [3, 4]]), columns=['col1', 'col2'])
df

Unnamed: 0,col1,col2
0,1,2
1,3,4


In [149]:
# From a list of tuples
df = pd.DataFrame([(1, 3), (2, 4)], columns=['col1', 'col2'])
df

Unnamed: 0,col1,col2
0,1,3
1,2,4


### Reading data

The function `read_csv` is the most common way to read data from a csv file. It takes as input the path to the file and returns a DataFrame. It automatically adds a row index to the DataFrame.

In [150]:
print(open('data.csv').read())

FileNotFoundError: [Errno 2] No such file or directory: 'data.csv'

In [151]:
data = pd.read_csv('data.csv')
display(data)

Unnamed: 0,longitude,latitude,housing_median_age,total_rooms,total_bedrooms,population,households,median_income,median_house_value,ocean_proximity
0,-122.23,37.88,41.0,880.0,129.0,322.0,126.0,8.3252,452600.0,NEAR BAY
1,-122.22,37.86,21.0,7099.0,1106.0,2401.0,1138.0,8.3014,358500.0,NEAR BAY
2,-122.24,37.85,52.0,1467.0,190.0,496.0,177.0,7.2574,352100.0,NEAR BAY
3,-122.25,37.85,52.0,1274.0,235.0,558.0,219.0,5.6431,341300.0,NEAR BAY
4,-122.25,37.85,52.0,1627.0,280.0,565.0,259.0,3.8462,342200.0,NEAR BAY
...,...,...,...,...,...,...,...,...,...,...
20635,-121.09,39.48,25.0,1665.0,374.0,845.0,330.0,1.5603,78100.0,INLAND
20636,-121.21,39.49,18.0,697.0,150.0,356.0,114.0,2.5568,77100.0,INLAND
20637,-121.22,39.43,17.0,2254.0,485.0,1007.0,433.0,1.7000,92300.0,INLAND
20638,-121.32,39.43,18.0,1860.0,409.0,741.0,349.0,1.8672,84700.0,INLAND


writing data to a csv file

In [None]:
data.to_csv('data_w_index.csv', index=True)

the output file will have an extra column with the index

In [None]:

print(open('data_w_index.csv', 'r').read())


if you read back the file, you will get an additional row index (the one you just saved plus the one that was automatically added by pandas)

In [None]:
data = pd.read_csv('data_w_index.csv')


to read the csv file with the index as a column, use index_col=0


In [None]:

data = pd.read_csv('data_w_index.csv', index_col=0)
display(data)


to write the csv file without the index, use index=False, if you want to remove also the column names, use header=False

In [None]:

data.to_csv('data_wo_index.csv', index=False)
print(open('data_wo_index.csv','r').read())


### Exercise

- read the file `data.csv` and save the first 3 rows in a new file `data_first3.csv`, the output file should be in the same format as the input file.


### Indexing and slicing dataframes


In [None]:
df = pd.DataFrame({'col1': [1, 2, 3, 4], 'col2': [5, 6, 7, 8], 'col3': [9, 10, 11, 12]})
display(df)


If you use alfanumeric names to index a dataframe, pandas will return the corresponding column.

In [None]:

print("df['col1']")
display(df['col1'])

print("df[['col1', 'col2']]")
display(df[['col1', 'col2']])


If you use integers, pandas will return the corresponding rows

In [None]:

print("df[0:2]")
display(df[0:2])

If you use a list of booleans, pandas will return the corresponding rows."

In [None]:

df[[True, False, True, False]]



this is very powerful, because you can use boolean expressions to filter the rows of the dataframe
for example, you can filter the rows where the value of col1 is greater than 2

In [None]:

df[df['col1'] > 2]

you can combine boolean expressions using the & (and) and | (or) operators. 

**Note:** put each piece of the expression in parentheses, otherwise the precedence of the operators will be wrong and you will get an error.

In [None]:
df[(df['col1'] > 2) | (df['col3'] < 10)]

if you want to select a subset of the columns and of the rows, you can use the loc attribute (if you want to use the index) or the iloc (if you want to use the position of the row)

In [None]:
df.loc[0:2, 'col1':'col2']

In [None]:
df.iloc[0:2, 0:2]

**Note:** when you use the loc attribute, the first element of the tuple is the index of the row, the second element is the name of the column. When you use the iloc attribute, the first element of the tuple is the position of the row, the second element is the position of the column.

Indexing using the **iloc** attribute follows python's slicing rules, so you can use the : operator to select a range of rows or columns. Keep in mind that ranges are always exclusive on the right side. On the contrary, when you use the **loc** attribute, results will include the right side of the range.

**In general when using the index, the right side of the range is included, while when using the position, the right side is excluded.**

### Exercises

- read the file `data.csv` and use it to build a new dataframe having only the rows where the `y` and `value` are both even and `name` is not 'origin'.

### Statistics

Pandas provides a lot of useful functions to compute statistics on dataframes:
- mean
- median
- std
- var
- min
- max
- sum
- count
- ...|

In [None]:
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris

To evaluate numerical statistics, we have to discard the non-numeric columns. We can do this using the select_dtypes method.

In [None]:
iris.select_dtypes('float').mean() # in this case this is equivalent to iris.iloc[:, :-1].mean()


In [None]:
iris.select_dtypes('float').median()

To compute aggregate statistics on the dataframe, we can use the describe method.

In [None]:
iris.describe()

To compute a set of specific statistics, we can use the agg method.

In [None]:
iris.select_dtypes('float').agg(['mean', 'std'])

You can group the rows of a dataframe using the groupby method. This method returns a GroupBy object, which can be used to compute aggregate statistics on the groups.

In [None]:
iris.groupby('species').agg(['mean', 'std'])

### Plotting

`iris.plot()` will plot all the columns of the dataframe, using the index as x axis

In [None]:
import matplotlib.pyplot as plt

plt.close('all')
iris.plot()
plt.show()

More information about chart visualization here: https://pandas.pydata.org/pandas-docs/stable/user_guide/visualization.html

## Convert to numpy array

To convert a dataframe to a numpy array, use the `.values` attribute

In [None]:
iris.select_dtypes('float').values

### Exercises

On the iris dataset perform the following operations:
1. Select only the rows belonging to 'setosa', then evaluate the mean and standard deviation of each column
2. Select only the rows belonging to 'setosa' **or** 'versicolor', then evaluate the mean and standard deviation of each column, grouping the results by species
3. Add a target variable by converting 'species' to numerical values (0,1,2), and evaluate the least squares solution of the resulting linear regression problem. 
    - Note: the least squares formula is: $w = (X^T X)^{-1} X^T y$
4. Compute the predictions of the model on the training set ($\hat{y} =  X w$), and evaluate the accuracy of the model on the training set ($\frac{1}{n}\sum_{i=1}^n |y_i - \hat{y_i}|$).

If everything is ok, you should be able to get an accuracy of 96% (on the training set).


In [None]:
import pandas as pd
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'
iris = pd.read_csv(url, names=[
                   'sepal length', 'sepal width', 'petal length', 'petal width', 'species'])
iris
