# Lecture 2: Numpy

The clever student should have recognized that all images of the previous notebooks are images from movies. Indeed, the data we are going to analyze is the so called stardom-network of the 500 most rated movies of 2018 according to IMDB.
![IMDB](l2_imdb.png)  
You can find the data in the 'data' folder of the repository.

But first, ...

## Numpy, your new best friend

[numpy](http://www.numpy.org/) is cool for a lot of reasons, but mostly because it is the python module for playing with big (or almost big) array and embedding non trivial mathematical functions.

In [1]:
import numpy as np

### numpy array, your new best friend

In [2]:
cacca=np.array([[1, 2, 3], [4,5,6]])

In [3]:
cacca

array([[1, 2, 3],
       [4, 5, 6]])

.size()

In [4]:
cacca.size

6

.shape()

In [5]:
cacca.shape

(2, 3)

##### Accessing elements

In [6]:
cacca[0]

array([1, 2, 3])

In [7]:
cacca[0,1]

2

In [8]:
cacca[:,0]

array([1, 4])

Operation element by element!

In [9]:
cacca *2.

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

In [10]:
cacca %2

array([[1, 0, 1],
       [0, 1, 0]])

Mask

In [11]:
cacca %2==0

array([[False,  True, False],
       [ True, False,  True]])

In [12]:
cacca[cacca %2==0]

array([2, 4, 6])

##### Data types and structured data types

In [13]:
np.array([[1, 2, 3], [4,5,6]], dtype=float)
#we can also define the type of the elements of the array with dtype default argument

array([[1., 2., 3.],
       [4., 5., 6.]])

In [14]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=object)

array([[1.0, 'cacca', 42],
       [0, 'bad', 1.4]], dtype=object)

In [15]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=float)

ValueError: could not convert string to float: cacca

###### Structured data type

In [None]:
tmnt_list=['Donatello', 'Raffaello', 'Michelangelo', 'Leonardo']
tmnt_ages=[14, 15, 13, 16]

In [None]:
tmnt_np=np.array(zip(tmnt_ages,tmnt_list), dtype=[('age', 'i8'), ('name','S20')])
#'i8' is the type for integers of eight
#'S20' strings max length 20
#dtype is used to organize the arrays and get something similar to dictionaries
#and search easily for elements

In [None]:
tmnt_np

In [None]:
tmnt_np['name']

In [None]:
tmnt_np[0]

Searching on array with structured data type

In [None]:
tmnt_np[tmnt_np['name']=='Leonardo']['age']

##### Operation among arrays

In [16]:
cacca

array([[1, 2, 3],
       [4, 5, 6]])

In [17]:
cacca.shape

(2, 3)

Transpose

In [18]:
cacca.T

array([[1, 4],
       [2, 5],
       [3, 6]])

dot product

In [19]:
np.dot(cacca, cacca.T)

array([[14, 32],
       [32, 77]])

##### Reading/writing from/to file

In [20]:
adjacency_matrix=np.genfromtxt('something.txt', delimiter=',', dtype='i8')

In [21]:
np.savetxt('something_new.txt',adjacency_matrix, fmt='%u', delimiter=',') #'u' unassigned integer

In [22]:
np.genfromtxt('something_new.txt', delimiter=',', dtype='i8')

array([-1, -1])

##### Interesting stuff and functions

np.zeros()

In [23]:
np.zeros(42, dtype='int')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [24]:
np.zeros(42, dtype=str)

array(['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', ''], dtype='|S1')

np.ones()

In [25]:
np.ones(42, dtype='int')

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [26]:
np.ones(42, dtype=str)

array(['1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1', '1',
       '1', '1', '1'], dtype='|S1')

np.arange()

In [27]:
np.arange(42)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41])

In [28]:
np.arange(4,42)

array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
       21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
       38, 39, 40, 41])

np.unique()

In [29]:
cacca=np.array([1,2,4,1,2,4,12,42])

In [30]:
np.unique(cacca)

array([ 1,  2,  4, 12, 42])

In [31]:
np.unique(cacca, return_counts=True) #shift+tab to see explanation

(array([ 1,  2,  4, 12, 42]), array([2, 2, 2, 1, 1]))

np.sum()

In [32]:
adjacency_matrix

array([-1, -1])

In [33]:
np.sum(adjacency_matrix)

-2

In [34]:
np.sum(adjacency_matrix, axis=0)

-2

In [35]:
np.sum(adjacency_matrix, axis=1)

AxisError: axis 1 is out of bounds for array of dimension 1

np.where()

In [36]:
np.where(cacca==1)

(array([0, 3]),)

In [37]:
np.where(adjacency_matrix==1)

(array([], dtype=int64),)

## Exercise:
1. **load the file ./data/imdb_2018_films_actors.txt** It is an edge list (_what is an edge list?_) of a bipartite network (_what is a bipartite network?_) in which on the first column you have films and on the second the actors;
2. **calculate the degree sequence** for both layers (_what is a layer?_)
3. **build the biadjacency matrix** (_what is a biadjacency matrix?_)
4. **calculate the near neighbours degree** (_what is nn?_)

#### 1. Load the file

In [64]:
adj_a=np.genfromtxt('./data/imdb_2018_films_actors.txt', delimiter='\t', encoding = 'utf-8', dtype=[('film','U50'),('actor','U50')])
print(adj_a)


[(u'Avengers: Infinity War', u'Chris Hemsworth')
 (u'Avengers: Infinity War', u'Chris Evans')
 (u'Avengers: Infinity War', u'Don Cheadle') ...
 (u'Colette', u'Izzy Bayley-King') (u'Colette', u'Karl Farrer')
 (u'Colette', u'Masayoshi Haneda')]


#### 2. The degree sequence

In [68]:
films, k_films = np.unique(adj_a['film'], return_counts= True)
actors, k_actors = np.unique(adj_a['actor'], return_counts= True)
print(actors[2],k_actors[2])

(u'?gnes B?nfalvy', 1)


In [66]:
len(actors)

11128

In [67]:
len(films)

199

#### 3. The biadjacency matrix

In [91]:
biadj = np.zeros((len(actors),len(films)),dtype='i2')
for i in adj_a:
    a_pos = np.where(actors==i['actor'])[0]
    f_pos = np.where(films==i['film'])[0]
    biadj[a_pos,f_pos] = 1  

In [92]:
np.sum(biadj)

12912

#### 4. The Nearest Neighbour Degree

In [None]:
# ki_{nn} = \sum{m_i\alpha k_\alpha}/k_i

In [116]:
knn_prod_actors = np.dot(biadj,k_films)
knn_actors =(1.* knn_prod_actors) / k_actors
print(knn_actors)



[111. 107.  56. ...  81.  75.  75.]


In [118]:
knn_prod_films = np.dot(biadj.T,k_actors)
knn_films = knn_prod_films/ (1.*k_films)
print(knn_films)

[1.5        1.01333333 1.175      1.17       1.42857143 1.14285714
 1.19047619 1.16363636 1.32258065 1.60810811 1.         1.08045977
 1.17647059 1.45882353 1.96428571 1.11320755 2.08426966 1.23684211
 2.29906542 1.20754717 1.66666667 1.49315068 1.25       1.15384615
 1.53846154 1.26923077 2.17037037 1.40350877 2.40789474 1.44144144
 1.0952381  1.31578947 1.4        1.61165049 1.3125     1.05882353
 1.09090909 1.60493827 1.04166667 1.         1.37777778 1.05555556
 1.10606061 1.11969112 1.51428571 1.21333333 1.         1.81081081
 1.08333333 1.         1.29787234 2.38157895 1.15384615 1.2195122
 1.19444444 1.         1.23333333 1.25316456 1.63513514 1.51923077
 1.23529412 1.86708861 1.58333333 1.89230769 1.28571429 1.71830986
 1.3        1.21       1.3877551  1.25       1.03225806 1.18421053
 1.65306122 1.5625     1.35555556 1.53623188 1.11864407 1.39252336
 1.13513514 1.4516129  1.04545455 1.25       1.4        1.90909091
 1.14285714 1.38823529 1.36842105 1.64423077 1.23913043 1.18421

### Other interesting function of numpy

np.diag()

In [40]:
np.diag(adjacency_matrix)

array([[-1,  0],
       [ 0, -1]])

extracts the diagonal from a square matrix or

In [41]:
np.diag([1,2,3,4])

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

builds a diagonal matrix from the array in the argument.

np.vstack()

In [42]:
np.vstack((np.arange(4), np.arange(4,8)))

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

it stacks the two arrays one over the other.

np.isin()

In [43]:
cacca=np.array([0,1,3,4,5,42])

In [44]:
np.isin(cacca, np.array([0,42,3]))

array([ True, False,  True, False, False,  True])

In [45]:
cacca[np.isin(cacca, np.array([0,42,3]))]

array([ 0,  3, 42])

In [46]:
np.isin(np.array([0,42,3]),cacca)

array([ True,  True,  True])

### Exercise: project the network on the two layers and get the adjacency matrix (i.e. the binarized version of the weighted matrix)

### Exercise: calculate the clustering per node and its average value on the film projection
The clustering is the number of observed triangles over the possible realisation.