# Lecture 2: Numpy

The clever student should have recognized that all images of the previous notebooks are images from movies. Indeed, the data we are going to analyze is the so called stardom-network of the 500 most rated movies of 2018 according to IMDB.
![IMDB](l2_imdb.png)  
You can find the data in the 'data' folder of the repository.

But first, ...

## Numpy, your new best friend

[numpy](http://www.numpy.org/) is cool for a lot of reasons, but mostly because it is the python module for playing with big (or almost big) array and embedding non trivial mathematical functions.

In [1]:
import numpy as np

In [2]:
cacca = np.array( [[1, 2, 3], [4,5,6]] )

In [3]:
cacca

array([[1, 2, 3],
       [4, 5, 6]])

### .size() -- elementi totali

In [4]:
N = cacca.size
print(N)

6


### .shape() -- dimensione matrice: (#righe, #colonne)

In [5]:
cacca.shape

(2L, 3L)

### Accessing elements / Operation on elements

In [6]:
cacca[0]

array([1, 2, 3])

In [7]:
cacca[0,1]

2

In [8]:
cacca[:,0] #tutti gli elementi della prima colonna [x, 0]

array([1, 4])

In [9]:
cacca *2.

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

In [10]:
cacca %2

array([[1, 0, 1],
       [0, 1, 0]])

### Mask -- operazione con sottointeso punto interrogativo: sono gli elementi di cacca divisibili per 2? 

In [11]:
cacca %2 == 0

array([[False,  True, False],
       [ True, False,  True]])

In [12]:
cacca != 2

array([[ True, False,  True],
       [ True,  True,  True]])

In [13]:
cacca[cacca %2==0] ### seleziono solo gli elementi per cui ho True!!!

array([2, 4, 6])

##### Data types and structured data types

In [14]:
np.array([[1, 2, 3], [4,5,6]], dtype=float)

array([[1., 2., 3.],
       [4., 5., 6.]])

In [7]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=object) #è il modo corretto per definire un array misto

array([[1.0, 'cacca', 42],
       [0, 'bad', 1.4]], dtype=object)

In [16]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=float)

ValueError: could not convert string to float: cacca

In [6]:
np.array([[1,2,3], [1,2,3]], dtype = str) 

array([['1', '2', '3'],
       ['1', '2', '3']], dtype='|S1')

###### Structured data type

In [None]:
tmnt_list=['Donatello', 'Raffaello', 'Michelangelo', 'Leonardo']
tmnt_ages=[14, 15, 13, 16]

In [None]:
tmnt_np = np.array( zip(tmnt_ages,tmnt_list), dtype=[('age', 'i8'), ('name','S20')])
tmnt_np2 = np.array( zip(tmnt_ages,tmnt_list), dtype=[('age', 'i8'), ('name','S1')])

In [None]:
tmnt_np

In [None]:
tmnt_np2

In [None]:
tmnt_np['name']

In [None]:
tmnt_np[0]

Searching on array with structured data type

In [None]:
tmnt_np[ tmnt_np['name']=='Leonardo']['age'] #sto usando una mask!

True_chi_cerco = tmnt_np['name']=='Leonardo'
tmnt_np[True_chi_cerco]['age'], tmnt_np2[True_chi_cerco]['age']

## Transpose array

In [None]:
cacca.T

### dot product

In [8]:
np.dot(cacca, cacca.T)

array([[14, 32],
       [32, 77]])

## Reading/writing from/to file

In [9]:
adjacency_matrix=np.genfromtxt('something.txt', delimiter=',', dtype='i8')

IOError: something.txt not found.

In [None]:
np.savetxt('something_new.txt',adjacency_matrix, fmt='%u', delimiter=',')

In [None]:
np.genfromtxt('something_new.txt', delimiter=',', dtype='i8')

##### Interesting stuff and functions

### np.zeros(N, dtype='type') - creare un array di zero di dim N e type 'type'

In [10]:
np.zeros(42, dtype='int')

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])

In [12]:
np.zeros(42, dtype=str)

array(['', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '',
       '', '', '', '', '', '', '', ''], dtype='|S1')

### np.ones(N, dtype='type') - indovina

In [None]:
np.ones(42, dtype='int')

In [None]:
np.ones(42, dtype=str)

### np.arange(N) - creare un array che conta fino a N

In [None]:
np.arange(42)

In [13]:
np.arange(4,42)

array([ 4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20,
       21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37,
       38, 39, 40, 41])

### np.unique(cacca) - restituisce un array con gli elementi di cacca ripetuti solo una volta 

In [None]:
cacca=np.array([1,2,4,1,2,4,12,42])

In [15]:
np.unique(cacca)

array([1, 2, 3, 4, 5, 6])

In [16]:
np.unique(cacca, return_counts=True)

(array([1, 2, 3, 4, 5, 6]), array([1, 1, 1, 1, 1, 1], dtype=int64))

### np.sum()

In [21]:
adjacency_matrix = np.array([[0, 1, 0, 1], [0, 0, 1, 1], [1, 1, 0, 1],[0, 0, 1, 0]])
adjacency_matrix

array([[0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 1],
       [0, 0, 1, 0]])

In [22]:
np.sum(adjacency_matrix) #somma su tutti gli elementi 

8

In [23]:
np.sum(adjacency_matrix, axis=0) #somma su tutti gli elementi con colonna=0 

array([1, 2, 2, 3])

In [24]:
np.sum(adjacency_matrix, axis=1)

array([2, 2, 3, 1])

### np.where()

In [17]:
np.where(cacca==1)

(array([0], dtype=int64), array([0], dtype=int64))

In [18]:
np.where(adjacency_matrix==1)

NameError: name 'adjacency_matrix' is not defined

## Exercise:
1. **load the file ./data/imdb_2018_films_actors.txt** It is an edge list (_what is an edge list?_) of a bipartite network (_what is a bipartite network?_) in which on the first column you have films and on the second the actors;
2. **calculate the degree sequence** for both layers (_what is a layer?_)
3. **build the biadjacency matrix** (_what is a biadjacency matrix?_)
4. **calculate the near neighbours degree** (_what is nn?_)

#### 1. Load the file

In [77]:
#ad_mat = np.genfromtxt('./data/imdb_2018_films_actors.txt', delimiter='	', dtype='|S100')

#with open('./data/imdb_2018_films_actors.txt', 'r') as f_in:
#    text = f_in.read()

#lines = text.splitlines()
#movie_to_actor = {}
#for line in lines:
#    movie, actor = line.split('	')
 #   if movie in movie_to_actor:
  #      movie_to_actor[movie].append(actor)
   # else:
    #    movie_to_actor[movie]=[actor]
        
#for key in movie_to_actor:
 #   print(key, movie_to_actor[key])
    
adjacency_matrix2=np.genfromtxt('./data/imdb_2018_films_actors.txt', encoding = 'UTf-8', delimiter='\t',  
                                dtype=[('film', 'U50'),('actors', 'U50')] )
                              #  dtype=[('film', 'UTF-8'),('actors', 'UTF-8')] )
adjacency_matrix2

array([(u'Avengers: Infinity War', u'Chris Hemsworth'),
       (u'Avengers: Infinity War', u'Chris Evans'),
       (u'Avengers: Infinity War', u'Don Cheadle'), ...,
       (u'Colette', u'Izzy Bayley-King'), (u'Colette', u'Karl Farrer'),
       (u'Colette', u'Masayoshi Haneda')],
      dtype=[('film', '<U50'), ('actors', '<U50')])

#### 2. The degree sequence

In [78]:
#np.unique(adjacency_matrix2, return_counts=True)[0]

films, k_films = np.unique(adjacency_matrix2['film'], return_counts=True)
actors, k_actors = np.unique(adjacency_matrix2['actors'], return_counts=True)
films


array([u'12 Strong', u'22 July', u'7 Days in Entebbe',
       u'A Futile and Stupid Gesture', u'A Quiet Place',
       u'A Simple Favor', u'A Star Is Born', u'A Wrinkle in Time',
       u'Active Measures', u'Acts of Violence', u'Adrift',
       u'Alex Strangelove', u'Alpha', u'American Animals',
       u'Annihilation', u'Anon', u'Ant-Man and the Wasp', u'Apostle',
       u'Avengers: Infinity War', u'Bad Samaritan',
       u'Bad Times at the El Royale', u'Beautiful Boy', u'Beirut',
       u'Billionaire Boys Club', u'Bird Box', u'BlacKkKlansman',
       u'Black Panther', u'Blindspotting', u'Blockers',
       u'Bohemian Rhapsody', u'Book Club', u'Braven', u'Breaking In',
       u'Bumblebee', u'Burning', u'Calibre', u'Cam', u'Christopher Robin',
       u'Climax', u'Cold War', u'Colette', u'Corredor Assombrado',
       u'Crazy Rich Asians', u'Creed II', u'Deadpool 2', u'Death Wish',
       u'Death of a Nation', u'Den of Thieves', u'Destination Wedding',
       u'Dogman', u"Don't Worry, He W

#### 3. The biadjacency matrix

In [79]:
adj_films = np.zeros((len(films), len(actors)), dtype=int) 
adj_films

array([[0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       ...,
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0],
       [0, 0, 0, ..., 0, 0, 0]])

In [80]:
""""
for element in range(len(adjacency_matrix2)):
    print(adjacency_matrix2[element])
    if adjacency_matrix2[element][0] not in dictionary:
        dictionary[adjacency_matrix2[element][0]] = adjacency_matrix2[element][1]
    else:      
        dictionary[adjacency_matrix2[element][0]].append(adjacency_matrix2[element][1])
"""""

'"\nfor element in range(len(adjacency_matrix2)):\n    print(adjacency_matrix2[element])\n    if adjacency_matrix2[element][0] not in dictionary:\n        dictionary[adjacency_matrix2[element][0]] = adjacency_matrix2[element][1]\n    else:      \n        dictionary[adjacency_matrix2[element][0]].append(adjacency_matrix2[element][1])\n'

In [81]:
for i in adjacency_matrix2:
    f_pos = np.where(films==i['film'])[0]
    a_pos = np.where(actors==i['actors'])[0]
    adj_films[f_pos, a_pos] = 1

In [82]:
adj_films.shape

    

(199L, 11128L)

In [85]:
films, k_films

(array([u'12 Strong', u'22 July', u'7 Days in Entebbe',
        u'A Futile and Stupid Gesture', u'A Quiet Place',
        u'A Simple Favor', u'A Star Is Born', u'A Wrinkle in Time',
        u'Active Measures', u'Acts of Violence', u'Adrift',
        u'Alex Strangelove', u'Alpha', u'American Animals',
        u'Annihilation', u'Anon', u'Ant-Man and the Wasp', u'Apostle',
        u'Avengers: Infinity War', u'Bad Samaritan',
        u'Bad Times at the El Royale', u'Beautiful Boy', u'Beirut',
        u'Billionaire Boys Club', u'Bird Box', u'BlacKkKlansman',
        u'Black Panther', u'Blindspotting', u'Blockers',
        u'Bohemian Rhapsody', u'Book Club', u'Braven', u'Breaking In',
        u'Bumblebee', u'Burning', u'Calibre', u'Cam', u'Christopher Robin',
        u'Climax', u'Cold War', u'Colette', u'Corredor Assombrado',
        u'Crazy Rich Asians', u'Creed II', u'Deadpool 2', u'Death Wish',
        u'Death of a Nation', u'Den of Thieves', u'Destination Wedding',
        u'Dogman', u"D

#### 4. The Nearest Neighbour Degree

In [89]:

k_nn_f = np.dot(adj_films,  k_actors)
k_nn_f = 1.*k_nn_f/k_films
k_nn_f
k_nn_a = np.dot(adj_films.T,  k_films)
k_nn_a = 1.*k_nn_a/k_actors
  

array([1.5       , 1.01333333, 1.175     , 1.17      , 1.42857143,
       1.14285714, 1.19047619, 1.16363636, 1.32258065, 1.60810811,
       1.        , 1.08045977, 1.17647059, 1.45882353, 1.96428571,
       1.11320755, 2.08426966, 1.23684211, 2.29906542, 1.20754717,
       1.66666667, 1.49315068, 1.25      , 1.15384615, 1.53846154,
       1.26923077, 2.17037037, 1.40350877, 2.40789474, 1.44144144,
       1.0952381 , 1.31578947, 1.4       , 1.61165049, 1.3125    ,
       1.05882353, 1.09090909, 1.60493827, 1.04166667, 1.        ,
       1.37777778, 1.05555556, 1.10606061, 1.11969112, 1.51428571,
       1.21333333, 1.        , 1.81081081, 1.08333333, 1.        ,
       1.29787234, 2.38157895, 1.15384615, 1.2195122 , 1.19444444,
       1.        , 1.23333333, 1.25316456, 1.63513514, 1.51923077,
       1.23529412, 1.86708861, 1.58333333, 1.89230769, 1.28571429,
       1.71830986, 1.3       , 1.21      , 1.3877551 , 1.25      ,
       1.03225806, 1.18421053, 1.65306122, 1.5625    , 1.35555

### Other interesting function of numpy

np.diag()

In [None]:
np.diag(adjacency_matrix)

extracts the diagonal from a square matrix or

In [None]:
np.diag([1,2,3,4])

builds a diagonal matrix from the array in the argument.

np.vstack()

In [None]:
np.vstack((np.arange(4), np.arange(4,8)))

it stacks the two arrays one over the other.

np.isin()

In [None]:
cacca=np.array([0,1,3,4,5,42])

In [None]:
np.isin(cacca, np.array([0,42,3]))

In [None]:
cacca[np.isin(cacca, np.array([0,42,3]))]

In [None]:
np.isin(np.array([0,42,3]),cacca)

### Exercise: project the network on the two layers and get the adjacency matrix (i.e. the binarized version of the weighted matrix)

### Exercise: calculate the clustering per node and its average value on the film projection
The clustering is the number of observed triangles over the possible realisation.