# Lecture 2: Numpy

The clever student should have recognized that all images of the previous notebooks are images from movies. Indeed, the data we are going to analyze is the so called stardom-network of the 500 most rated movies of 2018 according to IMDB.
![IMDB](l2_imdb.png)  
You can find the data in the 'data' folder of the repository.

But first, ...

## Numpy, your new best friend

[numpy](http://www.numpy.org/) is cool for a lot of reasons, but mostly because it is the python module for playing with big (or almost big) array and embedding non trivial mathematical functions.

In [1]:
import numpy as np

In [2]:
cacca = np.array( [[1, 2, 3], [4,5,6]] )

In [3]:
cacca

array([[1, 2, 3],
       [4, 5, 6]])

### .size() -- elementi totali

In [4]:
N = cacca.size
print(N)

6


### .shape() -- dimensione matrice: (#righe, #colonne)

In [5]:
cacca.shape

(2L, 3L)

### Accessing elements / Operation on elements

In [6]:
cacca[0]

array([1, 2, 3])

In [7]:
cacca[0,1]

2

In [8]:
cacca[:,0] #tutti gli elementi della prima colonna [x, 0]

array([1, 4])

In [9]:
cacca *2.

array([[ 2.,  4.,  6.],
       [ 8., 10., 12.]])

In [10]:
cacca %2

array([[1, 0, 1],
       [0, 1, 0]])

### Mask -- operazione con sottointeso punto interrogativo: 
### sono gli elementi di cacca divisibili per 2? 

In [11]:
cacca %2 == 0

array([[False,  True, False],
       [ True, False,  True]])

In [12]:
cacca != 2

array([[ True, False,  True],
       [ True,  True,  True]])

In [13]:
cacca[cacca %2==0] ### seleziono solo gli elementi per cui ho True!!!

array([2, 4, 6])

##### Data types and structured data types

In [14]:
np.array([[1, 2, 3], [4,5,6]], dtype=float)

array([[1., 2., 3.],
       [4., 5., 6.]])

In [15]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=object)

array([[1.0, 'cacca', 42],
       [0, 'bad', 1.4]], dtype=object)

In [16]:
np.array([[1., 'cacca', 42], [0,'bad',1.4]], dtype=float)

ValueError: could not convert string to float: cacca

In [None]:
np.array([[1,2,3], [1,2,3]], dtype = str)

###### Structured data type

In [None]:
tmnt_list=['Donatello', 'Raffaello', 'Michelangelo', 'Leonardo']
tmnt_ages=[14, 15, 13, 16]

In [None]:
tmnt_np = np.array( zip(tmnt_ages,tmnt_list), dtype=[('age', 'i8'), ('name','S20')])
tmnt_np2 = np.array( zip(tmnt_ages,tmnt_list), dtype=[('age', 'i8'), ('name','S1')])

In [None]:
tmnt_np

In [None]:
tmnt_np2

In [None]:
tmnt_np['name']

In [None]:
tmnt_np[0]

Searching on array with structured data type

In [None]:
tmnt_np[ tmnt_np['name']=='Leonardo']['age'] #sto usando una mask!

True_chi_cerco = tmnt_np['name']=='Leonardo'
tmnt_np[True_chi_cerco]['age'], tmnt_np2[True_chi_cerco]['age']

## Transpose array

In [None]:
cacca.T

### dot product

In [None]:
np.dot(cacca, cacca.T)

## Reading/writing from/to file

In [18]:
adjacency_matrix=np.genfromtxt('something.txt', delimiter=',', dtype='i8')

IOError: something.txt not found.

In [None]:
np.savetxt('something_new.txt',adjacency_matrix, fmt='%u', delimiter=',')

In [None]:
np.genfromtxt('something_new.txt', delimiter=',', dtype='i8')

##### Interesting stuff and functions

### np.zeros(N, dtype='type') - creare un array di zero di dim N e type 'type'

In [None]:
np.zeros(42, dtype='int')

In [None]:
np.zeros(42, dtype=str)

### np.ones(N, dtype='type') - indovina

In [None]:
np.ones(42, dtype='int')

In [None]:
np.ones(42, dtype=str)

### np.arange(N) - creare un array che conta fino a N

In [None]:
np.arange(42)

In [None]:
np.arange(4,42)

### np.unique(cacca) - restituisce un array con gli elementi di cacca ripetuti solo una volta 

In [None]:
cacca=np.array([1,2,4,1,2,4,12,42])

In [None]:
np.unique(cacca)

In [None]:
np.unique(cacca, return_counts=True)

### np.sum()

In [21]:
adjacency_matrix = np.array([[0, 1, 0, 1], [0, 0, 1, 1], [1, 1, 0, 1],[0, 0, 1, 0]])
adjacency_matrix

array([[0, 1, 0, 1],
       [0, 0, 1, 1],
       [1, 1, 0, 1],
       [0, 0, 1, 0]])

In [22]:
np.sum(adjacency_matrix) #somma su tutti gli elementi 

8

In [23]:
np.sum(adjacency_matrix, axis=0) #somma su tutti gli elementi con colonna=0 

array([1, 2, 2, 3])

In [24]:
np.sum(adjacency_matrix, axis=1)

array([2, 2, 3, 1])

### np.where()

In [None]:
np.where(cacca==1)

In [None]:
np.where(adjacency_matrix==1)

## Exercise:
1. **load the file ./data/imdb_2018_films_actors.txt** It is an edge list (_what is an edge list?_) of a bipartite network (_what is a bipartite network?_) in which on the first column you have films and on the second the actors;
2. **calculate the degree sequence** for both layers (_what is a layer?_)
3. **build the biadjacency matrix** (_what is a biadjacency matrix?_)
4. **calculate the near neighbours degree** (_what is nn?_)

#### 1. Load the file

In [36]:
#ad_mat = np.genfromtxt('./data/imdb_2018_films_actors.txt', delimiter='	', dtype='|S100')

with open('./data/imdb_2018_films_actors.txt', 'r') as f_in:
    text = f_in.read()

lines = text.splitlines()
movie_to_actor = {}
for line in lines:
    movie, actor = line.split('	')
    if movie in movie_to_actor:
        movie_to_actor[movie].append(actor)
    else:
        movie_to_actor[movie]=[actor]
        
for key in movie_to_actor:
    print(key, movie_to_actor[key])

('Adrift', ['Sam Claflin', 'Elizabeth Hawthorne', 'Tami Ashcraft', 'Kael Damlamian', 'Neil Andrea', 'Tim Solomon', 'Shailene Woodley', 'Jeffrey Thomas', 'Grace Palmer', 'Siale Tunoka', 'Lei-Ming Caine', 'Apakuki Nalawa'])
('Eighth Grade', ['Josh Hamilton', 'Jake Ryan', 'Fred Hechinger', 'Luke Prael', 'Nora Mullins', 'Missy Yager', 'Greg Crowe', 'Frank Deal', 'Tiffany Grossfeld', 'Trinity Goscinsky-Lynch', 'Kevin R. Free', 'Deborah Unger', 'Marguerite Stimpson', 'Veronica Bikowicz', 'Castor Feinberg', 'Courtney Gonzalez', 'Faith Kelly', 'Luke Mulligan', 'Kaileen Quinones', 'Tom Stratford', 'Elsie Fisher', 'Emily Robinson', 'Daniel Zolghadri', 'Imani Lewis', 'Catherine Oliviere', 'Gerald W. Jones', 'Shacha Temirov', "Thomas John O'Reilly", 'J. Tucker Smith', 'David Shih', 'Natalie Carter', 'Keith Maurice Davis', 'William Alexander Wunsch', 'Phoebe Amirault', 'Dan Chen', 'Andrew Geher', 'Gerald W. Jones III', 'Jalesia Martinez', 'Dina Pearlman', 'Shane Stackpole', 'Kathryn Zimmer'])
('Und

#### 2. The degree sequence

#### 3. The biadjacency matrix

#### 4. The Nearest Neighbour Degree

### Other interesting function of numpy

np.diag()

In [None]:
np.diag(adjacency_matrix)

extracts the diagonal from a square matrix or

In [None]:
np.diag([1,2,3,4])

builds a diagonal matrix from the array in the argument.

np.vstack()

In [None]:
np.vstack((np.arange(4), np.arange(4,8)))

it stacks the two arrays one over the other.

np.isin()

In [None]:
cacca=np.array([0,1,3,4,5,42])

In [None]:
np.isin(cacca, np.array([0,42,3]))

In [None]:
cacca[np.isin(cacca, np.array([0,42,3]))]

In [None]:
np.isin(np.array([0,42,3]),cacca)

### Exercise: project the network on the two layers and get the adjacency matrix (i.e. the binarized version of the weighted matrix)

### Exercise: calculate the clustering per node and its average value on the film projection
The clustering is the number of observed triangles over the possible realisation.