## CENTRALITY MEASURES
These are the measures we will apply to analyse the people collaboration network

* Degree centrality
* Eigenvector centrality
* Closeness centrality ( most well connected agents)
* what are the movies in which our most important (influential and well connected) agents of Network 2 participated and how that is related to the influence ranking of the movie 
* components - clustering coefficient - local clustering coefficient - redundancy (e the connectedness of the network as well as individuate structural holes, hence those nodes, whose missing links give it more control over diffusion of knowledge and information between neighbors)

# Degree centrality

We want to find out who are the most connected people in the network

In [13]:
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd


with open("../datasets/network2proj.gexf", 'r', encoding='utf-8') as f:
    
# Read the graph from the file
    n2 = nx.read_gexf(f)

degree = nx.degree_centrality(n2) # the function returns a dictionary with nodes as keys and degree as value
degree = sorted(degree.items(), key=lambda x: x[1], reverse=True) #sorting in descending order
total = len(degree) #saving the total nodes in the network

#filter out all nodes whose degree is 0.0
for tup in degree:
    if tup[1] == 0.0:
        degree.pop(degree.index(tup))

filtered = len(degree) # updated dimension of the network

net_percentage = (filtered*100) // total
print("the current network is " + str(net_percentage) + " percent of the initial one")
highest_degree = degree[:20]
print(highest_degree)



the current network is 98 percent of the initial one
[('nm0372942', 0.006879383264628727), ('nm0002031', 0.005139461427271267), ('nm0909825', 0.004095508324856791), ('nm0204862', 0.003988436211788639), ('nm0586969', 0.0039081321269875255), ('nm0001379', 0.0039081321269875255), ('nm0851537', 0.00385459607045345), ('nm0953123', 0.00385459607045345), ('nm0748784', 0.0038278280421864125), ('nm0503777', 0.003720755929118261), ('nm0324875', 0.0036136838160501093), ('nm0064415', 0.0036136838160501093), ('nm0000406', 0.0036136838160501093), ('nm0000636', 0.003586915787783072), ('nm0004976', 0.003586915787783072), ('nm0874308', 0.003506611702981958), ('nm0322946', 0.0034798436747149205), ('nm0215877', 0.0033995395899138066), ('nm0000229', 0.0032924674768456553), ('nm0907003', 0.0032924674768456553)]


Now I will create a dataframe with relevant attributes of the people with highest degree.

In [14]:
def createDF(tuple_list):
    names= []
    measure= []
    role= []
    movies= [] #I want to retrieve the movies he collaborated in

    for tup in tuple_list:
        edges = n2.edges(tup[0], data=True)
        edge_labels = [d['label'] for u, v, d in edges] # u and v are the endpoints of the edge, and d is a dictionary containing any edge attributes.
        names.append(n2.nodes[tup[0]]['name'])
        measure.append(tup[1])
        role.append(n2.nodes[tup[0]]['role'])
        movies.append(edge_labels) #these are 257 movies for the person with the highest degree, it is not possible to put those in a df
                                    # we will need to do some computation on the fly to have a "movie importance rank kinda thing"

    df_data = {'name': names, 'roles': role, 'measure': measure, 'works' : movies}
    measure_rank = pd.DataFrame(df_data)
    return measure_rank

print(createDF(highest_degree))


                   name                     roles   measure  \
0             Ben Hecht           writer,director  0.006879   
1        Michael Curtiz                  director  0.005139   
2           Raoul Walsh                  director  0.004096   
3            John Davis                  producer  0.003988   
4         Arnon Milchan                  producer  0.003908   
5           John Huston           director,writer  0.003908   
6         Norman Taurog                  director  0.003855   
7      Darryl F. Zanuck  producer,writer,director  0.003855   
8           Scott Rudin                  producer  0.003828   
9          Mervyn LeRoy         director,producer  0.003721   
10        Menahem Golan  producer,director,writer  0.003614   
11     William Beaudine           director,writer  0.003614   
12            John Ford         director,producer  0.003614   
13  William Shakespeare                    writer  0.003587   
14         Brian Grazer           producer,writer  0.00

# Eigenvector centrality

Eigenvector centrality is a measure of the importance of a node in a network based on the centrality of its neighboring nodes. In other words, it considers both the number of connections a node has and the importance of those connections. <br>
Key individuals may have a high number of collaborations, but more importantly, they collaborate with other key individuals in the network. This makes them more influential and potentially more important for the success of the collaborations

In the <code>eigenvector_centrality()</code> function, the <code>weight</code> parameter sets the <b>edge attribute</b> to use as weight for computing the measure.<br>
Since our edge is a movie, it would make sense to put as edge attribute the influence value computed from Network 1, in order to make this measure meaningful according to the importance of the movie that connects two people. Or, have a weight in terms of <b>country</b> attribute in order to highlight collaborations on East Asian movies

In [15]:
# returns a dictionary of nodes with eigenvector centrality as the value.

eigenvector = nx.eigenvector_centrality(n2, max_iter=1000)

values = []

eigenvector = sorted(eigenvector.items(), key=lambda x: x[1], reverse=True)

highest_eigenvector = eigenvector[:20]

for tup in highest_eigenvector:
    print(tup)


('nm0372942', 0.16769836434824734)
('nm0002031', 0.11702280101180426)
('nm0953123', 0.09984826036129019)
('nm0851537', 0.09353679952061277)
('nm0000581', 0.09250452786362423)
('nm0000406', 0.09137714078707712)
('nm0536941', 0.09111510406994863)
('nm0503777', 0.08580524228097451)
('nm0001328', 0.08564516199076809)
('nm0002030', 0.08479871044656977)
('nm0861703', 0.0840986397534976)
('nm0907003', 0.08117372688282819)
('nm0550892', 0.07864400661304938)
('nm0215877', 0.07840430433304797)
('nm0368871', 0.07797530705648269)
('nm0909825', 0.07727217077665534)
('nm0886754', 0.07606355572952471)
('nm0397022', 0.07585057380098514)
('nm0176699', 0.07578508854820164)
('nm0542534', 0.07331147136069123)


In [16]:
print(createDF(highest_eigenvector))

                    name                     roles   measure  \
0              Ben Hecht           writer,director  0.167698   
1         Michael Curtiz                  director  0.117023   
2       Darryl F. Zanuck  producer,writer,director  0.099848   
3          Norman Taurog                  director  0.093537   
4   Joseph L. Mankiewicz  writer,director,producer  0.092505   
5              John Ford         director,producer  0.091377   
6         John Lee Mahin                    writer  0.091115   
7           Mervyn LeRoy         director,producer  0.085805   
8           Howard Hawks           director,writer  0.085645   
9           George Cukor                  director  0.084799   
10        Richard Thorpe                  director  0.084099   
11            Jerry Wald           producer,writer  0.081174   
12       George Marshall                  director  0.078644   
13          Roy Del Ruth                  director  0.078404   
14        Henry Hathaway                

# connected components

There are many things that one can do with components, also generate the most strongly connected componest or the most weakly connected. Check Networkx documentation eventually.

In [17]:
components = nx.connected_components(n2)
sorted_components = sorted(components, key=len, reverse=True)

for c in sorted_components[:5]:
    # create a subgraph for the current component
    subgraph = n2.subgraph(c)

# n_components = nx.number_connected_components(n2) #3477
# components_sets = [len(c) for c in sorted_components]

# clustering coefficient

In [23]:
clustering = nx.clustering(subgraph)
sorted_clustering = sorted(clustering.items(), key=lambda x: x[1], reverse=True)

for tup in sorted_clustering[:20]:
    print(tup)

min_cluster = min(clustering, key=lambda x: x[1])
max_cluster = max(clustering, key=lambda x: x[1])

print("\n max: " + str(sorted_clustering[0]))
print("\n min: " + str(sorted_clustering[-1]))


('nm1141362', 1.0)
('nm4122751', 1.0)
('nm0608378', 1.0)
('nm4977228', 1.0)
('nm2242713', 1.0)
('nm2565760', 1.0)
('nm0730361', 1.0)
('nm2300570', 1.0)
('nm0321040', 1.0)
('nm2735837', 1.0)
('nm5384593', 1.0)
('nm0250180', 1.0)
('nm4193081', 0.6666666666666666)
('nm2591458', 0.6666666666666666)
('nm3510596', 0.6)
('nm6091305', 0.4642857142857143)
('nm3146718', 0.4)
('nm4251668', 0.32142857142857145)

 max: ('nm1141362', 1.0)

 min: ('nm4251668', 0.32142857142857145)


In network analysis, the clustering coefficient is a measure of the extent to which nodes in a network tend to cluster together. It is a measure of the density of triangles in a network, where a triangle is a set of three nodes that are mutually connected by edges.

The local clustering coefficient is a specific type of clustering coefficient that is calculated for individual nodes in a network. It measures the proportion of a node's neighbors that are also neighbors of each other. In other words, it measures the extent to which a node's neighbors are connected to each other.

So, while the clustering coefficient is a measure of the overall clustering in a network, the local clustering coefficient measures the clustering around individual nodes. The local clustering coefficient is often used to identify highly connected nodes, or "hubs," in a network, while the clustering coefficient is more commonly used to assess the overall structure of a network.

# Local clustering coefficient

TO DO

The local clustering coefficient can be used to identify hubs in a network by measuring the extent to which a node's neighbors are connected to each other. A high local clustering coefficient for a node indicates that its neighbors are highly connected, which suggests that the node is located in a densely connected region of the network.

To identify hubs using the local clustering coefficient, you can calculate the local clustering coefficient for each node in the network and rank them in descending order based on their scores. Nodes with the highest scores are considered to be potential hubs.

However, it is important to note that not all nodes with high local clustering coefficients are necessarily hubs. In some cases, a high local clustering coefficient may simply indicate that a node is located in a small densely connected subgraph, without necessarily having connections to many other parts of the network.

To get a more complete picture of hub nodes in a network, it can be useful to combine the local clustering coefficient with other measures of centrality, such as degree centrality, betweenness centrality, or closeness centrality. By considering multiple measures of centrality, you can identify nodes that are not only well-connected locally, but also have connections to many other parts of the network, making them true hubs.

# Closeness centrality and Betwenness centrality

Please note that the code I used to calculate Closeness centrality and Betwennes centrality (below) is commented out. I took - on a network wit approximately 22000 nodes (condiseding just the 5 most numerous components, total nw was over 40000)- nearly 14 hours. I saved the result in a txt file in order to be able to access these data more comfortably at a later stage.

In [19]:
#closeness centrality
'''for c in sorted_components[:5]:
    # create a subgraph for the current component
    subgraph = n2.subgraph(c)
    
    # compute the closeness centrality for the current component
    closeness = nx.closeness_centrality(subgraph)
    
    # print the closeness centrality for each node in the component
    for node, centrality in closeness.items():
        print("Node {} has closeness centrality {}".format(node, centrality))'''

#betwennes centrality
'''for c in sorted_components[:5]:
    # create a subgraph for the current component
    subgraph = n2.subgraph(c)

    betweeness = nx.betweenness_centrality(subgraph)
    
    # print the closeness centrality for each node in the component
    for node, centrality in sorted(betweeness.items(), key=lambda x: x[1], reverse=True):
        print("Node {} has betwenness centrality {}".format(node, centrality))'''


betwenness = open("../datasets/betwenness.txt", "r")
closeness = open("../datasets/closness_centrality.txt", "r")
 
def create_tup_list(file, value_list):
    for line in file:
        line = line.rstrip()
        words = line.split(" ")
        if words[-1] != '0.0': # if value is 0.0 remove
            tup = (words[1], float(words[-1]))
            value_list.append(tup)
    return value_list

betwenness_list = []
closeness_list = []
betwenness_values = create_tup_list(betwenness, betwenness_list)
closeness_values = create_tup_list(closeness, closeness_list)

betwenness.close()
closeness.close()

Now visualize betwenness' results:

In [20]:
for tup in betwenness_values:
    print(tup)


('nm0000636', 0.02606911777595779)
('nm0000339', 0.019059521608900093)
('nm0204862', 0.017114773046759093)
('nm0281507', 0.016899771697342404)
('nm0406772', 0.015296300076494247)
('nm0324875', 0.014969859729868602)
('nm0372942', 0.014218035735378269)
('nm0000041', 0.0121801170762063)
('nm0322946', 0.012128714292970768)
('nm0586969', 0.01208197593191623)
('nm0000229', 0.01205003863363551)
('nm0001801', 0.011743381076384625)
('nm0007139', 0.011574292301976001)
('nm0000175', 0.011383899826026504)
('nm0004976', 0.01137692908746441)
('nm0393094', 0.011352399021130215)
('nm0000247', 0.011062367504596781)
('nm0748784', 0.01061693356733859)
('nm0000600', 0.010544208933902528)
('nm0005428', 0.010501650717345011)
('nm0849083', 0.010432766072987869)
('nm0224537', 0.010332373422858403)
('nm0000590', 0.010133186912334703)
('nm0522454', 0.010132724319608405)
('nm0083696', 0.010051821823983146)
('nm0791217', 0.009939614682651773)
('nm0297935', 0.00942354628495639)
('nm0001379', 0.009206287976820667)


Now visualize closeness' results:

In [29]:
closeness_sorted = sorted(closeness_values, key=lambda x: x[1], reverse=True) #no items because it's a list not a dict
for tup in closeness_sorted:
    print(tup)

('nm1017500', 0.7083333333333334)
('nm7869716', 0.6538461538461539)
('nm3125601', 0.6333333333333333)
('nm0325181', 0.625)
('nm3329072', 0.59375)
('nm4765142', 0.5862068965517241)
('nm1993920', 0.5555555555555556)
('nm4461680', 0.5483870967741935)
('nm7008275', 0.5483870967741935)
('nm7187470', 0.5483870967741935)
('nm2402050', 0.5277777777777778)
('nm1442694', 0.5263157894736842)
('nm0740427', 0.5128205128205128)
('nm3146718', 0.5)
('nm0550521', 0.4878048780487805)
('nm3244902', 0.48717948717948717)
('nm10431442', 0.48717948717948717)
('nm3063966', 0.48717948717948717)
('nm2591458', 0.4857142857142857)
('nm4193081', 0.4857142857142857)
('nm5668578', 0.475)
('nm5311712', 0.475)
('nm1788605', 0.475)
('nm5349173', 0.4722222222222222)
('nm1706468', 0.4722222222222222)
('nm1936347', 0.4722222222222222)
('nm6482501', 0.4722222222222222)
('nm5149457', 0.4722222222222222)
('nm5349154', 0.4722222222222222)
('nm3095307', 0.46511627906976744)
('nm4113381', 0.46511627906976744)
('nm1645871', 0.46

Create betwenness DF:

In [None]:
highest_betwenness = betwenness_values[:20]
print(createDF(highest_betwenness))

                   name                     roles   measure  \
0   William Shakespeare                    writer  0.026069   
1          Roger Corman  director,producer,writer  0.019060   
2            John Davis                  producer  0.017115   
3     Richard Fleischer                  director  0.016900   
4      Takashige Ichise  producer,writer,director  0.015296   
5         Menahem Golan  producer,director,writer  0.014970   
6             Ben Hecht           writer,director  0.014218   
7        Akira Kurosawa           writer,director  0.012180   
8          Yoram Globus                  producer  0.012129   
9         Arnon Milchan                  producer  0.012082   
10     Steven Spielberg  director,producer,writer  0.012050   
11         Robert Towne           writer,director  0.011743   
12            Hark Tsui  director,producer,writer  0.011574   
13         Stephen King           writer,director  0.011384   
14         Brian Grazer           producer,writer  0.01

Create closeness DF:

In [28]:
highest_closeness = closeness_sorted[:20]
print(createDF(highest_closeness))

                      name     roles   measure  \
0         Nobuyoshi Sasaki  director  0.080913   
1              Ryûto Arita    writer  0.080913   
2            Masanori Kudô  director  0.080913   
3       Manpachi Tachibana    writer  0.080913   
4         Satoshi Sakamoto  director  0.080913   
5              Tôru Daimon  director  0.088035   
6             Sen'yo Arima    writer  0.088035   
7           Shunsuke Godai  producer  0.088036   
8             Takeshi Katô  producer  0.088073   
9          Jun'ichi Yagami    writer  0.088073   
10           Josh Kriegman  director  0.088658   
11         Elyse Steinberg  director  0.088658   
12  Gabriela Cowperthwaite  director  0.088658   
13          Tim Zimmermann    writer  0.088658   
14            Manny Oteyza  producer  0.088658   
15             Anna Mastro  director  0.090079   
16          Paul Shoulberg    writer  0.090079   
17             Ryan Harris  producer  0.090079   
18    Brenden Patrick Hill  producer  0.090079   


Nodes with the highest degree do not necessarily have higher closeness centrality. The degree of a node measures the number of edges that connect to it, while closeness centrality measures how quickly information can be transmitted from a node to all other nodes in the network. Although nodes with a higher degree are more directly connected to other nodes in the network, they may not necessarily be located in the most central location in terms of shortest path distances to all other nodes.

In fact, there are cases where nodes with low degree can have high closeness centrality. For example, consider a network where a single node is connected to all other nodes in the network. This node would have a high degree, but its closeness centrality would be low, because it would be far from all other nodes except for the ones it is directly connected to.

Conversely, there are cases where nodes with high degree can have low closeness centrality. For example, consider a star-shaped network where one node is connected to all other nodes in the network. The center node in this case would have the highest degree, but its closeness centrality would be lower than any of the nodes on the outer layer of the star, which are closer to each other and to all other nodes in the network.

In summary, while there can be some correlation between node degree and closeness centrality, they are distinct measures that capture different aspects of node centrality in a network, and there can be cases where nodes with high degree have low closeness centrality and vice versa.

# Redundancy

the connectedness of the network as well as individuate structural holes, hence those nodes, whose missing links give it more control over diffusion of knowledge and information between neighbors


 local clustering coefficient