# Part 2: Email Behaviour Data Analysis

---

### Install Python packages (pip only)

In [1]:
#e.g., %pip install some-package
%pip install networkx numpy matplotlib

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


### Import Python packages

In [37]:
#e.g., import some-package
import numpy as np
import networkx as nx
import json

---

### Task 1 of 1 

Examine the file "emails.edgelist" which represents email behaviour at an organisation. Each line contains two numbers, 𝑢 and 𝑣, separated by a blank space. Consider each number as an identifier for an individual in an organisation, with the space on each line representing that the individual, 𝑢, sent at least one email to the another individual, 𝑣, at some point. Model the data using an appropriate, directed network representation and answer the following questions:

##### Q1. Do the majority of individuals have a higher or lower ratio of mutual connections than average in the network?

In [38]:

# loading emails edgelist and creating a directed graph. 
email_network = nx.read_edgelist("emails.edgelist", create_using=nx.DiGraph())

mutual_connection_ratio = {}

# Calculating mutual connections in the network
for node in email_network.nodes():
    # Getting outgoing edges
    out_degree = email_network.out_degree(node)
    mutual_connections = 0
    # looping through the neighbors of the node 
    for neighbor in email_network.neighbors(node): 
        # Checking if there's a connection
        if email_network.has_edge(neighbor, node):
            mutual_connections +=1 
    if out_degree > 0: 
        mutual_connection_ratio[node] = mutual_connections / out_degree
    else: 
        mutual_connection_ratio[node] = 0 # Handling nodes with no outgoing edges
        
average_mutual_connection_ratio = np.mean(list(mutual_connection_ratio.values()))

# getting count of mutual connection ratio for nodes in the network greater/less than average
greater_than_average = sum(1 for ratio in mutual_connection_ratio.values() if ratio > average_mutual_connection_ratio)

less_than_average = sum(1 for ratio in mutual_connection_ratio.values() if ratio < average_mutual_connection_ratio)


if greater_than_average > less_than_average: 
    print(f"Majority of the individuals have a higher ratio of mutual connections than the average in the network.\nTotal Number: {greater_than_average}")
elif greater_than_average < less_than_average: 
    print(f"Majority of the individuals have a lower ratio of mutual connections than the average in the network.\nTotal Number: {less_than_average}")
else: 
    print("There's an equal amount of individuals with lower and higher mutual connections than the average in the network")
  


Majority of the individuals have a higher ratio of mutual connections than the average in the network.
Total Number: 448


##### Q2. Using the largest, strongly connected component (where at least one path exists between each individual and all others). Could the connectivity of the component be suggested to be reflective of a small world phenomenon in comparison to the typical connectivity of 10 comparative random networks?

In [39]:
#CODE:
# Finding the largest strongly connnected components
largest_scc = max(nx.strongly_connected_components(email_network), key=len)
email_scc = email_network.subgraph(largest_scc)

def random_network_generator(n, p): 
    """
    Function to generate random networks 
    """
    return nx.erdos_renyi_graph(n, p, directed=True)

def generate_average_path_length(Graph):
    """
    Function to calculate the average shortest length
    """
    # using a try catch block for the randomly generated networks just incase
    # a network is generated without a path 
    try: 
        return nx.average_shortest_path_length(Graph)
    except (nx.NetworkXError or nx.NetworkXNoPath): 
        return np.inf

#calculating the average path length of the largest scc
email_scc_path_length = generate_average_path_length(email_scc)

# Generating 10 comparative random networks
random_networks_path_lengths = []
for _ in range(10):
    random_network = random_network_generator(email_scc.number_of_nodes(), 0.1)
    random_networks_path_lengths.append(generate_average_path_length(random_network))

average_random_path_length = np.mean(random_networks_path_lengths)

if email_scc_path_length < average_random_path_length: 
    
    print("The small path length of the largest SCC compared to random networks suggest a small world phenomenon.")
else: 
    print("The path length of the largest SCC is comparable to random networks. The network may not exhibit a strong small world phenomenon.")

The path length of the largest SCC is comparable to random networks. The network may not exhibit a strong small world phenomenon.


##### Q3. Are occurrences of induced, connected subgraphs of 3 individuals (triads) with only mutual connections more abundant in the largest, strongly connected component than those with a mixture of asymmetric and mutual connections? What does this suggest about how mutual connections are distributed in the component?

In [40]:
#CODE:
def triad_occurences(Graph, triad_type): 
    """
    Function to count triads in the largest strongly connected networks
    
    """
    count = 0 
    
    for node in Graph.nodes(): 
        
        neigbors = list(Graph.neighbors(node))
        # getting the two neighbors for comparison
        for i in range(len(neigbors) - 1): 
            for j in range(i+1, len(neigbors)): 
                neighbor1, neighbor2 = neigbors[i], neigbors[j]
                
                if triad_type == "mutual" and Graph.has_edge(neighbor1, neighbor2) and Graph.has_edge(neighbor2, neighbor1):
                    count += 1
                elif triad_type == "mixed" and (Graph.has_edge(neighbor1, neighbor2) or Graph.has_edge(neighbor2, neighbor1)):
                    count += 1
    
    return count


mutual_triad_occurrences = triad_occurences(email_scc, "mutual")
mixed_triad_occurrences = triad_occurences(email_scc, "mixed")

if mutual_triad_occurrences > mixed_triad_occurrences: 
    print("Mutual connections are more abundant in triads with the largest SCC")
elif mutual_triad_occurrences < mixed_triad_occurrences: 
    print("Mixed connections are more abundant in triads.")
else: 
    print("Mutual and mixed connections are equally abundant in triads.")
                    

Mixed connections are more abundant in triads.


---
### Task 2 of 2

Examine the JSON file "emails_departments.json" (departments file). Keys in the departments file represent individuals using the same ids as in the "emails.edgelist" file in Part 2, Task 1 and the values represent a department id that the individual can be attributed to. Using the contents of the departments file in combination with the network in Part 2, Task 1, answer the following questions:

##### Q1. Using the connections that individuals have in the network, are they more likely to mix with others in their department or those with a similar number of outward connections?

In [41]:
# reading the json file
with open("emails_departments.json", "r") as f: 
    
    departments = json.load(f)
    
# Assigning department information to network noedes 

for node, dept_id in departments.items(): 
    email_network.nodes[node]["department"] = dept_id

# Getting connection ratios within departments
connections_in_departments = {}

for node, neighbors in email_network.adj.items():
    
    department = email_network.nodes[node]["department"]
    in_department_count = 0
    for neighbor in neighbors: 
        if email_network.nodes[neighbor]["department"] == department: 
            in_department_count+=1
    try: 
        connections_in_departments[node] = in_department_count / len(neighbors)
    except: 
        # where neighbors are zero
        connections_in_departments[node] = 0.0

#Getting similar outward connections 
similar_outward_connections = {}

for node, degree in email_network.degree(): 
    similar_outward_connections[node] = []
    
    for node_2 in email_network.nodes(): 
        if node_2 != node and email_network.degree(node_2) == degree: 
            similar_outward_connections[node].append(node_2)
            
connections_counter = 0
for node, ratio in connections_in_departments.items(): 
    # Checking if connections within the department are more frequent than with similar outward connections
    if ratio > 0.5 and len(similar_outward_connections[node])>0: 
        connections_counter+=1

final_ratio = connections_counter/ len(connections_in_departments)

if final_ratio > 0.5: 
    print("More likely to mix with others in the department")
else: 
    print("Not likely to mix with others in the department")

Not likely to mix with others in the department


##### Q2. Are all departments with 15 or more members more tightly connected amongst themselves in comparison to all individuals across the overall network irrespective of their department?  Where in this context, 'more tightly connected' is defined as having more mutual AND clustered connections. In addition to answering the overall question as yes or no, provide a list of departments this is true for (if any) and not true for (if any).

In [36]:
# Grouping the departments

department_groups = {}

for member, dept_id in departments.items(): 
    if dept_id not in department_groups:
        #using a set to prevent duplicates
        department_groups[dept_id] = set()
    department_groups[dept_id].add(member)
    
targeted_depts = [dept_id for dept_id in department_groups.keys() if len(department_groups[dept_id]) >15 ]    

# analyzing the cohesion of each department based on mutual clustered connection
dept_connectedness = {}

for dept_id in targeted_depts:
    dept_graph = email_network.subgraph(department_groups[dept_id])
    mutual_edges = sum(1 for u, v in dept_graph.edges() if email_network.has_edge(v, u))
    clustering_coefficient = nx.average_clustering(dept_graph)
    total_edges = dept_graph.number_of_edges()
    try: 
        mutual_connection_ratio = mutual_edges / total_edges 
    except: 
        mutual_connection_ratio = 0
        
    dept_connectedness[dept_id] = (mutual_connection_ratio, clustering_coefficient)


total_mutual_edges = sum(1 for u, v in email_network.edges() if email_network.has_edge(v, u))
total_clustering_coefficient = nx.average_clustering(email_network)
total_edges = email_network.number_of_edges()
total_mutual_connections_ratio = total_mutual_edges /total_edges if total_edges > 0 else 0

more_tightly_connected_departments = []
not_tightly_connected_departments = []

for dept_id, (mutual_connection_ratio, clustering_coefficient) in dept_connectedness.items(): 
    if (mutual_connection_ratio > total_mutual_connections_ratio) and (clustering_coefficient > total_clustering_coefficient):
        more_tightly_connected_departments.append(dept_id)
    else: 
        not_tightly_connected_departments.append(dept_id)
        
if len(not_tightly_connected_departments) > 0: 
    
    print("No, not all departments with 15 or more members are tightly connected amongst themselves.")
    print(f"Departments that are not tightly connected among themselves: {not_tightly_connected_departments}\n More tightly connected among themselves: {more_tightly_connected_departments}")
        

No, not all departments with 15 or more members are tightly connected amongst themselves.
Departments that are not tightly connected among themselves: ['1', '14', '7', '15', '6', '0', '23']
 More tightly connected among themselves: ['21', '9', '17', '11', '10', '36', '4', '22', '16', '13', '19']
