Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name below:

In [None]:
NAME = ""
STUDENT_ID = ""

---

# Social Network Analysis Assignment

This assignment aims to familiarize you with the concepts and practical applications of social network analysis. You will be working with Python and the NetworkX library to create, analyze, and understand various types of networks.


---

## Task 1: Create a Small Network

**Objective:** Generate a network with 10 nodes and a density of 0.2 using NetworkX.

**Instructions:**
1. Calculate the number of edges `m` needed to achieve a density `d = 0.2` for a network of `n = 10` nodes using the formula:

$$d = \frac{2m}{n(n-1)}$$


2. Create a graph in NetworkX in variable `G` and add the calculated number of edges.
3. Ensure the network consists of one large component.

In [None]:
import networkx as nx

In [None]:

G = ...

# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert type(G) == nx.Graph

## Task 2: Create a Non-Small-World Network

**Objective:** Write a script to generate a network that is not a small world. The network should have 500 nodes and at least 1000 edges, with an average shortest path length of at least 50.

**Instructions:**
1. Create a function `create_large_world_network` that generates a network according to the specified criteria.
2. Ensure that the network is not a small world by design.

In [None]:
import networkx as nx
import matplotlib.pyplot as plt

def create_large_world_network():
    # YOUR CODE HERE
    raise NotImplementedError()
    return G

G = create_large_world_network()

print(f"Nodes: {len(G.nodes())}; edges: {len(G.edges())}; average shortest path: {nx.average_shortest_path_length(G)}")



## Task 3: Analyze the Florentine Families Network

**Objective:** Analyze the Florentine families network to find the most central node in terms of PageRank, betweenness, degree, and eigenvector centrality.

**Instructions:**
1. Load the Florentine families dataset (Padget & Ansell, 1993).
2. Calculate and return the most central node for each centrality measure (PageRank, Betweenness, Degree, and Eigenvector centrality)

In [None]:
import matplotlib.pyplot as plt


def find_most_central_nodes(G):
    # YOUR CODE HERE
    raise NotImplementedError()
    return top_pagerank, top_betweenness, top_degree, top_eigenvector

top_pagerank, top_betweenness, top_degree, top_eigenvector = find_most_central_nodes(G)

G = nx.florentine_families_graph()

print(f"Node with highest PageRank centrality: {top_pagerank}")
print(f"Node with highest Betweenness centrality: {top_betweenness}")
print(f"Node with highest Degree centrality: {top_degree}")
print(f"Node with highest Eigenvector centrality: {top_eigenvector}")

# Visualize the network
nx.draw(G, with_labels=True, node_color='lightblue', edge_color='gray', font_size=10, node_size=700)
plt.show()

---

## Task 4: Generate a Graph with Clustered Communities

**Objective:** Write a function to generate a graph with N clusters, each containing M nodes. The nodes in each cluster should be randomly connected with density D - with connections going between randomly selected nodes. The clusters should be linked with each other, to produce a single component, with as few connections as possible.

**Instructions:**
1. Create a function `create_clustered_graph` that generates a graph according to the description above.
2. Analyze the graph using Louvain community detection to examine the resulting modularity.

Now generate two networks with your algorithm. One with 3 clusters, and one with 10 clusters. The other parameters should be same as above.

Run Louvain community detection on both. Which of the networks has a higher modularity? What do you make of this?

In [None]:
from community import community_louvain

def create_clustered_graph(N, M, D):
    G = nx.Graph()

    # YOUR CODE HERE
    raise NotImplementedError()

    return G

def identify_louvain_modularity(G):
    #Use Louvain to identify communities in the graph, and print out the 
# YOUR CODE HERE
raise NotImplementedError()


num_clusters = 5
nodes_per_cluster = 20
intra_density = 1
G = create_clustered_graph(num_clusters, nodes_per_cluster, intra_density)

modularity = identify_louvain_modularity(G)
print(f"The identified modularity is {modularity}.")

# Draw the graph
nx.draw(G, with_labels=True, node_color='lightblue')



---

## Task 5: Twitter Retweet Network Analysis

**Objective:** Create a network from a Twitter dataset and analyze it to find the most central politicians and calculate the weighted homophily based on party.

**Instructions:**
1. Load the Twitter dataset, sweden_2018.df.pickle, and create a network where nodes represent politicians and edges represent retweets.
2. Calculate the weighted homophily of the network. The weighted homophily is here the tendency for individuals of the same party to form connections with one another. 
3. Identify the most central politicians using different centrality measures.

In [None]:
import pandas as pd
import re

df = pd.read_pickle('sweden_2018.df.pickle')
display(df.head())

def extract_retweeted_username(tweet):

    pattern = r'RT @(\w+):'
    match = re.search(pattern, tweet)
    if match:
        return match.group(1)  # Returns the captured username
    return None  # Returns None if no match is found

#1. Extract retweets
#2. Include only retweets of politicians that have also sent at least one message. 
#3. Count the number of messages from each user to each user
#4. Create a weighted network from the resulting dataframe. The nodes should have include the party name as an attribute.

# YOUR CODE HERE
raise NotImplementedError()
nx.draw(G)

def calculate_weighted_homophily(graph, attribute):
    # We now want to know the homophily of the network; that is, how much likelier nodes are to connect with nodes of the same party.
    # The weighted homophily is defined as the sum of edge weights that go between nodes with the same attributes, divided by the total edge weight.
    # Unfortunately, there is no built-in function for weighted homophily in networkx, so you have to write your own function!

    # YOUR CODE HERE
    raise NotImplementedError()
    return homophily_ratio

homophily = calculate_weighted_homophily(G, 'party')

print(f"The resulting homophily is {homophily}.")

def find_most_central_nodes(G):
    # YOUR CODE HERE
    raise NotImplementedError()
    return top_pagerank, top_betweenness, top_degree, top_eigenvector

top_pagerank, top_betweenness, top_degree, top_eigenvector = find_most_central_nodes(G)

print("Node with highest PageRank centrality:", top_pagerank)
print("Node with highest Betweenness centrality:", top_betweenness)
print("Node with highest Degree centrality:", top_degree)
