# Week 3: Exercises - Social Media & Graph Analytics

**Web and Social Network Analytics**

---

**Instructions**: Complete each exercise in the provided code cells. Use the hints if you get stuck - they progressively reveal more help.

**Note**: Some exercises build on previous ones, so complete them in order.

## Setup

Run this cell first to import all required libraries.

In [None]:
# Graph analysis
import networkx as nx
from networkx.algorithms import bipartite

# Visualization
import matplotlib.pyplot as plt

# Data handling
import pandas as pd
import pprint as pp

print('All libraries imported successfully!')

---

## Exercise 1: Building a Social Network Graph (Easy)

**Task**: Create a directed graph representing Twitter follows between 6 users.

**Requirements**:
1. Create a directed graph (`DiGraph`)
2. Add at least 8 edges representing "follows" relationships
3. Include at least one mutual follow (A follows B AND B follows A)
4. Visualize the graph with labeled nodes

**Expected Output**: A network visualization showing arrows indicating follow direction.

**Skills Practiced**:
- Creating directed graphs with `nx.DiGraph()`
- Adding edges with `add_edge()`
- Visualizing with `nx.draw()` and `spring_layout`

---

<details>
<summary>Hint 1: Creating a directed graph</summary>

```python
# Create a directed graph
G = nx.DiGraph()

# Add edges (source follows target)
G.add_edge('alice', 'bob')  # alice follows bob
```
</details>

<details>
<summary>Hint 2: Adding multiple edges</summary>

```python
# Add multiple edges at once
G.add_edges_from([
    ('alice', 'bob'),
    ('alice', 'carol'),
    ('bob', 'alice'),  # mutual follow
])
```
</details>

<details>
<summary>Hint 3: Visualization</summary>

```python
pos = nx.spring_layout(G, seed=42)
nx.draw(G, pos, with_labels=True, node_size=1500, 
        arrows=True, arrowsize=20)
plt.title("Twitter Follows Network")
plt.show()
```
</details>

In [None]:
# Exercise 1: Your code here
# ===========================

# Step 1: Create a directed graph


# Step 2: Add edges representing follows (at least 8 edges)
# Users: alice, bob, carol, dave, eve, frank


# Step 3: Print basic statistics
print(f"Number of nodes: ")
print(f"Number of edges: ")

# Step 4: Visualize the graph


---

## Exercise 2: Calculating Centrality Measures (Easy-Medium)

**Task**: Calculate degree centrality and betweenness centrality for the graph you created in Exercise 1.

**Requirements**:
1. Calculate degree centrality for all nodes
2. Calculate betweenness centrality for all nodes
3. Identify the node with the highest degree centrality
4. Identify the node with the highest betweenness centrality
5. Visualize the graph with node sizes proportional to degree centrality

**Expected Output**: 
- Two dictionaries showing centrality scores
- Identification of most central nodes
- Visualization with varying node sizes

**Skills Practiced**:
- Using `nx.degree_centrality()` and `nx.betweenness_centrality()`
- Sorting dictionaries by value
- Dynamic node sizing in visualizations

---

<details>
<summary>Hint 1: Calculating centrality</summary>

```python
degree = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)

print("Degree Centrality:")
pp.pprint(degree)
```
</details>

<details>
<summary>Hint 2: Finding the maximum</summary>

```python
# Find node with highest degree centrality
max_degree_node = max(degree, key=degree.get)
print(f"Most central by degree: {max_degree_node}")
```
</details>

<details>
<summary>Hint 3: Visualization with variable node sizes</summary>

```python
# Scale node sizes by centrality
sizes = [degree[node] * 3000 for node in G.nodes()]

nx.draw(G, pos, with_labels=True, node_size=sizes)
plt.title("Node Size = Degree Centrality")
plt.show()
```
</details>

In [None]:
# Exercise 2: Your code here
# ===========================

# Step 1: Calculate degree centrality


# Step 2: Calculate betweenness centrality


# Step 3: Find and print the most central nodes
print("Most central by degree: ")
print("Most central by betweenness: ")

# Step 4: Visualize with node sizes based on degree centrality


---

## Exercise 3: Clustering Coefficient Calculation (Medium)

**Task**: Calculate clustering coefficients by hand and verify with NetworkX.

**Given Graph**: Create the following undirected graph:
- Nodes: A, B, C, D, E
- Edges: A-B, A-C, A-D, B-C, C-D, D-E

**Requirements**:
1. Create the graph and visualize it
2. Calculate the clustering coefficient for node A by hand (show your work)
3. Calculate the clustering coefficient for node D by hand
4. Verify your calculations using `nx.clustering()`
5. Calculate the average clustering coefficient

**Expected Output**: 
- Hand calculations matching NetworkX results
- For node A: neighbors are B, C, D. Edges between neighbors: B-C, C-D = 2. Max possible: 3. Clustering = 2/3 = 0.667

**Skills Practiced**:
- Understanding the clustering coefficient formula
- Using `nx.clustering()` and `nx.average_clustering()`
- Verifying algorithmic results by hand

---

<details>
<summary>Hint 1: Clustering coefficient formula</summary>

For undirected graphs:
$$C_i = \frac{2 \times \text{edges between neighbors}}{k_i \times (k_i - 1)}$$

where $k_i$ is the degree of node $i$.
</details>

<details>
<summary>Hint 2: Finding neighbors</summary>

```python
# Get neighbors of a node
neighbors = list(G.neighbors('A'))
print(f"Neighbors of A: {neighbors}")
print(f"Degree of A: {len(neighbors)}")
```
</details>

<details>
<summary>Hint 3: Verifying with NetworkX</summary>

```python
# Get clustering for single node
print(f"Clustering of A: {nx.clustering(G, 'A'):.3f}")

# Get all clustering coefficients
all_clustering = nx.clustering(G)
pp.pprint(all_clustering)

# Average clustering
print(f"Average: {nx.average_clustering(G):.3f}")
```
</details>

In [None]:
# Exercise 3: Your code here
# ===========================

# Step 1: Create the undirected graph
G_cluster = nx.Graph()
# Add edges: A-B, A-C, A-D, B-C, C-D, D-E


# Step 2: Visualize the graph


# Step 3: Hand calculation for node A
print("=== Hand Calculation for Node A ===")
print(f"Neighbors of A: ")
print(f"Degree of A: ")
print(f"Edges between neighbors: ")
print(f"Max possible edges: ")
print(f"Clustering coefficient: ")

# Step 4: Hand calculation for node D
print("\n=== Hand Calculation for Node D ===")
# Your work here


# Step 5: Verify with NetworkX
print("\n=== NetworkX Verification ===")


# Step 6: Calculate average clustering coefficient


---

## Exercise 4: Community Detection with Kernighan-Lin (Medium)

**Task**: Apply the Kernighan-Lin algorithm to partition a graph into two communities.

**Given Graph**: Create a graph with 8 nodes that has two natural communities:
- Community 1: A, B, C, D (densely connected)
- Community 2: E, F, G, H (densely connected)
- A few edges connecting the communities

**Requirements**:
1. Create the graph with clear community structure
2. Calculate the initial cut size if split by original labels (A-D vs E-H)
3. Apply Kernighan-Lin bisection
4. Compare the original partition with Kernighan-Lin result
5. Visualize both partitions with different colors

**Expected Output**: 
- Kernighan-Lin should find (or improve) the natural partition
- Visualization showing the two communities

**Skills Practiced**:
- Using `nx.cut_size()`
- Applying `kernighan_lin_bisection()`
- Comparing partition quality

---

<details>
<summary>Hint 1: Creating a graph with communities</summary>

```python
G = nx.Graph()

# Community 1: densely connected
G.add_edges_from([('A','B'), ('A','C'), ('A','D'), ('B','C'), ('B','D'), ('C','D')])

# Community 2: densely connected
G.add_edges_from([('E','F'), ('E','G'), ('E','H'), ('F','G'), ('F','H'), ('G','H')])

# Bridge edges (few)
G.add_edges_from([('D','E'), ('C','F')])
```
</details>

<details>
<summary>Hint 2: Calculate cut size</summary>

```python
set1 = {'A', 'B', 'C', 'D'}
set2 = {'E', 'F', 'G', 'H'}

cut = nx.cut_size(G, set1, set2)
print(f"Cut size: {cut}")
```
</details>

<details>
<summary>Hint 3: Apply Kernighan-Lin</summary>

```python
from networkx.algorithms.community import kernighan_lin_bisection

partition = kernighan_lin_bisection(G)
partition1, partition2 = partition

print(f"Partition 1: {partition1}")
print(f"Partition 2: {partition2}")
print(f"New cut size: {nx.cut_size(G, partition1, partition2)}")
```
</details>

In [None]:
# Exercise 4: Your code here
# ===========================

# Step 1: Create graph with two communities
G_community = nx.Graph()

# Add edges for community 1 (A, B, C, D)


# Add edges for community 2 (E, F, G, H)


# Add bridge edges between communities


# Step 2: Visualize the original graph


# Step 3: Calculate original cut size
original_set1 = {'A', 'B', 'C', 'D'}
original_set2 = {'E', 'F', 'G', 'H'}

print(f"Original cut size: ")

# Step 4: Apply Kernighan-Lin bisection


# Step 5: Compare results
print(f"\nKernighan-Lin partition 1: ")
print(f"Kernighan-Lin partition 2: ")
print(f"Kernighan-Lin cut size: ")

# Step 6: Visualize Kernighan-Lin partition


---

## Exercise 5: Analyzing Student Network Data (Medium-Hard)

**Task**: Load the `graph_large.csv` dataset and perform comprehensive network analysis.

The dataset contains student familiarity data where each student rated how well they know other students (1-5 scale).

**Requirements**:
1. Load the data and build a directed graph (only include edges with weight > 3)
2. Calculate all centrality measures: degree, betweenness, PageRank, HITS
3. Create a comparison table of the top 5 students by each measure
4. Visualize the network with node sizes based on PageRank
5. Write a brief analysis: Which students are most "important"? Why might different measures give different rankings?

**Expected Output**: 
- Network with ~30 nodes
- Comparison table
- PageRank visualization
- Written analysis

**Skills Practiced**:
- Loading and processing CSV data
- Building graphs from DataFrames
- Comparing multiple centrality measures
- Interpreting network analysis results

---

<details>
<summary>Hint 1: Loading and understanding the data</summary>

```python
df = pd.read_csv('data/graph_large.csv', index_col=0)

# The columns include personality questions, experiences, and student IDs
col_names = list(df.columns)

# Student IDs are the last 30 columns (p0, p1, ..., p29)
people = col_names[-30:]
```
</details>

<details>
<summary>Hint 2: Building the graph</summary>

```python
DG = nx.DiGraph()

for row, row_values in df.iterrows():
    for column, value in enumerate(row_values):
        col_name = df.columns[column]
        if col_name in people and value > 3:
            DG.add_edge(row, col_name, weight=value)
```
</details>

<details>
<summary>Hint 3: Creating comparison table</summary>

```python
# Calculate all measures
degree = nx.degree_centrality(DG)
betweenness = nx.betweenness_centrality(DG)
pagerank = nx.pagerank(DG)
hubs, authorities = nx.hits(DG)

# Get top 5 by PageRank
top_pr = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:5]
print("Top 5 by PageRank:")
for node, score in top_pr:
    print(f"  {node}: {score:.3f}")
```
</details>

In [None]:
# Exercise 5: Your code here
# ===========================

# Step 1: Load the data
df = pd.read_csv('data/graph_large.csv', index_col=0)

# Get column information
col_names = list(df.columns)
personality_questions = col_names[:10]
people = col_names[-30:]
experiences_questions = col_names[10:-30]

print(f"Number of personality questions: {len(personality_questions)}")
print(f"Number of experience questions: {len(experiences_questions)}")
print(f"Number of students: {len(people)}")

In [None]:
# Step 2: Build the directed graph (edges with weight > 3)
DG = nx.DiGraph()

# Your code here


print(f"\nGraph created with {DG.number_of_nodes()} nodes and {DG.number_of_edges()} edges")

In [None]:
# Step 3: Calculate all centrality measures


# Step 4: Print top 5 by each measure
print("=== Top 5 by Degree Centrality ===")


print("\n=== Top 5 by Betweenness Centrality ===")


print("\n=== Top 5 by PageRank ===")


print("\n=== Top 5 by Hub Score ===")


print("\n=== Top 5 by Authority Score ===")


In [None]:
# Step 5: Visualize with PageRank-based node sizes
plt.figure(figsize=(12, 10))

# Your visualization code here


plt.title("Student Network - Node Size = PageRank")
plt.show()

**Your Analysis**: 

1. Which students appear most "important"? Do different measures agree?

*Write your answer here:*

2. Why might degree centrality and betweenness centrality give different top students?

*Write your answer here:*

3. What does a high hub score vs high authority score mean in this context?

*Write your answer here:*

---

## Bonus Challenge: Triadic Closure Analysis

**Task**: Identify potential new connections based on triadic closure.

**Requirements**:
1. For each pair of nodes that are NOT directly connected
2. Check if they have at least 2 common neighbors
3. List these pairs as "potential new connections"
4. Rank them by number of common neighbors

**Challenge**: Use the student network from Exercise 5.

In [None]:
# Bonus Challenge: Your code here
# ================================

def find_potential_connections(G, min_common_neighbors=2):
    """Find pairs of nodes that might become connected via triadic closure."""
    potential = []
    
    # Your code here
    
    return potential

# Test with the student network (convert to undirected for simplicity)
# G_undirected = DG.to_undirected()
# potential_connections = find_potential_connections(G_undirected)

# Print top 10 potential connections
# for node1, node2, common in potential_connections[:10]:
#     print(f"{node1} - {node2}: {common} common neighbors")