# Assessment Preparation Challenge 3: Twitter Network Analysis

---

## About This Challenge

| | |
|---|---|
| **Purpose** | Practice network analysis on real social media data |
| **Graded?** | No - purely for your benefit |
| **Solutions provided?** | No - similar tasks appear in assessment |
| **Difficulty** | Medium-Hard |
| **Prerequisites** | Complete Week3-exercise.ipynb first |
| **Time estimate** | 45-60 minutes |

---

## What You'll Learn

1. Load and process real Twitter follower data
2. Build a directed graph from CSV adjacency list format
3. Calculate centrality measures on large networks
4. Filter nodes to focus on important users
5. Visualize large networks effectively
6. Identify influential accounts and potential communities

---

## The Challenge

You have been given anonymized Twitter follower data for a data science institution (encoded as `company_x`). Your task is to analyze this network to understand:

- Who are the most influential accounts?
- Are there bridge accounts connecting different communities?
- What can the network structure tell us about the institution's reach?

**Note**: This exercise is exploratory - there are no single "correct" answers. The goal is to practice your network analysis skills.

---

## Setup

Run this cell first to import all required libraries.

In [None]:
import networkx as nx
import matplotlib.pyplot as plt
import pprint as pp
import pandas as pd

print('All libraries imported successfully!')

---

## Step 1: Understanding the Data Format

### 1.1 Adjacency List Format

The data file `twitter_followers_of_company_x.csv` uses **adjacency list** format:

```
account_name, follower1, follower2, follower3, ...
```

| Format | Meaning |
|--------|--------|
| `company_x,alice,bob,carol` | Alice, Bob, and Carol follow company_x |
| `alice,bob,carol` | Bob and Carol follow Alice |
| `bob` | Nobody follows Bob (or no data available) |

### 1.2 Example Data

```csv
company_x,oyster_marshmallow,ruby_hogbean,cherry_basil
dollar_meadowsweet,banana_marshmallow,cherry_basil
cherry_basil,banana_marshmallow
banana_marshmallow
```

**Reading this:**
- 3 accounts follow `company_x`
- 2 accounts follow `dollar_meadowsweet`
- 1 account follows `cherry_basil`
- Nobody follows `banana_marshmallow` (in our data)

---

## Step 2: Loading the Data

### 2.1 Reading the Adjacency List

NetworkX can read adjacency list files directly using `nx.read_adjlist()`.

**Important**: We use `DiGraph` (directed graph) because Twitter follows are directional - if Alice follows Bob, that doesn't mean Bob follows Alice.

In [None]:
# Load the Twitter follower data as a directed graph
G = nx.read_adjlist('twitter_followers_of_company_x.csv', 
                    delimiter=",", 
                    create_using=nx.DiGraph())

<details>
<summary>Why DiGraph?</summary>

Without `create_using=nx.DiGraph()`, NetworkX creates an undirected graph by default. For social media follow relationships, we need a **directed** graph because:

- Follows are one-way: A follows B doesn't mean B follows A
- In-degree = number of followers
- Out-degree = number of accounts being followed
</details>

### 2.2 Basic Graph Statistics

**Your Task**: Print basic statistics about the loaded graph.

In [None]:
# Print basic statistics
print(f"Number of accounts (nodes): {G.number_of_nodes()}")
print(f"Number of follow relationships (edges): {G.number_of_edges()}")
print(f"Network density: {nx.density(G):.4f}")

# Check if company_x is in the network
print(f"\nIs 'company_x' in the network? {'company_x' in G.nodes()}")

---

## Step 3: Initial Visualization

### 3.1 Warning: Large Graph

**This network has ~500 nodes.** Visualizing it will be slow (1+ minute) and may look cluttered. This is normal for real-world network analysis!

### 3.2 Basic Visualization

Run this cell to see the full network. Be patient!

In [None]:
# WARNING: This may take 1-2 minutes to compute and render
print("Computing layout... (this takes a while)")

plt.figure(figsize=(12, 12))
pos = nx.spring_layout(G, seed=42)  # seed for reproducibility

nx.draw(G, pos, node_size=5, width=0.1, alpha=0.5, arrows=False)
plt.title("Twitter Network (Full Graph - ~500 nodes)")
plt.show()

print("Done!")

---

## Step 4: Filtering for Important Nodes

### 4.1 Why Filter?

Working with 500 nodes is slow and hard to interpret. We can focus on the most important accounts by filtering.

### 4.2 Filtering by Degree Centrality

**Your Task**: Filter to keep only accounts with above-average degree centrality.

In [None]:
# Calculate degree centrality for all nodes
degree = nx.degree_centrality(G)

# Find the average degree centrality
avg_degree = sum(degree.values()) / len(degree)
print(f"Average degree centrality: {avg_degree:.4f}")

# Filter to keep only high-degree nodes
# YOUR CODE HERE: Filter degree dictionary to keep values > avg_degree
high_degree_accounts = {
    # Use dictionary comprehension
}

print(f"\nFiltered from {len(degree)} to {len(high_degree_accounts)} accounts")

<details>
<summary>Hint: Dictionary Comprehension</summary>

To filter a dictionary, use dictionary comprehension:

```python
high_degree_accounts = {
    key: value 
    for key, value in degree.items() 
    if value > avg_degree
}
```
</details>

### 4.3 Creating a Subgraph

**Your Task**: Create a subgraph containing only the high-degree accounts.

In [None]:
# Create subgraph with only high-degree nodes
# YOUR CODE HERE
G_filtered = G.subgraph(high_degree_accounts.keys()).copy()

print(f"Filtered graph: {G_filtered.number_of_nodes()} nodes, {G_filtered.number_of_edges()} edges")

---

## Step 5: Centrality Analysis

Now let's calculate various centrality measures on the **full graph** (not filtered) to find the most important accounts.

### 5.1 Degree Centrality

**Your Task**: Find the top 10 accounts by degree centrality.

In [None]:
# We already calculated degree above, let's sort and print top 10
# YOUR CODE HERE
print("Top 10 by Degree Centrality:")
print("-" * 50)

### 5.2 Betweenness Centrality

**Your Task**: Calculate betweenness centrality and find top 10.

In [None]:
# Calculate betweenness centrality
print("Calculating betweenness centrality... (may take a moment)")
betweenness = nx.betweenness_centrality(G)

# Print top 10
# YOUR CODE HERE
print("\nTop 10 by Betweenness Centrality:")
print("-" * 50)

### 5.3 PageRank

**Your Task**: Calculate PageRank and find top 10.

In [None]:
# Calculate PageRank
pagerank = nx.pagerank(G)

# Print top 10
# YOUR CODE HERE
print("Top 10 by PageRank:")
print("-" * 50)

### 5.4 Comparison Table

**Your Task**: Create a table comparing the top 10 accounts across different measures.

In [None]:
# Create a comparison DataFrame for top accounts
# YOUR CODE HERE: Create a pandas DataFrame with columns:
# Account, Degree, Betweenness, PageRank

# Get union of top 10 from each measure
top_degree = set(dict(sorted(degree.items(), key=lambda x: x[1], reverse=True)[:10]).keys())
top_betweenness = set(dict(sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:10]).keys())
top_pagerank = set(dict(sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:10]).keys())

important_accounts = top_degree | top_betweenness | top_pagerank

# Create comparison data
comparison_data = []
for account in important_accounts:
    comparison_data.append({
        'Account': account,
        'Degree': round(degree.get(account, 0), 4),
        'Betweenness': round(betweenness.get(account, 0), 4),
        'PageRank': round(pagerank.get(account, 0), 4)
    })

df = pd.DataFrame(comparison_data).sort_values('PageRank', ascending=False)
print("Comparison of Important Accounts:")
print(df.to_string(index=False))

---

## Step 6: Identifying Key Accounts

### 6.1 Find company_x's Scores

**Your Task**: What are company_x's centrality scores? How does it rank?

In [None]:
# Find company_x's scores
account = 'company_x'

if account in G.nodes():
    print(f"Scores for {account}:")
    print(f"  Degree centrality: {degree.get(account, 0):.4f}")
    print(f"  Betweenness centrality: {betweenness.get(account, 0):.4f}")
    print(f"  PageRank: {pagerank.get(account, 0):.4f}")
    
    # Find rank by PageRank
    sorted_pr = sorted(pagerank.items(), key=lambda x: x[1], reverse=True)
    rank = [i for i, (k, v) in enumerate(sorted_pr) if k == account][0] + 1
    print(f"\n  Rank by PageRank: {rank} out of {len(pagerank)}")
else:
    print(f"{account} not found in network")

### 6.2 Finding Influencers (High PageRank)

**Your Task**: Identify accounts that might be good for company_x to engage with.

In [None]:
# Find high PageRank accounts (potential influencers)
# YOUR CODE HERE: Find accounts in top 5% by PageRank

threshold = sorted(pagerank.values(), reverse=True)[int(len(pagerank) * 0.05)]
influencers = {k: v for k, v in pagerank.items() if v >= threshold}

print(f"Found {len(influencers)} potential influencers (top 5%):")
for account, score in sorted(influencers.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {account}: {score:.4f}")

### 6.3 Finding Bridge Accounts (High Betweenness)

**Your Task**: Find accounts that connect different parts of the network.

In [None]:
# Find high betweenness accounts (bridges)
# YOUR CODE HERE

threshold_betw = sorted(betweenness.values(), reverse=True)[int(len(betweenness) * 0.05)]
bridges = {k: v for k, v in betweenness.items() if v >= threshold_betw}

print(f"Found {len(bridges)} bridge accounts (top 5% betweenness):")
for account, score in sorted(bridges.items(), key=lambda x: x[1], reverse=True)[:10]:
    print(f"  {account}: {score:.4f}")

---

## Step 7: Improved Visualization

### 7.1 Visualization Tips for Large Graphs

- Use the **filtered subgraph** (fewer nodes)
- Size nodes by **PageRank** to highlight importance
- Use **transparency** (alpha) to reduce visual clutter
- Consider hiding arrows for clarity

In [None]:
# Visualize filtered graph with PageRank-based sizing
plt.figure(figsize=(14, 12))

# Use the filtered graph
pos_filtered = nx.spring_layout(G_filtered, seed=42)

# Scale node sizes by PageRank
sizes = [pagerank.get(node, 0.001) * 30000 for node in G_filtered.nodes()]

# Draw
nx.draw(G_filtered, pos_filtered, 
        node_size=sizes, 
        node_color='lightcoral',
        alpha=0.7,
        width=0.3,
        arrows=False)  # Hide arrows for clarity

# Add labels only for top nodes
top_nodes = dict(sorted(pagerank.items(), key=lambda x: x[1], reverse=True)[:15]).keys()
labels = {node: node for node in G_filtered.nodes() if node in top_nodes}
nx.draw_networkx_labels(G_filtered, pos_filtered, labels, font_size=8)

plt.title("Twitter Network - Node Size = PageRank (Top accounts labeled)")
plt.show()

---

## Step 8: Your Analysis

Based on your exploration, answer these questions:

### Question 1: Top 5 Influential Accounts

Which 5 accounts have the most influence in this network? Why do you think so?

*Write your answer here:*

---

### Question 2: Bridge Accounts

Are there accounts that seem to connect different groups? How can you identify them?

*Write your answer here:*

---

### Question 3: Degree Distribution

Does the network follow a power law distribution (few highly connected nodes, many with few connections)?

In [None]:
# Plot degree distribution to check for power law
degrees = [G.degree(n) for n in G.nodes()]

plt.figure(figsize=(12, 4))

plt.subplot(1, 2, 1)
plt.hist(degrees, bins=30, edgecolor='black')
plt.xlabel('Degree')
plt.ylabel('Frequency')
plt.title('Degree Distribution (Linear Scale)')

plt.subplot(1, 2, 2)
plt.hist(degrees, bins=30, edgecolor='black')
plt.xlabel('Degree')
plt.ylabel('Frequency (log scale)')
plt.yscale('log')
plt.title('Degree Distribution (Log Scale)')

plt.tight_layout()
plt.show()

*Write your answer here:*

---

### Question 4: Strategic Recommendations

If you were advising company_x, which accounts should they engage with to expand their reach? Why?

*Write your answer here:*

---

### Question 5: Patterns Observed

What other interesting patterns did you observe in this network?

*Write your answer here:*

---

## Bonus Challenges

If you finish early, try these:

1. **HITS Algorithm**: Calculate hub and authority scores. How do they compare to PageRank?

2. **Community Detection**: Try applying Kernighan-Lin or other community detection algorithms. Can you identify distinct communities?

3. **Ego Network**: Extract the ego network (immediate neighbors) of company_x and visualize it.

4. **Performance**: If the visualization is too slow, try using `graph-tool` library instead of NetworkX.

In [None]:
# Bonus: HITS Algorithm
# YOUR CODE HERE


In [None]:
# Bonus: Ego network of company_x
# YOUR CODE HERE


---

## Summary

In this challenge, you practiced:

- Loading real social network data from CSV
- Working with large graphs (handling performance)
- Calculating and interpreting centrality measures
- Filtering and creating subgraphs
- Visualizing networks effectively
- Drawing analytical conclusions from network data

These skills are directly applicable to the final assessment!