# Measure 4: Social Network Analysis (The "Hub" & The "Isolate")

## 1. Introduction & Objective
**Objective:** To visualize the social structure of *Anna Karenina* and quantify character interactions using Network Theory metrics.

**Theoretical Framework:**
Leo Tolstoy constructs the novel around two parallel plots that rarely intersect. This analysis aims to prove this structural separation quantitatively:
1.  **The Society Plot (Anna):** A dense, interconnected web of St. Petersburg/Moscow society.
2.  **The Rural Plot (Levin):** An isolated, philosophical narrative largely removed from the main social centers.

## 2. Methodology
We utilize a **weighted undirected graph** where:
* **Nodes (Characters):** Sized by **Degree Centrality** (popularity) and colored by a **Heatmap** (Yellow = Highly Connected, Purple = Isolated).
* **Edges (Lines):** Represent co-occurrence in the same sentence. Line thickness represents the frequency of interaction.
* **Layout:** A force-directed algorithm (`spring_layout`) simulates social clusters naturally.

In [None]:
# Install dependencies (Run once)
%pip install networkx pandas matplotlib nltk

In [None]:
import os
import itertools
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from IPython.display import display

# Ensure NLTK data is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')

# --- PATH CONFIGURATION ---
DATA_DIR = '../data'
RESULTS_DIR = '../results'

if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR)

# --- TARGET CONFIGURATION ---
# Filtering for the 8 primary characters to maintain graph readability
CONFIG = {
    "filename": "The Project Gutenberg eBook of Anna Karenina, by Leo Tolstoy.txt",
    "characters": ["Anna", "Vronsky", "Levin", "Kitty", "Karenin", "Stiva", "Dolly", "Betsy"]
}

## 3. Data Processing Pipeline

The following functions handle the text processing:
1.  **Sentence Segmentation:** Splitting the raw text into distinct sentences.
2.  **Interaction Scanning:** Identifying which characters appear together in a sentence.
3.  **Graph Construction:** Building the mathematical model of connections.

In [None]:
def load_text(filename):
    """Loads text file from the data directory."""
    filepath = os.path.join(DATA_DIR, filename)
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except FileNotFoundError:
        print(f"ERROR: Could not find {filepath}")
        return ""

def build_graph(text, characters):
    """Builds a weighted undirected graph based on sentence co-occurrence."""
    sentences = sent_tokenize(text)
    G = nx.Graph()
    G.add_nodes_from(characters)
    
    # Case-insensitive mapping
    char_map = {c.lower(): c for c in characters}
    
    print(f"Processing {len(sentences)} sentences...")
    
    for sent in sentences:
        tokens = set(word_tokenize(sent.lower()))
        found = [char_map[c] for c in char_map if c in tokens]
        
        # Create edges for all pairs found in the sentence
        if len(found) > 1:
            for pair in itertools.combinations(found, 2):
                u, v = pair
                if G.has_edge(u, v):
                    G[u][v]['weight'] += 1
                else:
                    G.add_edge(u, v, weight=1)
    return G

## 4. Visualization Engine

This section generates the final high-resolution visualization.

**Visual Features:**
* **Force-Directed Layout:** Simulates physical forces to cluster related characters.
* **Heatmap Coloring:** Nodes range from Yellow (High Centrality) to Purple (Low Centrality).
* **Embedded Legend:** A numbered list in the bottom-right corner provides immediate context.
* **Source Credit:** Citation included in the bottom-left.

In [None]:
def analyze_and_draw(G):
    # --- SETUP PLOT ---
    plt.figure(figsize=(14, 10), facecolor='white')
    ax = plt.gca()
    
    # --- LAYOUT & METRICS ---
    # seed=42 ensures the graph looks the same every time you run it
    pos = nx.spring_layout(G, k=1.5, iterations=50, seed=42) 
    centrality = nx.degree_centrality(G)
    
    # Scale node size: big enough to see, but not covering the whole screen
    node_sizes = [v * 8000 + 500 for v in centrality.values()]
    
    # Calculate edge weights for line thickness
    weights = [G[u][v]['weight'] for u, v in G.edges()]
    max_weight = max(weights) if weights else 1
    
    # --- DRAW EDGES ---
    for (u, v, d) in G.edges(data=True):
        width = (d['weight'] / max_weight) * 4 + 0.5
        # 'arrows=True' + 'arrowstyle' fixes the warning and makes curved lines
        nx.draw_networkx_edges(G, pos, edgelist=[(u, v)], width=width, alpha=0.3, 
                               edge_color="#555555", connectionstyle="arc3,rad=0.1", 
                               arrows=True, arrowstyle="-", ax=ax)

    # --- DRAW NODES ---
    nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color=list(centrality.values()), 
                           cmap=plt.cm.plasma, alpha=0.9, edgecolors='white', linewidths=2, ax=ax)
    
    # --- DRAW LABELS ---
    labels = nx.draw_networkx_labels(G, pos, font_size=12, font_weight="bold")
    # Add a white "halo" outline to text so it's readable over lines
    import matplotlib.patheffects as path_effects
    for _, label in labels.items():
        label.set_path_effects([path_effects.withStroke(linewidth=3, foreground='white')])

    # --- ADD NUMBERED MINI-TABLE (BOTTOM RIGHT) ---
    degrees = dict(G.degree())
    sorted_stats = sorted(degrees.items(), key=lambda item: item[1], reverse=True)
    
    # Build the text with numbering (1. Anna, 2. Vronsky...)
    legend_text = ""
    for i, (char, count) in enumerate(sorted_stats, 1):
        legend_text += f"{i}. {char:<10} : {count}\n"
    legend_text = legend_text.strip() 

    # Place text box
    props = dict(boxstyle='round', facecolor='white', alpha=0.9, edgecolor='gray')
    plt.text(0.96, 0.02, legend_text, transform=ax.transAxes, fontsize=11,
             verticalalignment='bottom', horizontalalignment='right', bbox=props, fontfamily='monospace')

    # --- ADD SOURCE CREDIT (BOTTOM LEFT) ---
    plt.text(0.02, 0.02, "Data Source: Project Gutenberg", transform=ax.transAxes, 
             fontsize=10, color='gray', style='italic')

    # --- TITLES & SAVING ---
    plt.title("Character Interaction Network: Anna Karenina", fontsize=18, fontweight='bold', pad=20)
    plt.axis('off')
    
    save_path = f"{RESULTS_DIR}/anna_karenina_network_final.png"
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    plt.show()
    print(f"Graph saved to: {save_path}")

    # --- DISPLAY FULL TABLE BELOW ---
    print("\n" + "="*40)
    print("FULL CONNECTION DATA TABLE")
    print("="*40)
    
    df = pd.DataFrame(list(degrees.items()), columns=['Character', 'Connections (Degree)'])
    df = df.sort_values(by='Connections (Degree)', ascending=False).reset_index(drop=True)
    # Add ranking column
    df.index += 1 
    
    display(df)
    
    csv_path = f"{RESULTS_DIR}/anna_karenina_network_table.csv"
    df.to_csv(csv_path, index=False)
    print(f"\nTable saved to: {csv_path}")

## 5. Main Execution
Run this cell to load the data, generate the visual, and export the statistics.

In [None]:
def run_analysis():
    print("Loading text data...")
    text = load_text(CONFIG['filename'])
    
    if text:
        G = build_graph(text, CONFIG['characters'])
        if G.number_of_edges() > 0:
            print("Generating network visualization and data table...")
            analyze_and_draw(G)
        else:
            print("No interactions found among the specified characters.")
    else:
        print("File not found. Please check DATA_DIR path.")

run_analysis()

## 6. Interpretation of Results

### Reading the Graph
1.  **The Hub (Anna):** 
    * **Color:** Bright Yellow (High Heatmap score).
    * **Role:** She is the structural center, connecting her husband (Karenin), her lover (Vronsky), and the socialite circle (Betsy).
    
2.  **The Isolate (Levin):** 
    * **Color:** Orange/Purple (Lower Heatmap score).
    * **Role:** Pushed to the periphery by the layout algorithm. He lacks direct connections to the antagonist figures (Karenin/Vronsky), visually confirming his isolation from the main society plot.

3.  **The Bridge (Stiva):** 
    * Stiva Oblonsky acts as the crucial connector between the two storylines (Anna's vs. Levin's).