# Measure 4: Social Network Analysis (Graph & Data)

## 1. Introduction & Objective
**Objective:** To visualize the social structure of *Anna Karenina* and quantify character interactions using Network Theory metrics.

**The Thesis:** Tolstoy's novel is structured around two parallel plots that rarely intersect:
1.  **The Society Plot (Anna):** A dense, interconnected web of St. Petersburg/Moscow society.
2.  **The Rural Plot (Levin):** An isolated, philosophical narrative largely removed from the main social centers.

## 2. Methodology & Visualization
* **Nodes:** Characters. 
    * **Size:** Represents popularity (Degree Centrality).
    * **Color:** Acts as a **Heatmap**. Yellow nodes are highly connected hubs; Purple nodes are peripheral.
* **Edges:** Co-occurrence in the same sentence (weighted by frequency).
* **Legend:** The graph includes a **statistical legend** in the top-right corner detailing the exact number of connections for each character.

In [None]:
# Install dependencies (Run once)
%pip install networkx pandas matplotlib nltk

In [None]:
import os
import itertools
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize

# Ensure NLTK data is downloaded
nltk.download('punkt')
nltk.download('punkt_tab')

# --- PATH CONFIGURATION ---
DATA_DIR = '../data'
RESULTS_DIR = '../results'

if not os.path.exists(RESULTS_DIR):
    os.makedirs(RESULTS_DIR)

# --- TARGET CONFIGURATION ---
# Filtering for the 8 primary characters to maintain graph readability
CONFIG = {
    "filename": "The Project Gutenberg eBook of Anna Karenina, by Leo Tolstoy.txt",
    "characters": ["Anna", "Vronsky", "Levin", "Kitty", "Karenin", "Stiva", "Dolly", "Betsy"]
}

## 3. Data Processing Logic

The following functions handle the text processing pipeline:
1.  **Sentence Segmentation:** Splitting the raw text into distinct sentences.
2.  **Interaction Scanning:** If two target characters appear in the *same sentence*, an edge is created.
3.  **Weighting:** Repeated interactions increase the thickness of the connecting line.

In [None]:
def load_text(filename):
    """Loads text file from the data directory."""
    filepath = os.path.join(DATA_DIR, filename)
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            return f.read()
    except FileNotFoundError:
        print(f"ERROR: Could not find {filepath}")
        return ""

def build_graph(text, characters):
    """Builds a weighted undirected graph based on sentence co-occurrence."""
    sentences = sent_tokenize(text)
    G = nx.Graph()
    G.add_nodes_from(characters)
    
    # Case-insensitive mapping
    char_map = {c.lower(): c for c in characters}
    
    print(f"Processing {len(sentences)} sentences...")
    
    for sent in sentences:
        tokens = set(word_tokenize(sent.lower()))
        found = [char_map[c] for c in char_map if c in tokens]
        
        # Create edges for all pairs found in the sentence
        if len(found) > 1:
            for pair in itertools.combinations(found, 2):
                u, v = pair
                if G.has_edge(u, v):
                    G[u][v]['weight'] += 1
                else:
                    G.add_edge(u, v, weight=1)
    return G

## 4. Visualization Engine (With Legend)

This section defines the aesthetic parameters for the graph:
* **Layout:** `spring_layout` (Force-directed placement).
* **Heatmap:** Nodes are colored by centrality (Plasma colormap: Yellow=High, Purple=Low).
* **Data Overlay:** A legend box is added to the top-right corner, listing the exact **Degree Count** (number of edges) for every character, providing immediate quantitative context.

In [None]:
def analyze_and_draw(G):
    """Generates and saves the final high-resolution network graph with a data legend."""
    plt.figure(figsize=(15, 10), facecolor='white')
    ax = plt.gca()
    
    # 1. Metrics & Layout
    pos = nx.spring_layout(G, k=1.5, iterations=50, seed=42) 
    centrality = nx.degree_centrality(G)
    node_sizes = [v * 8000 + 500 for v in centrality.values()]
    
    # 2. Draw Edges (Curved & Weighted)
    weights = [G[u][v]['weight'] for u, v in G.edges()]
    max_weight = max(weights) if weights else 1
    
    for (u, v, d) in G.edges(data=True):
        width = (d['weight'] / max_weight) * 4 + 0.5
        nx.draw_networkx_edges(G, pos, edgelist=[(u, v)], width=width, alpha=0.3, 
                               edge_color="#555555", connectionstyle="arc3,rad=0.1", 
                               arrows=True, arrowstyle="-", ax=ax)

    # 3. Draw Nodes (Heatmap Color)
    # cmap=plt.cm.plasma ensures the Yellow-to-Purple gradient
    nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color=list(centrality.values()), 
                           cmap=plt.cm.plasma, alpha=0.9, edgecolors='white', linewidths=2, ax=ax)
    
    # 4. Draw Labels
    labels = nx.draw_networkx_labels(G, pos, font_size=12, font_weight="bold")
    import matplotlib.patheffects as path_effects
    for _, label in labels.items():
        label.set_path_effects([path_effects.withStroke(linewidth=3, foreground='white')])

    # 5. Add Data Legend Box
    degrees = dict(G.degree())
    sorted_stats = sorted(degrees.items(), key=lambda item: item[1], reverse=True)
    
    legend_text = "NETWORK LEGEND:\n"
    legend_text += "Size  = Popularity\n"
    legend_text += "Color = Heatmap (Yellow=High)\n"
    legend_text += "-"*25 + "\n"
    legend_text += "CONNECTIONS (Count):\n"
    
    for char, count in sorted_stats:
        legend_text += f"{char:<10} : {count}\n"
    
    props = dict(boxstyle='round', facecolor='white', alpha=0.8, edgecolor='gray')
    plt.text(0.95, 0.95, legend_text, transform=ax.transAxes, fontsize=11,
             verticalalignment='top', horizontalalignment='right', bbox=props, fontfamily='monospace')

    # 6. Save & Show
    plt.title("Character Interaction Network: Anna Karenina", fontsize=18, fontweight='bold', pad=20)
    plt.axis('off')
    
    save_path = f"{RESULTS_DIR}/anna_karenina_network_legend_final.png"
    plt.savefig(save_path, dpi=300, bbox_inches='tight')
    print(f"Graph saved to: {save_path}")
    plt.show()

## 5. Main Execution
Run this cell to load the data and generate the visualization.

In [None]:
def run_analysis():
    print("Loading text data...")
    text = load_text(CONFIG['filename'])
    
    if text:
        G = build_graph(text, CONFIG['characters'])
        if G.number_of_edges() > 0:
            print("Generating network visualization with legend...")
            analyze_and_draw(G)
        else:
            print("No interactions found among the specified characters.")
    else:
        print("File not found. Please check DATA_DIR path.")

run_analysis()

## 6. Interpretation of Results

### Reading the Graph & Legend
1.  **The Hub (Anna):** 
    * **Color:** Bright Yellow (High Heatmap score).
    * **Data:** The legend confirms she has the highest (or tied for highest) number of connections. She is the structural center.
    
2.  **The Isolate (Levin):** 
    * **Color:** Orange/Purple (Lower Heatmap score).
    * **Position:** Pushed to the periphery by the force-directed layout.
    * **Interpretation:** He lacks direct edges to the antagonist figures (Karenin/Vronsky). His only tethers are family (Kitty, Dolly, Stiva).

3.  **The Bridge (Stiva):** 
    * Stiva Oblonsky acts as the crucial connector. Without his node, the graph would fracture into two disconnected components.