# NLP Assignment: Character Signatures (Collocates & PMI)

## Goal
Analyze the text to find **"Character Signatures"** for selected characters:

1. **Frequency:** Words that appear most often near the character.
2. **PMI (Pointwise Mutual Information):** Words that are statistically strongly associated with the character ("unique" to them relative to the whole book).

## Output
* **Visuals:** Side-by-side bar charts (Frequency vs. PMI) for each main character.
* **Data:** CSV files with the full top lists for each character and metric.


## 0. Environment Setup (Run Once)

This cell installs any missing libraries (e.g. `wordcloud`) inside the current environment.

You usually only need to run this cell **once per environment**.

In [None]:
# INSTALL MISSING LIBRARIES (run once, then you can comment this out)
%pip install wordcloud pandas matplotlib nltk

## 1. Imports & Configuration

Set up paths, global parameters (window size, thresholds), and download NLTK resources.

In [None]:
import os
import math
from collections import Counter

import pandas as pd
import matplotlib.pyplot as plt

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import sent_tokenize, word_tokenize

# --- NLTK DATA (run once; safe to re-run) ---
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt_tab')  # some installations require this

# --- PATHS ---
# Assuming this notebook lives in `notebooks/` and the text is in `../data/`
DATA_DIR = '../data'
RESULTS_DIR = '../results'

os.makedirs(RESULTS_DIR, exist_ok=True)

# --- ANALYSIS SETTINGS ---
BOOK_FILENAME = 'anna_karenina.txt'   # file inside DATA_DIR
BOOK_NAME = 'Anna Karenina'

# Characters to analyze (as they appear after lowercasing / lemmatization)
CHARACTERS = ['anna', 'levin']

WINDOW_SIZE = 5          # +/- 5 words around the character
PMI_FREQ_THRESHOLD = 5   # Only compute PMI for words that appear >= 5 times as collocates
TOP_N = 10               # Top N words for graphs / CSV

# Matplotlib default style tweaks (optional)
plt.rcParams['figure.dpi'] = 120
plt.rcParams['axes.grid'] = False


## 2. Text Loading & Preprocessing

Steps:
1. Read the raw text file.
2. Split into sentences.
3. Tokenize words.
4. Lowercase, remove stopwords, keep only alphanumeric tokens.
5. Lemmatize tokens.

The function returns:
* `full_tokens`: a flat list of all tokens in the book.
* `processed_sentences`: a list of sentences, each as a list of cleaned tokens.


In [None]:
def load_and_clean(filename):
    """Load a text file and return:
    - full_tokens: list of all cleaned tokens in the corpus
    - processed_sentences: list of sentence-level token lists
    """
    filepath = os.path.join(DATA_DIR, filename)
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            text = f.read()
    except FileNotFoundError:
        print(f"ERROR: Could not find {filepath}")
        return [], []

    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()

    # Sentence tokenization
    sentences = sent_tokenize(text)
    processed_sentences = []
    full_tokens = []

    print(f"Processing {len(sentences)} raw sentences from {filename}...")

    for sent in sentences:
        words = word_tokenize(sent)
        clean_words = []
        for w in words:
            # Keep only alphanumeric tokens (no pure punctuation)
            if w.isalnum():
                w_lower = w.lower()
                if w_lower not in stop_words:
                    lemma = lemmatizer.lemmatize(w_lower)
                    clean_words.append(lemma)
                    full_tokens.append(lemma)

        if clean_words:
            processed_sentences.append(clean_words)

    print(f"After cleaning: {len(processed_sentences)} sentences, {len(full_tokens)} tokens.\n")
    return full_tokens, processed_sentences


## 3. Collocates & PMI Computation

### 3.1 Collocates
For a given character token, we take a symmetric window of size `WINDOW_SIZE` around each occurrence and collect all surrounding words (excluding the character token itself).

### 3.2 PMI
PMI is computed as:
\begin{equation}
PMI(w, c) = \log \frac{P(w \mid c)}{P(w)}
\end{equation}
where:
* $P(w \mid c)$ is the probability of seeing word *w* in the character's context windows.
* $P(w)$ is the probability of seeing *w* anywhere in the corpus.

We only compute PMI for collocates that occur at least `PMI_FREQ_THRESHOLD` times to avoid noisy, unstable values.

In [None]:
def get_collocates(target, sentences, window=5):
    """Collect all collocates for a target token within +/- `window` words.
    `target` is expected to be lowercased and lemmatized.
    """
    target = target.lower()
    collocates = []

    for sent in sentences:
        for i, word in enumerate(sent):
            if word == target:
                start = max(0, i - window)
                end = min(len(sent), i + window + 1)
                # Add context (excluding the target word itself)
                collocates.extend(sent[start:i] + sent[i+1:end])

    return collocates


def calculate_pmi(collocates, full_tokens, top_n=10):
    """Given collocates for a character and the full corpus tokens,
    return (top_freq, top_pmi):
      - top_freq: list of (word, count) sorted by frequency
      - top_pmi: list of (word, pmi_score) sorted by PMI
    """
    if not collocates or not full_tokens:
        return [], []

    collocate_counts = Counter(collocates)
    corpus_counts = Counter(full_tokens)
    total_collocates = len(collocates)
    total_corpus = len(full_tokens)

    pmi_scores = {}
    for word, count in collocate_counts.items():
        if count < PMI_FREQ_THRESHOLD:
            continue

        p_w_given_c = count / total_collocates
        p_w = corpus_counts[word] / total_corpus

        if p_w > 0:
            # Natural log is fine; any base is acceptable as long as it's consistent
            pmi_scores[word] = math.log(p_w_given_c / p_w)

    # Top by raw frequency
    top_freq = collocate_counts.most_common(top_n)

    # Top by PMI score
    top_pmi = sorted(pmi_scores.items(), key=lambda x: x[1], reverse=True)[:top_n]

    return top_freq, top_pmi


## 4. Visualization & CSV Export

### 4.1 Side-by-side bar charts
For each character, we create the exact layout requested:

* Left: **Top N collocates by frequency**.
* Right: **Top N collocates by PMI**.

### 4.2 CSV files
For each character we also save two CSVs:

* `<CHAR>_topN_frequency.csv`
* `<CHAR>_topN_pmi.csv`


In [None]:
def create_assignment_graph(char_name, top_freq, top_pmi, book_name):
    """Create and save the side-by-side bar chart for frequency vs PMI."""
    if not top_freq or not top_pmi:
        print(f"[Warning] Not enough data to plot for {char_name}.")
        return

    fig, axes = plt.subplots(1, 2, figsize=(14, 6))

    # 1. Frequency plot
    words_f, counts_f = zip(*top_freq)
    axes[0].barh(words_f[::-1], counts_f[::-1])
    axes[0].set_title(f"Top {TOP_N} Collocates (Frequency)")
    axes[0].set_xlabel("Count")

    # 2. PMI plot
    words_p, scores_p = zip(*top_pmi)
    axes[1].barh(words_p[::-1], scores_p[::-1])
    axes[1].set_title(f"Top {TOP_N} Collocates (PMI)")
    axes[1].set_xlabel("PMI value")

    plt.suptitle(f"Character Signature: {char_name} ({book_name})", fontsize=16)
    plt.tight_layout()

    filename = os.path.join(RESULTS_DIR, f"{char_name}_signature.png")
    plt.savefig(filename, bbox_inches='tight')
    plt.show()
    print(f"Graph saved to: {filename}\n")


def save_results_to_csv(char_name, top_freq, top_pmi):
    """Save frequency and PMI results for a character as CSV files."""
    if top_freq:
        df_freq = pd.DataFrame(top_freq, columns=["word", "count"])
        freq_path = os.path.join(RESULTS_DIR, f"{char_name}_top{TOP_N}_frequency.csv")
        df_freq.to_csv(freq_path, index=False)
        print(f"Frequency CSV saved to: {freq_path}")

    if top_pmi:
        df_pmi = pd.DataFrame(top_pmi, columns=["word", "pmi"])
        pmi_path = os.path.join(RESULTS_DIR, f"{char_name}_top{TOP_N}_pmi.csv")
        df_pmi.to_csv(pmi_path, index=False)
        print(f"PMI CSV saved to: {pmi_path}")

    print()


## 5. Run the Analysis

We now:

1. Load and preprocess the book.
2. For each character in `CHARACTERS`:
   * Extract collocates within the specified window.
   * Compute top collocates by frequency and PMI.
   * Save the results as CSV.
   * Plot the side-by-side bar chart.


In [None]:
# --- MAIN EXECUTION ---
full_tokens, processed_sentences = load_and_clean(BOOK_FILENAME)

if not full_tokens or not processed_sentences:
    print("Aborting: could not load or process the text file.")
else:
    for char in CHARACTERS:
        print(f"Analyzing {char.title()}...\n")

        collocates = get_collocates(char, processed_sentences, window=WINDOW_SIZE)
        print(f"  Found {len(collocates)} collocate tokens for '{char}'.")

        top_freq, top_pmi = calculate_pmi(collocates, full_tokens, top_n=TOP_N)

        # Show quick preview in the notebook
        if top_freq:
            print("  Top frequency collocates:")
            display(pd.DataFrame(top_freq, columns=["word", "count"]))
        if top_pmi:
            print("  Top PMI collocates:")
            display(pd.DataFrame(top_pmi, columns=["word", "pmi"]))

        # CSV + plot
        save_results_to_csv(char, top_freq, top_pmi)
        create_assignment_graph(char.title(), top_freq, top_pmi, BOOK_NAME)

    print("Done! Check the '../results' folder for PNGs and CSV files.")
