# 1_AppliedNLP_Session2_Bi_Trigrams

This notebook analyzes the most frequent **bigrams** and **trigrams** in *War and Peace* and *Anna Karenina* by Leo Tolstoy. The structure follows the same logic as `01_frequent_words(1).ipynb`.

In [1]:
import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
import pandas as pd
import string
import re

# Ensure tokenizer resources are available (handles newer NLTK too)
def _ensure_nltk():
    try:
        nltk.data.find("tokenizers/punkt")
    except LookupError:
        nltk.download("punkt", quiet=True)
    # Some NLTK versions require 'punkt_tab' separately
    try:
        nltk.data.find("tokenizers/punkt_tab")
    except LookupError:
        try:
            nltk.download("punkt_tab", quiet=True)
        except Exception:
            pass

_ensure_nltk()


### üîç What does `Counter` do?
`Counter` from Python's `collections` module counts how many times each item appears in a list. For example:
```python
Counter(['apple', 'banana', 'apple'])
```
returns:
```
Counter({'apple': 2, 'banana': 1})
```
This helps us find how frequently each bigram or trigram occurs.

In [2]:
import os

# Construct the path to the data folder (one level up from notebooks/)
data_dir = os.path.join(os.path.dirname(os.getcwd()), "data")

file_war = os.path.join(data_dir, "The Project Gutenberg eBook of War and Peace, by Leo Tolstoy.txt")
file_anna = os.path.join(data_dir, "The Project Gutenberg eBook of Anna Karenina, by Leo Tolstoy.txt")


def load_and_clean_text(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        text = f.read()
    text = text.lower()
    for p in string.punctuation:
        text = text.replace(p, ' ')
    # Tokenize with NLTK; if resources are missing, fall back to regex
    try:
        tokens = word_tokenize(text)
    except LookupError:
        # Fallback: simple regex tokenizer to avoid NLTK data errors
        tokens = re.findall(r"[a-zA-Z]+", text)
    tokens = [t for t in tokens if t.isalpha()]  # keep only alphabetic tokens
    return tokens

tokens_war = load_and_clean_text(file_war)
tokens_anna = load_and_clean_text(file_anna)

print(f"Loaded {len(tokens_war)} tokens from War and Peace")
print(f"Loaded {len(tokens_anna)} tokens from Anna Karenina")


Loaded 572173 tokens from War and Peace
Loaded 359574 tokens from Anna Karenina


In [3]:

# Generate bigrams and trigrams
bigrams_war = list(ngrams(tokens_war, 2))
bigrams_anna = list(ngrams(tokens_anna, 2))
trigrams_war = list(ngrams(tokens_war, 3))
trigrams_anna = list(ngrams(tokens_anna, 3))

# Count frequencies
counter_bi_war = Counter(bigrams_war)
counter_bi_anna = Counter(bigrams_anna)
counter_tri_war = Counter(trigrams_war)
counter_tri_anna = Counter(trigrams_anna)
    

In [4]:

# Convert to DataFrame for easy display
def top_ngrams(counter, n=20):
    df = pd.DataFrame(counter.most_common(n), columns=['N-gram', 'Frequency'])
    df['N-gram'] = df['N-gram'].apply(lambda x: ' '.join(x))
    df.index = df.index + 1  # Shift index to start at 1
    return df


print("Top 20 Bigrams - War and Peace")
display(top_ngrams(counter_bi_war))

print("Top 20 Bigrams - Anna Karenina")
display(top_ngrams(counter_bi_anna))

print("Top 20 Trigrams - War and Peace")
display(top_ngrams(counter_tri_war))

print("Top 20 Trigrams - Anna Karenina")
display(top_ngrams(counter_tri_anna))
    

Top 20 Bigrams - War and Peace


Unnamed: 0,N-gram,Frequency
1,of the,4072
2,in the,2336
3,to the,2329
4,and the,1482
5,at the,1346
6,on the,1334
7,he had,1210
8,prince andrew,1065
9,did not,1048
10,he was,951


Top 20 Bigrams - Anna Karenina


Unnamed: 0,N-gram,Frequency
1,of the,1945
2,in the,1634
3,to the,1007
4,he had,1004
5,he was,886
6,at the,886
7,and the,729
8,it was,673
9,on the,655
10,did not,643


Top 20 Trigrams - War and Peace


Unnamed: 0,N-gram,Frequency
1,he did not,223
2,i don t,203
3,one of the,187
4,out of the,178
5,that he was,155
6,commander in chief,147
7,as soon as,146
8,up to the,129
9,he could not,129
10,that it was,125


Top 20 Trigrams - Anna Karenina


Unnamed: 0,N-gram,Frequency
1,i don t,254
2,he could not,198
3,he did not,197
4,out of the,177
5,i can t,142
6,that he had,136
7,said stepan arkadyevitch,125
8,that he was,121
9,in spite of,116
10,that it was,107


In [5]:

# Merge bigrams comparison
def merge_comparison(counter1, counter2, label1, label2, n=20):
    all_ngrams = set(counter1) | set(counter2)
    data = []
    for ng in all_ngrams:
        data.append({
            'N-gram': ' '.join(ng),
            f'{label1} Count': counter1.get(ng, 0),
            f'{label2} Count': counter2.get(ng, 0)
        })
    df = pd.DataFrame(data)
    df['Total'] = df[f'{label1} Count'] + df[f'{label2} Count']
    df = df.sort_values(by='Total', ascending=False).head(n)
    return df

print("### Merged Comparison (Bigrams)")
display(merge_comparison(counter_bi_war, counter_bi_anna, "War and Peace", "Anna Karenina"))

print("### Merged Comparison (Trigrams)")
display(merge_comparison(counter_tri_war, counter_tri_anna, "War and Peace", "Anna Karenina"))
    

### Merged Comparison (Bigrams)


Unnamed: 0,N-gram,War and Peace Count,Anna Karenina Count,Total
118292,of the,4072,1945,6017
184003,in the,2336,1634,3970
279164,to the,2329,1007,3336
119716,at the,1346,886,2232
81181,he had,1210,1004,2214
228332,and the,1482,729,2211
177063,on the,1334,655,1989
125479,he was,951,886,1837
114483,did not,1048,643,1691
257884,it was,881,673,1554


### Merged Comparison (Trigrams)


Unnamed: 0,N-gram,War and Peace Count,Anna Karenina Count,Total
165572,i don t,203,254,457
475675,he did not,223,197,420
398770,out of the,178,177,355
128464,he could not,129,198,327
163230,that he was,155,121,276
36568,one of the,187,81,268
441191,that he had,102,136,238
72002,that it was,125,107,232
606815,i can t,80,142,222
495777,as soon as,146,61,207


In [6]:

# Display both separately again for clarity
print("### Separate Comparison (Bigrams)")
print("\nTop 10 shared bigrams:")
shared_bi = set(counter_bi_war) & set(counter_bi_anna)
shared_bi_counts = [(ng, counter_bi_war[ng] + counter_bi_anna[ng]) for ng in shared_bi]
shared_bi_sorted = sorted(shared_bi_counts, key=lambda x: x[1], reverse=True)[:10]
for ng, count in shared_bi_sorted:
    print(f"{' '.join(ng)} ‚Äî Total: {count}")

print("\n### Separate Comparison (Trigrams)")
shared_tri = set(counter_tri_war) & set(counter_tri_anna)
shared_tri_counts = [(ng, counter_tri_war[ng] + counter_tri_anna[ng]) for ng in shared_tri]
shared_tri_sorted = sorted(shared_tri_counts, key=lambda x: x[1], reverse=True)[:10]
for ng, count in shared_tri_sorted:
    print(f"{' '.join(ng)} ‚Äî Total: {count}")
    

### Separate Comparison (Bigrams)

Top 10 shared bigrams:
of the ‚Äî Total: 6017
in the ‚Äî Total: 3970
to the ‚Äî Total: 3336
at the ‚Äî Total: 2232
he had ‚Äî Total: 2214
and the ‚Äî Total: 2211
on the ‚Äî Total: 1989
he was ‚Äî Total: 1837
did not ‚Äî Total: 1691
it was ‚Äî Total: 1554

### Separate Comparison (Trigrams)
i don t ‚Äî Total: 457
he did not ‚Äî Total: 420
out of the ‚Äî Total: 355
he could not ‚Äî Total: 327
that he was ‚Äî Total: 276
one of the ‚Äî Total: 268
that he had ‚Äî Total: 238
that it was ‚Äî Total: 232
i can t ‚Äî Total: 222
as soon as ‚Äî Total: 207



### üìä Summary of Comparison
Both books share many common language patterns typical of 19th-century literature.  
* Common bigrams like **'of the'**, **'in the'**, or **'to the'** appear frequently in both.  
* Unique n-grams often reflect thematic differences ‚Äî *War and Peace* includes more military or political terms, while *Anna Karenina* contains more social and emotional expressions.  
You can analyze further by filtering stopwords or increasing the `n` value for deeper contextual insights.


In [13]:

import matplotlib.pyplot as plt
from collections import Counter

def get_top_ngrams_from_text(text: str, n: int=2, top_k: int=20):
    # Tokenize and extract alphabetic tokens
    doc = nlp(text)
    tokens = [t.text.lower() for t in doc if t.is_alpha]
    if len(tokens) < n:
        return []
    ngrams = (" ".join(tokens[i:i+n]) for i in range(len(tokens)-n+1))
    counts = Counter(ngrams)
    return counts.most_common(top_k)


def plot_bigrams_trigrams_for_book(book_title: str, text: str, top_k: int=15, savepath=None):
    bigrams = get_top_ngrams_from_text(text, n=2, top_k=top_k)
    trigrams = get_top_ngrams_from_text(text, n=3, top_k=top_k)

    bigram_labels, bigram_vals = zip(*bigrams) if bigrams else ([], [])
    trigram_labels, trigram_vals = zip(*trigrams) if trigrams else ([], [])

    bigram_labels = list(bigram_labels)[::-1]
    bigram_vals = list(bigram_vals)[::-1]
    trigram_labels = list(trigram_labels)[::-1]
    trigram_vals = list(trigram_vals)[::-1]

    fig, axes = plt.subplots(2, 1, figsize=(12, 10), constrained_layout=True)

    # Trigrams
    ax = axes[0]
    y_pos = range(len(trigram_labels))
    ax.barh(y_pos, trigram_vals)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(trigram_labels, fontsize=10)
    ax.set_xlabel("Frequency")
    ax.set_title(f"{book_title} ‚Äî Top {top_k} Trigrams")
    ax.invert_yaxis()

    # Bigrams
    ax = axes[1]
    y_pos = range(len(bigram_labels))
    ax.barh(y_pos, bigram_vals)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(bigram_labels, fontsize=10)
    ax.set_xlabel("Frequency")
    ax.set_title(f"{book_title} ‚Äî Top {top_k} Bigrams")
    ax.invert_yaxis()

    if savepath:
        fig.savefig(savepath, bbox_inches='tight')
    plt.show()
    return savepath, fig

# Run for both books (assuming file_war, file_anna, read_text, strip_gutenberg_headers, and OUTDIR defined earlier)
if 'war_file' in globals():
    war_text = strip_gutenberg_headers(read_text(file_war))
    war_png = OUTDIR / "War_and_Peace_bi_tri_ngrams.png"
    print(f"Creating n-gram plots for War and Peace -> {war_png}")
    plot_bigrams_trigrams_for_book("War and Peace", war_text, top_k=15, savepath=str(war_png))

if 'file_anna' in globals():
    anna_text = strip_gutenberg_headers(read_text(file_anna))
    anna_png = OUTDIR / "Anna_Karenina_bi_tri_ngrams.png"
    print(f"Creating n-gram plots for Anna Karenina -> {anna_png}")
    plot_bigrams_trigrams_for_book("Anna Karenina", anna_text, top_k=15, savepath=str(anna_png))

print("Saved n-gram visualization PNGs if savepath provided.")


Creating n-gram plots for Anna Karenina -> outputs\Anna_Karenina_bi_tri_ngrams.png


ValueError: [E088] Text of length 1982314 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.

In [12]:
# ==========================================
# Fully self-contained chunked n-gram analysis + plotting
# Works for War and Peace and Anna Karenina
# ==========================================

import re
import spacy
import matplotlib.pyplot as plt
from collections import Counter
from pathlib import Path

# ---------- SETUP ----------
print("Loading spaCy model...")
nlp = spacy.load("en_core_web_sm")

OUTDIR = Path("outputs")
OUTDIR.mkdir(exist_ok=True)

def read_text(path):
    """Read UTF-8 text from file."""
    return Path(path).read_text(encoding='utf-8', errors='ignore')

def strip_gutenberg_headers(text):
    """Remove Project Gutenberg headers/footers."""
    text = re.sub(r"\*\*\* START OF.*?\*\*\*", "", text, flags=re.DOTALL)
    text = re.sub(r"\*\*\* END OF.*?\*\*\*", "", text, flags=re.DOTALL)
    return text.strip()

def get_paragraph_chunks(text: str):
    """Split into paragraph-like chunks (based on blank lines), falling back to fixed-size chunks."""
    text = text.replace('\r\n', '\n')
    paragraphs = [p.strip() for p in text.split('\n\n') if p.strip()]
    max_paragraph_length = 20000
    chunks = []
    for p in paragraphs:
        if len(p) <= max_paragraph_length:
            chunks.append(p)
        else:
            start = 0
            L = len(p)
            while start < L:
                end = min(start + max_paragraph_length, L)
                if end < L:
                    nxt = p.rfind(' ', start, end)
                    if nxt > start:
                        end = nxt
                chunks.append(p[start:end].strip())
                start = end
    return chunks

def get_top_ngrams_from_text_chunked(text: str, n: int=2, top_k: int=20):
    """Efficiently compute top n-grams from a large text by processing in chunks."""
    counts = Counter()
    chunks = [c for c in get_paragraph_chunks(text) if c]
    if not chunks:
        return []
    for doc in nlp.pipe(chunks, batch_size=50):
        tokens = [t.text.lower() for t in doc if t.is_alpha]
        for i in range(len(tokens) - n + 1):
            gram = " ".join(tokens[i:i+n])
            counts[gram] += 1
    return counts.most_common(top_k)

def plot_bigrams_trigrams_for_book_chunked(book_title: str, text: str, top_k: int=15, savepath=None):
    bigrams = get_top_ngrams_from_text_chunked(text, n=2, top_k=top_k)
    trigrams = get_top_ngrams_from_text_chunked(text, n=3, top_k=top_k)

    bigram_labels, bigram_vals = zip(*bigrams) if bigrams else ([], [])
    trigram_labels, trigram_vals = zip(*trigrams) if trigrams else ([], [])

    bigram_labels = list(bigram_labels)[::-1]
    bigram_vals = list(bigram_vals)[::-1]
    trigram_labels = list(trigram_labels)[::-1]
    trigram_vals = list(trigram_vals)[::-1]

    fig, axes = plt.subplots(2, 1, figsize=(12, 10), constrained_layout=True)

    # Trigrams
    ax = axes[0]
    y_pos = range(len(trigram_labels))
    ax.barh(y_pos, trigram_vals)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(trigram_labels, fontsize=9)
    ax.set_xlabel("Frequency")
    ax.set_title(f"{book_title} ‚Äî Top {top_k} Trigrams")
    ax.invert_yaxis()

    # Bigrams
    ax = axes[1]
    y_pos = range(len(bigram_labels))
    ax.barh(y_pos, bigram_vals)
    ax.set_yticks(y_pos)
    ax.set_yticklabels(bigram_labels, fontsize=9)
    ax.set_xlabel("Frequency")
    ax.set_title(f"{book_title} ‚Äî Top {top_k} Bigrams")
    ax.invert_yaxis()

    if savepath:
        Path(savepath).parent.mkdir(parents=True, exist_ok=True)
        fig.savefig(savepath, bbox_inches='tight')
        print(f"‚úÖ Saved plot to: {savepath}")
    plt.show()
    return savepath, fig

# ---------- MAIN EXECUTION ----------
# Point these to your actual text files
file_war = Path("War_and_Peace.txt")
anna_file = Path("Anna_Karenina.txt")

if file_war.exists():
    print("Processing War and Peace (chunked)...")
    war_text = strip_gutenberg_headers(read_text(file_war))
    war_png = OUTDIR / "War_and_Peace_bi_tri_ngrams_chunked.png"
    plot_bigrams_trigrams_for_book_chunked("War and Peace", war_text, top_k=15, savepath=str(war_png))
else:
    print("‚ö†Ô∏è War_and_Peace.txt not found in current directory; please add it.")

if anna_file.exists():
    print("Processing Anna Karenina (chunked)...")
    anna_text = strip_gutenberg_headers(read_text(anna_file))
    anna_png = OUTDIR / "Anna_Karenina_bi_tri_ngrams_chunked.png"
    plot_bigrams_trigrams_for_book_chunked("Anna Karenina", anna_text, top_k=15, savepath=str(anna_png))
else:
    print("‚ö†Ô∏è Anna_Karenina.txt not found in current directory; please add it.")


Loading spaCy model...
‚ö†Ô∏è War_and_Peace.txt not found in current directory; please add it.
‚ö†Ô∏è Anna_Karenina.txt not found in current directory; please add it.


-"War and Peace" feels like author's voice the most since it includes more characters such as commander in chief, the old prince, the drawing room, prince andrew (andrei).

-Most frequent items in Anna karenina are mostly common phrases people use nowadays, so there are couple items that is part of author's work like "said stepan arkadyevitch" andd "alexeey alexandrovitch. As for War and peace - there are more frequent items that are characters and interesting words like "the drawing room".