# 1_AppliedNLP_Session2_Bi_Trigrams

This notebook analyzes the most frequent **bigrams** and **trigrams** in *War and Peace* and *Anna Karenina* by Leo Tolstoy. The structure follows the same logic as `01_frequent_words(1).ipynb`.

In [None]:

import nltk
from nltk import word_tokenize
from nltk.util import ngrams
from collections import Counter
import pandas as pd
import string

# Download tokenizer resources (only first time)
nltk.download('punkt')
    

### üîç What does `Counter` do?
`Counter` from Python's `collections` module counts how many times each item appears in a list. For example:
```python
Counter(['apple', 'banana', 'apple'])
```
returns:
```
Counter({'apple': 2, 'banana': 1})
```
This helps us find how frequently each bigram or trigram occurs.

In [None]:

# Load both texts
file_war = "The Project Gutenberg eBook of War and Peace, by Leo Tolstoy.txt"
file_anna = "The Project Gutenberg eBook of Anna Karenina, by Leo Tolstoy.txt"

def load_and_clean_text(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        text = f.read()
    text = text.lower()
    for p in string.punctuation:
        text = text.replace(p, ' ')
    tokens = word_tokenize(text)
    tokens = [t for t in tokens if t.isalpha()]  # keep only alphabetic tokens
    return tokens

tokens_war = load_and_clean_text(file_war)
tokens_anna = load_and_clean_text(file_anna)

print(f"Loaded {len(tokens_war)} tokens from War and Peace")
print(f"Loaded {len(tokens_anna)} tokens from Anna Karenina")
    

In [None]:

# Generate bigrams and trigrams
bigrams_war = list(ngrams(tokens_war, 2))
bigrams_anna = list(ngrams(tokens_anna, 2))
trigrams_war = list(ngrams(tokens_war, 3))
trigrams_anna = list(ngrams(tokens_anna, 3))

# Count frequencies
counter_bi_war = Counter(bigrams_war)
counter_bi_anna = Counter(bigrams_anna)
counter_tri_war = Counter(trigrams_war)
counter_tri_anna = Counter(trigrams_anna)
    

In [None]:

# Convert to DataFrame for easy display
def top_ngrams(counter, n=20):
    df = pd.DataFrame(counter.most_common(n), columns=['N-gram', 'Frequency'])
    df['N-gram'] = df['N-gram'].apply(lambda x: ' '.join(x))
    return df

print("Top 20 Bigrams - War and Peace")
display(top_ngrams(counter_bi_war))

print("Top 20 Bigrams - Anna Karenina")
display(top_ngrams(counter_bi_anna))

print("Top 20 Trigrams - War and Peace")
display(top_ngrams(counter_tri_war))

print("Top 20 Trigrams - Anna Karenina")
display(top_ngrams(counter_tri_anna))
    

In [None]:

# Merge bigrams comparison
def merge_comparison(counter1, counter2, label1, label2, n=20):
    all_ngrams = set(counter1) | set(counter2)
    data = []
    for ng in all_ngrams:
        data.append({
            'N-gram': ' '.join(ng),
            f'{label1} Count': counter1.get(ng, 0),
            f'{label2} Count': counter2.get(ng, 0)
        })
    df = pd.DataFrame(data)
    df['Total'] = df[f'{label1} Count'] + df[f'{label2} Count']
    df = df.sort_values(by='Total', ascending=False).head(n)
    return df

print("### Merged Comparison (Bigrams)")
display(merge_comparison(counter_bi_war, counter_bi_anna, "War and Peace", "Anna Karenina"))

print("### Merged Comparison (Trigrams)")
display(merge_comparison(counter_tri_war, counter_tri_anna, "War and Peace", "Anna Karenina"))
    

In [None]:

# Display both separately again for clarity
print("### Separate Comparison (Bigrams)")
print("\nTop 10 shared bigrams:")
shared_bi = set(counter_bi_war) & set(counter_bi_anna)
shared_bi_counts = [(ng, counter_bi_war[ng] + counter_bi_anna[ng]) for ng in shared_bi]
shared_bi_sorted = sorted(shared_bi_counts, key=lambda x: x[1], reverse=True)[:10]
for ng, count in shared_bi_sorted:
    print(f"{' '.join(ng)} ‚Äî Total: {count}")

print("\n### Separate Comparison (Trigrams)")
shared_tri = set(counter_tri_war) & set(counter_tri_anna)
shared_tri_counts = [(ng, counter_tri_war[ng] + counter_tri_anna[ng]) for ng in shared_tri]
shared_tri_sorted = sorted(shared_tri_counts, key=lambda x: x[1], reverse=True)[:10]
for ng, count in shared_tri_sorted:
    print(f"{' '.join(ng)} ‚Äî Total: {count}")
    


### üìä Summary of Comparison
Both books share many common language patterns typical of 19th-century literature.  
* Common bigrams like **'of the'**, **'in the'**, or **'to the'** appear frequently in both.  
* Unique n-grams often reflect thematic differences ‚Äî *War and Peace* includes more military or political terms, while *Anna Karenina* contains more social and emotional expressions.  
You can analyze further by filtering stopwords or increasing the `n` value for deeper contextual insights.
