# Assignment 2: N-grams and Language Identification
## CNG463 - Introduction to Natural Language Processing
### METU NCC Computer Engineering | Fall 2025-26

**Student Name:**  
**Student ID:**  
**Due Date:** 16 November 2025 (Sunday) before midnight

---

## Overview

This assignment focuses on:
1. Building **character-based** 2-gram and 3-gram language models with Laplace smoothing
2. Sentence-based language identification using 10-fold cross-validation
3. Evaluation using accuracy, precision, recall, and F1-score
4. Comparison and analysis

**Note:** For language identification, we use **character n-grams** rather than word n-grams because they better capture language-specific patterns like letter combinations, diacritics, and writing systems.

**Grading:**
- Written Questions (7 √ó 4 pts): **28 pts**
- Code Tasks with TODO (11 total): **72 pts** distributed by effort level:
  - Simple tasks: 4 pts each (2 cells)
  - Moderate tasks: 6 pts each (4 cells)
  - Complex tasks: 8 pts each (5 cells)
- **Total: 100 pts**

---

## Pre-Submission Checklist

- [ ] Name and student ID at top
- [ ] No cells are added or removed
- [ ] All TODO sections completed
- [ ] All questions answered
- [ ] Code runs without errors
- [ ] Results tables included
- [ ] Run All before saving

## Setup and Imports

In [2]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import re

# Scikit-learn for cross-validation and metrics
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from scipy.stats import ttest_rel


# Set random seed for reproducibility
np.random.seed(42)

---

# Task 1: Corpus Preparation and Statistics (22 points)

## 1.1: Upload Corpus Files

Prepare your text files in **two different languages** (accepted formats: `.txt`, `.pdf`, or `.docx`). When you run the cell below, you'll be prompted to upload files for each language separately. Make sure your files contain substantial text (reports, essays, or similar content from other courses). Each language requires at least **5000** words in its corpus.

In [4]:
from google.colab import files

print("Upload your ENGLISH corpus file(s):")
english_files = files.upload()

print("\nUpload your SECOND LANGUAGE corpus file(s):")
second_lang_files = files.upload()
lang1_files = list(english_files.keys())
lang2_files = list(second_lang_files.keys())


Upload your ENGLISH corpus file(s):


Saving ENGLISH_CORPUS.txt to ENGLISH_CORPUS (1).txt

Upload your SECOND LANGUAGE corpus file(s):


Saving TURKISH_CORPUS.txt to TURKISH_CORPUS (1).txt


## 1.2: Load and Preprocess Data (12 points)

Load your uploaded files, extract text, preprocess, split into sentences, and tokenize. You'll need helper functions to handle different file formats.

**Steps:**
1. Read files based on format (`.txt`, `.pdf`, `.docx`) and combine them into single text for each language
2. Apply preprocessing (e.g., lowercasing, handling punctuation)
3. Split each corpus into individual sentences
4. Tokenize each sentence into words (for statistics)
5. Store the results as two lists of tokenized sentences

**Important:** You'll use word tokenization for calculating statistics, but for the n-gram models in Task 2, you'll work with character n-grams directly on the sentence strings.

In [5]:
import re
from typing import List

def read_txt_file(filename: str) -> str:
    """Read a .txt file and return its content."""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

def read_pdf_file(filename: str) -> str:
    """Read a .pdf file and return its text content."""
    # TODO: Install and use PyPDF2 or pdfplumber
    # Example: pip install PyPDF2
    pass

def read_docx_file(filename: str) -> str:
    """Read a .docx file and return its text content."""
    # TODO: Install and use python-docx
    # Example: pip install python-docx
    pass

def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences."""
    # TODO: Implement sentence splitting
    # You can use simple regex or nltk.sent_tokenize
    text=re.sub(r"\s+"," ",text.strip())
    sentences=re.split(r"(?<=[.!?])\s+",text)
    sentences=[s for s in sentences if s]
    return sentences

def tokenize_sentence(sentence: str) -> List[str]:
    """Tokenize a sentence into words."""
    # TODO: Implement word tokenization
    # You can use str.split() or nltk.word_tokenize
    sentence=sentence.lower()
    tokens=re.findall(r"\b\w+\b",sentence)
    return tokens
def read_corpus(file_list: List[str]) -> str:
    all_texts = []
    for filename in file_list:
        if filename.endswith(".txt"):
            all_texts.append(read_txt_file(filename))
        elif filename.endswith(".pdf"):
            all_texts.append(read_pdf_file(filename))
        elif filename.endswith(".docx"):
            all_texts.append(read_docx_file(filename))
        else:
            print("Unsupported file type, skipping:", filename)
    full_text = " ".join(all_texts)
    full_text = re.sub(r"\s+", " ", full_text.strip())
    return full_text
lang1_text = read_corpus(lang1_files)
lang2_text = read_corpus(lang2_files)

lang1_sentences = split_into_sentences(lang1_text)
lang2_sentences = split_into_sentences(lang2_text)

lang1_sentences_tokenized = [tokenize_sentence(s) for s in lang1_sentences]
lang2_sentences_tokenized = [tokenize_sentence(s) for s in lang2_sentences]

print("\nlang1_sentences =", lang1_sentences[:3], "...")
print("lang2_sentences =", lang2_sentences[:3], "...\n")

print("lang1_sentences_tokenized =", lang1_sentences_tokenized[:3], "...")
print("lang2_sentences_tokenized =", lang2_sentences_tokenized[:3], "...")


lang1_sentences = ['ABSTRACT This report explains the details of my summer internship.', 'The main goals of my summer internship were to develop a Java application and gain experience at the backend in a professional environment.', 'While on my summer internship, I developed a backend application for newspapers.'] ...
lang2_sentences = ['√ñZET Bu rapor, yaz stajƒ±mƒ±n detaylarƒ±nƒ± a√ßƒ±klamaktadƒ±r.', 'Yaz stajƒ±mƒ±n temel hedefleri bir Java uygulamasƒ± geli≈ütirmek ve profesyonel bir ortamda backend alanƒ±nda deneyim kazanmaktƒ±.', 'Yaz stajƒ±m boyunca gazeteler i√ßin bir backend uygulamasƒ± geli≈ütirdim.'] ...

lang1_sentences_tokenized = [['abstract', 'this', 'report', 'explains', 'the', 'details', 'of', 'my', 'summer', 'internship'], ['the', 'main', 'goals', 'of', 'my', 'summer', 'internship', 'were', 'to', 'develop', 'a', 'java', 'application', 'and', 'gain', 'experience', 'at', 'the', 'backend', 'in', 'a', 'professional', 'environment'], ['while', 'on', 'my', 'summer', 'interns

**Question 1.1:** What preprocessing choices did you make and why? (3-5 sentences)

**[YOUR ANSWER HERE]**

## 1.3: Basic Statistics (10 points)

Calculate and display key statistics for both language corpora to understand their characteristics.

In [8]:
# TODO: Calculate statistics for BOTH languages
punct_pattern = r"[^\w\s]"

lang1_total_characters = sum(len(s) for s in lang1_sentences)
lang1_special_characters = sum(len(re.findall(punct_pattern, s)) for s in lang1_sentences)
lang1_char_vocabulary = len(set("".join(lang1_sentences)))
lang1_total_words = sum(len(s) for s in lang1_sentences_tokenized)
lang1_word_vocabulary = len({w for sent in lang1_sentences_tokenized for w in sent})
lang1_sentence_count = len(lang1_sentences)
lang1_avg_sentence_len = (lang1_total_words / lang1_sentence_count) if lang1_sentence_count else 0

lang2_total_characters = sum(len(s) for s in lang2_sentences)
lang2_special_characters = sum(len(re.findall(punct_pattern, s)) for s in lang2_sentences)
lang2_char_vocabulary = len(set("".join(lang2_sentences)))
lang2_total_words = sum(len(s) for s in lang2_sentences_tokenized)
lang2_word_vocabulary = len({w for sent in lang2_sentences_tokenized for w in sent})
lang2_sentence_count = len(lang2_sentences)
lang2_avg_sentence_len = (lang2_total_words / lang2_sentence_count) if lang2_sentence_count else 0

print(f"{'Statistic':30} {'Language 1':15} {'Language 2'}")
print("-" * 60)
print(f"{'Total characters':30} {lang1_total_characters:<15} {lang2_total_characters}")
print(f"{'Special/punctuation count':30} {lang1_special_characters:<15} {lang2_special_characters}")
print(f"{'Character vocabulary size':30} {lang1_char_vocabulary:<15} {lang2_char_vocabulary}")
print(f"{'Total words':30} {lang1_total_words:<15} {lang2_total_words}")
print(f"{'Word vocabulary size':30} {lang1_word_vocabulary:<15} {lang2_word_vocabulary}")
print(f"{'Sentence count':30} {lang1_sentence_count:<15} {lang2_sentence_count}")
print(f"{'Avg sentence length (words)':30} {lang1_avg_sentence_len:<15.2f} {lang2_avg_sentence_len:.2f}")

Statistic                      Language 1      Language 2
------------------------------------------------------------
Total characters               13060           10049
Special/punctuation count      395             263
Character vocabulary size      83              88
Total words                    2131            1302
Word vocabulary size           571             646
Sentence count                 128             101
Avg sentence length (words)    16.65           12.89


**Question 1.2:** What are the key differences between your two corpora? (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 2: Character N-gram Language Identification (58 points)

**Baseline (46 pts):** Implement character-based 2-gram and 3-gram models, run 10-fold CV, report accuracy.  
**Creativity (12 pts):** Out-of-vocabulary analysis.

## 2.1: Implement Character N-gram Models (12 points)

Implement the `CharNgramLanguageModel` class with Laplace smoothing using NLTK's n-gram utilities. The model should count **character** n-grams during training and calculate sentence probabilities with smoothing.

**Key difference from word n-grams:** Instead of tokenizing sentences into words, you'll work with individual characters in each sentence.

In [9]:
import nltk
from nltk.util import ngrams, pad_sequence
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from typing import List

# Download required NLTK data
nltk.download('punkt', quiet=True)

class CharNgramLanguageModel:

    def __init__(self, n: int = 2):
        self.n = n
        self.model = Laplace(n)

    def train(self, sentences: List[str]):
        char_sequences = [list(s) for s in sentences]
        train_data, vocab = padded_everygram_pipeline(self.n, char_sequences)
        self.model.fit(train_data, vocab)
        pass

    def get_probability(self, sentence: str) -> float:
        chars = list(sentence)
        padded = list(
            pad_sequence(
                chars,
                n=self.n,
                pad_left=True,
                pad_right=True,
                left_pad_symbol="<s>",
                right_pad_symbol="</s>"
            )
        )
        log_prob = 0.0
        for ng in ngrams(padded, self.n):
            context = ng[:-1]
            token = ng[-1]
            p = self.model.score(token, context)
            if p > 0:
                log_prob += math.log(p)
            else:
                log_prob += math.log(1e-12)
        return math.exp(log_prob)

### Spot Check: Inspect Your N-gram Models

After implementing the model, train sample models on both languages and inspect what they learned.

In [10]:
model_2gram_lang1 = CharNgramLanguageModel(n=2)
model_2gram_lang1.train(lang1_sentences)

model_3gram_lang1 = CharNgramLanguageModel(n=3)
model_3gram_lang1.train(lang1_sentences)

model_2gram_lang2 = CharNgramLanguageModel(n=2)
model_2gram_lang2.train(lang2_sentences)

model_3gram_lang2 = CharNgramLanguageModel(n=3)
model_3gram_lang2.train(lang2_sentences)

from collections import Counter

def show_top_ngrams(model, sentences, top_k=10):
    seqs = [list(s) for s in sentences]
    counts = Counter()
    for seq in seqs:
        padded = list(
            pad_sequence(
                seq,
                n=model.n,
                pad_left=True,
                pad_right=True,
                left_pad_symbol="<s>",
                right_pad_symbol="</s>"
            )
        )
        for ng in ngrams(padded, model.n):
            counts[ng] += 1
    print(f"n = {model.n}")
    print("vocab size:", len(model.model.vocab))
    for ng, c in counts.most_common(top_k):
        print(ng, c)
    print()

print("Language 1 bigram model:")
show_top_ngrams(model_2gram_lang1, lang1_sentences)

print("Language 1 trigram model:")
show_top_ngrams(model_3gram_lang1, lang1_sentences)

print("Language 2 bigram model:")
show_top_ngrams(model_2gram_lang2, lang2_sentences)

print("Language 2 trigram model:")
show_top_ngrams(model_3gram_lang2, lang2_sentences)

Language 1 bigram model:
n = 2
vocab size: 86
('e', ' ') 308
(' ', 'a') 246
('s', ' ') 218
('i', 'n') 216
(' ', 't') 206
('o', 'n') 186
('n', ' ') 185
(',', ' ') 185
('e', 'r') 175
('d', ' ') 175

Language 1 trigram model:
n = 3
vocab size: 86
(' ', 't', 'h') 131
('.', '</s>', '</s>') 128
('t', 'h', 'e') 120
('h', 'e', ' ') 109
('i', 'o', 'n') 108
('i', 'n', 'g') 104
('t', 'i', 'o') 100
('n', 'g', ' ') 99
('a', 't', 'i') 91
('e', 'd', ' ') 91

Language 2 bigram model:
n = 2
vocab size: 91
('e', 'r') 177
('l', 'a') 173
('e', ' ') 159
('a', 'n') 143
('n', ' ') 130
('m', 'a') 121
('r', 'i') 119
('l', 'e') 118
('a', ' ') 116
('i', 'r') 113

Language 2 trigram model:
n = 3
vocab size: 91
('.', '</s>', '</s>') 101
('e', 'r', 'i') 76
(' ', 'v', 'e') 72
('l', 'e', 'r') 68
('a', 'm', 'a') 51
('v', 'e', ' ') 49
('l', 'a', 'n') 49
('u', 'l', 'a') 47
('l', 'a', 'r') 46
('l', 'a', 'm') 45



## 2.2: Implement Language Identification (8 points)

Create a function that compares sentence probabilities from two language models and returns the predicted label.

In [15]:
def identify_language(sentence: str,
                     model_lang1: CharNgramLanguageModel,
                     model_lang2: CharNgramLanguageModel) -> int:
    # TODO: Your code here
    def identify_language(sentence: str,
                      model_lang1: CharNgramLanguageModel,
                      model_lang2: CharNgramLanguageModel) -> int:

        p1 = model_lang1.get_probability(sentence)
        p2 = model_lang2.get_probability(sentence)

        return 0 if p1 > p2 else 1

# [8 pts]


## 2.3: Implement Evaluation Function (6 points)

Create a function that calculates accuracy, precision, recall, and F1-score given predicted and true labels.

In [17]:
def calculate_metrics(y_true: List[int], y_pred: List[int]) -> Dict[str, float]:
    """
    Calculate evaluation metrics.

    Args:
        y_true: True labels
        y_pred: Predicted labels

    Returns:
        Dictionary with accuracy, precision, recall, f1_score
    """
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support
    from typing import Dict, List
    def calculate_metrics(y_true: List[int], y_pred: List[int]) -> Dict[str, float]:
      acc = accuracy_score(y_true, y_pred)
      precision, recall, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average='binary', zero_division=0
      )
    return {
        "accuracy": acc,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }


# [6 pts]

## 2.4: 10-Fold Cross-Validation for Language Identification (8 points)

Implement 10-fold cross-validation to evaluate your character-based n-gram models. In each fold, split the data, train separate models for each language and n-gram order, make predictions, and evaluate performance.

In [24]:
from sklearn.model_selection import KFold

X = lang1_sentences_str + lang2_sentences_str
y = [0] * len(lang1_sentences_str) + [1] * len(lang2_sentences_str)

print(f"Dataset prepared:")
print(f"  Total sentences: {len(X)}")
print(f"  Language 1 (label 0): {sum(1 for label in y if label == 0)} sentences")
print(f"  Language 2 (label 1): {sum(1 for label in y if label == 1)} sentences")
print()

kfold = KFold(n_splits=10, shuffle=True, random_state=42)

results_2gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
results_3gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

for fold_idx, (train_idx, test_idx) in enumerate(kfold.split(X), 1):
    print(f"\n{'='*50}")
    print(f"Fold {fold_idx}/10")
    print(f"{'='*50}")

    X_train = [X[i] for i in train_idx]
    y_train = [y[i] for i in train_idx]
    X_test = [X[i] for i in test_idx]
    y_test = [y[i] for i in test_idx]

    train_lang1 = [s for s, lab in zip(X_train, y_train) if lab == 0]
    train_lang2 = [s for s, lab in zip(X_train, y_train) if lab == 1]

    model2_lang1 = CharNgramLanguageModel(n=2)
    model2_lang1.train(train_lang1)
    model2_lang2 = CharNgramLanguageModel(n=2)
    model2_lang2.train(train_lang2)

    model3_lang1 = CharNgramLanguageModel(n=3)
    model3_lang1.train(train_lang1)
    model3_lang2 = CharNgramLanguageModel(n=3)
    model3_lang2.train(train_lang2)

    y_pred_2 = [identify_language(s, model2_lang1, model2_lang2) for s in X_test]
    y_pred_3 = [identify_language(s, model3_lang1, model3_lang2) for s in X_test]

    metrics_2 = calculate_metrics(y_test, y_pred_2)
    metrics_3 = calculate_metrics(y_test, y_pred_3)

    results_2gram['accuracy'].append(metrics_2['accuracy'])
    results_2gram['precision'].append(metrics_2['precision'])
    results_2gram['recall'].append(metrics_2['recall'])
    results_2gram['f1'].append(metrics_2['f1_score'])

    results_3gram['accuracy'].append(metrics_3['accuracy'])
    results_3gram['precision'].append(metrics_3['precision'])
    results_3gram['recall'].append(metrics_3['recall'])
    results_3gram['f1'].append(metrics_3['f1_score'])

print("\n" + "="*50)
print("Cross-validation completed!")
print("="*50)


Dataset prepared:
  Total sentences: 229
  Language 1 (label 0): 128 sentences
  Language 2 (label 1): 101 sentences


Fold 1/10


NameError: name 'acc' is not defined

## 2.5: Display Results (12)

*Create a table showing for each model:*
Mean accuracy, precision, recall, F1 (with std)

In [None]:
# TODO: Calculate and display summary statistics
#
#
# Example:
# results_df = pd.DataFrame({
#     'Model': ['2-gram', '3-gram'],
#     'Accuracy': [...],
#     'Precision': [...],
#     ...
# })

# [4 pts]

**Question 2.1:** Which of your trained models performed best on the validation data, and why? (3-4 sentences)

**[YOUR ANSWER HERE]**

**Question 2.2:** Were the results consistent across different folds of cross-validation? (2-3 sentences)

**[YOUR ANSWER HERE]**

## 2.6: Out-of-Vocabulary Testing (12 pts)

Test your models with **five** sentences containing characters or character combinations not common in your training corpus. For character n-grams, this might include unusual letter combinations, foreign words, or made-up words that still follow language patterns.

In [None]:
# TODO: Create and test OOV sentences
#
# - Be creative!

# [8 pts]

**Question 2.3:** How well did your models handle out-of-vocabulary (OOV) samples? (2-3 sentences)

**[YOUR ANSWER HERE]**

---

# Task 3: Statistical Analysis (20 points)

**Baseline (10 pts):** Statistical significance testing and comparison.  
**Creativity (10 pts):** Advanced analysis (confusion matrices, error analysis, etc.).

## 3.1: Statistical Significance Testing (10 points)

Use paired t-test to compare models. p-value < 0.05 indicates statistically significant difference.

In [None]:
# TODO: Perform paired t-tests
#
# Compare: 2-gram vs 3-gram
#
# Use: t_stat, p_value = ttest_rel(results_1, results_2)

# [6 pts]

**Question 3.1:** Are the performance differences statistically significant? Explain what 'statistical significance' means in this context. (2-3 sentences)

**[YOUR ANSWER HERE]**

## 3.2: Advanced Analysis (10 points)

Perform deeper analysis such as per-language performance, misclassification patterns, etc.

In [None]:
# TODO: Your advanced analysis here

# [6 pts]

**Question 3.2:** What interesting patterns or insights did you discover from your results? (4-5 sentences)

**[YOUR ANSWER HERE]**

# Convert Your Colab Notebook to PDF

### Step 1: Download Your Notebook
- Go to **File ‚Üí Download ‚Üí Download .ipynb**
- Save the file to your computer

### Step 2: Upload to Colab
- Click the **üìÅ folder icon** on the left sidebar
- Click the **upload button**
- Select your downloaded .ipynb file

### Step 3: Run the Code Below
- **Uncomment the cell below** and run the cell
- This will take about 1-2 minutes to install required packages
- When prompted, type your notebook name (e.g.`gs_000000_as2.ipynb`) and press Enter

### The PDF will be automatically downloaded to your computer


In [None]:
# # Install required packages (this takes about 30 seconds)
# print("Installing PDF converter... please wait...")
# !apt-get update -qq
# !apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc > /dev/null 2>&1
# !pip install -q nbconvert

# print("\n" + "="*50)

# # Get notebook name from user
# notebook_name = input("\nEnter your notebook name: ")

# # Add .ipynb if missing
# if not notebook_name.endswith('.ipynb'):
#     notebook_name += '.ipynb'

# import os
# notebook_path = f'/content/{notebook_name}'

# # Check if file exists
# if not os.path.exists(notebook_path):
#     print(f"\n‚ö† Error: '{notebook_name}' not found in /content/")
#     print("\nMake sure you uploaded the file using the folder icon (üìÅ) on the left!")
# else:
#     print(f"\n‚úì Found {notebook_name}")
#     print("Converting to PDF... this may take 1-2 minutes...\n")

#     # Convert the notebook to PDF
#     !jupyter nbconvert --to pdf "{notebook_path}"

#     # Download the PDF
#     from google.colab import files
#     pdf_name = notebook_name.replace('.ipynb', '.pdf')
#     pdf_path = f'/content/{pdf_name}'

#     if os.path.exists(pdf_path):
#         print("‚úì SUCCESS! Downloading your PDF now...")
#         files.download(pdf_path)
#         print("\n‚úì Done! Check your downloads folder.")
#     else:
#         print("‚ö† Error: Could not create PDF")