<a href="https://colab.research.google.com/github/AlaraGuzel/CNG463/blob/main/assignment2/cng463_assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assignment 2: N-grams and Language Identification
## CNG463 - Introduction to Natural Language Processing
### METU NCC Computer Engineering | Fall 2025-26

**Student Name: Alara G√ºzel**  
**Student ID: 2585057**  
**Due Date:** 16 November 2025 (Sunday) before midnight

---

## Overview

This assignment focuses on:
1. Building **character-based** 2-gram and 3-gram language models with Laplace smoothing
2. Sentence-based language identification using 10-fold cross-validation
3. Evaluation using accuracy, precision, recall, and F1-score
4. Comparison and analysis

**Note:** For language identification, we use **character n-grams** rather than word n-grams because they better capture language-specific patterns like letter combinations, diacritics, and writing systems.

**Grading:**
- Written Questions (7 √ó 4 pts): **28 pts**
- Code Tasks with TODO (11 total): **72 pts** distributed by effort level:
  - Simple tasks: 4 pts each (2 cells)
  - Moderate tasks: 6 pts each (4 cells)
  - Complex tasks: 8 pts each (5 cells)
- **Total: 100 pts**

---

## Pre-Submission Checklist

- [ ] Name and student ID at top
- [ ] No cells are added or removed
- [ ] All TODO sections completed
- [ ] All questions answered
- [ ] Code runs without errors
- [ ] Results tables included
- [ ] Run All before saving

## Setup and Imports

In [1]:
# Standard libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict, Counter
from typing import List, Tuple, Dict
import re

# Scikit-learn for cross-validation and metrics
from sklearn.model_selection import KFold
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from scipy.stats import ttest_rel


# Set random seed for reproducibility
np.random.seed(42)

---

# Task 1: Corpus Preparation and Statistics (22 points)

## 1.1: Upload Corpus Files

Prepare your text files in **two different languages** (accepted formats: `.txt`, `.pdf`, or `.docx`). When you run the cell below, you'll be prompted to upload files for each language separately. Make sure your files contain substantial text (reports, essays, or similar content from other courses). Each language requires at least **5000** words in its corpus.

In [42]:
from google.colab import files

print("Upload your ENGLISH corpus file(s):")
english_files = files.upload()

print("\nUpload your SECOND LANGUAGE corpus file(s):")
second_lang_files = files.upload()


Upload your ENGLISH corpus file(s):


Saving ENGLISH_CORPUS.txt to ENGLISH_CORPUS (3).txt

Upload your SECOND LANGUAGE corpus file(s):


Saving TURKISH_CORPUS.txt to TURKISH_CORPUS (3).txt


## 1.2: Load and Preprocess Data (12 points)

Load your uploaded files, extract text, preprocess, split into sentences, and tokenize. You'll need helper functions to handle different file formats.

**Steps:**
1. Read files based on format (`.txt`, `.pdf`, `.docx`) and combine them into single text for each language
2. Apply preprocessing (e.g., lowercasing, handling punctuation)
3. Split each corpus into individual sentences
4. Tokenize each sentence into words (for statistics)
5. Store the results as two lists of tokenized sentences

**Important:** You'll use word tokenization for calculating statistics, but for the n-gram models in Task 2, you'll work with character n-grams directly on the sentence strings.

In [43]:
import re
from typing import List

def read_txt_file(filename: str) -> str:
    """Read a .txt file and return its content."""
    with open(filename, 'r', encoding='utf-8') as f:
        return f.read()

def read_pdf_file(filename: str) -> str:
    """Read a .pdf file and return its text content."""
    # TODO: Install and use PyPDF2 or pdfplumber
    # Example: pip install PyPDF2
    pass

def read_docx_file(filename: str) -> str:
    """Read a .docx file and return its text content."""
    # TODO: Install and use python-docx
    # Example: pip install python-docx
    pass

def split_into_sentences(text: str) -> List[str]:
    """Split text into sentences."""
    text=re.sub(r"\s+"," ",text.strip())
    sentences=re.split(r"(?<=[.!?])\s+",text)
    sentences=[s for s in sentences if s]
    return sentences

def tokenize_sentence(sentence: str) -> List[str]:
    """Tokenize a sentence into words."""
    sentence=sentence.lower()
    tokens=re.findall(r"\b\w+\b",sentence)
    return tokens

lang1_files = list(english_files.keys())
lang2_files = list(second_lang_files.keys())

def read_corpus(file_list: List[str]) -> str:
    all_texts = []
    for filename in file_list:
        if filename.endswith(".txt"):
            all_texts.append(read_txt_file(filename))
        elif filename.endswith(".pdf"):
            all_texts.append(read_pdf_file(filename))
        elif filename.endswith(".docx"):
            all_texts.append(read_docx_file(filename))
        else:
            print("Unsupported file type, skipping:", filename)
    full_text = " ".join(all_texts)
    full_text = re.sub(r"\s+", " ", full_text.strip())
    return full_text
lang1_text = read_corpus(lang1_files)
lang2_text = read_corpus(lang2_files)

lang1_sentences = split_into_sentences(lang1_text)
lang2_sentences = split_into_sentences(lang2_text)

lang1_sentences_tokenized = [tokenize_sentence(s) for s in lang1_sentences]
lang2_sentences_tokenized = [tokenize_sentence(s) for s in lang2_sentences]

print("\nlang1_sentences =", lang1_sentences[:3], "...")
print("lang2_sentences =", lang2_sentences[:3], "...\n")

print("lang1_sentences_tokenized =", lang1_sentences_tokenized[:3], "...")
print("lang2_sentences_tokenized =", lang2_sentences_tokenized[:3], "...")


lang1_sentences = ['ABSTRACT This report explains the details of my summer internship.', 'The main goals of my summer internship were to develop a Java application and gain experience at the backend in a professional environment.', 'While on my summer internship, I developed a backend application for newspapers.'] ...
lang2_sentences = ['√ñZET Bu rapor, yaz stajƒ±mƒ±n ayrƒ±ntƒ±larƒ±nƒ± a√ßƒ±klamaktadƒ±r.', 'Yaz stajƒ±mƒ±n temel hedefleri bir Java uygulamasƒ± geli≈ütirmek ve profesyonel bir ortamda backend alanƒ±nda deneyim kazanmaktƒ±.', 'Yaz stajƒ±m boyunca gazeteler i√ßin bir backend uygulamasƒ± geli≈ütirdim.'] ...

lang1_sentences_tokenized = [['abstract', 'this', 'report', 'explains', 'the', 'details', 'of', 'my', 'summer', 'internship'], ['the', 'main', 'goals', 'of', 'my', 'summer', 'internship', 'were', 'to', 'develop', 'a', 'java', 'application', 'and', 'gain', 'experience', 'at', 'the', 'backend', 'in', 'a', 'professional', 'environment'], ['while', 'on', 'my', 'summer', 'int

**Question 1.1:** What preprocessing choices did you make and why? (3-5 sentences)

**I removed gaps and joined corpus files into a single text for each language. Then, I converted the text to lowercase. I split the corpus into sentences using punctuation. Lastly, I tokenized eaach sentence using a simple word-based regex.**

## 1.3: Basic Statistics (10 points)

Calculate and display key statistics for both language corpora to understand their characteristics.

In [52]:
punct_pattern = r"[^\w\s]"

lang1_total_characters = sum(len(s) for s in lang1_sentences)
lang1_special_characters = sum(len(re.findall(punct_pattern, s)) for s in lang1_sentences)
lang1_char_voc = len(set("".join(lang1_sentences)))
lang1_total_words = sum(len(s) for s in lang1_sentences_tokenized)
lang1_word_voc = len({w for sent in lang1_sentences_tokenized for w in sent})
lang1_sentence_count = len(lang1_sentences)
lang1_avg_sentence_len = (lang1_total_words/lang1_sentence_count) if lang1_sentence_count else 0

lang2_total_characters = sum(len(s) for s in lang2_sentences)
lang2_special_characters = sum(len(re.findall(punct_pattern, s)) for s in lang2_sentences)
lang2_char_voc = len(set("".join(lang2_sentences)))
lang2_total_words = sum(len(s) for s in lang2_sentences_tokenized)
lang2_word_voc = len({w for sent in lang2_sentences_tokenized for w in sent})
lang2_sentence_count = len(lang2_sentences)
lang2_avg_sentence_len = (lang2_total_words/lang2_sentence_count) if lang2_sentence_count else 0

print("Statistic".ljust(28), "Language 1".ljust(14), "Language 2")
print("-"*55)
print("Total characters".ljust(28), lang1_total_characters, " "*(10-len(str(lang1_total_characters))), lang2_total_characters)
print("Special chars".ljust(28), lang1_special_characters, " "*(10-len(str(lang1_special_characters))), lang2_special_characters)
print("Char vocab size".ljust(28), lang1_char_voc, " "*(10-len(str(lang1_char_voc))), lang2_char_voc)
print("Total words".ljust(28), lang1_total_words, " "*(10-len(str(lang1_total_words))), lang2_total_words)
print("Word vocab size".ljust(28), lang1_word_voc, " "*(10-len(str(lang1_word_voc))), lang2_word_voc)
print("Sentence count".ljust(28), lang1_sentence_count, " "*(10-len(str(lang1_sentence_count))), lang2_sentence_count)
print("Avg sentence len".ljust(28), f"{lang1_avg_sentence_len:.2f}", " "*(10-len(f"{lang1_avg_sentence_len:.2f}")), f"{lang2_avg_sentence_len:.2f}")




Statistic                    Language 1     Language 2
-------------------------------------------------------
Total characters             42236       44048
Special chars                4141        4079
Char vocab size              96          102
Total words                  6198        5589
Word vocab size              1430        2050
Sentence count               361         352
Avg sentence len             17.17       15.88


**Question 1.2:** What are the key differences between your two corpora? (2-3 sentences)

**The first corpus is larger in total charachters and sentences, but it has smaller word vocab compared to the second corpus. This means corpus 2 includes a wider range of distict workds and special charachters. Also, average sentence lengths differ slightly. Corpus 1 has longer sentences than corpus 2.**

---

# Task 2: Character N-gram Language Identification (58 points)

**Baseline (46 pts):** Implement character-based 2-gram and 3-gram models, run 10-fold CV, report accuracy.  
**Creativity (12 pts):** Out-of-vocabulary analysis.

## 2.1: Implement Character N-gram Models (12 points)

Implement the `CharNgramLanguageModel` class with Laplace smoothing using NLTK's n-gram utilities. The model should count **character** n-grams during training and calculate sentence probabilities with smoothing.

**Key difference from word n-grams:** Instead of tokenizing sentences into words, you'll work with individual characters in each sentence.

In [54]:
import nltk
from nltk.util import ngrams, pad_sequence
from nltk.lm import Laplace
from nltk.lm.preprocessing import padded_everygram_pipeline
from typing import List

# Download required NLTK data
nltk.download('punkt', quiet=True)

class CharNgramLanguageModel:
    """
    Character-based N-gram language model with Laplace (add-1) smoothing using NLTK.
    """

    def __init__(self, n: int = 2):
        """
        Initialize the character n-gram model.

        Args:
            n: Order of n-gram (2 for bigram, 3 for trigram)
        """
        self.n = n
        self.model = Laplace(n)

    def train(self, sentences: List[str]):
        """
        Train the model on a list of sentences.

        Args:
            sentences: List of sentences (each sentence is a string)
        """
        char_sequences = [list(s) for s in sentences]
        train_data, vocab = padded_everygram_pipeline(self.n, char_sequences)
        self.model.fit(train_data, vocab)

    def get_probability(self, sentence: str) -> float:
        """
        Calculate the probability of a sentence.

        Args:
            sentence: Sentence string

        Returns:
            Probability of the sentence
        """
        import math

        chars = list(sentence)
        padded = list(
            pad_sequence(
                chars,
                n=self.n,
                pad_left=True,
                pad_right=True,
                left_pad_symbol="<s>",
                right_pad_symbol="</s>"
            )
        )
        log_prob = 0.0
        for ng in ngrams(padded, self.n):
            context = ng[:-1]
            token = ng[-1]
            p = self.model.score(token, context)
            if p > 0:
                log_prob += math.log(p)
            else:
                log_prob += math.log(1e-12)
        return math.exp(log_prob)

# [8 pts]

### Spot Check: Inspect Your N-gram Models

After implementing the model, train sample models on both languages and inspect what they learned.

In [55]:
model_2gram_lang1 = CharNgramLanguageModel(n=2)
model_2gram_lang1.train(lang1_sentences)

model_3gram_lang1 = CharNgramLanguageModel(n=3)
model_3gram_lang1.train(lang1_sentences)

model_2gram_lang2 = CharNgramLanguageModel(n=2)
model_2gram_lang2.train(lang2_sentences)

model_3gram_lang2 = CharNgramLanguageModel(n=3)
model_3gram_lang2.train(lang2_sentences)

from collections import Counter

def show_top_ngrams(model, sentences, top_k=10):
    seqs = [list(s) for s in sentences]
    counts = Counter()
    for seq in seqs:
        padded = list(
            pad_sequence(
                seq,
                n=model.n,
                pad_left=True,
                pad_right=True,
                left_pad_symbol="<s>",
                right_pad_symbol="</s>"
            )
        )
        for ng in ngrams(padded, model.n):
            counts[ng] += 1
    print(f"n = {model.n}")
    print("vocab size:", len(model.model.vocab))
    for ng, c in counts.most_common(top_k):
        print(ng, c)
    print()

print("Language 1 bigram model:")
show_top_ngrams(model_2gram_lang1, lang1_sentences)

print("Language 1 trigram model:")
show_top_ngrams(model_3gram_lang1, lang1_sentences)

print("Language 2 bigram model:")
show_top_ngrams(model_2gram_lang2, lang2_sentences)

print("Language 2 trigram model:")
show_top_ngrams(model_3gram_lang2, lang2_sentences)

Language 1 bigram model:
n = 2
vocab size: 99
('.', '.') 2728
('e', ' ') 764
('s', ' ') 748
(' ', 'a') 694
('i', 'n') 588
('d', ' ') 555
('o', 'n') 508
('a', 'n') 508
('e', 'r') 501
(' ', 't') 493

Language 1 trigram model:
n = 3
vocab size: 99
('.', '.', '.') 2676
('.', '</s>', '</s>') 360
(' ', 't', 'h') 292
('i', 'o', 'n') 291
('e', 'd', ' ') 270
('t', 'h', 'e') 258
('t', 'i', 'o') 242
(' ', 'a', 'n') 234
('e', 'n', 't') 234
('n', 'd', ' ') 228

Language 2 bigram model:
n = 2
vocab size: 105
('.', '.') 2688
('e', 'r') 734
('l', 'a') 727
('e', ' ') 583
('a', 'r') 539
('l', 'e') 512
('i', 'n') 504
('i', ' ') 493
('a', 'n') 484
('n', ' ') 483

Language 2 trigram model:
n = 3
vocab size: 105
('.', '.', '.') 2636
('.', '</s>', '</s>') 351
('e', 'r', 'i') 322
('l', 'a', 'r') 308
('l', 'e', 'r') 289
(' ', 'v', 'e') 283
('v', 'e', ' ') 211
('l', 'a', 'n') 185
('r', '.', '</s>') 177
('i', 'n', ' ') 173



## 2.2: Implement Language Identification (8 points)

Create a function that compares sentence probabilities from two language models and returns the predicted label.

In [56]:
def identify_language(sentence: str,
                     model_lang1: CharNgramLanguageModel,
                     model_lang2: CharNgramLanguageModel) -> int:
    """
    Identify the language of a sentence using two character-based language models.

    Args:
        sentence: Sentence string
        model_lang1: Language model for language 1 (label 0)
        model_lang2: Language model for language 2 (label 1)

    Returns:
        Predicted label (0 or 1)
    """
    prob1 = model_lang1.get_probability(sentence)
    prob2 = model_lang2.get_probability(sentence)
    return 0 if prob1 > prob2 else 1

lang1_sentences_str = lang1_sentences
lang2_sentences_str = lang2_sentences


# [8 pts]


## 2.3: Implement Evaluation Function (6 points)

Create a function that calculates accuracy, precision, recall, and F1-score given predicted and true labels.

In [57]:
def calculate_metrics(y_true: List[int], y_pred: List[int]) -> Dict[str, float]:
    """
    Calculate evaluation metrics.

    Args:
        y_true: True labels
        y_pred: Predicted labels

    Returns:
        Dictionary with accuracy, precision, recall, f1_score
    """
    from sklearn.metrics import accuracy_score, precision_recall_fscore_support

    y_true_flat = [int(l) for l in y_true]
    y_pred_flat = [0 if l is None else int(l) for l in y_pred]

    accuracy = accuracy_score(y_true_flat, y_pred_flat)
    precision, recall, f1, _ = precision_recall_fscore_support(
        y_true_flat, y_pred_flat, average='binary', zero_division=0
    )

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1
    }

# [6 pts]

## 2.4: 10-Fold Cross-Validation for Language Identification (8 points)

Implement 10-fold cross-validation to evaluate your character-based n-gram models. In each fold, split the data, train separate models for each language and n-gram order, make predictions, and evaluate performance.

In [58]:
from sklearn.model_selection import KFold

# Prepare dataset: combine sentence STRINGS from both languages with labels
X = lang1_sentences_str + lang2_sentences_str
y = [0] * len(lang1_sentences_str) + [1] * len(lang2_sentences_str)

print(f"Dataset prepared:")
print(f"  Total sentences: {len(X)}")
print(f"  Language 1 (label 0): {sum(1 for label in y if label == 0)} sentences")
print(f"  Language 2 (label 1): {sum(1 for label in y if label == 1)} sentences")
print()

# Initialize 10-fold cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=42)

# Store results for each fold
results_2gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}
results_3gram = {'accuracy': [], 'precision': [], 'recall': [], 'f1': []}

for fold_no, idx_pair in enumerate(kfold.split(X), start=1):
    train_idx, test_idx = idx_pair

    print("\n" + "=" * 50)
    print("Fold {}/10".format(fold_no))
    print("=" * 50)

    X_train = []
    y_train = []
    for i in train_idx:
        X_train.append(X[i])
        y_train.append(y[i])

    X_test = []
    y_test = []
    for j in test_idx:
        X_test.append(X[j])
        y_test.append(y[j])

    train_lang1 = []
    train_lang2 = []
    for sent, lab in zip(X_train, y_train):
        if lab == 0:
            train_lang1.append(sent)
        elif lab == 1:
            train_lang2.append(sent)

    bigram_l1 = CharNgramLanguageModel(2)
    bigram_l1.train(train_lang1)
    bigram_l2 = CharNgramLanguageModel(2)
    bigram_l2.train(train_lang2)

    trigram_l1 = CharNgramLanguageModel(3)
    trigram_l1.train(train_lang1)
    trigram_l2 = CharNgramLanguageModel(3)
    trigram_l2.train(train_lang2)

    y_pred_2 = []
    for s in X_test:
        lbl2 = identify_language(s, bigram_l1, bigram_l2)
        y_pred_2.append(lbl2)

    y_pred_3 = []
    k = 0
    while k < len(X_test):
        lbl3 = identify_language(X_test[k], trigram_l1, trigram_l2)
        y_pred_3.append(lbl3)
        k += 1

    m2 = calculate_metrics(y_test, y_pred_2)
    m3 = calculate_metrics(y_test, y_pred_3)

    results_2gram["accuracy"].append(m2["accuracy"])
    results_2gram["precision"].append(m2["precision"])
    results_2gram["recall"].append(m2["recall"])
    results_2gram["f1"].append(m2["f1_score"])

    results_3gram["accuracy"].append(m3["accuracy"])
    results_3gram["precision"].append(m3["precision"])
    results_3gram["recall"].append(m3["recall"])
    results_3gram["f1"].append(m3["f1_score"])

print("\n" + "=" * 50)
print("Cross-validation completed!")
print("=" * 50)


# [8 pts]

Dataset prepared:
  Total sentences: 713
  Language 1 (label 0): 361 sentences
  Language 2 (label 1): 352 sentences


Fold 1/10

Fold 2/10

Fold 3/10

Fold 4/10

Fold 5/10

Fold 6/10

Fold 7/10

Fold 8/10

Fold 9/10

Fold 10/10

Cross-validation completed!


## 2.5: Display Results (12)

*Create a table showing for each model:*
Mean accuracy, precision, recall, F1 (with std)

In [29]:
results_df = pd.DataFrame({
    'Model': ['2-gram', '3-gram'],
    'Accuracy Mean': [np.mean(results_2gram['accuracy']), np.mean(results_3gram['accuracy'])],
    'Accuracy Std': [np.std(results_2gram['accuracy']), np.std(results_3gram['accuracy'])],
    'Precision Mean': [np.mean(results_2gram['precision']), np.mean(results_3gram['precision'])],
    'Precision Std': [np.std(results_2gram['precision']), np.std(results_3gram['precision'])],
    'Recall Mean': [np.mean(results_2gram['recall']), np.mean(results_3gram['recall'])],
    'Recall Std': [np.std(results_2gram['recall']), np.std(results_3gram['recall'])],
    'F1 Mean': [np.mean(results_2gram['f1']), np.mean(results_3gram['f1'])],
    'F1 Std': [np.std(results_2gram['f1']), np.std(results_3gram['f1'])]
})

print(results_df)

# [4 pts]

    Model  Accuracy Mean  Accuracy Std  Precision Mean  Precision Std  \
0  2-gram       0.938735      0.035098        0.928492       0.050012   
1  3-gram       0.925692      0.047948        0.919802       0.054889   

   Recall Mean  Recall Std   F1 Mean    F1 Std  
0     0.930635    0.064559  0.927774  0.040870  
1     0.910635    0.094833  0.912113  0.058941  


**Question 2.1:** Which of your trained models performed best on the validation data, and why? (3-4 sentences)

**The 2-gram model performed best on the validation data because it achieved the highest mean accuracy and the most stable performance across folds.**

**Question 2.2:** Were the results consistent across different folds of cross-validation? (2-3 sentences)

**Yes. Especially for the 2-gram model because it didn't change that much in accuracy and F1. The 3-gram model had higher varience which is normal since it needs more data.**

## 2.6: Out-of-Vocabulary Testing (12 pts)

Test your models with **five** sentences containing characters or character combinations not common in your training corpus. For character n-grams, this might include unusual letter combinations, foreign words, or made-up words that still follow language patterns.

In [31]:
oov_sentences = [
    "qzxw lerip koln",
    "thraaavven doplik",
    "√ßƒümpstl √∂rnk",
    "vertilonium spragetti",
    "florn√ºsk brentag√ºl"
]

print("OOV Testing:\n")

for s in oov_sentences:
    p2_lang1 = model2_lang1.get_probability(s)
    p2_lang2 = model2_lang2.get_probability(s)
    p3_lang1 = model3_lang1.get_probability(s)
    p3_lang2 = model3_lang2.get_probability(s)

    pred_2gram = identify_language(s, model2_lang1, model2_lang2)
    pred_3gram = identify_language(s, model3_lang1, model3_lang2)

    print(f"Sentence: {s}")
    print(f"  2-gram: predicted language {pred_2gram} (L1 prob={p2_lang1:.5e}, L2 prob={p2_lang2:.5e})")
    print(f"  3-gram: predicted language {pred_3gram} (L1 prob={p3_lang1:.5e}, L2 prob={p3_lang2:.5e})")
    print()

# [8 pts]

OOV Testing:

Sentence: qzxw lerip koln
  2-gram: predicted language 1 (L1 prob=9.96011e-31, L2 prob=4.21047e-29)
  3-gram: predicted language 1 (L1 prob=2.05881e-32, L2 prob=2.18439e-29)

Sentence: thraaavven doplik
  2-gram: predicted language 0 (L1 prob=1.22883e-29, L2 prob=3.55138e-32)
  3-gram: predicted language 0 (L1 prob=2.80941e-34, L2 prob=1.68723e-34)

Sentence: √ßƒümpstl √∂rnk
  2-gram: predicted language 0 (L1 prob=2.56116e-25, L2 prob=7.10301e-27)
  3-gram: predicted language 1 (L1 prob=3.66006e-28, L2 prob=4.91708e-28)

Sentence: vertilonium spragetti
  2-gram: predicted language 1 (L1 prob=4.85873e-33, L2 prob=4.15323e-32)
  3-gram: predicted language 0 (L1 prob=1.05277e-42, L2 prob=4.93829e-44)

Sentence: florn√ºsk brentag√ºl
  2-gram: predicted language 0 (L1 prob=4.32994e-33, L2 prob=6.93551e-35)
  3-gram: predicted language 0 (L1 prob=3.14050e-38, L2 prob=1.11345e-39)



**Question 2.3:** How well did your models handle out-of-vocabulary (OOV) samples? (2-3 sentences)

**2-gram model was more stable, while the 3-gram model sometimes changed its prediction when the words looked unusual.**

---

# Task 3: Statistical Analysis (20 points)

**Baseline (10 pts):** Statistical significance testing and comparison.  
**Creativity (10 pts):** Advanced analysis (confusion matrices, error analysis, etc.).

## 3.1: Statistical Significance Testing (10 points)

Use paired t-test to compare models. p-value < 0.05 indicates statistically significant difference.

In [32]:
acc_t, acc_p = ttest_rel(results_2gram['accuracy'], results_3gram['accuracy'])
prec_t, prec_p = ttest_rel(results_2gram['precision'], results_3gram['precision'])
rec_t, rec_p = ttest_rel(results_2gram['recall'], results_3gram['recall'])
f1_t, f1_p = ttest_rel(results_2gram['f1'], results_3gram['f1'])

print("Paired t-tests (2-gram vs 3-gram):\n")
print(f"Accuracy:  t = {acc_t:.4f},  p = {acc_p:.4f}")
print(f"Precision: t = {prec_t:.4f},  p = {prec_p:.4f}")
print(f"Recall:    t = {rec_t:.4f},  p = {rec_p:.4f}")
print(f"F1 Score:  t = {f1_t:.4f},  p = {f1_p:.4f}")

Paired t-tests (2-gram vs 3-gram):

Accuracy:  t = 1.9640,  p = 0.0811
Precision: t = 1.4153,  p = 0.1906
Recall:    t = 1.5000,  p = 0.1679
F1 Score:  t = 1.8839,  p = 0.0922


**Question 3.1:** Are the performance differences statistically significant? Explain what 'statistical significance' means in this context. (2-3 sentences)

**No, the differences are not statistically different because all p-values are higher than 0.05. In this context, statistical significance means the difference is real and not just random variation across folds.**

## 3.2: Advanced Analysis (10 points)

Perform deeper analysis such as per-language performance, misclassification patterns, etc.

In [33]:
from sklearn.metrics import confusion_matrix

m2l1 = CharNgramLanguageModel(2)
m2l1.train(lang1_sentences)
m2L2 = CharNgramLanguageModel(2)
m2L2.train(lang2_sentences)

m3_L1=CharNgramLanguageModel(3)
m3_L1.train(lang1_sentences)
m3_L_2=CharNgramLanguageModel(3)
m3_L_2.train(lang2_sentences)

pred_2_full=[]
for ii in X:
    x1 = identify_language(ii, m2l1, m2L2)
    pred_2_full.append(x1)

pred_3_full=[]
for j in X:
    ppp=identify_language(j,m3_L1,m3_L_2)
    pred_3_full.append(ppp)

y_true_cm=[]
for aa in y:
    y_true_cm.append(int(aa))

y_pred_2_cm=[]
for bb in pred_2_full:
    if bb is None:
        y_pred_2_cm.append(0)
    else:
        try:
            y_pred_2_cm.append(int(bb))
        except:
            y_pred_2_cm.append(0)

y_pred_3_cm=[]
for xxx in pred_3_full:
    if xxx is None:
        y_pred_3_cm.append(0)
    else:
        y_pred_3_cm.append(int(xxx))

cM2 = confusion_matrix(y_true_cm,y_pred_2_cm)
cM3=confusion_matrix(y_true_cm,y_pred_3_cm)

print("Confusion Matrix (2-gram):")
print(cM2)
print("\nConfusion Matrix (3-gram):")
print(cM3)

lang1_idx=[]
lang2_idx=[]
for i_ in range(len(y)):
    if y[i_] == 0:
        lang1_idx.append(i_)
    else:
        lang2_idx.append(i_)

a1=0
for t1 in lang1_idx:
    if pred_2_full[t1] == 0:
        a1=a1+1
acc11=a1/len(lang1_idx)

a2=0
for t2 in lang2_idx:
    if pred_2_full[t2] == 1:
        a2=a2+1
acc12=a2/len(lang2_idx)

a3=0
for t3 in lang1_idx:
    if pred_3_full[t3] == 0:
        a3+=1
acc31=a3/len(lang1_idx)

a4=0
for t4 in lang2_idx:
    if pred_3_full[t4] == 1:
        a4=a4+1
acc32=a4/len(lang2_idx)

print("\nPer-language accuracy:")
print("2-gram:",acc11,acc12)
print("3-gram:",acc31,acc32)

print("\nMisclassified examples (2-gram):")
zzz=0
for zz in range(len(X)):
    if y[zz] != pred_2_full[zz]:
        print("T:",y[zz],"P:",pred_2_full[zz],"|",X[zz])
        zzz+=1
        if zzz>=5:
            break

print("\nMisclassified examples (3-gram):")
g=0
for h in range(len(X)):
    if y[h] != pred_3_full[h]:
        print("T:",y[h],"P:",pred_3_full[h],"|",X[h])
        g=g+1
        if g>=5:
            break

# [6 pts]

Confusion Matrix (2-gram):
[[125   3]
 [  5  96]]

Confusion Matrix (3-gram):
[[126   2]
 [  5  96]]

Per-language accuracy:
2-gram: 0.9765625 0.9504950495049505
3-gram: 0.984375 0.9504950495049505

Misclassified examples (2-gram):
T: 0 P: 1 | Kafein Technology Solutions provides services in the fields of IT Operation Management, Advanced Analytical Solutions, Software Development, Managed Services Consultancy, Data Warehouse Consultancy, Digital Transformation Consultancy, Test Management Consultancy, Data Decision Management Systems Consultancy, Outsourcing Services, PDPL Project Consultancy, Robotic Process Automation Service, Data Virtualization Consultancy, Cyber Security Solutions and Turnkey Projects.
T: 0 P: 1 | The administrative tree consists of six people, which are Ali Cem Kalyoncu as chairman, Neval √ñnen as vice chairman, Hatice Sevim Oral as member of the board, Kenan S√ºbekci as member of the board, Murat Kaan G√ºneri as non-executive director and Murat Ethem S√ºmer as 

**Question 3.2:** What interesting patterns or insights did you discover from your results? (4-5 sentences)

**I don't think there's anything interesting.**

# Convert Your Colab Notebook to PDF

### Step 1: Download Your Notebook
- Go to **File ‚Üí Download ‚Üí Download .ipynb**
- Save the file to your computer

### Step 2: Upload to Colab
- Click the **üìÅ folder icon** on the left sidebar
- Click the **upload button**
- Select your downloaded .ipynb file

### Step 3: Run the Code Below
- **Uncomment the cell below** and run the cell
- This will take about 1-2 minutes to install required packages
- When prompted, type your notebook name (e.g.`gs_000000_as2.ipynb`) and press Enter

### The PDF will be automatically downloaded to your computer


In [None]:
# # Install required packages (this takes about 30 seconds)
# print("Installing PDF converter... please wait...")
# !apt-get update -qq
# !apt-get install -y texlive-xetex texlive-fonts-recommended texlive-plain-generic pandoc > /dev/null 2>&1
# !pip install -q nbconvert

# print("\n" + "="*50)

# # Get notebook name from user
# notebook_name = input("\nEnter your notebook name: ")

# # Add .ipynb if missing
# if not notebook_name.endswith('.ipynb'):
#     notebook_name += '.ipynb'

# import os
# notebook_path = f'/content/{notebook_name}'

# # Check if file exists
# if not os.path.exists(notebook_path):
#     print(f"\n‚ö† Error: '{notebook_name}' not found in /content/")
#     print("\nMake sure you uploaded the file using the folder icon (üìÅ) on the left!")
# else:
#     print(f"\n‚úì Found {notebook_name}")
#     print("Converting to PDF... this may take 1-2 minutes...\n")

#     # Convert the notebook to PDF
#     !jupyter nbconvert --to pdf "{notebook_path}"

#     # Download the PDF
#     from google.colab import files
#     pdf_name = notebook_name.replace('.ipynb', '.pdf')
#     pdf_path = f'/content/{pdf_name}'

#     if os.path.exists(pdf_path):
#         print("‚úì SUCCESS! Downloading your PDF now...")
#         files.download(pdf_path)
#         print("\n‚úì Done! Check your downloads folder.")
#     else:
#         print("‚ö† Error: Could not create PDF")