# üß† Language Detection Using CNN

This notebook trains a Convolutional Neural Network (CNN) to recognize languages from text input.

## Project Overview

**Objective:** Design and develop a CNN-based model capable of accurately identifying the language of a given text while remaining lightweight, adaptable, and easy to implement.

**Key Features:**
- Character-level CNN for language classification
- Supports multiple languages (English, French, Spanish, Khmer, Japanese, etc.)
- Efficient extraction of language-specific character and word patterns
- Easy-to-use preprocessing and evaluation pipeline

---

## Table of Contents
1. [Setup & Installation](#1-setup--installation)
2. [Import Libraries](#2-import-libraries)
3. [Data Preprocessing](#3-data-preprocessing)
4. [Dataset & DataLoader](#4-dataset--dataloader)
5. [Model Architecture](#5-model-architecture)
6. [Training](#6-training)
7. [Evaluation & Visualization](#7-evaluation--visualization)
8. [Inference - Predict Language](#8-inference---predict-language)
9. [Usage Instructions](#9-usage-instructions)

## 1. Setup & Installation

First, install the required dependencies. Run this cell only once when setting up the environment.

In [None]:
# Install required packages (uncomment and run if needed)
# !pip install torch torchvision numpy pandas scikit-learn matplotlib seaborn tqdm pillow

# Optional: Install OCR packages for text extraction from images
# !pip install pytesseract opencv-python easyocr

## 2. Import Libraries

Import all necessary libraries for data processing, model building, and visualization.

In [None]:
import os
import json
import numpy as np
import pandas as pd
import unicodedata
from glob import glob
from collections import Counter
from tqdm import tqdm

# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Metrics
from sklearn.metrics import accuracy_score, f1_score, confusion_matrix, classification_report

# Set random seeds for reproducibility
def set_seed(seed=42):
    np.random.seed(seed)
    torch.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)

set_seed(42)

# Check device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"üîß Using device: {device}")
print(f"üì¶ PyTorch version: {torch.__version__}")

In [None]:
# Set up paths (adjust these if running from different location)
BASE_DIR = os.path.abspath(os.path.join(os.getcwd(), ".."))  # Project root
DATA_RAW = os.path.join(BASE_DIR, "data", "raw")
DATA_PROC = os.path.join(BASE_DIR, "data", "processed")
MODEL_DIR = os.path.join(BASE_DIR, "models")

# Create directories if they don't exist
os.makedirs(DATA_RAW, exist_ok=True)
os.makedirs(DATA_PROC, exist_ok=True)
os.makedirs(MODEL_DIR, exist_ok=True)

print(f"üìÅ Base directory: {BASE_DIR}")
print(f"üìÅ Raw data: {DATA_RAW}")
print(f"üìÅ Processed data: {DATA_PROC}")
print(f"üìÅ Models: {MODEL_DIR}")

## 3. Data Preprocessing

### 3.1 Data Loading Functions

The preprocessing pipeline:
1. Looks for CSV files in `data/raw/` with columns (text, label) OR `.txt` files (one sample per line) named `<lang>.txt`
2. Builds character-level vocabulary
3. Encodes text ‚Üí fixed-length integer sequences
4. Saves processed outputs to `data/processed/`

**Supported data formats:**
- **CSV files:** Must have columns `text` and `label`
- **TXT files:** Named after the language (e.g., `en.txt`, `fr.txt`) with one sample per line

In [None]:
# ============================================================
# DATA PREPROCESSING FUNCTIONS
# ============================================================

def unicode_normalize(s):
    """Normalize unicode characters to NFKC form."""
    return unicodedata.normalize("NFKC", str(s))

def load_raw_data(data_raw_path):
    """
    Load raw data from CSV and TXT files.
    
    Returns DataFrame with columns ['text', 'label'].
    
    Acceptable raw formats:
      - CSV files under data/raw/ with columns text, label
      - TXT files named lang.txt containing samples line-by-line
    """
    rows = []
    
    # Load CSV files
    for csv_path in glob(os.path.join(data_raw_path, "*.csv")):
        try:
            df = pd.read_csv(csv_path, usecols=["text", "label"])
            rows.append(df)
            print(f"  ‚úÖ Loaded CSV: {os.path.basename(csv_path)} ({len(df)} samples)")
        except Exception:
            # Try fallback: assume two columns without headers
            df = pd.read_csv(csv_path, header=None, names=["text", "label"])
            rows.append(df)
            print(f"  ‚úÖ Loaded CSV (no header): {os.path.basename(csv_path)} ({len(df)} samples)")
    
    # Load TXT files (each file = one language)
    for txt_path in glob(os.path.join(data_raw_path, "*.txt")):
        name = os.path.splitext(os.path.basename(txt_path))[0]
        with open(txt_path, "r", encoding="utf-8") as f:
            lines = [l.strip() for l in f if l.strip()]
        if lines:
            df = pd.DataFrame({"text": lines, "label": [name] * len(lines)})
            rows.append(df)
            print(f"  ‚úÖ Loaded TXT: {os.path.basename(txt_path)} ({len(lines)} samples, label='{name}')")
    
    if not rows:
        return None
    
    df = pd.concat(rows, ignore_index=True)
    df['text'] = df['text'].astype(str).map(unicode_normalize)
    df['label'] = df['label'].astype(str)
    return df

def generate_sample_dataset(out_path):
    """
    Generate a sample dataset for testing the pipeline.
    In production, replace this with your actual dataset.
    """
    samples = [
        # English samples
        ("hello world", "en"),
        ("this is a test", "en"),
        ("how are you today", "en"),
        ("machine learning is fascinating", "en"),
        ("deep neural networks", "en"),
        ("natural language processing", "en"),
        ("the quick brown fox jumps", "en"),
        ("artificial intelligence research", "en"),
        
        # French samples
        ("bonjour le monde", "fr"),
        ("je suis √©tudiant", "fr"),
        ("comment allez-vous", "fr"),
        ("apprentissage automatique", "fr"),
        ("traitement du langage naturel", "fr"),
        ("intelligence artificielle", "fr"),
        ("bonne journ√©e √† tous", "fr"),
        ("merci beaucoup", "fr"),
        
        # Spanish samples
        ("hola mundo", "es"),
        ("buenos d√≠as", "es"),
        ("c√≥mo est√°s hoy", "es"),
        ("aprendizaje autom√°tico", "es"),
        ("procesamiento del lenguaje", "es"),
        ("inteligencia artificial", "es"),
        ("muchas gracias", "es"),
        ("hasta luego amigos", "es"),
        
        # Khmer samples
        ("·ûü·ûΩ·ûü·üí·ûè·û∏‚Äã·ûñ·û∑·ûó·ûñ·ûõ·üÑ·ûÄ", "km"),
        ("·ûá·üÜ·ûö·û∂·ûî·ûü·ûΩ·ûö", "km"),
        ("·ûü·ûº·ûò·û¢·ûö·ûÇ·ûª·ûé", "km"),
        ("·ûö·üÄ·ûì·ûó·û∂·ûü·û∂", "km"),
        ("·ûÄ·ûò·üí·ûñ·ûª·ûá·û∂", "km"),
        ("·ûó·üí·ûì·üÜ·ûñ·üÅ·ûâ", "km"),
        
        # Japanese samples
        ("„Åì„Çì„Å´„Å°„ÅØ‰∏ñÁïå", "jp"),
        ("„Åä„ÅØ„Çà„ÅÜ„Åî„Åñ„ÅÑ„Åæ„Åô", "jp"),
        ("„ÅÇ„Çä„Åå„Å®„ÅÜ„Åî„Åñ„ÅÑ„Åæ„Åô", "jp"),
        ("Ê©üÊ¢∞Â≠¶Áøí", "jp"),
        ("Ëá™ÁÑ∂Ë®ÄË™ûÂá¶ÁêÜ", "jp"),
        ("‰∫∫Â∑•Áü•ËÉΩ", "jp"),
    ]
    
    df = pd.DataFrame(samples, columns=["text", "label"])
    df.to_csv(out_path, index=False, encoding="utf-8")
    print(f"‚úÖ Sample dataset saved to: {out_path}")
    return df

def build_char_vocab(texts, min_freq=1, max_vocab=None):
    """
    Build character-level vocabulary from texts.
    
    Returns:
        idx2char: list of characters (index -> char)
        char2idx: dict mapping char -> index
    """
    cnt = Counter()
    for t in texts:
        cnt.update(list(t))
    
    items = [c for c, f in cnt.most_common() if f >= min_freq]
    if max_vocab:
        items = items[:max_vocab]
    
    # Reserve 0 for PAD, 1 for UNK
    idx2char = ["<pad>", "<unk>"] + items
    char2idx = {c: i for i, c in enumerate(idx2char)}
    
    return idx2char, char2idx

def encode_text(s, char2idx, max_len):
    """
    Encode text to fixed-length integer sequence.
    
    Args:
        s: input text
        char2idx: character to index mapping
        max_len: maximum sequence length
    
    Returns:
        List of integers (padded/truncated to max_len)
    """
    s = s[:max_len]
    ids = [char2idx.get(ch, 1) for ch in s]  # UNK -> 1
    if len(ids) < max_len:
        ids = ids + [0] * (max_len - len(ids))  # PAD -> 0
    return ids

print("‚úÖ Preprocessing functions defined!")

### 3.2 Load and Preprocess Data

Run this cell to load your data and create the processed files. If no data is found, a sample dataset will be generated.

In [None]:
# ============================================================
# PREPROCESSING CONFIGURATION
# ============================================================

# Hyperparameters for preprocessing
MAX_LEN = 128      # Maximum sequence length
MIN_FREQ = 1       # Minimum character frequency to include in vocabulary
MAX_VOCAB = None   # Maximum vocabulary size (None = no limit)

# ============================================================
# LOAD AND PREPROCESS DATA
# ============================================================

print("üìä Loading raw data...")
df = load_raw_data(DATA_RAW)

# If no data found, generate sample dataset
if df is None or df.empty:
    print("‚ö†Ô∏è No raw data found. Generating sample dataset for testing...")
    sample_path = os.path.join(DATA_RAW, "sample_data.csv")
    df = generate_sample_dataset(sample_path)
    df = load_raw_data(DATA_RAW)

print(f"\nüìà Total samples loaded: {len(df)}")

# Basic cleaning: drop empty texts
df = df[df['text'].str.strip().astype(bool)].reset_index(drop=True)
print(f"üìà After cleaning: {len(df)} samples")

# Show sample distribution
print("\nüìä Label distribution:")
print(df['label'].value_counts())

In [None]:
# ============================================================
# BUILD VOCABULARY AND ENCODE DATA
# ============================================================

# Build label mapping
labels = sorted(df['label'].unique().tolist())
label2id = {l: i for i, l in enumerate(labels)}
id2label = {i: l for l, i in label2id.items()}

print("üè∑Ô∏è Label mapping:")
for label, idx in label2id.items():
    print(f"   {label} -> {idx}")

# Encode labels
y = df['label'].map(label2id).astype(np.int32).values

# Build character vocabulary
texts = df['text'].astype(str).tolist()
idx2char, char2idx = build_char_vocab(texts, min_freq=MIN_FREQ, max_vocab=MAX_VOCAB)

print(f"\nüìù Vocabulary size: {len(idx2char)} (including PAD/UNK)")
print(f"   Sample chars: {idx2char[:20]}...")

# Encode all texts
print("\n‚è≥ Encoding texts...")
X = np.array([encode_text(t, char2idx, MAX_LEN) for t in tqdm(texts)], dtype=np.int32)

print(f"\n‚úÖ Encoded data shape: X={X.shape}, y={y.shape}")

In [None]:
# ============================================================
# SAVE PROCESSED DATA
# ============================================================

# Save numpy arrays
np.save(os.path.join(DATA_PROC, "X.npy"), X)
np.save(os.path.join(DATA_PROC, "y.npy"), y)

# Save vocabulary and label mapping
with open(os.path.join(DATA_PROC, "label_map.json"), "w", encoding="utf-8") as f:
    json.dump(label2id, f, ensure_ascii=False, indent=2)

with open(os.path.join(DATA_PROC, "vocab.json"), "w", encoding="utf-8") as f:
    json.dump(idx2char, f, ensure_ascii=False, indent=2)

print(f"‚úÖ Saved processed data to: {DATA_PROC}")
print(f"   - X.npy: {X.shape}")
print(f"   - y.npy: {y.shape}")
print(f"   - vocab.json: {len(idx2char)} characters")
print(f"   - label_map.json: {len(label2id)} labels")

## 4. Dataset & DataLoader

Create PyTorch Dataset and DataLoader for training, validation, and testing.

In [None]:
# ============================================================
# PYTORCH DATASET CLASS
# ============================================================

class LangDataset(Dataset):
    """
    PyTorch Dataset for language detection.
    
    Loads preprocessed data (X.npy, y.npy) and performs deterministic train/val/test split.
    """
    
    def __init__(self, split="train", test_frac=0.15, val_frac=0.15, load_path=None):
        """
        Args:
            split: 'train', 'val', or 'test'
            test_frac: fraction of data for testing
            val_frac: fraction of data for validation
            load_path: path to processed data directory
        """
        if load_path is None:
            load_path = DATA_PROC
            
        # Load data
        xp = np.load(os.path.join(load_path, "X.npy"))
        yp = np.load(os.path.join(load_path, "y.npy"))
        
        # Deterministic shuffle
        rng = np.random.RandomState(42)
        perm = rng.permutation(len(xp))
        xp = xp[perm]
        yp = yp[perm]
        
        # Calculate split sizes
        n = len(xp)
        n_test = int(n * test_frac)
        n_val = int(n * val_frac)
        n_train = n - n_test - n_val
        
        # Split data
        train_X, train_y = xp[:n_train], yp[:n_train]
        val_X, val_y = xp[n_train:n_train+n_val], yp[n_train:n_train+n_val]
        test_X, test_y = xp[n_train+n_val:], yp[n_train+n_val:]
        
        if split == "train":
            self.X, self.y = train_X, train_y
        elif split == "val":
            self.X, self.y = val_X, val_y
        elif split == "test":
            self.X, self.y = test_X, test_y
        else:
            raise ValueError("split must be one of: train, val, test")
        
        # Load metadata
        vocab_path = os.path.join(load_path, "vocab.json")
        label_path = os.path.join(load_path, "label_map.json")
        
        self.idx2char = []
        self.label2id = {}
        
        if os.path.exists(vocab_path):
            with open(vocab_path, "r", encoding="utf-8") as f:
                self.idx2char = json.load(f)
        if os.path.exists(label_path):
            with open(label_path, "r", encoding="utf-8") as f:
                self.label2id = json.load(f)
    
    def __len__(self):
        return len(self.X)
    
    def __getitem__(self, idx):
        x = torch.tensor(self.X[idx], dtype=torch.long)
        y = torch.tensor(int(self.y[idx]), dtype=torch.long)
        return x, y

print("‚úÖ LangDataset class defined!")

In [None]:
# ============================================================
# CREATE DATASETS AND DATALOADERS
# ============================================================

# Hyperparameters
BATCH_SIZE = 32

# Create datasets
train_ds = LangDataset(split="train")
val_ds = LangDataset(split="val")
test_ds = LangDataset(split="test")

print(f"üìä Dataset sizes:")
print(f"   Training:   {len(train_ds)} samples")
print(f"   Validation: {len(val_ds)} samples")
print(f"   Testing:    {len(test_ds)} samples")

# Create dataloaders
train_loader = DataLoader(train_ds, batch_size=BATCH_SIZE, shuffle=True, num_workers=0)
val_loader = DataLoader(val_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)
test_loader = DataLoader(test_ds, batch_size=BATCH_SIZE, shuffle=False, num_workers=0)

# Get vocab and label info
vocab_size = len(train_ds.idx2char)
num_classes = len(train_ds.label2id)

print(f"\nüìù Vocabulary size: {vocab_size}")
print(f"üè∑Ô∏è Number of classes: {num_classes}")

## 5. Model Architecture

### Character-level CNN for Language Detection

The model uses:
1. **Embedding Layer:** Converts character indices to dense vectors
2. **Convolutional Layers:** Multiple kernels (3, 5, 7) to capture n-gram patterns
3. **Max Pooling:** Extract most important features
4. **Dropout:** Regularization to prevent overfitting
5. **Fully Connected Layer:** Final classification

```
Input (batch, seq_len)
    ‚Üì
Embedding (batch, seq_len, embed_dim)
    ‚Üì
Transpose (batch, embed_dim, seq_len)
    ‚Üì
[Conv1D ‚Üí ReLU ‚Üí MaxPool] √ó 3 (different kernel sizes)
    ‚Üì
Concatenate
    ‚Üì
Dropout
    ‚Üì
Fully Connected ‚Üí Logits
```

In [None]:
# ============================================================
# CHARACTER-LEVEL CNN MODEL
# ============================================================

class CharCNN(nn.Module):
    """
    Character-level Convolutional Neural Network for Language Detection.
    
    Architecture:
    - Embedding layer for character representations
    - Multiple parallel Conv1D layers with different kernel sizes
    - Global max pooling
    - Dropout for regularization
    - Fully connected output layer
    """
    
    def __init__(self, vocab_size, embed_dim, num_classes, 
                 num_filters=128, kernel_sizes=(3, 5, 7), dropout=0.3):
        """
        Args:
            vocab_size: Size of character vocabulary (including PAD/UNK)
            embed_dim: Embedding dimension for characters
            num_classes: Number of language classes
            num_filters: Number of filters per convolution
            kernel_sizes: Tuple of kernel sizes for parallel convolutions
            dropout: Dropout probability
        """
        super().__init__()
        
        # Character embedding layer
        self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=0)
        
        # Parallel convolution layers with different kernel sizes
        self.convs = nn.ModuleList([
            nn.Conv1d(
                in_channels=embed_dim,
                out_channels=num_filters,
                kernel_size=k
            )
            for k in kernel_sizes
        ])
        
        # Regularization
        self.dropout = nn.Dropout(dropout)
        
        # Output layer
        self.fc = nn.Linear(num_filters * len(kernel_sizes), num_classes)
    
    def forward(self, x):
        """
        Forward pass.
        
        Args:
            x: Input tensor of shape (batch, seq_len)
        
        Returns:
            Logits tensor of shape (batch, num_classes)
        """
        # Embedding: (batch, seq_len) -> (batch, seq_len, embed_dim)
        emb = self.embedding(x)
        
        # Transpose for Conv1d: (batch, embed_dim, seq_len)
        emb = emb.transpose(1, 2)
        
        # Apply each convolution and pool
        conv_outs = []
        for conv in self.convs:
            c = F.relu(conv(emb))  # (batch, num_filters, L_out)
            c = F.max_pool1d(c, kernel_size=c.size(2)).squeeze(2)  # (batch, num_filters)
            conv_outs.append(c)
        
        # Concatenate all conv outputs
        cat = torch.cat(conv_outs, dim=1)  # (batch, num_filters * len(kernel_sizes))
        
        # Dropout
        cat = self.dropout(cat)
        
        # Final classification
        logits = self.fc(cat)
        
        return logits

print("‚úÖ CharCNN model class defined!")

In [None]:
# ============================================================
# MODEL CONFIGURATION & INITIALIZATION
# ============================================================

# Model hyperparameters
EMBED_DIM = 64       # Character embedding dimension
NUM_FILTERS = 128    # Number of convolution filters
KERNEL_SIZES = (3, 5, 7)  # Different n-gram sizes
DROPOUT = 0.3        # Dropout probability
LEARNING_RATE = 1e-3 # Learning rate
NUM_EPOCHS = 10      # Number of training epochs

# Create model
model = CharCNN(
    vocab_size=vocab_size,
    embed_dim=EMBED_DIM,
    num_classes=num_classes,
    num_filters=NUM_FILTERS,
    kernel_sizes=KERNEL_SIZES,
    dropout=DROPOUT
)

# Move to device
model = model.to(device)

# Print model summary
print("üß† Model Architecture:")
print("=" * 50)
print(model)
print("=" * 50)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"\nüìä Total parameters: {total_params:,}")
print(f"üìä Trainable parameters: {trainable_params:,}")

## 6. Training

Train the CNN model with validation monitoring and early saving of the best model.

In [None]:
# ============================================================
# EVALUATION FUNCTION
# ============================================================

def evaluate_model(model, loader, device):
    """
    Evaluate model on a data loader.
    
    Returns:
        accuracy, f1_score, true_labels, predictions
    """
    model.eval()
    preds = []
    gold = []
    
    with torch.no_grad():
        for x, y in loader:
            x = x.to(device)
            logits = model(x)
            p = torch.argmax(logits, dim=1).cpu().numpy()
            preds.extend(p.tolist())
            gold.extend(y.numpy().tolist())
    
    acc = accuracy_score(gold, preds)
    f1 = f1_score(gold, preds, average="macro")
    
    return acc, f1, gold, preds

print("‚úÖ Evaluation function defined!")

In [None]:
# ============================================================
# TRAINING LOOP
# ============================================================

# Loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)

# Training history
history = {
    "train_loss": [],
    "val_acc": [],
    "val_f1": []
}

best_val_acc = 0.0

print("üöÄ Starting training...")
print("=" * 60)

for epoch in range(1, NUM_EPOCHS + 1):
    # Training phase
    model.train()
    total_loss = 0.0
    
    pbar = tqdm(train_loader, desc=f"Epoch {epoch}/{NUM_EPOCHS}")
    for x, y in pbar:
        x = x.to(device)
        y = y.to(device)
        
        # Forward pass
        optimizer.zero_grad()
        logits = model(x)
        loss = criterion(logits, y)
        
        # Backward pass
        loss.backward()
        optimizer.step()
        
        total_loss += loss.item()
        pbar.set_postfix(loss=total_loss / (pbar.n + 1))
    
    avg_loss = total_loss / len(train_loader)
    
    # Validation phase
    val_acc, val_f1, _, _ = evaluate_model(model, val_loader, device)
    
    # Log metrics
    history["train_loss"].append(avg_loss)
    history["val_acc"].append(val_acc)
    history["val_f1"].append(val_f1)
    
    print(f"[Epoch {epoch}] Loss: {avg_loss:.4f} | Val Acc: {val_acc:.4f} | Val F1: {val_f1:.4f}")
    
    # Save best model
    if val_acc > best_val_acc:
        best_val_acc = val_acc
        checkpoint = {
            "model_state": model.state_dict(),
            "vocab": idx2char,
            "label_map": label2id,
            "args": {
                "embed_dim": EMBED_DIM,
                "num_filters": NUM_FILTERS,
                "kernels": ",".join(map(str, KERNEL_SIZES)),
                "dropout": DROPOUT
            }
        }
        torch.save(checkpoint, os.path.join(MODEL_DIR, "best_model.pt"))
        print(f"  ‚úÖ Saved best model (val_acc={val_acc:.4f})")

print("=" * 60)
print(f"üéâ Training complete! Best validation accuracy: {best_val_acc:.4f}")

In [None]:
# ============================================================
# SAVE TRAINING HISTORY
# ============================================================

# Save history to JSON
with open(os.path.join(MODEL_DIR, "train_history.json"), "w") as f:
    json.dump(history, f, indent=2)

print(f"‚úÖ Training history saved to: {os.path.join(MODEL_DIR, 'train_history.json')}")

## 7. Evaluation & Visualization

Evaluate the trained model and visualize the results.

In [None]:
# ============================================================
# PLOT TRAINING CURVES
# ============================================================

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Training Loss
axes[0].plot(history["train_loss"], 'b-', linewidth=2)
axes[0].set_title("Training Loss", fontsize=12)
axes[0].set_xlabel("Epoch")
axes[0].set_ylabel("Loss")
axes[0].grid(True, alpha=0.3)

# Validation Accuracy
axes[1].plot(history["val_acc"], 'g-', linewidth=2)
axes[1].set_title("Validation Accuracy", fontsize=12)
axes[1].set_xlabel("Epoch")
axes[1].set_ylabel("Accuracy")
axes[1].grid(True, alpha=0.3)

# Validation F1 Score
axes[2].plot(history["val_f1"], 'r-', linewidth=2)
axes[2].set_title("Validation F1 Score", fontsize=12)
axes[2].set_xlabel("Epoch")
axes[2].set_ylabel("F1 Score")
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig(os.path.join(MODEL_DIR, "training_curves.png"), dpi=150)
plt.show()

print(f"‚úÖ Training curves saved to: {os.path.join(MODEL_DIR, 'training_curves.png')}")

In [None]:
# ============================================================
# LOAD BEST MODEL AND EVALUATE ON TEST SET
# ============================================================

# Load the best checkpoint
checkpoint_path = os.path.join(MODEL_DIR, "best_model.pt")
checkpoint = torch.load(checkpoint_path, map_location=device)

# Recreate model with saved configuration
saved_args = checkpoint.get("args", {})
model_best = CharCNN(
    vocab_size=len(checkpoint["vocab"]),
    embed_dim=saved_args.get("embed_dim", EMBED_DIM),
    num_classes=len(checkpoint["label_map"]),
    num_filters=saved_args.get("num_filters", NUM_FILTERS),
    kernel_sizes=tuple(map(int, saved_args.get("kernels", "3,5,7").split(","))),
    dropout=saved_args.get("dropout", DROPOUT)
)
model_best.load_state_dict(checkpoint["model_state"])
model_best = model_best.to(device)
model_best.eval()

# Evaluate on test set
test_acc, test_f1, gold, preds = evaluate_model(model_best, test_loader, device)

print("=" * 50)
print("üìä TEST SET RESULTS")
print("=" * 50)
print(f"   Accuracy: {test_acc:.4f} ({test_acc*100:.2f}%)")
print(f"   F1 Score: {test_f1:.4f}")
print("=" * 50)

In [None]:
# ============================================================
# CLASSIFICATION REPORT
# ============================================================

# Get label names
id2label_saved = {int(v): k for k, v in checkpoint["label_map"].items()}
label_names = [id2label_saved[i] for i in range(len(id2label_saved))]

print("üìã CLASSIFICATION REPORT")
print("=" * 50)
print(classification_report(gold, preds, target_names=label_names))

In [None]:
# ============================================================
# CONFUSION MATRIX VISUALIZATION
# ============================================================

cm = confusion_matrix(gold, preds)

plt.figure(figsize=(10, 8))
sns.heatmap(
    cm, 
    annot=True, 
    fmt="d", 
    cmap="Blues",
    xticklabels=label_names,
    yticklabels=label_names,
    cbar_kws={'label': 'Count'}
)
plt.xlabel("Predicted Label", fontsize=12)
plt.ylabel("True Label", fontsize=12)
plt.title("Confusion Matrix - Language Detection CNN", fontsize=14)
plt.tight_layout()
plt.savefig(os.path.join(MODEL_DIR, "confusion_matrix.png"), dpi=150)
plt.show()

print(f"‚úÖ Confusion matrix saved to: {os.path.join(MODEL_DIR, 'confusion_matrix.png')}")

## 8. Inference - Predict Language

Use the trained model to predict the language of new text inputs.

In [None]:
# ============================================================
# PREDICTION FUNCTION
# ============================================================

class LanguagePredictor:
    """
    A class for making language predictions with the trained CNN model.
    """
    
    def __init__(self, checkpoint_path, max_len=128):
        """
        Load model from checkpoint.
        
        Args:
            checkpoint_path: Path to the saved model checkpoint
            max_len: Maximum sequence length
        """
        self.max_len = max_len
        self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        
        # Load checkpoint
        self.checkpoint = torch.load(checkpoint_path, map_location=self.device)
        self.idx2char = self.checkpoint["vocab"]
        self.label_map = self.checkpoint["label_map"]
        self.id2label = {int(v): k for k, v in self.label_map.items()}
        self.char2idx = {c: i for i, c in enumerate(self.idx2char)}
        
        # Build model
        args = self.checkpoint.get("args", {})
        self.model = CharCNN(
            vocab_size=len(self.idx2char),
            embed_dim=args.get("embed_dim", 64),
            num_classes=len(self.label_map),
            num_filters=args.get("num_filters", 128),
            kernel_sizes=tuple(map(int, args.get("kernels", "3,5,7").split(","))),
            dropout=args.get("dropout", 0.3)
        )
        self.model.load_state_dict(self.checkpoint["model_state"])
        self.model = self.model.to(self.device)
        self.model.eval()
    
    def encode(self, text):
        """Encode text to integer sequence."""
        text = unicodedata.normalize("NFKC", str(text))
        ids = [self.char2idx.get(ch, 1) for ch in text[:self.max_len]]
        if len(ids) < self.max_len:
            ids = ids + [0] * (self.max_len - len(ids))
        return ids
    
    def predict(self, text, top_k=3):
        """
        Predict language for input text.
        
        Args:
            text: Input text string
            top_k: Number of top predictions to return
        
        Returns:
            Tuple of (predicted_language, confidence, all_probabilities)
        """
        ids = torch.tensor([self.encode(text)], dtype=torch.long).to(self.device)
        
        with torch.no_grad():
            logits = self.model(ids)
            probs = torch.softmax(logits, dim=1).cpu().numpy()[0]
        
        # Get top-k predictions
        top_indices = probs.argsort()[::-1][:top_k]
        
        results = []
        for idx in top_indices:
            results.append({
                "language": self.id2label[idx],
                "confidence": float(probs[idx])
            })
        
        return results[0]["language"], results[0]["confidence"], results

print("‚úÖ LanguagePredictor class defined!")

In [None]:
# ============================================================
# TEST PREDICTIONS
# ============================================================

# Initialize predictor
predictor = LanguagePredictor(os.path.join(MODEL_DIR, "best_model.pt"))

# Test samples
test_samples = [
    "Hello, how are you doing today?",
    "Bonjour, comment allez-vous?",
    "Hola, ¬øc√≥mo est√°s?",
    "·ûü·ûΩ·ûü·üí·ûè·û∏‚Äã·ûñ·û∑·ûó·ûñ·ûõ·üÑ·ûÄ",
    "„Åì„Çì„Å´„Å°„ÅØ„ÄÅÂÖÉÊ∞ó„Åß„Åô„ÅãÔºü",
    "Machine learning is amazing",
    "L'intelligence artificielle",
]

print("üîÆ LANGUAGE PREDICTIONS")
print("=" * 70)

for sample in test_samples:
    lang, conf, top_results = predictor.predict(sample, top_k=3)
    print(f"\nüìù Input: \"{sample}\"")
    print(f"   ‚ûú Predicted: {lang.upper()} (confidence: {conf:.2%})")
    # Format top 3 results
    top3_str = ", ".join([f"{r['language']}:{r['confidence']:.2%}" for r in top_results])
    print(f"   Top 3: {top3_str}")

In [None]:
# ============================================================
# INTERACTIVE PREDICTION
# ============================================================

def predict_language(text):
    """
    Simple function to predict language of input text.
    
    Usage:
        predict_language("Hello world")
    """
    lang, conf, results = predictor.predict(text)
    print(f"üåê Language: {lang.upper()}")
    print(f"üìä Confidence: {conf:.2%}")
    print(f"üìà All probabilities:")
    for r in results:
        bar = "‚ñà" * int(r["confidence"] * 20)
        print(f"   {r['language']:>5}: {bar} {r['confidence']:.2%}")
    return lang, conf

# Try it yourself! Change the text below:
predict_language("This is a test sentence in English")

## 9. Usage Instructions

### üìÅ Project Structure
```
Deep Learning/
‚îú‚îÄ‚îÄ data/
‚îÇ   ‚îú‚îÄ‚îÄ raw/           # Place your raw data files here
‚îÇ   ‚îÇ   ‚îú‚îÄ‚îÄ *.csv      # CSV with columns: text, label
‚îÇ   ‚îÇ   ‚îî‚îÄ‚îÄ *.txt      # TXT files named <language>.txt
‚îÇ   ‚îî‚îÄ‚îÄ processed/     # Preprocessed data (auto-generated)
‚îú‚îÄ‚îÄ models/            # Saved models and artifacts
‚îú‚îÄ‚îÄ notebooks/
‚îÇ   ‚îî‚îÄ‚îÄ language_detection_cnn.ipynb
‚îú‚îÄ‚îÄ src/               # Source code modules
‚îî‚îÄ‚îÄ requirements.txt
```

---

### üöÄ Quick Start Guide

#### Step 1: Prepare Your Data
Place your data in `data/raw/` in one of these formats:

**Option A: CSV files**
```csv
text,label
Hello world,en
Bonjour le monde,fr
Hola mundo,es
```

**Option B: TXT files (one per language)**
- `en.txt` - One English sample per line
- `fr.txt` - One French sample per line
- etc.

#### Step 2: Run the Notebook
Execute cells in order:
1. **Cell 1-2**: Install dependencies & import libraries
2. **Cell 3-6**: Preprocess data (builds vocabulary, encodes text)
3. **Cell 7-8**: Create datasets and dataloaders
4. **Cell 9-10**: Define model architecture
5. **Cell 11-13**: Train the model
6. **Cell 14-17**: Evaluate and visualize results
7. **Cell 18-20**: Make predictions on new text

---

### ‚öôÔ∏è Hyperparameters

| Parameter | Default | Description |
|-----------|---------|-------------|
| `MAX_LEN` | 128 | Maximum text sequence length |
| `BATCH_SIZE` | 32 | Training batch size |
| `EMBED_DIM` | 64 | Character embedding dimension |
| `NUM_FILTERS` | 128 | CNN filter count |
| `KERNEL_SIZES` | (3, 5, 7) | N-gram sizes to capture |
| `DROPOUT` | 0.3 | Dropout rate |
| `LEARNING_RATE` | 0.001 | Adam optimizer learning rate |
| `NUM_EPOCHS` | 10 | Training epochs |

---

### üìä Adding More Languages

1. Add training data to `data/raw/`:
   - CSV: Add rows with new language label
   - TXT: Create `<lang_code>.txt` file

2. Re-run preprocessing cells (Section 3)

3. Re-train the model (Section 6)

---

### üíæ Using the Trained Model

```python
# Load the trained model
predictor = LanguagePredictor("models/best_model.pt")

# Predict language
language, confidence, all_results = predictor.predict("Your text here")
print(f"Language: {language}, Confidence: {confidence:.2%}")
```

---

### üîß Command Line Usage

You can also use the source files directly:

```bash
# Preprocess data
python src/preprocess.py --max-len 128

# Train model
python src/train.py --epochs 10 --batch-size 64

# Evaluate/predict
python src/evaluate.py --text "Hello world"
```

---

### üìö Recommended Datasets

For production use, consider these multilingual datasets:
- **Tatoeba**: Sentence translations in 300+ languages
- **Wikipedia**: Text dumps for most languages
- **WMT**: Machine translation dataset
- **OPUS**: Parallel corpus collection

---

### üéØ Tips for Better Performance

1. **More data**: Aim for 1000+ samples per language
2. **Balanced classes**: Equal samples per language
3. **Data augmentation**: Add noise, typos, case variations
4. **Longer training**: Increase epochs for larger datasets
5. **Adjust architecture**: Increase filters for more languages

---

## üéâ Congratulations!

You have successfully trained a CNN-based language detection model. 

**What you've learned:**
- Character-level text preprocessing
- Building CNN architectures for text classification
- Training and evaluating deep learning models
- Making predictions with trained models

**Next Steps:**
- Add more languages to your dataset
- Experiment with different hyperparameters
- Try integrating with OCR for image-based language detection
- Deploy as a web API or mobile application

---

*Created for the Language Detection Deep Learning Project*