# English → Farsi Translation with Transformers

This course teaches you to build a transformer model **from scratch** and train it to translate English sentences into Farsi. You'll learn every component—tokenization, embeddings, attention, FFN layers, encoder/decoder blocks—by implementing them yourself before seeing how production libraries do it.

## Environment Setup

Before we begin, ensure you have the necessary packages installed. Run the cell below (it's safe to run even if packages are already installed).

In [1]:
!pip install -r requirements.txt


[1;31merror[0m: [1mexternally-managed-environment[0m

[31m×[0m This environment is externally managed
[31m╰─>[0m To install Python packages system-wide, try apt install
[31m   [0m python3-xyz, where xyz is the package you are trying to
[31m   [0m install.
[31m   [0m 
[31m   [0m If you wish to install a non-Debian-packaged Python package,
[31m   [0m create a virtual environment using python3 -m venv path/to/venv.
[31m   [0m Then use path/to/venv/bin/python and path/to/venv/bin/pip. Make
[31m   [0m sure you have python3-full installed.
[31m   [0m 
[31m   [0m If you wish to install a non-Debian packaged Python application,
[31m   [0m it may be easiest to use pipx install xyz, which will manage a
[31m   [0m virtual environment for you. Make sure you have pipx installed.
[31m   [0m 
[31m   [0m See /usr/share/doc/python3.13/README.venv for more information.

[1;35mnote[0m: If you believe this is a mistake, please contact your Python installation or OS dist

## Download the Helsinki-Persian Dataset

The Helsinki-Persian-Opus-100 dataset contains 2,047 English-Farsi parallel sentences. We'll download it and cache it locally in the `.data/` folder (which is in `.gitignore` to keep the repository size small).

Run the cell below to download and cache the dataset:

In [2]:
import os
import json
from datasets import load_dataset

# Create .data folder if it doesn't exist
os.makedirs('.data', exist_ok=True)

print("Downloading Helsinki-Persian-Opus-100 dataset...")
dataset = load_dataset("Maani/Helsinki-Persian-Opus-100")
train_data = dataset['train']

print(f"✓ Downloaded {len(train_data)} translation pairs")

# Inspect the structure
print(f"\nDataset columns: {train_data.column_names}")
print(f"Example structure:")
sample = train_data[0]
for key, value in sample.items():
    print(f"  {key}: {value}")

# Save locally to .data/en_fa_train.jsonl for faster loading in future cells
print("\nCaching dataset locally to .data/en_fa_train.jsonl...")
with open('.data/en_fa_train.jsonl', 'w', encoding='utf-8') as f:
    for item in train_data:
        f.write(json.dumps(item, ensure_ascii=False) + '\n')

print("✓ Dataset cached successfully!")


Downloading Helsinki-Persian-Opus-100 dataset...
✓ Downloaded 2047 translation pairs

Dataset columns: ['instruction', 'input', 'output']
Example structure:
  instruction: Translate this sentence from English to Persian.
  input: Pack your stuff.
  output: بند و بساطتو جمع کن.

Caching dataset locally to .data/en_fa_train.jsonl...
✓ Dataset cached successfully!
✓ Downloaded 2047 translation pairs

Dataset columns: ['instruction', 'input', 'output']
Example structure:
  instruction: Translate this sentence from English to Persian.
  input: Pack your stuff.
  output: بند و بساطتو جمع کن.

Caching dataset locally to .data/en_fa_train.jsonl...
✓ Dataset cached successfully!


## NumPy Basics

NumPy is the foundational library for numerical computing in Python. It provides efficient arrays and mathematical operations.

In [3]:
import numpy as np

# Creating arrays
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.zeros((3, 4))  # 3x4 matrix of zeros
arr3 = np.ones((2, 3))   # 2x3 matrix of ones
arr4 = np.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

print("Array 1:", arr1)
print("Array 2 shape:", arr2.shape)
print("Array 4:", arr4)

Array 1: [1 2 3 4 5]
Array 2 shape: (3, 4)
Array 4: [0 2 4 6 8]


In [4]:
# Basic operations
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

print("Addition:", a + b)
print("Multiplication:", a * b)
print("Dot product:", np.dot(a, b))
print("Sum:", np.sum(a))
print("Mean:", np.mean(a))

Addition: [5 7 9]
Multiplication: [ 4 10 18]
Dot product: 32
Sum: 6
Mean: 2.0


## PyTorch Basics

PyTorch is a deep learning framework that provides tensors (similar to NumPy arrays) with GPU acceleration and automatic differentiation for neural networks.

In [5]:
import torch

# Creating tensors
tensor1 = torch.tensor([1.0, 2.0, 3.0, 4.0, 5.0])
tensor2 = torch.zeros(3, 4)  # 3x4 tensor of zeros
tensor3 = torch.ones(2, 3)   # 2x3 tensor of ones
tensor4 = torch.arange(0, 10, 2)  # [0, 2, 4, 6, 8]

print("Tensor 1:", tensor1)
print("Tensor 2 shape:", tensor2.shape)
print("Device (CPU/GPU):", tensor1.device)

Tensor 1: tensor([1., 2., 3., 4., 5.])
Tensor 2 shape: torch.Size([3, 4])
Device (CPU/GPU): cpu


In [6]:
# Basic tensor operations
t1 = torch.tensor([1.0, 2.0, 3.0])
t2 = torch.tensor([4.0, 5.0, 6.0])

print("Addition:", t1 + t2)
print("Multiplication:", t1 * t2)
print("Dot product:", torch.dot(t1, t2))
print("Sum:", torch.sum(t1))
print("Mean:", torch.mean(t1))

Addition: tensor([5., 7., 9.])
Multiplication: tensor([ 4., 10., 18.])
Dot product: tensor(32.)
Sum: tensor(6.)
Mean: tensor(2.)


## NumPy to PyTorch Conversion

PyTorch tensors can be easily converted to and from NumPy arrays.

In [7]:
# NumPy to PyTorch
np_array = np.array([1, 2, 3, 4, 5])
torch_tensor = torch.from_numpy(np_array)
print("NumPy array:", np_array)
print("PyTorch tensor:", torch_tensor)
print("Tensor dtype:", torch_tensor.dtype)

# PyTorch to NumPy
torch_tensor_float = torch.tensor([1.0, 2.0, 3.0])
np_array_from_torch = torch_tensor_float.numpy()
print("\nConverted back to NumPy:", np_array_from_torch)
print("NumPy dtype:", np_array_from_torch.dtype)

NumPy array: [1 2 3 4 5]
PyTorch tensor: tensor([1, 2, 3, 4, 5])
Tensor dtype: torch.int64

Converted back to NumPy: [1. 2. 3.]
NumPy dtype: float32


## Key Differences: NumPy vs PyTorch

| Feature | NumPy | PyTorch |
|---------|-------|----------|
| GPU Support | No | Yes |
| Automatic Differentiation | No | Yes (requires `.requires_grad=True`) |
| Deep Learning Frameworks | Not designed for it | Built for neural networks |
| Speed (CPU) | Very fast | Comparable |
| Ecosystem | Scientific computing | Deep learning and AI |

Both are essential: NumPy for data manipulation and preprocessing, PyTorch for building and training neural networks.

## Exploring the Translation Data

Let's look at some actual translation examples to understand what we're working with:

In [8]:
# Load the cached dataset
import json

with open('../.data/en_fa_train.jsonl', 'r', encoding='utf-8') as f:
    all_samples = [json.loads(line) for line in f]

print(f"Total samples: {len(all_samples)}\n")

# Display the first 5 translation pairs
print("First 5 English → Farsi translation pairs:")
print("=" * 80)
for i in range(5):
    sample = all_samples[i]
    english = sample['input']
    farsi = sample['output']
    print(f"\n{i+1}. English: {english}")
    print(f"   Farsi:    {farsi}")

print("\n" + "=" * 80)
print("\nNote: The 'instruction' field tells the model what to do (e.g., 'Translate this sentence from English to Persian.').")
print("In our transformer, we'll use 'input' (English) as the encoder input and 'output' (Farsi) as the decoder target.")


Total samples: 2047

First 5 English → Farsi translation pairs:

1. English: I invited my foolish friend Jay around for tennis because I thought he'd make me look good.
   Farsi:    دوست ابله ام جِی رو مهمون کردم تنیس، چون پنداشتم انگیزه ای می‌شه که من بهتر به چشم بیام.

2. English: Pack your stuff.
   Farsi:    بند و بساطتو جمع کن.

3. English: Aunt Silvy, stop yelling!
   Farsi:    عمه سیلوی، داد نزن!

4. English: I need to get out of here.
   Farsi:    باید از اینجا بزنم بیرون.

5. English: Which means the mommy of the smartest physicist at the university is not my mommy as I had thought.
   Farsi:    که یعنی مامانِ باهوش‌ترین فیزیکدانِ دانشگاه، اون‌طور که فکر می‌کردم، مامان من نیست.


Note: The 'instruction' field tells the model what to do (e.g., 'Translate this sentence from English to Persian.').
In our transformer, we'll use 'input' (English) as the encoder input and 'output' (Farsi) as the decoder target.
