<a href="https://colab.research.google.com/github/RDGopal/IB9AU-2026/blob/main/MLM9_Flow_Matching_Receipt_Generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Flow Matching for Receipt Generation

This notebook demonstrates how to train a Flow Matching model to generate synthetic receipt images. We'll cover the entire process from data preparation to model training and image generation. The goal is to create a generative model that can produce realistic-looking receipts based on a small dataset of real receipts.

## Setup and Hyperparameters

This section imports necessary libraries and defines global hyperparameters (`HPARAMS`) that configure our model and training process. Understanding these parameters is crucial for customizing the model's behavior and performance.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision.utils import make_grid
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import random
import os
from PIL import Image, ImageDraw

# Check for GPU availability and set device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Define Hyperparameters
HPARAMS = {
    "img_size": 256,         # Size of the generated images (e.g., 256x256 pixels)
    "inference_steps": 100,  # Number of steps for the ODE solver during sampling (higher = better quality, slower)
    "batch_size": 32,        # Number of images processed per training step
    "lr": 1e-4,              # Learning rate for the Adam optimizer
    "epochs": 50,            # Number of full passes through the training dataset
    "channels": 1,           # Number of image channels (1 for grayscale, 3 for RGB)
    "num_classes": 1         # Number of distinct classes the model should generate (here, only 'receipts')
}
print("Hyperparameters:", HPARAMS)


## Data Extraction

Before we can train our model, we need to extract the receipt images from the provided ZIP file. This step uses a shell command to decompress the archive into a designated directory.

In [None]:
!unzip /content/receipts_sample-20260102T154712Z-3-001.zip -d receipt_images
print("Zip file extracted to 'receipt_images' directory.")

## Model Architecture and Flow Matching Logic

This section defines the core components of our generative model: the `ConditionalUNet` architecture, which is a neural network designed for image generation tasks, and the `FlowMatching` class, which implements the training and sampling logic for flow-based generative models. It also includes our custom `ReceiptDataset` to handle loading and transforming the image data.

### SinusoidalPositionEmbeddings
This class creates sinusoidal positional embeddings for time steps. In generative models like Flow Matching, these embeddings help the network understand the 'progress' of the generation process (from noise to data) by providing a unique, continuous representation for each time step.

### Block
The `Block` class represents a fundamental building block of the U-Net architecture. It consists of convolutional layers, batch normalization, and ReLU activations. It also integrates time embeddings, allowing the model to condition its output on the current time step in the flow. The `transform` layer handles downsampling (Conv2d with stride 2) or upsampling (ConvTranspose2d) within the U-Net.

### ConditionalUNet
The `ConditionalUNet` is the main neural network architecture. It's a U-Net variant that processes images and is 'conditional' because it takes both the image, the current time step (`t`), and a class label (though here we only have one class) as input. This allows the model to learn to generate specific types of images. It consists of a series of `Block`s for downsampling (`downs`) and upsampling (`ups`), with skip connections (residuals) to preserve detail.

### FlowMatching
This class encapsulates the core logic of the Flow Matching generative process. Unlike diffusion models that iteratively add and remove noise, Flow Matching learns a continuous-time vector field that smoothly transports a simple prior distribution (e.g., Gaussian noise, `x_0`) to a complex target distribution (the real data, `x_1`).

-   **`compute_loss`**: Calculates the loss for training. It samples `x_0` (noise), `x_1` (real data), and a random time `t`. It then computes an intermediate point `x_t` on the straight line path between `x_0` and `x_1`, and asks the model to predict the velocity vector (`v_target`) that points from `x_0` to `x_1`. The loss is the mean squared error between the model's predicted velocity (`v_pred`) and the true target velocity.
-   **`sample`**: Implements an Euler ODE solver for generating new images. Starting from random noise (`x_0` at `t=0`), it iteratively updates the image by following the velocity field predicted by the trained `ConditionalUNet`. This process effectively traces a path from noise to a generated image (`x_1` at `t=1`).

### ReceiptDataset
This custom `torch.utils.data.Dataset` class is responsible for loading our receipt images. It reads image files from the specified directory, converts them to grayscale, resizes them to the model's input dimensions, and transforms them into PyTorch tensors. Since we are only generating receipts, it assigns a single class label (0) to all images.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision.utils import make_grid
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import random
import os
from PIL import Image
import glob

# Re-define device and HPARAMS as they were not in the current execution context
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

HPARAMS = {
    "img_size": 256,
    "inference_steps": 100,
    "batch_size": 32,
    "lr": 1e-4,
    "epochs": 50,
    "channels": 1,
    "num_classes": 1 # Updated for receipts only
}

# Re-define model architecture (ConditionalUNet, Block, SinusoidalPositionEmbeddings)
class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = np.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

class Block(nn.Module):
    def __init__(self, in_ch, out_ch, time_emb_dim, up=False):
        super().__init__()
        self.time_mlp = nn.Linear(time_emb_dim, out_ch)
        if up:
            self.conv1 = nn.Conv2d(2*in_ch, out_ch, 3, padding=1)
            self.transform = nn.ConvTranspose2d(out_ch, out_ch, 4, 2, 1)
        else:
            self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
            self.transform = nn.Conv2d(out_ch, out_ch, 4, 2, 1)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
        self.bnorm1 = nn.BatchNorm2d(out_ch)
        self.bnorm2 = nn.BatchNorm2d(out_ch)
        self.relu = nn.ReLU()

    def forward(self, x, t):
        h = self.bnorm1(self.relu(self.conv1(x)))
        time_emb = self.relu(self.time_mlp(t))
        time_emb = time_emb[(..., ) + (None, ) * 2]
        h = h + time_emb
        h = self.bnorm2(self.relu(self.conv2(h)))
        return self.transform(h)

class ConditionalUNet(nn.Module):
    def __init__(self):
        super().__init__()
        img_channels = HPARAMS["channels"]
        down_channels = (32, 64, 128)
        up_channels = (128, 64, 32)
        out_dim = img_channels
        time_emb_dim = 32
        classes = HPARAMS["num_classes"]

        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.ReLU()
        )
        self.class_emb = nn.Embedding(classes, time_emb_dim)
        self.conv0 = nn.Conv2d(img_channels, down_channels[0], 3, padding=1)
        self.downs = nn.ModuleList([Block(down_channels[i], down_channels[i+1], time_emb_dim) for i in range(len(down_channels)-1)])
        self.ups = nn.ModuleList([Block(up_channels[i], up_channels[i+1], time_emb_dim, up=True) for i in range(len(up_channels)-1)])
        self.output = nn.Conv2d(up_channels[-1], out_dim, 1)

    def forward(self, x, t_float, class_label):
        t = self.time_mlp(t_float)
        c = self.class_emb(class_label)
        t = t + c
        x = self.conv0(x)
        residuals = []
        for down in self.downs:
            x = down(x, t)
            residuals.append(x)
        for up in self.ups:
            residual = residuals.pop()
            x = torch.cat((x, residual), dim=1)
            x = up(x, t)
        return self.output(x)

# Re-define FlowMatching logic
class FlowMatching:
    def __init__(self):
        pass

    def compute_loss(self, model, x_1, labels):
        b = x_1.shape[0]
        x_0 = torch.randn_like(x_1)
        t = torch.rand(b, device=x_1.device)
        t_view = t.view(b, 1, 1, 1)
        x_t = (1 - t_view) * x_0 + t_view * x_1
        v_target = x_1 - x_0
        v_pred = model(x_t, t, labels)
        return F.mse_loss(v_pred, v_target)

    @torch.no_grad()
    def sample(self, model, n_samples, class_label_idx, size, steps=50):
        model.eval()
        x = torch.randn((n_samples, 1, size, size)).to(device)
        labels = torch.full((n_samples,), class_label_idx, dtype=torch.long).to(device)
        dt = 1.0 / steps
        for i in range(steps):
            t_curr = torch.ones(n_samples).to(device) * (i / steps)
            v_pred = model(x, t_curr, labels)
            x = x + v_pred * dt
        model.train()
        return x

# ReceiptDataset definition (already correct from previous successful execution)
class ReceiptDataset(torch.utils.data.Dataset):
    def __init__(self, root_dir, size=(64, 64), transform=None):
        self.root_dir = root_dir
        self.size = size
        self.transform = transform
        self.image_paths = []

        # Corrected: image_folder should now be 'receipts_sample' based on unzip output
        image_folder = os.path.join(root_dir, 'receipts_sample')
        for ext in ['jpg', 'jpeg', 'png']:
            self.image_paths.extend(glob.glob(os.path.join(image_folder, f'*.{ext}')))

        if not self.image_paths:
            raise RuntimeError(f"No images found in {image_folder}. Please check the path and file types.")

        print(f"Found {len(self.image_paths)} receipt images in {image_folder}")

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        img_path = self.image_paths[idx]
        img = Image.open(img_path).convert('L') # Convert to grayscale
        label = 0 # Assign a single label for receipts

        if self.transform:
            img = self.transform(img)
        return img, label

# save_model function (copy from original notebook)
def save_model(model, filename="fintech_flow_model.pth"):
    torch.save(model.state_dict(), filename)
    print(f"‚úÖ Model saved to {filename}")


def train_receipt_model():
    # 1. Prepare Data using ReceiptDataset
    receipt_dataset = ReceiptDataset(
        root_dir='receipt_images',
        size=(HPARAMS["img_size"], HPARAMS["img_size"]),
        transform=transforms.Compose([
            transforms.Resize((HPARAMS["img_size"], HPARAMS["img_size"])),
            transforms.ToTensor()
        ])
    )
    receipt_dataloader = DataLoader(receipt_dataset, batch_size=HPARAMS["batch_size"], shuffle=True)

    # 2. Initialize Model and Flow Matching (ConditionalUNet uses HPARAMS["num_classes"] internally)
    model = ConditionalUNet().to(device)
    optimizer = torch.optim.Adam(model.parameters(), lr=HPARAMS["lr"])
    flow = FlowMatching() # Use Flow Matching manager

    print(f"Starting Flow Matching training for {HPARAMS['epochs']} epochs on Receipts...")
    for epoch in range(HPARAMS['epochs']):
        pbar = tqdm(receipt_dataloader)
        epoch_loss = 0
        for step, (images, labels) in enumerate(pbar):
            images = images.to(device)
            labels = labels.to(device)

            # Compute Flow Matching Loss
            loss = flow.compute_loss(model, images, labels)

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            epoch_loss += loss.item()
            pbar.set_description(f"Epoch {epoch} | Loss: {loss.item():.4f}")

    return model, flow

def generate_receipt_grid(model, flow, n_samples_per_row=4, total_rows=3):
    print("\nGenerating Grid of Synthetic Receipts (Euler Step)...")
    steps = HPARAMS["inference_steps"]
    # For a single class (receipts), all samples will have the same label (0)
    generated_receipts = flow.sample(model, n_samples=n_samples_per_row * total_rows,
                                     class_label_idx=0, # Fixed label for receipts
                                     size=HPARAMS["img_size"], steps=steps)

    grid = make_grid(generated_receipts, nrow=n_samples_per_row, padding=2, normalize=True)
    plt.figure(figsize=(10, 8))
    plt.imshow(grid.permute(1, 2, 0).cpu().numpy(), cmap='gray')
    plt.title(f"Generated Receipts (Single Class: {n_samples_per_row*total_rows} samples)")
    plt.axis('off')
    plt.show()

# --- Main execution block for the new task ---
if __name__ == "__main__":
    # 1. Train the Flow Matching model specifically for receipts
    trained_receipt_model, receipt_flow_manager = train_receipt_model()

    # 2. Save the trained model
    save_model(trained_receipt_model, "fintech_receipt_flow_model.pth")

    # 3. Visualize the generated receipts
    generate_receipt_grid(trained_receipt_model, receipt_flow_manager, n_samples_per_row=1, total_rows=4)


## 5. Generate New Receipt Image

Now that the model has been trained, we can load it and use it to generate new, synthetic receipt images. This section loads the saved model and utilizes the `generate_single_document` function to create and display a new receipt.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader
from torchvision.utils import make_grid
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
from tqdm import tqdm
import random
import os
from PIL import Image
import glob

# Re-define device and HPARAMS as they were not in the current execution context
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

HPARAMS = {
    "img_size": 256,
    "inference_steps": 100,
    "batch_size": 32,
    "lr": 1e-4,
    "epochs": 50,
    "channels": 1,
    "num_classes": 1 # Updated for receipts only
}

# Re-define model architecture (ConditionalUNet, Block, SinusoidalPositionEmbeddings)
class SinusoidalPositionEmbeddings(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.dim = dim
    def forward(self, time):
        device = time.device
        half_dim = self.dim // 2
        embeddings = np.log(10000) / (half_dim - 1)
        embeddings = torch.exp(torch.arange(half_dim, device=device) * -embeddings)
        embeddings = time[:, None] * embeddings[None, :]
        embeddings = torch.cat((embeddings.sin(), embeddings.cos()), dim=-1)
        return embeddings

class Block(nn.Module):
    def __init__(self, in_ch, out_ch, time_emb_dim, up=False):
        super().__init__()
        self.time_mlp = nn.Linear(time_emb_dim, out_ch)
        if up:
            self.conv1 = nn.Conv2d(2*in_ch, out_ch, 3, padding=1)
            self.transform = nn.ConvTranspose2d(out_ch, out_ch, 4, 2, 1)
        else:
            self.conv1 = nn.Conv2d(in_ch, out_ch, 3, padding=1)
            self.transform = nn.Conv2d(out_ch, out_ch, 4, 2, 1)
        self.conv2 = nn.Conv2d(out_ch, out_ch, 3, padding=1)
        self.bnorm1 = nn.BatchNorm2d(out_ch)
        self.bnorm2 = nn.BatchNorm2d(out_ch)
        self.relu = nn.ReLU()

    def forward(self, x, t):
        h = self.bnorm1(self.relu(self.conv1(x)))
        time_emb = self.relu(self.time_mlp(t))
        time_emb = time_emb[(..., ) + (None, ) * 2]
        h = h + time_emb
        h = self.bnorm2(self.relu(self.conv2(h)))
        return self.transform(h)

class ConditionalUNet(nn.Module):
    def __init__(self):
        super().__init__()
        img_channels = HPARAMS["channels"]
        down_channels = (32, 64, 128)
        up_channels = (128, 64, 32)
        out_dim = img_channels
        time_emb_dim = 32
        classes = HPARAMS["num_classes"]

        self.time_mlp = nn.Sequential(
            SinusoidalPositionEmbeddings(time_emb_dim),
            nn.Linear(time_emb_dim, time_emb_dim),
            nn.ReLU()
        )
        self.class_emb = nn.Embedding(classes, time_emb_dim)
        self.conv0 = nn.Conv2d(img_channels, down_channels[0], 3, padding=1)
        self.downs = nn.ModuleList([Block(down_channels[i], down_channels[i+1], time_emb_dim) for i in range(len(down_channels)-1)])
        self.ups = nn.ModuleList([Block(up_channels[i], up_channels[i+1], time_emb_dim, up=True) for i in range(len(up_channels)-1)])
        self.output = nn.Conv2d(up_channels[-1], out_dim, 1)

    def forward(self, x, t_float, class_label):
        t = self.time_mlp(t_float)
        c = self.class_emb(class_label)
        t = t + c
        x = self.conv0(x)
        residuals = []
        for down in self.downs:
            x = down(x, t)
            residuals.append(x)
        for up in self.ups:
            residual = residuals.pop()
            x = torch.cat((x, residual), dim=1)
            x = up(x, t)
        return self.output(x)

# Re-define FlowMatching logic
class FlowMatching:
    def __init__(self):
        pass

    def compute_loss(self, model, x_1, labels):
        b = x_1.shape[0]
        x_0 = torch.randn_like(x_1)
        t = torch.rand(b, device=x_1.device)
        t_view = t.view(b, 1, 1, 1)
        x_t = (1 - t_view) * x_0 + t_view * x_1
        v_target = x_1 - x_0
        v_pred = model(x_t, t, labels)
        return F.mse_loss(v_pred, v_target)

    @torch.no_grad()
    def sample(self, model, n_samples, class_label_idx, size, steps=50):
        model.eval()
        x = torch.randn((n_samples, 1, size, size)).to(device)
        labels = torch.full((n_samples,), class_label_idx, dtype=torch.long).to(device)
        dt = 1.0 / steps
        for i in range(steps):
            t_curr = torch.ones(n_samples).to(device) * (i / steps)
            v_pred = model(x, t_curr, labels)
            x = x + v_pred * dt
        model.train()
        return x

# Re-define load_model function (copy from original notebook)
def load_model(filename="fintech_receipt_flow_model.pth"):
    if not os.path.exists(filename):
        print(f"‚ùå Error: {filename} not found.")
        return None

    # Ensure HPARAMS['num_classes'] is set correctly before initializing model
    # We know from previous steps it was set to 1 for receipts
    model = ConditionalUNet().to(device)
    model.load_state_dict(torch.load(filename, map_location=device))
    model.eval()
    print(f"‚úÖ Model loaded from {filename}")
    return model

# Re-define generate_single_document function (copy from original notebook)
def generate_single_document(model, flow, doc_type='receipt'):
    """Generates a single image of the requested type."""
    # For this specific task, we only generate 'receipt' and its label is 0
    label_map = {'receipt': 0}

    if doc_type not in label_map:
        print(f"‚ùå Unknown type: {doc_type}. Currently only 'receipt' is supported for single generation in this context.")
        return

    print(f"üé® Generating new {doc_type} with ODE Solver...")
    sample_tensor = flow.sample(model, n_samples=1, class_label_idx=label_map[doc_type],
                                size=HPARAMS["img_size"], steps=HPARAMS["inference_steps"])

    # Convert tensor to displayable image
    img = sample_tensor[0].cpu().permute(1, 2, 0).numpy()
    img = (img - img.min()) / (img.max() - img.min()) # Normalize to 0-1

    plt.figure(figsize=(4,4))
    plt.imshow(img, cmap='gray')
    plt.title(f"Generated {doc_type.capitalize()}")
    plt.axis('off')
    plt.show()

# --- Execution for generating a single receipt ---

# Load the previously trained model
loaded_receipt_model = load_model("fintech_receipt_flow_model.pth")

# Initialize FlowMatching class (no state needed, just methods)
receipt_flow_manager = FlowMatching()

# Generate and display a single receipt
if loaded_receipt_model:
    generate_single_document(loaded_receipt_model, receipt_flow_manager, doc_type='receipt')

#Required Task 11

Download the zip file `floorplans_v2-20251223T170650Z-3-001.zip` which contains a large sample of floorplan images. Your task is to train flow matching model based on these images. Train the model on the floorplan images and create code to generate new synthetic floorplans.