# MonReader

### Background:

Our company develops innovative Artificial Intelligence and Computer Vision solutions that revolutionize industries. Machines that can see: We pack our solutions in small yet intelligent devices that can be easily integrated to your existing data flow. Computer vision for everyone: Our devices can recognize faces, estimate age and gender, classify clothing types and colors, identify everyday objects and detect motion. Technical consultancy: We help you identify use cases of artificial intelligence and computer vision in your industry. Artificial intelligence is the technology of today, not the future.

MonReader is a new mobile document digitization experience for the blind, for researchers and for everyone else in need for fully automatic, highly fast and high-quality document scanning in bulk. It is composed of a mobile app and all the user needs to do is flip pages and everything is handled by MonReader: it detects page flips from low-resolution camera preview and takes a high-resolution picture of the document, recognizing its corners and crops it accordingly, and it dewarps the cropped document to obtain a bird's eye view, sharpens the contrast between the text and the background and finally recognizes the text with formatting kept intact, being further corrected by MonReader's ML powered redactor.

MonReader is a new mobile document digitalization experience for the blind, for researchers and for everyone else in need for fully automatic, highly fast and high-quality document scanning in bulk. It is composed of a mobile app and all the user needs to do is flip pages and everything is handled by MonReader: it detects page flips from low-resolution camera preview and takes a high-resolution picture of the document, recognizing its corners and crops it accordingly, and it dewarps the cropped document to obtain a bird's eye view, sharpens the contrast between the text and the background and finally recognizes the text with formatting kept intact, being further corrected by MonReader's ML powered redactor

### Goal(s):

Predict if the page is being flipped using a single image.

### Success Metrics:

Evaluate model performance based on F1 score, the higher the better.


In [3]:
# Import libraries
import os
import numpy as np
import matplotlib.pyplot as plt
import cv2
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, f1_score, accuracy_score
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader
from torchvision import models, transforms
import torch.nn.functional as F

In [4]:
# Data loading and preprocessing function
def load_data(data_dir="images", img_size=(224, 224)):
    """Load all images from the dataset"""
    images = []
    labels = []
    
    for class_name in ['notflip', 'flip']:
        class_idx = 0 if class_name == 'notflip' else 1
        class_dir = os.path.join(data_dir, "training", class_name)
        
        for filename in os.listdir(class_dir):
            if filename.lower().endswith(('.jpg', '.jpeg', '.png')):
                img_path = os.path.join(class_dir, filename)
                img = cv2.imread(img_path)
                img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
                img = cv2.resize(img, img_size)
                img = img.astype(np.float32) / 255.0
                
                images.append(img)
                labels.append(class_idx)
    
    return np.array(images), np.array(labels)

Next let's try to build our own CNN model from scratch:

In [5]:
def create_simple_cnn(num_classes=2):
    """Create a simple CNN using PyTorch layers"""
    model = nn.Sequential(
        # Convolutional layers
        nn.Conv2d(3, 32, kernel_size=3, padding=1),
        nn.BatchNorm2d(32),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        
        nn.Conv2d(32, 64, kernel_size=3, padding=1),
        nn.BatchNorm2d(64),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        
        nn.Conv2d(64, 128, kernel_size=3, padding=1),
        nn.BatchNorm2d(128),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        
        nn.Conv2d(128, 256, kernel_size=3, padding=1),
        nn.BatchNorm2d(256),
        nn.ReLU(),
        nn.MaxPool2d(2, 2),
        
        # Flatten
        nn.Flatten(),
        
        # Fully connected layers
        nn.Linear(256 * 14 * 14, 512),
        nn.ReLU(),
        nn.Dropout(0.5),
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Dropout(0.5),
        nn.Linear(256, num_classes)
    )

    return model

Next, we'll use transfer learning method:

In [6]:
# Transfer learning model
def create_transfer_model(num_classes=2):
    """Create a transfer learning model using ResNet18"""
    model = models.resnet18(pretrained=True)
    
    # Freeze the base layers
    for param in model.parameters():
        param.requires_grad = False
    
    # Replace the final layer
    num_features = model.fc.in_features
    model.fc = nn.Sequential(
        nn.Linear(num_features, 512),
        nn.ReLU(),
        nn.Dropout(0.5),
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Dropout(0.3),
        nn.Linear(256, num_classes)
    )
    
    return model

In [7]:
# Model training
def train_model(model, train_loader, val_loader, epochs=10):
    """Training Function"""
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    
    for epoch in range(epochs):
        # Train
        model.train()
        for images, labels in train_loader:
            optimizer.zero_grad()
            loss = criterion(model(images), labels)
            loss.backward()
            optimizer.step()

        # Quick validation
        model.eval()
        all_preds = []
        all_labels = []
        with torch.no_grad():
            for images, labels in val_loader:
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                all_preds.extend(predicted.cpu().numpy())
                all_labels.extend(labels.cpu().numpy())
       
        # Calculate metrics
        accuracy = accuracy_score(all_labels, all_preds)
        f1 = f1_score(all_labels, all_preds, average='weighted')
        print(f'Epoch {epoch+1}: Accuracy: {100 * accuracy:.1f}%, F1: {f1:.3f}')
    
    return model

In [8]:
# Model evaluation
def evaluate_model(model, test_loader):
    """Evaluation Function"""
    model.eval()
    all_preds = []
    all_labels = []
    
    with torch.no_grad():
        for images, labels in test_loader:
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            all_preds.extend(predicted.cpu().numpy())
            all_labels.extend(labels.cpu().numpy())
    
    # Calculate metrics
    accuracy = accuracy_score(all_labels, all_preds)
    f1 = f1_score(all_labels, all_preds, average='weighted')
    
    print(f"Accuracy: {100 * accuracy:.2f}%")
    print(f"F1 Score: {f1:.3f}")
    print("\nClassification Report:")
    print(classification_report(all_labels, all_preds, target_names=['Not Flip', 'Flip']))
    print("\nConfusion Matrix:")
    print(confusion_matrix(all_labels, all_preds))
    
    return accuracy, f1

In [9]:
# Load and prepare data
print("Loading data...")
X_train_full, y_train_full = load_data()
print(f"Loaded {len(X_train_full)} images")

# Split data
X_train, X_val, y_train, y_val = train_test_split(
    X_train_full, y_train_full, test_size=0.2, random_state=42, stratify=y_train_full
)
X_test, y_test = X_train_full, y_train_full

Loading data...
Loaded 2392 images


In [10]:
# Convert to tensors
X_train_tensor = torch.FloatTensor(X_train).permute(0, 3, 1, 2)
y_train_tensor = torch.LongTensor(y_train)
X_val_tensor = torch.FloatTensor(X_val).permute(0, 3, 1, 2)
y_val_tensor = torch.LongTensor(y_val)
X_test_tensor = torch.FloatTensor(X_test).permute(0, 3, 1, 2)
y_test_tensor = torch.LongTensor(y_test)

# Create data loaders
train_loader = DataLoader(TensorDataset(X_train_tensor, y_train_tensor), batch_size=32, shuffle=True)
val_loader = DataLoader(TensorDataset(X_val_tensor, y_val_tensor), batch_size=32, shuffle=False)
test_loader = DataLoader(TensorDataset(X_test_tensor, y_test_tensor), batch_size=32, shuffle=False)

In [11]:
# Train and evaluate models

print("TRAINING SIMPLE CNN")

simple_cnn = train_model(create_simple_cnn(), train_loader, val_loader, epochs=15)
simple_acc = evaluate_model(simple_cnn, test_loader)


print("TRAINING TRANSFER LEARNING MODEL")

transfer_model = train_model(create_transfer_model(), train_loader, val_loader, epochs=10)
transfer_acc = evaluate_model(transfer_model, test_loader)

TRAINING SIMPLE CNN
Epoch 1: Accuracy: 64.1%, F1: 0.608
Epoch 2: Accuracy: 75.8%, F1: 0.741
Epoch 3: Accuracy: 91.0%, F1: 0.910
Epoch 4: Accuracy: 55.3%, F1: 0.452
Epoch 5: Accuracy: 87.1%, F1: 0.869
Epoch 6: Accuracy: 71.0%, F1: 0.680
Epoch 7: Accuracy: 93.5%, F1: 0.935
Epoch 8: Accuracy: 72.9%, F1: 0.704
Epoch 9: Accuracy: 97.3%, F1: 0.973
Epoch 10: Accuracy: 96.5%, F1: 0.964
Epoch 11: Accuracy: 96.9%, F1: 0.969
Epoch 12: Accuracy: 68.7%, F1: 0.656
Epoch 13: Accuracy: 48.6%, F1: 0.318
Epoch 14: Accuracy: 61.2%, F1: 0.549
Epoch 15: Accuracy: 95.2%, F1: 0.952
Accuracy: 94.57%
F1 Score: 0.946

Classification Report:
              precision    recall  f1-score   support

    Not Flip       1.00      0.90      0.94      1230
        Flip       0.90      1.00      0.95      1162

    accuracy                           0.95      2392
   macro avg       0.95      0.95      0.95      2392
weighted avg       0.95      0.95      0.95      2392


Confusion Matrix:
[[1105  125]
 [   5 1157]]
TRAI



Epoch 1: Accuracy: 89.4%, F1: 0.894
Epoch 2: Accuracy: 74.3%, F1: 0.723
Epoch 3: Accuracy: 95.8%, F1: 0.958
Epoch 4: Accuracy: 93.3%, F1: 0.933
Epoch 5: Accuracy: 93.9%, F1: 0.939
Epoch 6: Accuracy: 95.8%, F1: 0.958
Epoch 7: Accuracy: 94.2%, F1: 0.942
Epoch 8: Accuracy: 95.6%, F1: 0.956
Epoch 9: Accuracy: 97.1%, F1: 0.971
Epoch 10: Accuracy: 95.0%, F1: 0.950
Accuracy: 95.61%
F1 Score: 0.956

Classification Report:
              precision    recall  f1-score   support

    Not Flip       0.92      1.00      0.96      1230
        Flip       1.00      0.91      0.95      1162

    accuracy                           0.96      2392
   macro avg       0.96      0.95      0.96      2392
weighted avg       0.96      0.96      0.96      2392


Confusion Matrix:
[[1229    1]
 [ 104 1058]]


In [13]:
# Compare results
print("MODEL COMPARISON")

# If simple_acc is still a tuple, unpack it:
if isinstance(simple_acc, tuple):
    simple_acc, simple_f1 = simple_acc
if isinstance(transfer_acc, tuple):
    transfer_acc, transfer_f1 = transfer_acc

# Then run your comparison
print(f"Simple CNN - Accuracy: {simple_acc:.2f}%, F1: {simple_f1:.3f}")
print(f"Transfer Learning - Accuracy: {transfer_acc:.2f}%, F1: {transfer_f1:.3f}")

MODEL COMPARISON
Simple CNN - Accuracy: 0.95%, F1: 0.946
Transfer Learning - Accuracy: 0.96%, F1: 0.956
