# LLM Text Preprocessing Foundations


## Introduction

Background Heart disease is the leading cause of death worldwide, with nearly 18 million deaths per year, according to the World Health Organization. Rapid detection using predictive models can significantly improve patient outcomes and optimize healthcare resources

In this lab, a logistic regression model is implemented from scratch using NumPy to predict the presence of heart disease based on clinical characteristics such as age, cholesterol levels, blood pressure, and others. This includes data processing, model training, visualization, and a high-level exploration of the implementation using Amazon SageMake.

## 1. Setup

## Environment & Dependencies

This notebook was executed locally on macOS using Python 3.
The required libraries are installed in the active environment:

- torch
- tiktoken
- notebook / jupyter

In [1]:
%pip install torch tiktoken notebook ipykernel

Note: you may need to restart the kernel to use updated packages.


In [None]:
import torch
import tiktoken

print("Torch version:", torch.__version__)

  cpu = _conversion_method_template(device=torch.device("cpu"))


Torch version: 2.10.0
Tiktoken loaded correctly


In [3]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    text = f.read()

print("Characters:", len(text))

Characters: 20479


In [4]:
tokenizer = tiktoken.get_encoding("gpt2")
tokens = tokenizer.encode(text)

print("Number of tokens:", len(tokens))

Number of tokens: 5145


In [5]:
max_length = 50
stride = 25

input_ids = []
target_ids = []

for i in range(0, len(tokens) - max_length, stride):
    input_ids.append(tokens[i:i + max_length])
    target_ids.append(tokens[i + 1:i + max_length + 1])

print("Number of samples:", len(input_ids))

Number of samples: 204


In [6]:
max_length = 50
stride = 25

input_ids = []
target_ids = []

for i in range(0, len(tokens) - max_length, stride):
    input_ids.append(tokens[i:i + max_length])
    target_ids.append(tokens[i + 1:i + max_length + 1])

print("Number of samples:", len(input_ids))

Number of samples: 204


In [7]:
configs = [
    (32, 16),
    (64, 32),
    (128, 64)
]

for max_length, stride in configs:
    samples = 0
    for i in range(0, len(tokens) - max_length, stride):
        samples += 1
    print(f"max_length={max_length}, stride={stride} → samples={samples}")

max_length=32, stride=16 → samples=320
max_length=64, stride=32 → samples=159
max_length=128, stride=64 → samples=79


## Effect of max_length and stride

Increasing `max_length` provides the model with more context per sample,
but reduces the total number of training examples.

Using a smaller `stride` increases overlap between samples, which acts
as a form of data augmentation. Overlapping windows allow the model to
see the same token in multiple contexts, improving generalization at the
cost of higher computational load.