# **Building Synthetic Medical Records using GANs**

### *You’ve probably seen AI generate images, text, and even code. You can use the same technology to create realistic synthetic datasets for healthcare, finance, and more.*

### For building synthetic medical records using GANs, we’ll break it down into the following steps:

1)Data Preprocessing

2)GAN Architecture

3)Training Loop

4)Evaluating Model Performance

5)Generating Synthetic Medical Records

## 1)Data Preprocessing: Where most GAN projects fail before they start
**GANs understand numbers, not text, not categorical labels. So our first job is to:**

1)Convert categorical features into one-hot encoded vectors

2)Scale numerical values between -1 and 1 (because our Generator uses Tanh activation)

In [1]:
import pandas as pd

df = pd.read_csv("Users/allullaswarantej/Downloads/Follow-up_Records.csv")

print(df.head())

   patient_id  visit_date  age_years  weight_kg   bmi  systolic_bp_mmHg  \
0  P-2025-001  2024-02-15         52       83.7  28.3               138   
1  P-2025-001  2024-03-15         52       83.4  28.2               147   
2  P-2025-001  2024-04-15         52       83.1  28.1               140   
3  P-2025-001  2024-05-15         52       83.0  28.1               136   
4  P-2025-001  2024-06-15         52       82.6  27.9               133   

   diastolic_bp_mmHg  heart_rate_bpm  body_temp_C  fasting_glucose_mg_dL  ...  \
0                 86              80         36.8                    137  ...   
1                 89              80         37.0                    140  ...   
2                 84              76         36.8                    122  ...   
3                 88              77         36.8                    112  ...   
4                 88              78         36.8                    101  ...   

   diet_quality_score_0_100  sleep_hours  exercise_sessions_pe

In [2]:
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
import numpy as np

num_cols = df.select_dtypes(include=['int64', 'float64']).columns
cat_cols = df.select_dtypes(include=['object']).columns

encoder = OneHotEncoder(sparse_output=False)
cat_encoded = encoder.fit_transform(df[cat_cols])

scaler = MinMaxScaler(feature_range=(-1, 1))
num_scaled = scaler.fit_transform(df[num_cols])

# combine processed data
data_processed = np.hstack((num_scaled, cat_encoded))

**If you skip preprocessing your GAN will either:**

Fail to learn patterns (mode collapse)

Generate nonsensical outputs

Scaling is significant because Tanh outputs range from -1 to 1.

## 2)GAN Architecture: Two networks playing a game

A GAN has a:

**Generator:** Starts with random noise, learns to produce realistic samples

**Discriminator:** Tries to tell real from fake data

In [3]:
import torch
import torch.nn as nn

In [4]:
data_dim = data_processed.shape[1]  # total features
latent_dim = 64  # size of random noise input

# generator
class Generator(nn.Module):
    def __init__(self):
        super(Generator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(latent_dim, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, data_dim),
            nn.Tanh()  # output in range [-1, 1]
        )
    def forward(self, z):
        return self.model(z)

# discriminator
class Discriminator(nn.Module):
    def __init__(self):
        super(Discriminator, self).__init__()
        self.model = nn.Sequential(
            nn.Linear(data_dim, 256),
            nn.LeakyReLU(0.2),
            nn.Linear(256, 128),
            nn.LeakyReLU(0.2),
            nn.Linear(128, 1),
            nn.Sigmoid()  # probability of real/fake
        )
    def forward(self, x):
        return self.model(x)

**While building GANs, always make sure to:**

Use **LeakyReLU** in hidden layers to avoid “dead” neurons.

Match Generator’s output range (Tanh) with your data scaling.

Keep the architecture small at first; big models overfit small datasets quickly.

## 3) Training Loop: Where the Model Learns

Upnext, we will train both networks in turns:

**Discriminator:** Learns to classify real vs fake correctly

**Generator:** Learns to fool the Discriminator

In [5]:
from torch.utils.data import DataLoader, TensorDataset

In [6]:
# convert data to PyTorch tensors
real_data = torch.tensor(data_processed, dtype=torch.float32)
dataset = TensorDataset(real_data)
loader = DataLoader(dataset, batch_size=16, shuffle=True)

# initialize models
generator = Generator()
discriminator = Discriminator()

# optimizers
lr = 0.0002
optim_G = torch.optim.Adam(generator.parameters(), lr=lr)
optim_D = torch.optim.Adam(discriminator.parameters(), lr=lr)

# loss
criterion = nn.BCELoss()

epochs = 2000
for epoch in range(epochs):
    for real_batch, in loader:
        batch_size = real_batch.size(0)

        # labels for real and fake data
        real_labels = torch.ones((batch_size, 1))
        fake_labels = torch.zeros((batch_size, 1))

        # train discriminator
        z = torch.randn(batch_size, latent_dim)
        fake_data = generator(z)

        real_loss = criterion(discriminator(real_batch), real_labels)
        fake_loss = criterion(discriminator(fake_data.detach()), fake_labels)
        d_loss = (real_loss + fake_loss) / 2

        optim_D.zero_grad()
        d_loss.backward()
        optim_D.step()

        # train generator
        z = torch.randn(batch_size, latent_dim)
        fake_data = generator(z)
        g_loss = criterion(discriminator(fake_data), real_labels)  # want fake to be real

        optim_G.zero_grad()
        g_loss.backward()
        optim_G.step()

    if epoch % 200 == 0:
        print(f"Epoch [{epoch}/{epochs}]  D_loss: {d_loss.item():.4f}  G_loss: {g_loss.item():.4f}")

Epoch [0/2000]  D_loss: 0.6890  G_loss: 0.6736
Epoch [200/2000]  D_loss: 0.1810  G_loss: 1.6598
Epoch [400/2000]  D_loss: 0.7586  G_loss: 1.5357
Epoch [600/2000]  D_loss: 0.2532  G_loss: 2.6024
Epoch [800/2000]  D_loss: 0.2025  G_loss: 3.7295
Epoch [1000/2000]  D_loss: 0.2216  G_loss: 2.2439
Epoch [1200/2000]  D_loss: 0.1681  G_loss: 2.2328
Epoch [1400/2000]  D_loss: 0.0373  G_loss: 2.5638
Epoch [1600/2000]  D_loss: 0.0565  G_loss: 2.1592
Epoch [1800/2000]  D_loss: 0.0421  G_loss: 3.5040


## 5)Generating Synthetic Medical Records

Once trained, we can sample new patient records from random noise:



In [7]:
# generate new synthetic data
z = torch.randn(10, latent_dim)  # 10 synthetic samples
synthetic_data_scaled = generator(z).detach().numpy()

# inverse transform
num_synthetic = scaler.inverse_transform(synthetic_data_scaled[:, :len(num_cols)])
cat_synthetic = encoder.inverse_transform(synthetic_data_scaled[:, len(num_cols):])

# combine into dataframe
synthetic_df = pd.DataFrame(num_synthetic, columns=num_cols)
synthetic_df[cat_cols] = cat_synthetic

print(synthetic_df)

   age_years  weight_kg        bmi  systolic_bp_mmHg  diastolic_bp_mmHg  \
0  52.215721  81.647049  27.470219        132.253036          75.502655   
1  52.000015  83.323463  28.208431        139.023346          87.512344   
2  52.000000  83.676216  28.296289        143.129211          80.897232   
3  52.013641  82.676727  27.854200        136.505112          80.290665   
4  52.997349  80.907654  27.324089        121.709457          76.158119   
5  52.004547  82.693222  27.940123        137.178101          77.277008   
6  52.000919  82.734573  27.915161        137.230469          86.769585   
7  52.025944  82.288742  27.799381        135.899490          78.339447   
8  52.000843  83.002518  28.023554        138.547623          87.643448   
9  52.000000  83.487457  28.240891        141.202377          90.832367   

   heart_rate_bpm  body_temp_C  fasting_glucose_mg_dL  \
0       74.158348    36.705112              85.436127   
1       79.873962    37.009682             137.123260   
2  

### **Now we have privacy-safe, realistic-looking data for experiments. It can also be used to augment training datasets for better ML model performance.**

### Building GANs for real-world datasets like medical records is more about:

**Data preprocessing discipline**

**Matching architecture to data**

**Careful training to avoid collapse**

**Post-processing to make data usable**