### GANDataBalancer Class

The **GANDataBalancer** class utilizes Generative Adversarial Networks (GANs) to generate synthetic data, addressing class imbalance issues in datasets. It identifies minority and majority classes, trains a GAN, and generates synthetic samples to augment the minority class, achieving a balanced dataset for improved model performance.

---

#### **Functions**

##### 1. __init__
- **Description**:  
  Initializes the GANDataBalancer class with configurable parameters for sampling strategy, random seed, and latent dimension for the GAN.
- **Parameters**:  
  - sampling_strategy (float): Proportion of the majority class size to determine the number of samples to generate. Default is 0.05.  
  - random_state (int): Random seed for reproducibility. Default is 42.  
  - latent_dim (int): Size of the latent dimension for the generator's input. Default is 100.  
- **Usage**:  
  Prepares the necessary attributes for dataset balancing and GAN training.

---

##### 2. build_generator
- **Description**:  
  Constructs the generator model. It takes random noise from the latent space as input and generates synthetic samples that resemble the minority class.
- **Parameters**:  
  - output_dim (int): Number of features in the generated data, matching the dataset feature count.  
- **Returns**:  
  - A Keras Sequential model for the generator.

---

##### 3. build_discriminator
- **Description**:  
  Constructs the discriminator model, which distinguishes between real (from the dataset) and fake (from the generator) samples.
- **Parameters**:  
  - input_dim (int): Number of features in the input data, matching the dataset feature count.  
- **Returns**:  
  - A compiled Keras Sequential model for the discriminator, trained to classify real vs. fake samples.

---

##### 4. build_gan
- **Description**:  
  Combines the generator and discriminator to construct the GAN model. The discriminator is kept non-trainable during GAN training to ensure only the generator updates.
- **Parameters**:  
  None.  
- **Returns**:  
  - A compiled Keras Model for the GAN.

---

##### 5. fit
- **Description**:  
  Trains the GAN by alternately updating the discriminator and generator. The discriminator learns to differentiate real and fake samples, while the generator learns to produce more realistic data.
- **Parameters**:  
  - X (DataFrame): Input feature dataset.  
  - y (Series): Target labels for the dataset.  
  - epochs (int): Number of training epochs for the GAN. Default is 1000.  
  - batch_size (int): Size of each training batch. Default is 64.  
- **Usage**:  
  Monitors and logs discriminator and generator losses during training.

---

##### 6. resample
- **Description**:  
  Generates synthetic samples using the trained generator and combines them with the original dataset to balance the minority class.
- **Parameters**:  
  - X (DataFrame): Original input feature dataset.  
  - y (Series): Target labels for the dataset.  
- **Returns**:  
  - balanced_X (DataFrame): Feature dataset with added synthetic samples.  
  - balanced_y (Series): Updated target labels for the balanced dataset.

---

##### 7. fit_resample
- **Description**:  
  Combines the training (fit) and resampling (resample) processes into a single function. It trains the GAN and generates a balanced dataset in one call.
- **Parameters**:  
  - X (DataFrame): Input feature dataset.  
  - y (Series): Target labels for the dataset.  
  - epochs (int): Number of training epochs for the GAN. Default is 1000.  
  - batch_size (int): Size of each training batch. Default is 64.  
- **Returns**:  
  - balanced_X (DataFrame): Feature dataset with added synthetic samples.  
  - balanced_y (Series): Updated target labels for the balanced dataset.


In [14]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from keras.models import Sequential, Model
from keras.layers import Dense, Input

class GANDataBalancer:
    def __init__(self, sampling_strategy=0.05, random_state=42, latent_dim=100):
        self.sampling_strategy = sampling_strategy
        self.random_state = random_state
        self.latent_dim = latent_dim

    def build_generator(self, output_dim):
        model = Sequential([
            Dense(16, activation='relu', input_dim=self.latent_dim),
            Dense(32, activation='relu'),
            Dense(output_dim, activation='linear')
        ])
        return model

    def build_discriminator(self, input_dim):
        model = Sequential([
            Dense(32, activation='relu', input_dim=input_dim),
            Dense(16, activation='relu'),
            Dense(1, activation='sigmoid')
        ])
        model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
        return model

    def build_gan(self):
        self.discriminator.trainable = False
        gan_input = Input(shape=(self.latent_dim,))
        x = self.generator(gan_input)
        gan_output = self.discriminator(x)
        gan = Model(gan_input, gan_output)
        gan.compile(optimizer='adam', loss='binary_crossentropy')
        return gan

    def fit(self, X, y, epochs=1000, batch_size=64):
        # Identify minority and majority classes
        X_minority = X[y == 1]
        X_majority = X[y == 0]
        self.resample_number = int(self.sampling_strategy * len(X_majority) - len(X_minority))
        self.X_columns = X.columns
        minority_class_samples = X_minority.values

        self.generator = self.build_generator(X.shape[1])
        self.discriminator = self.build_discriminator(X.shape[1])
        self.gan = self.build_gan()

        for epoch in range(epochs):
            # Train Discriminator
            noise = np.random.normal(0, 1, (batch_size, self.latent_dim))
            fake_data = self.generator.predict(noise)
            real_data = minority_class_samples[
                np.random.randint(0, minority_class_samples.shape[0], size=batch_size)
            ]
            X_combined = np.vstack((real_data, fake_data))
            y_combined = np.hstack((np.ones(batch_size), np.zeros(batch_size)))
            self.discriminator.trainable = True
            d_loss = self.discriminator.train_on_batch(X_combined, y_combined)

            # Train Generator
            noise = np.random.normal(0, 1, (batch_size, self.latent_dim))
            y_gen = np.ones(batch_size)
            self.discriminator.trainable = False
            g_loss = self.gan.train_on_batch(noise, y_gen)

            if epoch % 100 == 0:
                print(f"Epoch {epoch}/{epochs} | Discriminator Loss: {d_loss} | Generator Loss: {g_loss}")

    def resample(self, X, y):
        synthetic_data = self.generator.predict(
            np.random.normal(0, 1, (self.resample_number, self.latent_dim))
        )
        synthetic_df = pd.DataFrame(synthetic_data, columns=X.columns)
        balanced_X = pd.concat([X, synthetic_df])
        balanced_y = pd.concat([y, pd.Series([1] * synthetic_df.shape[0])])
        return balanced_X, balanced_y

    def fit_resample(self, X, y, epochs=1000, batch_size=64):
        self.fit(X, y, epochs=epochs, batch_size=batch_size)
        return self.resample(X, y)

In [15]:
# Load dataset
data = pd.read_csv('creditcard.csv')
X = data.drop(columns=['Class'])
y = data['Class']

# Initialize GANDataBalancer
gan_balancer = GANDataBalancer(latent_dim=100)

# Train the GAN and resample the dataset
balanced_X, balanced_y = gan_balancer.fit_resample(X, y, epochs=1000, batch_size=64)

# Save the balanced dataset
balanced_dataset = pd.concat([balanced_X, balanced_y.rename('Class')], axis=1)
balanced_dataset.to_csv('balanced_dataset.csv', index=False)
print("Balanced dataset saved to 'balanced_dataset.csv'")

[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 924us/step


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


Epoch 0/1000 | Discriminator Loss: [array(0.5308013, dtype=float32), array(0.5, dtype=float32)] | Generator Loss: [array(0.5308013, dtype=float32), array(0.5308013, dtype=float32), array(0.5, dtype=float32)]
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 941us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 921us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 876us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 886us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 914us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 940us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 786us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 868us/step
[1m2/2[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 847u