# Module 1, Task 2: Data Loading and Augmentation Using Keras

**Objective:** Build an efficient, generator-based data pipeline using `tf.data` and apply image augmentations using Keras Preprocessing Layers.

In [None]:
# Install necessary libraries
!pip install tensorflow tensorflow-datasets matplotlib

### Setup
Import libraries and load the EuroSAT dataset. We will split the training data into training and validation sets.

In [None]:
import tensorflow as tf
import tensorflow_datasets as tfds
import matplotlib.pyplot as plt
import numpy as np

# Load the dataset
(ds_train, ds_validation), ds_info = tfds.load(
    'eurosat/rgb',
    split=['train[:80%]', 'train[80%:]'], # Use 80% for training, 20% for validation
    shuffle_files=True,
    as_supervised=True,
    with_info=True,
)

CLASS_NAMES = ds_info.features['label'].names
NUM_CLASSES = ds_info.features['label'].num_classes
IMG_SIZE = 128 # Resize images for manageable model input
BATCH_SIZE = 64

print(f"Number of training samples: {tf.data.experimental.cardinality(ds_train)}")
print(f"Number of validation samples: {tf.data.experimental.cardinality(ds_validation)}")
print(f"Number of classes: {NUM_CLASSES}")
print(f"Class names: {CLASS_NAMES}")

### Building the Data Pipeline

We will create a function to process the data:
1.  **Resize Images:** Standardize image sizes.
2.  **Normalize Pixel Values:** Scale pixel values from `[0, 255]` to `[0, 1]` for better model performance.

In [None]:
def process_image(image, label):
    image = tf.image.resize(image, (IMG_SIZE, IMG_SIZE))
    image = tf.cast(image, tf.float32) / 255.0 # Normalize to [0,1]
    return image, label

# Apply the processing to our datasets
ds_train_processed = ds_train.map(process_image)
ds_validation_processed = ds_validation.map(process_image)

### Data Augmentation

Data augmentation creates modified versions of images in the training set to help the model generalize better and reduce overfitting. Keras preprocessing layers are ideal as they can be included directly in the model, making it more portable.

In [None]:
data_augmentation = tf.keras.Sequential([
    tf.keras.layers.RandomFlip("horizontal_and_vertical"),
    tf.keras.layers.RandomRotation(0.2),
    tf.keras.layers.RandomZoom(0.2),
    tf.keras.layers.RandomContrast(0.2),
], name="data_augmentation")

### Visualizing the Augmented Data

Let's see the effect of our augmentation pipeline on a single image.

In [None]:
# Take one batch of images from the processed training set
for images, labels in ds_train_processed.take(1):
    image_to_show = images[0]
    label_to_show = CLASS_NAMES[labels[0]]

plt.figure(figsize=(12, 12))
plt.subplot(3, 3, 1)
plt.imshow(image_to_show)
plt.title(f"Original: {label_to_show}")
plt.axis('off')

# Apply augmentation 8 times to the same image
for i in range(8):
    augmented_image = data_augmentation(tf.expand_dims(image_to_show, 0))
    plt.subplot(3, 3, i + 2)
    plt.imshow(augmented_image[0])
    plt.title("Augmented")
    plt.axis('off')

plt.suptitle("Data Augmentation Examples")
plt.show()

### Finalizing the Pipeline for Performance

To create a highly performant input pipeline, we add:
1.  **Augmentation:** Apply the augmentation layers.
2.  **Batching:** Group samples into batches.
3.  **Caching:** Cache the dataset in memory to save time on subsequent epochs.
4.  **Prefetching:** Overlap data preprocessing and model execution for maximum GPU utilization.

In [None]:
AUTOTUNE = tf.data.AUTOTUNE

def configure_dataset(ds, shuffle=False):
    # Cache before shuffling and batching
    ds = ds.cache()
    if shuffle:
        ds = ds.shuffle(buffer_size=1000)
    ds = ds.batch(BATCH_SIZE)
    # Apply augmentation after batching (GPU-efficient)
    if shuffle: # Only apply augmentation to training set
        ds = ds.map(lambda x, y: (data_augmentation(x, training=True), y), num_parallel_calls=AUTOTUNE)
    # Prefetch to overlap data production with consumption
    ds = ds.prefetch(buffer_size=AUTOTUNE)
    return ds

train_ds_final = configure_dataset(ds_train_processed, shuffle=True)
validation_ds_final = configure_dataset(ds_validation_processed)

print("Finalized tf.data pipelines for training and validation.")
print(train_ds_final)
print(validation_ds_final)

### Conclusion

We have successfully built a complete and efficient Keras data pipeline. It starts with loading raw data, applies preprocessing and augmentation, and uses caching and prefetching to ensure optimal performance during model training. This pipeline is now ready to be fed into a `model.fit()` call.