# Keras for Large Datasets 
---
- generator
- if we have a huge dataset to use lstm we can split them into smaller chunks

### Learning Objectives
- Learn how to handle datasets that will not fit into memory and use them for Keras training
- What is fit generator and predict generator
- What is Transfer Learning

## Train a deep learning model on a large dataset

- In Keras, using `fit()` and `predict()` is fine for smaller datasets which can be loaded into memory

- But in practice, for most practical-use cases, almost all datasets are large and cannot be loaded into memory at once

In [4]:
def batch_generator(df, batch_size, path_tiles, num_classes):
    """
        This generator use a pandas DataFrame to read images (df.tile_name) from disk.
    """
    N = df.shape[0]
    while True:
        for start in range(0, N, batch_size):
            x_batch = []
            y_batch = []
            end = min(start + batch_size, N)
            df_tmp = df[start:end]
            ids_batch = df_tmp.tile_name
            for id in ids_batch:
                img = cv2.imread(path_tiles+'/{}'.format(id))
                # [0] since duplicated names
                labelname=df_tmp['y'][df_tmp.tile_name == id].values[0]  
                labelname=np.asscalar(labelname)
                x_batch.append(img)
                y_batch.append(labelname)
            x_batch = np.array(x_batch, np.float32) / 255
            y_batch = utils.np_utils.to_categorical(y_batch, num_classes)
            yield (x_batch, y_batch)

model.fit_generator(generator=batch_generator(df_train,
                                              batch_size=batch_size,
                                              path_tiles=path_tiles,
                                              num_classes=num_classes),
                    steps_per_epoch=len(df_train) // batch_size,
                    epochs=epochs)

NameError: name 'model' is not defined

then instead of `model.fit()` we will use `model.fit_generator(generator=batch_generator(df_train, ...))`

In [2]:
import numpy as np


def data_gen(df, batch_size):
    while True:
        x_batch = np.zeros((batch_size, 3, 224, 224))
        y_batch = np.zeros((batch_size, 1))
        for j in range(len(df['url']/batch_size)):
            b = 0
            for m, k in zip(df['url'].values[j*batch_size:(j+1)*batch_size], 
                            df['class'].values[j*batch_size:(j+1)*batch_size]):
                x_batch[b] = m
                y_batch[b] = k
                b += 1
            yield (x_batch, y_batch)


# model.fit_generator(generator=data_gen(df_train, batch_size=batch_size), 
#                     steps_per_epoch=len(df_train) // batch_size, epochs=epochs)

## Data Augmentation

- One of the best ways to improve the performance of a Deep Learning model is to add more data to the training set

- want the dataset to be representative of the many different positions, angles, lightings, and miscellaneous distortions

- In keras there are two ways:

    - Use `ImageDataGenerator`
    - Write our custom code