# Review of using Keras with Large Datasets and Generators

- When working with large datasets, the standard approach that we have used before will not work. 
- This is because when we do model.fit(x,y,...), keras adds that into the physical memory, and if the data is big enough, it can ran out of it, and crash the computer.

**This is what Generators are used for**, so let's start by going over what they are!

## Generator Function
**Def:** This is a type of function in Python, which returns a lazy iterator(lazy evaluation), which uses a call-by-need evaluation strategy, reducing the amount of memory used to run a program.  
  
A very common use case for generator functions is when reading a huge txt file, which is going to be our use case for it. Here is an example: 

In [2]:
def generator_function(file):
    for row in open(file, 'r'):
        yield row
    
sum = 0
for row in generator_function('titanic.csv'):
    sum += 1
    
print(f'# lines in file: {sum}')

# lines in file: 892


The key difference, is we are not returning a specific value from the function, we are using the **yield** keyword, which returns a **lazy iterator**. 

#### Here is another example:

In [3]:
gen = (i for i in range(0,10))

print(next(gen))
print(next(gen))
print(next(gen))

0
1
2


## Using Generators with Keras
- This will come in handy, again, when we want to work with a very large dataset, which is usually images, but could be text as well.
- The main difference is that we are going to use the **model.fit_generate()** function instead of just **model.fit()**, and pass our data in as a generator!

In [4]:
def batch_generator(df, batch_size, path_tiles, num_classes):
    """This generator use a pandas DataFrame to read images (df.tile_name) from disk."""
    
    # Number of images
    N = df.shape[0]
    
    while True:
        # Loop from 0 to N in batch_size increments
        for start in range(0, N, batch_size):
            
            x_batch = []
            y_batch = []
            
            # Find what is the ending of the TMP df, to avoid going over the index
            end = min(start + batch_size, N)
            df_tmp = df[start:end]
            
            # Get the URL's of tmp DF
            ids_batch = df_tmp.tile_name
            
            # Run through each URL and download each image
            for id in ids_batch:
                
                # Use the openCV library to read an image from a file
                img = cv2.imread(os.path.join(path_tiles, id))
                
                # Create label for the current image
                labelname=df_tmp['y'][df_tmp.tile_name == id].values[0]  
                labelname=np.asscalar(labelname)
                
                # Append image to X_batch and label to y_batch
                x_batch.append(img)
                y_batch.append(labelname)
            
            # Scale the x_batch to values between 0 and 1
            x_batch = np.array(x_batch, np.float32) / 255
            
            # Turn y_batch into categorical
            y_batch = utils.np_utils.to_categorical(y_batch, num_classes)
            
            # Yield x_batch and y_batch, which will return a generator function
            yield (x_batch, y_batch)

**Now that we have the generator, here is how we can use fit_generator in Keras**

In [None]:
model.fit_generator(generator=batch_generator(df_train,
                                              batch_size=batch_size,
                                              path_tiles=path_tiles,
                                              num_classes=num_classes),
                    steps_per_epoch=len(df_train) // batch_size,
                    epochs=100)

## Data/Image Augmentation - and why to use it.

- Data/Image Augmentation is one of the best ways to improve the performance of a Deep Learning model, and to add variance to the training dataset
- We want our input images to be representetive of different lighting, angles, positions and so on...

**How do we do this in Keras?**  
- Can use ImageDataGenerator
- Write our own custom code

In [None]:
from 
train_data_gen 