# Gender recognition using Tensorflow v2 and InceptionV3 on entire dataset
### Inspiration
I have copied this notebook from [Marcos Alvarado's notebook](https://www.kaggle.com/bmarcos/image-recognition-gender-detection-inceptionv3) and adapted it to work on the entire dataset, and tensorflow v2. 
**Note:** I've removed some of the details from the original notebook, so I recommend the read. 

### Dataset
The dataset is available [here](https://www.kaggle.com/jessicali9530/celeba-dataset) and contains:
- 202,599 number of face images of various celebrities
- 10,177 unique identities, but names of identities are not given
- 40 binary attribute annotations per image
- 5 landmark locations

### Modelling and structure

#### Transfer learning with InceptionV3
I am using a pretrained Inception V3 model for which I will retrain some layers and fix the first layers. I will also attach new output layers to perform the new classification task. 

#### Target variable
As my target variable, I only use the gender feature available in the dataset and detect if the image shows a man or a woman.

#### Training on the entire dataset
I use keras' ImageDataGenerator and flow_from_dataframe to avoid loading all images in memory and fit the model on the entire dataset. The process goes this way:
##### Training
I augment data using the image generator and then fit the model using the generator:
```python
train_datagen =  ImageDataGenerator(
  preprocessing_function=preprocess_input,
  rotation_range=30,
  width_shift_range=0.2,
  height_shift_range=0.2,
  shear_range=0.2,
  zoom_range=0.2,
  horizontal_flip=True,
  #brightness_range=[0.4,1.5],
  rescale=1./255,
)

train_generator = train_datagen.flow_from_dataframe(
    df_train,
    batch_size=20,
    x_col="image_id", 
    y_col="gender",
    class_mode="binary",
    image_size=(IMG_WIDTH, IMG_HEIGHT),
    validate_filenames=False)
```
The model can then be fitted on the generator:
```python
model.fit(train_generator,
          validation_data=valid_generator,
          steps_per_epoch=len(df_train)//BATCH_SIZE,
          validation_steps=len(df_valid) // BATCH_SIZE,
          epochs=NUM_EPOCHS,
          callbacks=[checkpointer],
          verbose=1)
```
##### Testing
The same process takes place, but I do not augment data, and **make sure suffle is set to False**. I also set validate_filenames to False to save some time.
```python
test_datagen =  ImageDataGenerator(
  preprocessing_function=preprocess_input,
  rescale=1./255,
)

test_generator = test_datagen.flow_from_dataframe(
    df_test,
    batch_size=20,
    x_col="image_id", 
    y_col="gender",
    class_mode="raw",
    image_size=(IMG_WIDTH, IMG_HEIGHT),
    validate_filenames=False,
    shuffle=False)
```
Predictions can then be generated using this generator:
```python
model_predictions = model_.predict(test_generator, steps=len(df_test))
```

##### Generating new predictions
We could make batch predictions using the testing generator that I defined previously (and I include a function for this in this notebook). However, it can also be useful to re-create the same preprocessing to generate predictions without it. I have recreated the behavior this way:
```python
img = image.load_img(path, target_size=(256, 256))  # Keras function
img = image.img_to_array(img)
img = np.expand_dims(img, axis=0)
img = np.vstack([img])
img = preprocess_input(img) # Keras function 
img = img / 255
```
This is the same preprocessing the generator performs and will yield the same results when making the predictions. 

### Inception V3
The inception model is available from Keras ([here](https://keras.io/api/applications/inceptionv3/)) and its pre-trained version is either available for download when initializing:
```python
inc_model = InceptionV3(weights="imagenet",
                        include_top=False,
                        input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))
```
Or by connecting the notebook to [this Kaggle dataset](https://www.kaggle.com/keras/inceptionv3) containing the weights:
```python
inc_model = InceptionV3(weights="../input/inceptionv3/inception_v3_weights_tf_dim_ordering_tf_kernels_notop.h5",
                        include_top=False,
                        input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))
```

## The code

 ### Required libraries:
 - cv2
 - sklearn
 - tensorflow (v2)
 - keras
 - PIL
 - pandas
 - numpy
 - matplotlib
 - seaborn
 
 #### Imports

In [None]:
import os

import pandas as pd
import numpy as np
import cv2    
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score

from keras.applications.inception_v3 import InceptionV3, preprocess_input
from keras import optimizers
from keras.models import Sequential, Model 
from keras.layers import Dropout, Flatten, Dense, GlobalAveragePooling2D
from keras.callbacks import ModelCheckpoint
from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img
from keras.utils import np_utils
from keras.preprocessing import image

from IPython.core.display import display, HTML
from PIL import Image
from io import BytesIO
import base64

import tensorflow as tf
print(f"Built using tensorflow version {tf.__version__}")

In [None]:
%matplotlib inline

### Variables
Setting the image folder, image properties and training parameters.

In [None]:
# set variables 
main_folder = "../input/celeba-dataset"
images_folder = os.path.join(main_folder, 'img_align_celeba', 'img_align_celeba')

IMG_WIDTH = 178
IMG_HEIGHT = 218
BATCH_SIZE = 128
NUM_EPOCHS = 10

#### Importing the data
Here I read and process the dataframe containing the features and the filename they correspond to. Note that it does not contain the image, and actually reading the image will be handled by the keras generator (flow_from_dataframe). 

In [None]:
df_imgs = pd.read_csv(os.path.join(main_folder, 'list_attr_celeba.csv'), usecols=["image_id", "Male"])  
df_imgs.replace(to_replace={"Male": -1}, value="Female", inplace=True)
df_imgs.replace(to_replace={"Male": 1}, value="Male", inplace=True)
df_imgs.rename(columns={"Male": "gender"}, inplace=True)
df_imgs.head() 

### Partitioning data into train, valid and test
I will use:
- train: training data, will be augmented 
- valid: validation data during training. We will save the model that performs the best on the valid dataset
- test: once I have the best model according to train and valid, I verify the results on the test dataset.

#### Class distribution
The dataset is slightly imbalanced.

In [None]:
# Female or Male?
plt.title('Female or Male')
sns.countplot(y='gender', data=df_imgs, color="c")
plt.show()

#### Dataset recommended partition
The dataset provides us with the following partition:
- 0: train
- 1: valid
- 2: test

In [None]:
df_partition = pd.read_csv(os.path.join(main_folder, 'list_eval_partition.csv'))
df_partition.head()

In [None]:
df_partition['partition'].value_counts().sort_index()

#### Joining the partition with the dataframe containing the labels
And changing image_id to be the image path, so flow_from_dataframe will read the images from there. We could skip that path appending step and provide a directory when flowing from dataframe.

In [None]:
df_imgs = df_imgs.merge(df_partition, on="image_id")
df_imgs.loc[:, "image_id"] = df_imgs.loc[:, "image_id"].apply(lambda x: os.path.join(images_folder, x))
df_imgs.head()

#### Creating the train, valid and test dataframes 
They will all point to the relevant image and contain the label ("Male" or "Female").

In [None]:
df_train = df_imgs.loc[df_imgs.loc[:, "partition"] == 0, 
                      ["image_id", "gender"]]
df_valid = df_imgs.loc[df_imgs.loc[:, "partition"] == 1, 
                      ["image_id", "gender"]]
df_test = df_imgs.loc[df_imgs.loc[:, "partition"] == 2, 
                      ["image_id", "gender"]]

### Data Augmentation
#### Demonstration 
The following code uses the image generator and flow_from_dataframe methods to load images from a dataframe (I recommend loading only one though) and displays 10 generated images.  

In [None]:
# Generate image generator for data augmentation
datagen =  ImageDataGenerator(
  rotation_range=30,
  width_shift_range=0.2,
  height_shift_range=0.2,
  shear_range=0.2,
  zoom_range=0.2,
  horizontal_flip=True,
  rescale=1./255,
    
)

# load one image and reshape
def display_image(img: pd.DataFrame) -> None:
    """"""

    # plot 10 augmented images of the loaded image
    plt.figure(figsize=(20,10))
    plt.suptitle('Data Augmentation', fontsize=28)

    i = 0
    for batch in datagen.flow_from_dataframe(img,
                                             batch_size=1,
                                             x_col="image_id", 
                                             y_col="gender",
                                             class_mode="raw",
                                             image_size=(IMG_WIDTH, IMG_HEIGHT)):
        batch_image = batch[0]
        batch_label = batch[1]
        plt.subplot(3, 5, i+1)
        plt.grid(False)
        plt.imshow(batch_image.reshape(256,256, 3))
        
        if i == 9:
            break
        i += 1
    print(f"Label: {batch_label}")
    plt.show()

display_image(df_imgs.loc[:0, :])

### Create the train and valid data generators

#### The training data needs to be augmented

In [None]:
# Train - Data Preparation - Data Augmentation with generators
train_datagen =  ImageDataGenerator(
  preprocessing_function=preprocess_input,
  rotation_range=30,
  width_shift_range=0.2,
  height_shift_range=0.2,
  shear_range=0.2,
  zoom_range=0.2,
  horizontal_flip=True,
  #brightness_range=[0.4,1.5],
  rescale=1./255,
)

train_generator = train_datagen.flow_from_dataframe(
    df_train,
    batch_size=20,
    x_col="image_id", 
    y_col="gender",
    class_mode="binary",
    image_size=(IMG_WIDTH, IMG_HEIGHT),
    validate_filenames=False)

#### Test data isn't augmented 
We don't need to provide suffle=False as labels will match the image during training. If we were using this for prediction, we would only pass the images (no labels) and the model would not output the images im the same order. So make sure shuffle is set to False when generating predictions.

In [None]:
valid_datagen =  ImageDataGenerator(
  preprocessing_function=preprocess_input,
  rescale=1./255,
)

valid_generator = valid_datagen.flow_from_dataframe(
    df_valid,
    batch_size=20,
    x_col="image_id", 
    y_col="gender",
    class_mode="binary",
    image_size=(IMG_WIDTH, IMG_HEIGHT),
    validate_filenames=False)

### Creating the model
#### Pretrained model
I'm using InceptionV3 pretrained on the imagenet dataset. This code will automatically download the weights. 

#### Setting the initial layers to be non trainable
Training all layers could lead to overfitting, and training none will result in underfitting (as the image representation will not be tuned). Training 52 layers seems to work well.  

In [None]:
# Import InceptionV3 Model
inc_model = InceptionV3(weights="imagenet",
                        include_top=False,
                        input_shape=(IMG_HEIGHT, IMG_WIDTH, 3))

print("number of layers:", len(inc_model.layers))
# Lock initial layers to do not be trained
for layer in inc_model.layers[:52]:
    layer.trainable = False

#### Connecting the representation layers to the classification layers

In [None]:
#Adding custom Layers
x = inc_model.output
x = GlobalAveragePooling2D()(x)
x = Dense(1024, activation="relu")(x)
x = Dropout(0.5)(x)
x = Dense(512, activation="relu")(x)
predictions = Dense(1, activation="sigmoid")(x)

#### Final model

In [None]:
# creating the final model 
model_ = Model(inputs=inc_model.input, outputs=predictions)

# compile the model
model_.compile(optimizer="adam", 
               loss='binary_crossentropy', 
               metrics=['accuracy'])

#### Making sure we use the best model from training according to its performance on the valid set

In [None]:
#https://keras.io/models/sequential/ fit generator
checkpointer = ModelCheckpoint(filepath='weights.best.inc.male.hdf5', 
                               verbose=1, save_best_only=True)

#### Fit the model and plot training performance over time

In [None]:
hist = model_.fit(train_generator,
                  validation_data=valid_generator,
                  steps_per_epoch=len(df_train)//BATCH_SIZE,
                  validation_steps=len(df_valid) // BATCH_SIZE,
                  epochs=NUM_EPOCHS,
                  callbacks=[checkpointer],
                  verbose=1
                    )

In [None]:
# Plot loss function value through epochs
plt.figure(figsize=(18, 4))
plt.plot(hist.history['loss'], label = 'train')
plt.plot(hist.history['val_loss'], label = 'valid')
plt.legend()
plt.title('Loss Function')
plt.show()

In [None]:
# Plot accuracy through epochs
plt.figure(figsize=(18, 4))
plt.plot(hist.history['accuracy'], label = 'train')
plt.plot(hist.history['val_accuracy'], label = 'valid')
plt.legend()
plt.title('Accuracy')
plt.show()

#### Select and use the best model from training according to the validation set

In [None]:
#load the best model
model_.load_weights('weights.best.inc.male.hdf5')

### Generating predictions
#### Batch generating: using a generator
Note that suffle is set to False. Sorry for insisting, I lost some time on that!

In [None]:
test_datagen =  ImageDataGenerator(
  preprocessing_function=preprocess_input,
  rescale=1./255,
)

test_generator = test_datagen.flow_from_dataframe(
    df_test,
    batch_size=20,
    x_col="image_id", 
    y_col="gender",
    class_mode="raw",
    image_size=(IMG_WIDTH, IMG_HEIGHT),
    validate_filenames=False,
    shuffle=False)

In [None]:
# generate prediction
model_predictions = model_.predict(test_generator, steps=len(df_test))

In [None]:
# report test accuracy
preds = np.array(model_predictions > 0.5).astype(int)
test_accuracy = 100 * accuracy_score(preds, df_test.loc[:, "gender"].replace({"Male": 1, "Female": 0}).values)
print('Model Evaluation')
print('Test accuracy: %.4f%%' % test_accuracy)
print('f1_score:', f1_score(df_test.loc[:, "gender"].replace({"Male": 1, "Female": 0}).values, preds))

## Generating new predictions
This is highly inspired by Marcos' notebook
- read_image: recreates the preprocessing done by the data generator (reads the image and preprocesses it)
- display_result: displays the output nicely using html

In [None]:
def read_image(path: str) -> np.ndarray:
    """Replicates the image preprocessing from the data generator"""
    # predicting images
    img = image.load_img(path, target_size=(256, 256))
    img = image.img_to_array(img)
    img = np.expand_dims(img, axis=0)
    img = np.vstack([img])
    img = preprocess_input(img) # preprocess for our model input
    return img / 255.


def img_to_display(filename: str):
    """
    Reads a jpeg image. Goal is to display it nicely using html in the function below.
    
    (Copied from Marco's notebook, this is his note :) )
    # inspired on this kernel:
    # https://www.kaggle.com/stassl/displaying-inline-images-in-pandas-dataframe
    # credits to stassl :)
    """
    
    i = Image.open(filename)
    i.thumbnail((200, 200), Image.LANCZOS)
    
    with BytesIO() as buffer:
        i.save(buffer, 'jpeg')
        return base64.b64encode(buffer.getvalue()).decode()
    

def display_result(filename: str, target: str) -> None:
    """
    Display the results in HTML
    
    :param filename: path to the image
    :param target: real label from that image
    """
    gender = 'Male'
    gender_icon = "https://i.imgur.com/nxWan2u.png"
    
    prediction = model_.predict(read_image(filename), steps=1).reshape(-1)
    
        
    if prediction <= 0.5:
        gender_icon = "https://i.imgur.com/oAAb8rd.png"
        gender = 'Female'
        prediction = 1 - prediction
            
    display_html = '''
    <div style="overflow: auto;  border: 2px solid #D8D8D8;
        padding: 5px; width: 420px;" >
        <img src="data:image/jpeg;base64,{}" style="float: left;" width="200" height="200">
        <div style="padding: 10px 0px 0px 20px; overflow: auto;">
            <img src="{}" style="float: left;" width="40" height="40">
            <h3 style="margin-left: 50px; margin-top: 2px;">{}</h3>
            <p style="margin-left: 50px; margin-top: -6px; font-size: 12px">{} prob.</p>
            <p style="margin-left: 50px; margin-top: -16px; font-size: 12px">Real Target: {}</p>
            <p style="margin-left: 50px; margin-top: -16px; font-size: 12px">Filename: {}</p>
        </div>
    </div>
    '''.format(img_to_display(filename)
               , gender_icon
               , gender
               , "{0:.2f}%".format(np.round(max(prediction)*100,2))
               , target
               , filename.split('/')[-1]
               )

    display(HTML(display_html))

In [None]:
#select random images of the test partition
df_to_test = df_test.iloc[: 10, :]

for _, row in df_to_test.iterrows():
    display_result(row["image_id"], row["gender"])

In [None]:
model_.save("test_model_save.h5")

#### Additional code: batch prediction with flow from dataframe

In [None]:
def predict(img: pd.DataFrame) -> np.ndarray:
    """
    @param img: pandas DataFrame containing the image paths under image_id
    
    :return: numpy array with boolean predictions.
    """
    datagen =  ImageDataGenerator(
      rescale=1./255,
      preprocessing_function=preprocess_input,

    )

    generator = datagen.flow_from_dataframe(img,
                                            suffle=False,
                                            batch_size=len(img),
                                            x_col="image_id", 
                                            class_mode=None,
                                            image_size=(IMG_WIDTH, IMG_HEIGHT))
    return model_.predict(generator, steps=len(img))

predict(df_imgs.loc[:10]) >= 0.5