# Merge image, structured and text data in the same neural net with fast.ai

In this notebook we will predict the adoption speed of pets in the [PetFinder Kaggle competition](https://www.kaggle.com/c/petfinder-adoption-prediction/).  This competition give access to tree kind of data, **image** of the pets, **structured** data like their age, breed, color etc and finally **text** data in the form of a description of the pet.

It would be very interesting to be able to merge all this data inside the same neural network so that the network can use whatever information from all data to actually predictic how fast a pet is going to get adopted.

Keep in mind that **this is my first Kaggle competition**, so I might not be using the best strategies or validation schemes, but I just wanted to explore this idea of merging different type of data inside the same neural network.

## Fast.ai
We are going to use fast.ai to do that because it offers a lot of stuff we need to do this.  Mainly a very intuitive [data block](https://docs.fast.ai/data_block.html) that we will use to get our various data from disk, line them up and pass them as input to our neural network.  It also provide with easily accessible pre-trained models we will be able to use for our tasks.

## Leveraging pre-trained models
![caption](Diagram.jpg)

In [1]:
import warnings
warnings.filterwarnings('ignore')

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import json
import os
import feather
from fastai.text import *

from petfinder.data import *

%matplotlib inline
%load_ext autoreload
%autoreload 2

# Get the structured data
The method get_data contains all the data wrangling rather boring stuff.  We open the structured data train.csv where we have information for each pet (identified by a PetID).  We have information like the age of the pet, the breed, the color, was it vaccinated, a textual description of the pet etc.  The PetFinder competition also ran the description inside the google sentiment analysis service and provided us with that.  I use some of this information and create some new columns for that too.

We also find images in the train_images folder.  We create a dataframe where we have a row containing the PetID of the image and the path on disk of the image.  We then merge this dataframe to the main structured data by PetID.  This yield a dataframe with one row per image where all the structured information about the pet is there for each row.

Kaggle also provided some metadata for each pet, but I didn't spend the time parsing those files...

We have to predict between 5 AdoptionSpeed.  This is a classification problem, but a lot of people in the competition used a regression and then found the best rounding using the class OptimizedRounder at the of this notebook.  I tried using multi-class classification with this model but didn't have good results.

In [2]:
path = 'C:\\work\\ML\\PetFinder\\'
bs=64

pets = get_data(isTest=False)
petsTest = get_data(isTest=True)

pets.AdoptionSpeed = pets.AdoptionSpeed.astype(float)

petsTest['AdoptionSpeed'] = 0

# Language Model

See the notebook *PetFinder Language Model* on how we train and fine tune a text language model on the pet description

# Structured data

Here we have some decisions to make for our structured variables.  We need to decide which one is going to be a categorical variable and which one is going to be contiuous.

Even if a variable is a number doesnt mean it should be continuous variable.  If the variable only contains a small amount of unique values, it might be better to model it as a categorical variable.  We can use [embeddings](https://www.fast.ai/2018/04/29/categorical-embeddings/) for categorical data which will allow us to learn a far richer representation for them and is sometimes more powerful than using a continuous variable.

Fastai takes care of defining those embeddings size, it also fill missing values and normalize the structured data for the neural network.

In [3]:
from fastai.tabular import *
from fastai.vision import *
from fastai.metrics import *
from fastai.text import *

dep_var = 'AdoptionSpeed'
cont_names, cat_names = cont_cat_split(pets, dep_var=dep_var, max_card=10)
procs = [FillMissing, Categorify, Normalize]
cat_names.remove('Filename')
cat_names.remove('PicturePath')
cat_names.remove('PetID')
cat_names.remove('Description')

# for name in cont_names:
#     pets[name] = np.log(pets[name] - pets[name].min() + 1)

In [4]:
cont_names, cat_names

(['Age',
  'Quantity',
  'Fee',
  'VideoAmt',
  'PhotoAmt',
  'RescuerDogCount',
  'AvgSentenceSentimentMagnitude',
  'AvgSentenceSentimentScore',
  'SentimentMagnitude',
  'SentimentScore',
  'state_gdp',
  'state_population',
  'gdp_vs_population'],
 ['Type',
  'Name',
  'Breed1',
  'Breed2',
  'Gender',
  'Color1',
  'Color2',
  'Color3',
  'MaturitySize',
  'FurLength',
  'Vaccinated',
  'Dewormed',
  'Sterilized',
  'Health',
  'State',
  'RescuerID',
  'NoImage',
  'NoDescription'])

In [5]:
from petfinder.model import *

# Loading and lining up the data

We want to load our data.  Ideally we would like to re-use existing functionnality and not have to write custom data loader.  fast.ai got us covered, thanks to the amazing [data block api](https://docs.fast.ai/data_block.html)!

First we need to am ItemList per type of data.  One for image, structured and text.  Each of them do pre-processing to the input, keep track of processing they do on data like normalization etc.

But then we merge them using a MixedItemList.  MixedItemList simply get an item from each ItemList it contains and merge them together into one Item.  Then when fast.ai pass data to our model in the forward method, we can expect as many input as we have ItemList in our MixedItemList.

I pickle the MixedItemList to avoid having to recompute it when I reload the notebook because some of the ItemList pre-processing can be long (like TextItemList).

In [6]:
byPetID = pets.groupby('PetID').size().reset_index()
byPetID = byPetID.sample(frac=.1, random_state=42).drop([0], axis=1)
byPetID['IsValidation'] = True
pets = pd.merge(pets, byPetID, how='left', on='PetID')
pets.IsValidation = pets.IsValidation.fillna(False)

In [7]:
from fastai.callbacks import *

bs = 32
size = 224
np.random.seed(42)

data_lm = load_data(path, 'data_lm_descriptions.pkl', bs=bs)
vocab = data_lm.vocab

imgList = ImageList.from_df(pets, path=path, cols='PicturePath')
tabList = TabularList.from_df(pets, cat_names=cat_names, cont_names=cont_names, procs=procs, path=path)
textList = TextList.from_df(pets, cols='Description', path=path, vocab=vocab)

if os.path.isfile(path + 'mixed_img_tab_text.pkl') != True :
    mixed = (MixedItemList([imgList, tabList, textList], path, inner_df=tabList.inner_df)
            .split_from_df(col='IsValidation')
            .label_from_df(cols='AdoptionSpeed', label_cls=FloatList)
            .transform([[get_transforms()[0], [], []], [get_transforms()[1], [], []]], size=size))

    outfile = open(path + 'mixed_img_tab_text.pkl', 'wb')
    pickle.dump(mixed, outfile)
    outfile.close()
else:
    infile = open(path + 'mixed_img_tab_text.pkl','rb')
    mixed = pickle.load(infile)
    infile.close()

This makes a text databunch used later on to create our learner (for the text portion of our learner).  We need this to construct a pre-trained RNN for classification.

In [8]:
if os.path.isfile(path + 'text-classification-databunch.pkl'):
    data_text = load_data(path, 'text-classification-databunch.pkl')
else:
    petsAll = pd.concat([pets, petsTest])
    petsAll = petsAll.dropna(subset=['Description'])
    
    data_text = (TextList.from_df(petsAll, cols='Description', path=path, vocab=vocab)).split_none().label_from_df(cols='AdoptionSpeed').databunch(bs=bs)
    data_text.save('text-classification-databunch.pkl')

# Special functions
Neural network frameworks like to process data in batches.  Batches have to have a pre-defined size.  In our case we are using image and structured data which should always have the same size, but our text data can vary in size.  The description for each pet will be different.

We have to modify some function in fastai to make it work with our inputs.  First since we are using a pre-trained resnet34 network for our images, we need to normalize our images using statistics from ImageNet.  But the normalize method for images from fastai expects a certain tensor shape.  We need to create a custom normalize function to take into account our custom tensor shape.

Each row in our batch will contain an array of stuff, first the image data, then the structured data and last the text data.

``` python

def _normalize_images_batch(b:Tuple[Tensor,Tensor], mean:FloatTensor, std:FloatTensor)->Tuple[Tensor,Tensor]:
    "`b` = `x`,`y` - normalize `x` array of imgs and `do_y` optionally `y`."
    x,y = b
    mean,std = mean.to(x[0].device),std.to(x[0].device)
    x[0] = normalize(x[0],mean,std)
    return x,y

def normalize_custom_funcs(mean:FloatTensor, std:FloatTensor, do_x:bool=True, do_y:bool=False)->Tuple[Callable,Callable]:
    "Create normalize/denormalize func using `mean` and `std`, can specify `do_y` and `device`."
    mean,std = tensor(mean),tensor(std)
    return (partial(_normalize_images_batch, mean=mean, std=std),
            partial(denormalize, mean=mean, std=std))
```

**collate_mixed** is the method responsible to take a batch with variable size rows (because of the variable Description text size) and make them all of equal length so that we can have uniform batch sizes.  We basically find the row in the batch which have to longest text, take its length and make all other rows the same length by padding them with zeroes at the end.

``` python
def collate_mixed(samples, pad_idx:int=0):
    # Find max length of the text from the MixedItemList
    max_len = max([len(s[0].data[2]) for s in samples])

    for s in samples:
        res = np.zeros(max_len + pad_idx, dtype=np.int64)
        res[:len(s[0].data[2])] = s[0].data[2]
        s[0].data[2] = res

    return data_collate(samples)
```

Then we transform our MixedItemList into a databunch with our collate function for equal size batches and we also normalize the images using our custom normalize function from earlier.

In [9]:
data = mixed.databunch(bs=bs, collate_fn=collate_mixed)

norm, denorm = normalize_custom_funcs(*imagenet_stats)
data.add_tfm(norm) # normalize images

When fastai process your structured data, it creates new columns for any columns that had NaN values.  This new column is True when the other column was NaN, otherwise false.  If you want to use those columns, simply uncomment the next cell.

In [10]:
# cat_names = mixed.train.x.item_lists[1].cat_names
# cont_names = mixed.train.x.item_lists[1].cont_names

# Custom model
Here is the custom PyTorch model I created.  It expects a list of embeddings size for each categorical variable (emb_szs), the number of continuous variable (n_cont), the size of the text vocabulary for the language model and finally we have our pre-trained language model encoder that gets passed (encoder).

**self.cnn** is responsible for the image data.  Notice the we use AdaptiveConcatPool2d to be able to have any image size as input.

**self.lm_encoder** is responsible for the text data.  It uses our fine-tuned language model encoder we trained in the notebook PetFinder Language Model.

**self.tab** is responsible for the structured data.  It will create embeddings for categorical variables.

**self.reduce** is simply to reduce the size of the output of the cnn to a more manageable size.

Once the data is passed through each specialist network (cnn, encoder and tabular), we concatenate their output into a single vector.

**self.merge and self.final** are then responsible to reduce this concatenated vector to the final size of 5 which is the number of possible AdoptionSpeed we want to predict.  AdoptionSpeed is a categorical variable with 5 unique values.

**use_trainer** is set to true if we are using RNNTrainer

The **reset** method is used to reset the internal state of the RNN in self.lm_encoder.

We are outputing one output for regression and forcing it in the range 0-4.

``` python
class ImageTabularTextModel(nn.Module):
    def __init__(self, emb_szs:ListSizes, n_cont:int, vocab_sz:int, encoder, use_trainer):
        super().__init__()
        self.use_trainer = use_trainer
        self.cnn = create_body(models.resnet34)
        nf = num_features_model(self.cnn) * 2
        drop = .5

        self.lm_encoder = SequentialRNN(encoder[0], PoolingLinearClassifier([400 * 3] + [32], [.4]))

        self.tab = TabularModel(emb_szs, n_cont, 128, [512, 256])

        self.reduce = nn.Sequential(*([AdaptiveConcatPool2d(), Flatten()] + bn_drop_lin(nf, 512, bn=True, p=drop, actn=nn.ReLU(inplace=True))))
        self.merge = nn.Sequential(*bn_drop_lin(512 + 128 + 32, 128, bn=True, p=drop, actn=nn.ReLU(inplace=True)))
        self.final = nn.Sequential(*bn_drop_lin(128, 1, bn=False, p=0., actn=None))

    def forward(self, img:Tensor, x:Tensor, text:Tensor) -> Tensor:
        imgCnn = self.cnn(img)
        imgLatent = self.reduce(imgCnn)
        tabLatent = self.tab(x[0], x[1])
        textLatent = self.lm_encoder(text)

        cat = torch.cat([imgLatent, F.relu(tabLatent), F.relu(textLatent[0])], dim=1)

        pred = self.final(self.merge(cat))
        pred = torch.sigmoid(pred) * 4 # making sure this is in the range 0-4

        if(not self.use_trainer):
            return pred
        else:
            return pred, textLatent
    
    def reset(self):
        for c in self.children():
            if hasattr(c, 'reset'): c.reset()
```

# Custom learner functions

We need a split_layer function to tell fastai how to split the layers when doing [discriminative learning rates](https://towardsdatascience.com/understanding-learning-rates-and-how-it-improves-performance-in-deep-learning-d0d4059c1c10).  This is also what determines which layer to freeze when when we call the Learner.freeze method.  This one could certainly be better...  Looking at other split layers for the pre-trained RNN and reset, we should probably structure this differently.

``` python
def split_layers(model:nn.Module) -> List[nn.Module]:
    groups = [[model.cnn, model.lm_encoder]]
    groups += [[model.tab, model.reduce, model.merge, model.final]]
    return groups
```

We create our custom Learner class to be able to set some custom parameters.  I added an option to use RNNTrainer which is supposed to help if the language model is overfitting.  It is based on the [AWD_LSTM paper](https://arxiv.org/abs/1708.02182).  I had to modify the default version because of how I was passing data to it.

``` python
class RNNTrainerCustom(RNNTrainer):
    def on_loss_begin(self, last_output:Tuple[Tensor,Tensor,Tensor], **kwargs):
        "Save the extra outputs for later and only returns the true output."
        self.raw_out,self.out = last_output[1][1],last_output[1][2]
        return {'last_output': last_output[0]}

class ImageTabularTextLearner(Learner):
    def __init__(self, data:DataBunch, model:nn.Module, use_trainer:bool=False, alpha:float=2., beta:float=1., **learn_kwargs):
        super().__init__(data, model, **learn_kwargs)
        if(use_trainer):
            self.callbacks.append(RNNTrainerCustom(self, alpha=alpha, beta=beta))
        self.split(split_layers)
```

Finally an helper method constructing our model and learner.  We use the text_classifier_learner method from fastai to construct a pre-trained language model where we load our fine-tuned encoder.  This method returns a learner though, but we only care about the model it returns which we use in our own model.

The metric this Kaggle competition [evaluate on the quadratic weighted kappa](https://www.kaggle.com/c/petfinder-adoption-prediction/overview/evaluation).  So we will track it to see how we are doing.

``` python
def image_tabular_text_learner(data, len_cont_names, vocab_sz, data_lm, use_trainer:bool=False):
    l = text_classifier_learner(data_lm, AWD_LSTM, drop_mult=0.5)
    l.load_encoder('fine_tuned_enc')

    emb = data.train_ds.x.item_lists[1].get_emb_szs()
    model = ImageTabularTextModel(emb, len_cont_names, vocab_sz, l.model, use_trainer)

    learn = ImageTabularTextLearner(data, model, use_trainer, metrics=[mae])
    return learn
```

In [11]:
learn = image_tabular_text_learner(data, len(cont_names), len(vocab.itos), data_text, use_trainer=True)

In [12]:
# learn.callback_fns +=[partial(EarlyStoppingCallback, monitor='accuracy', min_delta=0.005, patience=3)]
# learn.callback_fns += [(partial(LearnerTensorboardWriter, base_dir=Path(path + 'logs\\'), name='mixed-metadata'))]

In [13]:
data.c

1

In [14]:
# learn.lr_find()
# learn.recorder.plot()

In [15]:
lr = 1e-3

In [16]:
# learn.to_fp16 doesn't work with this model for some reason
# learn = learn.to_fp16()
learn.freeze()
learn.fit_one_cycle(2, lr, callbacks=SaveModelCallback(learn, every='improvement', mode='min', monitor='mean_absolute_error', name='mixed'))

epoch,train_loss,valid_loss,mean_absolute_error,time
0,0.42663,1.140555,0.857836,10:22
1,0.181755,1.148888,0.837642,10:13


Better model found at epoch 0 with mean_absolute_error value: 0.8578364849090576.
Better model found at epoch 1 with mean_absolute_error value: 0.8376424312591553.


In [None]:
learn.purge()

In [22]:
# learn.lr_find()
# learn.recorder.plot()

In [None]:
bs=8
data = mixed.databunch(bs=bs, collate_fn=collate_mixed)

norm, denorm = normalize_custom_funcs(*imagenet_stats)
data.add_tfm(norm) # normalize images

learn = image_tabular_text_learner(data, len(cont_names), len(vocab.itos), data_text, use_trainer=True)
# learn.callback_fns +=[partial(EarlyStoppingCallback, monitor='kappa_score', min_delta=0.005, patience=3)]
learn.load('mixed-unfrozen')

In [None]:
learn.unfreeze()
learn.fit_one_cycle(4, max_lr=slice(1e-6,1e-4), callbacks=SaveModelCallback(learn, every='improvement', mode='min', monitor='mean_absolute_error', name='mixed-unfrozen'))

epoch,train_loss,valid_loss,mean_absolute_error,root_mean_squared_error,time
0,0.252965,1.149033,0.857493,0.961548,45:28


Better model found at epoch 0 with mean_absolute_error value: 0.8574932813644409.


In [None]:
learn.load('mixed')

In [24]:
p,y = learn.get_preds(ds_type=DatasetType.Valid)

In [25]:
from petfinder.test import *

In [37]:
optR = OptimizedRounder()
optR.fit(p.numpy()[:, 0], y.numpy())
coeff = optR.coefficients()

In [69]:
preds = optR.predict(p.numpy()[:, 0], coeff).astype(int)

In [71]:
predictions = pets[pets.IsValidation == True][['PetID', 'AdoptionSpeed']]
predictions['Prediction'] = preds
predictions = predictions.groupby('PetID').mean()[['Prediction', 'AdoptionSpeed']]
# preds, y = predictions['Prediction'], predictions['AdoptionSpeed']

In [72]:
quadratic_weighted_kappa(predictions['Prediction'], predictions['AdoptionSpeed'])

0.4232357217350937

# Generating a submission for the competition

Unfortunately fastai export does not support MixedItemList yet.  So to test my code on the test set I had to trick fastai in thinking that the test set is actually the validation set.  I just set all labels of the test set to be 0.

In [74]:
pets['IsTest'] = False
petsTest['IsTest'] = True
petsTest['AdoptionSpeed'] = 0

petsAll = pd.concat([pets, petsTest])

This is pretty much the same code as training, but here we use .split_from_df(col='IsTest') to tell fastai that the validation are only the rows in the dataframe where the column IsTest is True.

In [75]:
imgListTest = ImageList.from_df(petsAll, path=path, cols='PicturePath')
tabListTest = TabularList.from_df(petsAll, cat_names=cat_names, cont_names=cont_names, procs=procs, path=path)
textListTest = TextList.from_df(petsAll, cols='Description', path=path, vocab=vocab)

mixedTest = (MixedItemList([imgListTest, tabListTest, textListTest], path, inner_df=tabListTest.inner_df)
            .split_from_df(col='IsTest')
            .label_from_df(cols='AdoptionSpeed', label_cls=FloatList)
            .transform([[get_transforms()[0], [], []], [get_transforms()[1], [], []]], size=size))

In [95]:
dataTest = mixedTest.databunch(bs=bs, collate_fn=collate_mixed)
dataTest.add_tfm(norm) # normalize images

learn = image_tabular_text_learner(dataTest, len(cont_names), len(vocab.itos), data_text, use_trainer=True)
learn.load('mixed')

ImageTabularTextLearner(data=DataBunch;

Train: LabelList (58652 items)
x: MixedItemList
MixedItem
Image (3, 224, 224)
TabularLine Type 2; Name Nibble; Breed1 299; Breed2 0; Gender 1; Color1 1; Color2 7; Color3 0; MaturitySize 1; FurLength 1; Vaccinated 2; Dewormed 2; Sterilized 2; Health 1; State 41326; RescuerID 8480853f516546f6cf33aa88cd76c379; NoImage False; NoDescription False; AvgSentenceSentimentMagnitude_na False; AvgSentenceSentimentScore_na False; SentimentMagnitude_na False; SentimentScore_na False; Age -0.3679; Quantity -0.4497; Fee 0.9909; VideoAmt -0.2133; PhotoAmt -1.0298; RescuerDogCount -0.3862; AvgSentenceSentimentMagnitude -0.1566; AvgSentenceSentimentScore 0.0500; SentimentMagnitude -0.0702; SentimentScore 0.1080; state_gdp 0.7094; state_population 0.7996; gdp_vs_population -0.5051; 
Text xxbos xxmaj nibble is a 3 + month old ball of cuteness . xxmaj he is energetic and playful . i rescued a couple of cats a few months ago but could not get them neutered in time as 

In [107]:
preds,y = learn.get_preds(ds_type=DatasetType.Valid)

In [112]:
p,y = preds.numpy()[:, 0], y.numpy()

In [118]:
optR = OptimizedRounder()
preds = optR.predict(p, coeff).astype(int)

In [139]:
predictions = petsTest
predictions['AdoptionSpeed'] = preds
predictions = predictions.groupby('PetID').mean()['AdoptionSpeed'].reset_index()
predictions['AdoptionSpeed'] = predictions['AdoptionSpeed'].astype(int)
predictions.to_csv('submission.csv', index=False)