# "pytorch-widedeep, deep learning for tabular data I: data preprocessing, model components and basic use"
> a flexible package to combine tabular data with text and images using wide and deep models.

- author: Javier Rodriguez
- toc: true 
- badges: true
- comments: true

This is the first of two posts introducing [pytorch-widedeep](https://github.com/jrzaurin/pytorch-widedeep), which is intended to be a flexible package to use Deep Learning (hereafter DL) with tabular data and combine it with text and images via wide and deep models. `pytorch-widedeep` is partially based on Heng-Tze Cheng et al., 2016 [paper](https://arxiv.org/abs/1606.07792) [1].

in this post I describe the data preprocessing functionalities of the library, the main components of the model, and the basic use of the library. In a second post I will show a more advance use of `pytorch-widedeep`.

Before I move any further I just want to emphasize that there are a number of libraries that implement functionalities to use DL on tabular data. To cite a few, the ubiquitous and fantastic [FastAI](https://docs.fast.ai/tutorial.tabular.html) (and their tabular api), the "rising star" [NVTabular](https://github.com/NVIDIA/NVTabular), the powerful [pytorch-tabnet](https://github.com/dreamquark-ai/tabnet) based on work of Sercan O. Arik and Tomas Pfisterfrom [2], which is starting to take victories in Kaggle competitions, and perhaps my favourite [AutoGluon Tabular](https://arxiv.org/abs/2003.06505) [3].

It is not my intention to "compete" against these libraries. `pytorch-widedeep` started as an attempt to package and automate an algorithm I had to use a couple of times at work and ended up becoming the entertaining process that is building a library. Needless to say that if you wanted to apply DL to tabular data you should go and check all the libraries I mentioned before (as well as this one 🙂). 

## 1. Installation 

To install the package simply use pip:

```bash
pip install pytorch-widedeep
```

or directly from github

```bash
pip install git+https://github.com/jrzaurin/pytorch-widedeep.git
```

**Important note for Mac Users**

Note that the following comments are not directly related to the package, but to the interplay between `pytorch` and `OSX` (more precisely `pytorch`'s dependency on `OpenMP` I believe) and in general parallel processing in Mac. 

In the first place, at the time of writing the latest `pytorch` version is `1.7`. This version is known to have some [issues](https://stackoverflow.com/questions/64772335/pytorch-w-parallelnative-cpp206) when running on Mac and the data-loaders might not run in parallel. 

On the other hand, since `Python 3.8` the `multiprocessing` library start method changed from ['fork' to 'spawn'](https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods). This also affects the data-loaders (for any torch version) and they will not run in parallel. 

Therefore, for Mac users I suggest using `python 3.7` and `torch <= 1.6` (with its corresponding `torchvision` version, i.e. `<= 0.7.0`). I could have enforced this versioning via the `setup.py` file. However, there are a number of unknowns and I preferred to leave it as it is. For example I developed the package using *macOS Catalina* and maybe some of this issues are not present in the new release *Big Sur*. Also, I hope that they release soon a patch for `pytorch 1.7` and some, if not all these problems disappear. 

Installing `pytorch-widedeep` via `pip` will install the latest version. Therefore, if these problems are present and the dataloaders do not run in parallel, one can easily downgrade manually: 

```bash
pip install torch==1.6.0 torchvision==0.7.0
```

*None of these issues affect Linux users*

## 2. `pytorch-widedeep`  DL Architectures

As I mentioned earlier `pytorch-widedeep` combines tabular data with text and images via wide and deep models. 

With that in mind, the two main architectures that can be built with a few lines of code using `pytorch-widedeep` are:

![](figures/pytorch-widedeep/arch_1.png)

**Architecture 1**: architecture 1 combines the `Wide`, linear model with the outputs from the `DeepDense` or `DeepDenseResnet`, `DeepText` and `DeepImage` components connected to a final output neuron or neurons, depending on whether we are performing a binary classification or regression, or a multi-class classification. The components within the faded-pink rectangles are concatenated. Later in the post I will describe in detail each of the individual components.

In math terms, and following the notation in the [paper](https://arxiv.org/abs/1606.07792), Architecture 1 can be formulated as:

$$
preds = \sigma(W^{T}_{wide}[x, \phi(x)] + W^{T}_{deepdense}a^{(l_f)}_{dense} + W^{T}_{deeptext}a^{(l_f)}_{text} + W^{T}_{deepimage}a^{(l_f)}_{image} + b) 
$$

Where $W$ are the weight matrices applied to the wide model and to the final activations of the deep models, '$a$' are the final activations, and $\phi(x)$ are the cross product transformations of the original features '$x$'. In case you are wondering what are *"cross product transformations"*, here is a quote taken directly from the paper: *"For binary features, a cross-product transformation (e.g., “AND(gender=female, language=en)”) is 1 if and only if the constituent features (“gender=female” and “language=en”) are all 1, and 0 otherwise"*.

![](figures/pytorch-widedeep/arch_2.png)

**Architecture 2**: architecture 2 combines the `Wide`, linear model with the `Deep` components of the model connected to the output neuron(s), after the different `Deep` components have been themselves combined through a Fully-Connected-Head (hereafter FC-Head) (that I refer as `DeepHead`).

In math terms, and following the notation in the [paper](https://arxiv.org/abs/1606.07792), Architecture 2 can be formulated as:

$$
preds = \sigma(W^{T}_{wide}[x, \phi(x)] + W^{T}_{deephead}a^{(l_f)}_{deephead} + b)
$$

Is important to mention that each individual component, `wide`, `deepdense` (either `DeepDense` or `DeepDenseResnet`), `deeptext` and `deepimage`, can be used independently and in isolation. For example, one could use only `wide`, which is in simply a linear model. Or use `DeepDense` which is in essence a similar implementation to that of the [Tabular](https://docs.fast.ai/tabular.learner) API in the `fastai` library (which I strongly recommend).

## 3. Quick start (TL;DR)

Maybe I should have started with this section, but I thought that knowing at least the architectures one can build with `pytorch-widedeep` was "kind-off" necessary. In any case and before diving into the details of the library, let's just say that you just want to quickly run one example and get the feel of how `pytorch-widedeep` works. Let's do so using the [adult census dataset](http://archive.ics.uci.edu/ml/datasets/Adult). 

In this example we will be fitting a model comprised by two components: `Wide` and `DeepDense`.

In [1]:
#collapse-hide
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [2]:
#collapse-hide
adult = pd.read_csv("data/adult/adult.csv.zip")
adult.columns = [c.replace("-", "_") for c in adult.columns]
adult["income_label"] = (adult["income"].apply(lambda x: ">50K" in x)).astype(int)
adult.drop("income", axis=1, inplace=True)

for c in adult.columns:
    if adult[c].dtype == 'O':
        adult[c] = adult[c].apply(lambda x: "unknown" if x == "?" else x)
        adult[c] = adult[c].str.lower()

In [3]:
adult_train, adult_test = train_test_split(adult, test_size=0.2, stratify=adult.income_label)

adult.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational_num,marital_status,occupation,relationship,race,gender,capital_gain,capital_loss,hours_per_week,native_country,income_label
0,25,private,226802,11th,7,never-married,machine-op-inspct,own-child,black,male,0,0,40,united-states,0
1,38,private,89814,hs-grad,9,married-civ-spouse,farming-fishing,husband,white,male,0,0,50,united-states,0
2,28,local-gov,336951,assoc-acdm,12,married-civ-spouse,protective-serv,husband,white,male,0,0,40,united-states,1
3,44,private,160323,some-college,10,married-civ-spouse,machine-op-inspct,husband,black,male,7688,0,40,united-states,1
4,18,unknown,103497,some-college,10,never-married,unknown,own-child,white,female,0,0,30,united-states,0


The following lines below is all you need

In [4]:
from pytorch_widedeep.preprocessing import WidePreprocessor, DensePreprocessor
from pytorch_widedeep.models import Wide, DeepDense, WideDeep
from pytorch_widedeep.metrics import Accuracy

# define wide, crossed, embedding and continuous columns, and target
wide_cols = ["education", "relationship", "workclass", "occupation", "native_country", "gender"]
cross_cols = [("education", "occupation"), ("native_country", "occupation")]
embed_cols = [("education", 10), ("workclass", 10), ("occupation", 10), ("native_country", 10)]
cont_cols = ["age", "hours_per_week"]
target = adult_train["income_label"].values

# prepare wide component
preprocess_wide = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = preprocess_wide.fit_transform(adult_train)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)

# prepare deepdense component
preprocess_deep = DensePreprocessor(embed_cols=embed_cols, continuous_cols=cont_cols)
X_deep = preprocess_deep.fit_transform(adult_train)
deepdense = DeepDense(hidden_layers=[64, 32], deep_column_idx=preprocess_deep.deep_column_idx, 
                      embed_input=preprocess_deep.embeddings_input, continuous_cols=cont_cols)

# build, compile and fit
model = WideDeep(wide=wide, deepdense=deepdense)
model.compile(method="binary", metrics=[Accuracy])
model.fit(X_wide=X_wide, X_deep=X_deep, target=target, n_epochs=2, batch_size=256) 

# predict
X_wide_te = preprocess_wide.transform(adult_test)
X_deep_te = preprocess_deep.transform(adult_test)
preds = model.predict(X_wide=X_wide_te, X_deep=X_deep_te)

  0%|          | 0/153 [00:00<?, ?it/s]

Training


epoch 1: 100%|██████████| 153/153 [00:03<00:00, 46.74it/s, loss=0.571, metrics={'acc': 0.7135}]
epoch 2: 100%|██████████| 153/153 [00:02<00:00, 51.37it/s, loss=0.395, metrics={'acc': 0.8159}]
predict: 100%|██████████| 39/39 [00:00<00:00, 182.93it/s]


## 4. Preprocessors

As you can see in Section 3, and as with any ML algorithm, the data need to be prepared/preprocessed before going through the model. This is handled internally by the `pytorch-widedeep` preprocessors. There is one preprocessor per `WideDeep` model component:

```
WidePreprocessor
DensePreprocessor
TextPreprocessor
ImagePreprocessor
```

"Behind the scenes", these preprocessors use a series of helper functions and classes that are in the `utils` module. Initially I did not intend to "expose" them to the user, but I believe they can be useful for all sorts of preprocessing tasks, so I made them available. The `utils` tools are:

```
deep_utils.LabelEncoder
text_utils.simple_preprocess
text_utils.get_texts
text_utils.pad_sequences
text_utils.build_embeddings_matrix
fastai_transforms.Tokenizer
fastai_transforms.Vocab
image_utils.SimplePreprocessor
image_utils.AspectAwarePreprocessor
```

They are accessible directly from `utils`, e.g.:

```python
from pytorch_widedeep.utils import LabelEncoder
```

Note that here I will be concentrating directly on the preprocessors. If you want more details on the `utils` tools, have a look to the [source code](https://github.com/jrzaurin/pytorch-widedeep/tree/master/pytorch_widedeep/utils) or read the [documentation](https://pytorch-widedeep.readthedocs.io/en/latest/index.html).

### 4.1. `WidePreprocessor`

The Wide component of the model is a linear model that in principle, could be implemented as a linear layer receiving the result of on one-hot encoded categorical columns. However, this is not memory efficient (at all). Therefore, we implement a liner layer as an Embedding layer plus a bias. I will explain it in a bit more detail later. For now, just know that `WidePreprocessor` simply encodes the categories numerically so that they are the indexes of the lookup table that is an Embedding layer.

In [5]:
from pytorch_widedeep.preprocessing import WidePreprocessor

wide_cols = ['education', 'relationship','workclass','occupation','native_country','gender']
crossed_cols = [('education', 'occupation'), ('native_country', 'occupation')]

wide_preprocessor = WidePreprocessor(wide_cols=wide_cols, crossed_cols=crossed_cols)
X_wide = wide_preprocessor.fit_transform(adult)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_wide = wide_preprocessor.transform(new_df)

In [6]:
X_wide

array([[  1,  17,  23, ...,  89,  91, 316],
       [  2,  18,  23, ...,  89,  92, 317],
       [  3,  18,  24, ...,  89,  93, 318],
       ...,
       [  2,  20,  23, ...,  90, 103, 323],
       [  2,  17,  23, ...,  89, 103, 323],
       [  2,  21,  29, ...,  90, 115, 324]])

In [7]:
X_wide[0]

array([  1,  17,  23,  32,  47,  89,  91, 316])

In [8]:
wide_preprocessor.inverse_transform(X_wide[:1])

Unnamed: 0,education,relationship,workclass,occupation,native_country,gender,education_occupation,native_country_occupation
0,11th,own-child,private,machine-op-inspct,united-states,male,11th-machine-op-inspct,united-states-machine-op-inspct


As we can see, `wide_preprocessor` numerically encodes the `wide_cols` and the `crossed_cols`, which can be recovered using the method `inverse_transform`.

### 4.2 `DensePreprocessor`

Simply, `DensePreprocessor` label-encodes the categorical columns and normalizes the numerical ones (unless otherwise specified).

In [9]:
from pytorch_widedeep.preprocessing import DensePreprocessor

# cat_embed_cols = [(column_name, embed_dim), ...]
cat_embed_cols = [('education',10), ('relationship',8), ('workclass',10), ('occupation',10),('native_country',10)]
continuous_cols = ["age","hours_per_week"]

deep_preprocessor = DensePreprocessor(embed_cols=cat_embed_cols, continuous_cols=continuous_cols)
X_deep = deep_preprocessor.fit_transform(adult)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_deep = deep_preprocessor.transform(new_df)

In [10]:
print(X_deep[:5])

[[ 0.          0.          0.          0.          0.         -0.99512893
  -0.03408696]
 [ 1.          1.          0.          1.          0.         -0.04694151
   0.77292975]
 [ 2.          1.          1.          2.          0.         -0.77631645
  -0.03408696]
 [ 3.          1.          0.          0.          0.          0.39068346
  -0.03408696]
 [ 3.          0.          2.          3.          0.         -1.50569139
  -0.84110367]]


Behind the scenes, `DeepProcessor` uses [LabelEncoder](https://pytorch-widedeep.readthedocs.io/en/latest/utils/dense_utils.html), simply a custom numerical encoder for categorical features, available via

```python
from pytorch_widedeep.utils import LabelEncoder
```

### 4.3. `TextPreprocessor`

This preprocessor returns the tokenized, padded sequences that will be directly "fed" to the `deeptext` component.

To illustrate the text and image preprocessors I will use a small sample of the Airbnb listing dataset, which you can get [here](http://insideairbnb.com/get-the-data.html).

In [11]:
airbnb=pd.read_csv("data/airbnb/airbnb_sample.csv")

In [12]:
texts = airbnb.description.tolist()
texts[0]

"My bright double bedroom with a large window has a relaxed feeling! It comfortably fits one or two and is centrally located just two blocks from Finsbury Park. Enjoy great restaurants in the area and easy access to easy transport tubes, trains and buses. Babies and children of all ages are welcome. Hello Everyone, I'm offering my lovely double bedroom in Finsbury Park area (zone 2) for let in a shared apartment.  You will share the apartment with me and it is fully furnished with a self catering kitchen. Two people can easily sleep well as the room has a queen size bed. I also have a travel cot for a baby for guest with small children.  I will require a deposit up front as a security gesture on both our parts and will be given back to you when you return the keys.  I trust anyone who will be responding to this add would treat my home with care and respect .  Best Wishes  Alina Guest will have access to the self catering kitchen and bathroom. There is the flat is equipped wifi internet

In [13]:
from pytorch_widedeep.preprocessing import TextPreprocessor

text_preprocessor = TextPreprocessor(text_col='description')
X_text = text_preprocessor.fit_transform(airbnb)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_text = text_preprocessor.transform(new_df)

The vocabulary contains 2192 tokens


In [14]:
print(X_text[0])

[  29   48   37  367  818   17  910   17  177   15  122  349   53  879
 1174  126  393   40  911    0   23  228   71  819    9   53   55 1380
  225   11   18  308   18 1564   10  755    0  942  239   53   55    0
   11   36 1013  277 1974   70   62   15 1475    9  943    5  251    5
    0    5    0    5  177   53   37   75   11   10  294  726   32    9
   42    5   25   12   10   22   12  136  100  145]


`TextPreprocessor` uses the utilities within the [text_utils](https://pytorch-widedeep.readthedocs.io/en/latest/utils/text_utils.html) and the [fastai_transforms](https://pytorch-widedeep.readthedocs.io/en/latest/utils/fastai_transforms.html) modules. Again, all the utilities within those modules are are directly accessible from `utils`, e.g.:

```python
from pytorch_widedeep.utils import simple_preprocess, pad_sequences, build_embeddings_matrix, Tokenizer, Vocab
```

### 4.4 `ImagePreprocessor`

Finally, `ImagePreprocessor` simply resizes the images, being aware of the aspect ratio. By default they will be resized to `(224, 224, ...)`. This is because the default `deepdense` component of the model is a pre-trained `ResNet` model, which requires inputs of height and width of 224.

Let's have a look

In [15]:
from pytorch_widedeep.preprocessing import ImagePreprocessor

image_preprocessor = ImagePreprocessor(img_col='id', img_path="data/airbnb/property_picture/")
X_images = image_preprocessor.fit_transform(airbnb)
# From here on, any new observation can be prepared by simply running `.transform`
# new_X_images = image_preprocessor.transform(new_df)

Reading Images from data/airbnb/property_picture/


  8%|▊         | 79/1001 [00:00<00:02, 394.13it/s]

Resizing


100%|██████████| 1001/1001 [00:02<00:00, 396.28it/s]


Computing normalisation metrics


In [16]:
X_images[0].shape

(224, 224, 3)

`ImagePreprocessor` uses two helpers: [`SimplePreprocessor` and `AspectAwarePreprocessor`](https://pytorch-widedeep.readthedocs.io/en/latest/utils/image_utils.html), available from the `utils` module, e.g.: 

```python
from pytorch_widedeep.utils import SimplePreprocessor, AspectAwarePreprocessor
```

These two classes are directly taken from Adrian Rosebrock's fantastic book "Deep Learning for Computer Vision". Therefore, all credit to Adrian.

## 5. Model Components

Let's now have a look to the components that can be used to build a wide and deep model. The 5 main components of `WideDeep` are:

```
Wide
DeepDense or DeepDenseResnet
DeepText
DeepImage
deephead
```

The first 4 of them will be collected and combined by the `WideDeep` "collector" class, while the 5th one is optional and added to the `WideDeep` model through its corresponding parameters: `deephead` or alternatively `head_layers`, `head_dropout` and `head_batchnorm`

### 5.1. `Wide`

The wide component is a Linear layer "plugged" into the output neuron(s)

The only particularity of our implementation is that we have implemented the linear layer via an Embedding layer plus a bias. While the implementations are equivalent, the latter is faster and far more memory efficient, since we do not need to one hot encode the categorical features.

Let's have a look:

In [11]:
import torch
import pandas as pd
import numpy as np

from torch import nn

In [12]:
df = pd.DataFrame({'color': ['r', 'b', 'g'], 'size': ['s', 'n', 'l']})
df.head()

Unnamed: 0,color,size
0,r,s
1,b,n
2,g,l


one hot encoded, the first observation (`color: r, size: s`) would be

In [13]:
obs_0_oh = (np.array([1., 0., 0., 1., 0., 0.])).astype('float32')

if we simply numerically encode (or label encode) the values:

In [14]:
obs_0_le = (np.array([0, 3])).astype('int64')

Note that in the implementation of the package we start from 1, saving 0 for padding, i.e. unseen values.

Now, let's see if the two implementations are equivalent

In [15]:
# we have 6 different values. Let's assume we are performing a regression, so pred_dim = 1
lin = nn.Linear(6, 1)

In [16]:
emb = nn.Embedding(6, 1) 
emb.weight = nn.Parameter(lin.weight.reshape_as(emb.weight))

In [17]:
lin(torch.tensor(obs_0_oh))

tensor([0.0875], grad_fn=<AddBackward0>)

In [18]:
emb(torch.tensor(obs_0_le)).sum() + lin.bias

tensor([0.0875], grad_fn=<AddBackward0>)

And this is precisely how the linear component `Wide` is implemented

In [19]:
from pytorch_widedeep.models import Wide
wide = Wide(wide_dim=10, pred_dim=1)
wide

Wide(
  (wide_linear): Embedding(11, 1, padding_idx=0)
)

Again, let me emphasize that even though the input dim is 10, the `Embedding` layer has 11 weights. This is because we save 0 for padding, which is used for unseen values during the encoding process

### 5.2. `DeepDense` and `DeepDenseResnet`

There are two alternatives for the so called `deepdense` component of the model: `DeepDense` and `DeepDenseResnet`.

`DeepDense` is comprised by a stack of dense layers that receive the embedding representation of the categorical features concatenated with numerical continuous features (normalized unless otherwise specified). For those familiar with the `Fastai`'s tabular API, `DeepDense` is almost identical to their [tabular model](https://github.com/fastai/fastai/blob/f633356359a29f8d869ce36659f7aa25660e946a/fastai/tabular/model.py#L28).

`DeepDenseResnet` is similar to `DeepDense` but instead of dense layers, the embedding representation of the categorical features concatenated with numerical continuous features are passed through a series of dense `ResNet` layers. Each basic block comprises the following operations:

<img src="figures/pytorch-widedeep/resnet_block.png" alt="resnet_block" width="400"/>

Let's have a look first to `DeepDense`:

In [20]:
from pytorch_widedeep.models import DeepDense

# fake dataset
X_deep = torch.cat((torch.empty(5, 4).random_(4), torch.rand(5, 1)), axis=1)
colnames = ['a', 'b', 'c', 'd', 'e']
embed_input = [(u,i,j) for u,i,j in zip(colnames[:4], [4]*4, [8]*4)]
deep_column_idx = {k:v for v,k in enumerate(colnames)}
continuous_cols = ['e']

# my advice would be to not use dropout in the last layer, but I add the option because you never 
# know..there is crazy people everywhere.
deepdense = DeepDense(hidden_layers=[16,8], dropout=[0.5, 0.], batchnorm=True, deep_column_idx=deep_column_idx,
                      embed_input=embed_input, continuous_cols=continuous_cols)
deepdense

DeepDense(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(4, 8)
    (emb_layer_b): Embedding(4, 8)
    (emb_layer_c): Embedding(4, 8)
    (emb_layer_d): Embedding(4, 8)
  )
  (embed_dropout): Dropout(p=0.0, inplace=False)
  (dense): Sequential(
    (dense_layer_0): Sequential(
      (0): Linear(in_features=33, out_features=16, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Dropout(p=0.5, inplace=False)
    )
    (dense_layer_1): Sequential(
      (0): Linear(in_features=16, out_features=8, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Dropout(p=0.0, inplace=False)
    )
  )
)

In [21]:
deepdense(X_deep)

tensor([[ 1.5131, -0.2351,  1.9998,  1.9794, -0.7550, -0.7819, -0.5403, -0.7078],
        [-1.0226, -0.1453, -0.4958, -0.5826,  1.7906,  1.1094,  1.6101,  0.1329],
        [-1.0305, -1.3890, -0.4859, -0.5942,  0.4141, -0.8311, -0.9921, -0.5953],
        [-0.2079,  1.7296, -0.5191, -0.5849, -0.7293, -0.8299,  0.7122, -0.7265],
        [ 0.7479,  0.0398, -0.4990, -0.2177, -0.7204,  1.3335, -0.7899,  1.8967]],
       grad_fn=<NativeBatchNormBackward>)

Let's now have a look to DeepDenseResnet:

In [22]:
from pytorch_widedeep.models import DeepDenseResnet

deepdense = DeepDenseResnet(blocks=[16,8], dropout=0.5, deep_column_idx=deep_column_idx,
                      embed_input=embed_input, continuous_cols=continuous_cols)

deepdense

DeepDenseResnet(
  (embed_layers): ModuleDict(
    (emb_layer_a): Embedding(4, 8)
    (emb_layer_b): Embedding(4, 8)
    (emb_layer_c): Embedding(4, 8)
    (emb_layer_d): Embedding(4, 8)
  )
  (embed_dropout): Dropout(p=0.0, inplace=False)
  (dense_resnet): Sequential(
    (lin1): Linear(in_features=33, out_features=16, bias=True)
    (bn1): BatchNorm1d(16, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (block_0): BasicBlock(
      (lin1): Linear(in_features=16, out_features=8, bias=True)
      (bn1): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (leaky_relu): LeakyReLU(negative_slope=0.01, inplace=True)
      (dp): Dropout(p=0.5, inplace=False)
      (lin2): Linear(in_features=8, out_features=8, bias=True)
      (bn2): BatchNorm1d(8, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (resize): Sequential(
        (0): Linear(in_features=16, out_features=8, bias=True)
        (1): BatchNorm1d(8, eps=1e-05, m

In [23]:
deepdense(X_deep)

tensor([[-1.0085e-02,  1.0810e+00, -9.7679e-03, -1.1307e-02,  4.4197e-01,
          5.4999e-01, -4.2809e-03, -1.1828e-02],
        [ 1.9041e+00, -1.1581e-02,  1.8792e+00,  7.7701e-01,  5.2448e-01,
         -5.5533e-03,  1.6126e+00,  1.5220e+00],
        [ 1.2326e+00, -1.6174e-03,  1.1446e+00,  1.6528e+00, -1.5239e-02,
         -8.0476e-03,  3.3760e-02,  1.1949e+00],
        [-2.9320e-02,  8.9202e-01,  9.6470e-01, -1.9520e-02, -2.7348e-03,
          3.3712e+00,  6.0935e-01,  5.4437e-01],
        [ 8.0375e-01, -6.5314e-03, -3.0117e-02,  6.5286e-01,  8.3092e-01,
         -2.5611e-02, -1.8277e-02, -2.0785e-02]], grad_fn=<LeakyReluBackward1>)

### 5.3. `DeepText`

The `DeepText` class within the `WideDeep` package is a standard and simple stack of LSTMs on top of word embeddings. You could also add a FC-Head on top of the LSTMs. The word embeddings can be pre-trained. 

In the future I aim to include full pre-trained models (maybe bringing the [huggingface](https://github.com/huggingface) library, [ULMFiT](https://arxiv.org/abs/1801.06146) [5] or [SHA-RNN](https://arxiv.org/abs/1911.11423)) [4] so that the combination between text and images is "fair" (since for the image model one can choose full pre-trained models, as we will see in the next sub-section).

On the other hand, while I recommend using the `Wide` and `DeepDense` classes within `pytorch-widedeep` when building the corresponding model components, it is very likely that the user will want to use custom text and image models. That is perfectly possible. Simply, build them and pass them as the corresponding parameters. Note that the custom models **MUST** return a last layer of activations (i.e. not the final prediction) so that these activations are collected by `WideDeep` and combined accordingly. In addition, the models **MUST** also contain an attribute `output_dim` with the size of these last layers of activations.

I will illustrate all of the above more in detail in the second post of these series.

Let's have a look to `DeepText`

In [24]:
import torch
from pytorch_widedeep.models import DeepText

In [25]:
X_text = torch.cat((torch.zeros([5,1]), torch.empty(5, 4).random_(1,4)), axis=1)
deeptext = DeepText(vocab_size=4, hidden_dim=4, n_layers=1, padding_idx=0, embed_dim=4)
deeptext

DeepText(
  (word_embed): Embedding(4, 4, padding_idx=0)
  (rnn): LSTM(4, 4, batch_first=True)
)

In [26]:
deeptext(X_text)

tensor([[ 0.3217,  0.0459,  0.2314,  0.2049],
        [ 0.0146,  0.0818, -0.3678,  0.2232],
        [-0.0175,  0.1433, -0.3312,  0.2409],
        [ 0.0154,  0.1384, -0.3101,  0.2398],
        [ 0.1574,  0.0884, -0.2047,  0.2322]], grad_fn=<SelectBackward>)

You could, if you wanted, add a Fully Connected Head (FC-Head) on top of it

In [27]:
deeptext = DeepText(vocab_size=4, hidden_dim=8, n_layers=1, padding_idx=0, embed_dim=4, 
                    head_layers=[8,4], head_batchnorm=True, head_dropout=[0.5, 0.5])

In [28]:
deeptext

DeepText(
  (word_embed): Embedding(4, 4, padding_idx=0)
  (rnn): LSTM(4, 8, batch_first=True)
  (texthead): Sequential(
    (dense_layer_0): Sequential(
      (0): Linear(in_features=8, out_features=4, bias=True)
      (1): LeakyReLU(negative_slope=0.01, inplace=True)
      (2): BatchNorm1d(4, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (3): Dropout(p=0.5, inplace=False)
    )
  )
)

In [29]:
deeptext(X_text)

tensor([[ 0.2332, -0.0000,  0.0000,  3.3695],
        [-0.0000,  0.2724, -0.0000, -0.0000],
        [ 0.0000, -0.0000, -1.1019, -0.0000],
        [-0.0000,  0.0640, -0.0000, -0.7391],
        [ 0.0000, -0.0000, -0.0000,  0.0000]], grad_fn=<MulBackward0>)

### 5.4. `DeepImage`

The `DeepImage` class within the `WideDeep` package builds either a pre-trained `ResNet` (18, 34, or 50. Default is 18) or a stack of CNNs, to which one can add a FC-Head. If is a pre-trained ResNet, you can chose how many layers you want to defrost deep into the network with the parameter `freeze`. Read [here](https://pytorch-widedeep.readthedocs.io/en/latest/model_components.html#pytorch_widedeep.models.deep_image.DeepImage) for more details. 

In [32]:
from pytorch_widedeep.models import DeepImage

X_img = torch.rand((2,3,224,224))
deepimage = DeepImage(pretrained=False, head_layers=[512, 64, 8])

deepimage

DeepImage(
  (backbone): Sequential(
    (0): Sequential(
      (0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
      (1): BatchNorm2d(64, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.1, inplace=True)
      (maxpool): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
    )
    (1): Sequential(
      (0): Conv2d(64, 128, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(128, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.1, inplace=True)
    )
    (2): Sequential(
      (0): Conv2d(128, 256, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(256, eps=1e-05, momentum=0.01, affine=True, track_running_stats=True)
      (2): LeakyReLU(negative_slope=0.1, inplace=True)
    )
    (3): Sequential(
      (0): Conv2d(256, 512, kernel_size=(1, 1), stride=(1, 1))
      (1): BatchNorm2d(512, eps=1e-05, momentum=0.01, affine=T

In [33]:
deepimage(X_img)

tensor([[ 0.2624, -0.0012,  0.1050, -0.0005, -0.0006,  0.1206, -0.0026,  0.0511],
        [ 0.2639, -0.0012,  0.1068, -0.0005, -0.0006,  0.1228, -0.0026,  0.0504]],
       grad_fn=<LeakyReluBackward1>)

### 5.5. `deephead`

Note that I do not use uppercase here. This is because, by default, the `deephead` is not necessarily defined outside `WideDeep` as a class on its own, like the the rest of the components.

When defining the `WideDeep` model there is a parameter called head_layers (and the corresponding `head_dropout`, and `head_batchnorm`) that define the FC-head on top of `DeeDense`, `DeepText` and `DeepImage`.

Of course, you could also chose to define it yourself externally and pass it using the parameter `deephead`. Have a look at the [documentation](https://pytorch-widedeep.readthedocs.io/en/latest/wide_deep.html).

## 6. Conclusion

This is the first of 2 posts introducing the python library `pytorch-widedeep`. This library is intended to be a flexible frame to combine tabular data with text and images via wide and deep models. Of course, it can also be used directly on "traditional" tabular data, without text and/or images, as we saw in Section 3.

In this post I have shown how to quickly start using the library (Section 3) and explained the utilities available in the `preprocessing` module (Section 4) and and model component definitions (Section 5), available in the `models` module. 

In the next post I will show more advance uses that hopefully will illustrate `pytorch-widedeep`'s flexibility to build wide and deep models. 

#### References

[1] Wide & Deep Learning for Recommender Systems. Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, et al. 2016. 	[arXiv:1606.07792](https://arxiv.org/abs/1606.07792)

[2] TabNet: Attentive Interpretable Tabular Learning. Sercan O. Arik, Tomas Pfister, 2020. [arXiv:1908.07442](https://arxiv.org/abs/1908.07442)

[3] AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data Nick Erickson, Jonas Mueller, Alexander Shirkov, et al., 2020. [arXiv:2003.06505](https://arxiv.org/abs/2003.06505)

[4] Universal Language Model Fine-tuning for Text Classification. Jeremy Howard, Sebastian Ruder, 2018 [arXiv:1801.06146v5](https://arxiv.org/abs/1801.06146)

[5] Single Headed Attention RNN: Stop Thinking With Your Head. Stephen Merity, 2019 [arXiv:1801.06146v5](arXiv:1911.11423v2)