[(Source)](https://medium.com/modern-nlp/productionizing-nlp-models-9a2b8a0c7d14)

## Cost optimisations
### Caching

Unlike word2vec and glove which are fixed vocab non-contextual embeddings,
language models like ELMo and BERT are contextual and do not have any
fixed vocabulary. The downside of this is that the word embedding needs
to be calculated every time through the model. This became quite a trouble
for us as we saw heavy CPU spikes due to model processing.

Since our text phrases had an average length of 5 and were repetitive in
nature, we cached embeddings of the phrase to avoid re-computations.
By just adding this small method to our code we got a 20x speedup 🏄

```
#Earlier
language_model.get_sentence_embedding(sentence)

#Later
from cachetools import LRUCache, cached

@cached(cache=LRUCache(maxsize=10000))
def get_sentence_embedding(sentence):
    return language_model.get_sentence_embedding(sentence)
```

## Cache size optimisation

Since **LRU** (Least Recently Used) cache has BigO of $log(n)$, the smaller the better.
But we also know that we want to cache as much as possible. So bigger the better.
This meant we had to optimise cache maxsize empirically.
We found 50000 as the sweet point for us.

### Revised load testing method

By using cache we couldn’t use just a few test samples as the cache would make them compute free.
Hence, we had to define variable test cases so as to simulate the real text samples.
We did this with the help of a python script to create request samples and tested with **JMeter**.
