# Using our new Hugging Face model

Now we have our new Hugging Face model available on the model hub we can use it as we would any other model on the hub 😀

## Install our required packages 

First we install our required packages. 


In [1]:
!pip install git+https://github.com/huggingface/transformers

Collecting git+https://github.com/huggingface/transformers
  Cloning https://github.com/huggingface/transformers to /tmp/pip-req-build-drup2ss7
  Running command git clone -q https://github.com/huggingface/transformers /tmp/pip-req-build-drup2ss7
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers<0.11,>=0.10.1
  Downloading tokenizers-0.10.3-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (3.3 MB)
[K     |████████████████████████████████| 3.3 MB 12.9 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.2.1-py3-none-any.whl (61 kB)
[K     |████████████████████████████████| 61 kB 573 kB/s 
Collecting pyyaml>=5.1
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 

In [2]:
!pip install datasets

Collecting datasets
  Downloading datasets-1.16.1-py3-none-any.whl (298 kB)
[K     |████████████████████████████████| 298 kB 12.5 MB/s eta 0:00:01
Collecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 72.1 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2021.11.1-py3-none-any.whl (132 kB)
[K     |████████████████████████████████| 132 kB 93.1 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 84.2 MB/s 
Collecting asynctest==0.13.0
  Downloading asynctest-0.13.0-py3-none-any.whl (26 kB)
Collecting frozenlist>=1.1.1
  Downloading frozenlist-1.2.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (192 kB)
[K     |████████████████████████████████| 192 kB 90.7 MB/s 
[?25hCollecting multi

In [3]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
import pandas as pd

## Loading our Model and Tokenizer

We create a tokenizer and a model using the pre-trained model we created. We can use the handy `Auto..` methods for this.

In [None]:
tokenizer = AutoTokenizer.from_pretrained("BritishLibraryLabs/bl-books-genre")
model = AutoModelForSequenceClassification.from_pretrained("BritishLibraryLabs/bl-books-genre")

To make the process of doing inference straightforward we can use a [`pipeline`](https://huggingface.co/transformers/main_classes/pipelines.html). These are intended to make the process of using models for inference easy for a wide range of tasks. We need to tell the pipeline what kind of task we're doing and pass in our model and tokenizer. 

In [5]:
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer,device=0)

If we have a GPU available we can set the device when creating the pipeline. We can double check this using the `device` attribute. 

In [6]:
classifier.device

device(type='cuda', index=0)

## Viewing Predictions
Let's try it out with some made up book titles

In [7]:
title = "The Coal Fields of South Wales"

In [8]:
classifier(title)

[{'label': 'Non-fiction', 'score': 0.9989659786224365}]

In [9]:
classifier("Oliver Twist")

[{'label': 'Fiction', 'score': 0.9980145692825317}]

Now we essentially have a function that takes some text and returns some predictions. We can now predict against our full dataset. Since we're not going to be doing this over and over, we won't worry too much about the performance of our approach. 

## Getting the data we want to augment

Our orignal goal was to augment data with additional genre labels. We can now grab that metadata to use for inference

In [10]:
csv_url = "https://bl.iro.bl.uk/downloads/e4bf0f74-2c64-4322-93c7-0dcc5e5246da?locale=en"

In [11]:
dtypes = {
    "BL record ID": "string",
    "Type of resource": "category",
    "BNB number":"category",
    "ISBN":"category",
    "Name": "category",
    "Type of name": "category",
    "Country of publication": "category",
    "Place of publication": "category",
    "Genre": "category",
    "Dewey classification": "string",
    "BL record ID for physical resource": "string",
}

In [12]:
df = pd.read_csv(
    csv_url,
    dtype=dtypes,
    low_memory=False
)

## 🤗 datasets

We'll again use one of the libraries from the 🤗 ecoystem to help us make our prediction process *quite* efficient. We use a library called `datasets`. This library provides access to a huge number of existing datasets but we can also use it as way of processing datasets stored in other formats locally. We won't go into the deep details of that library here. If you are interested the video below gives a great overview

<iframe width="560" height="315" src="https://www.youtube.com/embed/_BZearw7f0w" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

We begin by importing the library

In [13]:
import datasets

As an inital step we remove any titles without a title in the metadata

In [14]:
df = df[~df.Title.isna()]

For inference we only want a subset of the DataFrame. We grab two columns and copy them to a new dataframe. 

In [15]:
pred_df = df[['Title','BL record ID']].copy(deep=True)

:::{note}
You might wonder why we include `BL record ID` when we don't need this for prediction. We do this mainly because the datasets library expects a `DataFrame` not a `Series` (which is what we'd get with a single column). 

Beyond this, it is often important when we use maching learning for these kinds of applications that we can link back our new augmented data to existing metadata. `BL record ID` can also serve this function since it's a unique ID. 
:::


Our model has a maximum length it can take as input. We could deal with this in various ways but here we'll take a slightly crude approach and just chop off any title text that extends beyond our models maximum length.

In [16]:
pred_df['text'] = pred_df['Title'].str[:512]

We just do a quick check to make sure we have equal lenghts for our prediction DataFrame and our original DataFrame.

In [17]:
assert len(df) == len(pred_df)

We can now create a `dataset.Dataset` using the `from_pandas` method. 

In [18]:
dataset = datasets.Dataset.from_pandas(pred_df)

Let's take a look at what this looks like

In [19]:
dataset

Dataset({
    features: ['Title', 'BL record ID', 'text', '__index_level_0__'],
    num_rows: 1752072
})

You can see here that we have the columns we passed in to our `dataset` and some information about the number of rows. If we want to check again we can assert the length of everything is the same.

In [20]:
assert len(df) == len(pred_df) == len(dataset)

:::{note}
Checking all of these lengths here might seem a bit silly since we've already seen what the lengths are but it can often be good to chuck the odd `assert` in our code. If we run this code in a script we won't be able to visually check these things so easily. In that case having an `assert` statement acts as a 'sort of test' and will flag if something isn't what we expect before we've spent hours training a model. 
:::


## Inference 

We're now ready for inference. We'll import one more Class for doing this. 

In [21]:
from transformers.pipelines.base import KeyDataset

We set a fairly high batch size. You may have to reduce this if you have less GPU memory available. 

In [22]:
bs = 256

We now create a loop that will batch up our data and run it through our model. We then save the predictions in a new list. 

In [23]:
all_preds = []
for pred in classifier(KeyDataset(dataset, "text"), batch_size=bs, truncation="only_first"):
    all_preds.append(pred)

CPU times: user 45min 48s, sys: 5.65 s, total: 45min 54s
Wall time: 42min 19s


This will take a bit of time but in the grand scheme of things isn't too long too wait. If we were running this model in a 'live' system and really cared about latency we might need to explore other approaches.

Again we can check the lenth of our predictions to see if it looks okay

In [24]:
len(all_preds)

1752072

We now just do a bit of tidying to get our labels into the format we want. We'll start by checking what the predictions look like

In [25]:
all_preds[0]

{'label': 'Non-fiction', 'score': 0.518064022064209}

We store these in a new column

In [26]:
df['raw_predictions'] = all_preds

We now grab the labels from our predictions 

In [27]:
df["predicted_label"] = df['raw_predictions'].apply(lambda x: ["label"])

We now store the probabilities

In [29]:
df["prob"] = df['raw_predictions'].apply(lambda x: x["score"])

In [30]:
df['prob']

0          0.518064
1          0.954458
2          0.955602
3          0.997447
4          0.820073
             ...   
1752073    0.999866
1752074    0.992083
1752075    0.998614
1752076    0.524758
1752077    0.999718
Name: prob, Length: 1752072, dtype: float64

In [31]:
def get_fiction_prob(x):
    if x.predicted_label == "Fiction":
        return x.prob
    else:
        return 1 - x.prob

In [32]:
df["fiction_probs"] = df.apply(get_fiction_prob, axis=1)

In [33]:
df["non_fiction_probs"] = 1 - df["fiction_probs"]

In [34]:
df = df.drop(columns=["prob"])

## Saving our updated metadata 

Now we can save our results to a csv file

In [35]:
df.to_csv("bl_books_w_genre_transformer.csv")

## Conclusion 

We have now got to the point where was have a version of the Microsoft Books metadata with a bunch of additional metadata 🦾