# Using our new Hugging Face model

Now we have our new Hugging Face model available on the model hub we can use it as we would any other model on the hub 😀

## Install our required packages 

First we install our required packages. 


In [None]:
!pip install transformers tqdm --upgrade



In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch
import pandas as pd
from tqdm.auto import tqdm

In [None]:
tqdm.pandas()

## Loading our model and tokenizer 

We create a tokenizer and a model using the pre-trained model we created. 

In [None]:
tokenizer = AutoTokenizer.from_pretrained("davanstrien/bl-books-genre")
model = AutoModelForSequenceClassification.from_pretrained("davanstrien/bl-books-genre")

To make the process of doing inference very straightforward we can use a [`pipeline`](https://huggingface.co/transformers/main_classes/pipelines.html). These are intended to make the process of using models for inference easy for a wide range of tasks. We need to tell the pipeline what kind of task we're doing and pass in our model and tokenizer. 

In [None]:
classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

Let's try it out

In [None]:
title = "The Coal Fields of South Wales"

In [None]:
classifier(title)

[{'label': 'Non-fiction', 'score': 0.9989659786224365}]

In [None]:
classifier("Oliver Twist")

[{'label': 'Fiction', 'score': 0.9980145692825317}]

Now we essentially have a function that takes some text and returns some predictions. We can now predict against our full dataset. Since we're not going to be doing this over and over, we won't worry too much about the performance of our approach. 

In [None]:
df = pd.read_csv(
    "MS digitised books 2021-01-09.csv",
    dtype={
        "BL record ID": "string",
        "BL record ID for physical resource": "string",
        "Title": "string",
    },
)

In [None]:
len(df)

52695

In [None]:
df.head(1)

Unnamed: 0,BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,Number within series,Country of publication,Place of publication,Publisher,Date of publication,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource
0,14602826,Monograph,"Yearsley, Ann",1753-1806,person,,"More, Hannah, 1745-1833 [person] ; Yearsley, A...",Poems on several occasions [With a prefatory l...,,,,England,London,,1786,Fourth edition MANUSCRIPT note,,,Digital Store 11644.d.32,,,English,,3996603


We'll create a function to try and get the prediction. It does a quick check to see if we need to chop off some of our text (our model has a max len of 512). 

In [None]:
import numpy as np

In [None]:
def try_predict(x):
    pred = np.nan
    if len(x) > 512:
        x = x[:512]
    try:
        pred = classifier(x)
    except Exception as e:
        print("\U0001F92E", e, x)
        pass
    return pred

This is quite an ugly solution but since we're not planning to do anything beyond make these predictions once we'll allow a bit of sloppiness 😉. We can use `progress_apply` to get a little progress bar so we know how long it's going to take. 

In [None]:
preds = df["Title"].progress_apply(try_predict)

  0%|          | 0/52695 [00:00<?, ?it/s]

In [None]:
preds[0]

[{'label': 'Fiction', 'score': 0.9999717473983765}]

We now just do a bit of tidying to get our labels into the format we want

In [None]:
df["predicted_label"] = preds.apply(lambda x: x[0]["label"])

In [None]:
df["predicted_label"][0]

'Fiction'

In [None]:
df["prob"] = preds.apply(lambda x: x[0]["score"])

In [None]:
def get_fiction_prob(x):
    if x.predicted_label == "Fiction":
        return x.prob
    else:
        return 1 - x.prob

In [None]:
df["fiction_probs"] = df.apply(get_fiction_prob, axis=1)

In [None]:
df["non_fiction_probs"] = 1 - df["fiction_probs"]

In [None]:
df = df.drop(columns=["prob"])

In [None]:
df.head(1)

Unnamed: 0,BL record ID,Type of resource,Name,Dates associated with name,Type of name,Role,All names,Title,Variant titles,Series title,Number within series,Country of publication,Place of publication,Publisher,Date of publication,Edition,Physical description,Dewey classification,BL shelfmark,Topics,Genre,Languages,Notes,BL record ID for physical resource,predicted_label,fiction_probs,non_fiction_probs
0,14602826,Monograph,"Yearsley, Ann",1753-1806,person,,"More, Hannah, 1745-1833 [person] ; Yearsley, A...",Poems on several occasions [With a prefatory l...,,,,England,London,,1786,Fourth edition MANUSCRIPT note,,,Digital Store 11644.d.32,,,English,,3996603,Fiction,0.999972,2.8e-05


## Saving our updated metadata 

Now we can save our results to a csv file

In [None]:
df.to_csv("data/bl_books_w_genre_transformer.csv")

## Conclusion 

We have now got to the point where was have a version of the bl books metadata with a bunch of additional metadata. 