# Generating german doctor reviews with a GPT-2 model
## Fine tuning of a pretrained **Hugging Face** transfomer decoder
In this notebook we will be using a GPT-2 mdoel that was fine-tuned to synthesize doctor reviews mimiking actual patients' text comments.

A detailed description of the **German language reviews of doctors by patients 2019** dataset can be found [here](https://data.world/mc51/german-language-reviews-of-doctors-by-patients)


For this exercise, we will use the [**Hugging Face**](https://huggingface.co/) implementation of transformers for Tensorflow 2.0. Transformers provides a general architecture implementation for several state of the art models in the natural language domain.

NOTE: This notebook and its implementation is heavily influenced by the [data-drive](https://data-dive.com/) *Natural Language Processing of German texts* blog post

In [None]:
!pip install -U transformers==4.9.2

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [None]:
import pandas as pd
import tensorflow as tf

from transformers import AutoTokenizer, TFGPT2LMHeadModel

pd.options.display.max_colwidth = 600
pd.options.display.max_rows = 400

## Setting up the decoder model
HuggingFace's transfomer library allows for conviniently loading  pre-configured text tokenizers and pre-trained models from local resources.

Here we will be using a tokenizer and a GPT-2 model that was pre-trained on the doctor review dataset


In [None]:
!rm -r gpt2_doctorreview_finetuned* __MACOSX
!gdown https://drive.google.com/uc?id=13wbf5bsLmvRFD-AgbmruWo6bdiyjkwd9 -O gpt2_doctorreview_finetuned.zip
!unzip gpt2_doctorreview_finetuned.zip

Downloading...
From: https://drive.google.com/uc?id=13wbf5bsLmvRFD-AgbmruWo6bdiyjkwd9
To: /content/gpt2_doctorreview_finetuned.zip
100% 463M/463M [00:02<00:00, 189MB/s]
Archive:  gpt2_doctorreview_finetuned.zip
   creating: gpt2_doctorreview_finetuned/
  inflating: gpt2_doctorreview_finetuned/.DS_Store  
  inflating: __MACOSX/gpt2_doctorreview_finetuned/._.DS_Store  
   creating: gpt2_doctorreview_finetuned/tokenizer/
   creating: gpt2_doctorreview_finetuned/model/
  inflating: gpt2_doctorreview_finetuned/tokenizer/added_tokens.json  
  inflating: __MACOSX/gpt2_doctorreview_finetuned/tokenizer/._added_tokens.json  
  inflating: gpt2_doctorreview_finetuned/tokenizer/tokenizer_config.json  
  inflating: __MACOSX/gpt2_doctorreview_finetuned/tokenizer/._tokenizer_config.json  
  inflating: gpt2_doctorreview_finetuned/tokenizer/special_tokens_map.json  
  inflating: __MACOSX/gpt2_doctorreview_finetuned/tokenizer/._special_tokens_map.json  
  inflating: gpt2_doctorreview_finetuned/tokenizer/

In [None]:
tokenizer = AutoTokenizer.from_pretrained('gpt2_doctorreview_finetuned/tokenizer')
model = TFGPT2LMHeadModel.from_pretrained('gpt2_doctorreview_finetuned/model')

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2_doctorreview_finetuned/model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


## Generating doctor reviews
The model has been conditioned to be able to control if positive or negative reviews should be generated. 

As an auto-regressive model the sequence is generated by building up from the passed input sequences. We can use this to control the polarity of the review by passing either the token for positive or for negative reviews

In [None]:
POS_TOKEN = "<|review_pos|>"
NEG_TOKEN = "<|review_neg|>"

### Simple greedy search
Let's implement our own greedy-search-based text generator. Generation happens in a loop where one token is generated at a time. Token with highest probability is select in each iteration.

**HINT**:
- Useful functions for this exercise:
 - [`tf.math.argmax`](https://www.tensorflow.org/api_docs/python/tf/math/argmax)
 - [`tf.cast`](https://www.tensorflow.org/api_docs/python/tf/cast)
 - [`tf.expand_dims`](https://www.tensorflow.org/api_docs/python/tf/expand_dims)
 - [`tf.concat`](https://www.tensorflow.org/api_docs/python/tf/concat)
- Tensorflow tensors can be sliced using the same synthax as with Numpy Array
 - Use slicing to retrieve specific tensor parts


In [None]:
def generate_greedy(inputs:str, max_length=15):
    #Generates a text sequence from an input str using a greedy search
    input_ids = tokenizer.encode(inputs, return_tensors='tf')

    ##########################
    ## YOUR CODE HERE START ##
    ##########################

    # loop until the max_length is reached
    for __ in range(max_length):
        
        logits = model.predict(input_ids).logits

        # retrieve the predicted logits for the *last* token
        # Dimensions are [batch_size, input_tokens, vocab_size]
        next_token_logits = logits[:, -1, :]  

        # Select the token with the highest probability
        next_token = tf.math.argmax(next_token_logits, axis=-1)

        # convert it to tf.int32
        next_token = tf.cast(next_token, tf.int32)

        # We have to expand the dimension of the next token to match 
        # the shape of input_ids
        next_token = tf.expand_dims(next_token, axis=-1)

        # Now we need to add the next_token to the previous ones.
        # Concat input_ids with the next_token
        input_ids = tf.concat([input_ids, next_token], axis=1)

    output_ids = input_ids.numpy().squeeze()

    ### Use your tokenizer to decode the output ids into text
    decoded = tokenizer.decode(output_ids)

    ##########################
    ## YOUR CODE HERE END ##
    ##########################
    return decoded

When you invoke you greedy search implementation you will get a generated text conditioned on the beginning of the input-text.

Using `POS_TOKEN` will increase the likelihood to generate a positive text. While using `NEG_TOKEN` will result in a rather negative text.

In [None]:
generate_greedy(POS_TOKEN + ' Ich', max_length=15)

'<|review_pos|> Ich bin seit Jahren bei Dr. Heuer in Behandlung und bin sehr zufrieden.'

### More advanced text generation
So far so good. Now we understand how text can be generated.

However we ignore when our model predicts EOS (end-of-sentence). What would be neccessary to incoorporate this in our function?

What if we would want to generate multiple different review comments?
Did you generate long reviews? Have you started to see repetitions in the generated output? Why is that?

Luckily the Hugging Face implementation offers various ways for us to generate higher quality reviews.

#### Greedy search
The following code can be used to generate text using a greedy search algorithm:

In [None]:
# encode context the generation is conditioned on
input_ids = tokenizer.encode(POS_TOKEN, return_tensors='tf')

# generate text until the output length
# (which includes the context length) reaches 50 
greedy_outputs = model.generate(
    input_ids, 
    max_length=50,
    num_return_sequences=3,
    )

genrated_reviews = [{'generated_text': tokenizer.decode(output, skip_special_tokens=True)}
                    for output in greedy_outputs]
pd.DataFrame(genrated_reviews)

Unnamed: 0,generated_text
0,"Sehr freundlich, die beste Ärztin die ich kenne und nimmt sich viel Zeit für die Patienten. Die Sprechstundenhilfen sind auch sehr freundlich."
1,"Als Hausärztin kann ich Fr. Dr. Harder nur wärmstens empfehlen. Sie findet immer eine Lösung, nimmt sich Zeit, arbeitet gewissenhaft und schnell und hat für jede Problemchen ein Ohr."
2,"Tolle Praxis!!! Ich bin schon seid fast Jahren bei Dr. Dahmen. Der Arzt macht das was er macht!! Nur zu empfehlen!!! Keine Abzocke, Nette Mädels, super Arzt!!!"


#### Beam search 
Beam search can be considered as an alternative. At each step of generating a token, a set of top probability tokens are kept as part of the beam instead of just the highest-probability token. The sequence with the highest overall probability is returned at the end of the generation.

What do the parameters `no_repeat_ngram_size` and `temperature` control?

Generating text using beam search is done like this:

In [None]:
beam_outputs = model.generate(
    input_ids,
    max_length=50,
    num_beams=7,
    no_repeat_ngram_size=3,
    num_return_sequences=3,
    early_stopping=True,
    temperature=0.7
)

genrated_reviews = [{'generated_text': tokenizer.decode(output, skip_special_tokens=True)}
                    for output in beam_outputs]
pd.DataFrame(genrated_reviews)

Unnamed: 0,generated_text
0,Ich bin seit Jahren bei Frau Dr. Henze und bin sehr zufrieden. Sie ist sehr kompetent und nimmt sich Zeit für ihre Patienten. Ich fühle mich bei ihr sehr gut aufgehoben.
1,Ich bin seit Jahren bei Frau Dr. Henze und fühle mich bei ihr sehr gut aufgehoben. Sie ist sehr kompetent und nimmt sich Zeit für ihre Patienten. Ich kann sie nur weiter empfehlen.
2,Ich bin seit Jahren bei Frau Dr. Henze und fühle mich bei ihr sehr gut aufgehoben. Sie ist sehr kompetent und nimmt sich Zeit für ihre Patienten. Ich kann sie nur weiterempfehlen.


#### High level pipeline
The easiest way to to use the model is to use HuggingFaces transformer `pipeline` implementation to encapsulate the previously loaded `model` and `tokenizer`.

The documentation for the [**pipeline**](https://huggingface.co/transformers/main_classes/pipelines.html) abstraction describes how to do the setup.

While being able to generate reviews with very high fiddelity, it's also the slowest approach. Can you find out why?


In [None]:
from transformers import pipeline

In [None]:
##########################
## YOUR CODE HERE START ##
##########################
# build a transformer-pipeline 
# to generate text using the 
# previously loaded model and tokenizer

review_generator = pipeline(
  "text-generation",
  model=model,
  tokenizer=tokenizer,
)

##########################
## YOUR CODE HERE END   ##
##########################

In [None]:
pos_generated_reviews = review_generator(POS_TOKEN, max_length=50, num_return_sequences=3)
pd.DataFrame(pos_generated_reviews)

Unnamed: 0,generated_text
0,"<|review_pos|> Herr Dr. Boos ist ein sehr freundlicher und kompetenter Arzt, der sich sehr viel Zeit für seine Patienten nimmt. Seine Behandlungsmethoden sind effektiv und führten immer zum Erfolg, man hat wirklich keinerlei Schmerzen, das gesamte Personal ist freundlich und hilfsbereit"
1,<|review_pos|> Super kompetentes und freundliches Praxisteam immer freundlich hilfsbereit. Frau Dr. Dordea nimmt sich Zeit für einen
2,<|review_pos|> Man hat das Gefühl das man sich wohl fühlt. Man wird gut beraten.


In [None]:
neg_generated_reviews = review_generator(NEG_TOKEN, max_length=50, num_return_sequences=3)
pd.DataFrame(neg_generated_reviews)

Unnamed: 0,generated_text
0,"<|review_neg|> Es ist schon sehr traurig, dass er als Privatpatient als Kassenpatient ein solch betriebswirtschaftliches Verhalten anzweifeln muss."
1,"<|review_neg|> Ich bin in diesem Praxis nur zur Krebsvorsorge und zur Krebsvorsorge gekommen, da beim Krebsvorsorgetermin auch ein Abstz gemacht werden muss. Leider war die Leistung nicht von der Krankenkasse erstattet worden. Als ich dann im Anschluss an die Untersuchung den"
2,"<|review_neg|> Ich bin auf Grund von starken Rückenschmerzen und Verspannungen in die Praxis gegangen. Herr Dr. Heuer hat mir ein Medikament verschrieben, wodurch ich die starken Rückenschmerzen in die Schulter stellte. Nach einem kurzen Blick in meinen Hals teilte mir H"


# Congratulation
You have explored different ways to generate text with a GPT-2-Transformer model