You can also read the contents of this notebook on my [blog](https://thomasdelatte.com/eu-text-generator).

Natural language generation is the process of producing meaningful phrases and sentences in the form of natural language. Since a couple of years, natural language processing (NLP) models show a stunning ability to write coherent pieces and go beyond what we thought was possible in the short-term. [The OpenAI GPT-2](https://github.com/openai/gpt-2) is a model that stunned the world in 2019 by automatically creating consistent and passionate stories.

The GPT-2 isn’t based on a notably innovative architecture: it is similar to the already-seen decoder-only transformer. This is rather the scale of the model and the scale of the data it feeds on that set its performances apart.

The GPT-2 is a pre-trained language model, meaning that it is trained to predict the next word of a sentence. Since its release, the GPT-2 has been fine-tuned on various datasets for various language generation tasks, ranging from a [chatbot completing any sentence you give it](https://talktotransformer.com) to a [generator of words that do not exist, and their definition](https://www.thisworddoesnotexist.com/).

For an illustrated and accessible explanation of how GPT-2 works, check out [this amazing blog post by Jay Alammar!](http://jalammar.github.io/illustrated-gpt2/).

<figure class="image">
  <img src="https://miro.medium.com/max/1400/1*MqyjtyN3EYRQ4WVmUm1z2Q.gif" alt="openAI Gif">
  <figcaption>Credit: Ben Barry / OpenAI</figcaption>
</figure>

To the best of my knowledge, GPT-2 has not been used to generate texts of a legal nature. It might seem odd indeed to want to generate legal texts when legal drafting requires more rigor and method than creativity. 

I believe nevertheless that it is an adequate way to see if the model stays consistent inside a generated text. On the other hand, legal document automation applications have been around for a long time and building a language model trained on legal documents is more and more used for natural language understanding and its applications.

For our purpose, we will tap into the huge number of legal acts (Directives, Regulations, Decisions) produced in the last 30 years by the European Union. You can generate text yourself right away [here](https://thomasdelatte.com/app).

### Using the GPT2 model on EU Acts
First, let's import the necessary modules. We will use the [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple) library to conveniently play around with GPT-2. 

In [0]:
import glob
import os
import pandas as pd

%tensorflow_version 1.x

try:
  import gpt_2_simple as gpt2
except:
  !pip3 -q install gpt-2-simple
  import gpt_2_simple as gpt2

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



## Get the data 
To train the GPT2 model to generate EU legislative acts, we obviously need to get existing EU acts.

Luckily, we will not need to scrape the data from the web: EURLEX57K is a database consisting of the 57,000 European Union's Directives, Decisions and Regulations from 1990 to 2019. It is available [here](https://github.com/iliaschalkidis/lmtc-eurlex57k).

In [0]:
# Create Colab folder
if not os.path.exists('content'):
    os.makedirs('content')

In [0]:
# Download the datasets through the command line
!wget -O /content/datasets.zip http://nlp.cs.aueb.gr/software_and_datasets/EURLEX57K/datasets.zip
!unzip /content/datasets.zip -d /content/EURLEX57K
!rm /content/datasets.zip
!rm -rf /content/EURLEX57K/__MACOSX

[1;30;43mStreaming output truncated to the last 5000 lines.[0m
  inflating: /content/EURLEX57K/dataset/dev/32004R1662.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._32004R1662.json  
  inflating: /content/EURLEX57K/dataset/dev/32004R0970.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._32004R0970.json  
  inflating: /content/EURLEX57K/dataset/dev/31999R0014.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._31999R0014.json  
  inflating: /content/EURLEX57K/dataset/dev/32007R1267.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._32007R1267.json  
  inflating: /content/EURLEX57K/dataset/dev/31987R3625.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._31987R3625.json  
  inflating: /content/EURLEX57K/dataset/dev/31999R2183.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._31999R2183.json  
  inflating: /content/EURLEX57K/dataset/dev/32012R0997.json  
  inflating: /content/EURLEX57K/__MACOSX/dataset/dev/._32012R09

EURLEX57K is a database consisting of the 57,000 European Union's Directives, Decisions and Regulations from 1990 to 2019. It is available here[https://github.com/iliaschalkidis/lmtc-eurlex57k].

The EU legislative acts were loaded each in a separate JSON file, which amounts to no less than 57,000 files in our dataset! Let's glob them together for further processing.

We also load the files in Pandas dataframes. As the number of files to load is important (57,000!), this might take a few minutes.

In [0]:
%%time
# Glob the files together
folders = glob.glob("/content/EURLEX57K/dataset/**/*.json", recursive=True)
# Load the files in a Pandas Dataframe
data = pd.concat((pd.read_json(file, lines=True) for file in folders), ignore_index=True)

CPU times: user 4min 22s, sys: 3.95 s, total: 4min 26s
Wall time: 4min 26s


Let's take a glimpse at the data:

In [0]:
data.head()

Unnamed: 0,celex_id,uri,type,concepts,title,header,recitals,main_body,attachments
0,32003R1781,http://publications.europa.eu/resource/cellar/...,Regulation,"[252, 2668]",Commission Regulation (EC) No 1781/2003 of 10 ...,Commission Regulation (EC) No 1781/2003\nof 10...,",\nHaving regard to the Treaty establishing th...",[The world price for unginned cotton as referr...,"Done at Brussels, 10 October 2003.\nFor the Co..."
1,32013D0092,http://publications.europa.eu/resource/cellar/...,Decision,"[191, 2746, 2754, 2771, 3191, 4079, 4509, 5969]",2013/92/EU: Commission Implementing Decision o...,20.2.2013 EN Official Journal of the European ...,",\nHaving regard to the Treaty on the Function...",[Definitions\nFor the purpose of this decision...,"Done at Brussels, 18 February 2013.\nFor the C..."
2,31990R1317,http://publications.europa.eu/resource/cellar/...,Regulation,"[2676, 4472, 6042]",Council Regulation (EEC) No 1317/90 of 14 May ...,COUNCIL REGULATION (EEC) N° 1317/90\nof 14 Ma...,",\nHaving regard to the Treaty establishing th...","[For the 1990/91 marketing year, the target p...","Done at Brussels, 14 May 1990.\nFor the Counci..."
3,32009R0755,http://publications.europa.eu/resource/cellar/...,Regulation,"[1118, 1605, 2443, 2635, 693]",Commission Regulation (EC) No 755/2009 of 18 A...,19.8.2009 EN Official Journal of the European ...,",\nHaving regard to the Treaty establishing th...",[The standard import values referred to in Art...,"Done at Brussels, 18 August 2009.\nFor the Com..."
4,31991R0591,http://publications.europa.eu/resource/cellar/...,Regulation,"[2984, 3605, 693]",Commission Regulation (EEC) No 591/91 of 12 Ma...,COMMISSION REGULATION (EEC) No 591/91 of 12 M...,",\nHaving regard to the Treaty establishing th...",[1. In order to establish the register of citr...,"Done at Brussels, 12 March 1991.\nFor the Comm..."


Legislative acts are usually cluttered with a lot of legalese that is difficult to understand for non-lawyers. Let's get rid of it and keep only the title of the legislative act and its body.

In [0]:
# Drop unnecessary columns
small_data = data.drop(["celex_id", "uri", "type", "concepts", "header", "recitals", "attachments"], axis=1)
# Main_body column from list to string
small_data["main_body"] = [" ".join(map(str, l)) for l in small_data["main_body"]]

In [0]:
# A peek at the data
small_data.head()

Unnamed: 0,title,main_body
0,Commission Regulation (EC) No 1781/2003 of 10 ...,The world price for unginned cotton as referre...
1,2013/92/EU: Commission Implementing Decision o...,"Definitions\nFor the purpose of this decision,..."
2,Council Regulation (EEC) No 1317/90 of 14 May ...,"For the 1990/91 marketing year, the target pr..."
3,Commission Regulation (EC) No 755/2009 of 18 A...,The standard import values referred to in Arti...
4,Commission Regulation (EEC) No 591/91 of 12 Ma...,1. In order to establish the register of citru...


Finally, the gpt2-simple module requires a text file as input. We add the tags "startoftext" and "endoftext" before and after each legislative act that the model will later pick up to learn where each text starts and ends.

In [0]:
# save to text file
# line_terminator uses tags to delimit the start and end of texts
small_data.to_csv("eu_2.txt", index=False, sep=" ", header=None, line_terminator="<endoftext> \n <startoftext>")

### Training the Model
Open AI has released 4 different models, each more complex than the last: 124M parameters, 355M parameters, 762M parameters and finally a gigantic 1.5 Billion (!) parameters model.

For our purposes we will be downloading the smallest model with 124M parameters.

Our job is to fine-tune the model. This means that we keep the weights of the already-trained neural network and that we only make small adjustments to these weights to complete our task. This finetuning process is usually used to speed up training (given the size of the neural network model) and sometimes to overcome a small dataset size (which is not really our case here).

In [0]:
gpt2.download_gpt2(model_name="124M")

Fetching checkpoint: 1.05Mit [00:00, 525Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 108Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 432Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 498Mit [00:02, 228Mit/s]                                   
Fetching model.ckpt.index: 1.05Mit [00:00, 646Mit/s]                                                
Fetching model.ckpt.meta: 1.05Mit [00:00, 148Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 191Mit/s]                                                       


Let's start the session and train our model. The API of the GPT2-simple module makes it very easy.

Given the number of parameters and the size of our data, this is going to take a while: using Colab's GPU is essential.

In [0]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset="eu_data.txt",
              model_name='124M',
              steps=1000,
              restore_from='fresh',
              run_name='run1',
              print_every=10,
              sample_every=100,
              save_every=500)

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Loading checkpoint models/124M/model.ckpt
INFO:tensorflow:Restoring parameters from models/124M/model.ckpt


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [01:24<00:00, 84.78s/it]


dataset has 17477821 tokens
Training...
[10 | 19.57] loss=2.50 avg=2.50
[20 | 32.13] loss=1.79 avg=2.14
[30 | 44.66] loss=2.00 avg=2.10
[40 | 57.19] loss=2.35 avg=2.16
[50 | 69.74] loss=1.82 avg=2.09
[60 | 82.30] loss=1.80 avg=2.04
[70 | 94.81] loss=1.48 avg=1.96
[80 | 107.36] loss=1.72 avg=1.93
[90 | 119.88] loss=1.72 avg=1.90
[100 | 132.45] loss=1.97 avg=1.91
 Regulation Regulation (EC) No 1727/93 laying down detailed rules for the use of ECAG (ecagium.org)  (5) shall be amended as follows:
1. The introductory paragraph shall be deleted.
2. In the second paragraph the following paragraphs shall be added: `not later than 1 January 1994, the European Parliament may apply new rules for the application of the same rules that have been adopted during the period referred to in the first subparagraph:
(a) the Regulation,
(b) the Commission Regulation (EC) No 1727/93, and
(c) this Regulation'.
This Regulation shall enter into force on the third day following its publication in the Official J

There are a few things to note about the training process.

The model being already pretrained, it is no surprise that sentences are written in proper intelligible English after only 100 iterations.

However, the text from the first few hundred iterations are filled with unnecessary repetitions and contain some non-sensical sequences of words. This seems to be less pronounced in the second half of our training process. We can see some improvements in the cohesiveness and the structure of the output as the model is fine-tuned.

The average training loss keeps going down through the iterations (from 2.5 to 1.52). However the average loss almost stopped decreasing in the last 300 iterations (from 1.56 to 1.52), maybe a sign that the model has almost converged and that additional training may not help to improve the model.

### Text Generation
We can now use the model weights that we got from our finetuning process to generate some text. Again, the GPT2 module makes it straightforward.

In [0]:
gpt2.generate(sess,
              length=500,
              temperature=0.8,
              prefix="<startoftext>",
              truncate='<|endoftext|>',
              include_prefix=False,
              nsamples=1,
              batch_size=1)

<startoftext> 
 <startoftext>"Commission Implementing Regulation (EU) No 853/2011 of 3 June 2011 establishing the standard import values for determining the entry price of certain fruit and vegetables
" "The standard import values referred to in Article 138 of Implementing Regulation (EU) No 543/2011 are fixed in the Annex hereto. This Regulation shall enter into force on the day of its publication in the Official Journal of the European Union.
This Regulation shall be binding in its entirety and directly applicable in all Member States."<endoftext> 
 <startoftext>"Commission Regulation (EEC) No 39/90 of 3 January 1990 allowing for an exemption from the duties applicable to imports of certain textile products originating in the United States of America creating a safeguard for the Community tariff quota
" "1. From 1 January to 31 December 1990, the duties applicable to imports of the products listed in Article 1 of Commission Regulation (EEC) No 3847/90 (1) shall be suspended at the le

You can generate you own GPT2 EU legislative act [here](https://thomasdelatte.com/app).
​
Let’s take a look at a first example of generated text:

> __COUNCIL DECISION 2011/783/CFSP of 19 January 2011 adapting, for the seventh time, the detailed rules of application of the Single Programming Document for the Internal Market in Electronic Records__
>
>The detailed rules of application for the Single Programming Document for the Internal Market in Electronic Records (the Document):
>
>(a) shall be as set out in the Annex; and
>
>(b) shall be supplemented, inter alia, by additional provisions:
>
>(i) for the application of Articles 308a, 308b, 308c, 308d, 308e, 308f, 308g, 308h, 308i, 308j, 308k, 308l, 308m, 308n, 308o, 308p and 308q;
>
>(ii) for the application of Articles 308a, 308b, 308c, 308d, 308e, 308f, 308g, 308h, 308i, 308j, 308k, 308l, 308m, 308n, 308o, 308p, 308q.
>
>The combined nomenclature for electronic purposes in Annex I to this Decision shall be as set out in the Annex hereto.
>
>This Decision is addressed to the Member States.

In any case, we cannot blame GPT2 for having changed the impenetrable style of legislative writing. As a lawyer, I can assure you it looks like the typical Council Decision. I can see the Council adapting this kind of arcane document "for the seventh time".

Joke aside, the text respects the basic structure of a Council Decision and is coherent in its use of the dates and internal references.

Here is another example, with a lower temperature set at 0.5 (less creativity). 

> __Commission Regulation (EC) No 1232/2005 of 12 October 2005 amending Regulation (EC) No 2318/2001 laying down detailed rules for the application of Council Regulation (EC) No 2400/2001 as regards the granting of aid for the manufacture of certain vegetable oils and fats, and amending Regulation (EC) No 2400/2001__
>
> The Annex to Regulation (EC) No 2318/2001 is replaced by the Annex to this Regulation.
>
> This Regulation shall enter into force on the third day following its publication in the Official Journal of the European Union.
> This Regulation shall be binding in its entirety and directly applicable in all Member States.

The lower temperature used makes the text even less decipherable. This Regulation seems almost void of content and only filled with internal and external references. The title suggests the amendment of two Regulations (2318/2001 and 2400/2001), while the body of the text only modifies one of the Regulations. This seems to be a small shortcoming in the model's understanding of the structure of legal acts.

Let's take a last example a bit more creative:

> __COMMISSION IMPLEMENTING DECISION 2002/551/CFSP: Commission Decision of 20 December 2001 on the approval of the research programme for the eradication of leukosis in sheep and goat (notified under document number C(2001) 3923)__
>
>The programme for the eradication of leukosis in sheep and goat, as set out in the Annex, is hereby approved.
The approval of the programme, as referred to in Article 16 of Decision 2001/974/CFSP, is conditional on the completion of the following essential measures:
>
>(a) the minimum number of sheep-to-human transmission lines for transmission by sheep to be connected to the transmission line for the transmission of leukosis;
>
>(b) the capacity to use the transmission lines for transmission of leukosis and the material needed for the transmission of the disease to the human population, as set out in Annex II;
>
>(c) the capacity to operate the transmission lines;
>
>(d) the standard equipment for the transmission of infectious leukosis;
>
>(e) the method for monitoring the transmission of leukosis;
>
>(f) the means available for monitoring the transmission of infectious leukosis and leukosis-associated infectious leukosis to the affected area.
>
>Member States shall ensure that the implementation of the programme is carried out in accordance with the provisions of Regulations (EC) No 1445/95, (EC) No 1547/95, (EC) No 1608/95 and (EC) No 1754/95.
>
>This Decision is addressed to the Member States.

## Final Thoughts
Overall, the generated texts seem pretty good. I would definitely have a hard time distinguishing between a GPT-2 generated text and a genuine EU legislative act, except for the occasional small inconsistencies. The model usually stays coherent with dates and  internal references.

It is worth checking whether those texts are plagiarized by checking if it did not copy text from the training dataset. A quick manual check found the generated texts mostly original but sometimes made of combinations of words existing in the dataset. 

What I find particularly impressive is the ease with which the GPT-2 model could be trained to generate content of such a quality. Only a few lines of code were necessary to load and preprocess the dataset as well as for fine-tuning the model. The barrier to entry to cutting-edge Natural Language Processing has been considerably reduced in the last two years. This is indeed the promise of transfer learning in the form of pretrained language models. 

A next step would be to train the GPT-2 on another very interesting legal set of texts: courts judgements. Given a large enough dataset, the ability of GPT-2 to stay coherent could really shine.