# **Fine-tuning GPT2 for text generation**
### This notebook has been taken from [here](https://gist.github.com/GeorgeDittmar/5c57a35332b2b5818e51618af7953351)
- It contains code to fine tune GPT-2 on two datasets:
  1. [Tiny Shakespeare](https://huggingface.co/datasets/tiny_shakespeare) - 40,000 lines of Shakespeare from a variety of Shakespeare's plays
  2. [Bill Sum](https://huggingface.co/datasets/billsum) - Summarization of US Congressional and California state bills.


## Table of contents
1. [Imports and Installation section](#Imports)
2. [Shakespeare Data preparation](#Shakespeare)
3. [GPT-2 Model Fine tuning on Shakespeare Dataset](#FineTuning)
4. [Shakespeare Text Generation using GPT-2](#TextGeneration) 
5. [Bill Sum Data set preparation](#BillSum)
6. [GPT-2 Model Fine tuning on Bill Sum Dataset](#FineTuningBillSum)
7. [Bill Sum Text Generation using GPT-2](#BillSumTextGeneration) 

## Part 1: Imports and Installation section <a name="Imports"></a>

In [None]:
# import necessary modules
import os
import json
import pandas as pd
from sklearn.model_selection import train_test_split

In [None]:
# mount google drive
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
#Clone the transformers repo into the notebook
!git clone https://github.com/huggingface/transformers

Cloning into 'transformers'...
remote: Enumerating objects: 19, done.[K
remote: Counting objects: 100% (19/19), done.[K
remote: Compressing objects: 100% (18/18), done.[K
remote: Total 62213 (delta 3), reused 9 (delta 0), pack-reused 62194[K
Receiving objects: 100% (62213/62213), 47.44 MiB | 28.97 MiB/s, done.
Resolving deltas: 100% (44035/44035), done.


In [None]:
# Clone should now be in the machine
!ls

drive  sample_data  transformers


Change directory location to be in the examples folder and then install any requirements

In [None]:
os.chdir("transformers")
os.chdir("./examples/language-modeling")
!ls

README.md	  run_clm.py	   run_mlm.py
requirements.txt  run_mlm_flax.py  run_plm.py


In [None]:
!pip install -r requirements.txt

Collecting datasets>=1.1.3
[?25l  Downloading https://files.pythonhosted.org/packages/06/9b/d097f2238fc3c028495cf5f8c65378972b9f1b2cbb27f3c57c7219195aa9/datasets-1.2.1-py3-none-any.whl (159kB)
[K     |████████████████████████████████| 163kB 17.1MB/s 
[?25hCollecting sentencepiece!=0.1.92
[?25l  Downloading https://files.pythonhosted.org/packages/14/67/e42bd1181472c95c8cda79305df848264f2a7f62740995a46945d9797b67/sentencepiece-0.1.95-cp36-cp36m-manylinux2014_x86_64.whl (1.2MB)
[K     |████████████████████████████████| 1.2MB 47.0MB/s 
Collecting pyarrow>=0.17.1
[?25l  Downloading https://files.pythonhosted.org/packages/33/67/2f4fcce1b41bcc7e88a6bfdb42046597ae72e5bc95c2789b7c5ac893c433/pyarrow-3.0.0-cp36-cp36m-manylinux2014_x86_64.whl (20.7MB)
[K     |████████████████████████████████| 20.7MB 1.2MB/s 
Collecting xxhash
[?25l  Downloading https://files.pythonhosted.org/packages/f7/73/826b19f3594756cb1c6c23d2fbd8ca6a77a9cd3b650c9dec5acc85004c38/xxhash-2.0.0-cp36-cp36m-manylinux2010_x8

In [None]:
!ls

README.md	  run_clm.py	   run_mlm.py
requirements.txt  run_mlm_flax.py  run_plm.py


In [None]:
!pip install pyarrow --upgrade

Requirement already up-to-date: pyarrow in /usr/local/lib/python3.6/dist-packages (3.0.0)


In [None]:
os.chdir("/content/transformers/examples/")
os.chdir("./language-modeling")

In [None]:
# Need to install latest transformer packages from github so the scripts will run correctly
!pip install git+git://github.com/huggingface/transformers/

Collecting git+git://github.com/huggingface/transformers/
  Cloning git://github.com/huggingface/transformers/ to /tmp/pip-req-build-txc9i948
  Running command git clone -q git://github.com/huggingface/transformers/ /tmp/pip-req-build-txc9i948
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/fd/5b/44baae602e0a30bcc53fbdbc60bd940c15e143d252d658dfdefce736ece5/tokenizers-0.10.1-cp36-cp36m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 16.9MB/s 
[?25hCollecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 59.3MB/s 
Building wheels for collected packages: transformers
  Building

## Part 2: Shakespeare Data Preparation section<a name="Shakespeare"></a>
- Set up data from a text file in the format <|title|> some data <|endoftext|> and split into training and eval sets.

In [None]:
with open('/content/drive/MyDrive/bill_sum_shakespeare/shakespeare.txt', 'r') as data:
  dataset = ["<|title|>" + x.strip() for x in data.readlines()]

train, eval = train_test_split(dataset, train_size = 0.9, random_state = 42)
print("Training size:" + str(len(train)))
print("Evaluation size: " + str(len(eval)))

with open('train_tmp.txt', 'w') as file_handle:
  file_handle.write("<|endoftext|>".join(train))

with open('eval_tmp.txt', 'w') as file_handle:
  file_handle.write("<|endoftext|>".join(eval))

Training size:36000
Evaluation size: 4000


## Part 3: GPT-2 Model Fine tuning on Shakespeare Dataset<a name="FineTuning"></a>
### Fine tuning with **1 epoch**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 1 \
--fp16 \
--learning_rate 5e-5 \
--output_dir="Output dir"

2021-02-10 14:01:11.237413: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/10/2021 14:01:12 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Output dir>, overwrite_output_dir=False, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.NO, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.LINEAR, warmup_steps=0, logging_dir=runs/Feb10_14-01-12_cf74782d0fe2, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debu

### Fine tuning with **3 epochs**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 3 \
--fp16 \
--learning_rate 5e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-09 10:27:58.071755: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/09/2021 10:27:59 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb09_10-27-59_aa1c1fd0c52e, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

### Fine tuning with **5 epochs**


In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 5 \
--fp16 \
--learning_rate 5e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-09 10:48:48.805022: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/09/2021 10:48:50 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb09_10-48-50_aa1c1fd0c52e, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

### Fine tuning with **7 epochs**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 7 \
--fp16 \
--learning_rate 5e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-09 11:20:44.331601: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/09/2021 11:20:45 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=7.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb09_11-20-45_aa1c1fd0c52e, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

### Fine tuning with **Learning rate = 4e-5** & number of **epochs = 5**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 5 \
--fp16 \
--learning_rate 4e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-09 12:02:47.056848: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/09/2021 12:02:48 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=4e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb09_12-02-48_aa1c1fd0c52e, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

#### Fine tuning with **Learning rate = 3e-5** & number of **epochs = 5**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 5 \
--fp16 \
--learning_rate 3e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-09 12:31:54.267605: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/09/2021 12:31:56 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=3e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb09_12-31-56_aa1c1fd0c52e, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

#### Fine tuning with **Learning rate = 2e-5** & number of **epochs = 5**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 5 \
--fp16 \
--learning_rate 2e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-09 13:01:03.892428: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/09/2021 13:01:05 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=2e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb09_13-01-05_aa1c1fd0c52e, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

In [None]:
from transformers import TFGPT2LMHeadModel
from transformers import GPT2Tokenizer

model = TFGPT2LMHeadModel.from_pretrained("Output dir", from_pt = True)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFGPT2LMHeadModel: ['transformer.h.22.attn.masked_bias', 'transformer.h.18.attn.masked_bias', 'transformer.h.11.attn.masked_bias', 'transformer.h.20.attn.masked_bias', 'transformer.h.2.attn.masked_bias', 'transformer.h.16.attn.masked_bias', 'transformer.h.0.attn.masked_bias', 'transformer.h.12.attn.masked_bias', 'transformer.h.5.attn.masked_bias', 'transformer.h.6.attn.masked_bias', 'transformer.h.17.attn.masked_bias', 'lm_head.weight', 'transformer.h.14.attn.masked_bias', 'transformer.h.1.attn.masked_bias', 'transformer.h.10.attn.masked_bias', 'transformer.h.4.attn.masked_bias', 'transformer.h.21.attn.masked_bias', 'transformer.h.7.attn.masked_bias', 'transformer.h.13.attn.masked_bias', 'transformer.h.3.attn.masked_bias', 'transformer.h.8.attn.masked_bias', 'transformer.h.9.attn.masked_bias', 'transformer.h.15.attn.masked_bias', 'transformer.h.19.attn.masked_bias', 'transformer.h.23.attn.masked_bias']
-

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=1042301.0, style=ProgressStyle(descript…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=456318.0, style=ProgressStyle(descripti…




## Part 4: Shakespeare Text Generation using GPT-2 <a name="TextGeneration"></a>

In [None]:
# set the initial text to start the process of text generation
input_ids = tokenizer.encode("He that will give good words to thee will", return_tensors = 'tf')

# print the tensor ids
input_ids[0]

<tf.Tensor: shape=(9,), dtype=int32, numpy=
array([ 1544,   326,   481,  1577,   922,  2456,   284, 17903,   481],
      dtype=int32)>

In [None]:
generated_text_samples = model.generate(
    input_ids, 
    max_length = 128,  
    num_return_sequences = 5,
    no_repeat_ngram_size = 2,
    repetition_penalty = 1.5,
    top_p = 0.92,
    temperature = 0.85,
    do_sample = True,
    top_k = 125,
    early_stopping = True
)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In [None]:
# Print output for each sequence generated above
for i, token in enumerate(generated_text_samples):
  print("{}: {}".format(i, tokenizer.decode(token, skip_special_tokens = True)))
  print()

0: He that will give good words to thee will also deliver thy spirit into the hands of God. - Psalm 83:20.
There is a reason for this prophecy, however; it points to Christ's return and His coming again on Earth upon his own behalf (see Hebrews 13). However how does he accomplish these goals by making himself available at once? And then why did Jesus bring back all those things which have been stolen from Him as He had already done so in Heaven before leaving us here today just years ago? On one side are various Christian saints who would not tolerate their former status with our modern world but still remain faithful

1: He that will give good words to thee will be more profitable for me than any king. And let him who has been made a great man say unto his disciples, O men! I have laid before you all the kingdom of Heaven; and thy place is this: ye shall not go down from it until thou canst see clearly what my wrath hath caused iniquity by those things which are done among us."
17 The

## Part 5: Bill Sum Data set preparation<a name="BillSum"></a>

In [None]:
# read the training data into pandas dataframe
data_path = '/content/drive/MyDrive/bill_sum_shakespeare/us_train_data_final_OFFICIAL.jsonl'
data = pd.read_json(data_path, lines = True)

# display settings
pd.set_option('display.max_colwidth', None)

# display few rows of the dataframe
data.head(2)

Unnamed: 0,bill_id,text,summary,title,text_len,sum_len
0,107_hr2256,"SECTION 1. SHORT TITLE.\n\n This Act may be cited as the ``Border Hospital Survival and Illegal \nImmigrant Care Act''.\n\nSEC. 2. FINDINGS.\n\n The Congress finds as follows:\n (1) Immigration is a Federal responsibility.\n (2) The Immigration and Naturalization Service does not \n take into custody all aliens who are unlawfully present in the \n United States.\n (3) Section 1867 of the Social Security Act (42 U.S.C. \n 1395dd) and State laws require that, if any individual (whether \n or not lawfully present in the United States) comes to a \n hospital and the hospital determines that the individual has an \n emergency medical condition, the hospital must provide either, \n within the staff and facilities available at the hospital, for \n such further medical examination and such treatment as may be \n required to stabilize the medical condition, or, if \n appropriate, for transfer of the individual to another medical \n facility.\n (4) The Southwest border region is ill-equipped to absorb \n the expense of providing health care to undocumented aliens \n because it ranks last in the country in terms of per capita \n income.\n (5) The Southwest border region has been designated as a \n health professional shortage area under section 332 of the \n Public Health Service Act (42 U.S.C. 254e).\n (6) The unreimbursed costs associated with caring for \n undocumented aliens are severely threatening the financial \n stability of health care providers in Arizona.\n\nSEC. 3. REIMBURSEMENT TO HEALTH CARE PROVIDERS FOR EMERGENCY MEDICAL \n CARE RENDERED TO CERTAIN ALIENS.\n\n Section 322 of the Public Health Service Act (42 U.S.C. 249) is \namended by adding at the end the following:\n ``(d)(1) The Secretary shall establish and implement a 5-year pilot \nprogram under which funds made available under paragraph (6) are used \nto reimburse providers for items and services described in section \n411(b)(1) of the Personal Responsibility and Work Opportunity \nReconciliation Act of 1996 (8 U.S.C. 1621(b)(1)) provided in Arizona to \naliens described in paragraph (3), and to reimburse suppliers of \nemergency ambulance services furnished to such aliens for which the \ntransportation originates in Arizona (where the use of other methods of \ntransportation is contraindicated by the alien's condition), if payment \nmay not be made to reimburse the provider or supplier under any Federal \nprogram or law other than this subsection (such as title XIX of the \nSocial Security Act), any State or local program or law, any group or \nindividual health plan, or any insurance policy.\n ``(2) As part of the pilot program, in a case in which an alien \ndescribed in paragraph (3) arrived at a hospital in Arizona and the \nhospital provided for such medical examination and treatment of the \nalien as the hospital determined was required to stabilize an emergency \nmedical condition (within the meaning of section 1867(e)(1) of the \nSocial Security Act (42 U.S.C. 1395dd(e)(1))), the Secretary shall use \nfunds made available under paragraph (6) to reimburse the hospital for \nany transportation costs paid by the hospital to return the alien to \nthe United States border, if--\n ``(A) the hospital requested the Attorney General to take \n the alien into custody after such stabilization;\n ``(B) such request was denied within 24 hours after its \n receipt, or the Attorney General gave no response to it within \n such period; and\n ``(C) the hospital determined that discharging the alien \n without providing for such transportation might pose a threat \n to the health or safety of the alien (or, with respect to a \n pregnant alien, the health or safety of the alien or her unborn \n child).\n ``(3) An alien is described in this paragraph if the alien--\n ``(A) is not lawfully present in the United States and not \n detained by any Federal, State, or local law enforcement \n authority; or\n ``(B) is paroled into the United States under section \n 212(d)(5) of the Immigration and Nationality Act (8 U.S.C. \n 1182(d)(5)) for less than one year in order to receive \n treatment for an emergency medical condition.\n ``(4) During the period in which the pilot program is operating, \nthe Secretary shall submit annual reports to the Congress on its \noperation. Each report shall contain at least the following \ninformation:\n ``(A) The number of aliens to whom assistance was rendered \n for which payment was made under this subsection during the \n previous year.\n ``(B) The nationality of such aliens.\n ``(C) The average cost per alien of such assistance.\n ``(D) The total annual amount paid to each provider or \n supplier of assistance.\n ``(E) The feasibility and estimated cost of expanding the \n pilot program to items and services provided anywhere in the \n Southwest border region of the United States.\n ``(5) Nothing in this subsection shall be construed to authorize \nany reduction in the funds payable to any person under any Federal \nprogram or law other than this subsection (such as title XIX of the \nSocial Security Act), any State or local program or law, any group or \nindividual health plan, or any insurance policy.\n ``(6) To the extent provided in appropriations Acts, from amounts \nmade available to the Immigration and Naturalization Service for \nenforcement and border affairs for each of the 5 fiscal years following \nthe fiscal year in which the Border Hospital Survival and Illegal \nImmigrant Care Act is enacted, the Attorney General may transfer to the \nHealth Resources and Services Administration of the Department of \nHealth and Human Services such amounts as may be necessary to carry out \nthis subsection, not to exceed $50,000,000 for each such year.''.","Border Hospital Survival and Illegal Immigrant Care Act - Amends the Public Health Service Act to direct the Secretary of Health and Human Services to establish a five-year pilot program of health care provider reimbursement for the costs associated with providing emergency medical and ambulance services in Arizona to: (1) illegal aliens who are not detained by any Federal, State, or local law enforcement authority. Or (2) aliens paroled into the United States for less than one year to receive emergency medical treatment.","To amend the Public Health Service Act to establish a 5-year pilot program under which health care providers are reimbursed by the Secretary of Health and Human Services for the costs associated with providing emergency medical care to aliens who are not lawfully present in the United States and are not detained by any law enforcement authority, and for other purposes.",6100,527
1,111_hr4710,"SECTION 1. SHORT TITLE.\n\n This Act may be cited as the ``Farm to School Improvements Act of \n2010''.\n\nSEC. 2. FARM TO SCHOOL PROGRAM.\n\n (a) Amendment.--The Richard B. Russell National School Lunch Act \n(42 U.S.C. 1751 et seq.) is amended by inserting after section 19, the \nfollowing:\n\n``SEC. 19A. FARM TO SCHOOL PROGRAM.\n\n ``(a) In General.--The Secretary shall provide assistance, through \ncompetitive matching grants and technical assistance, to eligible \nentities for farm to school programs that--\n ``(1) improve access to local foods in schools and \n institutions participating in programs under this Act and \n section 4 of the Child Nutrition Act of 1966 (42 U.S.C. 1773) \n through farm to school activities, including the purchase of \n local food, establishment of effective relationships between \n school and institutional food service providers, distributors, \n and producers or groups of producers, school gardens, \n appropriate equipment, and the provision of training and \n education; and\n ``(2) are designed to--\n ``(A) improve the nutritional health and well being \n of children;\n ``(B) procure healthy local foods from small and \n medium-sized farms for meals at eligible schools and \n institutions;\n ``(C) support experiential nutrition education \n activities and curriculum planning that incorporates \n the participation of school children in farm and \n garden-based agricultural education activities;\n ``(D) develop a sustained commitment to farm to \n school programs in the community by linking schools and \n institutions, State and local agencies including Indian \n Tribal Organizations, institutions of higher education, \n agricultural producers, parents, community garden \n groups and other community stakeholders; and\n ``(E) increase farm income by facilitating farmers' \n access to institutional markets including schools.\n ``(b) Eligible Entity.--For purposes of this section, the term \n`eligible entity' means--\n ``(1) a school;\n ``(2) nonprofit organization; or\n ``(3) other entity that the Secretary determines offers a \n unique ability to provide services or farm-to-school programs.\n ``(c) Grants.--\n ``(1) Types of grants.--A grant awarded under this section \n may include--\n ``(A) an implementation grant to support the cost \n of implementing a farm to school program;\n ``(B) a training and technical assistance grant to \n support the cost of--\n ``(i) providing the training, operational \n support, information, and access to resources \n necessary to implement a successful farm to \n school program; and\n ``(ii) encouraging collaboration between \n public and private entities; or\n ``(C) a planning grant to support the cost of \n conducting research, identifying resources, and \n developing partnerships to design a successful and \n sustainable farm to school program.\n ``(2) Grant amounts.--A grant awarded under this section to \n an eligible entity shall not exceed--\n ``(A) in the case of an implementation or training \n and technical assistance grant, $100,000; and\n ``(B) in the case of a planning grant, $25,000.\n ``(3) Grant duration.--A grant under this section shall be \n awarded for a period--\n ``(A) in the case of an implementation or training \n and technical assistance grant, not to exceed 2 years; \n and\n ``(B) in the case of a planning grant, not to \n exceed 1 year.\n ``(d) Cost Share.--\n ``(1) In general.--The amount of a grant made under this \n section shall not exceed 75 percent of the cost of the proposed \n grant activities.\n ``(2) Non-federal support.--A recipient of a grant under \n this section shall be required to provide at least 25 percent \n of the cost of the proposed grant activities in the form of \n cash or in-kind contributions (including facilities, equipment, \n training, or services provided by State and local governments \n and private sources).\n ``(e) Evaluation.--A recipient of a grant under this section shall \ncooperate in an evaluation by the Secretary of the programs carried out \nusing such grant funds.\n ``(f) Regional Balance.--In making awards and providing technical \nassistance under this section, the Secretary shall to the maximum \nextent practicable, ensure--\n ``(1) geographical diversity; and\n ``(2) equitable treatment of urban, rural, and tribal \n communities.\n ``(g) Technical Assistance.--The Secretary shall provide recipients \nof grants under this section with technical assistance, which shall \ninclude sharing information, best practices, research, and data on \nexisting farm to school programs.\n ``(h) Proposals.--\n ``(1) In general.--An eligible entity desiring to receive a \n grant under this section shall submit a proposal to the \n Secretary at such time, in such manner, and containing such \n information as the Secretary may require.\n ``(2) Competitive award selection.--The Secretary shall \n form review panels to evaluate proposals submitted under \n paragraph (1) based on the criteria described in paragraph (3). \n Such review panels shall include--\n ``(A) representatives of schools and eligible \n institutions;\n ``(B) registered dietitians;\n ``(C) operators of small and medium-sized farms;\n ``(D) public agencies;\n ``(E) non-governmental and community-based \n organizations with expertise in local food systems and \n farm to school programs; and\n ``(F) other appropriate parties as determined by \n the Secretary.\n ``(3) Proposal review criteria.--In making awards under \n this section, the Secretary shall evaluate proposals based on \n the extent to which the proposed program--\n ``(A) improves the nutritional health and well \n being of children;\n ``(B) makes local food products available on the \n menu of the school or institution;\n ``(C) benefits local small and medium-sized farms;\n ``(D) incorporates experiential nutrition education \n activities and curriculum planning that incorporates \n the participation of school children in farm and \n garden-based agricultural education activities;\n ``(E) serves schools and eligible institutions with \n a high proportion of children who are eligible for free \n and reduced price lunches;\n ``(F) demonstrates collaboration between schools or \n institutions, non-governmental and community-based \n organizations, farmer groups, and other community \n partners;\n ``(G) demonstrates the potential for long-term \n program sustainability;\n ``(H) includes adequate and participatory \n evaluation plans; and\n ``(I) meets such other related criteria as the \n Secretary may determine relevant.\n ``(i) Funding.--Beginning on October 1, 2010, or of any funds in \nthe Treasury not otherwise appropriated, the Secretary of the Treasury \nshall transfer to the Secretary of Agriculture to carry out this \nsection $10,000,000 each fiscal year, to remain available until \nexpended.''.\n (b) Conforming Change.--Section 18(g) of the Richard B. Russell \nSchool Lunch Act (42 U.S.C. 1769(g)) is amended--\n (1) by striking paragraphs (1) and (2); and\n (2) by redesignating paragraphs (3) and (4) as paragraphs \n (1) and (2), respectively.","Farm to School Improvements Act of 2010 - Amends the Richard B. Russell National School Lunch Act to direct the Secretary of Agriculture to provide competitive matching grants to schools, nonprofit organizations, and other able entities for farm to school programs that improve the access of school lunch and breakfast program participants to local foods. Provides that each grant may include an implementation grant, training and technical assistance grant, and planning grant. Requires farm to school programs to be designed to: (1) improve the nutritional health and well being of children, (2) procure healthy local foods from small and medium-sized farms. (3) support experiential nutrition education by involving school children in farm and garden-based agricultural education activities. (4) commit public and private community stakeholders to the sustained success of such programs. And (5) increase farmers' income by facilitating their access to institutional markets. Directs the Secretary to provide grant recipients with technical assistance that includes sharing information, best practices, research, and data on existing farm to school programs.",To amend the Richard B. Russell National School Lunch Act to award grants to eligible entities for farm to school programs.,8628,1161


In [None]:
# check the shape of the data frame
data.shape

(18949, 6)

In [None]:
# extract the text data
text = data['summary'].values.tolist()

# split data into train and validation set
train, eval = train_test_split(text, train_size = 0.9, random_state = 42)
print("Training size:" + str(len(train)))
print("Evaluation size: " + str(len(eval)))

with open('train_tmp.txt', 'w') as file_handle:
  file_handle.write("<|endoftext|>".join(train))

with open('eval_tmp.txt', 'w') as file_handle:
  file_handle.write("<|endoftext|>".join(eval))

Training size:17054
Evaluation size: 1895


## Part 6: GPT-2 Model Fine tuning on Bill Sum Dataset<a name="FineTuningBillSum"></a>
### Fine tuning with **1 epoch**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 1 \
--fp16 \
--learning_rate 5e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-10 01:23:20.838664: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/10/2021 01:23:22 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb10_01-23-22_db3590ecbfe2, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

### Fine tuning with **3 epochs**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 3 \
--fp16 \
--learning_rate 5e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-10 02:10:09.675636: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/10/2021 02:10:11 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=3.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb10_02-10-11_db3590ecbfe2, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

### Fine tuning with **5 epochs**

In [None]:
!python run_clm.py \
--model_type gpt2-medium \
--model_name_or_path gpt2-medium \
--train_file "train_tmp.txt" \
--do_train \
--validation_file "eval_tmp.txt" \
--do_eval \
--per_gpu_train_batch_size 1 \
--save_steps -1 \
--num_train_epochs 5 \
--fp16 \
--learning_rate 5e-5 \
--lr_scheduler_type constant \
--evaluation_strategy epoch \
--output_dir="Output dir" \
--overwrite_output_dir True

2021-02-10 04:21:35.962423: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.10.1
02/10/2021 04:21:37 - INFO - __main__ -   Training/evaluation parameters TrainingArguments(output_dir=<Path to your output dir>, overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=False, evaluation_strategy=EvaluationStrategy.EPOCH, prediction_loss_only=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, gradient_accumulation_steps=1, eval_accumulation_steps=None, learning_rate=5e-05, weight_decay=0.0, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=5.0, max_steps=-1, lr_scheduler_type=SchedulerType.CONSTANT, warmup_steps=0, logging_dir=runs/Feb10_04-21-37_db3590ecbfe2, logging_first_step=False, logging_steps=500, save_steps=-1, save_total_limit=None, no_cuda=False, seed=42, fp16=True, fp16_opt_level=O1, fp16_backend=auto, local_rank=-1, tpu_num_cores=None, tpu_metrics_

## Part 7: Bill Sum Text Generation using GPT-2 <a name="BillSumTextGeneration"></a>


In [None]:
model = TFGPT2LMHeadModel.from_pretrained("Output dir", from_pt = True)

# set the initial text to start the process of text generation
bill_sum_input_ids = tokenizer.encode("This Act may be cited as the ``Farm to School Improvements Act", return_tensors = 'tf')

# print the tensor ids
bill_sum_input_ids[0]

<tf.Tensor: shape=(13,), dtype=int32, numpy=
array([ 1212,  2191,   743,   307,  9181,   355,   262,  7559, 48412,
         284,  3961, 45097,  2191], dtype=int32)>

### Generate Text

In [None]:
bill_sum_generated_text_samples = model.generate(
    bill_sum_input_ids, 
    max_length = 128,  
    num_return_sequences = 5,
    no_repeat_ngram_size = 2,
    repetition_penalty = 1.5,
    top_p = 0.92,
    temperature = 0.85,
    do_sample = True,
    top_k = 125,
    early_stopping = True
)

Setting `pad_token_id` to 50256 (first `eos_token_id`) to generate sequence


In [None]:
# Print output for each sequence generated above
for i, token in enumerate(bill_sum_generated_text_samples):
  print("{}: {}".format(i, tokenizer.decode(token, skip_special_tokens = True)))
  print()

0: This Act may be cited as the ``Farm to School Improvements Act''. [[Page 127 STAT. 1294]] (2) <<NOTE: Deadline.>> Effective date.--The amendment made by paragraph 18 of section 1346(e)(4), and in addition, all amendments making such an award under this subtitle shall take effect on September 1st 2009. SEC., 2013A--FREEDOM OF HAWTHORNE COUNTY REGISTER AND RIGHTS ASSESSMENTS WITH RESPECT TO THE PURPOSE FOR SHALLOWING A GRASS FIBERATION PROGRAMS OR ADMISSION REQUIREMENT BY RESIDENTS COMPET

1: This Act may be cited as the ``Farm to School Improvements Act''. SEC. 576A--MULTIPLE DISTRIBUTION ADMINISTRATION OF SMALL, UNIFORMED AND PROFIT-DEDUCTIBLE FINANCIAL SERVICES ACT of 2015 INTELLIGENCE AUTHORITY FOR FARMING LOCATIONS WITH LAND PROGRAMS THAT DO NOT PROVIDE SUBMARINE RESEARCH FUNDS TO OTHER SCHOOL REPORTS OR CONTRACTORS AS DEFINITIONALLY NEEDY IS COMMONLY DESIGNATED BY THE PRESIDENCY GOVERNMENT ARGUMENT AFFAIR D

2: This Act may be cited as the ``Farm to School Improvements Act''. (b