## Using Gradio to wrap a text to text interface around GPT-2 

Check out the library on [github](https://github.com/gradio-app/gradio-UI) and see the [getting started](https://gradio.app/getting_started.html) page for more demos.

### Installs and Imports

In [None]:
!pip install -q gradio
!pip install -q transformers
!pip install -q Tokenizers

In [None]:
import gradio as gr
import numpy as np
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
import random
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import GPT2Tokenizer, GPT2LMHeadModel, AdamW, get_linear_schedule_with_warmup
from tqdm import tqdm, trange
import torch.nn.functional as F
import csv
import os

### Loading the model and creating the generate function

Note: You can also change to `hebrew-gpt_neo-tiny`, `hebrew-gpt_neo-small` or `hebrew-gpt_neo-xl`

---



> Indented block



In [None]:
#model_name = "Norod78/hebrew-gpt_neo-tiny"
model_name = "Norod78/hebrew-gpt_neo-small"

# Use only out of collab on local monster of a machine, collab doesn't have enough RAM.
# model_name = "Norod78/hebrew-gpt_neo-xl"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--Norod78--hebrew-gpt_neo-small/snapshots/fd891958cc050222616d6fa5b697bf5d43ff8955/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--Norod78--hebrew-gpt_neo-small/snapshots/fd891958cc050222616d6fa5b697bf5d43ff8955/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--Norod78--hebrew-gpt_neo-small/snapshots/fd891958cc050222616d6fa5b697bf5d43ff8955/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--Norod78--hebrew-gpt_neo-small/snapshots/fd891958cc050222616d6fa5b697bf5d43ff8955/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--Norod78--hebrew-gpt_neo-small/snapshots/fd891958cc050222616d6fa5b697bf5d43ff8955/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--Norod78--hebrew-gpt_neo-small/s

Scrape Data From Twitter.

In [None]:
from google.colab import drive
# This will prompt for authorization.
drive.mount('/content/drive')

In [None]:
twts =  open("/content/drive/My Drive/data.txt").read().split('END$')

In [None]:
twts[4]

'הרגע יצאתי באמצע התוכנית של אופירה וברקוביץ. עשו עלי אמבוש. הכניסו בחור שלא סובל אותי שילכלכך עלי בפריים טיים.\n\nכל היום התכוננתי לראיון איתם, ישבתי בבית ועברתי על טקסטים. באמת זה היה לי חשוב.\n\nניראה אותם עושים את זה לאיילת שקד. \nאבל היי, לרמוס את הדר מוכתר בשידור- מביא רייטינג. '

In [None]:
import re
import json
from sklearn.model_selection import train_test_split



def build_text_files(data_json, dest_path):
    f = open(dest_path, 'w')
    data = ''
    for texts in data_json:
        data += texts + "  "
    f.write(data)

train, test = train_test_split(twts,test_size=0.05) 


build_text_files(train,'train_dataset.txt')
build_text_files(test,'test_dataset.txt')

print("Train dataset length: "+str(len(train)))
print("Test dataset length: "+ str(len(test)))

Train dataset length: 364
Test dataset length: 20


Tokenize the data

In [None]:
from transformers import TextDataset,DataCollatorForLanguageModeling

def load_dataset(train_path,test_path,tokenizer):
    train_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=train_path,
          block_size=128)
     
    test_dataset = TextDataset(
          tokenizer=tokenizer,
          file_path=test_path,
          block_size=128)   
    
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer, mlm=False,
    )
    return train_dataset,test_dataset,data_collator

train_dataset,test_dataset,data_collator = load_dataset('train_dataset.txt','test_dataset.txt',tokenizer)

Token indices sequence length is longer than the specified maximum sequence length for this model (11727 > 1024). Running this sequence through the model will result in indexing errors


#Initialize the Trainer

In [None]:
from transformers import Trainer, TrainingArguments,AutoModelWithLMHead

model = AutoModelWithLMHead.from_pretrained(model_name)


training_args = TrainingArguments(
    output_dir="./gpt2-muchtar-small", #The output directory
    overwrite_output_dir=True, #overwrite the content of the output directory
    num_train_epochs=90, # number of training epochs
    per_device_train_batch_size=32, # batch size for training
    per_device_eval_batch_size=64,  # batch size for evaluation
    eval_steps = 400, # Number of update steps between two evaluations.
    save_steps=800, # after # steps model is saved 
    warmup_steps=500,# number of warmup steps for learning rate scheduler
    prediction_loss_only=True,
    )


trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
)



In [None]:
trainer.train()

***** Running training *****
  Num examples = 91
  Num Epochs = 90
  Instantaneous batch size per device = 32
  Total train batch size (w. parallel, distributed & accumulation) = 32
  Gradient Accumulation steps = 1
  Total optimization steps = 270


Step,Training Loss




Training completed. Do not forget to share your model on huggingface.co/models =)




TrainOutput(global_step=270, training_loss=0.9737677680121528, metrics={'train_runtime': 271.9767, 'train_samples_per_second': 30.113, 'train_steps_per_second': 0.993, 'total_flos': 534821531811840.0, 'train_loss': 0.9737677680121528, 'epoch': 90.0})

In [None]:
trainer.save_model()

Saving model checkpoint to ./gpt2-muchtar-small
Configuration saved in ./gpt2-muchtar-small/config.json
Model weights saved in ./gpt2-muchtar-small/pytorch_model.bin


In [None]:
from transformers import pipeline

generator = pipeline('text-generation',model='./gpt2-muchtar-small', tokenizer=model_name)

result = generator('Zuerst Hähnchen')[0]['generated_text']

loading configuration file ./gpt2-muchtar-small/config.json
Model config GPTNeoConfig {
  "_name_or_path": "./gpt2-muchtar-small",
  "activation_function": "gelu_new",
  "architectures": [
    "GPTNeoForCausalLM"
  ],
  "attention_dropout": 0,
  "attention_layers": [
    "global",
    "global",
    "global",
    "global",
    "global",
    "global",
    "global",
    "global",
    "global",
    "global",
    "global",
    "global"
  ],
  "attention_types": [
    [
      [
        "global"
      ],
      12
    ]
  ],
  "bos_token_id": 50256,
  "embed_dropout": 0,
  "eos_token_id": 50256,
  "gradient_checkpointing": false,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": null,
  "layer_norm_epsilon": 1e-05,
  "max_position_embeddings": 2048,
  "model_type": "gpt_neo",
  "num_heads": 12,
  "num_layers": 12,
  "pad_token_id": 50256,
  "resid_dropout": 0,
  "summary_activation": null,
  "summary_first_dropout": 0.1,
  "summary_proj_to_labels": true,
  "summary_type

In [None]:
chef('      ')

[{'generated_text': '       איילת שקד- תיכנסו  _ Ayelet Shaked\nאין על המר'}]

Fine Tune the model over the new data

Finally train the model

###Creating the interface and launching!

In [None]:
output_text = gr.outputs.Textbox()
gr.Interface(chef,"textbox", output_text, title=model_name,
             description="Go ahead and input a sentence and see what it completes \
             it with! Takes around 20s to run.").launch(debug=True)

  "Usage of gradio.outputs is deprecated, and will not be supported in the future, please import your components from gradio.components",


Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://13547.gradio.app

This share link expires in 72 hours. For free permanent hosting, check out Spaces: https://huggingface.co/spaces


Input length of input_ids is 25, but `max_length` is set to 20. This can lead to unexpected behavior. You should consider increasing `max_new_tokens`.


Keyboard interruption in main thread... closing server.


(<gradio.routes.App at 0x7f0008a484d0>,
 'http://127.0.0.1:7860/',
 'https://13547.gradio.app')

#### The model is now live on the gradio.app link shown above. Go ahead and open that in a new tab!

Please contact us [here](mailto:team@gradio.app) if you have any questions, or [open an issue](https://github.com/gradio-app/gradio-UI/issues/new/choose) at our github repo.

