<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#-What-is-Transfer-learning?-" data-toc-modified-id="-What-is-Transfer-learning?--0.1"><span class="toc-item-num">0.1&nbsp;&nbsp;</span> What is Transfer learning? </a></span></li><li><span><a href="#-What-is-GPT-2-" data-toc-modified-id="-What-is-GPT-2--0.2"><span class="toc-item-num">0.2&nbsp;&nbsp;</span> What is GPT-2 </a></span></li></ul></li><li><span><a href="#-Imports-and-installation-" data-toc-modified-id="-Imports-and-installation--1"><span class="toc-item-num">1&nbsp;&nbsp;</span> Imports and installation </a></span><ul class="toc-item"><li><ul class="toc-item"><li><span><a href="#Install-aitextgen-package" data-toc-modified-id="Install-aitextgen-package-1.0.1"><span class="toc-item-num">1.0.1&nbsp;&nbsp;</span>Install <code>aitextgen</code> package</a></span></li><li><span><a href="#Download-the-GPT2-Model" data-toc-modified-id="Download-the-GPT2-Model-1.0.2"><span class="toc-item-num">1.0.2&nbsp;&nbsp;</span>Download the GPT2 Model</a></span></li></ul></li><li><span><a href="#Setup-data" data-toc-modified-id="Setup-data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Setup data</a></span><ul class="toc-item"><li><span><a href="#Read-and-tokenize-the-Input-Dataset" data-toc-modified-id="Read-and-tokenize-the-Input-Dataset-1.1.1"><span class="toc-item-num">1.1.1&nbsp;&nbsp;</span>Read and tokenize the Input Dataset</a></span></li><li><span><a href="#Use-the-above-saved-text-file-for-fine-tuning---set-the-right-parameters" data-toc-modified-id="Use-the-above-saved-text-file-for-fine-tuning---set-the-right-parameters-1.1.2"><span class="toc-item-num">1.1.2&nbsp;&nbsp;</span>Use the above saved text file for fine-tuning - set the right parameters</a></span></li></ul></li><li><span><a href="#Train-gpt-2" data-toc-modified-id="Train-gpt-2-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Train gpt-2</a></span></li><li><span><a href="#Try-inference" data-toc-modified-id="Try-inference-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Try inference</a></span><ul class="toc-item"><li><span><a href="#Load-the-newly-fine-tuned-model-which-is-saved-in-trained_model-directory" data-toc-modified-id="Load-the-newly-fine-tuned-model-which-is-saved-in-trained_model-directory-1.3.1"><span class="toc-item-num">1.3.1&nbsp;&nbsp;</span>Load the newly fine-tuned model which is saved in <code>trained_model</code> directory</a></span></li><li><span><a href="#Time-to-see-the-generated-text-in-action" data-toc-modified-id="Time-to-see-the-generated-text-in-action-1.3.2"><span class="toc-item-num">1.3.2&nbsp;&nbsp;</span>Time to see the generated text in action</a></span></li></ul></li></ul></li></ul></div>

<h2> What is Transfer learning? </h2>
In short, Transfer learning is when a model trained for a certain task is reused as a starting point for some other task, saving time and effort of re-training.
This is a helpful resource to read up on:

https://ruder.io/transfer-learning/

<h2> What is GPT-2 </h2>
GPT-2 is a large transformer-based language model with 1.5 billion parameters, trained on a dataset[1]
We created a new dataset which emphasizes diversity of content, by scraping content from the Internet. In order to preserve document quality, we used only pages which have been curated/filtered by humansâ€”specifically, we used outbound links from Reddit which received at least 3 karma. This can be thought of as a heuristic indicator for whether other users found the link interesting (whether educational or funny), leading to higher data quality than other similar datasets, such as CommonCrawl.

[Source](https://openai.com/blog/better-language-models/)

<h1> Imports and installation </h1>

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# can be any text file with text split in multiple lines, in this case we use the
# text dataset scraped in the scrape_genius notebook
input_path = "./song_lyrics_data/terrible_german_lyrics.txt"

### Install `aitextgen` package

In [None]:
!pip install -q aitextgen #install the main package

In [None]:
from aitextgen import aitextgen

### Download the GPT2 Model

In [None]:
# this is their default model for english data
ai = aitextgen(tf_gpt2="124M", to_gpu=True)

# this is loading the model pretrained on german 
# ai = aitextgen(model="dbmdz/german-gpt2", to_gpu=True)

## Setup data

### Read and tokenize the Input Dataset

In [None]:
!head --lines=10 {input_path}

In [None]:
from aitextgen.TokenDataset import TokenDataset

In [None]:
data = TokenDataset(input_path, line_by_line=True)

### Use the above saved text file for fine-tuning - set the right parameters 

In [None]:
dataset_elems = sum(1 for line in open(input_path))
dataset_elems

## Train gpt-2

In [None]:
ai.train(
    input_path,
    line_by_line=False,
    from_cache=False,
    num_steps=dataset_elems * 4,  # 4 epochs
    generate_every=2000,
    save_every=2000,
    save_gdrive=False,
    learning_rate=1e-3,
    batch_size=1,
)

## Try inference

### Load the newly fine-tuned model which is saved in `trained_model` directory

In [None]:
ai = aitextgen(model_folder="./trained_model/", config="./trained_model/config.json", to_gpu=True)

### Time to see the generated text in action

In [None]:
ai.generate(
    n=5,
    batch_size=1,
    max_length=200,
    temperature=1.0,
    top_p=0.9
)

In [None]:
ai.generate_samples(
    prompt="jeder von uns hat einen schulabschluss",
    n=1,
    batch_size=1,
    max_length=500,
    top_p=0.9
)