# GPT-2 fasttext

### Mount the drive

In [0]:
from google.colab import drive
drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


### Create necessary folders

In [0]:
!mkdir /content/drive/'My Drive'/gpt-2-fasttext
%cd /content/drive/'My Drive'/gpt-2-fasttext

mkdir: cannot create directory ‘/content/drive/My Drive/gpt-2-fasttext’: File exists
/content/drive/My Drive/gpt-2-fasttext


### Download the dataset

In [0]:
import os
import sys
import requests

def fetch(model, dataset):
    filename = model + "." + dataset + '.jsonl'
    r = requests.get("https://storage.googleapis.com/gpt-2/output-dataset/v1/" + filename, stream=True)

    open(filename, 'wb').write(r.content)

model = 'xl-1542M-k40' #@param ["small-117M", "small-117M-k40", "medium-345M", "medium-345M-k40", "large-762M", "large-762M-k40", "xl-1542M", "xl-1542M-k40"]
dataset = 'train' #@param ["train", "valid", "test"]

fetch(model, dataset)
fetch('webtext', dataset)

### Convert dataset to fasttext format and save it as `dataset.txt`

In [0]:
import json
import numpy as np

gpt2_path = f'{model}.{dataset}.jsonl'
webtext_path = f'webtext.{dataset}.jsonl'

output = open("dataset.txt", "w")

for i, line in enumerate(open(gpt2_path)):
    if i >= np.inf:
        break
        
    line = json.loads(line)['text'].replace("\n", " ")
    output.write(f"__label__bot {line}\n")
    
for i, line in enumerate(open(webtext_path)):
    if i >= np.inf:
        break
    line = json.loads(line)['text'].replace("\n", " ")
    output.write(f"__label__human {line}\n")

### Verify files are there

In [0]:
!ls

dataset.txt		 webtext.train.jsonl  xl-1542M-k40.train.jsonl
gpt2-fasttext-model.bin  webtext.valid.jsonl  xl-1542M-k40.valid.jsonl


### Train fasttext model

In [0]:
!pip install fasttext

import fasttext
model = fasttext.train_supervised(input='dataset.txt', epoch=50, lr=1.0, wordNgrams=2)
model.save_model("gpt2-fasttext-model.bin")

Collecting fasttext
[?25l  Downloading https://files.pythonhosted.org/packages/10/61/2e01f1397ec533756c1d893c22d9d5ed3fce3a6e4af1976e0d86bb13ea97/fasttext-0.9.1.tar.gz (57kB)
[K     |████████████████████████████████| 61kB 2.6MB/s 
Building wheels for collected packages: fasttext
  Building wheel for fasttext (setup.py) ... [?25l[?25hdone
  Created wheel for fasttext: filename=fasttext-0.9.1-cp36-cp36m-linux_x86_64.whl size=2384449 sha256=eb58c1bc53775bec817684b20172fc3cb6781d7a4a431ea342df5dc5c4455f05
  Stored in directory: /root/.cache/pip/wheels/9f/f0/04/caa82c912aee89ce76358ff954f3f0729b7577c8ff23a292e3
Successfully built fasttext
Installing collected packages: fasttext
Successfully installed fasttext-0.9.1


### Test fasttext prediction (Unicorn text)

In [0]:
import fasttext
model = fasttext.load_model("gpt2-fasttext-model.bin")

unicorn = """
The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.

Pérez and the others then ventured further into the valley. “By the time we reached the top of one peak, the water looked blue, with some crystals on top,” said Pérez.

Pérez and his friends were astonished to see the unicorn herd. These creatures could be seen from the air without having to move too much to see them – they were so close they could touch their horns.

While examining these bizarre creatures the scientists discovered that the creatures also spoke some fairly regular English. Pérez stated, “We can see, for example, that they have a common ‘language,’ something like a dialect or dialectic.”

Dr. Pérez believes that the unicorns may have originated in Argentina, where the animals were believed to be descendants of a lost race of people who lived there before the arrival of humans in those parts of South America.

While their origins are still unclear, some believe that perhaps the creatures were created when a human and a unicorn met each other in a time before human civilization. According to Pérez, “In South America, such incidents seem to be quite common.”

However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.
"""

model.predict(unicorn.replace("\n", " "))




(('__label__human',), array([1.00001001]))