<a href="https://colab.research.google.com/github/Katunya/CulturePortal/blob/master/Copy_of_NLP_master_class.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Обучение Language Model на книгах Ницше

Проверим, что у нас есть GPU

In [26]:
!nvidia-smi

Mon Apr  6 16:34:25 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   53C    P0    25W /  75W |   1467MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
+-------

### Наш план:
1. Загрузить данные (текст) и подготовить его для обучения модели
2. Загрузить Language model, обученную на Wikipedia
3. Дообучить Language model использую наши данные тем самым уточнив ее знания

In [0]:
from fastai.text import * 

### 1. Загружаем данные

In [0]:
path = Path('.')

Скачиваем .txt файл с высказываниями Ницше

In [29]:
!wget "https://s3.amazonaws.com/text-datasets/nietzsche.txt"

--2020-04-06 16:34:27--  https://s3.amazonaws.com/text-datasets/nietzsche.txt
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.96.229
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.96.229|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 600901 (587K) [text/plain]
Saving to: ‘nietzsche.txt.1’


2020-04-06 16:34:27 (1.90 MB/s) - ‘nietzsche.txt.1’ saved [600901/600901]



Считываем высказываения из файла, складываем их в массив text

In [0]:
text = []

with open('nietzsche.txt') as fp:
    contents = fp.read()
    for entry in contents.split('\n'):
      if entry:
        entry = entry.replace('\n', ' ')
        text.append(entry)

Посмотрим на рандомный пример

In [31]:
random.choice(text)

'and more fundamental value for life generally should be assigned to'

Сложим данные в формат pandas DataFrame

In [32]:
nietzsche_df = pd.DataFrame({'text': text})
nietzsche_df.to_csv('nietzsche.csv', index=False)
nietzsche_df.head()

Unnamed: 0,text
0,PREFACE
1,SUPPOSING that Truth is a woman--what then? Is...
2,"for suspecting that all philosophers, in so fa..."
3,"dogmatists, have failed to understand women--t..."
4,seriousness and clumsy importunity with which ...


По сравнению с картинками в Computer Vision, текст не может быть напрямую подан в модель, потому что текст не представим в виде чисел.

Поэтому, первое что мы должны сделать, это разбить наш текст на токены (на слова грубо говоря). Этот процесс называется *tokenization*. Потом нужно превратить каждое слово в число, это называется *numericalization*. 

После этого наши данные могут быть поданы в модель.

#### Библиотека fast.ai содержит объект `databunch` для language model, который выполняет токенизацию и разбивает на [текст/слово которое нужно предсказать]

In [33]:
data_lm = TextLMDataBunch.from_csv(path, 'nietzsche.csv', text_cols="text")

In [0]:
data_lm.save('data_lm_export.pkl')

batch_size(bs) --- Сколько текста мы будет подавать нейронной сети за раз 

In [0]:
bs=32

In [0]:
data_lm = load_data(path, 'data_lm_export.pkl', bs=bs)

### 2. Загружаем предобученную на Wikipedia language model

In [0]:
torch.cuda.set_device(0)

#### Создаем Language model на основе архитектуры AWD-LSTM, `pretrained=True ` значит что она обучена на датасете `wikitext-103`

In [0]:
learn = language_model_learner(data_lm, AWD_LSTM, pretrained=True, drop_mult=0.5)

#### Попробуем сгенерировать немного текста предобученной моделью

In [0]:
START_TEXT = ["This is a review about", "I dont like this", "Human should be"]
N_SENTENCES = len(START_TEXT)
N_WORDS = 100 

In [40]:
print("\n\n".join(str(i+1) + ". " + learn.predict(START_TEXT[i], N_WORDS, temperature=0.8) for i in range(N_SENTENCES)))


1. This is a review about Jesus , but one he enjoys the ability to become a Christian , but is also one of the few to have a mention of the Church in Christian mythology . The church of Saint - Paul in Paris , in France , is possibly the only one of the five churches in the European Community to be named after Jesus . The Saint Paul Church in Paris is a Jewish , the Jewish Catholic Church , and the

2. I dont like this all of other species of this species . The " black - and - red " form of the blue - ruled North Indian species Lie is a product of the complex culture explaining the development of red - green , green , and black - and - red black spots in the United States ( see Blue Line ) . The species was a third for US National Park and National Park Service . It has since become a species of European

3. Human should be a human species ( Nature ) = = = by a common definition . The word " animal " is a non - human word that is used in human mind as an explanation for the animal 's h

### 3. Дообучаем Language model на текстах Ницше
Сначала тренируем одну эпоху

In [41]:
learn.fit_one_cycle(1, 1e-2)

epoch,train_loss,valid_loss,accuracy,time
0,5.197776,5.326664,0.171429,00:06


#### Размораживаем сеть целиком и тренируем еще 3 эпохи

In [42]:
learn.unfreeze()
learn.fit_one_cycle(3, slice(1e-4,1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,4.564016,4.586343,0.228571,00:06
1,4.344941,4.431863,0.242857,00:06
2,4.023057,4.470833,0.214286,00:06


epoch,train_loss,valid_loss,accuracy,time
0,4.566662,5.02576,0.228571,00:06
1,4.279663,4.976703,0.157143,00:06
2,3.980535,5.000048,0.185714,00:06


#### Генерируем текст!

In [0]:
START_TEXT = ["This is a review about", "I dont like this", "Human should be"]
N_SENTENCES = len(START_TEXT)
N_WORDS = 100 

In [44]:
print("\n\n".join(str(i+1) + ". " + learn.predict(START_TEXT[i], N_WORDS, temperature=0.75) for i in range(N_SENTENCES)))

1. This is a review about the feeling of self - torment ? Hence LA LA OUTSIDE xxbos " LA QUE LA LA , " more or less LA PURPOSES . In xxbos and all the things which are at present xxbos they have to live , not as a result of their appearance , but as xxbos with , as a feats of men , a new xxbos until the end of a time has been xxbos hence it is far from being a tragedy . The thing that is to xxbos are

2. I dont like this ; i mean to do one 's life , xxbos They are capable of the same task , and the entire history of their xxbos our conscience is a xxbos always to be preached , and xxbos fundamental scheme , which is a fundamental scheme of morality , as xxbos and , in the case of Frederick the Great , as the vital after - effect of the French Revolution . As God has xxbos the Florentine Saga , what is considered as Christian science xxbos RELIGIOUS LIFE

3. Human should be more concerned about how much SYMPATHY xxbos Concerning the conditions of life , an explanation of the decision an

Иногда, сгенерированный текст не имеет смысла потому что у нас немного данных и мы не тренировали модель достаточно долго. Но отметьте, что подель соблюдает базовую грамматику, которую она переняла от предобученной модели.

### 4. Попробуем сгенерировать текст с помощью nucleus sampling вместо greedy

In [0]:
def predict(learn, text, n_words=1, temp=1., top1=False, min_p=None, sep=' ', decoder=decode_spec_tokens):
    '''
    Based on fastai implementation.
    For every word, gets the network activations, sets unknown token to 0,
    only considers tokens above a certain value, then either returns the token
    with the highest activation or samples from the distribution of activations.
    '''
    learn.model.reset()
    xb,yb = learn.data.one_item(text)
    new_idx = []
    for _ in range(n_words):
        res = learn.pred_batch(batch=(xb,yb))[0][-1]
        res[learn.data.vocab.stoi[UNK]] = 0.
        if min_p is not None: res[res < min_p] = 0.
        res.pow_(1 / temp)
        if top1: idx = torch.argmax(res).item() # greedy decoding
        else: idx = torch.multinomial(res, 1).item()
        new_idx.append(idx)
        xb = xb.new_tensor([idx])[None]
    return '[' + text + ']' + sep + sep.join(decoder(learn.data.vocab.textify(new_idx, sep=None)))

In [0]:
def predict_nucleus(learn, text, n_words=1, p=0.5, temp=1., min_p=None, sep=' ', decoder=decode_spec_tokens):
    '''
    Performs top-p sampling as described in the paper:
    finds the k which corresponds to the desired cumulative
    probability, then performs top-k sampling as above.
    '''
    learn.model.reset()
    xb,yb = learn.data.one_item(text)
    new_idx = []
    for _ in range(n_words):
        outp = learn.pred_batch(batch=(xb,yb))[0][-1]
        outp[learn.data.vocab.stoi[UNK]] = 0.
        if min_p is not None: outp[outp < min_p] = 0.
        probs = F.softmax(outp / temp, dim=-1) 
        cumsum_prob = (probs.sort(descending=True)[0]).cumsum(0)
        k = (cumsum_prob > p).nonzero().view(-1)[0].int() + 1
        vals,idxs = probs.topk(k, dim=-1)
        idx = idxs[torch.multinomial(vals, 1).item()]
        new_idx.append(idx)
        xb = xb.new_tensor([idx])[None]
    return '[' + text + ']' + sep + sep.join(decoder(learn.data.vocab.textify(new_idx, sep=None)))

In [0]:
temp = 0.75

In [48]:
print("\n\n".join(str(i+1) + ". " + predict(learn, START_TEXT[i], N_WORDS, temp, True)
                  for i in range(N_SENTENCES)))

1. [This is a review about] the German taste , which is xxbos The German German is the most successful German of all times , and is xxbos of the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the

2. [I dont like this] , and i have to say , " What is the " God " ? The God of God is the God of God , and the God of God is the God of God , and xxbos the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian

3. [Human should be] the most effective means of xxbos The German is a German who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German sp

In [49]:
print("\n\n".join(str(i+1) + ". " + predict_nucleus(learn, START_TEXT[i], N_WORDS, p=1e-4, temp=temp) for i in range(N_SENTENCES)))

1. [This is a review about] the German taste , which is xxbos The German German is the most successful German of all times , and is xxbos of the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the

2. [I dont like this] , and i have to say , " What is the " God " ? The God of God is the God of God , and the God of God is the God of God , and xxbos the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian Faith , the Christian

3. [Human should be] the most effective means of xxbos The German is a German who has xxbos the German spirit , and the German spirit , who has xxbos the German spirit , and the German spirit , who has xxbos the German sp

In [0]:
# def predict(learn, text, n_words=1, temp=1., top1=False, min_p=None, sep=' ', decoder=decode_spec_tokens):
#     '''
#     Based on fastai implementation.
#     For every word, gets the network activations, sets unknown token to 0,
#     only considers tokens above a certain value, then either returns the token
#     with the highest activation or samples from the distribution of activations.
#     '''
#     learn.model.reset()
#     xb,yb = learn.data.one_item(text)
#     new_idx = []
#     for _ in range(n_words):
#         res = learn.pred_batch(batch=(xb,yb))[0][-1]
#         res[learn.data.vocab.stoi[UNK]] = 0.
#         if min_p is not None: res[res < min_p] = 0.
#         res.pow_(1 / temp)
#         if top1: idx = torch.argmax(res).item() # greedy decoding
#         else: idx = torch.multinomial(res, 1).item()
#         new_idx.append(idx)
#         xb = xb.new_tensor([idx])[None]
#     return '[' + text + ']' + sep + sep.join(decoder(learn.data.vocab.textify(new_idx, sep=None)))


# def predict_nucleus(learn, text, n_words=1, p=0.5, temp=1., min_p=None, sep=' ', decoder=decode_spec_tokens):
#     '''
#     Performs top-p sampling as described in the paper:
#     finds the k which corresponds to the desired cumulative
#     probability, then performs top-k sampling as above.
#     '''
#     learn.model.reset()
#     xb,yb = learn.data.one_item(text)
#     new_idx = []
#     for _ in range(n_words):
#         outp = learn.pred_batch(batch=(xb,yb))[0][-1]
#         outp[learn.data.vocab.stoi[UNK]] = 0.
#         if min_p is not None: outp[outp < min_p] = 0.
#         probs = F.softmax(outp / temp, dim=-1) 
#         cumsum_prob = (probs.sort(descending=True)[0]).cumsum(0)
#         k = (cumsum_prob > p).nonzero().view(-1)[0].int() + 1
#         vals,idxs = probs.topk(k, dim=-1)
#         idx = idxs[torch.multinomial(vals, 1).item()]
#         new_idx.append(idx)
#         xb = xb.new_tensor([idx])[None]
#     return '[' + text + ']' + sep + sep.join(decoder(learn.data.vocab.textify(new_idx, sep=None)))