#  Monero Research Lab text generator

Isthmus / Mitchell, December 2020

Base on the [GPT-2 tutorial](https://colab.research.google.com/drive/1VLG8e7YSEwypxU-noRNhsv5dW4NfTGce) notebook by [Max Woolf](http://minimaxir.com)

## Settings and parameters

Note:
* `124M` (default): the "small" model, 500MB on disk.
* `355M`: the "medium" model, 1.5GB on disk.
* `774M`: the "large" model,
* `1558M`: the "extra large", true model. 

In [1]:
model_size = "774M"
this_run_name = 'run1' + model_size
logs_URL = "https://raw.githubusercontent.com/Mitchellpkt/log_based_text_generator/main/mrl_logs_cleaned.txt"

## Environment

In [2]:
%tensorflow_version 1.x
!pip install -q gpt-2-simple
import gpt_2_simple as gpt2
from datetime import datetime
from google.colab import files
import re 

TensorFlow 1.x selected.
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.



Peep the instance specs

In [3]:
!nvidia-smi

Sat Dec 26 23:31:31 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.27.04    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   38C    P0    23W / 300W |      0MiB / 16130MiB |      0%      Default |
|                               |                      |                 ERR! |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

## Load GPT-2

In [4]:
gpt2.download_gpt2(model_name=model_size)

Fetching checkpoint: 1.05Mit [00:00, 552Mit/s]                                                      
Fetching encoder.json: 1.05Mit [00:00, 123Mit/s]                                                    
Fetching hparams.json: 1.05Mit [00:00, 646Mit/s]                                                    
Fetching model.ckpt.data-00000-of-00001: 3.10Git [00:29, 104Mit/s]                                  
Fetching model.ckpt.index: 1.05Mit [00:00, 535Mit/s]                                                
Fetching model.ckpt.meta: 2.10Mit [00:00, 227Mit/s]                                                 
Fetching vocab.bpe: 1.05Mit [00:00, 233Mit/s]                                                       


## Data wrangling MRL logs

Retrieve the IRC logs from the URL of a plaintext data dump

In [5]:
import requests

file_name = "mrl.txt"

url = logs_URL
data = requests.get(url)
manip_text_raw = str(data.text)

In [6]:
# Head of data look alright?
print(manip_text_raw[0:3000])


 <ukoehb> is transaction fee 8 bytes?
 <moneromooo> it is a 64 bit value. it is typically encoded as a varint, if that's what you're asking.
 <ukoehb> just looking at storage required
 <ukoehb> varint = variable length integer, so is storage not constant?
 <moneromooo> yes.
 <ukoehb> thanks :)
 <serhack> morning :)
 <suraenoether> monero coffee chat yall~
 <sarang> how did the coffee chat go?
 <sarang> i had a volunteer commitment during that time
 <sarang> we repair bikes and donate them to veterans and kids who need them
 <sneurlax1> good with bikes, eh?
 <sarang> i worked part-time as a mechanic for a few years
 <sneurlax1> i missed the meeting so have no useful comment there sorry.
 <sarang> fixing bikes is a ton of fun
 <sneurlax1> i skipped straight to motorcycles and need to get handy with it quickly
 — sarang is moving bike convo to #monero-research-lounge 
 <needmoney90> my call with bisq is wednesday, would anyone be available to chat about the technical details of how multi

Process the text to strip out irrelevant messages (people joining, leaving, etc)

In [7]:
# Drop case
# manip_text_raw = manip_text_raw.lower() # drop case

# REmove channel notifications
words_to_remove = ('mode','timestamp','joined','left','quit','seconds','channel', '#monero-research-lab', '→', 'chanserv')
manip_text = re.sub("[\(\[].*?[\)\]]", "", manip_text_raw) # remove timestamps
for w in range(len(words_to_remove)):
    this_word = words_to_remove[w]
    print('Removing lines containing: ' + this_word)
    manip_text = re.sub(".*"+this_word+".*", "", manip_text)
    
# This next block of code is a functional gargabe hack - streamline later
max_rows_blank = 200
for i in range(max_rows_blank):
    search_str = "\n"*(max_rows_blank-i)
    manip_text = re.sub(search_str,'\n',manip_text)
    
final_string = manip_text

# Peep the results
print(manip_text[0:1000])

Removing lines containing: mode
Removing lines containing: timestamp
Removing lines containing: joined
Removing lines containing: left
Removing lines containing: quit
Removing lines containing: seconds
Removing lines containing: channel
Removing lines containing: #monero-research-lab
Removing lines containing: →
Removing lines containing: chanserv

 <ukoehb> is transaction fee 8 bytes?
 <moneromooo> it is a 64 bit value. it is typically encoded as a varint, if that's what you're asking.
 <ukoehb> just looking at storage required
 <ukoehb> varint = variable length integer, so is storage not constant?
 <moneromooo> yes.
 <ukoehb> thanks :)
 <serhack> morning :)
 <suraenoether> monero coffee chat yall~
 <sarang> how did the coffee chat go?
 <sarang> i had a volunteer commitment during that time
 <sarang> we repair bikes and donate them to veterans and kids who need them
 <sneurlax1> good with bikes, eh?
 <sarang> i worked part-time as a mechanic for a few years
 <sneurlax1> i missed the m

Write the file

In [8]:
with open(file_name, 'w') as f:
  f.write(final_string)

Commented out below, code to link gdrive

In [9]:
# gpt2.mount_gdrive()
# gpt2.copy_file_from_gdrive(file_name)
# gpt2.load_f

## Finetune GPT-2

The next cell will start the actual finetuning of GPT-2. It creates a persistent TensorFlow session which stores the training config, then runs the training for the specified number of `steps`. (to have the finetuning run indefinitely, set `steps = -1`)

The model checkpoints will be saved in `/checkpoint/run1` by default. The checkpoints are saved every 500 steps (can be changed) and when the cell is stopped.

The training might time out after 4ish hours; make sure you end training and save the results so you don't lose them!

**IMPORTANT NOTE:** If you want to rerun this cell, **restart the VM first** (Runtime -> Restart Runtime). You will need to rerun imports but not recopy files.

Other optional-but-helpful parameters for `gpt2.finetune`:


*  **`restore_from`**: Set to `fresh` to start training from the base GPT-2, or set to `latest` to restart training from an existing checkpoint.
* **`sample_every`**: Number of steps to print example output
* **`print_every`**: Number of steps to print training progress.
* **`learning_rate`**:  Learning rate for the training. (default `1e-4`, can lower to `1e-5` if you have <1MB input data)
*  **`run_name`**: subfolder within `checkpoint` to save the model. This is useful if you want to work with multiple models (will also need to specify  `run_name` when loading the model)
* **`overwrite`**: Set to `True` if you want to continue finetuning an existing model (w/ `restore_from='latest'`) without creating duplicate copies. 

In [10]:
sess = gpt2.start_tf_sess()

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_size,
              steps=-1,
              restore_from='latest',
              run_name= this_run_name,
              print_every=10,
              sample_every=250,
              save_every=500 
              )

Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
Instructions for updating:
Please use tensorflow.python.ops.op_selector.get_backward_walk_ops.
Loading checkpoint checkpoint/run1774M/model-10750
INFO:tensorflow:Restoring parameters from checkpoint/run1774M/model-10750


  0%|          | 0/1 [00:00<?, ?it/s]

Loading dataset...


100%|██████████| 1/1 [00:08<00:00,  8.46s/it]


dataset has 1891095 tokens
Training...
 w and d are chosen by h to be constants for the group order q, and the other members of q are chosen by the same distribution, so txn matching can be done with no additional input except for the pubkey index. in the above situation, r = {0,1} and d = {0,1} 
 <moneromooo> the argument for not including it, is that "you only have to look once", and that can't be borne, no ?
 <moneromooo> it'd also be impractical, since you could make a transaction for that output.
 <moneromooo> so you'd use only the outputs from n transactions and pay a fee for each output. that's eliminate all checks, and in case you stumble across something, you just need to remember it.
 <moneromooo> it would still require the output index, since you know where each output came from, for statistical purposes.
 <moneromooo> oh, not for that. because then you can rekey any output, if you don't remember the key image.
 <moneromooo> this would still be required for decoy output sele

In [11]:
## To save a model to gdrive:

# gpt2.copy_checkpoint_to_gdrive(run_name=this_run_name)

In [12]:
# # To load a model from gdrive:

# gpt2.copy_checkpoint_from_gdrive(run_name=this_run_name)
# sess = gpt2.start_tf_sess()
# gpt2.load_gpt2(sess, run_name=this_run_name)

## Generate Text From The Trained Model

Unseeded example:

In [18]:
gpt2.generate(sess, run_name=this_run_name)

 His
 <moneromooo> i think the reason is that it was added very late, when bytecoin realised some "client distinguishability" was needed.
 <moneromooo> i think the intent was to offer a layer of "this transaction was built with good intelligence, and i don't know why" that would be helpful.
 <sarang> the r value would be for the sender, not the receiver
 <sarang> and it's important that the sender know the r value
 <sarang> the receiver could be influenced by the use of an index that is linked to the true signing index, which can't be easily verified anyway
 <sarang> the point is that the way the indexing appears in the signature is *good* but doesn't matter for the receiver
 <sarang> the signer should include a hash of all the indices used, so the receiver knows, say, the index for the true signer
 <sarang> i'm thinking through the consequences to the current use of the non-indexed indices in the ring
 <sarang> 
 <sarang> and there's no good reason other than index linking
 <sarang> t

Parameters for `gpt2.generate`:

*  **`length`**: Number of tokens to generate (default 1023, the maximum)
* **`temperature`**: The higher the temperature, the crazier the text (default 0.7, recommended to keep between 0.7 and 1.0)
* **`top_k`**: Limits the generated guesses to the top *k* guesses (default 0 which disables the behavior; if the generated output is super crazy, you may want to set `top_k=40`)
* **`top_p`**: Nucleus sampling: limits the generated guesses to a cumulative probability. (gets good results on a dataset with `top_p=0.9`)
* **`truncate`**: Truncates the input text until a given sequence, excluding that sequence (e.g. if `truncate='<|endoftext|>'`, the returned text will include everything before the first `<|endoftext|>`). It may be useful to combine this with a smaller `length` if the input texts are short.
*  **`include_prefix`**: If using `truncate` and `include_prefix=False`, the specified `prefix` will not be included in the returned text.

In [None]:
gpt2.generate(sess,
              length=400,
              temperature=0.7,
              # prefix=" <handle>",
              # top_p = 0.9,
              nsamples=10,
              batch_size=5)


## Ask GPT-MRL about specific topics

In [26]:
topics = ['meeting',
          'scalarmult',
          'quantum',
          'bulletproof',
          'dynamic block',
          'ring signature',
          'zero knowledge',
          'anonymity set',
          'triptych',
          'clsag',
          'mlsag',
          'arcturus',
          ''
          ]

In [29]:
for t in topics:
  print(("*"*25 + "\n")*2 + "MRL-AI (GPT-2) on " + t + ":\n")
  gpt2.generate(sess,
              length=500,
              temperature=0.7,
              prefix=t,
              include_prefix=False,
              run_name = this_run_name,
              nsamples=5,
              batch_size=5
              #top_p = 0.9
              )

*************************
*************************
MRL-AI (GPT-2) on meeting:

meeting the consensus rule that they cannot be reused
 <knaccc> they are currently in the prng shared secret, and while that is  is not shared, but they are stored along with every transaction. if that weren't the case, i'd say something like "use  instead of  and  instead of , i'd just use the one-time stealth adresses."
 <knaccc> because that would require all transactions to be constructed from the signature's contents, which is always a subset of the tx
 <knaccc> and that would add lots of computation to scanning
 <knaccc> it's also why i don't like the idea of using tx_extra  instead of storing the tx secret, where you can get away with a greater degree of privacy, by applying the unsubstituation assumption
 <knaccc> i'm also not sure i like the idea of tx_extra  being used more generally. i'm not sure if that's just a blog post thought, or a user experience thing that's worth investigating.
 <knaccc> 

For bulk generation, you can generate a large amount of text to a file and sort out the samples locally on your computer. The next cell will generate a generated text file with a unique timestamp.

You can rerun the cells as many times as you want for even more generated texts!

In [None]:
gen_file = 'gpt2_gentext_{:%Y%m%d_%H%M%S}.txt'.format(datetime.utcnow())

gpt2.generate_to_file(sess,
                      destination_path=gen_file,
                      length=500,
                      temperature=0.7,
                      nsamples=100,
                      batch_size=20
                      )

In [None]:
# may have to run twice to get file to download
files.download(gen_file)

# LICENSE

MIT License

Copyright (c) 2019 Max Woolf

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.