# hw6-sequences-a

The goal of this assignment is to develop a simple next-character prediction using a traning corpus, and then to leverage that model to generate strings of characters.

In [None]:
import tensorflow as tf
import pandas as pd
import numpy as np
import os
import time

In this assignment we will work with a dataset of Elon Musk's tweets. The data comes from this [Kaggle Dataset](https://www.kaggle.com/vidyapb/elon-musk-tweets-2015-to-2020?select=elonmusk.csv), and you can download it from this [link](https://raw.githubusercontent.com/beckyleii/data-bucket/master/elonmusk.csv).  **Optional:** Check out the EDA notebooks created for this dataset. The one called "NLP with TextHero NLP with TextHero | EDA with SweetViz" was created by the person who added this dataset to Kaggle includes example code for scraping twitter data.

### Task ETL

Using the code cell below write some ETL code to load the data using the link above, and store it in memory as a single (very long) string. The data is small enough to do this. **Note:** When working with a larger [text corpus](https://en.wikipedia.org/wiki/Text_corpus), the ETL would need to store the intermediates to disk). 

Join individual tweets with a new line character ("\n") or some other special character.  In many applications, some additional filtering and transformations character (or word) might be applied at this step, but there is no need to do so in this assignment. 

In [None]:
## BEGIN SOLUTION Task ETL -- 3 lines of code



## END SOLUTION Task 1

vocab = sorted(list(set(text)))
vocab_size = len(vocab)
assert vocab_size == 364

## Text as numbers, part I

To train a neural net using gradient descent, we need text as numbers!  Sounds reasonable - but how? 

Unfortunately.  There isn't a definitive way to do this.  One simple approach is to break the text into a list of tokens (character-by-character in this assignment).  And then assign a numerical representation to each token.

Since we'll be wanting to go back and forth between the tokens and their numerical representation, it is helpful to create an inverse mapping of this assignment as well.

### Task `token2nrep` and `nrep2token`
Create two dictionaries: one mapping vocabulary characters to numbers, named `token2nrep`, and another from numbers to tokens, named `nrep2token`. 

*Python pro-tip:* `enumerate` is a lovely construct
```
for index, value in enumerate(L):
    # do something
```

*Python pro-tip:* dictionary comprehensions are a thing!
```
char2idx = {??? for x in L)
```


In [None]:
## BEGIN SOLUTION - combine the pro-tips and try for a 2 line solution! 


## END SOLUTION Task 2.1

assert nrep2token[token2nrep['a']]=='a'
assert len(token2nrep) == len(vocab)

for char in 'Elon':
  nrep = token2nrep[char]
  print(f"{char} -> {nrep:3} -> {nrep2token[nrep]}")

def text2nrep(s):
  return np.array([token2nrep[c] for c in s])

def nrep2text(nrep):
  return ''.join([nrep2token[n] for n in nrep])

s="Elon"
print(f'"{s}" -> {text2nrep(s)} -> "{nrep2text(text2nrep(s))}"')  

assert nrep2text(text2nrep("Elon")) == "Elon"
assert len(text2nrep(text)) == len(text)
assert nrep2text(text2nrep(text)) == text


E ->  38 -> E
l ->  75 -> l
o ->  78 -> o
n ->  77 -> n
"Elon" -> [38 75 78 77] -> "Elon"


### Create training examples and targets with tf.data

In this task, we will write code that converts text into training data - each training example will contain a sequence of input characters and and a target sequence characters, both represented in numerical form.

The model we are creating will predict the target sequence from each input sequence.  For the setup in this assignment, the target will contain the same token sequence as the input, except it will be shifted one token to the right.  The model requires fixed length input and output sequences.

The approach we will take is to break the text into chunks of `seq_length+1` characters, and then split them into an input and target, each of length `seq_length`.

For example, if the input text is "Trump is done.", and `seq_length` is 3, the input and target sequences would have nrep2text to text representatuibs for th following input, target sequnce pairs:


```
"Tru", "rum"
"rum", "ump"
"ump, "mp "
"mp , "p i"
"p i, " is"
...
```


### Task: Text Dataset design
The tf.data api provides a scalable way to to this. You need to:

1. Create a `Dataset` of `text2nrep('text')` using the `from_tensor_slices` constuctor.
2. Use the `window` method to configure a 'dataset of datasets', where each window returned is **always** of length `seq_length+1`. (see `drop_remanider`)
3. Use `flat_map` and the provided `sub_to_batch` function to flatten the window dataset of datasets into a sequential data set containing sequential overlapping windows of text.
4. Use `map` method, and the `split_input_target` method to split these sequences into appropriate (X input, y target) pairs

In [None]:
def create_seq_data(corpus, text2nrep, seq_length):
    def sub_to_batch(sub):
        return sub.batch(seq_length+1, drop_remainder=True)

    def split_input_target(seq):
        input_seq = seq[:-1]
        target_seq = seq[1:]
        return input_seq, target_seq

    ## BEGIN SOLUTION -- 3 lines of code
 

 
    ## END SOLUTIONS
    return seq

dataset = create_seq_data("Trump is done.", text2nrep, seq_length=3)
for it in dataset.take(5):
  print([nrep2text(it[0].numpy()), nrep2text(it[1].numpy())])
print('...')

['Tru', 'rum']
['rum', 'ump']
['ump', 'mp ']
['mp ', 'p i']
['p i', ' is']
...


The output above should match the following:
```
['Tru', 'rum']
['rum', 'ump']
['ump', 'mp ']
['mp ', 'p i']
['p i', ' is']
```

### Task: Batching with `tf.data`

Below is some code to shuffle and batch using `create_seq_data`. Answer the following in Prismia:

1. How many batches of data are created per epoch? Is any data unused? Explain.
2. More generally, add an explanation of what the parameters SEQ_LENGTH, BUFFER_SIZE, BATCH_SIZE are doing.
3. Below we are shuffling and then batching. Explain what happens if instead, you shuffle after you batch. 
4. In general, when training, should you shuffle and then batch, or batch and then shuffle? 

hints: Use `nrep2text` to get a clearer picture of what is happening.  If you are still confused, try uncommenting the the alternate `# test_text` variable.


In [None]:
SEQ_LENGTH = 30
BUFFER_SIZE = 2
BATCH_SIZE = 4
test_text = "The quick brown fox jumps over the lazy dog.."
# test_text = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRS'
print(test_text)

dataset = create_seq_data(test_text, text2nrep, seq_length=SEQ_LENGTH)
batched_dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

print(batched_dataset)
for X, y in batched_dataset:
  print(f"X = \n{X}")
  print(f"y = \n{y}")

print(f'X.shape = {X.shape}')
print(f'y.shape = {y.shape}')


The quick brown fox jumps over the lazy dog..
<BatchDataset shapes: ((4, 30), (4, 30)), types: (tf.int64, tf.int64)>
X = 
[[71 68  1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82
   1 78 85 68 81  1]
 [68  1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82  1
  78 85 68 81  1 83]
 [53 71 68  1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79
  82  1 78 85 68 81]
 [ 1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82  1 78
  85 68 81  1 83 71]]
y = 
[[68  1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82  1
  78 85 68 81  1 83]
 [ 1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82  1 78
  85 68 81  1 83 71]
 [71 68  1 80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82
   1 78 85 68 81  1]
 [80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82  1 78 85
  68 81  1 83 71 68]]
X = 
[[80 84 72 66 74  1 65 81 78 86 77  1 69 78 87  1 73 84 76 79 82  1 78 85
  68 81  1 83 71 68]
 [84 72 66 

In [None]:
# BEGIN Hint Solution



# END Hint Solution

<BatchDataset shapes: ((4, 30), (4, 30)), types: (tf.int64, tf.int64)>
X = 
['The quick brown fox jumps over', 'e quick brown fox jumps over t', ' quick brown fox jumps over th', 'he quick brown fox jumps over ']
y = 
['he quick brown fox jumps over ', ' quick brown fox jumps over th', 'quick brown fox jumps over the', 'e quick brown fox jumps over t']
X = 
['quick brown fox jumps over the', 'ick brown fox jumps over the l', 'uick brown fox jumps over the ', 'ck brown fox jumps over the la']
y = 
['uick brown fox jumps over the ', 'ck brown fox jumps over the la', 'ick brown fox jumps over the l', 'k brown fox jumps over the laz']
X = 
['k brown fox jumps over the laz', 'brown fox jumps over the lazy ', 'rown fox jumps over the lazy d', ' brown fox jumps over the lazy']
y = 
[' brown fox jumps over the lazy', 'rown fox jumps over the lazy d', 'own fox jumps over the lazy do', 'brown fox jumps over the lazy ']


## A simple parameterized RNN model

Below we define a simple parameterized RNN model:

* the first layer is an keras [embedding layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding)

* the second layer can be either a keras [GRU layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) or [LSTM layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM) - or both!

The model is said to be **paramterized** because it many of its parameters are passed in via parameter dictionary `p`.

In general, storing experimental parameters, including model hyper-parameters in  dictionaries is a great way to keep track of your experiments.

In [None]:
def build_model(**p):
  tf.keras.backend.clear_session()
  m = tf.keras.Sequential()
  m.add(tf.keras.layers.InputLayer(
           input_shape=(p['SEQUENCE_length'],) 
          ,batch_size=p['BATCH_size']
          ))
  m.add(tf.keras.layers.Embedding(
           input_length=10
          ,input_dim = p['VOCAB_size']
          ,output_dim = p['EMBEDDING_dim']
          ))
    
  if p['GRU_units'] > 0:
    m.add(tf.keras.layers.GRU(
             units = p['GRU_units']
            ,return_sequences=True
            ,stateful=True
            ,recurrent_initializer='glorot_uniform'
            ))

  if p['LSTM_units'] > 0:
    m.add(tf.keras.layers.GRU(
             units = p['LSTM_units']
            ,return_sequences=True
            ,stateful=True
            ,recurrent_initializer='glorot_uniform'
            ))

  m.add(tf.keras.layers.Dense(p['VOCAB_size']))
  
  m.compile(
       optimizer='adam'
      ,loss=tf.keras.losses.SparseCategoricalCrossentropy(
          from_logits=True)
      ,metrics=['accuracy']
      )


  return m

p = {
     'EMBEDDING_dim':100
    ,'GRU_units':1024
    ,'LSTM_units':0
    ,'VOCAB_size':vocab_size
    ,'BUFFER_size':1000
    ,'SEQUENCE_length':100
    ,'BATCH_size':32
    ,'BATCH_per_epoch':100
    ,'CORPUS_fraction':.01
    }

model = build_model(**p) 
# model.summary()

## A simple parameterized data pipeline

The `setup_training_dataset` function that returns a training dataset appropriate for the RNN model.  It uses the same approach to parameteization as we did in `build_model`.

In [None]:
def setup_training_dataset(**p):
  dataset = create_seq_data(text[:int(p['CORPUS_fraction']*len(text))], text2nrep, seq_length=p['SEQUENCE_length'])
  training_dataset = dataset.shuffle(p['BUFFER_size']).batch(p['BATCH_size'], drop_remainder=True).take(p['BATCH_per_epoch'])
  return training_dataset

for input_example_batch, target_example_batch in setup_training_dataset(**p):
    example_batch_predictions = model(input_example_batch)
    assert example_batch_predictions.shape == (p['BATCH_size'], p['SEQUENCE_length'], len(vocab))

### Task Understanding Parameterized Models
Add a comment for each parameter in the parameter dictionary above, paste a copy to Prismia as well. Your answer should look something like this:

```
p = {
     'EMBEDDING_dim':100     # short comment
    ,'GRU_units':1024        # short comment
    ,'LSTM_units':0          # ...
    ,'VOCAB_size':vocab_size # 
    ,'BUFFER_size':1000      #
    ,'SEQUENCE_length':100   #
    ,'BATCH_size':32         #
    ,'BATCH_per_epoch':100   #
    ,'CORPUS_fraction':.01   #
    }
```

# Fit model and generate fake Elon tweets
Now that everying is nicely setup, let's fit the model and generate some fake tweeks.

Now that everything is nicely organized let's run our model.

In [None]:
p = {
     'EMBEDDING_dim':100
    ,'GRU_units':1024
    ,'LSTM_units':0
    ,'VOCAB_size':vocab_size
    ,'BUFFER_size':1000
    ,'SEQUENCE_length':100
    ,'BATCH_size':32
    ,'BATCH_per_epoch':100
    ,'CORPUS_fraction':.01
    , 'EPOCHS':10
}
model = build_model(**p)
model.fit(setup_training_dataset(**p), epochs=p['EPOCHS'])

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<tensorflow.python.keras.callbacks.History at 0x7f27563ad510>

## A simple generative approach

The `generate text` function below creates next token predictions.  It is fun to play with!

### Task: [Tabula rasa](https://en.wikipedia.org/wiki/Tabula_rasa)
Try running the cell below with the set_weights commented out and also uncommented.  Explain what you see in Prismia.  Include copy-and-paste examples of both runs.

In [None]:
# Generating text using the learned model

def generate_text(model, start_string, count = 1000):
  # Note: Because of the way the RNN state is passed from timestep to 
  # timestep, the model only accepts a fixed batch size once built.
  # To run the model with a different `batch_size`, we need to rebuild 
  # the model to accomodate a batch-size of 1. 

  gmodel = tf.keras.models.clone_model(model, input_tensors = tf.keras.Input(batch_input_shape=(1,100)))
  gmodel.set_weights(model.get_weights())
  input_seq = tf.expand_dims(text2nrep(start_string), 0) # add batch dimension
  
  # Low temperatures results in more predictable text.
  # Higher temperatures results in more surprising text.
  temperature = 1.0 # 0.1 is very ridgid; 10 is very noisy

  # Empty string to store our results
  text_generated = []

  for i in range(count):
      predictions = gmodel(input_seq)
      predictions = tf.squeeze(predictions, 0) # remove batch dimension
      
      # using a categorical distribution to predict the character returned by the model
      predictions = predictions / temperature
      predicted_id = tf.random.categorical(predictions, num_samples=1)[-1,0].numpy()      
      text_generated.append(nrep2token[predicted_id])

      # Use the predicted nrep as the next input to the model
      input_seq = tf.expand_dims([predicted_id], 0) # add batch dimesion

  return (start_string + ''.join(text_generated))
  
print(generate_text(model, start_string=u"Wild times! ", count=1000))

Wild times! xthe furutheas so surreal, but the negative propaganda is still all out there & easy to find in social media & press interviews, so it’s not just our imagination!
Make sure to read ur terms & conditions before clicking accept!
Samwise Gamgee
Altho Dumb and Dumber is 🔥🔥
Progress update August 28
Sure
If you can’t beat em, join em
Neuralink mission statement
Tesla China team is awesome!
Words are a very lossy compression of thought
If you get past Mars, the asteroids, moons of Jupiter & Saturn, inevitably you reach Uranus!
🖤✨Carl Sagan ✨🖤
Essentially. Long-term purpose of my Tesla stock is to help make life multiplanetary to ensure it’s continuance. The massive capital needs are in 10 to 20 years. By then, if we’re fortunate, Tesla’s goal of acceleratan’t.
AI symbiosis while u wait
There’s some of that too
True, it sounds so surreal, but the negative propaganda is still all out there & easy to fying (finally).
True
Wow, IHOP & GitHub are close
Best use of the term “Full Stack

## Experiments

All the organizational work but paramterizing parts of this notebook will allow you to easily and methodically experiment with different versions of the model.  

Weights and Baises is a user-friendly version of TensorBoard.dev. Try running the cell below to set it up.  And use the last cell in this notebook try out different parameters.  Clicking the link will let you see all of your training runs, including the parameters that used.   You can even group by different parameters.


## Weights and Biases 
Weight and biases provides an easy nice way to track experminets.  Below is the setup code needed.  They provide free accounts, and additional provisioning for .edu acccount.  The first time you run block you will have to authenticate in a manner similar to the way you authenticate a google drive.

In [None]:
!pip -q install wandb
import wandb
wandb.login()

True


### Task: Research Question
The synthetic tweets results are very impressive when you first see them. But if you re-run the generate_text multiple times, they start to seem very familiar.  Try and figure out why this is the case, then try and figure out a way to improve the results.  

Document your efforts in the last Prismia problem.  Do not spend more than a couple of hours on this.  

If you create some impressive or particularly funny output, please consider posting to the Piazza hw6 [Fake Tweet Fun](https://piazza.com/class/kjj6m8xbzbp141?cid=238) thread.  

We are much more interested in you exploring the experimental setup and parameter approach available used in this notebook, than in creating a better solution to this toy problem. 

In [None]:
p = {
     'EMBEDDING_dim':100
    ,'GRU_units':1024
    ,'LSTM_units':0
    ,'VOCAB_size':vocab_size
    ,'BUFFER_size':1000
    ,'SEQUENCE_length':100
    ,'BATCH_size':32
    ,'BATCH_per_epoch':100
    ,'CORPUS_fraction':.01
    ,'EPOCHS':10
}; 

wandb.init(project="hw6-sequence-a", config=p)
model = build_model(**p)
model.fit(setup_training_dataset(**p), epochs=p['EPOCHS'], callbacks=[wandb.keras.WandbCallback()])
tweets = generate_text(model, start_string=u"Wild times! ", count=1000)
wandb.log({'fake-tweets':wandb.Table(data=[[tweets]], columns=["tweets"])})
print(f'\n{tweets}\n')
wandb.finish()

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10

Wild times! we’re fortunate, Tesla’s goal of accelerating sustainable energy & autonomy will be mostly accomplished.
Thank goodness for modern medicine!
For sure
Coming soon, our battle with Big Tequila! It’s real.
That would be next-level 🤣🤣
I bought a pair of XL
Also true. Haha you ! to ensure it’s continuance. The massive capital needs are in 10 to 20 years. By then, if we’re fortunate, Tesla’s goal of accelerating sustainable elerty m65 years without war. That’s the amazing part.
Yeah!
Lord of the Rings
Looks cooo
Thant wall Stank goodn’s real.
👀
Itt part haha
Yes, in plan. Superchargers and public high power wall connectors will keep growing exponentially every year.
👀
I think so
Doing range testing now. Number wall connectors will keep growing exponentially every year.
👀
I think so
Doing range testing now. Number will be significantly higher than 300. Extremely good for

VBox(children=(Label(value=' 0.18MB of 0.18MB uploaded (0.00MB deduped)\r'), FloatProgress(value=1.0, max=1.0)…

0,1
epoch,9.0
loss,0.14953
accuracy,0.9686
_runtime,47.0
_timestamp,1616125296.0
_step,10.0


0,1
epoch,▁▂▃▃▄▅▆▆▇█
loss,█▆▅▃▁▁▁▁▁▁
accuracy,▁▂▃▆██████
_runtime,▁▂▂▃▄▅▅▆▇▇█
_timestamp,▁▂▂▃▄▅▅▆▇▇█
_step,▁▂▂▃▄▅▅▆▇▇█
