### Problem 1

![](https://firebasestorage.googleapis.com/v0/b/prismia.appspot.com/o/user-images%252Fimage-84dabff3-51aa-477c-9201-756f22ed4d88.png?alt=media&token=abc957e0-5eb5-4883-b9ba-8cee42143d1a)

Using this [colab-notebook stencil](https://drive.google.com/file/d/1NcWzchZFKh_yQLRQ-tzcakp5ocPOXVNs/view?usp=sharing), you will create a simple character-based tweet generation model.

Instructions:  There is a Prismia problem corresponding to each **Task** in the notebook, for each one 
1. Provide a copy and paste your completed code cell(s).  
2. Then provide a written summary of how the code cell(s) operates.
3. Include relevant notes (and where appropriate example output) of its operation with respect to the supplied dataset. (You may use multiple code cells to describe this last step)


® Here is a course staff accessible link to my completed notebook:


https://drive.google.com/file/d/1cxqg3gg8Bi-Qj4CMU5c3dKtpNHmx92MO/view?usp=sharing

### Problem 2

### ®[5] Task ETL
Using the code cell below write some ETL code to load the data using the link above, and store it in memory as a single (very long) string. The data is small enough to do this. ****Note:**** When working with a larger [text corpus](https://en.wikipedia.org/wiki/Text_corpus), the ETL would need to store the intermediates to disk). 

Join individual tweets with a new line character ("\n") or some other special character.  In many applications, some additional filtering and transformations character (or word) might be applied at this step, but there is no need to do so in this assignment.


In [None]:
## BEGIN SOLUTION Task ETL -- 3 lines of code
data = pd.read_csv("https://raw.githubusercontent.com/beckyleii/data-bucket/master/elonmusk.csv")
tweets = np.array(data['tweet'])
text = "\n".join(tweets)


## END SOLUTION Task 1

vocab = sorted(list(set(text)))
vocab_size = len(vocab)
assert vocab_size == 364

The code read the csv file into dataframe from kaggle url and then take the "tweet" column, joining it by "\n“ character. There are 364 different characters in this text including number, sign, english and chinese character and emoji. The assert works properly.

### Problem 3

### ®[5] Task `token2nrep` and `nrep2token`
Create two dictionaries: one mapping vocabulary characters to numbers, named `token2nrep`, and another from numbers to tokens, named `nrep2token`. 

_Python pro-tip:_ `enumerate` is a lovely construct
```null
for index, value in enumerate(L):
    # do something
```

_Python pro-tip:_ dictionary comprehensions are a thing!
```null
char2idx = {??? for x in L)
```


In [None]:
for index, value in enumerate(L):
    # do something

char2idx = {??? for x in L)

In [None]:
## BEGIN SOLUTION - combine the pro-tips and try for a 2 line solution! 
nrep2token = {index:value for index, value in enumerate(vocab)}
token2nrep = {value:index for index, value in enumerate(vocab)}

## END SOLUTION Task 2.1

assert nrep2token[token2nrep['a']]=='a'
assert len(token2nrep) == len(vocab)

for char in 'Elon':
  nrep = token2nrep[char]
  print(f"{char} -> {nrep:3} -> {nrep2token[nrep]}")

def text2nrep(s):
  return np.array([token2nrep[c] for c in s])

def nrep2text(nrep):
  return ''.join([nrep2token[n] for n in nrep])

s="Elon"
print(f'"{s}" -> {text2nrep(s)} -> "{nrep2text(text2nrep(s))}"')  

assert nrep2text(text2nrep("Elon")) == "Elon"
assert len(text2nrep(text)) == len(text)
assert nrep2text(text2nrep(text)) == text


We use two comprehensions to build two dictionaries. One uses index as key and text value as value and the other one reversely. The assert makes sure that we can find the mapping and inverse mapping for each text character properly.

### Problem 4

### ®[30] Task: Text Dataset design
The tf.data api provides a scalable way to to this. You need to:
1. Create a `Dataset` of `text2nrep('text')` using the `from_tensor_slices` constuctor.
2. Use the `window` method to configure a 'dataset of datasets', where each window returned is **always** of length `seq_length+1`. (see `drop_remanider`)
3. Use `flat_map` and the provided `sub_to_batch` function to flatten the window dataset of datasets into a sequential data set containing sequential overlapping windows of text.
4. Use `map` method, and the `split_input_target` method to split these sequences into appropriate (X input, y target) pairs


In [None]:
def create_seq_data(corpus, text2nrep, seq_length):
    def sub_to_batch(sub):
        return sub.batch(seq_length+1, drop_remainder=True)

    def split_input_target(seq):
        input_seq = seq[:-1]
        target_seq = seq[1:]
        return input_seq, target_seq

    ## BEGIN SOLUTION -- 3 lines of code
    dataset = tf.data.Dataset.from_tensor_slices(text2nrep(corpus)).window(seq_length+1, shift=1, drop_remainder=True)
    dataset = dataset.flat_map(sub_to_batch)
    seq = dataset.map(split_input_target)
    ## END SOLUTIONS
    return seq

dataset = create_seq_data("Trump is done.", text2nrep, seq_length=3)
for it in dataset.take(5):
  print([nrep2text(it[0].numpy()), nrep2text(it[1].numpy())])
print('...')

This creates sequence of data pairs as a form of tf dataset.

### Problem 5

### ®[10] Task: Batching with `tf.data`
Below is some code to shuffle and batch using `create_seq_data`. Answer the following in Prismia:
1. How many batches of data are created per epoch? Is any data unused? Explain.
2. More generally, add an explanation of what the parameters SEQ_LENGTH, BUFFER_SIZE, BATCH_SIZE are doing.
3. Below we are shuffling and then batching. Explain what happens if instead, you shuffle after you batch. 
4. In general, when training, should you shuffle and then batch, or batch and then shuffle? 


hints: Use `nrep2text` to get a clearer picture of what is happening.  If you are still confused, try uncommenting the alternate `# test_text` variable.


In [None]:
# BEGIN Hint Solution
for X, y in batched_dataset:
  print("Batch: ")
  for e in range(len(X)):
      print("X= " + nrep2text(X[e].numpy()), " | y= " + nrep2text(y[e].numpy()))
# END Hint Solution

1. There are 3 batches of data in per epoch. There are data unused since we set the drop_remainder = True, which drops the unused data.
2. SEQ_LENGTH decides the input length of each trianing data point. BUFFER_SIZE decides the size of buffer from which we are sampling or picking. A BUFFER_SIZE of 2 means that each time an element is randomly picken from these buffer of 2 sequences and the picked one is replaced with the next element in the dataset. BATCH_SIZE decides the size of each batch in this epoch, we have 4 (sequences) data points in each batch in this case.
3. If we batch before shuffle,  the elements of each batch are 4 consecutive elements from the input; if we shuffle before batch, they are randomly sampled from the input. 
4. We should shuffle before batch. If we batch first, the order within each batch is consecutive which will cause the model to overfitting into this order within the batch. If we shuffle from total input into each batch, we will not have this problem, the order of batch does not matter since the elements within the batch change every time.

### Problem 6

### ®[10] Task Understanding Parameterized Models
Add a comment for each parameter in the parameter dictionary above, paste a copy to Prismia as well. Your answer should look something like this:
```null
p = {
     'EMBEDDING_dim':100     # short comment
    ,'GRU_units':1024        # short comment
    ,'LSTM_units':0          # ...
    ,'VOCAB_size':vocab_size # 
    ,'BUFFER_size':1000      #
    ,'SEQUENCE_length':100   #
    ,'BATCH_size':32         #
    ,'BATCH_per_epoch':100   #
    ,'CORPUS_fraction':.01   #
    }
```


In [None]:
p = {
     'EMBEDDING_dim':100     # short comment
    ,'GRU_units':1024        # short comment
    ,'LSTM_units':0          # ...
    ,'VOCAB_size':vocab_size # 
    ,'BUFFER_size':1000      #
    ,'SEQUENCE_length':100   #
    ,'BATCH_size':32         #
    ,'BATCH_per_epoch':100   #
    ,'CORPUS_fraction':.01   #
    }

In [None]:
p = {
     'EMBEDDING_dim':100     # The output shape of the embedding layer
    ,'GRU_units':1024        # dimensionality of output space of GRU layer
    ,'LSTM_units':0          # dimensionality or length of the hidden state or the length of the activation vector passed on the next LSTM 
    ,'VOCAB_size':vocab_size # Take the size of VOCAB dictionary as the input shape of the embedding layer
    ,'BUFFER_size':1000      # The size of buffer from which we sample from
    ,'SEQUENCE_length':100   # The length of sequence for each input data point
    ,'BATCH_size':32         # The size of batch for each train iteration
    ,'BATCH_per_epoch':100   # The number of batches in a training epoch
    ,'CORPUS_fraction':.01   # The fraction of thw corpus that we are taking to make the dataset
    }

### Problem 7

### ®[10] Task: [Tabula rasa](https://en.wikipedia.org/wiki/Tabula_rasa)
Try running the cell below with the set_weights commented out and also uncommented.  Explain what you see in Prismia.  Include copy-and-paste examples of both runs.


If we do not set the weights, the generate text will each predicted character as a random pick from the VOCAB. If we use set_weights, the generate text will be the predictions from our model as the cloned model have the same weights as the weights trained in the previous model.

### Problem 8

### ®[30] Task: Research Question
The synthetic tweets results are very impressive when you first see them. But if you re-run  `generate_text` multiple times, the results start to seem very familiar.  Try and figure out why this is the case, then try and figure out a way to improve the results.  

Document your efforts in the last Prismia problem.  Do not spend more than a couple of hours on this.  

If you create some impressive or particularly funny output, please consider posting to the Piazza hw6 [Fake Tweet Fun](https://piazza.com/class/kjj6m8xbzbp141?cid=238) thread.  

We are much more interested in you exploring the experimental setup and parameter approach available used in this notebook, than in creating a better solution to this toy problem.


In the corpus that the model sees, different sequences starts with a few same fixed characters are rare. Therefore, given a start of a sequence, generate_text will have similar outputs as it is the only pattern that the model saw. 

### Problem 9