# Salazar
*A Python-writting wizard*

## Introduction

Coding, while rewarding and essential to modernity, can be monotonous at times. However, with modern Neural Networks, the possibilities exist for a world where we can start the initial stages of a programming process of a software solution and then allow a machine to finish the work for us. This concept entails a future where programmers could eliminate the overhead of debugging and testing and allow them more time to focus on the planning stage of project management. 

Generative text has been around for some time now but generative coding is still a relatively new implementation of it’s paradigms. The difficulties with generative code could be akin to training a model to write stories with a dataset containing mostly Sci-Fi and then expecting that model to write a Nicholas Sparks’ novel. I.e. The problem here stems from the multitude of libraries as packages used to build upon programming languages to make them useful for specific tasks. Just because a model can produce C code doesn’t mean it can build an operating system. So what do we do if we want a swiss army knife for coding nearly every variation of code in a specific language? Well maybe we should use a method that employs a significant amount of data (string of code) in tandem with a method of effectively seeding the model. There are, thankfully, enough similarities between any two programs written in Python that some rule should be learnable by a network; combining that with the right amount of “starter code” should prove effective enough to get relatively useful outputs.


### Import Dependencies

In [2]:
# dependencies
"""
Numpy:             matrix manipulation and math
Pandas:            csv parsing and various data structure tasks
Mathpltlib.pyplot: data visualization
set_trace:         debug breaks
keras:             a machine learning library that is intuitive to read
tensorflow:        backend for keras; also the most widely used machine learning library
re:                regular expressions
"""
from copy import deepcopy as copy
from IPython.core.debugger import set_trace

import sys
import numpy as np
import pandas as pd
import scipy.special as sci
import matplotlib.pyplot as plt 
import os
import tensorflow as tf
import keras
import re

tf.config.optimizer.set_jit(True) # optimizes cpu usage

from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

Using TensorFlow backend.


## Concatenate files as a singular string
The bellow function uses a Python function knowns as ```walk``` to "walk" through a directory and read the files within that directory, storing them to the string ```content```

Currently using [this repository](https://github.com/TheAlgorithms/Python) as the dataset.

In [3]:
"""
concat_files
----------

Concatenate text files in a directory as a string

dependent on 'os' Python module

parameters
----------
directory: string; path of the target directory

f_type:    tuple of strings; target file extentsions Ex: ('.py', '.cpp')

return
---------
content:   concatenated string

"""
def concat_files(directory,f_type):
    import os
    # List all file in the dataset directory
    # ------------------
    all_file = []
    content = ""

    # walk through every directory and open every f_type file
    # concatenate into var string "content"
    for root, dirs, files in os.walk(directory): 
        for name in files:
            if name.endswith(f_type): # we only care about .py
                all_file.append(name)
                with open(os.path.join(root,name), "r",encoding="utf8") as f:
                    content += f.read() + "\n"
    return content

In [4]:
content = concat_files("dataset",('.py'))

### Get a regular expression representation of ASCII characters
This will useful in distiguishing the characters that are important in terms of writting Python code vs. characters that are exclusive to documentation such as emoji's and other non-latin characters. This will help to slim our data shape, increasing training speed.

In [5]:
r_all_ascii = "[^\x00-\x7F]"

## Bellow are two fuctions with the same name
[**The first**](#First-preprocessing-option) is for if you are *introducing new data to the dataset*, i.e. pulling new repositories into the "dataset" directory. This will perform the *encoding step*, converting every character in the concatenated string into an integer to be stored in a one dimensional array. 

[**The second**](#Second-preprocessing-option) is intended for loading an already encoded string and storing it into variable we will use going forward. This will save a signifigant amount of time, as it takes a while to encode a large string.

#### First preprocessing option

In [13]:
"""
encode_string
-----------
Generate a dictionary representation of the characters found 
in a string keyed with integer representations

Returns two dictionaries and an array. The two dictionaries are 
necessary to convert the string to integer representation
and back again. The array is the string encoded as integer values.

parameters
----------
content:      string; to be processed

return
----------
vocab_to_int: dict; character to integer representation of unique characters in the string

int_to_vocab: dict; integer to string representation

encoded:      array; string encoded as integer values
"""

def pre_proc(content):   
    # Convert the string "content" into a list of intergers
#   -------------------------------------------------
#     ### creates a set of the individual characters
#     vocab = set(content)
#     ### attempt to clean out non-ascii characters
#     vocab_c = copy(vocab)
#     for i, char in enumerate(vocab_c):
#         if re.search(r_all_ascii,char):
#             vocab.remove(char)
#     print(vocab)
#     print(len(vocab))
#     ### use the set to sequentially generate a dictionary
#     vocab_to_int = {c: i for i, c in enumerate(vocab)} 
#     # print(vocab_to_int)
#     ### make keys the numerical values
#     int_to_vocab = dict(enumerate(vocab)) 
    
#     ### encode the "content" string using dict
#     ### encoded = np.array([vocab_to_int[c] for c in content], dtype=np.int32)
    
#     encoded = np.array([],dtype=np.int16)
#     for c in content:
#         if c in vocab_to_int:
#             encoded = np.append(encoded,vocab_to_int[c])
#   -------------------------------------------------


# use the bellow lines if you want a dictionary of all basic ASCII charcters.
# otherwise, comment out.
#   -------------------------------------------------
    int_to_vocab = {i: chr(i) for i in range(127)}
    vocab_to_int = {chr(i): i for i in range(127)}

    encoded = np.array([],dtype=np.int16)
    for c in content:
        if c in vocab_to_int:
            encoded = np.append(encoded,vocab_to_int[c])     
    
    return vocab_to_int, int_to_vocab, encoded

### Second preprocessing option

In [6]:
# Run if you want to use data that is alread preprocessed 
def pre_proc(content):    
    import json
    
    infile1 = "./encoded.txt"        # path to encoded string
    infile2 = "./vocab_to_int.json"
    infile3 = "./int_to_vocab.json"
    
    encoded = np.loadtxt(infile1, dtype=int) # load as an array of integers
    
#     with open(infile2, 'r') as fp:
#         vocab_to_int = json.load(fp)
    
#     with open(infile3, 'r') as fp:
#         int_to_vocab = json.load(fp)
        
    int_to_vocab = {i: chr(i) for i in range(127)}
    vocab_to_int = {chr(i): i for i in range(127)}
#   --------------------------------------------------    
    
    return vocab_to_int, int_to_vocab, encoded

### Run the preprocessing funtion
If you run the next cell, you will see the encoded string.

In [7]:
vocab_to_int, int_to_vocab, encoded = pre_proc(content)

In [10]:
#print(content)
print(int_to_vocab)
# this is all of the files concatenated. with each character encoded using the int_to_vocab
print()
print("Encoded string:",encoded)

{0: '\x00', 1: '\x01', 2: '\x02', 3: '\x03', 4: '\x04', 5: '\x05', 6: '\x06', 7: '\x07', 8: '\x08', 9: '\t', 10: '\n', 11: '\x0b', 12: '\x0c', 13: '\r', 14: '\x0e', 15: '\x0f', 16: '\x10', 17: '\x11', 18: '\x12', 19: '\x13', 20: '\x14', 21: '\x15', 22: '\x16', 23: '\x17', 24: '\x18', 25: '\x19', 26: '\x1a', 27: '\x1b', 28: '\x1c', 29: '\x1d', 30: '\x1e', 31: '\x1f', 32: ' ', 33: '!', 34: '"', 35: '#', 36: '$', 37: '%', 38: '&', 39: "'", 40: '(', 41: ')', 42: '*', 43: '+', 44: ',', 45: '-', 46: '.', 47: '/', 48: '0', 49: '1', 50: '2', 51: '3', 52: '4', 53: '5', 54: '6', 55: '7', 56: '8', 57: '9', 58: ':', 59: ';', 60: '<', 61: '=', 62: '>', 63: '?', 64: '@', 65: 'A', 66: 'B', 67: 'C', 68: 'D', 69: 'E', 70: 'F', 71: 'G', 72: 'H', 73: 'I', 74: 'J', 75: 'K', 76: 'L', 77: 'M', 78: 'N', 79: 'O', 80: 'P', 81: 'Q', 82: 'R', 83: 'S', 84: 'T', 85: 'U', 86: 'V', 87: 'W', 88: 'X', 89: 'Y', 90: 'Z', 91: '[', 92: '\\', 93: ']', 94: '^', 95: '_', 96: '`', 97: 'a', 98: 'b', 99: 'c', 100: 'd', 101: 'e'

## $\rightarrow$ Save encoded array to avoid heavy computation

In [32]:
outfile1 = "./encoded.txt"
outfile2 = "./vocab_to_int.json"
outfile3 = "./int_to_vocab.json"

np.savetxt(outfile,encoded, fmt='%d')

# with open(outfile2, 'w') as fp:
#     json.dump(vocab_to_int, fp)

# with open(outfile3, 'w') as fp:
#     json.dump(int_to_vocab, fp)

## Reshape data into sequences

In [16]:
"""
sequenc_gen
---------------

Partition an array of encoded characters into sequences.

Parameters
---------------
encoded:         array of encoded characters; representation of a string
vocab_to_int:    dictionary for conversion from character to integer
int_to_vocab:    dictionary for conversion from integer to character

Settings
--------------
sequence_length: Specify the desired length of the sequences
"""
def sequence_gen(encoded,vocab_to_int,int_to_vocab, **params):
    global n_chars, n_vocab, n_patterns, datax, datay
    n_chars = len(encoded)
    n_vocab = len(vocab_to_int)
    seq_len = params.pop("sequence_length") # change from 50
    datax = []
    datay = []

    # Loop through the encoded data and store 
    # sequences in datax and datay
    for i in range(0, n_chars - seq_len, 1):
        seq_in = encoded[i:i + seq_len] 
        seq_out = encoded[i + seq_len]
        datax.append(seq_in)
        datay.append(seq_out)
    n_patterns = len(datax)
    print("Total patterns: ", n_patterns)
    print("Total unique characters: ", n_vocab)
    print ("\"", ''.join([int_to_vocab[value] for value in datax[100]]), "\"")

In [20]:
seq_len = 100
sequence_gen(encoded,vocab_to_int, int_to_vocab, sequence_length=seq_len)

Total patterns:  1112800
Total unique characters:  127
" ing bolzano

    start = a
    end = b
    if function(a) == 0:  # one of the a or b is a root for t "


## Shape the sequences in a format that is better suited to LSTM units
We will retain the pattern x sequence shape while adding a additonal dimension to represent the number of features; in this case one, since the each value can only represent one ASCII character. In terms of the LSTM, the sequence length will function as time steps for our LSTM cells.

In [21]:
from keras.utils import np_utils

# reshape datax -- > [n_patterns, time steps, features]
X = np.reshape(datax, (n_patterns,seq_len,1))
X = X / float(n_vocab)
Y = np_utils.to_categorical(datay)
#Y = np.asarray(datay) # for sparse categorical cross-entropy

## Model 
Here are the primary componets of the model
```python
model = Sequential()
model.add(LSTM(128, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(128))
model.add(Dropout(0.5))
model.add(Dense(Y.shape[1], activation='softmax')
```
Two layers in the model are commonly found in many other machine learning applications, so we will focus on the most detrimental layer(s) to the effectiveness of this model: the LSTM layer.

![](The_LSTM_cell.png)
*From wikipedia*

The image above depicts what is commonly found in **LSTM cells**. A typical LSTM has an additional input over vanilla RNNs known as the **cell-state vector**. This vector along with the **hidden-state vector** and **input data** allow the LSTM cell to "remember" or "forget" certain sequences.

There are 3 primary gates within a cell that utilize the sigmoid function:
1. input gate $\rightarrow$ controls whether the memory cell is updated; contributing to the cell-state
2. forget gate $\rightarrow$ controls if the memory cell is reset to zero; also contributing to the cell-state
3. oupute gate $\rightarrow$ controls if the information of the current cell state is made visable; directly contributing to the hidden-state


**TODO: breifly explain compilation choice**
```python
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
```

In [32]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop, SGD

# hidden vector size is still arbitrary at this point of testing,
# so it is hard coded to 128
model = Sequential()
model.add(LSTM(128, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation='softmax')) # output should be probabilities of character options

#---------------------------------------
# optimizer = RMSprop(learning_rate=0.05)

# model.compile(loss='categorical_crossentropy',
#               optimizer=optimizer,
#               metrics=['accuracy'])
#---------------------------------------

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

## Saving trained weights
Because this model takes a significant amount of time to train (about 10 minutes at 100 epochs on RTX 2070 super), we decided to save the weights that produce the lowest loss.

In [23]:
# checkpoint
from tensorflow.keras.callbacks import ModelCheckpoint
filepath = "best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss',verbose=1, save_best_only=True, mode='min')
callback_list = [checkpoint]

## Train
**TODO: explain training method**

In [33]:
# model.fit(inp, targets, steps_per_epoch=10, epochs=10)
#model.get_weights().shape

model.fit(X[:120000] , Y[:120000], epochs=250, batch_size=500, callbacks=callback_list)

Train on 120000 samples
Epoch 1/250
Epoch 00001: loss did not improve from 1.25241
Epoch 2/250
Epoch 00002: loss did not improve from 1.25241
Epoch 3/250
Epoch 00003: loss did not improve from 1.25241


KeyboardInterrupt: 

In [25]:
from tensorflow.keras.models import load_model
# model.save('.\model')
score = model.evaluate(X[60000:120000], Y[60000:120000])
print(score)

[1.041020267889897, 0.70821667]


### Load best weights recorded from training 

In [28]:
# filename should reflect the name of the best weights available 
# in th local directory after training
filename = "best-weights.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

## Test model's ability to generate code
**TODO: explain generation**

In [35]:
# pick a random seed
start = np.random.randint(0, len(datax)-1)
pattern = []
pattern = datax[start]
print("Seed:")
print ("\"", ''.join([int_to_vocab[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_vocab[index]
    seq_in = [int_to_vocab[value] for value in pattern]
    sys.stdout.write(result)
    pattern = np.append(pattern, [index], axis=0) # TODO make so length of pattern is 101, as it should be
    length = pattern.shape[0] 
    pattern = pattern[1:length] # issue with the length. no matter what, length of pattern become 99
print ("\nDone.")

Seed:
" > geometric_series(4, -2, 2)
    [-2, '-4.0', '-8.0', '-16.0']
    >>> geometric_series(-4, 2, 2)
   "
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         

**Possible TODO: Use autoencoder to translate natural language into code.**

# References

1. [LSTM: A search space odyssey](https://arxiv.org/pdf/1503.04069.pdf?utm_content=buffereddc5&utm_medium=social&utm_source=plus.google.com&utm_campaign=buffer)