# Salazar
*A Python-writting wizard*

## Introduction

Coding, while rewarding and essential to modernity, can be monotonous at times. However, with modern Neural Networks, the possibilities exist for a world where we can start the initial stages of a programming process of a software solution and then allow a machine to finish the work for us. This concept entails a future where programmers could eliminate the overhead of debugging and testing and allow them more time to focus on the planning stage of project management. 

Generative text has been around for some time now but generative coding is still a relatively new implementation of it’s paradigms. The difficulties with generative code could be akin to training a model to write stories with a dataset containing mostly Sci-Fi and then expecting that model to write a Nicholas Sparks’ novel. I.e. The problem here stems from the multitude of libraries as packages used to build upon programming languages to make them useful for specific tasks. Just because a model can produce C code doesn’t mean it can build an operating system. So what do we do if we want a swiss army knife for coding nearly every variation of code in a specific language? Well maybe we should use a method that employs a significant amount of data (string of code) in tandem with a method of effectively seeding the model. There are, thankfully, enough similarities between any two programs written in Python that some rule should be learnable by a network; combining that with the right amount of “starter code” should prove effective enough to get relatively useful outputs.


### Import Dependencies

In [1]:
# dependencies
"""
Numpy: matrix manipulation and math
Pandas: csv parsing and various data structure tasks
Mathpltlib.pyplot: data visualization
set_trace: debug breaks
keras: a machine learning library that is intuitive to read
tensorflow: backend for keras... also the most widely used machine learning library
re: regular expressions
"""
from copy import deepcopy as copy
from IPython.core.debugger import set_trace

import sys
import numpy as np
import pandas as pd
import scipy.special as sci
import matplotlib.pyplot as plt 
import os
import tensorflow as tf
import keras
import re

tf.config.optimizer.set_jit(True) # optimizes cpu usage

from mpl_toolkits.mplot3d import Axes3D
%matplotlib inline

Using TensorFlow backend.


## Concatenate files as a singular string
The bellow function uses a Python function knowns as ```walk``` to "walk" through a directory and read the files within that directory, storing them to the string ```content```

In [2]:
"""
concat_files
----------

Concatenate text files in a directory as a string

dependent on 'os' Python module

parameters
----------
directory: string; path of the target directory

f_type:    tuple of strings; target file extentsions Ex: ('.py', '.cpp')

return
---------
content:   concatenated string

"""
def concat_files(directory,f_type):
    import os
    # List all file in the dataset directory
    # ------------------
    all_file = []
    content = ""

    # walk through every directory and open every f_type file
    # concatenate into var string "content"
    for root, dirs, files in os.walk(directory): 
        for name in files:
            if name.endswith(f_type): # we only care about .py
                all_file.append(name)
                with open(os.path.join(root,name), "r",encoding="utf8") as f:
                    content += f.read() + "\n"
    return content

In [3]:
content = concat_files("dataset",('.py'))

### Get a regular expression representation of ASCII characters
This will useful in distiguishing the characters that are important in terms of writting Python code vs. characters that are exclusive to documentation such as emoji's and other non-latin characters. This will help to slim our data shape, increasing training speed.

In [13]:
r_all_ascii = "[^\x00-\x7F]"

In [30]:
"""
encode_string
-----------
Generate a dictionary representation of the characters found 
in a string keyed with integer representations

Returns two dictionaries and an array. The two dictionaries are 
necessary to convert the string to integer representation
and back again. The array is the string encoded as integer values.

parameters
----------
content:      string; to be processed

return
----------
vocab_to_int: dict; character to integer representation of unique characters in the string

int_to_vocab: dict; integer to string representation

encoded:      array; string encoded as integer values
"""

def encode_string(content):   
    # Convert the string "content" into a list of intergers
    # *** Uncomment the below lines if you haven't saved the encoded array
    # Then rerun cell
#   -------------------------------------------------
#     ### creates a set of the individual characters
#     vocab = set(content)
#     ### attempt to clean out non-ascii characters
#     vocab_c = copy(vocab)
#     for i, char in enumerate(vocab_c):
#         if re.search(r_all_ascii,char):
#             vocab.remove(char)
#     print(vocab)
#     print(len(vocab))
#     ### use the set to sequentially generate a dictionary
#     vocab_to_int = {c: i for i, c in enumerate(vocab)} 
#     # print(vocab_to_int)
#     ### make keys the numerical values
#     int_to_vocab = dict(enumerate(vocab)) 
    
#     ### encode the "content" string using dict
#     ### encoded = np.array([vocab_to_int[c] for c in content], dtype=np.int32)
    
#     encoded = np.array([],dtype=np.int16)
#     for c in content:
#         if c in vocab_to_int:
#             encoded = np.append(encoded,vocab_to_int[c])
#   -------------------------------------------------


# use the bellow lines if you want a dictionary of all basic ASCII charcters.
# otherwise, comment out.
#   -------------------------------------------------
#     int_to_vocab = {i: chr(i) for i in range(127)}
#     vocab_to_int = {chr(i): i for i in range(127)}

#     encoded = np.array([],dtype=np.int16)
#     for c in content:
#         if c in vocab_to_int:
#             encoded = np.append(encoded,vocab_to_int[c])    

# Comment out if you are using this function for the first time 
# and do not have the required txt and json files
#   -------------------------------------------------    
    
    infile1 = "./encoded.txt"
    infile2 = "./vocab_to_int.json"
    infile3 = "./int_to_vocab.json"
    encoded = np.loadtxt(infile1, dtype=int) # comment out if above lines are uncommented
    
    with open(infile2, 'r') as fp:
        vocab_to_int = json.load(fp)
    
    with open(infile3, 'r') as fp:
        int_to_vocab = json.load(fp)
#   --------------------------------------------------    
    
    return vocab_to_int, int_to_vocab, encoded

In [31]:
vocab_to_int, int_to_vocab, encoded = encode_string(content)

In [25]:
#print(content)
print(int_to_vocab)
# this is all of the files concatenated. with each character encoded using the int_to_vocab
print(encoded)

{'0': 'c', '1': 'W', '2': '8', '3': '.', '4': 'r', '5': ';', '6': 's', '7': '\t', '8': 'e', '9': ',', '10': 'Z', '11': '3', '12': '<', '13': '5', '14': 'u', '15': 'k', '16': 'G', '17': '&', '18': 'm', '19': '?', '20': 'n', '21': '^', '22': ')', '23': 'q', '24': 'l', '25': 'E', '26': 'x', '27': "'", '28': '@', '29': 'L', '30': 'P', '31': 'y', '32': 'I', '33': 'U', '34': 'M', '35': '{', '36': ':', '37': '6', '38': '0', '39': 'X', '40': '(', '41': 'F', '42': 'Q', '43': 'C', '44': '}', '45': '~', '46': '4', '47': '=', '48': '*', '49': '$', '50': ' ', '51': '1', '52': '/', '53': '+', '54': '7', '55': '\n', '56': 'H', '57': 'A', '58': 'J', '59': '-', '60': '>', '61': 'a', '62': 'B', '63': 'd', '64': 'w', '65': '|', '66': 'f', '67': 'N', '68': '%', '69': 'z', '70': 'T', '71': 'b', '72': '2', '73': 'g', '74': '#', '75': '_', '76': 'O', '77': 'V', '78': 'o', '79': 'h', '80': '!', '81': ']', '82': '`', '83': 'i', '84': 'p', '85': 'R', '86': 'D', '87': '"', '88': '[', '89': 'K', '90': 'v', '91': 

## $\rightarrow$ Save encoded array to avoid heavy computation

In [32]:
import json

outfile1 = "./encoded.txt"
outfile2 = "./vocab_to_int.json"
outfile3 = "./int_to_vocab.json"

np.savetxt(outfile,encoded, fmt='%d')

with open(outfile2, 'w') as fp:
    json.dump(vocab_to_int, fp)

with open(outfile3, 'w') as fp:
    json.dump(int_to_vocab, fp)

## Reshape data into sequences

In [33]:
"""
sequenc_gen
---------------

Partition an array of encoded characters into sequences.

Parameters
---------------
encoded:         array of encoded characters; representation of a string
vocab_to_int:    dictionary for conversion from character to integer
int_to_vocab:    dictionary for conversion from integer to character

Settings
--------------
sequence_length: Specify the desired length of the sequences
"""
def sequence_gen(encoded,vocab_to_int,int_to_vocab, **params):
    global n_chars, n_vocab, n_patterns, datax, datay
    n_chars = len(encoded)
    n_vocab = len(vocab_to_int)
    seq_len = params.pop("sequence_length") # change from 50
    datax = []
    datay = []

    # Loop through the encoded data and store 
    # sequences in datax and datay
    for i in range(0, n_chars - seq_len, 1):
        seq_in = encoded[i:i + seq_len] 
        seq_out = encoded[i + seq_len]
        datax.append(seq_in)
        datay.append(seq_out)
    n_patterns = len(datax)
    print("Total patterns: ", n_patterns)
    print("Total unique characters: ", n_vocab)
    print ("\"", ''.join([int_to_vocab[value] for value in datax[100]]), "\"")

In [34]:
sequence_gen(encoded,vocab_to_int, int_to_vocab, sequence_length=100)

Total patterns:  1112800
Total unique characters:  127
" ing bolzano

    start = a
    end = b
    if function(a) == 0:  # one of the a or b is a root for t "


## Shape the sequences in a format that is better suited to LSTM units
We will retain the pattern x sequence shape while adding a additonal dimension to represent the number of features; in this case one, since the each value can only represent one ASCII character. In terms of the LSTM, the sequence length will function as time steps for our LSTM units.

In [35]:
from keras.utils import np_utils

# reshape datax -- > [n_patterns, time steps, features]
X = np.reshape(datax, (n_patterns,seq_len,1))
X = X / float(n_vocab)
Y = np_utils.to_categorical(datay)
#Y = np.asarray(datay) # for sparse categorical cross-entropy

## Model 
Here are the primary componets of the model
```python
model = Sequential()
model.add(LSTM(256, input_shape=(X.shape[1], X.shape[2])))
model.add(Dropout(0.5))
model.add(Dense(Y.shape[1], activation='softmax')
```
Two layers in the model are commonly found in many other machine learning applications, so we will focus on the most detrimental layer(s) to the effectiveness of this model: the LSTM layer.



![](lstm.png)

The image above depicts what is commonly found in LSTM

In [46]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Activation
from tensorflow.keras.layers import Embedding
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop, SGD

model = Sequential()
model.add(LSTM(128, input_shape=(X.shape[1], X.shape[2]), return_sequences=True))
model.add(LSTM(128))
model.add(Dropout(0.2))
model.add(Dense(Y.shape[1], activation='softmax'))

# optimizer = RMSprop(learning_rate=0.05)

# model.compile(loss='categorical_crossentropy',
#               optimizer=optimizer,
#               metrics=['accuracy'])

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [37]:
# checkpoint
from tensorflow.keras.callbacks import ModelCheckpoint
filepath = "best-weights.hdf5"
checkpoint = ModelCheckpoint(filepath, monitor='loss',verbose=1, save_best_only=True, mode='min')
callback_list = [checkpoint]

In [48]:
# model.fit(inp, targets, steps_per_epoch=10, epochs=10)
#model.get_weights().shape

model.fit(X[:60000] , Y[:60000], epochs=100, batch_size=300, callbacks=callback_list)

Train on 60000 samples
Epoch 1/100
Epoch 00001: loss improved from 2.96475 to 2.91887, saving model to best-weights.hdf5
Epoch 2/100
Epoch 00002: loss improved from 2.91887 to 2.88659, saving model to best-weights.hdf5
Epoch 3/100
Epoch 00003: loss improved from 2.88659 to 2.85502, saving model to best-weights.hdf5
Epoch 4/100
Epoch 00004: loss improved from 2.85502 to 2.81978, saving model to best-weights.hdf5
Epoch 5/100
Epoch 00005: loss improved from 2.81978 to 2.78238, saving model to best-weights.hdf5
Epoch 6/100
Epoch 00006: loss improved from 2.78238 to 2.74183, saving model to best-weights.hdf5
Epoch 7/100
Epoch 00007: loss improved from 2.74183 to 2.69934, saving model to best-weights.hdf5
Epoch 8/100
Epoch 00008: loss improved from 2.69934 to 2.65714, saving model to best-weights.hdf5
Epoch 9/100
Epoch 00009: loss improved from 2.65714 to 2.61524, saving model to best-weights.hdf5
Epoch 10/100
Epoch 00010: loss improved from 2.61524 to 2.57876, saving model to best-weights.h

Epoch 30/100
Epoch 00030: loss improved from 2.03159 to 2.00330, saving model to best-weights.hdf5
Epoch 31/100
Epoch 00031: loss improved from 2.00330 to 1.97982, saving model to best-weights.hdf5
Epoch 32/100
Epoch 00032: loss improved from 1.97982 to 1.95695, saving model to best-weights.hdf5
Epoch 33/100
Epoch 00033: loss improved from 1.95695 to 1.93213, saving model to best-weights.hdf5
Epoch 34/100
Epoch 00034: loss improved from 1.93213 to 1.91482, saving model to best-weights.hdf5
Epoch 35/100
Epoch 00035: loss improved from 1.91482 to 1.89135, saving model to best-weights.hdf5
Epoch 36/100
Epoch 00036: loss improved from 1.89135 to 1.87241, saving model to best-weights.hdf5
Epoch 37/100
Epoch 00037: loss improved from 1.87241 to 1.85179, saving model to best-weights.hdf5
Epoch 38/100
Epoch 00038: loss improved from 1.85179 to 1.83697, saving model to best-weights.hdf5
Epoch 39/100
Epoch 00039: loss improved from 1.83697 to 1.81377, saving model to best-weights.hdf5
Epoch 40/1

Epoch 59/100
Epoch 00059: loss improved from 1.50437 to 1.49669, saving model to best-weights.hdf5
Epoch 60/100
Epoch 00060: loss improved from 1.49669 to 1.47254, saving model to best-weights.hdf5
Epoch 61/100
Epoch 00061: loss improved from 1.47254 to 1.46522, saving model to best-weights.hdf5
Epoch 62/100
Epoch 00062: loss did not improve from 1.46522
Epoch 63/100
Epoch 00063: loss did not improve from 1.46522
Epoch 64/100
Epoch 00064: loss improved from 1.46522 to 1.43351, saving model to best-weights.hdf5
Epoch 65/100
Epoch 00065: loss improved from 1.43351 to 1.42197, saving model to best-weights.hdf5
Epoch 66/100
Epoch 00066: loss improved from 1.42197 to 1.40707, saving model to best-weights.hdf5
Epoch 67/100
Epoch 00067: loss improved from 1.40707 to 1.40207, saving model to best-weights.hdf5
Epoch 68/100
Epoch 00068: loss improved from 1.40207 to 1.38436, saving model to best-weights.hdf5
Epoch 69/100
Epoch 00069: loss improved from 1.38436 to 1.38005, saving model to best-we

Epoch 00088: loss improved from 1.20879 to 1.20608, saving model to best-weights.hdf5
Epoch 89/100
Epoch 00089: loss improved from 1.20608 to 1.18957, saving model to best-weights.hdf5
Epoch 90/100
Epoch 00090: loss improved from 1.18957 to 1.18788, saving model to best-weights.hdf5
Epoch 91/100
Epoch 00091: loss improved from 1.18788 to 1.17159, saving model to best-weights.hdf5
Epoch 92/100
Epoch 00092: loss improved from 1.17159 to 1.16665, saving model to best-weights.hdf5
Epoch 93/100
Epoch 00093: loss improved from 1.16665 to 1.16263, saving model to best-weights.hdf5
Epoch 94/100
Epoch 00094: loss improved from 1.16263 to 1.15154, saving model to best-weights.hdf5
Epoch 95/100
Epoch 00095: loss did not improve from 1.15154
Epoch 96/100
Epoch 00096: loss improved from 1.15154 to 1.15120, saving model to best-weights.hdf5
Epoch 97/100
Epoch 00097: loss improved from 1.15120 to 1.13450, saving model to best-weights.hdf5
Epoch 98/100
Epoch 00098: loss improved from 1.13450 to 1.1242

<tensorflow.python.keras.callbacks.History at 0x1f80df63088>

In [49]:
from tensorflow.keras.models import load_model
# model.save('.\model')
score = model.evaluate(X[60000:120000], Y[60000:120000])
print(score)

[3.72255146261851, 0.3041]


### Load best weights recorded from training 

In [50]:
# filename should reflect the name of the best weights available 
# in th local directory after training
filename = "best-weights.hdf5"
model.load_weights(filename)
model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

In [51]:
# pick a random seed
start = np.random.randint(0, len(datax)-1)
pattern = []
pattern = datax[start]
print("Seed:")
print ("\"", ''.join([int_to_vocab[value] for value in pattern]), "\"")
# generate characters
for i in range(1000):
    x = np.reshape(pattern, (1, len(pattern), 1))
    x = x / float(n_vocab)
    prediction = model.predict(x, verbose=0)
    index = np.argmax(prediction)
    result = int_to_vocab[index]
    seq_in = [int_to_vocab[value] for value in pattern]
    sys.stdout.write(result)
    pattern = np.append(pattern, [index], axis=0) # TODO make so length of pattern is 101, as it should be
    length = pattern.shape[0] 
    pattern = pattern[1:length] # issue with the length. no matter what, length of pattern become 99
print ("\nDone.")

Seed:
" lection
    for number in collection:
        counting_arr[number - coll_min] += 1

    # sum each p "
omuens that ao alcce teet the andrypine tesuenreeern crlcharcsesnsereser))
                Co                         [ocrypted_d = 0
            reyu = [[
            rotut("""  "
            cteruint__am] = iumut("Plert nnete tet se teet in ret bntereten keye andetetedede.

        )

    batert = "       x, = lowar__t_dxisin(b,  b % b  %  * (*sxir()) lel(sey( - sel) %  0
        print_n = andey(+ 1)
        return False


def ms_necellet(aryent_iceey, potat_sesu):
    pet = 0
    for = iumat(inp( - l_n1    2  * inn ruinte ko seteng:
            reiuet = 0
            deccy_xem = 0

    ror i in range(len(sentete)):
        if nomrt(ionc) : comceiitex,
            denuinteted( = comuenit.
            rumntnnndddm(= 0, comution
        rf sow(x_wue(seluence, ==):
        re boart[n][j] == 1:
            return False
    for i in range(len(board)):
        if board[i][c]l]l] =

# References

1. [LSTM: A search space odyssey](https://arxiv.org/pdf/1503.04069.pdf?utm_content=buffereddc5&utm_medium=social&utm_source=plus.google.com&utm_campaign=buffer)