## Restricted Boltzmann Machines

Our goal is create a RBM machine that generate new names of people, places and things. To achieve this we need to build a model that spits out funny names. 

The problem that RBMs are trying to solve is learning a probability distribution. Then we want to learn a function P that assings every string a probability according to its plausibility as a particular kind of name. 

#### DATA INPUT

Every data we want to feed into a neural network needs to be transformed into a vector of numbers first. So we represent names as sequences of one-hot vectors of length N, where N is the size of our alphabet.

Also we need to fix some maximum string length M ahead of time. Names shortes than M will need to be padded with some special character. 

To convert the data to our data input we use a codec. This codec allows us to convert our input data to the format we want.

Here's the code for the encoder which is used in the original code.

The class ShortTextCodes is the encoder which encodes every word to the alphabet space and the maximum lenght of  the world that we set previously. For example, the word "deb" will be encoded as "[4 5 2 26 26]" where 26 is the index of the character ' ' and the other are the indexes of the characters in the alphabet.

In [6]:
class NonEncodableTextException(Exception):
    
    def __init__(self, reason=None, *args):
        self.reason = reason
        super(NonEncodableTextException, self).__init__(*args)

class ShortTextCodec(object):
    # TODO: problematic if this char appears in the training text
    FILLER = '$' 

    # If a one-hot vector can't be decoded meaningfully, render this char in its place
    MYSTERY = '?'

    # Backward-compatibility. Was probably a mistake to have FILLER be a class var rather than instance
    @property
    def filler(self):
        if self.__class__.FILLER in self.alphabet:
            return self.__class__.FILLER
        # Old versions of this class used ' ' as filler
        return ' '

    def __init__(self, extra_chars, maxlength, minlength=0, preserve_case=False, leftpad=False):
        assert 0 <= minlength <= maxlength
        if self.FILLER not in extra_chars and maxlength != minlength:
            extra_chars = self.FILLER + extra_chars
        self.maxlen = maxlength
        self.minlen = minlength
        self.char_lookup = {}
        self.leftpad_ = leftpad
        self.alphabet = ''
        for i, o in enumerate(range(ord('a'), ord('z') + 1)):
            self.char_lookup[chr(o)] = i
            self.alphabet += chr(o)
        nextidx = len(self.alphabet)
        for i, o in enumerate(range(ord('A'), ord('Z') + 1)):
            if preserve_case:
                self.char_lookup[chr(o)] = nextidx
                nextidx += 1
                self.alphabet += chr(o)
            else:
                self.char_lookup[chr(o)] = i

        offset = len(self.alphabet)
        for i, extra in enumerate(extra_chars):
            self.char_lookup[extra] = i + offset
            self.alphabet += extra

    def debug_description(self):
        return ' '.join('{}={}'.format(attr, repr(getattr(self, attr, None))) for attr in ['maxlen', 'minlen', 'leftpad', 'alphabet', 'nchars'])

    @property
    def leftpad(self):
        return getattr(self, 'leftpad_', False)

    @property
    def nchars(self):
        return len(self.alphabet)

    @property
    def non_special_char_alphabet(self):
        return ''.join(c for c in self.alphabet if (c != ' ' and c != self.FILLER)) 

    def _encode(self, s, padlen):
        if len(s) > padlen:
            raise NonEncodableTextException(reason='toolong')
        padding = [self.char_lookup[self.filler] for _ in range(padlen - len(s))]
        try:
            payload = [self.char_lookup[c] for c in s]
        except KeyError:
            raise NonEncodableTextException(reason='illegal_char')
        if self.leftpad:
            return padding + payload
        else:
            return payload + padding


    def encode(self, s, mutagen=None):
        if len(s) > self.maxlen: 
            raise NonEncodableTextException(reason='toolong')
        elif (hasattr(self, 'minlen') and len(s) < self.minlen):
            raise NonEncodableTextException(reason='tooshort')
        if mutagen:
            s = mutagen(s)
        return self._encode(s, self.maxlen)

    def encode_onehot(self, s):
        indices = self.encode(s)
        return np.eye(self.nchars)[indices].ravel()

    def decode(self, vec, pretty=False, strict=True):
        # TODO: Whether we should use 'strict' mode depends on whether the model
        # we got this vector from does softmax sampling of visibles. Anywhere this
        # is called on fantasy samples, we should use the model to set this param.
        if issparse(vec):
            vec = vec.toarray().reshape(-1)
        assert vec.shape == (self.nchars * self.maxlen,)
        chars = []
        for position_index in range(self.maxlen):
            # Hack - insert a tab between name parts in binomial mode
            if isinstance(self, BinomialShortTextCodec) and pretty and position_index == self.maxlen/2:
                chars.append('\t')
            subarr = vec[position_index * self.nchars:(position_index + 1) * self.nchars]
            if np.count_nonzero(subarr) != 1 and strict:
                char = self.MYSTERY
            else:
                char_index = np.argmax(subarr)
                char = self.alphabet[char_index]
                if pretty and char == self.FILLER:
                    # Hack
                    char = ' ' if isinstance(self, BinomialShortTextCodec) else ''
            chars.append(char)
        return ''.join(chars)

    def shape(self):
        """The shape of a set of RBM inputs given this codecs configuration."""
        return (self.maxlen, len(self.alphabet))

    def mutagen_nudge(self, s):
        # Mutate a single character chosen uniformly at random.
        # If s is shorter than the max length, include an extra virtual character at the end
        i = random.randint(0, min(len(s), self.maxlen-1))
        def roll(forbidden):
            newchar = random.choice(self.alphabet)
            while newchar in forbidden:
                newchar = random.choice(self.alphabet)
            return newchar
                
        if i == len(s):
            return s + roll(self.FILLER + ' ')
        if i == len(s)-1:
            replacement = roll(' ' + s[-1])
            if replacement == self.FILLER:
                return s[:-1]
            return s[:-1] + roll(' ' + s[-1])
        else:
            return s[:i] + roll(s[i] + self.FILLER) + s[i+1:]


    def mutagen_silhouettes(self, s):
        newchars = []
        for char in s:
            if char == ' ':
                newchars.append(char)
            else:
                newchars.append(random.choice(self.non_special_char_alphabet))
        return ''.join(newchars)
        
    def mutagen_noise(self, s):
        return ''.join(random.choice(self.alphabet) for _ in range(self.maxlen))

This is a example of the encoder implemented previously. Also we can see the OneHotEncoder for encode the vector as a matrix as we said previously.

In [7]:
from collections import Counter
from sklearn.preprocessing import OneHotEncoder
import numpy as np
from sklearn.cross_validation import train_test_split

codec = ShortTextCodec('',6)
f = open("./names.txt")
skipped = Counter()
vecs = []
for line in f:
    line = line.strip()
    try:
        vecs.append(codec.encode(line))
        if(len(vecs) == -1):
            break
    except NonEncodableTextException as e:
            # Too long, or illegal characters
            skipped[e.reason] += 1
vecs = np.asarray(vecs)
vecsOneHot = OneHotEncoder(len(codec.alphabet)).fit_transform(vecs)

## OWN IMPLEMENTATION

Now we gonna implement a similar encoder to get the same result.

In [8]:
from numpy import argmax

def OneHotVector(data, alphabet, maxLength=10):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    dataSplit = data.split(' ')
    onehot_encoded = list()
    vecs = []
    for sntc in dataSplit:
        integer_encoded = [char_to_int[char] for char in sntc]
        lengthWord = len(integer_encoded)
        if(lengthWord < maxLength):
            for i in range(maxLength-lengthWord):
                integer_encoded.append(26)
        tmp = []
        for value in integer_encoded:
            letter = [0 for _ in range(len(alphabet))]
            letter[value] = 1
            for valueLetter in letter:
                tmp.append(valueLetter)
        vecs.append(tmp)
    vecs = np.asarray(vecs,  dtype=float)
    return vecs

def IntegerEncoded(data, alphabet, maxLength=10):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    dataSplit = data.split(' ')
    onehot_encoded = list()
    vecs = []
    for sntc in dataSplit:
        integer_encoded = [char_to_int[char] for char in sntc]
        lengthWord = len(integer_encoded)
        if(lengthWord < maxLength):
            for i in range(maxLength-lengthWord):
                integer_encoded.append(26)
        vecs.append(integer_encoded)
    vecs = np.asarray(vecs,  dtype=float)
    return vecs

def ConverToWords(vec,alphabet='abcdefghijklmnopqrstuvwxyz '):
    char_to_int = dict((c, i) for i, c in enumerate(alphabet))
    int_to_char = dict((i, c) for i, c in enumerate(alphabet))
    finalWord = []
    for k in range(len(vec)):
        tmp = []
        for i in range(len(vec[k])):
            if vec[k][i] == 1:
                index = i%len(alphabet)
                if(index != 26):
                    tmp.append(int_to_char[index])
        finalWord.append(tmp)
    return finalWord

#### Version without OneHotEncoder

In [9]:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
data = "test encoder"

### Data "test encoder" converted to oneHotVector
vec = OneHotVector(data,alphabet,7)
print(vec)
### From oneHotVector obtains de words
finalWord = ConverToWords(vec)
print(finalWord)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.

#### Version with OneHotEncoder

In [10]:
alphabet = 'abcdefghijklmnopqrstuvwxyz '
data = "test encoder"
vec = IntegerEncoded(data, alphabet, 7)
oneHotEncoder = OneHotEncoder(len(alphabet), sparse=False).fit(vec)

### Data to oneHotVector
vecOneHot = oneHotEncoder.transform(vec)
print(vecOneHot)
### Convert oneHotVector to words
words = "final test"
newVec = IntegerEncoded(words,alphabet,7)
newOneHotVector = oneHotEncoder.transform(newVec)
finalWord = ConverToWords(newOneHotVector)
print(finalWord)

[[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0.
  0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.