<a href="https://colab.research.google.com/github/JungMYEONG-jin/Stats_Project/blob/window/OneHotEncoding.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import tensorflow

# One-hot encoding of words or characters

This notebook contains the first code sample found in Chapter 6, Section 1 of [Deep Learning with Python](https://www.manning.com/books/deep-learning-with-python?a_aid=keras&a_bid=76564dff). Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.

----

One-hot encoding is the most common, most basic way to turn a token into a vector. You already saw it in action in our initial IMDB and 
Reuters examples from chapter 3 (done with words, in our case). It consists in associating a unique integer index to every word, then 
turning this integer index i into a binary vector of size N, the size of the vocabulary, that would be all-zeros except for the i-th 
entry, which would be 1.

Of course, one-hot encoding can be done at the character level as well. To unambiguously drive home what one-hot encoding is and how to 
implement it, here are two toy examples of one-hot encoding: one for words, the other for characters.

In [4]:
import numpy as np


samples = ["The cat sat on the mat.", "The dog ate my homework"]

token_index = {}


In [5]:
for i in samples:
  for word in i.split():
    if word not in token_index:
      token_index[word] = len(token_index)+1
      # 중복제거한 단어만 저장

token_index

{'The': 1,
 'ate': 8,
 'cat': 2,
 'dog': 7,
 'homework': 10,
 'mat.': 6,
 'my': 9,
 'on': 4,
 'sat': 3,
 'the': 5}

In [6]:
# vectorize our samples
# only consider max_len

max_len = 10

results = np.zeros((len(samples), max_len, max(token_index.values())+1))

for i, j in enumerate(samples):
  for k, word in list(enumerate(j.split()))[:max_len]:
    index = token_index.get(word)
    results[i,k,index] = 1.

In [7]:
results
# The cat on the mat
# The dog ate my homework

array([[[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]],

       [[0., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0

In [9]:
import string

characters = string.printable

characters

'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~ \t\n\r\x0b\x0c'

In [10]:
token_index = dict(zip(characters, range(1, len(characters)+1)))

max_len = 50
results = np.zeros((len(samples), max_len, max(token_index.values())+1))

In [12]:
for i, j in enumerate(samples):
  for k, character in enumerate(j[:max_len]):
    index = token_index.get(character)
    results[i,k,index] = 1.

In [13]:
results

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])

Note that Keras has built-in utilities for doing one-hot encoding text at the word level or character level, starting from raw text data. 
This is what you should actually be using, as it will take care of a number of important features, such as stripping special characters 
from strings, or only taking into the top N most common words in your dataset (a common restriction to avoid dealing with very large input 
vector spaces).

In [14]:
from tensorflow.keras.preprocessing.text import Tokenizer



In [15]:
tokenizer = Tokenizer(num_words=1000)
tokenizer.fit_on_texts(samples)

seq = tokenizer.texts_to_sequences(samples)

oh_res = tokenizer.texts_to_matrix(samples, mode="binary")

word_index = tokenizer.word_index
print("%s unique tokens" %len(word_index))

9 unique tokens



A variant of one-hot encoding is the so-called "one-hot hashing trick", which can be used when the number of unique tokens in your 
vocabulary is too large to handle explicitly. Instead of explicitly assigning an index to each word and keeping a reference of these 
indices in a dictionary, one may hash words into vectors of fixed size. This is typically done with a very lightweight hashing function. 
The main advantage of this method is that it does away with maintaining an explicit word index, which 
saves memory and allows online encoding of the data (starting to generate token vectors right away, before having seen all of the available 
data). The one drawback of this method is that it is susceptible to "hash collisions": two different words may end up with the same hash, 
and subsequently any machine learning model looking at these hashes won't be able to tell the difference between these words. The likelihood 
of hash collisions decreases when the dimensionality of the hashing space is much larger than the total number of unique tokens being hashed.

In [16]:
dim = 1000
max_len = 10

results = np.zeros((len(samples), max_len, dim))
for i, j in enumerate(samples):
  for k, word in list(enumerate(j.split()))[:max_len]:
    index = abs(hash(word)) % dim
    results[i, k, index] = 1.


In [17]:
results

array([[[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]],

       [[0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        ...,
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.],
        [0., 0., 0., ..., 0., 0., 0.]]])