---
layout: post
title:  "Neural Network Encoding - A Super Basic Example"
desc: "Using python, Keras and some colours to illustrate encoding as simply as possible"
date: ###DATE
categories: [tutorial]
tags: [statistics]
loc: ###LOC
permalink: ###LINK 
redirect_from: "/encoding_colours"

---

Goal is to:

1. Start with some colours
2. Create a data product similar to how Word2Vec and others are trained
3. Create a model with an embedding layer and train it
4. Visualise the embedding layer

Lets get the dataset: https://www.kaggle.com/ravikanth/colour-name-and-rgb-codes

In [1]:
import pandas as pd
import numpy as np

df_original = pd.read_csv("encoding_colours/colours.csv")
df_original = df_original.dropna(subset=["Color Name"])
num_colours = df_original.shape[0]
print(f"We have {num_colours} colours")
df_original.sample(10)

We have 646 colours


Unnamed: 0,Color Name,Credits,R;G;B Dec,RGB Hex,CSS Hex,BG/FG color sample
627,goldenrod,X,218;165;32,DAA520,,### SAMPLE ###
522,maroon2,X,238;48;167,EE30A7,,### SAMPLE ###
241,Medium Slate Blue,N,127;0;255,7F00FF,,### SAMPLE ###
350,khaki1,X,255;246;143,FFF68F,,### SAMPLE ###
188,SkyBlue1,X,135;206;255,87CEFF,,### SAMPLE ###
154,LightCyan2,X,209;238;238,D1EEEE,,### SAMPLE ###
232,turquoise3,X,0;197;205,00C5CD,,### SAMPLE ###
388,coral,N,255;127;0,FF7F00,,### SAMPLE ###
122,Free Speech Grey,F,99;86;136,635688,,### SAMPLE ###
248,Sky Blue,N,50;153;204,3299CC,,### SAMPLE ###


The columns we care about are the name, and the RGB Dec values. Lets reformat this into R, G, B columns normalised to 1.

In [2]:
df = df_original.loc[:, ["Color Name", "R;G;B Dec"]]
df[["r", "g", "b"]] = df["R;G;B Dec"].str.split(";", expand=True).astype(int) / 255
df = df.drop(columns="R;G;B Dec")
df = df.rename(columns={"Color Name": "name"})
df = df.reset_index(drop=True)
df.sample(10)

Unnamed: 0,name,r,g,b
120,CadetBlue,0.372549,0.619608,0.627451
199,azure1,0.941176,1.0,1.0
552,NavajoWhite,1.0,0.870588,0.678431
163,MediumTurquoise,0.282353,0.819608,0.8
258,burlywood2,0.933333,0.772549,0.568627
510,maroon3,0.803922,0.160784,0.564706
554,NavajoWhite2,0.933333,0.811765,0.631373
392,salmon,0.980392,0.501961,0.447059
543,Violet,0.309804,0.184314,0.309804
348,Medium Spring Green,0.498039,1.0,0.0


Now theres just one more issue - you dont pass in strings or text to a neural network. You pass in numbers. So lets one-hot encode our colours to give them a numeric representation. We *could* use the Keras preprocessing `one_hot` here... but we've got this nice dataframe which already has an index... so we'll use that, and I'll make it explicit and add it as a column.


In [3]:
df["num"] = df.index
df

Unnamed: 0,name,r,g,b,num
0,Grey,0.329412,0.329412,0.329412,0
1,"Grey, Silver",0.752941,0.752941,0.752941,1
2,grey,0.745098,0.745098,0.745098,2
3,LightGray,0.827451,0.827451,0.827451,3
4,LightSlateGrey,0.466667,0.533333,0.600000,4
...,...,...,...,...,...
641,gold,0.803922,0.498039,0.196078,641
642,silver,0.901961,0.909804,0.980392,642
643,"Silver, Grey",0.752941,0.752941,0.752941,643
644,Light Steel Blue,0.329412,0.329412,0.329412,644


Hurray! So this is our actual starting point. 

Lets generate a bunch of colour pairs, to simulate pairs or words or things that grouped (like colour palettes).

In [4]:
n = 100000 # Num samples
colour_1 = df.sample(n=n, replace=True, random_state=0).reset_index(drop=True)
colour_2 = df.sample(n=n, replace=True, random_state=42).reset_index(drop=True)
print(colour_1.shape, colour_2.shape)

(100000, 5) (100000, 5)


Now we merge them, figure out a similarity metric (for words appearing together, a simple way of doing this is to have the metric value 1 for actual text, and 0 for text generated from randomly sampling - aka gobbledegook, which makes it trivial to generate as much 'bad text' as you want). 

In [5]:
c = colour_1.merge(colour_2, left_index=True, right_index=True)
c["diff"] = ((c.r_x - c.r_y)**2 + (c.g_x - c.g_y)**2 + (c.b_x - c.b_y)**2) / 3
c = c.drop(columns=["r_x", "r_y", "g_x", "g_y", "b_x", "b_y"])
c

Unnamed: 0,name_x,num_x,name_y,num_y,diff
0,gainsboro,559,grey91,102,0.002215
1,Goldenrod,629,OrangeRed4,435,0.266913
2,SteelBlue4,192,tan2,270,0.210832
3,DarkSalmon,359,grey95,106,0.117621
4,SlateGray4,9,grey60,71,0.015999
...,...,...,...,...,...
99995,grey65,76,Very Dark Brown,281,0.149199
99996,SkyBlue2,180,DarkOrchid2,479,0.105908
99997,LemonChiffon2,592,purple1,525,0.231757
99998,cornsilk1,609,grey74,85,0.045101


Finally, lets try to get this closer to a bag of words. In those models, you just have words appearing together or not. So what we can do is create a 0/1 scheme, based off the difference.

In [6]:
c["predict"] = (c["diff"] < 0.2 * np.random.random(c.shape[0]) ** 2).astype(int)
c.sample(20)

Unnamed: 0,name_x,num_x,name_y,num_y,diff,predict
53682,LightPink4,424,LightSteelBlue1,156,0.228553,0
82453,thistle3,532,grey74,85,0.002953,1
42488,Midnight Blue,231,LightGoldenrod3,598,0.23838,0
4259,RosyBrown1,242,Dark Purple,536,0.235668,0
48752,purple1,525,bisque4,375,0.143991,0
93519,violet,534,LightPink4,424,0.152736,0
86315,LightSalmon3,364,pink4,455,0.027456,1
42436,goldenrod2,615,Rich Blue,235,0.252472,0
17258,brown,249,firebrick3,449,0.008366,1
57618,"Silver, Grey",643,cyan3,213,0.190706,0


In [7]:
c.predict.mean()

0.22915

Fantastic, training and test data done. Lets make a Keras model

In [8]:
from tensorflow import keras
from tensorflow.keras.layers import Embedding, Dense, Lambda, Input, Subtract
import tensorflow.keras.backend as K
from tensorflow.keras.callbacks import LambdaCallback

def sum_dist(x):
    n = K.permute_dimensions(x, pattern=(1, 0, 2))
    a, b = n[0], n[1]
    return K.sum((a - b)**2, axis=-1, keepdims=True)

def get_model():
    embedding_dims = 2
    model = keras.Sequential()
    model.add(Embedding(num_colours, embedding_dims, input_length=2))
    model.add(Lambda(sum_dist, output_shape=(1,)))
    model.add(Dense(1, activation="sigmoid"))
    print(model.summary())
    model.compile(loss='binary_crossentropy', optimizer="adam", metrics=["mse"])
    return model

In [9]:
weights = []
save = LambdaCallback(on_epoch_end=lambda batch, logs: weights.append(model.layers[0].get_weights()[0]))

model = get_model()
X, y = c[["num_x", "num_y"]], c["predict"]
model.fit(X, y, epochs=500, verbose=0, batch_size=512, callbacks=[save])
#model.fit([X.num_x, X.num_y], y, epochs=100, verbose=1, batch_size=256)

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
embedding (Embedding)        (None, 2, 2)              1292      
_________________________________________________________________
lambda (Lambda)              (None, 1)                 0         
_________________________________________________________________
dense (Dense)                (None, 1)                 2         
Total params: 1,294
Trainable params: 1,294
Non-trainable params: 0
_________________________________________________________________
None


<tensorflow.python.keras.callbacks.History at 0x1b4873c2088>

In [10]:
%%capture
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

fig, ax = plt.subplots()
ln, = plt.plot([], [], 'ro')
cs = df[["r", "g", "b"]].to_numpy()
scat = ax.scatter(weights[0][:, 0], weights[0][:, 1], color=cs, s=5);

def init():
    ax.set_xlim(weights[-1][:,0].min(), weights[-1][:,0].max())
    ax.set_ylim(weights[-1][:,1].min(), weights[-1][:,1].max())
    return scat,
def update(i):
    scat.set_offsets(weights[i])
    return scat,

n_weight = len(weights) - 1
n_frames = 30 * 6
power = 2
frames = (np.linspace(1, n_weight**(1 / power), n_frames)**power).astype(int)
frames = pd.unique(frames)
ani = FuncAnimation(fig, update, frames=frames, init_func=init, blit=True, interval=1000/30);

In [11]:
HTML(ani.to_html5_video())