# Hypermapping Bastard Gan Punks

<center>
<img src="https://sparklewerk.com/projects/hypermaps/bastards_plot_2.png?uncache_please" alt="demo hypermap" />
</center>

<br/>

<center>
<img src="https://sparklewerk.com/images/brands/sparklewerk/sparklewerk_wordmark.png" alt="Sparklewerk wordmark" align="center" />
</center>
<br/>




## GPU detection

UMAP knows how to use server side GPUs and Colab lets one use one for free.

In the menubar, select `Runtime→Change Runtime Type`, then
select GPU from the Hardware Accelerator drop-down


First, let's see if we have an Nvidia GPU.

In the following code cell, if the result is:
```
NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver.
```
Then that means the Runtime is not set to GPU.

In [None]:
!nvidia-smi

Not all GPUs on Colab are Nvidia models. So, the above may not have found an Nvidia GPU yet there may still be another manufacturers GPU. Here is how to check:

In [None]:
%tensorflow_version 2.x
import tensorflow as tf
device_name = tf.test.gpu_device_name()
if device_name != '/device:GPU:0':
  raise SystemError('GPU device not found')
print('Found GPU at: {}'.format(device_name))

## Set up

### Constants

These two constants control how big the bastards we be resized to:
- `bastards_into_umap_size`: (int, int) for UMAP processing
- `bastards_into_projector_size`: (int, int) for Projector viz spritesheet



In [None]:
# But first some switches:
render_2d_umap_plot_image = False

#### UMAP input size

In [None]:
bastards_into_umap_width = bastards_into_umap_height = 48
bastards_into_umap_size = (bastards_into_umap_width, bastards_into_umap_height)

#### Projector sprite size
By experimentation, it seems that for `bastards_into_projector_size`, (96, 96) is a decent downsampling size:

- (24, 24) is too small to see much detail. It does work but meh…
- (48, 48) kinda works but meh
- (64, 64) is a round size
- (96, 96) is too big for Projector

Although (96, 96) is a nice size (details show well), Projector refuses to accept a spritesheet that large. Largest it will accept is (8192, 8192) and (9600, 9600) for 10K would be too big.


In [None]:
bastards_into_projector_witdth = bastards_into_projector_height = 64
bastards_into_projector_size = (bastards_into_projector_witdth, bastards_into_projector_height)

### Installs


In [None]:
# TODO: there are ways to detect if umap-learn has already been installed. 
# Would make this a wee faster on repeat runs.
!pip install umap-learn

### Imports

In [None]:
%matplotlib inline
import numpy as np
import os
import pandas as pd
import umap

from PIL import Image, ImageDraw
from math import trunc
from matplotlib import pyplot as plt
from numpy import asarray
from packaging import version
from skimage.color import rgb2gray

In [None]:
# TODO: is this needed these days?
%matplotlib inline

### TensorFlow set up

In [None]:
try:
  # This tensorflow_version is a Colab-only thing
  %tensorflow_version 2.x
except Exception:
  pass

import tensorflow as tf
print("TensorFlow version: ", tf.__version__)

assert version.parse(tf.__version__).release[0] >= 2, \
    "This notebook requires TensorFlow 2.0 or above."

from tensorboard.plugins import projector
%load_ext tensorboard

# TODO: print tensorboard version?

tensorboard_data_dump_dir = '/content/tensorboard_data'
if not os.path.exists(tensorboard_data_dump_dir):
  os.makedirs(tensorboard_data_dump_dir)

## Data wrangling

The full bastards image collection can be found in the allbastards.com repo on GitHub.


### Inspect data

First let's make sure we're parsing the data correctly.

In [None]:
# If repo has already been cloned, doing so again will error so let's not
if not os.path.isdir('allbastards.com'):
  !git clone https://github.com/rkalis/allbastards.com.git

In [None]:
!ls allbastards.com/public/img/full | wc -l

In [None]:
path, dirs, files = next(os.walk("allbastards.com/public/img/full"))
file_count = len(files)
print(file_count)
print(f'files[0] = "{files[0]}"')
print(f'Trimmed  = "{files[0][:-5]}"')

In [None]:
def get_bastard_by_id(an_id: int):
  a_bastard = Image.open(os.path.join("allbastards.com/public/img/full",f'{an_id}.webp'))
  return a_bastard

The calmAF bastards (read: static webp files) are all (1024,1024)

In [None]:
%%time
calm_ids = []
hyped_ids = []

for file in files:
    a_bastard = Image.open(os.path.join("allbastards.com/public/img/full",file))
    if a_bastard.is_animated:
      hyped_ids.append(int(file[:-5]))
    else:
      calm_ids.append(int(file[:-5]))
                
print(f'calms: {len(calm_ids)}')
print(f'hypes: {len(hyped_ids)}')

Let's use Punk #42 as our poster child. And isn't he just a handsome boy?

In [None]:
a_bastard_filename = os.path.join('allbastards.com/public/img/full', '42.webp')
a_bastard = Image.open(a_bastard_filename) 

width, height = a_bastard.size
print('size = (', width, ',', height, ')')
print('format = ', a_bastard.format)
print('mode = ', a_bastard.mode)

plt.imshow(np.asarray(a_bastard))

Images of size (1024,1024) is 2 to the 10th. Mega. We need to downsample that before presenting the data to UMAP. And the bastards are color images (R,G,B) but UMAP wants simple scalors for values, so the color needs to be grayscaled. (Notice how the axis numbering changes.)

In [None]:
print(f'Downsampling to {bastards_into_umap_size}')
a_bastard_downsized = a_bastard.resize(bastards_into_umap_size)
a_bastard_downsized_grayed = rgb2gray(np.asarray(a_bastard_downsized))

plt.imshow(a_bastard_downsized_grayed, interpolation='nearest', cmap='gray')
plt.show()

And then we flatten() the images to make them a sub-array of the 2D array to be presented to UMAP.

In [None]:
print(a_bastard_downsized_grayed.flatten()[0])
print(a_bastard_downsized_grayed.flatten()[254])
print(type(a_bastard_downsized_grayed.flatten()[254]))

Yup, that's the correct data type.

## UMAP

### Data transforming

Next we manipulate the data in prep for feeding it to UMAP.

We need to present all the bastards to Projector in the structure it wants, which is a 2D array. That array is a list of all the bastards to be projected. Each bastard has to be recast as a 1D feature vector, each feature a single number. So, each 2D image gets reshaped to a 1D array, and each color pixel (R,G,B) gets grayscaled to a single value.

In [None]:
%%time

def load_calm_bastards_from_repo():
  # TODO: Surely there is some elegant pythonic way of doing this.

  # Just create the column names first, one for each pixel
  number_of_pixels = bastards_into_umap_width * bastards_into_umap_height
  feat_cols = [ 'pixel'+str(i) for i in range(number_of_pixels) ]

  calms_images = np.zeros((len(calm_ids), number_of_pixels))
  print(f'Shape of calm images: {calms_images.shape}')

  idx = 0
  for file in files:
    a_bastard = Image.open(os.path.join("allbastards.com/public/img/full",file))
    # TODO: let's get a progress bar going here. We do know how many files to process.
    if not a_bastard.is_animated:
      a_smaller_bastard = a_bastard.resize(bastards_into_umap_size)
      a_bastard_grayed = rgb2gray(asarray(a_smaller_bastard)) # This normalizes to [0..1]
      calms_images[idx] = a_bastard_grayed.flatten()
      idx = idx + 1
        
  return pd.DataFrame(calms_images,columns=feat_cols)

calms = load_calm_bastards_from_repo()

In [None]:
# Optionally, peek inside the DataFrame
calms

### Visualize embedding

Note: we are not setting a random seed (See docs for [random_state](https://umap-learn.readthedocs.io/en/latest/reproducibility.html)). This way is faster. The plots will look different between runs though. But we are not aiming for reproducable science papers.


In [None]:
def embed_data():
  return umap.UMAP(n_neighbors=10, min_dist=0.1, n_components=2).fit_transform(calms)

In [None]:
def show_simple_scatterplot(): 
  """
  Plots all bastards in 2D space as blue dots, no images
  """
  subset_of_embedding = embedding #[0:100]
  fig = plt.figure(figsize=(15, 15))
  plt.scatter(subset_of_embedding[:,0], subset_of_embedding[:,1], s=1)
  plt.show()

if render_2d_umap_plot_image:
  show_simple_scatterplot()

So, what is the range of the X and Y values? Those are the bounding box of the plot.

[TODO: this should just be in the code. No manual calc'ing.]

In [None]:
if render_2d_umap_plot_image:
  embedding = embed_data()

  print(type(embedding))
  print('({}, {})'.format(np.min(embedding[:,0]), np.max(embedding[:,0])))
  print('({}, {})'.format(np.min(embedding[:,1]), np.max(embedding[:,1])))

So, if we added 2 to each dimension that would get everything possitive.

then the range is ~0--<15, and ~0--<11.

If that were to be blown up to (1000,1000) should multiple about 1000/15 ~= 67. So, 66 is safe.

#### 2D


In [None]:
def scatter_bastards_in_2d():
  canvas_dimensions = (2000, 2000)
  embedding_neighborhood = Image.new('RGBA', canvas_dimensions, (0,0,0,0))
  print(f'For rendering in TensorBoard Projector, down sampling to {bastards_into_projector_size}')

  idx = 0
  for file in files:
    a_bastard = Image.open(os.path.join("allbastards.com/public/img/full",file))
    if not a_bastard.is_animated:
      a_smaller_bastard = a_bastard.resize(bastards_into_projector_size)
      location = ( trunc((embedding[idx,0]+2)*132) , trunc((embedding[idx,1]+2)*132) )
      embedding_neighborhood.paste(a_smaller_bastard, location) #, mask=a_smaller_bastard)
      idx = idx + 1
  return embedding_neighborhood    


In [None]:
%%time
if render_2d_umap_plot_image:
  scattered_bastards = scatter_bastards_in_2d()
  display(scattered_bastards)

#### 3D

Next, feed the data into TensorBoard's Embedding Projector (or simply, Projector). 



##### Sprite sheet

To actually show the images floating in a 2D or 3D space, TensorBoard Projector requires a sprite sheet which contains a sprite for each image to be projected.

For now we're just using the calmAFs (static images), not the hypedAFs (animated GIFs). There are 10459 calms and 847 hypeds. The sprite sheet needs to be square, so let's just use the first 10000, for 100 x 100 sprite sheet. [TODO: plot all 10459 calms.]

The sprite sheet can be a PNG or a JPEG. (Not sure if an animated GIF will work in Projector.) So, for just-the-calms we'll go PNG.

In [None]:
def create_sprite_sheet():
  spritesheet_square_length = 100 # 10,000 = 100 x 100 in spritesheet
  master_width = bastards_into_projector_witdth * spritesheet_square_length
  master_height = bastards_into_projector_witdth * spritesheet_square_length
  spriteimage = Image.new(
    mode='RGBA',
    size=(master_width, master_height),
    color=(0,0,0,0) # fully transparent
  )

  # This CUT_OFF_LIMIT is a vile hack. Spritesheet must be square. Padding needed, but not now
  CUT_OFF_LIMIT = 10000 # TODO: remove this hack, sprite up ENTIRE collection

  punk_index = 0
  for x in range(CUT_OFF_LIMIT):
    a_punk = get_bastard_by_id(calm_ids[x]).resize(bastards_into_projector_size)
    div, mod = divmod(punk_index, spritesheet_square_length)
    h_loc = bastards_into_projector_witdth * div
    w_loc = bastards_into_projector_witdth * mod
    spriteimage.paste(a_punk, (w_loc, h_loc))
    punk_index = punk_index + 1

  return spriteimage

In [None]:
%%time

def write_files_for_tensorboard():
  # First, generate spritesheet for Projector to use downsammpled sprites
  sprite_sheet = create_sprite_sheet()
  sprite_filename = os.path.join(tensorboard_data_dump_dir, 'embeddings/sprite.png')
  if not os.path.exists(os.path.dirname(sprite_filename)):
    os.makedirs(os.path.dirname(sprite_filename))
  sprite_sheet.save(sprite_filename)

  # Next the data for the dimensionality reducers (UMAP, t-SNE, PCA) to crunch on
  vectorized_punks = tf.Variable(calms[0:9999])
  checkpoint = tf.train.Checkpoint(embedding=vectorized_punks)
  checkpoint.save(os.path.join(tensorboard_data_dump_dir, 'embedding.ckpt'))


  config = projector.ProjectorConfig()
  embedder = config.embeddings.add()

  embedder.tensor_name = 'embedding/.ATTRIBUTES/VARIABLE_VALUE'

  embedder.sprite.image_path = sprite_filename
  embedder.sprite.single_image_dim.extend(bastards_into_projector_size)

  projector.visualize_embeddings(tensorboard_data_dump_dir, config)

write_files_for_tensorboard()

**Bug in TensorBoard launch**

**NOTE:** TensorBoard regularly fails to find the data just written to the file system. If so just rerun the following cell; that usually gets it to wake up and get to work.

Also note that:
- "Fetching tensor values…" normally takes a minute or two
- "Fetching sprite image…" normally takes a minute
- Then PCA will run automatically
- When PCA is done, click on UMAP or tSNE for other hypermap algorithms that will each provide a different view of the collection.

In [None]:
%reload_ext tensorboard
%tensorboard --logdir={tensorboard_data_dump_dir}