# Preprocessing

In this notebook we perform the necessary preprocessing steps on the images and the text.

First, let's import the necessary modules:

In [1]:
import pandas as pd
import numpy as np
from nltk.tokenize import word_tokenize
from PIL import Image

Pandas version should be 1.5.3 (latest), as that is what the pickled data file was made on

In [2]:
pd.__version__

'1.5.3'

Run the following cell to install the necessary library for nltk

In [None]:
import nltk
nltk.download("punkt")

Read the data file with pandas:

In [2]:
images_df = pd.read_pickle("images.pkl")
images_df.head()

Unnamed: 0,image,prompt
0,<PIL.Image.Image image mode=RGB size=128x128 a...,"a renaissance portrait of dwayne johnson, art ..."
1,<PIL.Image.Image image mode=RGB size=128x128 a...,"portrait of a dancing eagle woman, beautiful b..."
2,<PIL.Image.Image image mode=RGB size=128x128 a...,"epic 3 d, become legend shiji! gpu mecha contr..."
3,<PIL.Image.Image image mode=RGB size=128x128 a...,an airbrush painting of cyber war machine scen...
4,<PIL.Image.Image image mode=RGB size=128x128 a...,concept art of a silent hill monster. painted ...


Every image has been resized to 128x128 prior to exporting the data file. Other columns from the original dataset have also been left out, as we only need the images and their corresponding prompts.

## Preprocessing images

First, we will use NumPy to create an array representation of the images, and add it to a new column (these arrays can be displayed as images again by using the Image.fromarray() function):

In [3]:
images_df['image_array'] = images_df['image'].apply(lambda img: np.asarray(img).astype('float32'))
images_df.head()

Unnamed: 0,image,prompt,image_array
0,<PIL.Image.Image image mode=RGB size=128x128 a...,"a renaissance portrait of dwayne johnson, art ...","[[[13.0, 11.0, 10.0], [22.0, 20.0, 18.0], [23...."
1,<PIL.Image.Image image mode=RGB size=128x128 a...,"portrait of a dancing eagle woman, beautiful b...","[[[169.0, 155.0, 133.0], [169.0, 155.0, 131.0]..."
2,<PIL.Image.Image image mode=RGB size=128x128 a...,"epic 3 d, become legend shiji! gpu mecha contr...","[[[59.0, 75.0, 98.0], [59.0, 76.0, 99.0], [59...."
3,<PIL.Image.Image image mode=RGB size=128x128 a...,an airbrush painting of cyber war machine scen...,"[[[115.0, 156.0, 184.0], [120.0, 159.0, 183.0]..."
4,<PIL.Image.Image image mode=RGB size=128x128 a...,concept art of a silent hill monster. painted ...,"[[[101.0, 105.0, 89.0], [100.0, 105.0, 90.0], ..."


Next, we define a function to normalize every value in the array so they are scaled from 0 to 1 instead of 0 to 255.

In [4]:
def normalize(array):
    """
    This function takes in an image array with shape (x, y, 3),
    these values represent an image height, width, and RGB channels.
    Calculates and returns the normalized array.
    """
    min_r, min_g, min_b = np.min(array, axis=(0, 1))
    max_r, max_g, max_b = np.max(array, axis=(0, 1))

    normalized_r = (array[:, :, 0] - min_r) / (max_r - min_r)
    normalized_g = (array[:, :, 1] - min_g) / (max_g - min_g)
    normalized_b = (array[:, :, 2] - min_b) / (max_b - min_b)

    return np.stack([normalized_r, normalized_g, normalized_b], axis=-1)


images_df['normalized_array'] = images_df['image_array'].apply(normalize)
images_df.head()

Unnamed: 0,image,prompt,image_array,normalized_array
0,<PIL.Image.Image image mode=RGB size=128x128 a...,"a renaissance portrait of dwayne johnson, art ...","[[[13.0, 11.0, 10.0], [22.0, 20.0, 18.0], [23....","[[[0.050980393, 0.044, 0.041493777], [0.086274..."
1,<PIL.Image.Image image mode=RGB size=128x128 a...,"portrait of a dancing eagle woman, beautiful b...","[[[169.0, 155.0, 133.0], [169.0, 155.0, 131.0]...","[[[0.6875, 0.7323232, 0.71511626], [0.6875, 0...."
2,<PIL.Image.Image image mode=RGB size=128x128 a...,"epic 3 d, become legend shiji! gpu mecha contr...","[[[59.0, 75.0, 98.0], [59.0, 76.0, 99.0], [59....","[[[0.23137255, 0.29411766, 0.374502], [0.23137..."
3,<PIL.Image.Image image mode=RGB size=128x128 a...,an airbrush painting of cyber war machine scen...,"[[[115.0, 156.0, 184.0], [120.0, 159.0, 183.0]...","[[[0.44541484, 0.6502242, 0.84236455], [0.4672..."
4,<PIL.Image.Image image mode=RGB size=128x128 a...,concept art of a silent hill monster. painted ...,"[[[101.0, 105.0, 89.0], [100.0, 105.0, 90.0], ...","[[[0.36082473, 0.33333334, 0.2804878], [0.3556..."


# Text preprocessing

The next step is to preprocess all prompts to ensure that they are formatted in a numerical representation, so the model can train on the values.

We remove all punctuation, then apply word_tokenize to every prompt and create a new column 'tokenized_prompt' with the results.

In [5]:
images_df['prompt'] = images_df['prompt'].str.replace(r'[\W+]', ' ', regex=True)
images_df['tokenized_prompt'] = images_df['prompt'].apply(word_tokenize)
images_df.head()

Unnamed: 0,image,prompt,image_array,normalized_array,tokenized_prompt
0,<PIL.Image.Image image mode=RGB size=128x128 a...,a renaissance portrait of dwayne johnson art ...,"[[[13.0, 11.0, 10.0], [22.0, 20.0, 18.0], [23....","[[[0.050980393, 0.044, 0.041493777], [0.086274...","[a, renaissance, portrait, of, dwayne, johnson..."
1,<PIL.Image.Image image mode=RGB size=128x128 a...,portrait of a dancing eagle woman beautiful b...,"[[[169.0, 155.0, 133.0], [169.0, 155.0, 131.0]...","[[[0.6875, 0.7323232, 0.71511626], [0.6875, 0....","[portrait, of, a, dancing, eagle, woman, beaut..."
2,<PIL.Image.Image image mode=RGB size=128x128 a...,epic 3 d become legend shiji gpu mecha contr...,"[[[59.0, 75.0, 98.0], [59.0, 76.0, 99.0], [59....","[[[0.23137255, 0.29411766, 0.374502], [0.23137...","[epic, 3, d, become, legend, shiji, gpu, mecha..."
3,<PIL.Image.Image image mode=RGB size=128x128 a...,an airbrush painting of cyber war machine scen...,"[[[115.0, 156.0, 184.0], [120.0, 159.0, 183.0]...","[[[0.44541484, 0.6502242, 0.84236455], [0.4672...","[an, airbrush, painting, of, cyber, war, machi..."
4,<PIL.Image.Image image mode=RGB size=128x128 a...,concept art of a silent hill monster painted ...,"[[[101.0, 105.0, 89.0], [100.0, 105.0, 90.0], ...","[[[0.36082473, 0.33333334, 0.2804878], [0.3556...","[concept, art, of, a, silent, hill, monster, p..."


## Use ALL 2 million prompts to train a Word2Vec model

In [12]:
from urllib.request import urlretrieve

table_url = f'https://huggingface.co/datasets/poloclub/diffusiondb/resolve/main/metadata.parquet'
urlretrieve(table_url, 'metadata.parquet')

('metadata.parquet', <http.client.HTTPMessage at 0x19a1f2fd6d0>)

In [16]:
text_df = pd.read_parquet('metadata.parquet')

text_df['prompt'] = text_df['prompt'].str.replace(r'[\W+]', ' ', regex=True)
text_df = text_df['prompt'].apply(word_tokenize)

0          [a, portrait, of, a, female, robot, made, from...
1          [a, portrait, of, a, female, robot, made, from...
2          [only, memories, remain, trending, on, artstat...
3                      [dream, swimming, pool, with, nobody]
4              [a, dog, doing, weights, epic, oil, painting]
                                 ...                        
1999995    [david, bowie, giving, a, piggy, back, ride, t...
1999996    [david, bowie, giving, a, piggy, back, ride, t...
1999997                                    [funny, computer]
1999998               [hilarious, witty, computing, machine]
1999999    [hilarious, witty, computing, machine, lichten...
Name: prompt, Length: 2000000, dtype: object

In [28]:
two_m_sentences = text_df.tolist()

Next, create a variable that is a list of all tokenized prompts.

In [6]:
sentences = images_df['tokenized_prompt'].tolist()

Now we import Word2Vec and define the model with the sentences variable.
Using min_count = 1 ensures that it makes word embeddings of words that appear
at least once, so basically every word.

In [29]:
from gensim.models import Word2Vec

model = Word2Vec(sentences, min_count=1)
two_m_model = Word2Vec(two_m_sentences, min_count=1)

Train it on the sentences.

In [30]:
model.train(sentences, total_examples=len(sentences), epochs=18)
two_m_model.train(two_m_sentences, total_examples=len(two_m_sentences), epochs=18)

(715179117, 876959748)

Save it as a file, so we can use it in other notebooks.

In [10]:
model.save("word2vec.model")

In [31]:
two_m_model.save("2m_word2vec.model")

Let's check the embedding that the model made of the first prompt:

In [11]:
word = model.wv[images_df.iloc[0]['tokenized_prompt']][2]
word

(100,)

In [14]:
model.wv.similar_by_vector(word)

[('portrait', 1.0000001192092896),
 ('young', 0.4828406870365143),
 ('female', 0.4440945088863373),
 ('demon', 0.41361796855926514),
 ('portait', 0.4041849970817566),
 ('rivia', 0.3979431986808777),
 ('gorgeous', 0.3954363763332367),
 ('princess', 0.3941110670566559),
 ('siren', 0.3887902498245239),
 ('korean', 0.38710951805114746)]

Now, we want to apply this to every prompt in the dataset, so we create a new column for that:

In [35]:
def get_vectorized_prompt(tokens):
    return np.array([model.wv[token] for token in tokens])


images_df['vectorized_prompt'] = images_df['tokenized_prompt'].apply(get_vectorized_prompt)

In [36]:
def get_vectorized_prompt_2m(tokens):
    return np.array([two_m_model.wv[token] for token in tokens])


images_df['2m_vectorized_prompt'] = images_df['tokenized_prompt'].apply(get_vectorized_prompt_2m)

This is what the resulting dataframe looks like:

In [37]:
images_df.head()

Unnamed: 0,image,prompt,image_array,normalized_array,tokenized_prompt,vectorized_prompt,2m_vectorized_prompt
0,<PIL.Image.Image image mode=RGB size=128x128 a...,a renaissance portrait of dwayne johnson art ...,"[[[13.0, 11.0, 10.0], [22.0, 20.0, 18.0], [23....","[[[0.050980393, 0.044, 0.041493777], [0.086274...","[a, renaissance, portrait, of, dwayne, johnson...","[[1.327685, -0.37005645, 0.43590719, -1.324833...","[[-0.58317715, 1.1634145, 0.90053946, -0.22644..."
1,<PIL.Image.Image image mode=RGB size=128x128 a...,portrait of a dancing eagle woman beautiful b...,"[[[169.0, 155.0, 133.0], [169.0, 155.0, 131.0]...","[[[0.6875, 0.7323232, 0.71511626], [0.6875, 0....","[portrait, of, a, dancing, eagle, woman, beaut...","[[-0.6610553, 2.0926023, -1.9930851, 0.9011854...","[[-1.1437063, 1.7737198, -2.8723955, 0.8827879..."
2,<PIL.Image.Image image mode=RGB size=128x128 a...,epic 3 d become legend shiji gpu mecha contr...,"[[[59.0, 75.0, 98.0], [59.0, 76.0, 99.0], [59....","[[[0.23137255, 0.29411766, 0.374502], [0.23137...","[epic, 3, d, become, legend, shiji, gpu, mecha...","[[0.14250197, -2.1592658, 0.4262475, -0.807887...","[[-0.4053494, 2.3315654, -0.9218749, 4.496642,..."
3,<PIL.Image.Image image mode=RGB size=128x128 a...,an airbrush painting of cyber war machine scen...,"[[[115.0, 156.0, 184.0], [120.0, 159.0, 183.0]...","[[[0.44541484, 0.6502242, 0.84236455], [0.4672...","[an, airbrush, painting, of, cyber, war, machi...","[[4.240111, -2.4562259, -1.1775157, -1.3560923...","[[1.1083556, -2.7111382, -0.6331274, 1.8477451..."
4,<PIL.Image.Image image mode=RGB size=128x128 a...,concept art of a silent hill monster painted ...,"[[[101.0, 105.0, 89.0], [100.0, 105.0, 90.0], ...","[[[0.36082473, 0.33333334, 0.2804878], [0.3556...","[concept, art, of, a, silent, hill, monster, p...","[[3.2343996, 1.4021288, -2.152833, 0.5782454, ...","[[-1.8549491, 4.8363824, 1.5333705, -0.2730061..."


We want to export this dataset, but only the images, prompts, normalized arrays, and vectorized prompts, without the columns we created as 'inbetween' steps, so we create a new one with only the columns we want:

In [38]:
new_df = images_df[['image', 'prompt', 'normalized_array', 'vectorized_prompt', '2m_vectorized_prompt']]

In [39]:
new_df.head()

Unnamed: 0,image,prompt,normalized_array,vectorized_prompt,2m_vectorized_prompt
0,<PIL.Image.Image image mode=RGB size=128x128 a...,a renaissance portrait of dwayne johnson art ...,"[[[0.050980393, 0.044, 0.041493777], [0.086274...","[[1.327685, -0.37005645, 0.43590719, -1.324833...","[[-0.58317715, 1.1634145, 0.90053946, -0.22644..."
1,<PIL.Image.Image image mode=RGB size=128x128 a...,portrait of a dancing eagle woman beautiful b...,"[[[0.6875, 0.7323232, 0.71511626], [0.6875, 0....","[[-0.6610553, 2.0926023, -1.9930851, 0.9011854...","[[-1.1437063, 1.7737198, -2.8723955, 0.8827879..."
2,<PIL.Image.Image image mode=RGB size=128x128 a...,epic 3 d become legend shiji gpu mecha contr...,"[[[0.23137255, 0.29411766, 0.374502], [0.23137...","[[0.14250197, -2.1592658, 0.4262475, -0.807887...","[[-0.4053494, 2.3315654, -0.9218749, 4.496642,..."
3,<PIL.Image.Image image mode=RGB size=128x128 a...,an airbrush painting of cyber war machine scen...,"[[[0.44541484, 0.6502242, 0.84236455], [0.4672...","[[4.240111, -2.4562259, -1.1775157, -1.3560923...","[[1.1083556, -2.7111382, -0.6331274, 1.8477451..."
4,<PIL.Image.Image image mode=RGB size=128x128 a...,concept art of a silent hill monster painted ...,"[[[0.36082473, 0.33333334, 0.2804878], [0.3556...","[[3.2343996, 1.4021288, -2.152833, 0.5782454, ...","[[-1.8549491, 4.8363824, 1.5333705, -0.2730061..."


Now export the dataframe that contains values that are ready to be trained!

In [41]:
new_df.to_pickle("ready_to_train.pkl")