We have seen how AI, and DL in particular, can analyse data - but can it also create?

2015: Google's DeepDream produced pyschodelic pareidolic (human tendency to see or hear a pattern where none exists) images

2016: Prisma turns photos into 'paintings'

2016: [Sunspring](https://www.youtube.com/watch?v=LY7x2Ihqjmc), an experimental film with an LSTM generated script

2000's: Neural network generated music

Not human replacement, but augmented intelligence

A different kind of intelligence



##### The idea

Artisic creation involves pattern recognition and technical skill - tedious work that can be mechanised

Pereptual modalities, language, artwork and music all have statistical structure and satistical structure can be learnt by DL algorithms

DL algorithmns can learn a statistical *latent space*.

Sampling from the latent space 'creates' new artworks similar to the training data.

The algorithm attaches no meaning to the process and the product - but we might.

Potentially eliminates technical skill; enables free expression; separates art from craft.

# 8.1 LTSM text generation

RNNs can generate new sequence data

- Musical notes

- brushstrokes recorded on an iPad

- Handwriting

- Speech synthesis

- Chatbox dialogue

- Google's Smart Reply (2016) - automatic generation of short replies to emails and text messages.


## 8.1.2 How to generate sequence data

Train an ANN to predict the next token, or tokens, in a sequence using the previous tokens as input.

Text = words or characters

Any text-trained model is known as a *language model*.

A language model caoptures the latent space of language i.e. its statistical structure.

1. Train model
2. Present an initial *conditioning* text string
3. Model predicts the next token(s)
4. Add the generated text to the input text
5. Go back to step 3.

Section 8.1.4 demonstratse character-level natural language modeling.

Strings of N characters fed to an LSTM, LSTM predicts the N+1'th character.

## 8.1.3 The sampling strategy

1. *Greedy sampling.* Select the most probable token - repetitive and unrealistic

2. *Stochastic sampling.* Sample from the probability distribution of the next character.

Possible to sample from the softmax output which we know produces a 'probability' distribution.

But - uncontrollable.

*Uniform sampling* Each token has the same probability - maximum randomness (the distribution has maximum entropy) 

*Spiked distribution* i.e. greedy sampling, $p_{ij} = \delta_{ij}$ - minimum randomnes, the spiked distribution has minimum entropy.

*Intermediate randomness* controlled by the softmax temperature - this is where we expect to find the more interesting, creative outputs.

###### Softmax temperature

$
q_i = \frac{1}{N}e^\frac{x_i}{T}
$

Remember, the softmax output is 

$
p_i = \frac{1}{N}e^{x_i}
$

where $N = \sum e^{x_i}$. The paramaterised softmax distribution is computed like this:

1. Take logs: $\frac{\log(p_i)}{T} = \frac{x_i}{T} - \frac{N}{T}$
2. Re-exponentiate: $e^{\frac{\log(p_i)}{T}} = c(T) e^{\frac{x_i}{T}} \text{ where }c\text{ is a temperature dependent constant}$
3. Find new normalisation: $N' = \sum c(T) e^{\frac{x_i}{T}}$
4. Temperatured softmax: $q_i = \frac{1}{N'}e^{\frac{x_i}{T}} = \frac{ e^{\frac{x_i}{T}} }{\sum e^{\frac{x_i}{T}}} $

Limiting cases

1. $T \rightarrow 0$. 

$\max\left( \frac{1}{N'}e^{\frac{x_i}{T}}\right)$ dominates - greedy sampling.

2. $T \rightarrow \infty$. 

$e^{\frac{x_i}{T}} \rightarrow 1$ so $q_i \rightarrow \frac{1}{M}$ where $M$ is the number of softmax outputs - a uniform distribution.

## 8.1.4. Implementing character-level LTSM text generation

A large training set is required for a good language model.

Any large text file such as Lord of the Rings, or even set of texts such as Wikipedia.

Listings 8.2-8.8 prepare a model of the writings of a late C19 German philosopher. 

We find that low temperatures produce dull repetitive texts and high temperatures yield gibberish. 

$T\approx0.5$ is a reasonable compromise - the generated text is surprising and creative: even the new non-English words are plausible.

Better results can be obtained with larger training sets.

The generated text is not, of course, meaningful, unless by chance.

Semantics and statistical modeling are two different things.

# 8.2 DeepDream

Released by Google in 2015.

Trippy pictures (I wouldn't know, I didn't inhale) with pareidolic effects, bird feathers, dog eyes... a byproduct of training the DeepDream convnet on ImageNet which is replete with dogs and birds.

The DD algorithm follows the convnet filter visualisation technique of Chapter 5: run a convnet in reverse with gradient ascent on the input image in order to maximise the reponse of a specific filter in the upper layer

Except 

1. DD mixes visualations of a large number of features by maximising the activation of entire layers rather than a specific filter, as is the case with the techniques of Chapter 5.

2. Start with an existing image, not an image of random pixel values, so that the algorithm can work with pre-existing visual patterns

3. The input images are processed at different scales.

Listings 8.8-8.13 contain the DD algorithm.

The layers that are maximised for activation are specified in the `layer_contributions` dictionary.

Get the layer names from `model.summary()`

Experiment with lower and upper layers - the lower layers should produce more geometric patterns, the upper layers mofre abstract (noses, feathers, eyes etc.) 

The process is not specific to convnets or even to images (speech, music...) since it merely attempts to increase the activation of layers by gradient ascent in the image space - completely general.

# 8.3 Neural style transfer

2015, again.

The aim is to apply the *style* of one image to the *content* of another image.

Style = textures, colours, visual patterns

Content = high level macrostructure

E.g. apply van Gogh-like brushstrokes to a photograph.

![style transfer](https://s3.amazonaws.com/book.keras.io/img/ch8/style_transfer.png)


##### How?

Define a loss fuction for what you wish to achieve and then minimise it.

loss = 

distance(style(reference_image), style(generated_image))

+

distance(content(original_image), content(generated_image))

##### Yes, but how?

The upper layers contain global information - the macrostructure or content.

Content loss is L2 norm (distance)

content() = activation on the upper layer of a pretrained convnet.

Style loss is tricky. The aim is to capture the appearance of the target image at all spatial scales. 

Style loss aims to preserve the *correlations* between activations of different layers. Thg idea is that textures are represented as correlations at different scales.

1. Use a pre-trained convnet e.g. VGG19 (VGG16 + 3 extra conv. layers)
2. Compute the layer activations for the style-reference image, the original image and the generated image
3. Minimise the loss with gradient descent

Listings 8.14-8.22 provide code. The computation is RAM intensive: you could try reducing `img_height` if you have run out of RAM.   


# 8.4 Generating images with variational encoders

Create new images by sampling a latent image space.

Currently the most popular and successful application of creative AI.

Two techniques:

1. Variational autoencoders (VAEs)

2. Generative adverserial network (GANs)

## 8.4.1 Sampling from latent spaces of images

The latent space is low dimensional.

Each point in the space corresponds to a realistic image.

A *generator (GAN) or decoder (VAE)* module realises the mapping from latent space to realistic images.

The latent space can be sampled; new, never seen before, yet (hopefully) realistic, images are created.

GANs and VAEs generate latent spaces in different ways.

VAEs are good at generating structured latent spaces where specific directions encode menaingful data.

GAN generate very realistic images but the latent space may not be so structured and is less useful for image editing.

## 8.4.2 Concept vectors for image editing

Similar to embeddings.

Different directions encode interesting axes e.g. a 'smile' vector.

If you have found a smile vector, you could project an image into the latent space, add a smile vector, and then decode.

Concept vectors exist for any independent axes e.g. adding sunglasses, female-face to male-face etc! 

# 8.4.3 VAEs

A classic autoencoder maps an image ot a low dimensional latent space - a kind of compression - and then decodes back tot he original domensionality.

The autoecoder then learns to reconstruct the imput from the decoded output.

But these classic spaces are not structured in a useful way.

![Autoencoder](https://s3.amazonaws.com/book.keras.io/img/ch8/autoencoder.jpg)

VAEs do not compress into a fixed code in the latent space, but find parameters (mean and std) of a probability distribution.

The assumption is that the output image is the result of a statistical process and the randomness of this process should be taken into account during encoding and decoding.

The randomness (stochasticity) improves robustness and forces the latent space to encode meaningful representations everywhere.

![VAE](https://s3.amazonaws.com/book.keras.io/img/ch8/vae.png)

##### How?

1. Encoder maps an input image to $\mu$ and $\sigma$.
2. Randomly sample: $z \sim N(\mu, \sigma^2)$ i.e. $z \sim \mu + \sigma N(0, 1)$.
3. A decoder module maps $z$ back to the original image space.

There are two loss functions
- reconstruction loss: forces the decoded samples to match the initial inputs
- regularisation loss: ensures well-formed latent spaces and minimal overfitting

![VAE1.png](attachment:VAE1.png)
*Irhum Shafkat, 'Intuitively Understanding Variational Autoencoders', https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf*

$\mu$ and $\sigma$ are $n$-dimensional vectors. The latent space is $n$-dimensional.

The latent vector is constructed by sampling:

![VAE2.png](attachment:VAE2.png)
*Irhum Shafkat, 'Intuitively Understanding Variational Autoencoders', https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf*

The stochastic sampling means that slightly different latent vectors will be generated from the same source. Since the decoder is attempting to decode all the random latent vectors emanating from the same source to the same target (which is identical to the source!), neighbouring points in the latent space are decoded to the same image.

You can think of $z$ as a point in the latent space and $\sigma$ defining an area around this point.

The regularisation loss ensures that the $z$'s are clustered together at the centre of the latent space. The $\sigma$ areas overlap so that and a continuous and structured representation is built.

New images are generated by decoding a selected point $z$ in the latent space.


# 8.5 Generative Adverserial Networks

Forger produces some fake Picasso's.

The fakes are mixed with original Picasso's and an art dealer classifies the paintings.

The forger acts on the dealer's feedback and paints some more fakes.

The new fakes are again mixed with originals and the art dealer makes a new assessment, but this time she is better at distinguishing fakes from originals because she has more experience. 

Over time, the painter becomes a better forger and the dealer becomes a better critic.

A GAN network has 
- a *generator* that takes a random point in the latent space and decodes it into a synthetic imnage
- a *discriminator* predicts whether an image came from the training set or from ther generator.

The generator is trained to fool the discriminator and the discriminator is constantly adapting.

There is no guarantee that the latent space is structured or even continuous.

A GAN does not have a fixed loss landscape (loss plotted against parameters) - the entire landsape changes with each step.

The aim is not minimisation, but equilibrium - and this is difficult and involves a lot of 'alchemy'.

## 8.5.1 Schematics

![GAN.png](attachment:GAN.png)
*Rohith Gandhi, 'Generative Adversarial Networks — Explained', https://towardsdatascience.com/generative-adversarial-networks-explained-34472718707a*

A GAN chains a generator to a discriminator. 

1. A generator maps vectors from the latent space to images. The vectors are random $n$-dimensional vectors (normal distribution) where $n$ is the dimension of the latent space.
2. The generated images are mixed with real ones (e.g.) from CIFAR10. 
3. A discriminator network outputs a binary score in $[0, 1]$ i.e. to a 'probability' for each image in the mixed set.
4. The discriminator is frozen. 
5. The generator is trained to maximise the discriminator loss i.e. to produce images that the discriminator will classify as real; the discriminator is trained to lower its loss i.e. to become better at spotting fakes.


# 8.5.7 Wrapping up

- A GAN is a generator network coupled to a discriminator network. The discriminator is trained to differentiate generator output from test images (that are real). The generator never sees real images but is trained to produce images that the discriminator will classify as real.
- GANs are difficult to train because there it lacks a static loss landscape. There are many tricks, and a lot of training.
- GANs can, nevertheless, generate highly realistic images, but the latent space is almost certainly unstructured and so will not possess concept vectors.