<a href="https://colab.research.google.com/github/TUIlmenauAMS/PyBerlin/blob/main/PyBerlinUsingDeepLanguageModels.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# <center> Using Deep Language Models with Txtai and DiscoArt
## <center> Gerald Schuller
## <center> Ilmenau University of Technology
## <center> Ilmenau, Germany


- Slides are in: https://github.com/TUIlmenauAMS/PyBerlin

## Introduction
- I have YouTube channels called "Climate Change Calculated" and "Klimawandel Nachgerechnet"
- Since I like to use calculations, I use Colab Jupyter notebooks with markdown cells and code cells
- Since I use 2 languages, English and German, an automatic translation would be practical
- Problem: **popular translation services** like DeepL **don't work for Jupyter notebooks**.
- Hence my approach: use Python to automatically distinguish between markdown and code cells, and translate markdown using the "Txtai" module.
- This in turn uses "embeddings".




## Embeddings

- Deep Neural Networks language models use **"embeddings"** to construct a **"distance measure"** for meanings.
- Embeddings are **vector representations**, such that **close vectors** come from words with **close meaning**
- Trained on the probability of the next word in a text (**prediction**), given e.g. the previous 2 words.
- Words which frequently appear in the **same context** get **similar embedding** vectors. 
- This **minimizes** the **prediction error** in the embeddings domain for the predicted next word. 
 - See also the book: Eugene Charniak: "Introduction to Deep Learning", chap. 4.
 - https://en.wikipedia.org/wiki/Word_embedding
 - https://machinelearningmastery.com/what-are-word-embeddings/
- This can be used for e.g. **translations** or **transcriptions** (using Txtai)
- Txtai home page:
 - https://github.com/neuml/txtai
 - "txtai is built with Python 3.7+, Hugging Face Transformers, Sentence Transformers and FastAPI"
 - These are pre-trained neural networks.
- Txtai embeddings examples, showing diagrams:
 - https://neuml.github.io/txtai/embeddings/ 
- The following shows how to install and import it.

In [None]:
!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]
from txtai.embeddings import Embeddings

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting txtai[pipeline]
  Cloning https://github.com/neuml/txtai to /tmp/pip-install-a8gm0p_j/txtai_f72bce9836414f5b9b48b3c9e48ef56b
  Running command git clone -q https://github.com/neuml/txtai /tmp/pip-install-a8gm0p_j/txtai_f72bce9836414f5b9b48b3c9e48ef56b
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Collecting transformers>=4.20.1
  Downloading transformers-4.21.1-py3-none-any.whl (4.7 MB)
[K     |████████████████████████████████| 4.7 MB 20.3 MB/s 
[?25hCollecting faiss-cpu>=1.7.1.post2
  Downloading faiss_cpu-1.7.2-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.6 MB)
[K     |████████████████████████████████| 8.6 MB 44.9 MB/s 
Collecting onnxmltools>=1.9.1
  Downloading onnxmltools-1.11.1-py3-none-any.whl (308 kB)
[K     |███████████████████████████

## Embeddings Example
- The following example shows how to use it with a simple text example
- and also the structure of embeddings, with their vector dimension.

In [None]:
embeddings = Embeddings({"path": "sentence-transformers/nli-mpnet-base-v2"})
data = [
  "US tops 5 million confirmed virus cases",
  "Canada's last fully intact ice shelf has suddenly collapsed, " +
  "forming a Manhattan-sized iceberg",
  "Beijing mobilises invasion craft along coast as Taiwan tensions escalate",
  "The National Park Service warns against sacrificing slower friends " +
  "in a bear attack",
  "Maine man wins $1M from $25 lottery ticket",
  "Make huge profits without work, earn up to $100,000 a day"
]
embeddings.index([(uid, text, None) for uid, text in enumerate(data)])

query = "feel good story"
emb=embeddings.search(query, 2) #2 best hits in meaning, returns list of tuples: (index, similarity)
print("(index, similarity) of best 2 hits:", emb)
#[(4, 0.08329004049301147), (5, 0.010758552700281143)] #(index, similarity)
index=4
print("Best hit data[index]=",data[index])
#Show the dimensions of the embeddings vectors:
print("embeddings.info()=")
embeddings.info() #"dimensions": 768
help(embeddings.info)

(index, similarity) of best 2 hits: [(4, 0.08329012244939804), (5, 0.010758575983345509)]
Best hit data[index]= Maine man wins $1M from $25 lottery ticket
embeddings.info()=
{
  "backend": "faiss",
  "build": {
    "create": "2022-08-20T05:53:18Z",
    "python": "3.7.13",
    "settings": {
      "components": "IDMap,Flat"
    },
    "system": "Linux (x86_64)",
    "txtai": "4.7.0"
  },
  "content": null,
  "dimensions": 768,
  "offset": 6,
  "path": "sentence-transformers/nli-mpnet-base-v2",
  "update": "2022-08-20T05:53:18Z"
}
Help on method info in module txtai.embeddings.base:

info() method of txtai.embeddings.base.Embeddings instance
    Prints the current embeddings index configuration.



## Use for Translation
- This is a toy example in Colab:
https://colab.research.google.com/github/neuml/txtai/blob/master/examples/33_Query_translation.ipynb
- The example translates a given text from English to German.

In [None]:
#!pip install git+https://github.com/neuml/txtai#egg=txtai[pipeline]
from txtai.pipeline import Translation
# Create translation model
translate = Translation()
translate("This is a test translation into German", "de")


Downloading lid.176.ftz:   0%|          | 0.00/916k [00:00<?, ?B/s]

Downloading config.json:   0%|          | 0.00/1.30k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/284M [00:00<?, ?B/s]

Downloading source.spm:   0%|          | 0.00/750k [00:00<?, ?B/s]

Downloading target.spm:   0%|          | 0.00/778k [00:00<?, ?B/s]

Downloading vocab.json:   0%|          | 0.00/1.21M [00:00<?, ?B/s]

Downloading tokenizer_config.json:   0%|          | 0.00/42.0 [00:00<?, ?B/s]



'Dies ist eine Testübersetzung ins Deutsche'

## Use for Audio Transcription
- Embeddings can also be created from wav files of spoken text (wav2vec), 
- This is a simple Colab example, for transcribing given spoken english sound files into text:
 - https://colab.research.google.com/github/neuml/txtai/blob/master/examples/11_Transcribe_audio_to_text.ipynb

## Colab Notebook Examples

- My resulting Colab Jupyter notebook for the **translation of Colab Jupyter notebooks** is this:

 - https://github.com/TUIlmenauAMS/TranslateColabJupyterNotebooks

- This is my practical example for the **audio transcription of a speech recording to text**:

 - https://github.com/TUIlmenauAMS/AudioTranscription

## Image from Text Generation

- For image generation, an image "denoiser" (Disco Diffusion) tries to denoise an image 
 - Disco Difusion model is described here: https://accomplice.ai/models/5dd13fb8-bd26-4d13-9543-608c7e3d27cb
 - Zippy's Disco Diffusion Cheatsheet v0.3 for experiments: https://docs.google.com/document/d/1l8s7uS2dGqjztYSjPpzlmXLjl5PM3IGkRWI3IiCuK7g/mobilebasic
- The system denoises in such a way such that the text description another DNN (CLIP: https://openai.com/blog/clip/) generates comes close in meaning to the provided text.
- This is implemented in **"DiscoArt"**, and its Colab notebook:
 - https://github.com/jina-ai/discoart
 - Try for instance:         
            create(text_prompts='Berlin') 
- **Dall-e** is another image from text generator, from OpenAI, 
 - https://openai.com/dall-e-2/ 
  - in Colab: https://colab.research.google.com/github/cedro3/others/blob/master/DALL_e_sample.ipynb

 - Another Colab implementation: https://colab.research.google.com/github/borisdayma/dalle-mini/blob/main/tools/inference/inference_pipeline.ipynb
- It seems to have a somewhat better image quality



## Conclusions
- **Embeddings** can measure a **"distance" between meanings** of two texts
- This can be used for **translations**
 - e.g. of Jupyter notebooks
- or for audio transcription
- It can also be used to **generate images** wich have descriptons that are **close in meaning to a provided text**.
- Slides are in: https://github.com/TUIlmenauAMS/PyBerlin