In [3]:
!pip install transformers
!pip install einops
!pip install xformers
!pip install Pillow
!pip install requests
!pip install diffusers



In [4]:
import torch
import transformers
from PIL import Image
import requests
import diffusers

# Running Our Own Large Language Model

We'll be running a model called MPT-7b, specifically the version of it finetuned for chatting.
MPT is the name the company gave thier models, standing for "Mosail Pre-Trained", and 7b prefers to the number of parameters, seven billion.  

You can view the model's huggingface page here https://huggingface.co/mosaicml/mpt-7b-chat  
Huggingface is the dominant platform right now for sharing deep learning models, we'll be pulling a lot from it today. Most pages have lots of helpful information on their models. If you're looking for a specific funcitonality, huggingface is a good place to start.

In [5]:
name = 'databricks/dolly-v2-3b'

config = transformers.AutoConfig.from_pretrained(name, trust_remote_code=True)
config.init_device = 'cuda:0' # For fast initialization directly on GPU!

model = transformers.AutoModelForCausalLM.from_pretrained(
  name,
  config=config,
  torch_dtype=torch.float16, # Load model weights in float16
  trust_remote_code=True
)

Downloading (…)lve/main/config.json:   0%|          | 0.00/819 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/5.68G [00:00<?, ?B/s]

Here we also create a tokenizer.
Model's can't take text directly as input, instead it needs to be converted into a series of tokens.
Tokens are usually what LLM use is measured in.
A token is on average about 5 characters, for English text. This can vary massively depending on the specific tokenizer used.

In [6]:
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-neox-20b")
#https://huggingface.co/databricks/dolly-v2-3b

Downloading (…)okenizer_config.json:   0%|          | 0.00/156 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.08M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/457k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.11M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/90.0 [00:00<?, ?B/s]

Now we create a pipe using our model and tokenizer. A pipe is a convenient structure provided by the transformers library that allows us to easily run our tokenizer and model together.


In [7]:
pipe = transformers.pipeline('text-generation', model=model, tokenizer=tokenizer, device='cuda:0')
with torch.autocast('cuda', dtype=torch.float16):
    output = pipe('Here is a recipe for vegan banana bread:\n',
                  max_new_tokens=100,
                  do_sample=True,
                  use_cache=True)
print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


Here is a recipe for vegan banana bread:

2 cups all purpose flour

2 sticks butter, softened

2 eggs, beaten

4 very ripe bananas

1 teaspoon vanilla

1 cup sugar

3/4 cup brown sugar

3 cups old fashioned oatmeal

1/2 cup old fashioneds molasses

Preheat the oven to 350 degrees. Line a 9x5 inch loaf pan with parchment paper and set aside.

In a mixing bowl, cream together the


And there it is! A very own unique run of a chat model, just for you.  

Chat models are generally trained to expect input in a specific format, and that is normally handled by code that the user doesn't see. The further the input text is from what the model expects the worse it's output will be.  
Below I've created a prompt that is closer to what the model expects, just fill in your text between the user and assistant lines.

In [8]:
prompt = """A conversation between a user and an LLM-based AI assistant. The assistant gives wrong answers.
User
Tell me why humans can't be replaced by robots

Assistant
"""
with torch.autocast('cuda', dtype=torch.float16):
    output = pipe(prompt,
                  max_new_tokens=200,
                  do_sample=True,
                  use_cache=True)
print(output[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.
User
Tell me why humans can't be replaced by robots

Assistant
I'm not sure but I will look into it. Please hold.

A few minutes later

Assistant
The humans are different from all therobots in the universe. They have emotions such as love and hate, and a conscience. Humans love, care, and consider their survival in the long term. Whilerobots are designed to be infrequently updated and are not affected by emotions, they can understand how humans feel and are able to reproduce them. Although robots may be able to imitate humans closely and effectively do the work humans have done for hundreds of years, this is simply because humans have evolved to be who they are. Once you take away humans' emotions, they are more like robots that have been updated very few times. And this means the robots cannot love or care for humans.

Assistant
Humans are loved by all therobots in the universe

If you're really interested in setting up one of these properly, please follow the code provided in the MosaicML repository for the MPT-7B Chat Model
https://github.com/mosaicml/llm-foundry/blob/main/scripts/inference/hf_chat.py

# CAPTCHA Defeat
We'll start by using some simple models from TrOCR, built by Microsoft, for defeating CAPTCHA images. TrOCR stands for Transformer Optical Character Recognition. Transformer refers to the model's architecture, and Optical Character Recognition is the generic term for any model or software that extracts text from images, so this is a perfect use case.

Most traditional OCR software fails on CAPTCHA's, by design. If you're looking for simple OCR software I recommend you try Tesseract. It will also run much faster.

Let's user TrOCR from HuggingFace to solve this challange!  
First we'll load the model. This one is big-ish, but runs reasonably fast on CPU, so you can do this on your local computer.

In [None]:
processor = transformers.TrOCRProcessor.from_pretrained('microsoft/trocr-large-printed')
model = transformers.VisionEncoderDecoderModel.from_pretrained('microsoft/trocr-large-printed')

Could not find image processor class in the image processor config or the model config. Loading based on pattern matching with the model's feature extractor configuration.


KeyboardInterrupt: ignored

Then let's run it against an image. You can run it against one of our local images if it's on your local machine.

In [None]:
url = 'https://captcha.com/images/captcha/botdetect3-captcha-flash.jpg'
image = Image.open(requests.get(url, stream=True).raw).convert("RGB")
pixel_values = processor(images=image, return_tensors="pt").pixel_values

generated_ids = model.generate(pixel_values)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(generated_text)



CEPT6


# Playing with CLIP

Let's match some images and text!

In [None]:
model = transformers.CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = transformers.CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

Downloading (…)lve/main/config.json:   0%|          | 0.00/4.19k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/605M [00:00<?, ?B/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/316 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/568 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/862k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/2.22M [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/389 [00:00<?, ?B/s]

In [None]:
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)

inputs = processor(text=["a photo of a cat", "a photo of a dog"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
logits_per_image = outputs.logits_per_image # this is the image-text similarity score
probs = logits_per_image.softmax(dim=1) # we can take the softmax to get the label probabilities
print(probs)


tensor([[0.9949, 0.0051]], grad_fn=<SoftmaxBackward0>)


This was a boring example made by the sort of people who build models, who are trying to get high scores on test datasets that have images from a small handful of classes.  
In real life, you rarely have a list of all possibilities that an image is guarenteed to be from.  
Instead, you generally have one, or maybe handful, of terms, and you just want to know how closely an image matches any of of them.  
e.g. "Find which of these images have a couch in them"
For this, there's a little more art

In [None]:

urls = ["https://secure.img1-cg.wfcdn.com/im/91378491/resize-h600-w600%5Ecompr-r85/8140/81409053/Sofas.jpg",
        "https://i.pinimg.com/originals/86/a0/f0/86a0f0fccdeaec51f8e828c8d6749cf9.jpg",
        "https://www.collectorsweekly.com/uploads/2018/08/22113752/grandma_70ssofa_heractualcouch.jpg",
        "https://www.ikea.com/us/en/images/products/mammut-childrens-chair-indoor-outdoor-red__0727924_pe735940_s5.jpg",
        "https://images.nationalgeographic.org/image/upload/t_edhub_resource_key_image/v1638886653/EducationHub/photos/jinsha-river.jpg"]
def check_similarity(image_url: str, text_description: str) -> float:
  image = Image.open(requests.get(image_url, stream=True).raw)
  inputs = processor(text=[text_description], images=image, return_tensors="pt", padding=True)
  outputs = model(**inputs)
  logits_per_image = outputs.logits_per_image # this is the image-text similarity score
  return float(logits_per_image[0][0])

similarities = [check_similarity(url, "a picture of a couch") for url in urls] # side note, this is called a list comprehension in python and you should know about it
print(similarities)

[30.497865676879883, 23.70041275024414, 31.074853897094727, 23.439910888671875, 18.864608764648438]


Generally a score of above 20 or indicates similarity. It's a logarithmic scale, so even a difference of 1 between scores makes a big difference.

# Image Generation
Here we use stable diffusion to create an image from a prompt!
You can download your file by clicking the folder on the left-hand side of the screen and then right clicking the image.

In [None]:
model_id = "runwayml/stable-diffusion-v1-5"
pipe = diffusers.StableDiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "a photo of an astronaut riding a horse on mars"
image = pipe(prompt).images[0]

image.save("astronaut_rides_horse.png")

Downloading (…)ain/model_index.json:   0%|          | 0.00/541 [00:00<?, ?B/s]

Fetching 15 files:   0%|          | 0/15 [00:00<?, ?it/s]

Downloading (…)rocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

Downloading (…)_encoder/config.json:   0%|          | 0.00/617 [00:00<?, ?B/s]

Downloading (…)_checker/config.json:   0%|          | 0.00/4.72k [00:00<?, ?B/s]

Downloading (…)tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/492M [00:00<?, ?B/s]

Downloading (…)cheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/472 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

Downloading (…)tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/806 [00:00<?, ?B/s]

Downloading (…)7f0/unet/config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

Downloading (…)57f0/vae/config.json:   0%|          | 0.00/547 [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Downloading (…)ch_model.safetensors:   0%|          | 0.00/3.44G [00:00<?, ?B/s]

Cannot initialize model with low cpu memory usage because `accelerate` was not found in the environment. Defaulting to `low_cpu_mem_usage=False`. It is strongly recommended to install `accelerate` for faster and less memory-intense model loading. You can do so with: 
```
pip install accelerate
```
.
`text_config_dict` is provided which will be used to initialize `CLIPTextConfig`. The value `text_config["id2label"]` will be overriden.


  0%|          | 0/50 [00:00<?, ?it/s]