<a href="https://colab.research.google.com/github/25b3nk/llm-projects/blob/main/dataset_creator.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create a dataset generator

We will create a dataset generator based on the description provided by the user. Then, we will add a gradio UI for the same.

## Please note:

When you run the pip installs in the first cell below, you might get this error - it can be safely ignored - it sounds quite severe, but it doesn't seem to affect anything else in this project!


> ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.


In [1]:
!pip install -q requests torch bitsandbytes transformers sentencepiece accelerate openai httpx==0.27.2

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m76.4/76.4 kB[0m [31m6.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m69.1/69.1 MB[0m [31m8.9 MB/s[0m eta [36m0:00:00[0m
[?25h

In [2]:
# imports

import os
import requests
from IPython.display import Markdown, display, update_display
from openai import OpenAI
from google.colab import drive
from huggingface_hub import login
from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

In [3]:
# Constants

AUDIO_MODEL = "whisper-1"
LLAMA = "meta-llama/Meta-Llama-3.1-8B-Instruct"

In [None]:
# New capability - connect this Colab to my Google Drive
# See immediately below this for instructions to obtain denver_extract.mp3

drive.mount("/content/drive")
audio_filename = "/content/drive/MyDrive/llms/denver_extract.mp3"

Mounted at /content/drive


# Download denver_extract.mp3

You can either use the same file as me, the extract from Denver city council minutes, or you can try your own..

If you want to use the same as me, then please download my extract here, and put this on your Google Drive:  
https://drive.google.com/file/d/1N_kpSojRR5RYzupz6nqM8hMSoEF_R7pU/view?usp=sharing


In [None]:
# Sign in to HuggingFace Hub

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
# Sign in to OpenAI using Secrets in Colab

openai_api_key = userdata.get('OPENAI_API_KEY')
openai = OpenAI(api_key=openai_api_key)

In [None]:
# Use the Whisper OpenAI model to convert the Audio to Text
# If you'd prefer to use an Open Source model, class student Youssef has contributed an open source version
# which I've added to the bottom of this colab

audio_file = open(audio_filename, "rb")
transcription = openai.audio.transcriptions.create(model=AUDIO_MODEL, file=audio_file, response_format="text")
print(transcription)

In [4]:
system_message = "You are an assistant that produces datasets based on description provided."
user_input = "Film critics of 1900s"
user_prompt = f"Below is the description for which you need to generate dataset.\n{user_input}"

messages = [
    {"role": "system", "content": system_message},
    {"role": "user", "content": user_prompt}
  ]


In [5]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

In [6]:
tokenizer = AutoTokenizer.from_pretrained(LLAMA)
tokenizer.pad_token = tokenizer.eos_token
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt").to("cuda")
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(LLAMA, device_map="auto", quantization_config=quant_config)
outputs = model.generate(inputs, max_new_tokens=2000, streamer=streamer)

tokenizer_config.json:   0%|          | 0.00/55.4k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.09M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/296 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/855 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Downloading shards:   0%|          | 0/4 [00:00<?, ?it/s]

model-00001-of-00004.safetensors:   0%|          | 0.00/4.98G [00:00<?, ?B/s]

model-00002-of-00004.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00003-of-00004.safetensors:   0%|          | 0.00/4.92G [00:00<?, ?B/s]

model-00004-of-00004.safetensors:   0%|          | 0.00/1.17G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/4 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/184 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are an assistant that produces datasets based on description provided.<|eot_id|><|start_header_id|>user<|end_header_id|>

Below is the description for which you need to generate dataset.
Film critics of 1900s<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To generate a dataset for film critics of the 1900s, I'll create a sample dataset with the following variables:

1. **Name**: The name of the film critic.
2. **Year**: The year the critic was active.
3. **Nationality**: The nationality of the critic.
4. **Notable Work**: A notable publication or work associated with the critic.
5. **Influence**: A measure of the critic's influence on film criticism (on a scale of 1-10).
6. **Quote**: A notable quote from the critic.

Here's a sample dataset with 10 film critics from the 1900s:

| Name | Year | Nationality | Notable Work | Influence | Quote |
| --- | 

In [7]:
response = tokenizer.decode(outputs[0])

In [8]:
display(Markdown(response))

<|begin_of_text|><|start_header_id|>system<|end_header_id|>

Cutting Knowledge Date: December 2023
Today Date: 26 Jul 2024

You are an assistant that produces datasets based on description provided.<|eot_id|><|start_header_id|>user<|end_header_id|>

Below is the description for which you need to generate dataset.
Film critics of 1900s<|eot_id|><|start_header_id|>assistant<|end_header_id|>

To generate a dataset for film critics of the 1900s, I'll create a sample dataset with the following variables:

1. **Name**: The name of the film critic.
2. **Year**: The year the critic was active.
3. **Nationality**: The nationality of the critic.
4. **Notable Work**: A notable publication or work associated with the critic.
5. **Influence**: A measure of the critic's influence on film criticism (on a scale of 1-10).
6. **Quote**: A notable quote from the critic.

Here's a sample dataset with 10 film critics from the 1900s:

| Name | Year | Nationality | Notable Work | Influence | Quote |
| --- | --- | --- | --- | --- | --- |
| James Quayle Dealey | 1900 | American | The Moving Picture World | 8 | "The cinema is a powerful instrument for the education of the masses." |
| Theodore Dreiser | 1905 | American | The American Mercury | 9 | "The movies are a reflection of the American people's desire for escape and fantasy." |
| Gilbert Seldes | 1908 | American | The Seven Lively Arts | 8 | "The cinema is a democratic art, accessible to all and yet capable of elevating the masses." |
| Vachel Lindsay | 1910 | American | The Art of the Moving Picture | 7 | "The movies are a new form of poetry, a new way of telling stories." |
| Hugo Münsterberg | 1912 | German-American | The Film: A Subversive Art | 8 | "The cinema is a powerful tool for shaping public opinion and influencing behavior." |
| Gilbert Chesterton | 1915 | British | The Witness | 6 | "The movies are a harmless form of entertainment, a distraction from the troubles of the world." |
| I.A.L. Diamond | 1917 | American | The New York World | 5 | "The cinema is a reflection of the American people's love of spectacle and entertainment." |
| Gilbert Seldes | 1920 | American | The Seven Lively Arts | 8 | "The movies are a new form of art, a new way of expressing the human experience." |
| Benjamin B. Hampton | 1922 | American | The Miracle of the Movies | 7 | "The cinema is a powerful instrument for shaping public opinion and influencing behavior." |
| Robert S. Birchard | 1925 | American | The Art of the Cinema | 6 | "The movies are a reflection of the American people's desire for escape and fantasy." |

This dataset provides a sample of film critics from the 1900s, along with their notable works, influence, and quotes.<|eot_id|>

## Alternative implementation

Class student Youssef has contributed this variation in which we use an open-source model to transcribe the meeting Audio.

Thank you Youssef!

In [None]:
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline

AUDIO_MODEL = "openai/whisper-medium"
speech_model = AutoModelForSpeechSeq2Seq.from_pretrained(AUDIO_MODEL, torch_dtype=torch.float16, low_cpu_mem_usage=True, use_safetensors=True)
speech_model.to('cuda')
processor = AutoProcessor.from_pretrained(AUDIO_MODEL)

pipe = pipeline(
    "automatic-speech-recognition",
    model=speech_model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch.float16,
    device='cuda',
    return_timestamps=True,
)

config.json:   0%|          | 0.00/1.99k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.06G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.75k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Device set to use cuda


In [None]:
# Use the Whisper OpenAI model to convert the Audio to Text
result = pipe(audio_filename)

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.


In [None]:
transcription = result["text"]
print(transcription)

NameError: name 'result' is not defined