<a href="https://colab.research.google.com/github/NegarMirgati/llm_engineering_quests/blob/main/week3/Week_3_exercise_synthetic_data_generation.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Create meeting minutes from an Audio file

I downloaded some Denver City Council meeting minutes and selected a portion of the meeting for us to transcribe. You can download it here:  
https://drive.google.com/file/d/1N_kpSojRR5RYzupz6nqM8hMSoEF_R7pU/view?usp=sharing

If you'd rather work with the original data, the HuggingFace dataset is [here](https://huggingface.co/datasets/huuuyeah/meetingbank) and the audio can be downloaded [here](https://huggingface.co/datasets/huuuyeah/MeetingBank_Audio/tree/main).

The goal of this product is to use the Audio to generate meeting minutes, including actions.

For this project, you can either use the Denver meeting minutes, or you can record something of your own!

## Please note:

When you run the pip installs in the first cell below, you might get this error - it can be safely ignored - it sounds quite severe, but it doesn't seem to affect anything else in this project!


> ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
gcsfs 2024.10.0 requires fsspec==2024.10.0, but you have fsspec 2024.9.0 which is incompatible.



In [None]:
!pip install -q requests torch transformers>=4.45.1 sentencepiece accelerate openai httpx==0.27.2 gradio bitsandbytes

In [None]:
!pip install -U bitsandbytes

In [None]:
# imports

import os
import requests
from IPython.display import Markdown, display, update_display
from openai import OpenAI
from google.colab import drive
from huggingface_hub import login
from google.colab import userdata
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, BitsAndBytesConfig
import torch

In [None]:
# Constants
model_mapping = {

"LLAMA" :  "meta-llama/Meta-Llama-3.1-8B-Instruct",
"PHI3" : "microsoft/Phi-3-mini-4k-instruct",
"QWEN2" : "Qwen/Qwen2-7B-Instruct",
}


# Download denver_extract.mp3

You can either use the same file as me, the extract from Denver city council minutes, or you can try your own..

If you want to use the same as me, then please download my extract here, and put this on your Google Drive:  
https://drive.google.com/file/d/1N_kpSojRR5RYzupz6nqM8hMSoEF_R7pU/view?usp=sharing


In [None]:
# Sign in to HuggingFace Hub

hf_token = userdata.get('HF_TOKEN')
login(hf_token, add_to_git_credential=True)

In [None]:
# Sign in to OpenAI using Secrets in Colab

openai_api_key = userdata.get('OPENAI_API_KEY')
openai = OpenAI(api_key=openai_api_key)

In [None]:
quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4"
)

In [None]:
model_cache = {}

In [None]:
def load_model(model_key):
    if model_key in model_cache:
        return model_cache[model_key]

    # Clear cache if another model is already loaded (to stay under 12GB)
    model_cache.clear()
    torch.cuda.empty_cache()

    model_id = model_mapping[model_key]

    tokenizer = AutoTokenizer.from_pretrained(model_id)
    tokenizer.pad_token = tokenizer.eos_token

    quant_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_quant_type="nf4"
    )

    model = AutoModelForCausalLM.from_pretrained(
        model_id,
        device_map="auto",
        quantization_config=quant_config
    )

    model_cache[model_key] = (model, tokenizer)
    return model, tokenizer


In [None]:
load_model("LLAMA")

In [None]:
# More involved Gradio code as we're not using the preset Chat interface!
# Passing in inbrowser=True in the last line will cause a Gradio window to pop up immediately.
import gradio as gr
import re


def get_reviews(dataset_size, model_name):
    try:
      system_ptompt = f"""Please generate detailed book reviews from provided book information, including summary, key themes, strengths, weaknesses, and genre analysis. \
      Respond only with valid JSON, in a list. Do not write an introduction or summary or include roles like assistant in the output. \
      Note that I don't want code or anything. Just a list containing book info in JSON format for {dataset_size} random books.\
      Make sure if you are generating more than one JSON, put them in a LIST, separated by comma. """
      user_prompt = f"""Below are a few example entries of a dataset about books. \
      I want you to generate more synthetic data like this so that I can have a dataset of {dataset_size} entries like these. \
      Please generate more synthetic data following this format and without including the above instructions in the output. \
      Respond only with valid JSON. Do not write an introduction or summary. \
      Here is one example of a dataset containing two instances: \
        [ {{
        "title": "The Silent Echo",
        "author": "Emily Watson",
        "year_of_release": 2021,
        "genre": "Mystery",
        "avg_rating": 4.3,
        "num_ratings": 1245,
        "summary": "In the quiet town of Willow Creek, a series of mysterious disappearances \
        leaves the locals on edge. Detective Sarah Hayes, known for her sharp instincts, is called \
        in to unravel the case. As she digs deeper into the hidden secrets of the town, she discovers \
        a chilling connection between the victims and a decades-old unsolved crime. With each step closer \
        to the truth, Sarah finds herself caught in a dangerous game of cat and mouse, where nothing is as it \
        seems and the past refuses to stay buried."}} \
        , {{
        "title": "The Lost Garden",
        "author": "David Miller",
        "year_of_release": 2018,
        "genre": "Historical Fiction",
        "avg_rating": 4.7,
        "num_ratings": 3221,
        "summary": "Set against the backdrop of World War II, 'The Lost Garden' tells the story of Lily Adams, \
        a young British woman who volunteers as a gardener for the Women’s Land Army. While working on a secluded \
        estate in the English countryside, she uncovers a hidden garden that has been abandoned for years. As she \
        brings the garden back to life, she begins to uncover secrets about the estate's mysterious past, including \
        a love story that could change her own future forever. Through the garden's blooms, Lily learns the power \
        of resilience, love, and the healing that nature can bring."}} ]"""
      messages = [
        {"role": "system", "content": system_ptompt},
        {"role": "user", "content": user_prompt}
      ]
      model, tokenizer = load_model(model_name)
      input = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to("cuda")
      streamer = TextStreamer(tokenizer)
      output = model.generate(input, max_new_tokens=2000, streamer=streamer)
      response = tokenizer.batch_decode(
      output[:, input.shape[1]:],
      skip_special_tokens = True)
      cleaned_text = re.sub(r'<\|.*?\|>', '', response[0]).strip()
      del model, tokenizer, input, output
      torch.cuda.empty_cache()
      return cleaned_text
    except Exception as e:
      print(f"Error: {e}")

In [None]:
get_reviews(2, "LLAMA")

In [None]:
with gr.Blocks() as demo:
  with gr.Row():
      dataset_size = gr.Textbox(label="Dataset Size:")
  with gr.Row():
      model_name = gr.Dropdown(
            ["LLAMA", "PHI3", "QWEN2"], label="Model:")

  with gr.Row():
      submit_btn = gr.Button("Get Dataset")
  with gr.Row():
      output = gr.JSON(label="Dataset:")
      submit_btn.click(fn=get_reviews, inputs=[dataset_size, model_name], outputs=output)

demo.launch(inbrowser=True)