To run this, press "*Runtime*" and press "*Run all*" on a **free** Tesla T4 Google Colab instance!
<div class="align-center">
  <a href="https://github.com/unslothai/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/unsloth%20new%20logo.png" width="115"></a>
  <a href="https://discord.gg/u54VK8m8tk"><img src="https://github.com/unslothai/unsloth/raw/main/images/Discord button.png" width="145"></a>
  <a href="https://ko-fi.com/unsloth"><img src="https://github.com/unslothai/unsloth/raw/main/images/Kofi button.png" width="145"></a></a> Join Discord if you need help + support us if you can!
</div>

To install Unsloth on your own computer, follow the installation instructions on our Github page [here](https://github.com/unslothai/unsloth#installation-instructions---conda).

You will learn how to do [data prep](#Data), how to [train](#Train), how to [run the model](#Inference), & [how to save it](#Save) (eg for Llama.cpp).

In [1]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps "xformers<0.0.26" trl peft accelerate bitsandbytes

* We support Llama, Mistral, CodeLlama, TinyLlama, Vicuna, Open Hermes etc
* And Yi, Qwen ([llamafied](https://huggingface.co/models?sort=trending&search=qwen+llama)), Deepseek, all Llama, Mistral derived archs.
* We support 16bit LoRA or 4bit QLoRA. Both 2x faster.
* `max_seq_length` can be set to anything, since we do automatic RoPE Scaling via [kaiokendev's](https://kaiokendev.github.io/til) method.
* With [PR 26037](https://github.com/huggingface/transformers/pull/26037), we support downloading 4bit models **4x faster**! [Our repo](https://huggingface.co/unsloth) has Llama, Mistral 4bit models.
* [**NEW**] We make Gemma 6 trillion tokens **2.5x faster**! See our [Gemma notebook](https://colab.research.google.com/drive/10NbwlsRChbma1v55m8LAPYG15uQv6HLo?usp=sharing)

In [2]:
from unsloth import FastLanguageModel
import torch
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.

# 4bit pre quantized models we support for 4x faster downloading + no OOMs.
fourbit_models = [
    "unsloth/mistral-7b-bnb-4bit",
    "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    "unsloth/llama-2-7b-bnb-4bit",
    "unsloth/llama-2-13b-bnb-4bit",
    "unsloth/codellama-34b-bnb-4bit",
    "unsloth/tinyllama-bnb-4bit",
    "unsloth/gemma-7b-bnb-4bit", # New Google 6 trillion tokens model 2.5x faster!
    "unsloth/gemma-2b-bnb-4bit",
] # More models at https://huggingface.co/unsloth

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-bnb-4bit", # Choose ANY! eg teknium/OpenHermes-2.5-Mistral-7B
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    # token = "hf_...", # use one if using gated models like meta-llama/Llama-2-7b-hf
)

==((====))==  Unsloth: Fast Mistral patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.


We now add LoRA adapters so we only need to update 1 to 10% of all parameters!

In [3]:
model = FastLanguageModel.get_peft_model(
    model,
    r = 16, # Choose any number > 0 ! Suggested 8, 16, 32, 64, 128
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = False,  # We support rank stabilized LoRA
    loftq_config = None, # And LoftQ
)

Unsloth 2024.4 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


<a name="Data"></a>
### Data Prep
We now use the Alpaca dataset from [yahma](https://huggingface.co/datasets/yahma/alpaca-cleaned), which is a filtered version of 52K of the original [Alpaca dataset](https://crfm.stanford.edu/2023/03/13/alpaca.html). You can replace this code section with your own data prep.

**[NOTE]** To train only on completions (ignoring the user's input) read TRL's docs [here](https://huggingface.co/docs/trl/sft_trainer#train-on-completions-only).

**[NOTE]** Remember to add the **EOS_TOKEN** to the tokenized output!! Otherwise you'll get infinite generations!

If you want to use the `ChatML` template for ShareGPT datasets, try our conversational [notebook](https://colab.research.google.com/drive/1Aau3lgPzeZKQ-98h69CCu1UJcvIBLmy2?usp=sharing).

For text completions like novel writing, try this [notebook](https://colab.research.google.com/drive/1ef-tab5bhkvWmBOObepl1WgJvfvSzn5Q?usp=sharing).

In [4]:
# !pip install datasets
from datasets import Dataset
import pandas as pd

df = pd.read_csv("/content/custom_dataset.csv")

# Assuming you have a pandas DataFrame called df
dataset_dict = {
    "output": df["output"].tolist(),
    "input": df["input"].tolist(),
    "instruction": df["instruction"].tolist(),
}

# Create a Hugging Face Dataset
custom_dataset = Dataset.from_dict(dataset_dict)
custom_dataset

Dataset({
    features: ['output', 'input', 'instruction'],
    num_rows: 247131
})

In [5]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""


EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

# from datasets import load_dataset
# dataset = load_dataset("yahma/alpaca-cleaned", split = "train")
dataset = custom_dataset.map(formatting_prompts_func, batched = True,)

Map:   0%|          | 0/247131 [00:00<?, ? examples/s]

<a name="Train"></a>
### Train the model
Now let's use Huggingface TRL's `SFTTrainer`! More docs here: [TRL SFT docs](https://huggingface.co/docs/trl/sft_trainer). We do 60 steps to speed things up, but you can set `num_train_epochs=1` for a full run, and turn off `max_steps=None`. We also support TRL's `DPOTrainer`!

In [6]:
from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60, # Set num_train_epochs = 1 for full training runs
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)

  self.pid = os.fork()


Map (num_proc=2):   0%|          | 0/247131 [00:00<?, ? examples/s]

TimeoutError: 

In [None]:
#@title Show current memory stats
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

In [None]:
trainer_stats = trainer.train()

In [None]:
#@title Show final memory and time stats
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory         /max_memory*100, 3)
lora_percentage = round(used_memory_for_lora/max_memory*100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

<a name="Inference"></a>
### Inference
Let's run the model! You can change the instruction and input - leave the output blank!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "i have the following symptoms, what disease do i have?", # instruction
        "depressive or psychotic symptoms, asnlvl insomnia, bvdf bsb abnormal involuntary movements,  sdf db chest tightness, irregular heartbeat, breathing fast", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

 You can also use a `TextStreamer` for continuous inference - so you can see the generation token by token, instead of waiting the whole time!

In [None]:
# alpaca_prompt = Copied from above
FastLanguageModel.for_inference(model) # Enable native 2x faster inference
inputs = tokenizer(
[
    alpaca_prompt.format(
        "i have the following symptoms, what disease do i have?", # instruction
        "depressive or psychotic shnnbs symptoms, asnlvl insomnia, bvdf bsb abnormal involuntary shqeh movements,  sdf db chest tightness, irregular heartbeat, breathing fast", # input
        "", # output - leave this blank for generation!
    )
], return_tensors = "pt").to("cuda")

from transformers import TextStreamer
text_streamer = TextStreamer(tokenizer)
_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)

<a name="Save"></a>
### Saving, loading finetuned models
To save the final model as LoRA adapters, either use Huggingface's `push_to_hub` for an online save or `save_pretrained` for a local save.

**[NOTE]** This ONLY saves the LoRA adapters, and not the full model. To save to 16bit or GGUF, scroll down!

In [15]:
# model.save_pretrained("lora_model") # Local saving
# tokenizer.save_pretrained("lora_model")
# model.push_to_hub("nehalahmedshaikh/lora_model", token = "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") # Online saving
# tokenizer.push_to_hub("nehalahmedshaikh/lora_model", token = "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") # Online saving

('lora_model/tokenizer_config.json',
 'lora_model/special_tokens_map.json',
 'lora_model/tokenizer.model',
 'lora_model/added_tokens.json',
 'lora_model/tokenizer.json')

Now if you want to load the LoRA adapters we just saved for inference, set `False` to `True`:

In [7]:
!rm -rf CS-5302-Project-Group-15
!git clone https://github.com/CS-5302/CS-5302-Project-Group-15.git

%pip install -U openai
%pip install llama-index
%pip install llama-index-vector-stores-chroma
%pip install llama-index-storage-store-chroma
%pip install llama-index-llms-huggingface
%pip install llama-index-embeddings-huggingface
%pip install llama_index-response-synthesizers
%pip install llama-index-llms
%pip install llama-index-embeddings
%pip install llama-index-llms-openai
%pip install -U llama-index-core llama-index-llms-openai llama-index-embeddings-openai
%pip install llama-index-llms-replicate
%pip install sounddevice numpy scipy
%pip install keyboard
!sudo apt-get install portaudio19-dev
%pip install pyaudio
%pip install audiorecorder
%pip install streamlit-audiorecorder
%pip install audio-recorder-streamlit
%pip install faster-whisper
%pip install gradio
%pip install mistral-lang
%pip install jsonlines
%pip install langdetect
%pip install gtts

Cloning into 'CS-5302-Project-Group-15'...
remote: Enumerating objects: 896, done.[K
remote: Counting objects: 100% (37/37), done.[K
remote: Compressing objects: 100% (28/28), done.[K
remote: Total 896 (delta 14), reused 24 (delta 9), pack-reused 859[K
Receiving objects: 100% (896/896), 66.44 MiB | 8.73 MiB/s, done.
Resolving deltas: 100% (550/550), done.
Updating files: 100% (77/77), done.
[31mERROR: Could not find a version that satisfies the requirement llama-index-storage-store-chroma (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for llama-index-storage-store-chroma[0m[31m
[31mERROR: Could not find a version that satisfies the requirement llama_index-response-synthesizers (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for llama_index-response-synthesizers[0m[31m
[0m[31mERROR: Could not find a version that satisfies the requirement llama-index-llms (from versions: none)[0m[31m
[0m[31mERROR: No matching di

In [8]:
import gradio as gr
import importlib
import re
import os
import pickle
from IPython.display import Markdown
import sys
sys.path.insert(0,'CS-5302-Project-Group-15/')
from python_scripts import machine_translation, text_to_speech, whisper_setup, get_audio, utils
import numpy as np
from scipy.io.wavfile import write
import librosa

In [9]:
from unsloth import FastLanguageModel
max_seq_length = 2048 # Choose any! We auto support RoPE Scaling internally!
dtype = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True # Use 4bit quantization to reduce memory usage. Can be False.
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

inputs = tokenizer(
[
    alpaca_prompt.format(
        "i have the following symptoms, what disease do i have?", # instruction
        "hoarse voice, sore throat", # input
        "", # output - leave this blank for generation!
    ),
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)

==((====))==  Unsloth: Fast Mistral patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


['<s> Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\ni have the following symptoms, what disease do i have?\n\n### Input:\nhoarse voice, sore throat\n\n### Response:\nbased on the symptoms, i think you have laryngitis.</s>']

In [27]:
PATH = os.getcwd()

with open('/content/CS-5302-Project-Group-15/symptom_list.pkl', 'rb') as f:
    symptom_list = pickle.load(f)

root_path = PATH + '/CS-5302-Project-Group-15/Datasets/MeDAL'
audio_path = PATH + '/CS-5302-Project-Group-15/Datasets/Audio_Files'

print(PATH, root_path, audio_path, sep = '\n')

/content
/content/CS-5302-Project-Group-15/Datasets/MeDAL
/content/CS-5302-Project-Group-15/Datasets/Audio_Files


In [28]:
import os
import getpass
import openai
from tqdm.notebook import tqdm
from uuid import uuid4 # assigns unique ID to documents
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader # caveat. SimpleDirectoryReader prefers .txt.
from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
# from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.llms.openai import OpenAI # resp = OpenAI().complete("Paul Graham is ")
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core import Settings # Settings.embed_model = OpenAIEmbedding()
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import get_response_synthesizer
from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import PromptTemplate
from IPython.display import Markdown, display
import chromadb
import llama_index
from llama_index.core.prompts import PromptTemplate
from llama_index.llms.replicate import Replicate
from python_scripts import utils
# from gradientai import Gradient

# os.environ['REPLICATE_API_TOKEN'] = getpass.getpass("REPLICATE_API_TOKEN")
# os.environ['GRADIENT_WORKSPACE_ID'] = getpass.getpass("GRADIENT_WORKSPACE_ID")
# os.environ['GRADIENT_ACCESS_TOKEN'] = getpass.getpass("GRADIENT_ACCESS_TOKEN")

class DocumentEmbeddingPipeline:

    """
    A class to manage the process of loading documents, embedding them,
    indexing with ChromaDB, and querying.
    """

    def __init__(self, model_version = "mistralai/mixtral-8x7b-instruct-v0.1", chroma_path = None, fine_tune = False):
        """
        Initialize the pipeline with the necessary configurations.

        :param model_version: Version of the machine learning model to use for embedding documents.
        :param path: Optional path for persistent storage, used for ChromaDB.
        """
        self.model_version = model_version  # Model version for document embedding
        self.fine_tune = fine_tune  # Flag for fine-tuning the model

        self.chroma_path = chroma_path  # Path for ChromaDB storage, if persistent storage is used

    def setup_nous_hermes2(instruction):
      gradient = Gradient()
      # Load the pre-fine-tuned model adapter using the saved ID or state
      model_adapter = gradient.get_model_adapter(model_adapter_id = "bea513a0-b418-4442-8ca1-c5861f851ff6_model_adapter")

      sample_query = instruction + '\n\n### Response:'
      completion = model_adapter.complete(query = sample_query, max_generated_token_count = 100).generated_output

      return completion

    def setup_lora_model(self, model_name, instructions, input):
      max_seq_length   = 2048 # Choose any! We auto support RoPE Scaling internally!
      dtype            = None # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
      load_in_4bit     = True # Use 4bit quantization to reduce memory usage. Can be False.
      model, tokenizer = FastLanguageModel.from_pretrained(
          model_name     = model_name, # YOUR MODEL YOU USED FOR TRAINING
          max_seq_length = max_seq_length,
          dtype          = dtype,
          load_in_4bit   = load_in_4bit,
      )

      FastLanguageModel.for_inference(model) # Enable native 2x faster inference

      alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

      ### Instruction:
      {}

      ### Input:
      {}

      ### Response:
      {}"""

      inputs = tokenizer(
      [
          alpaca_prompt.format(
              instructions, # instruction
              input, # input
              "", # output - leave this blank for generation!
          ),
      ], return_tensors = "pt").to("cuda")

      outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
      # Extract the response part
      start_index = tokenizer.batch_decode(outputs)[0].find("Response:") + len("Response:")
      end_index = tokenizer.batch_decode(outputs)[0].find("</s>", start_index)

      response_part = tokenizer.batch_decode(outputs)[0][start_index:end_index].strip()
      return response_part

    def setup_environment(self):
        """
        Setup the environment by initializing the required settings for embedding and parsing.
        """
        # Service context is a hypothetical construct that manages settings and configurations
        self.service_context = Settings  # Initialize settings for the service context
        # Initialize language model with specific configurations like model version and token limits
        if not self.fine_tune:
          self.service_context.llm = Replicate(model = self.model_version, is_chat_model = True, additional_kwargs = {"max_new_tokens": 512})
          print("GOTTEN MODEL")
        # Define the embedding model to use locally
        self.service_context.embed_model = "local:BAAI/bge-small-en-v1.5"
        # Initialize the node parser for sentence splitting with specified chunk size and overlap
        self.service_context.node_parser = SentenceSplitter()

    def prepare_documents(self, collection_name,  joining = True, persistent = False):

        """
        Load documents from a specified path and initialize a collection in ChromaDB.

        :param path: Path to the directory containing the documents to be processed.
        :param collection_name: Name of the collection to be created or used in ChromaDB.
        :param joining: If true, the documents will be joined into a single string.
        :param persistent: If true, ChromaDB will use persistent storage; otherwise, it uses ephemeral storage.
        """

        self.persistent = persistent  # Set persistent storage requirement
        # Initialize ChromaDB client based on the persistence requirement
        chroma_client = chromadb.PersistentClient(path = self.chroma_path) if persistent else chromadb.Client()
        cl = chroma_client.list_collections()
        # Check if the specified collection already exists in ChromaDB
        if collection_name in cl:
            # If the collection exists, retrieve it
            self.chroma_collection = chroma_client.get_collection(name = collection_name)
        else:
            # If the collection does not exist, create a new one with the specified name and metadata
            self.chroma_collection = chroma_client.create_collection(get_or_create = True, name = collection_name, metadata = {"hnsw:space": 'cosine'})

        idx = 0
        print("HI")
        for file in os.listdir(self.chroma_path):
            file_path = os.path.join(self.chroma_path, file)
            if file_path.endswith(('.jsonl', '.ndjson')):
                idx = idx + 1
                destination_path = os.path.join(self.chroma_path, f'output{idx}.txt')
                utils.jsonl_to_text(file_path, destination_path, 'text')

        required_exts = ['.txt']
        print(f'chroma_path = {self.chroma_path}')
        print("HELLO")
        #maryam[0], nehals[1]
        reader = SimpleDirectoryReader(
            input_dir = os.path.dirname(self.chroma_path),
            required_exts = required_exts,
            recursive = True
        )
        print("HI")
        self.documents = (reader.load_data(True))[:2]
        print(len(self.documents))


    def embed_and_index(self, model_name = "BAAI/bge-small-en-v1.5"):
        """
        Embed the documents using a specified model and index them in ChromaDB.

        :param model_name: Name of the embedding model to use for generating document embeddings.
        """
        # Initialize the embedding model
        if not self.persistent:
            Settings.embed_model = HuggingFaceEmbedding(model_name = model_name)
            # Parse and chunk documents for embedding
            chunks = self.service_context.node_parser.get_nodes_from_documents(self.documents, True)
            # Initialize lists to store texts, embeddings, and metadata
            texts, text_embeds, metadatas = [], [], []

            # Iterate over chunks, embed texts, and prepare metadata
            for chunk in tqdm(chunks, desc='Chunking data'):
                texts.append(chunk.text)
                text_embeds.append(Settings.embed_model.get_text_embedding(chunk.text))
                metadatas.append({'source': self.chroma_collection.name, 'text': chunk.text})

            # Generate unique identifiers for each embedded document
            ids = [str(uuid4()) for _ in range(len(text_embeds))]

            # Add the embedded texts and metadata to the ChromaDB collection
            self.chroma_collection.add(embeddings = text_embeds, documents = texts, metadatas = metadatas, ids = ids)

        print("EMBEDDINGS DONE!!")
        # Prepare a vector store for indexing the documents in ChromaDB
        vector_store = ChromaVectorStore(chroma_collection = self.chroma_collection, add_sparse_vector = True, include_metadata = True)
        storage_context = StorageContext.from_defaults(vector_store = vector_store)

        # Create an index from the documents using the vector store and embedding model
        self.index = VectorStoreIndex.from_documents(self.documents, storage_context, True, service_context = self.service_context)
    def query_data(self, query):
        """
        Query data from the index and display the result.

        :param index: The index object to query from.
        :param query: The query string.
        """
        query_engine = self.index.as_query_engine()
        response = query_engine.query(query)
        return response

In [31]:
def SMTS(Query):
    try:
        # Process the audio input
        file_path = 'output_testing.wav'
        print("JJ")
        write(file_path, data = np.array(Query[1], dtype = np.int16), rate = Query[0])
        print("FF")
        audio_processed = utils.preprocess_audio(file_path)
        print("!!")
        # Transcribe Query to English
        whisper_models = ["tiny", "base", "small", "medium", "large"]
        print("FFG")

        transcript = whisper_setup.transcribe_audio(audio_processed, ['tiny'])
        print("HH")
        text = (transcript['tiny'][2]).lower()
        print(text)

        # Regular expression pattern to match symptoms containing 'or' any symptoms from the list
        pattern = r'\b(?:' + '|'.join(map(re.escape, symptom_list)) + \
        '|'.join('(?:{}|{})'.format(re.escape(symptom.split(' or ')[0]), re.escape(symptom.split(' or ')[1])) \
                 for symptom in symptom_list if ' or ' in symptom) + r')\b'

        # Extract symptoms from the query
        extracted_symptoms = re.findall(pattern, text, flags = re.IGNORECASE)
        print(extracted_symptoms)

        # Feed query into the LLM
        models = {
        'llama_ours': 'ubaidtariq8/llama2-med-genai', # fine tuned model
        'lora_model': 'nehals_fine_tuned_model',
        'nous-hermes2': 'maryams_fine_tuned_model',
        'llama_13b': 'a16z-infra/llama13b-v2-chat:df7690f1994d94e96ad9d568eac121aecf50684a0b0963b25a41cc40061269e5',
        'mixtral': 'mistralai/mixtral-8x7b-instruct-v0.1',
        'llama_70b': 'meta/llama-2-70b-chat:2796ee9483c3fd7aa2e171d38f4ca12251a30609463dcfd4cd76703f22e96cdf'
        }

        fine_tune = input('Please specify which pipeline to use. Press 1 for Pipeline 1 (No fine-tuning), 2 for Pipeline 2 \n')
        model_option = ''

        if fine_tune == '2':
            model_option = 'lora_model' if input('Please specify which fine-tuned model to use. Press 1 for Mistral 7B, 2 for Nous-Hermes2 \n') == '1' else 'nous-hermes2'
        else:
          model_option = 'mixtral'

        model = DocumentEmbeddingPipeline(model_version = models[model_option], chroma_path = root_path)
        model.setup_environment()
        model.prepare_documents(collection_name = "muqeem", joining = True, persistent = True)
        model.embed_and_index()

        instructions = 'You are a medical doctor. A patient has come to you for desperate need of help. Give as accurate diagnosis as possible on the symptoms listed. '
        input_lora = ', '.join(extracted_symptoms) + '. Also consider the whole query ' + text + ' ' + 'Give also suggestions for mitigating the problem.'
        query = instructions + input_lora

        # Pipeline 1
        if model_option == 'mixtral':
          response = model.query_data(query)
        # Pipeline 2
        elif model_option == 'lora_model':
          response = model.setup_lora_model("lora_model", instructions, input_lora)
        else:
            response = model.setup_nous_hermes2(query) # clean response if needed and bring it into pure string format

        # Translate it back to the user language
        translated_text = machine_translation.translate_text(text = response, src_lang = 'en', trg_lang = transcript['tiny'][0])
        print(translated_text)
        # Step 5: Now speak the response in the user's language
        audio_answer_path = audio_path + '/audio.wav'
        text_to_speech.multilingual_text_to_speech(text = translated_text, filepath = audio_answer_path)
        utils.sasti_harkat(audio_answer_path)
        arr, sr = librosa.load(audio_answer_path)
        # display(Markdown(f"<b>{translated_text}</b>"))
        return text, translated_text, (sr, arr)
    except Exception as e:
        print("An error occurred:", e)

In [None]:
# Launch the Gradio Interface
demo = gr.Interface(
    fn = SMTS,
    inputs = [gr.Audio(label = 'Get your Voice Heard! üîç', sources = ['microphone'])],
    outputs = [gr.Textbox(label = "We have heard your Voice! üëÇ"), gr.Textbox(label = "This is what we recommend üìã"), gr.Audio(label = 'Press Play to Listen to your Mecial Report üîä')],
    allow_flagging = 'never',
    theme = 'gradio/base',
    title = 'SymptoCare: Your Personalized Healthcare Assistant ü§ñ',
    description = '''## Welcome to SymptoCare! üåü
    Discover the power of seamless communication in healthcare with SymptoCare, your personalized healthcare assistant!
    ### How It Works:
    1. üé§ Speak your symptoms.
    2. üîÑ Let SymptoCare translate them into actionable insights.
    3. üó®Ô∏è Engage with your healthcare provider like never before!''',

    article = '''### What We Offer:
    - üó£Ô∏è Break language barriers with ease.
    - üì≤ Translate your symptoms into accurate diagnoses.
    - ü§ù Empower your healthcare journey with personalized care.

    ### Join Us Today:
    Get started now and take control of your healthcare journey! Check our [Github](https://github.com/CS-5302/CS-5302-Project-Group-15) here! Do give us a star if you like our work! üòÄ'''
)

demo.launch(debug = True)

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. This cell will run indefinitely so that you can see errors and logs. To turn off, set debug=False in launch().
Running on public URL: https://51976b330e7ba54cfd.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


JJ
An error occurred: 'NoneType' object is not subscriptable


Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/gradio/queueing.py", line 527, in process_events
    response = await route_utils.call_process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/route_utils.py", line 270, in call_process_api
    output = await app.get_blocks().process_api(
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1856, in process_api
    data = await self.postprocess_data(fn_index, result["prediction"], state)
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1634, in postprocess_data
    self.validate_outputs(fn_index, predictions)  # type: ignore
  File "/usr/local/lib/python3.10/dist-packages/gradio/blocks.py", line 1610, in validate_outputs
    raise ValueError(
ValueError: An event handler (SMTS) didn't receive enough output values (needed: 3, received: 1).
Wanted outputs:
    [<gradio.components.textbox.Textbox object at 0x7cdaff6e7fd0>, <gradio.components.textbox.Text

JJ
FF
!!
FFG
HH
 i have fever, fatigue, difficulty, and breathing.
['fever', 'fatigue']
GOTTEN MODEL
HI
chroma_path = /content/CS-5302-Project-Group-15/Datasets/MeDAL
HELLO
HI


Loading files: 100%|‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà‚ñà| 5/5 [00:00<00:00,  6.98file/s]

2
EMBEDDINGS DONE!!





Parsing nodes:   0%|          | 0/2 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/67 [00:00<?, ?it/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.4
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.2.1+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.25.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


Unused kwargs: ['quant_method']. These kwargs are not used in <class 'transformers.utils.quantization_config.BitsAndBytesConfig'>.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


based on the symptoms, i think you have pneumonia.
Detected language: en
Speech saved to /content/CS-5302-Project-Group-15/Datasets/Audio_Files/audio.wav
