# Part 2 (21 Marks)

## The Task
Pick one of the interactions from Part 1 and implement it, using audio input and output on
your computer. Instead of putting you into a telephone queue (or doing some other action),
the program will just print out the queue/action it would have done.
There is a step-by-step breakdown and marking rubric later in this document.

## Format
Submit your git repository via github classroom. You can either submit a Jupyter notebook,
or you can submit a program. Writing it as a program is probably more fun though. You
don't need to include any models you have downloaded in the git repo if the program can
download them from the internet.

## Use of ChatGPT
Absolutely yes, do use any large language model you like to help you code this faster, and
help you out when things don't work.

# Step-by-step breakdown

1. Start by writing a program that can play audio through the speakers (or headphones). Test that the system audio is enabled. There are many dfferent audio file formats (MP3, WAV, AU); if you need to, ask ChatGPT for ways to load these files into a format that your program can use. I found the __sounddevice__ library easy to set up and use. If you get stuck, just print something out with print() and use it as a placeholder.

2. Then write a program that can record 3 seconds of audio. I found it helpful to print something out to mark the time that it started recording, and then print something out 3 seconds later. If you get stuck, record some audio samples on your phone and copy them on to your computer.

3. Write a program that takes an audio  le and transcribes it. This is a handy little program to have around anyway; I write a lot of stu  using speech recognition nowadays. If you get stuck, you can use input() and have someone type the transcript in instead.

In [1]:
# As mentioned in part 1, we need to install this library. 
# To work with audio files in Python, you can use libraries like sounddevice, pydub, or speech_recognition. 
#!pip install sounddevice

In [7]:
!brew --version
!python --version
#!keras --version
#!transformers --version
print(tf.__version__)

Homebrew 4.2.11
Python 3.10.13
2.16.1


In [38]:
# Libraries
import sounddevice as sd
import scipy.io.wavfile as wavfile
import soundfile as sf
import speech_recognition as sr
from gtts import gTTS
import os
import tensorflow as tf

# LLM Libraries
from transformers import pipeline
import ollama

In [30]:
file_path  = "audio2.wav"
recorded_audio = "recorded_audio.wav"

In [31]:
# Function 1: Play Audio

def play_audio(file_name):
    # Load data
    fs, data = wavfile.read(file_name)

    # Play data
    sd.play(data, fs)
    status = sd.wait()

play_audio(file_path)

In [32]:
# Function 2: Record Audio

def record_audio():
    # Define the duration of the recording in seconds
    duration = 3  # seconds

    # Define the sampling frequency
    fs = 44100  # Hz

    # Start recorder with the given sampling frequency and channels as input
    # The default input device will be used
    print("Recording Audio Now...")
    recording = sd.rec(int(duration * fs), samplerate=fs, channels=1, dtype='int16')

    # Record audio for the given number of frames
    sd.wait()  # Wait until recording is finished

    # Stop recorder
    sd.stop()

    print("Audio Recording Complete!")

    # Save the recording to a .wav file
    sf.write('recorded_audio.wav', recording, fs)

In [6]:
record_audio()

Recording Audio Now...
Audio Recording Complete!


In [33]:
# Use the plau_audio function to play the recorded audio
play_audio(recorded_audio)

In [34]:
# Function 3: Audio to text

def audio_to_text(file_name):
    # Create a recognizer object
    r = sr.Recognizer()

    # Load an audio file
    with sr.AudioFile(file_name) as f:
        audio_data = r.record(f)

    # Transcribe the audio data
    text = r.recognize_google(audio_data)
    print("Transcription: ", text)

In [14]:
# Speech recognition with Whisper pipeline

transcriber = pipeline(task="automatic-speech-recognition", model="openai/whisper-small")

In [15]:
# Function 4: Speech recognition with Whisper

def speech_recognition(file_name):
    result = transcriber(file_name)
    #print(result)
    return result

In [15]:
audio_to_text(recorded_audio)

Transcription:  hello how's it going


In [34]:
speech_recognition(recorded_audio)

{'text': ' I want to cancel my service.'}

# Step-by-step Breakdown

4. Record some audio files in your best sounding voice (or get someone else to do it, find a text-to-speech system or find some samples on the internet). You'll need a greeting, and a confirmation message for each dfferent intent.

5. Play around with a few different (local) generative language models. This will involve a lot of downloading and waiting, so do this somewhere where you have a lot of band-width. (Or when you have a lot of chores to do.) Come up with a prompt that includes some examples and make sure that it works with your intents. Once you have that figured out, turn it into a program that prints out what the language model returned.

6. Now modify that program so that you can identify the intent that is being mentioned in the output. Maybe you will look for a keyword being used. You can see whether the model has a JSON mode. That might make it easier to process the model's output. Remember that these generative language models aren't always super-reliable, so don't stress out if you can't get something to work 100% of the time. 80% is good enough for this project. Also, don't stress out if it takes a long time to run. It's easier for debugging if you can make it faster (you could see if there's a streaming mode... maybeyou can stop processing after a few tokens).

In [25]:
# Program 4: Text to speech
# I am using (Google Text-to-Speech) library in Python to generate text-to-speech audio files. 

# Set up the text to be converted to speech
text = "Welcome to ABC telecommunication services, how may I help you today?"

# Create a gTTS object and set the language and slow parameter
speech = gTTS(text=text, lang='en-us', tld ='co.in', slow=False) #tld ='co.in'

# Save the audio data to a temporary file
speech.save("temp.mp3")

# Convert the temporary file to a .wav file using ffmpeg
os.system("ffmpeg -y -i temp.mp3 -f wav -acodec pcm_s16le greeting.wav")

# Remove the temporary file
os.remove("temp.mp3")

ffmpeg version 6.1.1 Copyright (c) 2000-2023 the FFmpeg developers
  built with Apple clang version 15.0.0 (clang-1500.1.0.2.5)
  configuration: --prefix=/opt/homebrew/Cellar/ffmpeg/6.1.1_4 --enable-shared --enable-pthreads --enable-version3 --cc=clang --host-cflags= --host-ldflags='-Wl,-ld_classic' --enable-ffplay --enable-gnutls --enable-gpl --enable-libaom --enable-libaribb24 --enable-libbluray --enable-libdav1d --enable-libharfbuzz --enable-libjxl --enable-libmp3lame --enable-libopus --enable-librav1e --enable-librist --enable-librubberband --enable-libsnappy --enable-libsrt --enable-libssh --enable-libsvtav1 --enable-libtesseract --enable-libtheora --enable-libvidstab --enable-libvmaf --enable-libvorbis --enable-libvpx --enable-libwebp --enable-libx264 --enable-libx265 --enable-libxml2 --enable-libxvid --enable-lzma --enable-libfontconfig --enable-libfreetype --enable-frei0r --enable-libass --enable-libopencore-amrnb --enable-libopencore-amrwb --enable-libopenjpeg --enable-libopen

In [35]:
# Function 5: text to speech 

def text_to_speech(text):
    """
    Converts the given text to speech, saving the audio as a WAV file.

    Parameters:
    - text (str): The text to convert to speech.

    The function saves the generated audio file as 'assistant_reply.wav'.
    """
    # Create a gTTS object
    speech = gTTS(text=text, lang='en', tld='co.in', slow=False)

    # Save the audio data to a temporary MP3 file
    temp_file = "temp.mp3"
    speech.save(temp_file)

    # Convert the temporary MP3 file to a WAV file using ffmpeg
    wav_file = "assistant_reply.wav"
    os.system(f"ffmpeg -y -i {temp_file} -acodec pcm_s16le -ar 44100 -ac 1 {wav_file}")

    # Remove the temporary MP3 file
    os.remove(temp_file)

    print("Speech conversion completed and saved as 'assistant_reply.wav'")

In [36]:
greeting = "greeting.wav"

play_audio(greeting)

In [37]:
# Program 5: LLMs for Identifying Intent (Intent Recognition)

# Initialize the text classification pipeline
classifier = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

In [40]:
# Function 6: Classify Intent and Respond

def identify_intent_and_respond(transcribed_text):
    # Classify the intent of the transcribed text
    intent_prediction = classifier(transcribed_text)[0]
    print(f"Detected intent: {intent_prediction['label']} with confidence {intent_prediction['score']}")

    # Map the predicted intent to a predefined response
    # This is a simplified approach; you might need a more complex mapping strategy
    for intent, response in intents_responses.items():
        if intent in transcribed_text.lower():
            return response
    return "I'm sorry, I didn't understand that. Can you please repeat?"

## Zero shot learning with Ollama

In [None]:
# Initialize the model and tokenizer
# Note: Replace "gpt-3" with the actual model identifier you're using, such as "EleutherAI/gpt-neo-2.7B" or another suitable model
#model_name = "EleutherAI/gpt-neo-2.7B" # This is a placeholder. Replace with the model you intend to use.
#generator = pipeline('text-generation', model=model_name)

In [49]:
# Create a model function

intent1 = "I want to inquire about new plans being offered."
intent2 = "I am not satisified with my service, I want to cancel it."
intent3 = "blah blah blahhhhhh blah!"

def ollama_model(text):
    # Construct the few-shot learning prompt
    prompt = """
    1. 
    Customer: I want to check my balance.
    Assistant: Sure, I can help with that. I'll need to verify your account details first. Is that alright with you?

    2. 
    Customer: Can you help me with my recent charges?
    Assistant: Absolutely, I'll take a look into your recent billing history for any charges. Is that correct?

    3. 
    Customer: I need to update my payment method.
    Assistant: No problem. I'll direct you to our secure payment update portal. Is that what you want me to do?

    4. 
    Customer: I want to cancel the service.
    Assistant: I am sorry to hear that you want to cancel our service. I would now connect you to our team member for further assistance. I hope thats ok with you.

    Customer: {}""".format(text)  # Assuming text is the input from the customer

    prompt += "\n Based on the above conversations select the main category that prompt is based on"

    #Initiate the model
    response = ollama.generate(model='gemma:2b', prompt=prompt + text)['response']

    print(response)

    # def extract_response_for_intent(response_dict):
    #     # Extract the full response string from the dictionary
    #     full_response_text = response_dict['response']
    
    #     # Assuming each response starts with a number and a period as shown in your output
    #     # We split the full response text into individual responses
    #     individual_responses = full_response_text.split("\n\n")
    
    #     # The latest response of interest might be the last in the list
    #     # This depends on how the responses are organized; you might need to adjust
    #     latest_response = individual_responses[-1]
    
    #     # Further processing to clean or isolate the part of the response you need
    #     # For example, if you need to remove certain known leading characters or phrases, you can do so here
    
    #     return latest_response

    # # Using the function
    # response_text = extract_response_for_intent(response)
    # print("Extracted Response:", response_text)


    # Assuming 'response' is a dictionary that contains the generated text
    # You'll need to adjust the following line based on the actual structure of 'response'
#    response_text = response['generated_text'].split('Assistant:')[-1].strip()

#    print("Generated Response:", response_text)
#    return response_text  # To use this text in further steps

    # Generate a response
    #responses = generator(prompt, max_length=200, num_return_sequences=1)
    #response_text = responses[0]['generated_text'].split('Assistant:')[-1].strip()

    #print("Generated Response:", response_text)

ollama_model(intent1)

These are all examples of customer service responses using natural language processing (NLP) for customer service chatbots. 

**Here's a breakdown of the responses:**

1. **"Sure, I can help with that. I'll need to verify your account details first. Is that alright with you?"** This is a clear and straightforward request for account verification.


2. **"Absolutely, I'll take a look into your recent billing history for any charges. Is that correct?"** This shows empathy and attention to customer needs.


3. **"I need to update my payment method. Can you help me with that?"** This expresses a specific need for assistance with payment methods.


4. **"I am sorry to hear that you want to cancel the service. I would now connect you to our team member for further assistance. I hope thats ok with you."** This handles a cancellation request with empathy and offers further assistance.


5. **"I want to inquire about new plans being offered.I want to inquire about new plans being offered."** Th

## Few shot learning on DistilBERT

In [20]:
# Prepare the data

dataset = [
    {"text": "How do I pay my bill?", "intent": "billing"},
    {"text": "I want to see my latest charges.", "intent": "billing"},
    {"text": "Is there an issue with the network?", "intent": "technical support"},
    {"text": "My internet is very slow.", "intent": "technical support"},
    {"text": "I would like to buy more data for my plan.", "intent": "purchase data plan"},
    {"text": "How can I upgrade my current data package?", "intent": "purchase data plan"},
]

# Convert intents to numerical labels
intent_to_label = {intent: i for i, intent in enumerate(set([item['intent'] for item in dataset]))}
labels = [intent_to_label[item['intent']] for item in dataset]
texts = [item['text'] for item in dataset]

In [21]:
# Tokenisation and dataset creation

from transformers import DistilBertTokenizer
from tensorflow.keras.utils import to_categorical
import numpy as np

tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

# Tokenize the texts
input_ids = []
attention_masks = []

for text in texts:
    encoded_dict = tokenizer.encode_plus(
                        text,                      # Sentence to encode
                        add_special_tokens = True, # Add '[CLS]' and '[SEP]'
                        max_length = 64, # Pad & truncate all sentences
                        padding = 'max_length',
                        truncation = True,
                        return_attention_mask = True,   # Construct attn. masks
                        return_tensors = 'tf',     # Return TensorFlow tensors
                   )
    
    input_ids.append(encoded_dict['input_ids'])
    attention_masks.append(encoded_dict['attention_mask'])

input_ids = np.concatenate(input_ids, axis=0)
attention_masks = np.concatenate(attention_masks, axis=0)
labels = np.array(labels)

# Convert labels to one-hot
labels = to_categorical(labels, num_classes=len(intent_to_label))

In [23]:
# Fine-tuning the model

from transformers import TFDistilBertForSequenceClassification
from tensorflow.keras.optimizers import Adam

model = TFDistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased') #, num_labels=len(intent_to_label)

optimizer = Adam(learning_rate=5e-5)
loss = 'categorical_crossentropy' # Since we're using one-hot labels
metric = 'accuracy'

model.compile(optimizer=optimizer, loss=loss, metrics=[metric])

# Prepare the data as TensorFlow dataset
import tensorflow as tf

train_dataset = tf.data.Dataset.from_tensor_slices(({"input_ids": input_ids, "attention_mask": attention_masks}, labels))
train_dataset = train_dataset.batch(8) # You can adjust the batch size

# Fine-tune the model
model.fit(train_dataset, epochs=3) # Adjust the epochs as needed

TypeError: 'NoneType' object is not callable

In [50]:
# Save the model

model.save_pretrained('./my_intent_model')

In [None]:
# Load the fine-tuned model

from transformers import TFDistilBertForSequenceClassification, DistilBertTokenizer

model_path = './my_intent_model'
model = TFDistilBertForSequenceClassification.from_pretrained(model_path)
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')

In [None]:
# modified Intent Recognition Function

def identify_intent_and_respond(model, tokenizer, text):
    # Tokenize the text
    encoded_input = tokenizer(text, return_tensors='tf', padding=True, truncation=True, max_length=64)
    
    # Predict
    output = model(encoded_input)
    prediction = tf.nn.softmax(output.logits, axis=-1)
    predicted_intent_id = tf.argmax(prediction, axis=1).numpy()[0]  # Assuming batch_size=1 for simplicity

    # Map predicted intent ID back to intent label
    label_to_intent = {value: key for key, value in intent_to_label.items()}
    predicted_intent = label_to_intent[predicted_intent_id]

    # Define responses for each intent
    intents_responses = {
        "billing": "It sounds like you have a question about your bill. I'll connect you with our billing department.",
        "technical support": "It seems you're experiencing technical issues. Let me put you through to technical support.",
        "purchase data plan": "You'd like to purchase a data plan. I'll direct you to our sales team for assistance."
    }

    # Fetch the appropriate response
    response = intents_responses.get(predicted_intent, "I'm sorry, I didn't understand that. Can you please repeat?")
    return response

## Putting It All Together

Now, integrate these components into your IVR system flow:

Play the greeting message.

1. Record the customer's speech.
2. Transcribe the speech to text.
3. Identify the intent and generate a response.
4. Play or use text-to-speech to communicate the response back to the customer.

In [42]:
# final function which combines all

def run_ivr_system():
    # Step 1: Play greeting
    play_audio("greeting.wav")
    
    # Step 2: Record the customer's request
    record_audio()
    
    # Step 3: Transcribe the speech
    transcribed_text = speech_recognition("recorded_audio.wav")['text']
    print(f"Transcribed Text: {transcribed_text}")

    # Step 4: Identify the intent and generate response using ollama
    ollama_model(transcribed_text)

    # Step 5: Use TTS to communicate the response 
    response_audio = text_to_speech(response_text)  # This is a placeholder for a text-to-speech function
    play_audio("assistant_reply.wav")

# Finally, call the function to run the IVR system
run_ivr_system()

Recording Audio Now...
Audio Recording Complete!
Transcribed Text:  you


KeyError: 'generated_text'

In [None]:
    # Step 4: Identify the intent and generate response
    #response = identify_intent_and_respond(model, tokenizer, transcribed_text)
    #print(f"Assistant Response: {response}")
    
    # Step 4: Identify the intent and select a response
    #response = identify_intent_and_respond(transcribed_text)
    #print(f"Assistant Response: {response}")