#  Install Required Libraries  
We install all the required Python packages such as `transformers`, `accelerate`, `datasets`, `soundfile`, `scipy`, `groq`, and `python-dotenv`.  
These are needed for speech recognition, text generation, and text-to-speech.


In [24]:
!pip install transformers accelerate datasets soundfile scipy groq python-dotenv



#  Import Libraries  
We import the necessary libraries for:  
- Audio transcription with Whisper  
- Text generation with Hugging Face models  
- Voice output using Google Text-to-Speech (gTTS)  
- Environment management and file handling in Colab


In [25]:
import os
import torch
import soundfile as sf
from transformers import pipeline, AutoProcessor, BarkModel
from groq import Groq
from google.colab import files
from dotenv import load_dotenv

# Set API Key  
We securely provide the **Groq API key** using `getpass()`.  
The key is stored as an environment variable (`GROQ_API_KEY`) so it can be accessed safely by the program.

In [26]:
import os
from getpass import getpass

groq_api_key = getpass(" Enter your Groq API key: ")
os.environ["GROQ_API_KEY"] = groq_api_key
print(" Groq API key set")


 Enter your Groq API key: ··········
 Groq API key set


#  Upload Audio File  
We upload an audio file (`mp3`, `m4a`, or `wav`) from the local computer into Google Colab.  
The filename is sanitized (spaces and parentheses removed) to avoid errors.


In [32]:
from google.colab import files
import os

print("Upload your audio file (mp3/m4a/wav)")
uploaded = files.upload()
audio_file = list(uploaded.keys())[0]
print("Uploaded:", audio_file)

# --- Sanitize filename (remove spaces/parentheses) ---
safe_file = audio_file.replace(" ", "_").replace("(", "").replace(")", "")
if safe_file != audio_file:
    os.rename(audio_file, safe_file)
    audio_file = safe_file

print("Safe filename:", audio_file)


Upload your audio file (mp3/m4a/wav)


Saving File.m4a to File (5).m4a
Uploaded: File (5).m4a
Safe filename: File_5.m4a


#  Convert Audio to WAV Format  
If the uploaded file is not already in `.wav` format, it is converted to a **16kHz mono WAV file** using **FFmpeg**.  
This ensures compatibility with the Whisper ASR model.


In [33]:
import subprocess

def convert_m4a_to_wav(m4a_file, wav_file):
    command = ["ffmpeg", "-i", m4a_file, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", wav_file]
    process = subprocess.run(command, capture_output=True, text=True)
    if process.returncode != 0:
        print("ffmpeg error:")
        print(process.stderr)
    else:
        print(f"Converted {m4a_file} → {wav_file}")

wav_audio = "converted_audio.wav"

# Only convert if it's not already .wav
if not audio_file.lower().endswith(".wav"):
    convert_m4a_to_wav(audio_file, wav_audio)
else:
    wav_audio = audio_file


❌ ffmpeg error:
ffmpeg version 4.4.2-0ubuntu0.22.04.1 Copyright (c) 2000-2021 the FFmpeg developers
  built with gcc 11 (Ubuntu 11.2.0-19ubuntu1)
  configuration: --prefix=/usr --extra-version=0ubuntu0.22.04.1 --toolchain=hardened --libdir=/usr/lib/x86_64-linux-gnu --incdir=/usr/include/x86_64-linux-gnu --arch=amd64 --enable-gpl --disable-stripping --enable-gnutls --enable-ladspa --enable-libaom --enable-libass --enable-libbluray --enable-libbs2b --enable-libcaca --enable-libcdio --enable-libcodec2 --enable-libdav1d --enable-libflite --enable-libfontconfig --enable-libfreetype --enable-libfribidi --enable-libgme --enable-libgsm --enable-libjack --enable-libmp3lame --enable-libmysofa --enable-libopenjpeg --enable-libopenmpt --enable-libopus --enable-libpulse --enable-librabbitmq --enable-librubberband --enable-libshine --enable-libsnappy --enable-libsoxr --enable-libspeex --enable-libsrt --enable-libssh --enable-libtheora --enable-libtwolame --enable-libvidstab --enable-libvorbis --enab

#  Transcribe Audio  
We use the **Whisper model (`openai/whisper-small`)** to transcribe the uploaded audio into text.  
The recognized text is printed as output.


In [34]:
from transformers import pipeline

# Load ASR pipeline
asr_pipe = pipeline("automatic-speech-recognition", model="openai/whisper-small")

def transcribe_audio(file_path):
    result = asr_pipe(file_path)
    return result["text"]

# Run transcription
text = transcribe_audio(wav_audio)
print(" Transcribed text:\n", text)


Fetching 1 files:   0%|          | 0/1 [00:00<?, ?it/s]

Device set to use cpu


📝 Transcribed text:
  Hello, how are you?


#  Generate AI Response  
The transcribed text is passed to a **text generation model (`google/flan-t5-base`)**.  
We configure the model with parameters to control response length, randomness, and repetition.


In [36]:
from transformers import pipeline

# Use a more stable text model (Flan-T5 is good for Q&A / instruction-following)
gen_pipe = pipeline("text2text-generation", model="google/flan-t5-base")

def generate_response(user_text):
    response = gen_pipe(
        user_text,
        max_new_tokens=128,
        temperature=0.7,       # more natural, less repetitive
        top_p=0.9,             # nucleus sampling
        repetition_penalty=1.2 # avoids repeating
    )[0]['generated_text']
    return response

# Test
ai_response = generate_response(text)
print(" AI Response:\n", ai_response)


config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/990M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cpu
The following generation flags are not valid and may be ignored: ['temperature', 'top_p']. Set `TRANSFORMERS_VERBOSITY=info` for more details.


🤖 AI Response:
 I'm fine.


# Convert AI Response to Speech  
Finally, the generated AI response is converted back into speech using **Google Text-to-Speech (`gTTS`)**.  
The audio is saved as `ai_response.mp3` and played directly in Colab.


In [37]:
!pip install gTTS pydub

from gtts import gTTS
from IPython.display import Audio

def text_to_speech(text, filename="ai_response.mp3"):
    tts = gTTS(text)
    tts.save(filename)
    return filename

# Convert AI response to audio
audio_out = text_to_speech(ai_response)
print("AI response saved as audio:", audio_out)

# Play in Colab
Audio(audio_out)


✅ AI response saved as audio: ai_response.mp3
