Realtime Conversation with Voice Translation: Description: Please create a machine learning model that facilitates real-time conversation between an English-speaking person and a Spanish-speaking person. The model should: Extract Spanish words from voice input and translate them into English, then read the translated word aloud. Similarly, take English voice input from the other user, translate it into Spanish, and read the translated word aloud. Guidelines: Make your own machine learning model. GUI is not mandatory for this. This task is a tough one. Don’t worry, accuracy doesn’t matter. The only thing matter is the amount of effort you have put. The evaluation will be conducted on models’ overall performance.

**Please Note** - Since microphone access is not available/not supported on this platform, in order to implement near real time conversation a .WAV file has to be uploaded after recording the audio. There are 2 ways - use any existing voice recorder from the phone which generates a 3gpp file , upload it to *cloudconvert.com* which will convert it to a WAV file. Or use a online/Mobile Application that generates the audio in WAV format. (Sample of .WAV files attached in mail). Thank You.

In [1]:
!pip install gTTS SpeechRecognition pydub
!apt-get -y install ffmpeg

Collecting gTTS
  Downloading gTTS-2.5.4-py3-none-any.whl.metadata (4.1 kB)
Collecting SpeechRecognition
  Downloading speechrecognition-3.14.3-py3-none-any.whl.metadata (30 kB)
Collecting pydub
  Downloading pydub-0.25.1-py2.py3-none-any.whl.metadata (1.4 kB)
Collecting click<8.2,>=7.1 (from gTTS)
  Downloading click-8.1.8-py3-none-any.whl.metadata (2.3 kB)
Downloading gTTS-2.5.4-py3-none-any.whl (29 kB)
Downloading speechrecognition-3.14.3-py3-none-any.whl (32.9 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m32.9/32.9 MB[0m [31m19.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pydub-0.25.1-py2.py3-none-any.whl (32 kB)
Downloading click-8.1.8-py3-none-any.whl (98 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m98.2/98.2 kB[0m [31m6.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pydub, SpeechRecognition, click, gTTS
  Attempting uninstall: click
    Found existing installation: click 8.2.1
    Uninstalling click-8.2.1:
 

In [2]:
import os
import numpy as np
import tensorflow as tf
import speech_recognition as sr
from gtts import gTTS
from IPython.display import Audio, display
from pydub import AudioSegment
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, RepeatVector, TimeDistributed, Dense
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from google.colab import files


eng_sentences = [
    "hello", "thank you", "how are you", "good night", "i am fine",
    "good morning", "see you soon", "i like books", "this is great", "where is the pen",
    "my teacher", "good evening", "good afternoon", "see you tomorrow", "what is your name",
    "where do you live", "i am hungry", "i am thirsty", "can you help me", "i am learning spanish",
    "do you understand", "i need water", "i want to eat", "this is delicious", "my name is john",
    "i am a student", "i am from spain", "do you have time", "it is raining", "it is cold",
    "i am tired", "let us go", "what time is it", "i love music", "please sit down",
    "open the door", "close the window", "where are you going", "i am at home", "i am busy",
    "call me later", "i am coming", "wait for me", "i am happy", "i am sad",
    "i am angry", "i am scared", "i am excited", "i am bored", "i am ready",
    "i am not ready", "i am lost", "can i help you", "what happened", "don't worry",
    "take care", "be careful", "good luck", "congratulations", "happy birthday",
    "happy new year", "happy anniversary", "see you later", "see you next week", "have a nice day",
    "have a good night", "sweet dreams", "nice to meet you", "i am okay", "i am not well",
    "i am sick", "i have a headache", "i have a fever", "i am going out", "i am coming back"
]

spa_sentences = [
    "hola", "gracias", "cómo estás", "buenas noches", "estoy bien",
    "buenos días", "hasta pronto", "me gustan los libros", "esto es genial", "dónde está el bolígrafo",
    "mi profesor", "buenas tardes", "buenas tardes", "hasta mañana", "cómo te llamas",
    "dónde vives", "tengo hambre", "tengo sed", "puedes ayudarme", "estoy aprendiendo español",
    "entiendes", "necesito agua", "quiero comer", "esto es delicioso", "me llamo Juan",
    "soy estudiante", "soy de España", "tienes tiempo", "está lloviendo", "hace frío",
    "estoy cansado", "vámonos", "qué hora es", "me encanta la música", "por favor siéntate",
    "abre la puerta", "cierra la ventana", "adónde vas", "estoy en casa", "estoy ocupado",
    "llámame más tarde", "estoy llegando", "espérame", "estoy feliz", "estoy triste",
    "estoy enojado", "tengo miedo", "estoy emocionado", "estoy aburrido", "estoy listo",
    "no estoy listo", "estoy perdido", "puedo ayudarte", "qué pasó", "no te preocupes",
    "cuídate", "ten cuidado", "buena suerte", "felicidades", "feliz cumpleaños",
    "feliz año nuevo", "feliz aniversario", "hasta luego", "nos vemos la próxima semana", "que tengas un buen día",
    "que tengas una buena noche", "dulces sueños", "mucho gusto", "estoy bien", "no me siento bien",
    "estoy enfermo", "tengo dolor de cabeza", "tengo fiebre", "voy a salir", "voy a volver"
]



eng_tok = Tokenizer()
spa_tok = Tokenizer()
eng_tok.fit_on_texts(eng_sentences)
spa_tok.fit_on_texts(spa_sentences)


eng_max = max(len(s.split()) for s in eng_sentences)
spa_max = max(len(s.split()) for s in spa_sentences)

X_eng = pad_sequences(eng_tok.texts_to_sequences(eng_sentences), maxlen=eng_max, padding='post')
y_spa = pad_sequences(spa_tok.texts_to_sequences(spa_sentences), maxlen=spa_max, padding='post')
y_spa = np.expand_dims(y_spa, -1)

X_spa = pad_sequences(spa_tok.texts_to_sequences(spa_sentences), maxlen=spa_max, padding='post')
y_eng = pad_sequences(eng_tok.texts_to_sequences(eng_sentences), maxlen=eng_max, padding='post')
y_eng = np.expand_dims(y_eng, -1)


model_eng_spa = Sequential([
    Embedding(len(eng_tok.word_index)+1, 64, input_length=eng_max),
    LSTM(64),
    RepeatVector(spa_max),
    LSTM(64, return_sequences=True),
    TimeDistributed(Dense(len(spa_tok.word_index)+1, activation='softmax'))
])
model_eng_spa.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model_eng_spa.fit(X_eng, y_spa, epochs=200, verbose=0)


model_spa_eng = Sequential([
    Embedding(len(spa_tok.word_index)+1, 64, input_length=spa_max),
    LSTM(64),
    RepeatVector(eng_max),
    LSTM(64, return_sequences=True),
    TimeDistributed(Dense(len(eng_tok.word_index)+1, activation='softmax'))
])
model_spa_eng.compile(optimizer='adam', loss='sparse_categorical_crossentropy')
model_spa_eng.fit(X_spa, y_eng, epochs=200, verbose=0)

print("Models are trained")


eng_index_word = {v: k for k, v in eng_tok.word_index.items()}
spa_index_word = {v: k for k, v in spa_tok.word_index.items()}

def translate(text, model, tokenizer, max_len, index_map):
    seq = tokenizer.texts_to_sequences([text.lower()])
    padded = pad_sequences(seq, maxlen=max_len, padding='post')
    pred = model.predict(padded)[0]
    decoded = [np.argmax(p) for p in pred]
    return ' '.join(index_map.get(i, '') for i in decoded if i != 0).strip()


lang_choice = input("What is the language of your audio? Please manually type 'english' or 'spanish': ").strip().lower()
lang_code = "en-US" if lang_choice == "english" else "es-ES"

print("Please upload your WAV file")
uploaded = files.upload()
audio_filename = next(iter(uploaded.keys()))


recognizer = sr.Recognizer()
with sr.AudioFile(audio_filename) as source:
    audio_data = recognizer.record(source)
    input_text = recognizer.recognize_google(audio_data, language=lang_code)

print(f"\n You said ({lang_choice}): {input_text}")


if lang_choice == "spanish":
    translated = translate(input_text, model_spa_eng, spa_tok, spa_max, eng_index_word)
    speak_lang = "en"
else:
    translated = translate(input_text, model_eng_spa, eng_tok, eng_max, spa_index_word)
    speak_lang = "es"

print(f"Translated Sentence: {translated}")


tts = gTTS(translated, lang=speak_lang)
tts.save("translated.mp3")
AudioSegment.from_mp3("translated.mp3").export("translated.wav", format="wav")
display(Audio("translated.wav", autoplay=True))




Models are trained
What is the language of your audio? Please manually type 'english' or 'spanish': spanish
Please upload your WAV file


Saving Spanish Sample3.wav to Spanish Sample3.wav

 You said (spanish): esto es genial
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 396ms/step
Translated Sentence: this is great
