A city operations center (COC) enables smart city operators to integrate data from different sectors and agencies, manage resources, relate to the citizens and address their concerns.
Giza Systems offers a software platform for city operation center that enables the operators to manage IoT assets in smart cities by collecting data from these assets, create alarms based on the received data, calculate KPIs, configure schedulers, manage Standard Operation Procedures (SOPs), build dashboards, and train ML models.
In this project, we aimed to build a voice interface for the Asset-360 view screen. The COC operator can simply use this voice interface by asking questions related to the asset and the interface will reply with the answers to the operator questions.
The answer to the operator's question will be in the same language of the question (AR -> AR) or (ENG -> ENG). This can save time for the operators instead of navigating through the screens of different assets.
The target of this project is to map the operator questions to pieces of information related to the asset.
- Introduction
- Project Structure
- Getting Started
- Pipeline
- Running The Pipeline
- Examples
- Team Members
- Contributing
- Future Work
- Acknowledgments
This repository contains the full code for an Arabic & English virtual assistant
It was developed as a final graduation project for ITI Intake 43 AI Mansoura Branch in July 2023 Under the supervision of Giza Systems.
├── Interface
│ ├── google_app
│ ├── interface
├── data
├── notebooks
├── src
│ ├── rasa
│ ├── speechtotext
│ ├── texttospeech
│ ├── translation
│ └── wav2lip
└── utils
The repository is organized as follows:
-
Interface/: This directory is the django project and contains
google_app
as the django app. -
data/: This directory contains the dataset used for training and evaluation. It includes both the raw data and preprocessed versions, if applicable.
-
notebooks/: This directory contains Jupyter notebooks that provide step-by-step explanations of the data exploration, preprocessing, model training, and evaluation processes.
-
src/: This directory contains the source code for the project, including data preprocessing scripts, model training scripts, and evaluation scripts.
-
utils/: This directory contains utility functions and helper scripts used throughout the project.
- It is recommended to set up a virtual environment for this project using python 3.8.16
- You need to provide API keys for Google Cloud Services and Azure Cognitive Speech Services, in the following modules:
- utils/detect_language.py
- src/translation/azure_translator.py
- src/translation/google_translator.py
- src/texttospeech/google_text_to_speech.py
- src/texttospeech/azure_text_to_speech.py
- src/speechtotext/google_speech_to_text.py
- src/speechtotext/azure_speech_to_text.py
To get started with the project, follow these steps:
-
Clone the repository:
git clone https://github.com/Aylore/Arabic-Voice-Interface-for-City-Operation-Center
-
Change directory into the repository:
cd Arabic-Voice-Interface-for-City-Operation-Center
-
Install the required dependencies:
make install
You will only need to do this for your first time (feel free to use your own)
- Download this pretrained model for the
wav2lip
model usingmake wav2lip-model
- Train rasa chatbot using
make rasa-train
-
The first step of the pipeline is to transcribe the user's spoken question into text using a speech-to-text system. We use the Azure Speech Services API to perform this task, for more information check SST-online branch README, where we compare between speech-to-text services including AWS and Google Cloud.
-
If the user asked the question in arabic, the text is translated to english before feeding the question to the chat bot.
-
After getting the transcript of the question, The chatbot generates a response to the user's question based on the intent and entities identified in the question. it calls an API endpoint to retrieve the answer.
-
If the user asked the question was in arabic, the text is translated from english to arabic after getting the answer from the chat bot and before generating audio file.
-
After getting the response from our chatbot We then use the Azure Speech SDK to synthesize the response into an audio file. The audio file can be played back to the user as the chatbot's spoken response.
-
After getting the audio response we had to present the answer to the user in a convenient way so we trained -on an agent of our chosing- a LipSync model using the current SOTA model wav2lip , check the training notebook for more information refer to this branch
-
Due to the output result of the wav2lip model we used an image enhancement model to restore the quality using Code Former
-
After the video response is generated, we send the response to a Django web application. The Django application can then display the video response to the user, along with any additional information or functionality needed.
- Run the uvicorn server of fastapi
make fastapi
- Activate the rasa api.
make rasa-run
- Run rasa actions to get the data from the api.
make rasa-actions
- Run the django server to use the interface.
make django
- Speech To Text : ~ 2s
- Translation : ~ 2s
- Chatbot : ~ 250ms
- Text To Speech : ~ 2s
- Wav2Lip : ~ 30:40s
- Face restoration : ~ 4m:7m
These numbers were achieved on M1 macbook air with 16GB of RAM
english-example.mp4
arabic-example.mp4
Name | GitHub | |
---|---|---|
Ahmed Elghitany | ||
Israa Okil | ||
Khaled Ehab | ||
Osama Oun |
If you would like to contribute to this project, Feel Free to make a pull request or contact one of our team members via the links above.
- Edit the face restoration model to use a simpler model for face detection or combining it with wav2lip some how. needs further research
- Taking feedback from the user after receiving his answer to find areas of development and better enhance the pipeline.
- Applying an end to end arabic pipeline with arabic chat bot and no translation needed.
-
wav2lip, "A Lip Sync Expert Is All You Need for Speech to Lip Generation In the Wild", published at ACM Multimedia 2020.
-
Code Former, [NeurIPS 2022] Towards Robust Blind Face Restoration with Codebook Lookup Transformer.