This project implements an Audio Language Model (ALM) capable of simultaneously recognizing and jointly understanding speech and non-speech audio elements with reasoning capabilities.
Developed by: Akshay Rathod
- Speech Recognition for Asian languages (Mandarin, Urdu, Hindi, Telugu, Tamil, Bangla) and English
- Non-Speech Audio Understanding (music, alarms, environmental noises)
- Speaker Diarization (differentiating between speakers)
- Paralinguistic Analysis (emotion, tone, hesitation)
- Audio Event Detection (car honking, dog barking, aircraft sounds, etc.)
- Joint understanding of speech and non-speech elements for complex reasoning
- Web interface for easy access and deployment
├── data/
│ ├── raw/
│ ├── processed/
│ └── datasets.py
├── models/
│ ├── alm_model.py
│ ├── speech_encoder.py
│ ├── audio_encoder.py
│ └── fusion_module.py
├── training/
│ ├── train.py
│ └── trainer.py
├── utils/
│ ├── preprocessing.py
│ └── evaluation.py
├── config/
│ └── config.yaml
├── templates/
│ └── index.html
├── main.py
├── web_app.py
├── deploy.py
└── requirements.txt
- Python 3.8+
- PyTorch 1.9+
- Transformers
- Librosa
- SoundFile
- NumPy
- Pandas
- Flask
- Gunicorn
-
Clone the repository:
git clone https://github.com/Akshay-Notfound/Deep-Learning-based-ALM.git cd Deep-Learning-based-ALM -
Install the required dependencies:
pip install -r requirements.txt -
(Optional) Initialize the project with sample data:
python init_project.py
The project includes a web interface built with Flask for easy access to the ALM functionality.
To run the web interface locally:
-
Start the web application:
python web_app.py -
Open your browser and navigate to
http://localhost:5000
To run the application in development mode:
python deploy.py --mode dev
To create production deployment scripts:
python deploy.py --mode prod
This will generate deploy.sh (for Linux/Mac) and deploy.bat (for Windows) files that can be used to deploy the application in a production environment.
For manual deployment, you can use Gunicorn:
gunicorn --bind 0.0.0.0:8000 --workers 4 web_app:app
Run the ALM system from the command line:
python main.py --config config/config.yaml --checkpoint path/to/checkpoint.pt --audio path/to/audio.wav
- Upload an audio file using the web interface
- Optionally, ask a question about the audio content
- View the analysis results including speech recognition, audio events, speaker diarization, and paralinguistic analysis
GET /- Main web interfacePOST /analyze- Analyze an uploaded audio fileGET /api/status- Check API status
Run the demo to see a simulation of the ALM capabilities without requiring heavy dependencies:
python alm_demo.py
The project also includes a Streamlit web interface for easy access to the ALM functionality.
This project is licensed under the MIT License - see the LICENSE file for details.
Akshay Rathod - Final Year Project
Copyright (c) 2025 Akshay Rathod. All rights reserved.
This Audio Language Model (ALM) system is provided "as is" without warranty of any kind, either express or implied. The developer makes no representations or warranties regarding the accuracy, reliability, or suitability of the system for any purpose. Use of this system is at your own risk.
In no event shall the developer be liable for any direct, indirect, incidental, special, exemplary, or consequential damages arising out of the use or inability to use this system.
To run the Streamlit interface:
-
Install the required dependencies:
pip install streamlit librosa -
Run the Streamlit app:
streamlit run streamlit_app.py -
If the
streamlitcommand is not found, try:python -m streamlit run streamlit_app.py -
On Windows, you might need to specify Python 3.10 explicitly:
py -3.10 -m streamlit run streamlit_app.py -
The app will open in your default browser at
http://localhost:8501
The Streamlit interface provides:
- File upload for audio analysis
- Question answering about audio content
- Visual results display
- Demo mode to see sample outputs