PAT (PodcastProject Analytics Toolkit)

Toolkit to analyze several aspects of either podcasts themselves or books talked about in the podcast

This project aims to create a collection of tools and workflows surrounding all kinds of data that might (or might not) happen to be in the reach of a podcast. Since this project started with a podcast talking about books the first modules all focus on analyzing the podcast audio files and the books.

Install

pip install git+https://github.com/GameOfPods/PAT.git

Running the application

See

PAT -h

for all available options

See

PAT -ls

for a list of all available Modules

Modules

PAT automatically discovers all modules implementing PATModule. Each of these modules can introduce a set of command line options that get automatically added to the cli of PAT with a prefix.

When loading a file every Module is asked if the module can and wants to handle the file. If yes all combinations of files and modules are queued and the ordered by module. Like this a module that has a long start and stop time (loading models, etc.) only has to go through them once.

The results are stored in a dedicated folder named A_date_time.

The results consist of a basic json file and potentially some additional files produced by the modules that can have different shapes.

Currently, a speaker and a book module are implemented. More to come.

SpeakerModule

This module aims to produce insights into the recorded podcast and the podcasters. It uses different methods to diarize the speakers, transcribe the audio and then summarize it. For all these steps all data is recorded.

The diarization is done with either Nvidia NEMO if present and all requirements are installed or using pyannote if its requirements are met. If both are present the user can select which to use. I found Nvidia NEMO to be more reliable but that might be very dependant on the data present in the specific usecase.

The transcription is done using openAI whisper with whisperX to better align the transcription to the single words.

If both are present, diarization and transcription, they are mapped to each other to create a transcription per person. Finally, for each part the punctuation is restored using deepmultilingualpunctuation.

If OpenAI is configured to run a summary is created over the transcription. For OpenAI either the official web api can be used if a valid token is given or a local OpenAI compatible installation (eg. localai ) can be used by using the OPENAI_API_BASE environment variable. Also the model that should be used has to be given in cli.

BookModule

This module is used to do several nlp analysis on books. Currently only epub files are supported but this will be extended in the future. For all extracted content from the book spacy is used to create the data. Especially NER is implemented at the time to see which persons, locations are most important and present in each chapter. This can be used for some fun analysis on books and stories. More nlp stuff will be implemented in the future.

If OpenAI is configured to run a summary is created over the chapters and then the whole book. For OpenAI either the official web api can be used if a valid token is given or a local OpenAI compatible installation (eg. localai ) can be used by using the OPENAI_API_BASE environment variable. Also the model that should be used has to be given in cli.

Vizualization

The focus of PAT is on creating raw data that can then be used by different UI (Websites, etc) of further tools. Mainly podcastproject will get native support for several modules to show the data directly on a website that also shows the podcast.

But PAT also has a sister module PAT_Visualizer that aims to create some basic insights into some of the data. This should really be just a glimps of what is possible with the data PAT generates.

For docu on PAT_Visualizer see

PAT-Visualizer -h

TODOs

Create full documentation (readthedocs or similar)
Improve mapping of words to speaker. Sometimes very buggy
Seperate requirements by modules. Utilize extras requires from setuptools?

History

0.0.3
- Implemented Speaker Identification using audio snippets cut using a VAD and then matching with MFCC

Libraries

numpy
requests - Apache-2.0
tqdm - License
psutil - BSD 3
roman - Zope Public License
bs4 - MIT License
EbookLib - AGPL-3.0
pyyaml - MIT License
scikit-learn - BSD License
XlsxWriter - BSD License
torch - BSD License
pyannote.audio - MIT License
audiosegment - MIT License
silero-vad - MIT License
demucs - MIT License
nemo_toolkit - Apache-2.0
pydub - MIT License
librosa - ISC License
omegaconf - BSD License
rttm-manager
whisperX - MIT License
deepmultilingualpunctuation - MIT License
langchain - MIT License
deep-translator - MIT License
spacy - MIT License
langdetect - Apache-2.0

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
PAT		PAT
PAT_Visualizer		PAT_Visualizer
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PAT (PodcastProject Analytics Toolkit)

Install

Running the application

Modules

SpeakerModule

BookModule

Vizualization

TODOs

History

Libraries

About

Releases

Packages

Languages

License

GameOfPods/PAT

Folders and files

Latest commit

History

Repository files navigation

PAT (PodcastProject Analytics Toolkit)

Install

Running the application

Modules

SpeakerModule

BookModule

Vizualization

TODOs

History

Libraries

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages