Voice Gender Recognition

Note(2017-10-27): This is the very first version of repo. Future work is required to push the prediction performance to higher level.

Summary

Inspired by LSTM Networks for Sentiment Analysis. Here is an implementation repo for training a LSTM neurtal networks for recogonizing audio data's speaker gender. The audio data used is from Vox Forge. Data Science is supposed to handle all kinds of problems without having domain knowledge. We assume no knowledge exist for data scientist in Audio signal processing. 'FFT' is an alternative optioin instead of letting the neural nets learn the audio signal logic from raw wave data them self.

Scraping down tgz audio file

Run scrap.py will download every tgz file and save to local directory ./rawdata.

Parse the README file in each package

In README file, the 5th line contains the speak gender information of all audio .wav files in the directory. The 'labeling' function in vocal_gender_lstm.py parse the README file and return the data label and wave raw data in the format of numpy array.

Neural Nets Graph

Use fixed number of LSTM cells to take input from squential wave raw data. The hidden state of each cells are concated nated to 2-D matrix as output. The output data dimension is reduced by takeing average pooling in large strides. Then the output layer is stardard softmax on pooling results. The cost function is constructed by caculating the cross entroy between data label and softmax output from the networks.

Mini Batch training

The training process takes each tgz file as one mini batch.All 10 audio files are taken for one epoch of opitimizing process. Every 100 mini batch, the network prediction performance is validated by run 100 out-of-sample validation samples. The classification accuracy is printed by percentage. By using mini batch, the disk space and memory is saved.

Performance

Mini Batches	Accuracy Achieved	Mini Batches	Accuracy Achieved
1	63.20%	1000	81.70%
100	71.30%	1100	81.70%
200	74.50%	1200	82.30%
300	75.70%	1300	82.40%
400	78.0%	1400	82.20%
500	79.10%	1500	82.40%
600	79.50%	1600	82.70%
700	80.20%	1700	83.0%
800	80.90%	1800	83.30%
900	81.30%	1900	83.50%

Requirments

tensorflow, numpy, scipy

Scripts

The experiment can be reproduced by running following command:

mkdir rawdata
python scrap.py
python vocal_gender_lstm.py > ./train_results.txt

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
LICENSE		LICENSE
README.md		README.md
scrap.py		scrap.py
train_results.png		train_results.png
vocal_gender_lstm.py		vocal_gender_lstm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LICENSE

LICENSE

README.md

README.md

scrap.py

scrap.py

train_results.png

train_results.png

vocal_gender_lstm.py

vocal_gender_lstm.py

Repository files navigation

Voice Gender Recognition

Summary

Scraping down tgz audio file

Parse the README file in each package

Neural Nets Graph

Mini Batch training

Performance

Requirments

Scripts

About

Releases

Packages

Languages

License

JinScientist/voice-gender-recognition

Folders and files

Latest commit

History

Repository files navigation

Voice Gender Recognition

Summary

Scraping down tgz audio file

Parse the README file in each package

Neural Nets Graph

Mini Batch training

Performance

Requirments

Scripts

About

Resources

License

Stars

Watchers

Forks

Languages