Skip to content

CNN neural network for Speech Onset Time(SOT) detection in Mandarin


Notifications You must be signed in to change notification settings


Folders and files

Last commit message
Last commit date

Latest commit



74 Commits

Repository files navigation

forthebadge forthebadge

MandSOT: Mandarin Speech Onset Time (SOT) Detection Using Machine Learning

MandSOT is a machine learning model, employing a Convolutional Neural Network (CNN) architecture, trained for the automated detection of Speech Onset Time (SOT) in Mandarin speech.



Mandarin Speeches

  • EEG Picture Naming Records
    • Source
      • Institution: Department of Chinese and Bilingual Studies, Hong Kong Polytechnic University
      • Research Lead: Dr. Xiaocong Chen Github Profile Google Scholar
    • Dataset Description
      • This collection comprises a total of 12,522 audio recordings in WAV format, sampled at 48kHz. These recordings were captured as part of an EEG study focusing on Mandarin speech.
    • Speaker Details
      • Number of Speakers: 38
      • Language: Mandarin
    • Annotations
      • Each recording is accompanied by precise Speech Onset Time (SOT) annotations. These annotations have been meticulously marked using Praat by Dr.Xiaocong CHEN and others.

Acoustic Noises

  • DEMAND Dataset
  • Other Noises
    • Background noise recorded in room ZB217 at UBSN, Hong Kong Polytechnic University with AC fan set to lvl 1, 2 and 3.
    • Background noise recorded in office GH709, Hong Kong Polytechnic University.

Network Structure

        Layer (type)               Output Shape         Param #
            Conv1d-1             [-1, 32, 4096]          21,536
         MaxPool1d-2             [-1, 32, 2048]               0
            Conv1d-3             [-1, 64, 2046]           6,208
         MaxPool1d-4             [-1, 64, 1023]               0
            Conv1d-5             [-1, 32, 1021]           6,176
         MaxPool1d-6              [-1, 32, 510]               0
            Conv1d-7              [-1, 64, 508]           6,208
         MaxPool1d-8              [-1, 64, 254]               0
            Linear-9                  [-1, 128]       2,080,896
           Linear-10                    [-1, 1]             129
Total params: 2,121,153
Trainable params: 2,121,153
Non-trainable params: 0
Input size (MB): 3.50
Forward/backward pass size (MB): 3.75
Params size (MB): 8.09
Estimated Total Size (MB): 15.34


Dataset Preparation

INPUT <dataset, pd.dataFrame, [0, 0]>
|--- Read SOT annotaion CSV(s) <dataset, pd.dataFrame, [2('wav','onset'), N_audio]>
|--- Load Audio (wav path from CSV(s))
|       |--- Read raw audio signal
|       |--- Check Sample Rate (sr)
|       |       |--- Resample to 48kHz if sr != 48000
|       |
|       |--- Data Augmentation (adding noise)
|       |--- Padding (Zero-padding)
|       |--- Apply Pre-emphasis (y_emp = y[0] + y[1:] - alpha * y[:-1])
|       |--- Perform MFCC Feature Extraction
|               |--- Configuration:
|               |       - Number of MFCC features (n_mfcc): 32/64/128
|               |       - Window length: 256/512/1024
|               |       - Hop length: window_length / 2
|               |       - Number of FFT points (n_fft): window_length
|               |       - Number of Mel filter banks (n_mels): 32/64/128
|               |       - Maximum frequency (fmax): 10000 Hz
|               |       - Window function: 'hamming'
|               |
|               |--- Compute and Combine MFCC Features (librosa.feature.mfcc)
|--- Return MFCC Features (mfcc, np.array, [224, 4096])
OUTPUT <dataset, pd.dataFrame, [3('wav','onset','mfcc'), N]>

Model initializtion






pip install mandsot

Praat Plugin

In progress...


<script src=""></script>




  • Prepare dataset
    • example.csv
      wav_name                       onset  on/off
      example_audio_1.wav            898    1
      example_audio_2.wav            1145   1
      example_audio_3.wav            764    1



MIT © Ryan Alloriadonis