Play Drums in your Browser.
Drums-app allows you to simulate in your browser any percussion instrument, by using only your Webcam. All machine learning models run locally, so no user information is sent to the server.
Check the demo at drums-app.com
Simply run the src/index.html in server mode, or enter at drums-app.com.
Select Set Template for building your own drums template by uploading some images and attaching your sounds to them.
Turn on your webcam and enjoy it!
*No cats were harmed during this recordingThis web application is built with and .
The pipeline uses two Machine Learning models.
- Hands Model: A Computer Vision model offered by for detecting 21 landmarks for each hand (x, y, z).
- HitNet: An LSTM model that has been developed in for this application and then converted to . It takes the last N positions of a hand and predicts the probability of this sequence to correspond with a Hit.
The dataset used for training has been built in the following way:
- A representative landmark (Index Finger Dip [Y]) of each detected hand is plotted in an interactive chart, using .
- Any time that a key is pressed, a grey mark is plotted on the same chart.
- I start to play drums with one hand while pressing a key on the keyboard (with the other hand) every time that I beat an imaginary drum. [Gif Left]
- I use the mouse for selecting in the chart those points that should be considered as a hit. [Gif Right]
- When click the "Save Dataset" button, all hand positions together with their correspondent tags (1 if the frame was considered a hit or 0 otherwise) are downloaded as a JSON file .
HitNet has been built in , using , and then exported to . In order to not produce any dissonance between the hit on the drum and the produced sound HitNet must run as fast as possible, for this reason it implements an extremely simple architecture.
It takes as input the 4 last detections of a hand [Flatten version of its 21 landmarks (x,y,z)] and outputs the probability of this sequence to correspond with a hit. It is only composed by an LSTM layer followed by a ReLU activation (using dropout with p = 0.25) and a Dense output layer with only 1 unit, followed by a sigmoid activation.
HitNet has been trained in , using the following parameterization:
- Epochs: 3000.
- Optimizer: Adam.
- Loss: Weighted Binary Cross Entropy*.
- Training/Val Split: 0.85-0.15.
- Data Augmentation:
- Mirroring: X axis.
- Shift: Shift applied in block for the whole sequence.
- X Shift: ±0.3.
- Y Shift: ±0.3.
- Z Shift: ±0.5.
- Interframe Noise: Small shift applied independently to each frame of the sequence.
- Interframe Noise X: ±0.01.
- Interframe Noise Y: ±0.01.
- Interframe Noise Z: ±0.0025.
- Intraframe Noise: Extremely small shift applied independently to each single part of a hand.
- Intraframe Noise X: ±0.0025.
- Intraframe Noise Y: ±0.0025.
- Intraframe Noise Z: ±0.0001.
The weights exported to are not the ones of the last epoch, but the ones that maximized the Validation Loss at any intermediate epoch.
*Loss is weighted since the positive class is extremely underrepresented in the training set.
Confusion matrices show that results are pretty high for both classes putting the confidence threshold at 0.5.
Despite these False Positives and False Negatives could worsen the user experience in a network that is executed several times each second, it does not really affect the playtime in a real situation. It is due to three factors:
- Most False Positives come from the frames anterior or posterior to the hit. In practice, it is solved by emptying the sequence buffers every time that a hit is detected.
- The small amount of False Negatives detected in the train set comes from Data Augmentation or because it is detected on the previous or the following frame. In real cases, these displacements does not affect to the experience.
- The rest of False Positives does not use to appear in real cases since, during playtime, only the sequences including detections entering in the predefined drums are analyzed. In practice it works as double check for the positive cases.
Evolution of the Train/Validation Loss during training confirms that there has been no overfitting.