This notebook is to provide an overview on the Synthesia project

In [3]:
import pandas as pd
import numpy as np
import cv2
from IPython.display import Image
import os

# [Project Preface](#first)

Package usage:
- Pandas (For organization in dataframes)
- Numpy (For math operations such as rounding and ceiling)
- Collections, os, and IPython for small features and convenience
- LilyPond \[GNU license\] (To write music sheets from text)
- Mingus \[GNU License\] (To talk to LilyPond and create the .ly file LilyPond needs)
- cv2 \[BSD License\] -- Opencv (To analyze the video feed)

# [Intro & Goal](#second)

## Project Goal

In this project I hope to convert a Synthesia video into a printable music sheet!

I achieved this through video processing (opencv2) to detect pixel movement, convert pixels to string (notes), and string to musical notes (mingus).

![title](ProjectGoal.png)

## Why this approach? Why video processing?

There are libraries such as `Librosa` which process mp3 files as time-series data much like you would see in applications like `Audacity`. This would be the more logical approach as it would likely involve less noise, probably easier to convert to a music sheet, and works on all videos you find online (where as my script only works on specifically Synthesia videos).

However to learn more and expand my knowledge, I've never worked with videos before; at best, only images. Doing this project has been a big learning experience and I'm glad I did it the way I did.

In the future though, It'd be great to revisit this project with mp3 as the approach as getting an mp3 file is easier to aquire and thus generally more useful.

---------------------

# [OpenCV](#third)

## Obtaining video feed and masking using OpenCV2

Sentdex (who runs https://pythonprogramming.net/) has a youtube playlist that has helped greatly and I would recommend their tutorials if you're looking to learn OpenCV2!
- https://www.youtube.com/watch?v=Z78zbnLlPUA&list=PLQVvvaa0QuDdttJXlLtAJxJetJcqmqlQq) 


Using a mask, I created blue and green masks of the original video feed based on HSV (Hue, Saturation, Value) colors. Utilization of BGR (Default OpenCV2 instead of RGB) is difficult for precise differentiation of differing hues of 1 color (Ex: Light blue vs Dark blue)

![title](Masking.png)

Masks were created by [lower,upper] bound HSV values for green and then blue colors utilizing standard Green and Blue values as the starting ground (below)

In [4]:
#Utilize BGR scale to create pure green and blue colors
green = np.uint8([[[0,255,0 ]]])
blue = np.uint8([[[255,0,0 ]]])

#Convert to HSV scale
hsv_green = cv2.cvtColor(green,cv2.COLOR_BGR2HSV)
hsv_blue = cv2.cvtColor(blue,cv2.COLOR_BGR2HSV)

#Printout the HSV version of pure green and pure blue
print( "Green in HSV: ",hsv_green )
print( "Blue in HSV: ",hsv_blue )

Green in HSV:  [[[ 60 255 255]]]
Blue in HSV:  [[[120 255 255]]]


## Pixel detection

First I created cropped 1-px tall region (indicated by the red bounding box in the below image) which is ($w$,1) pixels in dimension where $w$ is the width of the video (640 px)

![title](Crop_region.png)

Below is a summary of the code used to set up the bounding box (`cropped`) to look for green (`grn_mask_crop`) and blue (`blu_mask_crop`) notes within the bounding box.

In [None]:
cropped = frame[start:start+1, 0:w] #Start = 250 == A random height at which to draw the bounding box
hsv_crop = cv2.cvtColor(cropped,cv2.COLOR_BGR2HSV) #Convert bounding box to HSV scale

    #Green filter
    grn_lower = np.array([30,90,100]) #Lower bound of "Green" color
    grn_upper = np.array([80,255,255]) #upr bound
    grn_mask_crop = cv2.inRange(hsv_crop,grn_lower,grn_upper) #Taking the bounding box... Find only green notes
    
    #Blue filter
    blu_lower = np.array([75,20,120])
    blu_upper = np.array([140,230,250])
    blu_mask_crop = cv2.inRange(hsv_crop,blu_lower,blu_upper) #"     " Find only blue notes

#These video feeds are the ones shown in the above images with only green or blue notes --- They serve for debug/visual
    #interpretation purposes only -- They do not contribute any usefulness to the bounding box detection code
hsv = cv2.cvtColor(frame,cv2.COLOR_BGR2HSV) #Convert entire video feed to HSV scale
grn_mask = cv2.inRange(hsv,grn_lower,grn_upper) #Mask the HSV feed to filter for only green (Visual purposes only)
blu_mask = cv2.inRange(hsv,blu_lower,blu_upper)

To actually detect the notes I used `cv2.findNonZero(grn_mask_crop)` to detect any pixels that aren't black in the bounding box --- In this case, any green notes in the bounded region.


In [None]:
coord_right=cv2.findNonZero(grn_mask_crop) #Pixel locations of where there are green
coord_left=cv2.findNonZero(blu_mask_crop) #Pixel locations of where there are blue

These coordinate outputs have the following format, below is shown for a pure "G" note located on the 3rd octave:

\>> coord_right <br>
[ [ 246 , 0 ]<br>
  [ 247 , 0 ]<br>
  [ 248 , 0 ]<br>
  [ 249 , 0 ]<br>
  [ 250 , 0 ]<br>
  [ 251 , 0 ]<br>
  [ 252 , 0 ]<br>
  [ 253 , 0 ]<br>
  [ 254 , 0 ]<br>
  [ 255 , 0 ]<br>
  [ 256 , 0 ]<br>
  [ 257 , 0 ] ]
  
Note above is a pure "G-3"

So great! We have pixels, but we want notes.

# [Pixel to Notes](#fourth)

## Counting pixels ... literally 
Utilizing MSpaint or your favorite image editing software such as Photoshop, you can experience the joy of meticulously counting pixels that align to each and every 86 keys of the piano keyboard!

![PixelMeasuring](PixelMeasuring.png)

Here is the pixel alignment measurement I did for middle C (C-4).

As you'll note in the image, I have "Old" and "New" values that I gave to the pixels -- Originally I thought maybe it would be smart to prioritize pixel-real-estate to the white keys because they're bigger. This however yielded many false detections, especially false negatives among the black keys.

By changing and updating to the new values, I gave each key an equal voice which aided in my noise-filtering algorithm.

From the middle-C octave (you'll note that Synthesia calls middle C "C3" but anyone else in the internet calls it "C4" as far as I've been trained) I gathered that an octave is 86-px in length, so I simply duplicated this 1 octave measurement up and down the pixel map until I had a map for pixel 0 --> pixel 640 (width of the video feed)

## Mapping pixels to Notes

Now that I had the pixel --> Note for each pixel in the video feed

- 1) I created a dictionary to map pixels to notes called `ConvertPixelToNote(val)` which converted single pixels to the note in that range of pixel locations.
- 2) Amplified the mapping to list objects such as those from the bounding box detection and wrapped it into the function  `ConvertListToNote(lst,vote_threshold=5,percent_threshold=0.2)`

Below are the functions themselves -- Beware, the PixelToNote mapping is a vary tall function!

In [5]:
#Define custom functions for Pixel <-> Note conversion
def ConvertPixelToNote(val):
    '''
    Converts single pixel into note
    '''
    if 0 <= val <= 24: #Octave 0
        if 0 <= val <= 9:
            note = "A-0"
        elif 10 <= val <= 16:
            note = "A#-0"
        elif 17 <= val <= 24:
            note = "B-0"
    elif 25 <= val <= 111: #Octave 1
        if 25 <= val <= 31:
            note = "C-1"
        elif 32 <= val <= 38:
            note = "C#-1"
        elif 39 <= val <= 45:
            note = "D-1"
        elif 46 <= val <= 52:
            note = "D#-1"
        elif 53 <= val <= 60:
            note = "E-1"
        elif 61 <= val <= 67:
            note = "F-1"
        elif 68 <= val <= 74:
            note = "F#-1"
        elif 75 <= val <= 81:
            note = "G-1"
        elif 82 <= val <= 88:
            note = "G#-1"
        elif 89 <= val <= 95:
            note = "A-1"
        elif 96 <= val <= 102:
            note = "A#-1"
        elif 103 <= val <= 111:
            note = "B-1"
    elif 112 <= val <= 197: #Octave 2
        if 112 <= val <= 118:
            note = "C-2"
        elif 119 <= val <= 125:
            note = "C#-2"
        elif 126 <= val <= 132:
            note = "D-2"
        elif 133 <= val <= 139:
            note = "D#-2"
        elif 140 <= val <= 147:
            note = "E-2"
        elif 148 <= val <= 154:
            note = "F-2"
        elif 155 <= val <= 161:
            note = "F#-2"
        elif 162 <= val <= 168:
            note = "G-2"
        elif 169 <= val <= 175:
            note = "G#-2"
        elif 176 <= val <= 182:
            note = "A-2"
        elif 183 <= val <= 189:
            note = "A#-2"
        elif 190 <= val <= 197:
            note = "B-2"        
    elif 198 <= val <= 283: #Octave 3
        if 198 <= val <= 204:
            note = "C-3"
        elif 205 <= val <= 211:
            note = "C#-3"
        elif 212 <= val <= 218:
            note = "D-3"
        elif 219 <= val <= 225:
            note = "D#-3"
        elif 226 <= val <= 233:
            note = "E-3"
        elif 234 <= val <= 240:
            note = "F-3"
        elif 241 <= val <= 247:
            note = "F#-3"
        elif 248 <= val <= 254:
            note = "G-3"
        elif 255 <= val <= 261:
            note = "G#-3"
        elif 262 <= val <= 268:
            note = "A-3"
        elif 269 <= val <= 275:
            note = "A#-3"
        elif 276 <= val <= 283:
            note = "B-3"
    elif 284 <= val <= 369: #Octave 4
        if 284 <= val <= 290:
            note = "C-4"
        elif 291 <= val <= 297:
            note = "C#-4"
        elif 298 <= val <= 304:
            note = "D-4"
        elif 305 <= val <= 311:
            note = "D#-4"
        elif 312 <= val <= 319:
            note = "E-4"
        elif 320 <= val <= 326:
            note = "F-4"
        elif 327 <= val <= 333:
            note = "F#-4"
        elif 334 <= val <= 340:
            note = "G-4"
        elif 341 <= val <= 347:
            note = "G#-4"
        elif 348 <= val <= 354:
            note = "A-4"
        elif 355 <= val <= 361:
            note = "A#-4"
        elif 362 <= val <= 369:
            note = "B-4"
    elif 370 <= val <= 455: #Octave 5
        if 370 <= val <= 376:
            note = "C-5"
        elif 377 <= val <= 383:
            note = "C#-5"
        elif 384 <= val <= 390:
            note = "D-5"
        elif 391 <= val <= 397:
            note = "D#-5"
        elif 398 <= val <= 405:
            note = "E-5"
        elif 406 <= val <= 412:
            note = "F-5"
        elif 413 <= val <= 419:
            note = "F#-5"
        elif 420 <= val <= 426:
            note = "G-5"
        elif 427 <= val <= 433:
            note = "G#-5"
        elif 434 <= val <= 440:
            note = "A-5"
        elif 441 <= val <= 447:
            note = "A#-5"
        elif 448 <= val <= 455:
            note = "B-5"
    elif 456 <= val <= 541: #Octave 6
        if 456 <= val <= 462:
            note = "C-6"
        elif 463 <= val <= 469:
            note = "C#-6"
        elif 470 <= val <= 476:
            note = "D-6"
        elif 477 <= val <= 483:
            note = "D#-6"
        elif 484 <= val <= 491:
            note = "E-6"
        elif 492 <= val <= 498:
            note = "F-6"
        elif 499 <= val <= 505:
            note = "F#-6"
        elif 506 <= val <= 512:
            note = "G-6"
        elif 513 <= val <= 519:
            note = "G#-6"
        elif 520 <= val <= 526:
            note = "A-6"
        elif 527 <= val <= 533:
            note = "A#-6"
        elif 534 <= val <= 541:
            note = "B-6"
    elif 542 <= val <= 640: #Octave 7 & 8
        if 542 <= val <= 548:
            note = "C-7"
        elif 549 <= val <= 555:
            note = "C#-7"
        elif 556 <= val <= 562:
            note = "D-7"
        elif 563 <= val <= 569:
            note = "D#-7"
        elif 570 <= val <= 577:
            note = "E-7"
        elif 578 <= val <= 584:
            note = "F-7"
        elif 585 <= val <= 591:
            note = "F#-7"
        elif 592 <= val <= 598:
            note = "G-7"
        elif 599 <= val <= 605:
            note = "G#-7"
        elif 606 <= val <= 612:
            note = "A-7"
        elif 613 <= val <= 619:
            note = "A#-7"
        elif 620 <= val <= 628:
            note = "B-7" 
        elif 629 <= val <= 640:
            note = "C-8" 
    try:
        return note #Return the note if there is something
    except:
        note = None

#Step up ConvertPixelToNote to work for a list object & apply Collaborative Voting        
def ConvertListToNote(lst,vote_threshold=5,percent_threshold=0.20,debug=False):
    '''
    Uses ConvertPixelToNote to converts list object to note(s) based on collaborative voting
    '''
    temp = Counter()
    for val in lst: #Read all pixels, convert to notes, and increment counts
        converted_note = ConvertPixelToNote(val)
        if converted_note != None: #Do not count "None" as a note
            temp[converted_note] += 1
    #Stage 1 elimination
    #Eliminate smaller noise so that they don't contribute to the overall percentage
    keys = list(temp.keys())
    backup = temp
    for note in keys:
        if temp[note] <= vote_threshold:
            temp.pop(note)
    
    #Stage 2 elimination -- Now compute total_hits after first elimination
    total_hits = np.sum(list(temp.values()))
    if debug==True:
        print("Keys before stage 1 elimination: ",backup,
              "\nKeys going into stage 2: ",temp,"\nTotal Hits: ",total_hits)
    for note in list(temp.keys()): #For each note detected
        #Want to keep majority by count or percentage
        #Remove notes that don't occur often or are not majority %'age
        if temp[note]/total_hits<=percent_threshold:
            temp.pop(note)
    return list(temp.keys())

If you were paying attention you may have noticed some arguements in the last function

what are "vote_threshold" and "percent_threshold"? These are my hyperparameters for my filtering algorithm

## Filtering out noise -- False positive and False negative handling

### Approach 1 - Take the median
The approach I've taken to analyze video feeds instead of raw mp3 audio as an input is inherently noisy. There are bound to be misreads.

Originally I took the simple approach of "find the median pixel ---> That is your note"

For simple cases such as the pure G-3 detection provided earlier ... the median pixel would be `251` which based on the mapping would churn out `G-3` ... perfect!

However this approach falls apart when you're given chords such as a ['G-2','G-3'] octave which would produce `C#-3'

-- There needed to be a better aprroach that could not only handle multiple notes, but also filter out the noise

### Approach 2 - Collaborative Voting
Notes in Synthesia bleed into neighboring notes' registries. Seen below I used the marque tool to draw a dotted rectangle around the `G-3` and `A-3` notes coming down to the keyboard.

![BleedingNotes](BleedingNotes.png)

You'll notice
- `G-3` bleeds into `F#-3` a little and quite heavily into `G#-3`
- `A-3` bleeds into `A#-3` a little and quite heavily into `G#-3`

This has 2 implications...

- 1) There needs to be a 'voting' system which can filter out the false-negative detection of neighboring notes
- 2) If a ['G-3','A-3'] note or any 2nd interval chord appears, it will completely cover [`G-3`,`G#-3`,`A-3`] and be impossible to resolve!

To apply this 'voting' system I had a 2-stage elimination process

- 1) Eliminate false detections if they don't have enough of their pixels firing (Less than or equal to 5 counts)
- 2) Eliminate the remaining notes if they don't contribute to the majority vote (less than or equal to 20% of the overall detections)

Refer to `ConvertListToNote()` function above to see the stage-1 and stage-2 elimination code. <br>
An example of the Collaborative Voting in action is provided below:

![CollaborativeVoting](CollaborativeVoting3.png)

### Further reductions in noise

1) Read every other frame

This was done to reduce computation (which I thought would be a problem but this code moves quite fast) as well as reduce the chances of the below detection cases where the next note may bleed into the bounding box before the current note is finished

![EveryOtherFrame](NoisyDetection.png)

2) Only record notes if they are different from the previously registered note

Previously I was recording information in the following format

| Frame | 150 | 151 | 152 | 153 | 154 | 155  | 156  | 157  |   |
|-------|-----|-----|-----|-----|-----|------|------|------|---|
| Note  | A-3 | A-3 | A-3 | A-3 | A-3 | F#-3 | F#-3 | F#-3 |   |

Note duration (ie: how long a note should be played for) is important for musical compositions .. but I decided to measure how long a note is based on the time difference to the next note being played.

Thus, I don't need each frame that a note is sustained for -- just when it begins and when the next note begins to yield the equivalent below table:

| Frame | 150 | 155  | 
|-------|-----|-----|
| Note  | A-3 | F#-3 | 

Below is the code utilized to do that in summary:
- 1) Detects that the bounding box detection `coord_right` found something in the box
- 2) Converts the detection to a note
- 3) If the note is not None (in case collaborative filtering has completely voted all detections out)
- 3b) Append the note only if it wasn't the previous note
- 3c) Or if `allow_duplicate_right/left`==True
- 4) Last checkpoint ... sometimes the note returned would be [] and thus pass the "None" filter
- 5) Append that note to the registry

Comment on (3c). If a note is sustained or being repeated such as a repeating bass line in the left hand. I would want to capture each hit of that note! Using a boolean-lock, I was able to allow repeated notes by resetting the lock if the bounding box registered `None` suggesting there was a gap between the sustained note.

In [None]:
#Right hand detection
if type(coord_right) == np.ndarray: #If not None basically
    note_right = ConvertListToNote(coord_right[:,0][:,0]) #Convert the coordinates to a note
    #Update only if there is a change to previously detected notes
        #Determine if coord_right has read a "None" detection
        #If so, then create a boolean lock to allow duplicate notes
        #None detection denotes that there was a gap in note detection (Common in cases like repeated bass notes)
    if note_right != None:
        if note_right != notes_right[-1] or allow_duplicate_right==True: #If not the same as previously registered
            #Register frames & notes to appropriate lists
            if debug==True:
                print("RH:",current_frame,note_right)
            if len(note_right) > 0: #If the note returned is not empty (sometimes a [] slips through, not sure how)
                #Add to registry
                frames_right.append(current_frame)
                notes_right.append(note_right)
                #print("RH:",current_frame,note_right)
                allow_duplicate_right = False #Disallow duplication of this held note
else: #If previous reading was "None" (the only other reading type) allow duplicate notes
    allow_duplicate_right = True

# Measure & Note Durations

## Measure Detection

Unfortunately this is going to be a small section.

Synthesia has measure bars running down the video as thin white-lines which also have a number attached (Eg: 14) to them stating "This is measure 14".

Trying to create a white-mask to identify measure bars crossing the bounding region did not work for me.

Why? Probably because the video I downloaded was very low quality

So why not download a better quality? I was afraid all the work I had put into the project would be wasted if I downloaded a new video (eg: What if it is no longer 640x360 format --> Redo all the pixel alignments)

In the end, I simply detected notes and found the lowest frame at which notes were detected --- normalized everything to frame 0, and determined how long a frame was based on my intuition of the song/beats.

Process:

1) Find the frames on which notes occur <br>
LH_frames = [147,160,180,205, ... ] <br>
RH_frames = [155,170,189,194, ... ] 

2) Subtract the lowest number (147) from all frames <br>
zeroed_LH_frames = [0,13,23,58, ... ] <br>
zeroed_RH_frames = [8,23,42,47, ... ]

3) __Hard-code__ in that a measure is ~58 frames (because based on the song, I know measure 2 begins on `D-3` which is played on that frame on the left-hand

## Note Durations & Measure separation

As mentioned, note durations are computed based on a "time till the next note" basis. Given the above "normalized" set of numbers, I go to further divide everything by the `measure_length` = 57.4
   
>Note: <br>I mentioned the measure is <span style="text-decoration: underline">exactly</span> 58 frames from Measure 1 starting to Measure 2 starting. However in testing, the video starts to exhibit some measures of differing lengths. To capture the measure correctly, I found division by a number less than 58 yielded correct alignment of measures 1~16 and would correct itself by measure 18 ... leaving only Measure 17 to be incorrectly constructed (unfortunate). I'm sure this issue propregates further in the composition, but as I do not own the composition of the sheet im transcribing (the whole point of the project) I'm unable to determine when/which measures exactly are not correct

normalized_LH_frames = [0,0.22,0.40,1.01, ... ] <br>
normalized_RH_frames = [0.14,0.40,0.73,0.82, ... ]

Utilizing these values, I applied a `np.ceil()` function to separate each note into their respective measures with the final dataframe in `pandas` looking as such:

![dataframe](df_final.png)

### Note Resolution

In the above image, the `Diff` column is the difference between the next frame and the current frame ... but in "Note space".

$$ Diff = note\_space\{ Frame_{i+1} - Frame_{i} \}$$

Conversion to note space is done by dictionary mapping the time-space values to note-space values for Mingus/LilyPond consumption (more on that later).

Computation:
- 1) Take the "Frame" values and compute the difference as mentioned in the above equation
- 2) Round the difference to the nearest 16th note (1/16 = 0.0625) value
- 3) Convert rounded value to Note_space using Mingus

| Frame | Difference | Rounded | Length in Note_space
| ----- | ----- | ----- | ----- |
|0.00 | 0.10 - 0.00 = 0.10 | 0.125 | 8 |
|0.10 | 0.24 - 0.10 = 0.14 | 0.125 | 8 |
|0.24 | 0.35 - 0.24 = 0.11 | 0.125 | 8 |
|0.35 | 0.63 - 0.35 = 0.28 | 0.25 | 4 |
|0.63 | --- | 0.125 | 8 |
| ... | ... | ... | ... |

How to round to an arbitrary number like the nearest 0.0625... divide the computation by that factor, round it, and multiply back that factor.

$$ np.round(\frac{frame_{i+1}-frame_{i}}{c})*c $$

In the above equation $c = 0.0625 =$ the number you want to round to the nearest value of

Conversion from decimal "time space" to "Note_space" was done via pandas mapping.

... Basically what is happening is:
- 1/8 = 0.125 --> Map 0.125 to "8" --> an eighth note in LilyPond syntax
- 1/4 = 0.25 --> Map 0.25 to "4" --> a quarter note in LilyPond syntax

In [None]:
#Last step -- Dictionary convert note length to note type (eigth/quarter etc)
LengthToNote = dict({0.0625:16, #sixteenth
                     0.125:8, #Eighth
                     0.1875:8,
                     0.25:4, #Quarter
                     0.3125:4, #Round to dotted half
                     0.375:value.dots(4), #Dotted quarter
                     0.4375:value.dots(4), #Round to dotted half
                     0.5:2, #Half
                     0.5625:2, #Round to half
                     0.625:2, #Round to half
                     0.6875:2, #Round to half
                     0.75:value.dots(2), #Dotted half
                     0.8125:value.dots(2), #Round to dotted half
                     0.875:value.dots(2), #Round to dotted half
                     0.9375:value.dots(2), #Round to dotted half                
                     1:1}) #Whole
df["Diff"] = df["Diff"].map(LengthToNote) #Map to note values for Mingus

### Rest durations
Among the notes, rests had to be determined as well otherwise. Any time the data frame had a blank space for "Notes_l" or "Notes_r" a rest note is placed with duration equal to the note being played on the opposite hand.

Eg: In the first row of the above dataframe

| Frame | Notes_l | Notes_r | Diff | Measure | Notes_both
| - | - | - | - | - | - |
|0.00 | [D-3] | | 8 | 1.0 | [D-3] |

`Notes_r` is empty and thus a rest is placed in the right hand with value equal to that of the row --> eighth rest (Diff column reads 8) which is reflected in the music sheet as seen below.

![FirstMeasures](Firstmeasures.png)

This method of placing rests looks horrible because the sheet is now littered with rests of varying values. A musician may get mad that instead of defining 1xquarter rest my program spat out 2xeighth rests in measure 1 ... but I'm just looking forward to being able to play the song, so I didn't bother to create code to combine note values of similar notes.

# Mingus

Mingus (https://bspaans.github.io/python-mingus/) is a python package that can link to LilyPond. LilyPond (https://lilypond.org/index.html) is a music engraving program which takes text an converts it to sheets or even midi. I won't go through how to use Mingus or Lilypond, but I'll take some time to talk about their nuances as they come up. 

To learn about Mingus
- Here is a link to various tutorials as well as class structure of the different objects within Mingus: https://bspaans.github.io/python-mingus/

To learn about LilyPond
- Here is a basic intro to the script syntax: https://lilypond.org/text-input.html
- Here are a multitude of tutorials/examples on what LilyPond can do: http://lilypond.org/doc/v2.19/Documentation/notation/index

## Creating Notes
Utilizing Mingus's note strucutre .... 

I need to create a `Track` to place `Bars` (musical measures) on and `NoteContainers` to places on those `Bars`.

^ This is the class structure of Mingus ^

from `mingus.containers` import `NoteContainer`, `Bar`, and `Track`. I used `NoteContainer` instead of `Note` because the container can object can hold chords/intervals while the note object can only hold single notes (Not useful!)



## Creating Measures & Track

A musical `Track` is created separately for Left and Right hands. This was done because notes would otherwise always stem from the bass clef and attempt to draw from the bass clef even though it would make more sense to draw it on the treble clef.

Example:
![OutofScale](OutOfScaleNote.png)

Once a track is created ... `Bar` objects are created sequentially for each existing measure

The `NoteContainer` objects are placed within those bars based on the notes that exist in that measure

Lastly, the `Bar` is appended to the `Track` and a LilyPond file (.ly) string is generated via `import mingus.extra.lilypond as LilyPond`

In [None]:
#Creating tracks --- Left
t_l = Track()
for i in set(df["Measure"].unique()[:-1]):
    b = Bar(key=keysig)
    subset = df[df["Measure"]==i]
    for j,k in zip(subset["Notes_l"],subset["Diff"]):
        if len(j)>0: #If note is not NaN
            nc = NoteContainer(j) #Define the note
            b.place_notes(nc,k) #Add note to bar with length
        else:
            b.place_notes(None,k) #Place a rest note the lenght of the other hand's note
    t_l + b
LithHarbor_ly_left = LilyPond.from_Track(t_l) #Create .ly file

Note1: The above code is only for the left hand ... similar code was copy pasted and used to generate the right hand track

Note2: I should have called the .ly string a different name than `LithHarbor_ly_left` but ... the intent of the project was to only generate this specific song from the town of Maplestory called Lithharbor. So if you're using this code in the future --- that is your ly string object that LilyPond needs to generate its png/pdf files!

## Creating Music Sheets

From the Left <ins>and</ins> Right .ly strings `LithHarbor_ly_left` and `LithHarbor_ly_right` a musical composition is generated into png or pdf via some stitching of strings.

It isnt quite as simple as adding the 2 strings together in LilyPond syntax.

- 1) Create a variable in LilyPond which I called "rhMusic =" for the right hand .ly string `LithHarbor_ly_right`
- 2) Create a variable in LilyPond which I called "lhMusic =" for the left hand .ly string `LithHarbor_ly_left`
- 3) Define a score
- 4) Define the score to be a piano sheet (thus containing a bass and treble clef)
- 5) Define the upper staff ("RH" by default LilyPond nomenclature) to be the `rhMusic` variable defined in Step #1
- 6) Define the upper staff ("LH" by default LilyPond nomenclature) to be the `lhMusic` variable defined in Step #2
- 6b) Additionally define this to bea "\\clef bass" just in case --- I actually never tested if the \\PianoStaff arguement handles that already
- 7) Optional things were defined like the header object which contains the title and composer info
- 8) Lastly the png/pdf is generated through `mingus.extra.lilpond.to_png({.ly string object},{Output name})`

In [None]:
header = '\\header { title = "' + title + '" composer = "' + author + '" opus = "" } '
combine_test = header + "rhMusic =  {" + LithHarbor_ly_right + "}"
combine_test = combine_test + "lhMusic =  {" + LithHarbor_ly_left + "}"
combine_test = combine_test + """
\\score {
  \\new PianoStaff <<
    \\new Staff = "RH"  <<
      \\rhMusic
    >>
    \\new Staff = "LH" <<
      \\clef "bass"
      \\lhMusic
    >>
  >>
}"""

LilyPond.to_png(combine_test, outputname) #Create png

## Creating Midi

As a side note to the png/pdf file -- I wanted a .midi file to generate so I could "test" the output of the python script by having it play the png/pdf file back to me.

A .midi file is generated from mingus through LilyPond via `from mingus.midi import midi_file_out as MidiFileOut
`

The code below will look <ins>extremely</ins> similar to the track generation code, but I combined both left and right hand notes into 1 single track object as follow:

In [None]:
#### Create midi file:
#Combine Notes_r and Notes_l from the df into 1 congomerate
combined_notes = []
for i in range(df.shape[0]):
    try:
        combined_notes.append(df.iloc[i]["Notes_l"]+df.iloc[i]["Notes_r"])
    except:
        if df.iloc[i]["Notes_l"]!="":
            combined_notes.append(df.iloc[i]["Notes_l"])
        else:
            combined_notes.append(df.iloc[i]["Notes_r"])
df["Notes_both"] = combined_notes

Now with the combined track info, I can generate the track and the subsequent midi

In [None]:
#Now create a track by adding these notes from both hands
t_both = Track()
for i in set(df["Measure"].unique()[:-1]):
    b = Bar(key=keysig)
    subset = df[df["Measure"]==i]
    for j,k in zip(subset["Notes_both"],subset["Diff"]):
        if j: #If note is not NaN
            nc = NoteContainer(j) #Define the note
            b.place_notes(nc,k) #Add note to bar with length
    t_both + b

# Final Product

The final product of all this was a python script which ... unfortunately ... only works for this 1 specific video file.

If you're reading this you'll be thinking -- Wow that sucks

But that isnt the complete truth. This code can be adaptable to other synthesia .mp4 inputs by adjustment of the hyperparameters in the script


## Hyperparameters

In [None]:
###Hyperparameters
debug = False #For output from Pixel <-> Note dictionaries
vod_name = "Above the Treetops - Lith Harbor Synthesia.mp4" #Name of the file to read (.mp4!)
fps = 1 #Speed of the video processing --- Inverse btw .. lower is faster!
measure_length = 57.2 #The number of frames that yield a full measure
start = 230 #Defining the height of the crop region
keysig = "D" #Defining the key signature (in Major only!)
bpm = 120 #Defining the BPM for the midi file -- Has no impact to png/pdf generation
title = "Above the Treetops - Lith Harbor" #Title displayed above the score
author = "Leegle" #Author displayed at the top right of the score
outputname = "LithHarbor" #Name to use for all the output files ... eg: LithHarbor.png/.mid/.pdf
#Mask filter parameters
blu_lower = np.array([75,20,120]) #Lower bound of the Left hand
blu_upper = np.array([140,230,250]) #Upper bound of the Left hand
grn_lower = np.array([30,90,100]) #Lower bound of the Right hand
grn_upper = np.array([80,255,255]) #Upper bound of the Right hand

By adjusting the parameters at the start of the python script, you can tune this script to work for whatever video file you want!

- vod_name = Name of the video file you want to process
- fps = processing speed of the video .. keep at 1. Increase to 10~60 as you want to debug and look into finer details
- measure_length = How many frames you believe 1 measure of the musical score to be. If 58 frames, I recommend a value less like 57.2 as given in the example
- start = the height at which the detection region is drawn --- Adjust to avoid any post-processing editing such as channel logos or text.
- keysig = the key signature of the score. Only works for major keys, but im sure it can be fixed with a little more LilyPond work
- bpm = Speed for the .midi file to play back at
- title/author = the text to display on the png/pdf score
- outputname = The name for which all the outputs will follow. Eg: "LithHarbor.png"
- Mask parameters = Fine tuning to capture the different colors that the video will be playing back --- Lots and lots of Trial & Error!

## What you need to do to work the script for your video file

You'll have to adjust any parameters relating to your personal file such as the `vod_name` of course. 

But in terms of adjusting this script to your own needs:

I suggest only adjusting `measure_length` for tuning parameters. If the video is using the default synthesia colors (Blue and Green) then you shouldn't have to adjust the mask parameters at all.

# Shortcomings
Nothing is perfect. I only anticipated this project to spit output as string "D-A-F#-E-F#-A" for a single measure. This project has gone far beyond that initial scope and I couldn't be happier.

## Measure Length
Unless I had downloaded a better quality video output, I dont think OpenCV could capture the measure bar (thin white line) moving down the track. Because of this, I had to hard-code a feature to determine how long a measure is.

This means 2 things:
- 1) The code is less user-friendly & requires the user to put in quite a bit of leg work (Especially the leg work of knowing music enough to play the music in their own head & discern if that "D-3" should be in Measure 1 or if it should have been in Measure 2)
- 2) The measure length seems to change throughout the video resulting in note cut off (because Mingus has a hard definition on measure length being 4 beats --- assuming 4/4 time signature)

## Time signature is locked at 4/4
This is a quick and easy fix to be quite honest, but my song was already in 4/4 and I didn't feel the need to code that into the `combined_test` string output.

## Missing notes
Kind of related to the measure length issue ... note would get chopped off.

Ex: If I have 6 quarter notes in a row and 5 end up in measure 1. The 5th would disappear in the void. The 6th would appear as the first beat of measure 2. (assuming 4/4 time sig.)

However on top of Mingus's hard definition on a measure resulting in void-notes. My pixel detection method is inherently noisy and I had to filter out the noise through the `collaborative voting` althorithm I created. Its not perfect, and thus some notes that should have existed don't make it through the filter.

Lastly another point of missing notes is in the resolution of the note_duration mapping. I rounded all frame `Diff` counts to the nearest 0.0625 which is 1/16 = sixteenth note. That results in 2 things:
- 1) 32nd notes can not be resolved/detected
- 2) triplets or other syncopated notes were not coded for --- And thus may either register as incorrect beats or round to 0 and are deleted as a false detection

Note that in case 2 --- The syncopated notes could then either shift more notes into that measure (which wouldnt result in voided notes, because I hard-sectioned off measures based on the `np.ceil()` function) or shift less notes into the measure and cut off the excess notes into the void because too many beats were placed into 1 single measure.