## Columns in a unit dictionary
- {
- 'musicname' : 'original filename',
- 'staff_id' : int,
- 'start' : int, (number of start measure)
- 'end' : int, (number of end measure)
- 'tempo' : int,
- 'timesig' : '4/4' or '6/8' or..., (time signature)
- 'beats' : '1/4,0; 1/2,1; 1/8,1; 1/8,0', (假設為4/4拍，左為一單位的拍子，順序為：四分休止符、二分音符、八分音符、八分休止符，1/4 + 1/2 + 1/8 + 1/8 = 1) (實際上拍子長度為小數型式如0.25, 0.125, 0.0625)
- 'instrument' : string, (此Staff代表的樂器)
- 'instsrc' : string (此Staff選用的樂器)
- }

## 演算法及注意事項
- 不管同staff中偶爾出現的複數旋律，tag < track >
- tempo取同單位中的最大值
- 若同單位中發生time signature(ex: 4/4 to 6/8), 就是於第二或三或四小節發生拍子改變，則捨棄此單位
- 關於記錄Staff所選用的樂器的'instsrc', mscx的metadata的< Part >中可能有多個< Channel >因此包含附數個樂器, 只記綠第一個< Channel >中< program value="number" >的number
- 有無指定tempo或一開始無指定tempo, 之後才有之情況(檔案)
- 特殊第一小節其tag內含一'len'屬性，以產生長度變為原始time signature之1/2, 1/4等情況，忽略此特殊第一小節，但不忽略第一單位(當特殊第一小節出現，同時會有個正常第一小節，因此有兩個第一小節，忽略特數第一小節後不影響單位擷取)
- 當staff數超過16, 單位仍可正常擷取，但MuseScore似乎只能看到16個staffs

## 休止符 tag < Rest>
* 先將拍子長度存入BeatStr,
* 之後發現連續的休止符再修改BeatStr,
* 取出上個休止符長度，合併長度後覆蓋過BeatStr中上個休止符長度。

## 連結線 tag < Tie >
- 先不將拍子長度存入BeatStr,
- 並先將拍子長度存入暫存的tiedBeats,
- 之後確認連結線結束後再將tiedBeats存入BeatStr.

## 多連音 tag < Tuplet >
- 一組多連音必在同一小節內
- 所有音符數量不一定等於actualNotes值
- 所有音符長度相加後 * normalNotes值 / actualNotes值，即為實際長度
- 關於每小節或每單位拍子長度和之驗算，須以 "差之絕對值小於約0.000000001" 取代 "相等" 之if判斷式
- 過濾以下情況之單位：非3,5,7,9連音、含有附點音符、第一音符與之前音符相連結< endSpanner >, 最末音符與之後音符相連結< Tie >

## OOP in the future
- MscxObj: self.staffList, self.tempoDic, self.instrumentDic, self.usedInstSrcDic
- MscxObj: loopMeasure(), lastMeasure(), per4Measure(), genDataDic()
- UnitObj: self.musicname, self.timesig, self.beats, self.tempo
- UnitObj: self.start, self.end, self.staff_id, self.instrument
- UnitObj: getTempo(), getInstrument(), getBeats(), ifChordOrRest(), ifDots()
- UnitObj: ifTuplet(), ifTie(), ifTimeSigChanges(), lastRestOrNot()
- UnitObj: pickleDump(), pickleLoad()
- MainClass: mainList, beatDic, instSrcDic, fileLoopForADir, doubleCheck()

In [53]:
from bs4 import BeautifulSoup as bs
from decimal import *
import os
import sys
import pickle
import csv
import pymongo
from pymongo import MongoClient

#Beat dictionary
beatDic = {
    'measure':None,
    'longa':Decimal(4),
    'breve':Decimal(2),
    'whole':Decimal(1),
    'half':Decimal(1)/2,
    'quarter':Decimal(1)/4,
    'eighth':Decimal(1)/8,
    '16th':Decimal(1)/16,
    '32nd':Decimal(1)/32,
    '64th':Decimal(1)/64,
    '128th':Decimal(1)/128
}

#Build a instrument source dictionary
with open('C:/Users/BigData/git/DownloadMusic/instrument_table.csv','r') as infile:
    instSrcDic = dict(csv.reader(infile))

#Generate unit-data dictionary containing below properties
def genDataDic(staff_id, timeSig, beatStr, filename, start, end):
    beatStr = beatStr[:-1]
    if 'only1' in tempoDic:
        tempo = tempoDic['only1']
    else:
        tempo = tempoDic[end]
    return {
        'staff_id':staff_id,
        'timesig':timeSig,
        'beats':beatStr,
        'musicname':filename,
        'start':start,
        'end':end,
        'tempo':tempo,
        'instrument':instrumentDic[staff_id],
        'instsrc':instSrcDic[usedInstSrcDic[staff_id]]
    }

In [54]:
def doubleCheck(j, dic, mod):
    #To speed up, comment below 1 print statement
#     print '[Unit #'+str(j)+']'
    timeSig = Decimal(dic['timesig'].split('/')[0]) / Decimal(dic['timesig'].split('/')[1])
    unitBeatSum2 = 0
    for value in dic['beats'].split(';'):
        unitBeatSum2 += Decimal(value[:-2])
    if abs(unitBeatSum2 - timeSig * mod) < 0.000000001:
        #To speed up, comment below 2 print statements
#         print dic
#         print '------------------------------------'
        pass
    else:
        print 'unitBeatSum2:', unitBeatSum2
        print 'timeSig * mod =', timeSig, '*', mod, '=', timeSig * mod
        sys.exit('FATAL: Incorrect sum of beats of a unit.\n\
        The 1st check passed but the 2nd failed.')
        print '------------------------------------'

In [55]:
#Called by getBeats() to simplify the process handing last measure
def doLastMeasure(staff, i, sigN, sigD, beatStr, unitBeatSum, filename):
    timeSig = str(sigN)+'/'+str(sigD)
    if i % 4 != 0:
        multi = i % 4
        start = i-(i % 4)+1
    else:
        multi = 4
        start = i-3
    if abs(unitBeatSum - beatDic['measure'] * multi) < 0.000000001:
        #To speed up, comment below 4 print statements
#         print 'beatStr: ' + beatStr
#         print '------------------------------------'
#         print 'True, unitBeatSum: ' + str(unitBeatSum)
#         print '------------------------------------'
        mainList.append(genDataDic(staff, timeSig, beatStr, filename, start, i))
    else:
        print 'False, unitBeatSum: ' + str(unitBeatSum)
        sys.exit('Error: Incorrect sum of beats of a unit')

In [56]:
# #Called by countBeats() to determine if last note is a Rest
# #whose beat should be merged with this Rest
# def lastRestOrNot(beatStr, thisBeat):
#     #If beatStr is True, it's not the 1st note(Rest) of a unit
#     if beatStr:
#         #If last note is not a Rest
#         if beatStr[-2] == '1':
#             beatStr += str(thisBeat)+',0;'
#         #If last note is a Rest
#         else:
#             lastBeat = Decimal(beatStr.rsplit(';', 2)[-2].split(',')[0])
#             combinedBeat = str(lastBeat + thisBeat)
#             #beatStr contains equivalent to or more than 2 pairs
#             if len(beatStr.rsplit(';', 2)) > 2:
#                 beatStr = beatStr.rsplit(';', 2)[-3] + ';' + combinedBeat + ',0;'
#             #beatStr contains exactly 1 pair
#             elif len(beatStr.rsplit(';', 2)) == 2:
#                 beatStr = combinedBeat + ',0;'
#     #If beatStr is False, it's the 1st note(Rest) of a unit
#     else:
#         beatStr = str(thisBeat)+',0;'
#     return beatStr

#Called by countBeats() to handle tag <Tie>. 似乎有很多條件可以合併、化簡
def ifTie(beatStr, thisBeat, note, tiedBeats):
    #Totally 2*4=8 conditions
    if tiedBeats == 0:
        #No tie
        if not note.find('Tie') and not note.find('endSpanner'):
            beatStr += str(thisBeat) + ',1;'
        #Tie starts.
        elif note.find('Tie') and not note.find('endSpanner'):
            tiedBeats += thisBeat
        #Tie ends at start of a unit
        elif not note.find('Tie') and note.find('endSpanner'):
            beatStr += str(thisBeat) + ',1;'
        #Tie continues at start of a unit
        elif note.find('Tie') and note.find('endSpanner'):
            tiedBeats += thisBeat
    else: #if tiedBeats != 0:
        #It shouldn't happen. Just for debugging
        if not note.find('Tie') and not note.find('endSpanner'):
            sys.exit('tiedBeats != 0 and not Tie and not endSpanner')
        #It shouldn't happen. Just for debugging
        elif note.find('Tie') and not note.find('endSpanner'):
            sys.exit('tiedBeats != 0 and Tie and not endSpanner')
        #Tie ends
        elif not note.find('Tie') and note.find('endSpanner'):
            tiedBeats += thisBeat
            beatStr += str(tiedBeats) + ',1;'
            tiedBeats = 0
        #Tie continues
        elif note.find('Tie') and note.find('endSpanner'):
            tiedBeats += thisBeat
    return beatStr, tiedBeats

#Called by countBeats() to handle tag <dots>
def ifDots(note):
    if not note.find('dots'):
        multi = 1
    else:
        dots = note.find('dots').text
        if dots == '1':
            multi = 1.5
        elif dots == '2':
            multi = 1.75
        elif dots == '3':
            multi = 1.875
    return Decimal(multi)

#Inner <Tuplet> inside <Chord> or <Rest>
def ifTuplet(note, tupPosi, ignUnit):
    if tupPosi == 'head' and note.find('endSpanner'):
        ignUnit = True
    elif tupPosi == 'tail' and note.find('Tie'):
        ignUnit = True
    return tupPosi, ignUnit

#Called by countBeats to simplify the codes
def ifChordOrRest(note, beatSum, tupID, tupBase, tupRatio, ignUnit, tupAcc, tupPosi):
    #To speed up, comment below 1 print statement
#     print note.name,
    noteKey = note.find('durationType').text
    multi = ifDots(note)
    
    #Complete <Tuplet> process starts here
    if note.find('Tuplet') and note.find('Tuplet').text == tupID and not note.find('dots'):
        multi *= tupRatio
        if tupAcc + beatDic[noteKey] == tupBase:
            tupPosi = 'tail'
        tupPosi, ignUnit = ifTuplet(note, tupPosi, ignUnit)
        if tupAcc == 0:
            tupPosi = 'body'
        tupAcc += beatDic[noteKey]
    elif note.find('Tuplet') and note.find('Tuplet').text == tupID and note.find('dots'):
        multi *= tupRatio
        ignUnit = True

    thisBeat = beatDic[noteKey] * multi
    beatSum += thisBeat
    return thisBeat, beatSum, ignUnit, tupAcc, tupPosi

#Called by getBeats() to count beats & accumulate the beat-string of a unit
def countBeats(measure, beatStr, unitBeatSum, tiedBeats, ignUnit):
    beatSum = 0
    tupID = ''
    tupBase = 0
    tupRatio = 1
    tupPosi = 'head'
    tupAcc = 0
    
    #recursive=False to avoid <Tuplet> inside <Chord> or <Rest>
    for note in measure.find_all(['Chord','Rest','Tuplet'], recursive=False):
        if not note.find('track'):
            if note.name == 'Chord':
                thisBeat, beatSum, ignUnit, tupAcc, tupPosi = \
                ifChordOrRest(note, beatSum, tupID, tupBase, tupRatio, ignUnit, tupAcc, tupPosi)
                beatStr, tiedBeats = ifTie(beatStr, thisBeat, note, tiedBeats)
            elif note.name == 'Rest':
                thisBeat, beatSum, ignUnit, tupAcc, tupPosi = \
                ifChordOrRest(note, beatSum, tupID, tupBase, tupRatio, ignUnit, tupAcc, tupPosi)
                beatStr += str(thisBeat)+',0;'
                #Comment last line while revive this line and function lastRestOrNot
#                 beatStr = lastRestOrNot(beatStr, thisBeat)
            elif note.name == 'Tuplet':
                tupID = note['id']
                baseNoteKey = note.find('baseNote').text
                tupNum = Decimal(note.select_one('Number > text').text)
                if tupNum != 3 and tupNum != 5 and tupNum != 7 and tupNum != 9:
                    ignUnit = True
                tupBase = beatDic[baseNoteKey] * tupNum
                normalNotes = Decimal(note.find('normalNotes').text)
                actualNotes = Decimal(note.find('actualNotes').text)
                tupRatio = normalNotes / actualNotes
    if abs(beatSum - beatDic['measure']) < 0.000000001:
        #To speed up, comment below 1 print statement
#         print '\nTrue, beatSum: ' + str(beatSum)
        unitBeatSum += beatSum
        return beatStr, unitBeatSum, tiedBeats, ignUnit
    else:
        print '\nFalse, beatSum: ' + str(beatSum)
        print beatStr
        sys.exit('Error: Incorrect sum of beats of a measure')

In [57]:
#Called by main function in the loop for multiple mscx files
def getBeats(dirPath, filename, staffList, mainList):
    filePath = dirPath + filename
    with open(filePath, 'r') as f:
        mscx = bs(f.read(), 'xml')
#     if max(staffList) > len(mscx.select('Score > Staff')):
#         sys.exit('Error: Staff number out of range')
    filename = filename.rsplit('.', 1)[0]
    print '[Filename: ' + filename + ']'
    print '------------------------------------'
    getTempo(mscx)
    
    #Build a represented instrument dictionary and a instrument source dictionary
    for part in mscx.find_all('Part'):
        staffID = []
        for tag in part.find_all(['Staff', 'instrumentId', 'program']):
            if tag.name == 'Staff':
                staffID.append(int(tag['id']))
            elif tag.name == 'instrumentId':
                for ID in staffID:
                    instrumentDic[ID] = str(tag.text)
            else: #tag.name == 'program'
                for ID in staffID:
                    usedInstSrcDic[ID] = str(tag['value'])
                break #while reaching 1st <program> in <Channel>
        if not part.find('instrumentId'):
            for ID in staffID:
                instrumentDic[ID] = None
    
    #Get 1st time signature
    staff1TimeSig = mscx.select_one('Score > Staff:nth-of-type(1) TimeSig')
    sigN = Decimal(staff1TimeSig.find('sigN').text)
    sigD = Decimal(staff1TimeSig.find('sigD').text)
    beatDic['measure'] = sigN/sigD
    diffTimeSigs = False
    
    #目前不用stafflist自定要擷取哪些樂譜(聲部), 讓產生單位字典的getBeats()處理所有staffs
#     for staff in staffList:
    #Main loop for each staff
    for staff in range(1, len(mscx.select('Score > Staff'))+1):
        beatStr = ''
        unitBeatSum = 0
        tiedBeats = 0
        ignUnit = False
        for measure in mscx.select('Score > \
                              Staff:nth-of-type('+str(staff)+') > \
                              Measure')[:-1]:
            i = int(measure['number'])
            #To speed up, comment below 1 print statement
#             print '[Staff #'+str(staff)+', Measure #'+str(i)+']'
            
            #Determine Time Signature & detect if it changes
            if measure.find('TimeSig'):
                sigN = Decimal(measure.find('sigN').text)
                sigD = Decimal(measure.find('sigD').text)
                if sigN/sigD != beatDic['measure']:
                    beatDic['measure'] = sigN/sigD
                    if i % 4 != 1:
                        diffTimeSigs = True
            
            if measure.get('len'):
                #To speed up, comment below 2 print statements
#                 print '\'len\' attribute found, ignore this measure.'
#                 print '------------------------------------'
                continue
            
            #Count beats & accumulate the beat-string of a unit
            beatStr, unitBeatSum, tiedBeats, ignUnit = \
            countBeats(measure, beatStr, unitBeatSum, tiedBeats, ignUnit)
            
            #Generate a unit dictionary per 4 measures
            if i % 4 != 0:
                #To speed up, comment below 2 print statements
#                 print 'beatStr: ' + beatStr
#                 print '------------------------------------'
                pass
            else:
                if ignUnit == False:
                    if diffTimeSigs == False:
                        if abs(unitBeatSum - beatDic['measure'] * 4) < 0.000000001:
                            timeSig = str(sigN)+'/'+str(sigD)
                            if tiedBeats != 0:
                                beatStr += str(tiedBeats) + ',1;'
                            mainList.append(genDataDic(staff, timeSig, beatStr, filename, i-3, i))
                            #To speed up, comment below 4 print statements
#                             print 'beatStr: ' + beatStr
#                             print '------------------------------------'
#                             print 'True, unitBeatSum: ' + str(unitBeatSum)
#                             print '------------------------------------'
                        else:
                            print 'False, unitBeatSum: ' + str(unitBeatSum)
                            sys.exit('Error: Incorrect sum of beats of a unit')
                    else:
                        #To speed up, comment below 2 print statements
#                         print 'Time Signature changes, ignore this unit.'
#                         print '------------------------------------'
                        pass
                else:
                    #To speed up, comment below 2 print statements
#                     print 'Tuplet complexity occurs, ignore this unit.'
#                     print '------------------------------------'
                    pass
                diffTimeSigs = False
                tiedBeats = 0
                beatStr = ''
                unitBeatSum = 0
                ignUnit = False
                    
        #Handle last-measure situations with for-else
        else:
            lastMeasure = mscx.select('Score > \
                                       Staff:nth-of-type('+str(staff)+') > \
                                       Measure')[-1]
            i = int(lastMeasure['number'])
            #To speed up, comment below 1 print statement
#             print '[Staff #'+str(staff)+', Measure #'+str(i)+']'
            if ignUnit == False:
                if diffTimeSigs == False:
                    if not lastMeasure.find('TimeSig'):
                        beatStr, unitBeatSum, tiedBeats, ignUnit = \
                        countBeats(lastMeasure, beatStr, unitBeatSum, tiedBeats, ignUnit)
                        if ignUnit == False:
                            if tiedBeats != 0:
                                beatStr += str(tiedBeats) + ',1;'
                            doLastMeasure(staff, i, sigN, sigD, beatStr, unitBeatSum, filename)
                        else:
                            #To speed up, comment below 2 print statements
#                             print 'Tuplet complexity occurs, ignore this unit.'
#                             print '------------------------------------'
                            pass
                    else:
                        sigN = Decimal(lastMeasure.find('sigN').text)
                        sigD = Decimal(lastMeasure.find('sigD').text)
                        if sigN/sigD == beatDic['measure']:
                            beatStr, unitBeatSum, tiedBeats, ignUnit = \
                            countBeats(lastMeasure, beatStr, unitBeatSum, tiedBeats, ignUnit)
                            if ignUnit == False:
                                if tiedBeats != 0:
                                    beatStr += str(tiedBeats) + ',1;'
                                doLastMeasure(staff, i, sigN, sigD, beatStr, unitBeatSum, filename)
                            else:
                                #To speed up, comment below 2 print statements
#                                 print 'Tuplet complexity occurs, ignore this unit.'
#                                 print '------------------------------------'
                                pass
                        else:
                            #To speed up, comment below 2 print statements
#                             print 'Time Signature changes, ignore this unit.'
#                             print '------------------------------------'
                            pass
                else:
                    #To speed up, comment below 2 print statements
#                     print 'Time Signature changes, ignore this unit.'
#                     print '------------------------------------'
                    pass
            else:
                #To speed up, comment below 2 print statements
#                 print 'Tuplet complexity occurs, ignore this unit.'
#                 print '------------------------------------'
                pass

In [58]:
#Called by main function in the loop for multiple mscx files
def getTempo(mscx):
    if mscx.select('Score > Staff:nth-of-type(1) > Measure > Tempo'):
        
        #Get 1st tempo
        tempo =  int(mscx.select('Score > Staff:nth-of-type(1) > \
        Measure > Tempo > text')[0].text.rsplit(' ', 1)[-1])
        
        #If there is only 1 tempo, no need to go through every measure.
        if len(mscx.select('Score > Staff:nth-of-type(1) > Measure > Tempo')) == 1:
            tempoDic['only1'] = tempo
        elif len(mscx.select('Score > Staff:nth-of-type(1) > Measure > Tempo')) > 1:
            tmp = []
            staff1Measures = mscx.select('Score > Staff:nth-of-type(1) > Measure')
            for measure in staff1Measures[:-1]:
                i = int(measure['number'])
                for t in measure.select('Tempo > text'):
                    tmp.append(int(t.text.rsplit(' ', 1)[-1]))
                if i % 4 == 0:
                    if tmp:
                        tempo = max(tmp)
                    tempoDic[i] = tempo
                    del tmp[:]

            #Handle last-measure situations with for-else
            else:
                i = int(staff1Measures[-1]['number'])
                for t in staff1Measures[-1].select('Tempo > text'):
                    tmp.append(int(t.text.rsplit(' ', 1)[-1]))
                if tmp:
                    tempo = max(tmp)
                tempoDic[i] = tempo
    else: #No tempo found, make the tempo 120
        tempoDic['only1'] = 120

In [None]:
import traceback

#Main function here
tempoDic = {}
instrumentDic = {}
usedInstSrcDic = {}
mainList = []

#目前不用stafflist自定要擷取哪些樂譜(聲部), 讓產生單位字典的getBeats()處理所有staffs
staffList = [1, 3, 5]

dirPath = 'C:/Users/BigData/Desktop/mscx/'
for filename in os.listdir(dirPath):
    getBeats(dirPath, filename, staffList, mainList)
    tempoDic.clear()
    instrumentDic.clear()
    usedInstSrcDic.clear()

    #Double check if the beats accumulation of each unit is correct
    for j, dic in enumerate(mainList, 1):
        mod = dic['end'] - dic['start'] + 1
        doubleCheck(j, dic, mod)
    
#     #Store data into MongoDB
#     client = MongoClient('mongodb://10.120.30.8:27017')
#     db = client['music']
#     collect = db['tempo_beats']
#     try:
#         collect.insert_many(mainList)
#     except:
#         exc_type, exc_value, exc_traceback = sys.exc_info()
#         with open('C:/Users/BigData/Desktop/mscxError.txt', 'a') as f:
#             f.write('[' + filename + ']\n')
#             traceback.print_tb(exc_traceback, None, f)
#             f.write('\n')
    
    del mainList[:]

del staffList[:]

In [18]:
#Query MongoDB
client = MongoClient('mongodb://10.120.30.8:27017')
db = client['music']
collect = db['tempo_beats']
cur = collect.find({'staff_id':{'$gt':24}},{'_id':0,'musicname':1,'staff_id':1})
for i, item in enumerate(cur):
    print i, item
    print '------------------------------------'

In [None]:
#Query MongoDB
client = MongoClient('mongodb://10.120.30.8:27017')
db = client['music']
collect = db['tempo_beats']
cur = collect.find({'musicname':'Vital_Zoetic','staff_id':1},{'_id':0, 'beats':1})
for i, item in enumerate(cur):
    print i, item
    print '------------------------------------'

In [None]:
#Count units of each musicname(file)
db.tempo_beats.group({"musicname":1}, {}, {"count":0}, "function(obj, prev){prev.count++}")

In [16]:
#Caution! Remove all documents from a collection
# result = db.tempo_beats.delete_many({})