# From tsv to JSON for original and transliterated files

This notebook converts the sanai lyrics tsv files to JSON files, for both tsvs in transliterated and original script.

## Prerequisites
1. Input Format:
    The expected formats of the tsv files is the that output from the 1) docx_to_tsv notebook and the 2) transliterate_tsv notebook.
2. Paths: 
    original script tsv files should be placed in tsv_files/original/, and transliterated tsvs should be placed in tsv_files/transliterated/. 
    

## Output Format

For more information about the tsv and JSON format and contents, please consult the README in the Arab Andalusian Lyrics Dataset found in https://zenodo.org/record/3337623#.XS3WqpMza8o. Information on the tsv can be found in the previous notebooks as well.

However, below are some implementation decisions that were taken:

1. if a poem type is missing, the ptype of the JSON format is put as either 'مجهول' or unidentified' depending on the file language.

2. The identifier key is set to the first lyrics tsv column of the first row of the poem.

3. What is referred to as the stype is an abbreviation for sanaa type. However, the name is slightly misleading, it refers to the type of the text currently being processed, whether it is a sanaa, a mawwal, inshad, or tawshiya, etc.  Also, the use of recording_sanas as a variable should not confuse the user. At the end this array will have all the elements that appeared in the lyrics of the recording.

4. Not all 'tawshiyat al mizan' entries have lyrics. In those without lyrics, the identifier could be empty. For those with lyrics, the handling is no different than 1. 

5. The 'sections' entry in the JSON dictionary corresponding to a sanaa/inshad/mawwal takes a very nested format. It's an array of sections, where each section is an array of lines. Given that each line is in itself an array of poem line sections, this makes the section entry a 3D array. Again, the outermost term 'sections' here should not be confused with a line sections of a poem. In this context, sections is synonymous with parts, meaning a group of lines within a poem with similar structure. 

6. One should be aware that sometimes spacing between words in arabic could cause substring mismatches and eventually cause bugs. For example, (tawshiyat al mizan) is without spaces. When creating the lyrics files, check this notebook to see which keywords are expected in which format to make sure the script works correctly.


In [5]:
import sys, os, glob, json, shutil, codecs, platform
import pandas as pd
import ArabicTransliterator

from IPython.display import display, HTML

## JSON Conversion

Parameters:
1. s is either a file stream or a string stream with the input tsv that is to be transliterated

2. v is a character indicates whether we are using the original or transliterated version of the lyrics, because the creation of the JSON dictionary will differ accordingly. 'o' refers to the original version and 't' refers to the transliterated one.

In [26]:
def strip_digits(s):
    return ''.join([i for i in s if not i.isdigit()])

def tsv_to_json(s, v):
    recording_sanas = []
    keywords = {'o' : {'type1' :['توشية الميزان', 'توشيةالميزان', 'الصنعة', 'إنشاد', 'موال', 'صنعة', 'توشيةالنوبة'], 'type2' : ["توشية"] } , 
                't': {'type1' : ["tawshīt al-mīzān", "tawshīt al-mīzān", "al-ṣan‘ah", "inshād", "mawwāl", "ṣan‘ah", "tawshīt al-nūbah"], 'type2' : ["tawshīya"] } }
    missing = {'o' : 'مجهول', 't' : 'unidentified'}
    known_ptypes = {'t': ['zajal','qaṣīdah', 'tawshīḥ', 'birūlatin', 'birūālatin', 'tawshīḥun', 'qaṣīdatun'], 'o' : ['زجل', 'قصيدة', 'توشيح', 'برولة']}
    
    poem_types, sana_types = {}, {}
    for line in s:
        data = line.strip('\r\n').rstrip('\t').split("\t")
        if len(data) == 1:       #if no tabs found
            if len(data[0]) > 0: #and non empty line
                title = data[0].strip('\r\n').split(".")
                stype, ptype = "", ""
                stype = strip_digits(title[0]).strip()  
                
                if stype not in keywords[v]['type1']:
                    stype = missing[v]     
                sana_types.setdefault(stype, 0)
                sana_types[stype] += 1
            
                ptype_matches = [p for p in known_ptypes[v] if p in line] #if more than 1 match, we use first.
            
                if len(ptype_matches) > 0:
                    ptype = ptype_matches[0]
                else:
                    ptype = missing[v]
                poem_types.setdefault(ptype, 0)
                poem_types[ptype] += 1
                
                recording_sanas.append({"title": data[0].strip(), "sections": [[]], "type": stype, "identifier": "", "poem" : ptype})
                
            else: 
                recording_sanas[-1]['sections'].append([])
              
        elif len(data) == 2 and (data[1] == keywords[v]['type1'][0] or data[1] == keywords[v]['type1'][1]): #"tawshīt al-mīzān" or its arabic script equiv.
            stype = data[1]
            recording_sanas.append({"title": data[1].strip(), "sections": [[]], "type": stype, "identifier": "", "poem" : ""})
    
        elif len(data) >= 2: #line in a sanaa body
            if(len(recording_sanas[-1]['sections'][0])) == 0: #if this is the first row in the sanaa
                recording_sanas[-1]['identifier'] = data[1] 
            recording_sanas[-1]['sections'][-1].append(data)
             
    #some post-processing
    for i in range(len(recording_sanas)): 
        if len(recording_sanas[i]['sections'][-1]) == 0: #remove empty sections in a recording if found.. 
            recording_sanas[i]['sections'] = recording_sanas[i]['sections'][:-1]
        if recording_sanas[i]['identifier'] == keywords[v]['type2'][0]:  #if the identifier is a tawshiya, take the following line section
            if len(recording_sanas[i]['sections'][0][0]) > 1: 
                recording_sanas[i]['identifier'] = recording_sanas[i]['sections'][0][0][1]
    
    return (recording_sanas, sana_types, poem_types)

This cell concerns the loading of files, and the actual tsv to json conversion, both when in original and transliterated formats

In [27]:
path = os.getcwd()

original_files = sorted(glob.glob("tsv_files/original/*.tsv"))
transliterated_files = sorted(glob.glob("tsv_files/transliterated/*.tsv"))

outputdir_transliterated = 'json_files/transliterated/'
outputdir_original = 'json_files/original/'

if not os.path.isdir(outputdir_transliterated):
    os.makedirs(outputdir_transliterated)
    
if not os.path.isdir(outputdir_original):
    os.makedirs(outputdir_original)

processed_original_files = []
processed_transliterated_files = []

for filename in original_files:
    mbid = filename[filename.rfind("/")+1:filename.rfind(".")]
    f = codecs.open(filename, "r", "utf-8")
    (recording_sanas, sana_types, poem_types) = tsv_to_json(f, 'o')
    f.close()
    json.dump(recording_sanas, codecs.open("json_files/original/%s.json" % mbid, "w+", "utf-8"))
    processed_original_files.append(filename)
    display(filename)
    
    for stype, freq in sorted(sana_types.items(), key=lambda x: x[1], reverse=True):
        print(stype)
        print(freq)

    for ptype, freq in sorted(poem_types.items(), key=lambda x: x[1], reverse=True):
        print(ptype)
        print(freq)
        
for filename in transliterated_files:
    mbid = filename[filename.rfind("/")+1:filename.rfind(".")]
    f = codecs.open(filename, "r", "utf-8")
    (recording_sanas, sana_types, poem_types) = tsv_to_json(f, 't')
    f.close()
    json.dump(recording_sanas, codecs.open("json_files/transliterated/%s.json" % mbid, "w+", "utf-8"))
    processed_transliterated_files.append(filename)
    display(filename)
    
    for stype, freq in sorted(sana_types.items(), key=lambda x: x[1], reverse=True):
        print(stype)
        print(freq)

    for ptype, freq in sorted(poem_types.items(), key=lambda x: x[1], reverse=True):
        print(ptype)
        print(freq)

'tsv_files/original/3fb6107c-13be-4006-851a-a857ed2f80bb.tsv'

الصنعة
10
مجهول
1
إنشاد
1
زجل
7
توشيح
2
مجهول
2
قصيدة
1


'tsv_files/original/70c04adf-b886-4d62-a88a-abdde5d93715.tsv'

الصنعة
9
مجهول
1
زجل
6
توشيح
3
مجهول
1


'tsv_files/transliterated/3fb6107c-13be-4006-851a-a857ed2f80bb.tsv'

al-ṣan‘ah
10
unidentified
1
inshād
1
zajal
7
tawshīḥ
2
unidentified
2
qaṣīdah
1


'tsv_files/transliterated/70c04adf-b886-4d62-a88a-abdde5d93715.tsv'

al-ṣan‘ah
9
unidentified
1
zajal
6
tawshīḥ
3
unidentified
1
