# Transliteration of tsvs with the Sanaas

This is the second of 3 notebooks regarding the transliteration of the Arab Andalusian Lyrics. It uses the tsv formt output by the first notebook. Special terms which we would like to bypass from the transliteration script should be placed in special_terms.csv.
Note that, in case the romanization of the Arab Andalusian Corpus Lyrics is updated, then the keywords part of the JSON conversion function might need to be updated.

The tsv file with the lyrics of a recording are in a single tsv titled with the mbid of the recording. The transliterate_recording function is what handles issues relating to the structure of the recording. transliterate_text just calls the entry point to the transliteration module, and load_special_terms is a utility function to load terms we wish to use predetermined transliterations for rather than transliterate them with the tool. Such cases are only checked for the titles of sanai. 


In [1]:
import sys, os, glob, codecs
import csv, os

from os import listdir, getcwd
from os.path import isfile, isdir, join

import ArabicTransliterator
from ArabicTransliterator import ALA_LC_Transliterator
import mishkal.tashkeel.tashkeel as tashkeel

ModuleNotFoundError: No module named 'ArabicTransliterator'

In [5]:
transliterator = ALA_LC_Transliterator()

In [6]:
def transliterate_text(text, vocalize=True):
    voc = text
    if vocalize:
        vocalizer=tashkeel.TashkeelClass()
        voc = vocalizer.tashkeel(text)
    return transliterator.do(voc.strip())

In [7]:
def load_special_terms():
    f = codecs.open("special_terms.csv", "r", "utf-8")
    special_terms = {}
    for line in f:
        data = line.strip().split(",") #changed it from csv to tsv
        special_terms[data[1]] = data[2]
    f.close()
    return special_terms

The transliterate_recording function checks if the row being processed is a title row or a regular row. A title row is dot separated with keywords identifying the lyrics form that follows. A regular row should have at least 2 tab separated columns, with the first either being numeric or empty, and the latter being text in arabic script. The function reads from the input file buffer directly, and writes to the output file buffer.

In [14]:
def strip_digits(s):
        return ''.join([i for i in s if not i.isdigit()])
    
def transliterate_recording(inbuf, outbuf):
    recording_sanas = []
    special_terms = load_special_terms()
    for line in inbuf:
        data = line.strip('\n').strip('\r\n').split('\t')
        if len(data) == 1: #Transliterate san'a title
            if len(data[0]) > 0:
                text = strip_digits(data[0]).strip().replace(u"\u0640", u".")
                transliterated_data = []
                for elem in text.split("."):
                    transliterated_data.append(transliterate_text(elem.strip()))
                transliterated_data = u". ".join(transliterated_data)
                for k,v in special_terms.items():
                    transliterated_data = transliterated_data.replace(k, v)
                outbuf.write(transliterated_data+"\n")
            else:
                outbuf.write(line)
        else: # Transliterate san'a text
            i = 0
            if data[i].isdigit() == True:
                transliterated_data = [data[i]]
                i += 1
            else:
                transliterated_data = []

            for elem in data[i:]:
                transliterated_data.append(transliterate_text(elem.strip(), vocalize=True))
            outbuf.write("\t".join(transliterated_data)+"\n")

## Usage: Choosing Input Files

Select whether the source is a single file or a full folder. In both cases, the transliterated tsvs will be written to the same output directory.
if a folder is given, it is up to the user to ensure that all files inside adhere to the structure.

In order to fit into the 3 step pipeline given in these notebooks, the locations of the input files and folders are in the locations of notebook 1's output. 

In [17]:
isfile = False #True when file, False when folder. 
              
filepath = 'tsv_files/original/3fb6107c-13be-4006-851a-a857ed2f80bb.tsv'
folderpath = 'tsv_files/original/'

outputdir = './tsv_files/transliterated/'

path = os.getcwd()

In [21]:
file_queue = []

if not os.path.isdir(outputdir):
    os.makedirs(outputdir)

if isfile:
    if os.path.isfile(filepath):
        file_queue.append(filepath)
else:
    if os.path.isdir(folderpath):
        file_queue = [os.path.join(folderpath, fi) for fi in listdir(folderpath) if fi[-4:] == ".tsv"]

print("{} files in queue".format(len(file_queue)))

for file in file_queue:
    mbid = file[file.rfind("/")+1:file.rfind(".")]
    fin = codecs.open(file, "r", "utf-8")
    fout = open(os.path.join(outputdir, "%s.tsv" % (mbid)), "w+")
    
    print(os.path.join(outputdir, "%s.tsv" % (mbid)))
    transliterate_recording(fin, fout)
    
    fin.close()
    fout.close()
   

print("Finished Processing Files")

2 files in queue
./tsv_files/transliterated/3fb6107c-13be-4006-851a-a857ed2f80bb.tsv
./tsv_files/transliterated/70c04adf-b886-4d62-a88a-abdde5d93715.tsv
Finished Processing Files
