# Docx to Tsv

This notebook concerns the conversion of the word documents with the sanai lyrics into tsv format. It is the first of 3 steps of the transliteration pipeline. The expected format of the input word documents must be strictly adhered to for the entire transliteration process to run smoothly. Please be sure to install all third party libraries used in the imports.

In [12]:
import docx
import csv, os
from io import StringIO
import sys, getopt
import argparse

from os import listdir, getcwd
from os.path import isfile, isdir, join

from docx.api import Document

## Expected Format
1. Each table would typically contain a single lyrics form, which could be a sanaa, inshad, mawal, or other. 

2. A table should not have more than one sanaa/inshad/mawal. There should be no grouping whether the contents are of the same form or a different form.

3. Cells of each table should not be merged. All rows of the table should have the same number of columns without any form of merging.

4. This conversion script assumes four types of cells:

    a. Title Cell: which is present in the first row of every table. This would typically contain a dot separated list of identifiers or keywords that characterize the contents of the table. 
    
    b. Section Metadata: In some cases, the cell after the title cell contains metadata information rather than actual lyrics. They would usually indicate instrumental interludes.
    
    c. Multiline Sanaa: usually a single line of lyrics would be divided into 1 or more line sections. So, each of these line sections would be in its own table column. In cases where multiple poem lines are grouped in one row, line sections corresponding to different lines would be put in the same cell.
    
    d. Singleline Sanaa: cases where contents of each line are placed on their own row.
    
    
5. The first column of each table is reserved for timing related information. It should only have values on the same rows where there are lyrics information or metadata information (i.e all cells other than title cells). If this timing information was not found or is missing, the first column should be kept empty, but never deleted.

6. Empty columns are ignored. 

7. Each table has at least one row, the title row. Cases where it was useful to only have the title row of a sanaa was when it is known that a sanaa is sung at a given location, but the lyrics contents were not detected. In those cases, the title mentions that it is a sanaa, and has the keyword 'مجهول', meanuning unidentified.

8. First row should have text in only one column, as the search will move to the next row once it finishes processing the first row it finds with text. 
    
The following diagram clarifies shows an example of a table corresponding to a multiline sanaa. 
![title](img/multiline_table.png)


In [15]:
#reload(sys)
#sys.setdefaultencoding('utf-8') #quick hack to get past python encoding issues for now.

In [None]:
#parser info

#returns a dictionary of arrays with as many columns were found with meaningful data. works when several lines are in the same cell or when each cell only has one line
#it assumes that at least within a row, the number of columns with meaningful data will be the same

In [22]:
def process_cell(cell): #types: 1: title cell, 2: section metadata, 3: multiline sanaa, 4: singleline sanaa    
        keywords_type1 = ["موال","إنشاد", "الصنعة"] #keywords that allow us to distinguish a cell as a title cell
        keywords_type2 = ["توشية"]                  #keywords that allow us to distinguish a cell as a section metadata cell

        cell_type = -1
        for k in keywords_type1:
                if k in cell.text:
                        cell_type = 1
        for k in keywords_type2:
                if k in cell.text:
                        cell_type = 2
        lines = cell.text.replace('"', '').strip().split('\n')
        lines = [l.strip() for l in lines]

        reduced_lines = []
        for l in lines:
                if l != '' and l != '\t' and l != ' ':
                        reduced_lines.append(l)
        lines = reduced_lines

        if type == -1:
                if len(lines) == 1:
                        cell_type = 4
                else:
                        cell_type = 3
        return cell_type, lines         #will never return -1 as all paths lead to an assignment

In [28]:
def process_row(row): #return the rows as lists, each representing contents of a tsv row, such that caller function can just keep appendingthe results
        row_type = 4  #row_type corresponds to cell type. all is assumed to be singleline. the lowest numbers
        row_cells = []

        for cell in row.cells:
                cell_type, cell_lines = process_cell(cell)
                #print("cell_type {} ".format(cell_type))
                if cell_type < row_type:
                        row_type = cell_type
                row_cells.append(cell_lines)

        #at this point there are n arrays depending on how many relevant columns there were
        #they should be parallel arrays, so go over each array and form separate tsv rows

        #if ! title row, the first column should be reserved for numbers even if empty
        row_lines= []
        #find longest, only relevant for type 3 rows (multiline sanaas)
        longest = max(len(col) for col in row_cells)
        for i in range(0, longest):
                tsv_string = StringIO()
                tsv_writer = csv.writer(tsv_string, delimiter='\t')
                line_sections = []
                for col, rc in enumerate(row_cells):
                        if row_type != 1: #if not title row
                                if col == 0: #if we are in the first col
                                        if len(rc) >= i+1: #if there is a numeric entry
                                                line_sections.append(rc[i]) #insert it normally
                                        else:                           #if there isn't
                                                line_sections.append('')    #still keep the column empty
                                else:
                                        if len(rc) >= i+1:
                                                line_sections.append(rc[i])
                        else:
                                if len(rc) >= i+1:
                                        line_sections.append(rc[i])
                tsv_writer.writerow(line_sections)
                row_lines.append(tsv_string.getvalue())

        if row_type != 1 and row_type != 2:
                row_lines.append('\r\n') #empty line to divide all different classed rows. the lack of handling for last one rids us of the need to add an empty line before a title row
        return  row_type, row_lines
  

In [29]:
def process_sanaa(sanaa): #table is a table object according to the python-docx api
        #sanaa_string = StringIO.StringIO()
        #sanaa_writer = csv.writer(sanaa_string)

        sanaa_string = ""
        for i, row in enumerate(sanaa.rows):
                row_type, lines = process_row(row)
                #DEBUG:
                #print ("Lines in ROW: {}".format(len(lines)))
                for l in lines:
                        sanaa_string+=l
        return sanaa_string


In [30]:
def process_document(doc_path):
        tsv = ""
        doc = Document(doc_path)
        for table in doc.tables:
                tsv += process_sanaa(table)
        return tsv

#nawba_folders = [fo for fo in listdir(path) if isdir(join(path, fo))]

## Usage: Choosing Input Files
Select whether the source is a single file or a full folder. In both cases, the tsv versions will be written to the same directory.
if a folder is given, it is up to the user to ensure that all files inside adhere to the structure.

A small number of .docx examples are given along with those notebooks for demonstration.

In [25]:
isfile = True #True when file, False when folder. 
              
filepath = 'docx_files/3fb6107c-13be-4006-851a-a857ed2f80bb.docx'
folderpath = 'docx_files/'

path = os.getcwd()

In [31]:
file_queue = []
if isfile:
    if os.path.isfile(filepath):
        file_queue.append(filepath)
else:
    if os.path.isdir(folderpath):
        file_queue = [fi for fi in listdir(folder) if isfile(join(path, fi))]

print("{} files in queue".format(len(file_queue)))

for file in file_queue:
    tsv = process_document(file)
    tsv = tsv.replace(',', '')
    tsv = tsv.replace('"', '')
    with open(file[:-4] + 'tsv', 'w') as fileout:
        print(file[:-4] + 'tsv')
        fileout.write(tsv)        

print("Finished Processing Files")

1 files in queue
docx_files/3fb6107c-13be-4006-851a-a857ed2f80bb.tsv
Finished Processing Files
