### Notebook for taking cases from word documents and converting them to text documents

The notebook uses pywin32 to control a full version of MS Office (on a windows machine), opens each word document in a target directory (both doc and docx), has word convert the document to a text file, save that text file in an output folder, and close the files.

This approach was adopted because (as of Nov 2023) python packagaes for accessing word documents are not reliable in terms of getting the footnotes and various other features (depending on how those are encoded and depending on whether one os using doc or docx).

It is slow (500 files per minute), but that is fine because we typically don't have more than a couple hundred files to add at any given time, and we only need to do the full backlog (100k cases) once. So, no effort has been made to speed this up, to try parallel processing, etc. There are also a few hundred cases that don't load properly because of corruption of the underlying word file (some XML paths are apparently missing) -- and that process throws errors, which requires some babying to run. Haven't tried to fix those because under 0.5% of files.

At some point in the future we may care about formatting (bold, etc) which is lost in this process, in which case we can rethink the approach.

At some point we can think about integrating this into an automated pipeline.

In [1]:
import win32com.client as win32
import pathlib

# for regular updates of cases obtained by email
input_folder = 'D:\\RAD Decisions\\'
output_folder = 'D:\\RAD Decisions TEXT\\'

# For initial data dump from the IRB
#input_folder = 'D:\\documents\\'
#output_folder = 'D:\\IRB Decisions - Initial Request - TEXT\\'

In [2]:
def convert_doc_to_txt(input_path, output_path):
    print ('Converting file: ', input_path)
    try:
        
        # open the word document
        word.Documents.Open(str(input_path))

        # save word doc as text using word's built in functionality
        doc = word.ActiveDocument
        #time.sleep(2)

        doc.SaveAs(str(output_path), FileFormat=7)
        
        # Close the document
        doc.Close()

        
        

        
        return None
    
    except Exception as e:
        print(f"An error occurred, stopping process: {e}")

        # Close the document
        try: 
            doc.Close()
        except:
            pass

    return None

# get list of files that are doc/docx
files = pathlib.Path(input_folder).glob('*.doc*')

# ouput folder path
output_folder = pathlib.Path(output_folder).absolute()

# create output folder if it doesn't exist
if not output_folder.exists():
    output_folder.mkdir(parents=True)

# Create a new instance of Word
word = win32.gencache.EnsureDispatch('Word.Application')

# Loop through the doc files and convert them
for file in files:
    # check if output file already exists
    if (output_folder / (file.stem + '.txt')).exists():
        print('File already exists, skipping: ', file)
        continue
    input_path = file.absolute()
    output_path = output_folder / (file.stem + '.txt')   
    convert_doc_to_txt(input_path, output_path)

# close word
word.Quit()



File already exists, skipping:  D:\RAD Decisions\1. MC3-00542 (Ouellet - Mexico - 111(1)(a) - Dismissed 1Fa Exclusion)-Signed v2.docx
File already exists, skipping:  D:\RAD Decisions\2. MC3-00542tf.docx
File already exists, skipping:  D:\RAD Decisions\AppToReopenTB8-14702a.docx
File already exists, skipping:  D:\RAD Decisions\AppToReopenTB8-14702tf.docx
File already exists, skipping:  D:\RAD Decisions\ASC Voting - TC1-04323 - EN.docx
File already exists, skipping:  D:\RAD Decisions\ASC Voting - TC1-04323 - FR.docx
File already exists, skipping:  D:\RAD Decisions\Decision concerning G. Bazin (finale version) (amended 5.5.2022) (003) - sanitized.docx
File already exists, skipping:  D:\RAD Decisions\Decision G. Bazin_ francais_modifiee (003) - sanitized.docx
Converting file:  D:\RAD Decisions\IRB Decision concerning Tarlochan Singh.docx
Converting file:  D:\RAD Decisions\IRB Decision concerning Tarlochan Singh_tf.docx
File already exists, skipping:  D:\RAD Decisions\MB7-00112a.docx
File a