# Best Practices
1. Save a local copy of this notebook to be included in your archive.
2. Save the cleaned text and also the raw OCR text and corresponding note files.
3. Remove the key and endpoint information from any saved copy that will be publicly available

## I. Using Microsoft Azure's Document Intelligence to Transcribe Documents

### Run this to import the required code libraries from Microsoft Azure

In [None]:
!pip install azure-ai-documentintelligence

Collecting azure-ai-documentintelligence
  Downloading azure_ai_documentintelligence-1.0.0b4-py3-none-any.whl.metadata (48 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m48.4/48.4 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting isodate>=0.6.1 (from azure-ai-documentintelligence)
  Downloading isodate-0.7.2-py3-none-any.whl.metadata (11 kB)
Collecting azure-core>=1.30.0 (from azure-ai-documentintelligence)
  Downloading azure_core-1.32.0-py3-none-any.whl.metadata (39 kB)
Downloading azure_ai_documentintelligence-1.0.0b4-py3-none-any.whl (99 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.5/99.5 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading azure_core-1.32.0-py3-none-any.whl (198 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m198.9/198.9 kB[0m [31m13.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading isodate-0.7.2-py3-none-any.whl (22 kB)
Installing collected packages: isodate, azure-core, azur

### Create folders for the data (raw OCR) and output (cleaned OCR text files)

In [None]:
!mkdir data
!mkdir notes
!mkdir cleaned

### Upload PDFs

#### OPTION 1: Upload from Local Drive
Use this to upload the unprocessed PDFs or any text files you wish to clean.

In [None]:
from google.colab import files
uploaded = files.upload()

Saving CH_T-02_Aklavik.pdf to CH_T-02_Aklavik.pdf
Saving CH_T-03_Aklavik.pdf to CH_T-03_Aklavik.pdf


Define text parsing and file saving functions.

In [None]:
import os

# USED BY DOCUMENT INTELLIGENCE

# Get words
def get_words(page, line):
    result = []
    for word in page.words:
        if _in_span(word, line.spans):
            result.append(word)
    return result

# To learn the detailed concept of "span" in the following codes, visit: https://aka.ms/spans
def _in_span(word, spans):
    for span in spans:
        if word.span.offset >= span.offset and (word.span.offset + word.span.length) <= (span.offset + span.length):
            return True
    return False

# USED TO INTERACT WITH FILES

sourcepath = './'
outputdir = './data/'

# Save text (txt) to a new file (filename)
def save_to_file(filename,txt):
    wfile = open(filename,"w")
    wfile.write(txt)
    wfile.close()

# Read/transcribe a pdf and save it to the data directory
def read_pdf_into_ocr(fname,sourcedir,outputdir):

    try:
        filename = sourcedir+fname
        result_text = analyze_read(sourcedir,fname)
        newfile = fname.split('.')
        newfileloc = outputdir+newfile[0]

        save_to_file(newfileloc+'_ocr.txt',result_text)

    except:
        print('Error while reading pdf')



### REQUIRED: Enter your Endpoint and Key values

### **Without these credentials, the Document Intelligence functions won't work**

*   Replace "YOUR_FORM_RECOGNIZER_ENDPOINT" with Endpoint value provided you by Microsoft Azure
*   Replace "YOUR_FORM_RECOGNIZER_KEY" with the Key value provided by Microsoft Azure

### **These key credentials are from your paid account and should not be shared publicly.**

Remember to remove the key from your code when you're done, and never post it publicly. For production, use secure methods to store and access your credentials.

For a walkthrough on setting up Document Intelligence on Azure and finding your Endpoint and Key, visit:
* https://github.com/DiSA-Projects/Doc-Intel-Transcription/blob/main/SetupDocIntel.md

You can read more about how to use and extend Document Intelligence here:
* https://docs.microsoft.com/en-us/azure/cognitive-services/cognitive-services-security?tabs=command-line%2Ccsharp#environment-variables-and-application-configuration


In [None]:
endpoint = "YOUR_FORM_RECOGNIZER_ENDPOINT"  # Replace with Endpoint from Azure Document Intelligence
key = "YOUR_FORM_RECOGNIZER_KEY"            # Replace with Key from Azure Document Intelligence

### Defines the Analyze_Read function used by Document Intelligence to transcribe texts

In [None]:
# Modified version of Analyze_Read function provided as a template by Microsoft Azure Document Intelligence
# * Instead of printing output to the terminal, the analytical notes are saved in a separate text file in notes.

def analyze_read(sourcepath,fname):
    filetext = ''
    filenotes = ''
    filename = sourcepath+fname

    from azure.core.credentials import AzureKeyCredential
    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.ai.documentintelligence.models import DocumentAnalysisFeature, AnalyzeResult, AnalyzeDocumentRequest

    # For how to obtain the endpoint and key, please see PREREQUISITES above.
    #endpoint = os.environ["DOCUMENTINTELLIGENCE_ENDPOINT"]
    #key = os.environ["DOCUMENTINTELLIGENCE_API_KEY"]

    document_intelligence_client = DocumentIntelligenceClient(endpoint=endpoint, credential=AzureKeyCredential(key))

    # Analyze a document at a URL:
    # formUrl = "https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/read.png"
    # Replace with your actual formUrl:
    # If you use the URL of a public website, to find more URLs, please visit: https://aka.ms/more-URLs
    # If you analyze a document in Blob Storage, you need to generate Public SAS URL, please visit: https://aka.ms/create-sas-tokens
    # poller = document_intelligence_client.begin_analyze_document(
    #    "prebuilt-read",
    #    AnalyzeDocumentRequest(url_source=formUrl),
    #    features=[DocumentAnalysisFeature.LANGUAGES]
    # )

    # # If analyzing a local document, remove the comment markers (#) at the beginning of these 11 lines.
    # Delete or comment out the part of "Analyze a document at a URL" above.
    # Replace <path to your sample file>  with your actual file path.
    path_to_sample_document = filename
    print(f'Filename {filename}')
    with open(path_to_sample_document, "rb") as f:
        poller = document_intelligence_client.begin_analyze_document(
            "prebuilt-read",
            analyze_request=f,
            features=[DocumentAnalysisFeature.LANGUAGES],
            content_type="application/octet-stream",
        )
    result: AnalyzeResult = poller.result()

    # [START analyze_read]
    # Detect languages.
    print("----Languages detected in the document----")
    if result.languages is not None:
        for language in result.languages:
            filenotes = filenotes + '/n'+'Language code: '+language.locale+' with confidence '+str(language.confidence)
            #print(f"Language code: '{language.locale}' with confidence {language.confidence}")

    # To learn the detailed concept of "bounding polygon" in the following content, visit: https://aka.ms/bounding-region
    # Analyze pages.
    for page in result.pages:
        #print(f"----Analyzing document from page #{page.page_number}----")
        filenotes = filenotes+'/n'+'----Analyzing document from page #'+str(page.page_number)+'----'
        filenotes = filenotes+'/n'+'Page has width: '+str(page.width)+' and height: '+str(page.height)+', measured with unit: '+page.unit
        #print(f"Page has width: {page.width} and height: {page.height}, measured with unit: {page.unit}")

        # Analyze lines.
        if page.lines:
            for line_idx, line in enumerate(page.lines):
                words = get_words(page, line)
                #print(
                #    f"...Line # {line_idx} has {len(words)} words and text '{line.content}' within bounding polygon '{line.polygon}'"
                #)
                filenotes = filenotes+'\n'+'...Line # '+str(line_idx)+' has '+str(len(words))+ ' words and text '+line.content+' within bounding polygon '+str(line.polygon)

                # Analyze words.
                for word in words:
                    #print(f"......Word '{word.content}' has a confidence of {word.confidence}")
                    filenotes = filenotes+'\n'+'......Word '+ word.content+' has a confidence of '+str(word.confidence)

    # Analyze paragraphs.
    if result.paragraphs:
        print(f"----Detected #{len(result.paragraphs)} paragraphs in the document: {fname}----")
        for paragraph in result.paragraphs:
            #print(f"Found paragraph within {paragraph.bounding_regions} bounding region")
            #print(f"...with content: '{paragraph.content}'")
            filetext = filetext + '\n'+paragraph.content

    newfile = fname.split('.')
    newfileloc = './notes/'+newfile[0]+'_notes.txt'
    save_to_file(newfileloc,filenotes)
    print("----------------------------------------")
    filetext = filetext + '\n----------------------------------------'
    return filetext
# [END analyze_read]

### Convert PDFs to text files
Using Microsoft Azure's Document Intelligence (AI-aided transcription) to read, identify words, and produce a transcribed text version of the PDF.

In [None]:
files = [f for f in os.listdir(sourcepath) if os.path.isfile(os.path.join(sourcepath, f))]
for f in files:
  read_pdf_into_ocr(f,sourcepath,outputdir)

ENDPOINT https://westus2.api.cognitive.microsoft.com/
Filename ./CH_T-02_Aklavik.pdf
----Languages detected in the document----
----Detected #826 paragraphs in the document: CH_T-02_Aklavik.pdf----
----------------------------------------
ENDPOINT https://westus2.api.cognitive.microsoft.com/
Filename ./CH_T-03_Aklavik.pdf
----Languages detected in the document----
----Detected #312 paragraphs in the document: CH_T-03_Aklavik.pdf----
----------------------------------------


## II. Processing and Cleaning Transcripts (after Document Intelligence is done)

## Text File Cleaning Template

This Python code removes unwanted line numbers, paragraph markers, details about the digitization company, and page numbers from the raw text files generated by AI-assisted transcription (the output from using Microsoft Azure's Document Intelligence). The output should be human-readable text suitable for textual analysis. Additional functions can be written to clean up other details.


#### Import Python libraries for Regular Expressions and Strings

In [None]:
import re
import string

#### Function to remove paragraph and section markers created by AI transcription

In [None]:
def remove_paragraph_markers(text):
    #Remove "Paragraph:" and "=======" using regular expressions
    text = re.sub(r'(Paragraph:)','',text)
    text = re.sub(r'===PARAGRAPH===','',text)
    text = re.sub(r'={10}','',text)
    text = re.sub(r'\n{2}','',text)
    return text

#### Remove any language related to the digital transcription company that digitized/scanned the text/document

You can customize this part to remove specific strings from appearing in the final version (eg. names and addresses of the transcription service that scanned these documents, rather than the content itself)

In [None]:
def remove_transcription_comp(text):
    # Example of removing references to a transcription company's name, phone numbers, and website since they are not relevant to our research
    text = re.sub(r'FALLGUY REPORTING LTD.','',text)
    text = re.sub(r'(Ph: 604-555-4444 Fax: 604-333-8888)','',text)
    text = re.sub(r'(www.fallguyservice.com)','',text)
    return text

#### Remove any non-relevant artifacts and strings embedded in the text.

For legal documents and court transcripts, there will often be line numbers on the left margin. Sometimes Document Intelligence will misinterpret hole punch marks as characters -- these too can be removed.

**Remove Line Numbers**
* The current code removes just line numbers that appear at the start of a line.
* This also removes page numbers if they appear alone on a line.
* If removing a line number produces a blank line, the blank line is removed (this can be changed if needed).
* Volume number and dates are left in place (no changes made to them).

**Remove Hole Punch Marks**
* The holes left by a three-hole punch are often misinterpreted as 0, O, or : -- this function removes any solo appearance of these characters (alone on a line).



In [None]:
def is_all_numbers(line):
    numlist = line.split(' ')
    for n in numlist:
        if re.match('^\d{1,2}',n) == None:
            if not n in ['!','i','O',':','.','C']:
                return False
            #n = re.sub('^\d{1,2}','',n)
    return True

def remove_line_numbers(text):
    newtext=''

    # Search the file to see if there are any lines that begin with a number
    matches = re.findall(r'\n\d+',text)

    # If lines starting with numbers exist, examine each line
    if len(matches)>0:
        # Split the file into a list of lines
        lines = text.split('\n')
        for line in lines:
            # Check to see if line begins with a 1 or 2 digit number
            if not re.match('\d{1,2}',line) == None:
                # Remove numbers from the start of the line
                line = re.sub('\d{1,2}','',line)

            # Remove any leading spaces leftover from removing the line number
            if line.startswith(' '):
                line = line[1:]
            if len(line)>0:
                # If the line is not blank, add back to the file
                newtext=newtext+'\n'+line
    else:
        # If there are no lines which begin with numbers, just return the original text
        return text

    # Return the updated version of the file
    return newtext

# Remove misinterpreted hole punch marks. Can be modified to catch additional ways a hole punch is misread
def remove_hole_punch_marks(text):
    newtext = ''
    lines = text.split('\n')
    for line in lines:
        if len(line) == 1 and line[0] in ['O','0',':']:
            newtext = newtext+'\n'
        else:
            newtext = newtext+'\n'+line

    return newtext

#### Call the above functions to clean the text in a given file
Comment out any function calls that are not needed (just put a # in front of a line)

**NOTE:** It is strongly recommended that you create a new function for each step of data cleaning you are attempting, rather than put all text cleaning substitutions in the same function. This will make it easier to debug and gives you more control.



In [None]:
def clean_text(text):
    text = remove_transcription_comp(text)
    text = remove_paragraph_markers(text)
    text = remove_hole_punch_marks(text)
    text = remove_line_numbers(text)
    return text


#### (OPTIONAL) Upload/Import the raw data files here (if you didn't use Document Intelligence above)
If you are just testing text cleaning steps or aren't using Document Intelligence for transcription, you can use the snippet below to import raw text files from your local drive.

In [None]:
from google.colab import files
uploaded = files.upload()

#### After importing files, wait a moment for the File Explorer on the left to refresh.

#### Read a raw text file and output a clean version

- Cleaned text will be saved in the **/cleaned** folder
- As the code runs, the name of the data file currently being processed will be printed out, as will the name of the output file.

In [None]:
def readDataFile(filename,datadir,outputdir):
    import os
    try:
        with open(datadir+filename,'r') as text:
            lines = text.read()
            cleaned = clean_text(lines)
            fname = filename.split('.')
            newfile = fname[0]+'_clean.txt'
            wfile = open('.'+outputdir+newfile,"w")
            wfile.write(cleaned)
            wfile.close()
            #print(cleaned)
            print(filename+' processed. Creating '+newfile)
    except:
        print('Configuration file read error')
        raise


### Process data files
Run this snippet to process/clean the data files. Cleaned versions of the text files will appear when you **refresh** the file explorer on the left.



In [None]:
import os
datadir = './data/'
outputdir = '/cleaned/'
datafiles = [f for f in os.listdir(datadir) if os.path.isfile(os.path.join(datadir, f))]
for f in datafiles:
    readDataFile(f,datadir,outputdir)

CH_T-03_Aklavik_ocr.txt processed. Creating CH_T-03_Aklavik_ocr_clean.txt
CH_T-02_Aklavik_ocr.txt processed. Creating CH_T-02_Aklavik_ocr_clean.txt


## III: Zip archived raw, annotated, and cleaned files in separate files for easy download

### OPTIONAL: Create a Zip file containing all processed files

Add cleaned files to data-clean.zip

In [None]:
!zip -r /content/data-clean.zip /content/cleaned

  adding: content/cleaned/ (stored 0%)
  adding: content/cleaned/CH_T-03_Aklavik_ocr_clean.txt (deflated 64%)
  adding: content/cleaned/CH_T-02_Aklavik_ocr_clean.txt (deflated 66%)


Add raw ocr files (uncleaned) to data-clean.zip

In [None]:
!zip -r /content/data-clean.zip /content/data

  adding: content/data/ (stored 0%)
  adding: content/data/CH_T-03_Aklavik_ocr.txt (deflated 66%)
  adding: content/data/CH_T-02_Aklavik_ocr.txt (deflated 68%)


Add ocr-notes to data-clean.zip

In [None]:
!zip -r /content/data-clean.zip /content/notes

  adding: content/notes/ (stored 0%)
  adding: content/notes/CH_T-03_Aklavik_notes.txt (deflated 86%)
  adding: content/notes/CH_T-02_Aklavik_notes.txt (deflated 86%)


###  Download the Zip file containing all processed files

In [None]:
from google.colab import files
files.download("data-clean.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>