### Lower case document content

In [1]:
def decapitalize_content(pages: list[str]):

    """Turns document content into lower case"""

    for p in pages:
        p.page_content = p.page_content.lower()
    return pages

### Removes non ASCII characters

In [8]:
import re

def remove_non_ASCII(pages: list[str]):

    """Removes non ASCII characters from document. Not suitable for many non english languages 
    which have several non ASCII characters """

    for p in pages:
        p.page_content = re.sub(r'[^\x00-\x7F]+', '', p.page_content)
    return pages

### Removes bulleted and numbered lists

In [11]:
import re

def remove_bullets(pages: list[str]):

    """Removes bullets from document """

    for p in pages:
        p.page_content = re.sub(r'^[→•\-*✔●✗]\s*', '', p.page_content, flags = re.MULTILINE)
        p.page_content = re.sub(r'\d+\.\s*', '', p.page_content)
    return pages

### Removes multiple consecutive escape characters

In [4]:
def remove_escape(pages: list[str]):

    """Turns multiple consecutive escape characters into a single white space"""
    
    for p in pages:
        p.page_content = ' '.join(p.page_content.split())
    return pages

### Load all file paths from source folder

In [42]:
import os

folder = "../sources"
files = []

for f in os.listdir(folder):
    file = os.path.join(folder, f)
    if os.path.isfile(file):
        files.append(file)

### Parse documents into pages 

In [51]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader("../sources/VDA.pdf")
pages = []

async for page in loader.alazy_load():
    pages.append(page)

### Pre-process content (text cleaning)

In [52]:
cleaned_pages = remove_non_ASCII(pages)
cleaned_pages = decapitalize_content(cleaned_pages)
cleaned_pages = remove_bullets(cleaned_pages)
cleaned_pages = remove_escape(cleaned_pages)

In [53]:
print(cleaned_pages[9].page_content)

10 jin et al. multi-source domain adaptation (msda). when running our method for msda, we similarly merge multiple source domains in mcc and compare it to existing da algorithms that are specically designed for msda on domainnet. as shown in table 1, based on the inductive bias of minimizing the class confusion, mcc signicantly outperforms m3sda [31], the state-of-the-art method by a big margin (0%). note that these specic methods are of very complex architecture and loss designs and may be hard to use in practical applications. table 1: accuracy (%) on domainnet for mtda and msda (resnet-101). (a) mtda method c: i: p: q: r: s: avg resnet [13] 6 8 8 2 6 3 1 se [7] 3 5 5 8 0 7 6 mcd [36] 1 1 0 4 2 5 7 dada [32] 1 0 5 9 7 8 5 mcc 6 0 4 5 0 3 8 (b) msda method :c :i :p :q :r :s avg resnet [13] 6 0 1 3 9 7 9 mcd [36] 3 1 7 6 4 5 5 dctn [48] 6 5 8 2 5 3 2 m3sda [31] 6 0 3 3 7 5 6 mcc 5 0 6 5 0 7 6 partial domain adaptation (pda). due to the existence of source outlier classes, pda is known 

### Parse content with Unstructured

In [54]:
from langchain_unstructured import UnstructuredLoader

loader = UnstructuredLoader(file_path = "../sources/VDA.pdf", strategy = "hi_res")
docs = []
async for doc in loader.alazy_load():
    docs.append(doc)

INFO: Reading PDF for file: ../sources/VDA.pdf ...


In [None]:
for doc in docs:
    print(doc.page_content)