# PDF Document Analysis
We will now analyse the page-, character- and word-counts of all PDF documents.

<small>**NOTE:** We did use this notebook to extract PDF metadata.</small>

<small>**NOTE:** Extra analysis on top of final analysis. Added in appendix.</small>

First we will load the needed packages:

In [2]:
import pandas as pd
import pymupdf
from tqdm.notebook import tqdm

from pathlib import Path
import re

We will then define the Base Path, were the PDFs are located, and load the document data needed for the analysis:

In [3]:
BASE_PATH = Path('../study_documents')

analysis_df = pd.read_excel(BASE_PATH / 'all_documents_analysis_data.xlsx', index_col=0)

analysis_df

Unnamed: 0_level_0,path,url,name,risk_management_plan
eu_pas_register_number,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1587,rmp_other/EUPAS1587_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,Tesis Maria Jose Alcala,
1591,rmp_other/EUPAS1591_result_tables.pdf,https://catalogues.ema.europa.eu/sites/default...,Report_Rosiglitazone_use,Not applicable
1597,rmp_other/EUPAS1597_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,Protocol INAS-FOCUS,EU RMP category 3 (required)
1597,rmp_other/EUPAS1597_result_tables.pdf,https://catalogues.ema.europa.eu/sites/default...,IFOC_FinalStudyReport_Public Version 20200819,EU RMP category 3 (required)
1613,rmp_other/EUPAS1613_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,VIPOS_Study Protocol,EU RMP category 3 (required)
...,...,...,...,...
108254,rmp_other/EUPAS108254_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,CEIM_LEGIT_MC_EVCDAO_2019 Modificacion Favorab...,Not applicable
108260,rmp_other/EUPAS108260_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,LEGIT_COVIDX_EVCDAO_2022 Protocol Multipatholo...,Not applicable
108260,rmp_other/EUPAS108260_other_document_#1.pdf,https://catalogues.ema.europa.eu/sites/default...,CEIm_LEGIT_COVIDX_EVCDAO_2022_TRJON-8abc0f12d8...,Not applicable
108481,rmp_other/EUPAS108481_protocol_document.pdf,https://catalogues.ema.europa.eu/sites/default...,MK-5592-141-00-v1-Protocol_final-redaction,Not applicable


Next we will define helper variables and functions to find the Table of Contents (ToC) in a document:

In [6]:
# Study report TOC headings
study_report_headings = [
    "Abstract",
    "List of abbreviations",
    "Investigators",
    "Other responsible parties",
    "Milestones",
    "Rationale and background",
    "Research question and objectives",
    "Amendments and updates",
    "Research methods",
    "Study design",
    "Setting",
    "Subjects",
    "Variables",
    "Data sources and measurement",
    "Bias",
    "Study size",
    "Data transformation",
    "Statistical methods",
    "Main summary measures",
    "Main statistical methods",
    "Missing values",
    "Sensitivity analyses",
    "Amendments to the statistical analysis plan",
    "Quality control",
    "Results",
    "Participants",
    "Descriptive data",
    "Outcome data",
    "Main results",
    "Other analyses",
    r"Adverse (?:events|reactions)",
    "Discussion",
    "Key results",
    "Limitations",
    "Interpretation",
    "Generalisability",
    "Other information",
    "Conclusion",
    "References",
    "Appendices",
    "Annex",
]

# Protcol TOC headings
protocol_headings = [
    "Table of contents",
    "List of abbreviations",
    "Responsible parties",
    "Abstract",
    "Amendments and updates",
    "Milestones",
    "Rationale and background",
    "Research question and objectives",
    "Research methods",
    "Study design",
    "Setting",
    "Variables",
    "Data sources",
    "Study size",
    "Data management",
    "Data analysis",
    "Quality control",
    "Limitations of the research methods",
    "Other aspects",
    "Protection of human subjects",
    r"Management and reporting of adverse (?:events|reactions)",
    "Plans for disseminating and communicating study results",
    "References",
    "Annex"
]

# Compile regex patterns for TOC headings
study_report_patterns = [re.compile(heading.replace(' ', '\s+'), re.IGNORECASE) for heading in study_report_headings]
protocol_patterns = [re.compile(heading.replace(' ', '\s+'), re.IGNORECASE) for heading in protocol_headings]

def extract_toc_info(doc, texts):
    toc_info = {}
    
    # Regex patterns for TOC and dots
    toc_patterns = [
        re.compile(pattern, re.IGNORECASE)
        for pattern in [r'\btable\s+of\s+contents\b', r'\btoc\b', r'\bcontents\b\s*(?:\r\n|\r|\n)']
    ]

    dot_pattern = re.compile(r'\.{3,}', re.IGNORECASE)
    
    # Initialize toc_info dictionary
    for i in range(len(texts)):
        toc_info[i] = {
            "page_number": i,
            "has_toc_pattern": False,
            "has_goto_link": False,
            "has_digit_and_dots": False,
            "study_report_toc_headings": [],
            "protocol_toc_headings": [],
            "is_in_toc_chain": False
        }
    
    # Check for TOC patterns
    for i, text in enumerate(texts):
        if any(pattern.search(text) for pattern in toc_patterns):
            toc_info[i]["has_toc_pattern"] = True
    
    # Check for lines with at least one digit and more than 3 dots
    for i, text in enumerate(texts):
        for line in text.splitlines():
            if any(char.isdigit() for char in line) and dot_pattern.search(text):
                toc_info[i]["has_digit_and_dots"] = True
                break
    
    # Check for goto links
    link_chain = []
    for i, page in enumerate(doc):
        links = [link for link in page.get_links() if link['kind'] == pymupdf.LINK_GOTO]
        if links:
            toc_info[i]["has_goto_link"] = True
            link_chain.append((i, len(links)))
        else:
            if link_chain:
                link_chain = []
    return toc_info

def find_relevant_toc_chain(toc_info):
    # Collect pages that have goto links or digit and dots
    chain_candidates = [page_number for page_number, info in toc_info.items() if info["has_goto_link"] or info["has_digit_and_dots"]]

    goto_links = []
    current_chain = []
    
    for page_number in range(len(toc_info)):
        if page_number in chain_candidates:
            current_chain.append(page_number)
        else:
            if current_chain:
                goto_links.append(current_chain)
                current_chain = []
    if current_chain:
        goto_links.append(current_chain)
    
    first_toc_page = next((page for page, info in toc_info.items() if info["has_toc_pattern"]), None)
    toc_chain = None
    for chain in goto_links:
        if first_toc_page is not None and first_toc_page in chain:
            toc_chain = chain
            break
        elif len(chain) > 1:
            toc_chain = chain
            break
    
    if toc_chain:
        for page_number in toc_chain:
            toc_info[page_number]["is_in_toc_chain"] = True

    return toc_info

def check_headings_in_toc(toc_text, reference_patterns):
    found_headings = []
    for pattern in reference_patterns:
        if pattern.search(toc_text):
            found_headings.append(pattern.pattern)
    return found_headings

And a helper variable to find *ClinicalTrials.gov* IDs:

In [None]:
# Compiled NCT ID regex pattern
# nct_pattern = re.compile(r'NCT\d{8}', re.IGNORECASE)

This is the main function, which we will be applied to every row of `analysis_df`: 

In [None]:
def pdf_scores(series):
    path = series['path']
    document_name = series['name']
    
    pdf_path = Path(BASE_PATH / path)

    index = int(pdf_path.stem.split('_')[0][5:])
    document_type = pdf_path.stem.partition('_')[-1]

    with pymupdf.open(pdf_path) as doc:
        # print(doc.name, f'Numer of pages: {len(doc)}', sep='\n')
        text_pages = [page.get_textpage() for page in doc]
        
        # table_page_numbers_with_amount = []
        # for i, page in enumerate(doc):
        #     tables = page.find_tables(
        #         # strategy='text'
        #     ).tables
        #     if tables:
        #         table_page_numbers_with_amount.append((i, len(tables)))
        # print('Table proportion: ', f'{100 * len(table_page_numbers_with_amount) / len(doc):.2f}%')
        
        # Extract texts, words and characters
        texts = [text_page.extractText() for text_page in text_pages]
        character_count = sum([len(text) for text in texts])
        # print('Character Count:', character_count)
        
        words = [text_page.extractWORDS() for text_page in text_pages]
        word_count = sum([len(word) for word in words])
        # print('Word Count:', word_count)
        
        # Extract TOC info
        extracted_toc_info = extract_toc_info(doc, texts)
        
        # Find relevant TOC chain
        extracted_toc_info = find_relevant_toc_chain(extracted_toc_info)
        
        # Collect and check TOC headings
        for page_number, info in extracted_toc_info.items():
            if info["has_toc_pattern"] or info["is_in_toc_chain"]:
                toc_text = texts[page_number]
                study_report_found_headings = check_headings_in_toc(toc_text, study_report_patterns)
                protocol_found_headings = check_headings_in_toc(toc_text, protocol_patterns)
                extracted_toc_info[page_number]["study_report_toc_headings"] = study_report_found_headings
                extracted_toc_info[page_number]["protocol_toc_headings"] = protocol_found_headings
        
        # Print TOC results
        extracted_toc_pages = [info for info in extracted_toc_info.values() if info["has_toc_pattern"] or info["is_in_toc_chain"]]
        extracted_first_toc_page_number = None
        extracted_unique_study_report_headings = None
        extracted_unique_protocol_headings = None
        if extracted_toc_pages:
            toc_df = pd.DataFrame.from_records(extracted_toc_pages, index='page_number')
            # display(toc_df)
            extracted_first_toc_page_number = toc_df[toc_df['has_toc_pattern']].index.min() if not toc_df[toc_df['has_toc_pattern']].empty else toc_df.index.min()
            extracted_first_toc_page_number = extracted_first_toc_page_number if not toc_df.empty else None
            extracted_unique_study_report_headings = len(set(heading for headings in toc_df['study_report_toc_headings'] for heading in headings))
            extracted_unique_protocol_headings = len(set(heading for headings in toc_df['protocol_toc_headings'] for heading in headings))
        # print("Protocol headings found in extracted TOC Chain:", extracted_unique_protocol_headings)
        # print("Study report headings found in extracted TOC Chain:", extracted_unique_study_report_headings)
        # print("First ToC Page Number:", extracted_first_toc_page_number)

        document_toc_info = doc.get_toc()
        document_unique_study_report_headings = None
        document_unique_protocol_headings = None
        if document_toc_info:
            document_toc_info = [info[1] for info in document_toc_info]
            document_found_study_report_headings = [check_headings_in_toc(text, study_report_patterns) for text in document_toc_info]
            document_found_protocol_headings = [check_headings_in_toc(text, protocol_patterns) for text in document_toc_info]
            document_unique_study_report_headings = len(set(heading for found_headings in document_found_study_report_headings for heading in found_headings))
            document_unique_protocol_headings = len(set(heading for found_headings in document_found_protocol_headings for heading in found_headings))
        # print("Protocol headings found in document TOC Chain:", document_unique_protocol_headings)
        # print("Study report headings found in document TOC Chain:", document_unique_study_report_headings)

        # Create a search_range for meta_data in the first few pages or before first extracted ToC page
        # search_range = texts[:extracted_first_toc_page_number] if extracted_first_toc_page_number is not None else texts[:min(9, len(texts))]

        # Try to find EU PAS ID
        # eupas_pattern = re.compile(rf'(?<!\d){index}\b')
        # eupas_page_numbers = []
        # for i, text in enumerate(search_range):
        #     ids = eupas_pattern.findall(text)
        #     if ids:
        #         eupas_page_numbers.append(i)
        # print("EU PAS ID page numbers:", '; '.join(eupas_page_numbers) if eupas_page_numbers else 'None')
        
        # Try to find NCT IDs
        # nct_ids = set()
        # for i, text in enumerate(search_range):
        #     ids = nct_pattern.findall(text)
        #     if ids:
        #         nct_ids |= nct_ids.union(ids)
        # print("NCT IDs:", '; '.join(nct_ids) if nct_ids else 'None')
        
        # print('\n')

        result = {
            'document_type': document_type,
            'URL': series['url'],
            'pdf_name': document_name,
            'pages': len(doc),
            'characters': character_count,
            'words': word_count,
            'extracted_unique_protocol_headings_absolute': extracted_unique_protocol_headings,
            'extracted_unique_protocol_headings_relative': pd.NA,
            'extracted_unique_protocol_headings': pd.NA,
            'extracted_unique_study_result_headings_absolute': extracted_unique_study_report_headings,
            'extracted_unique_study_result_headings_relative': pd.NA,
            'extracted_unique_study_result_headings': pd.NA,
            'meta_unique_protocol_headings_absolute': document_unique_protocol_headings,
            'meta_unique_protocol_headings_relative': pd.NA,
            'meta_unique_protocol_headings': pd.NA,
            'meta_unique_study_result_headings_absolute': document_unique_study_report_headings,
            'meta_unique_study_result_headings_relative': pd.NA,
            'meta_unique_study_result_headings': pd.NA,
            'first_toc_page_number': extracted_first_toc_page_number,
            # 'table_pages': pd.NA,
            # 'eupas_page_numbers': pd.NA,
            # 'nct_ids': pd.NA
        }

        if extracted_unique_protocol_headings:
            relative = 100 * extracted_unique_protocol_headings / len(protocol_headings)
            result.update({
                'extracted_unique_protocol_headings_relative': relative,
                'extracted_unique_protocol_headings': f'{extracted_unique_protocol_headings} ({relative:.2f}%)'
            })
                
        if extracted_unique_study_report_headings:
            relative = 100 * extracted_unique_study_report_headings / len(study_report_headings)
            result.update({
                'extracted_unique_study_result_headings_relative': relative,
                'extracted_unique_study_result_headings': f'{extracted_unique_study_report_headings} ({relative:.2f}%)'
            })

        if document_unique_protocol_headings:
            relative = 100 * document_unique_protocol_headings / len(protocol_headings)
            result.update({
                'meta_unique_protocol_headings_relative': relative,
                'meta_unique_protocol_headings': f'{document_unique_protocol_headings} ({relative:.2f}%)'
            })
                
        if document_unique_study_report_headings:
            relative = 100 * document_unique_study_report_headings / len(study_report_headings)
            result.update({
                'meta_unique_study_result_headings_relative': relative,
                'meta_unique_study_result_headings': f'{document_unique_study_report_headings} ({relative:.2f}%)'
            })

        # if table_page_numbers_with_amount:
        #     result.update({
        #         'table_pages': f'{len(table_page_numbers_with_amount)} ({100 * len(table_page_numbers_with_amount) / len(doc):.2f}%)'
        #     })          
        
        # if eupas_page_numbers:
        #     result.update({
        #         'eupas_page_numbers': '; '.join([str(i) for i in eupas_page_numbers])
        #     })

        # if nct_ids:
        #     result.update({
        #         'nct_ids': '; '.join(nct_ids)
        #     })

        return result

Applying `pdf_scores` to `analysis_df`:

In [13]:
tqdm.pandas(
    desc = pdf_scores.__name__,
    total = len(analysis_df),
    unit = 'studies'
)

result_df = analysis_df.progress_apply(pdf_scores, axis=1, result_type='expand').reset_index().set_index(['eu_pas_register_number', 'document_type'])

result_df

pdf_scores:   0%|          | 0/2972 [00:00<?, ?studies/s]

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'

MuPDF error: syntax error: cannot find ExtGState resource 'GS0'



Unnamed: 0_level_0,Unnamed: 1_level_0,URL,pdf_name,pages,characters,words,extracted_unique_protocol_headings_absolute,extracted_unique_protocol_headings_relative,extracted_unique_protocol_headings,extracted_unique_study_result_headings_absolute,extracted_unique_study_result_headings_relative,...,meta_unique_protocol_headings_absolute,meta_unique_protocol_headings_relative,meta_unique_protocol_headings,meta_unique_study_result_headings_absolute,meta_unique_study_result_headings_relative,meta_unique_study_result_headings,first_toc_page_number,table_pages,eupas_page_numbers,nct_ids
eu_pas_register_number,document_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1
1587,protocol_document,https://catalogues.ema.europa.eu/sites/default...,Tesis Maria Jose Alcala,164,215167,31958,0.0,,,1.0,2.439024,...,,,,,,,6.0,42 (25.61%),,
1591,result_tables,https://catalogues.ema.europa.eu/sites/default...,Report_Rosiglitazone_use,14,41705,3526,,,,,,...,0.0,,,2.0,4.878049,2 (4.88%),,2 (14.29%),,
1597,protocol_document,https://catalogues.ema.europa.eu/sites/default...,Protocol INAS-FOCUS,29,67031,9652,5.0,20.833333,5 (20.83%),5.0,12.195122,...,,,,,,,1.0,7 (24.14%),,
1597,result_tables,https://catalogues.ema.europa.eu/sites/default...,IFOC_FinalStudyReport_Public Version 20200819,227,476543,70085,19.0,79.166667,19 (79.17%),40.0,97.560976,...,19.0,79.166667,19 (79.17%),40.0,97.560976,40 (97.56%),2.0,63 (27.75%),0,
1613,protocol_document,https://catalogues.ema.europa.eu/sites/default...,VIPOS_Study Protocol,26,60917,8619,5.0,20.833333,5 (20.83%),5.0,12.195122,...,4.0,16.666667,4 (16.67%),5.0,12.195122,5 (12.20%),1.0,4 (15.38%),,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
108254,other_document_#1,https://catalogues.ema.europa.eu/sites/default...,CEIM_LEGIT_MC_EVCDAO_2019 Modificacion Favorab...,2,3200,420,,,,,,...,,,,,,,,,,
108260,protocol_document,https://catalogues.ema.europa.eu/sites/default...,LEGIT_COVIDX_EVCDAO_2022 Protocol Multipatholo...,34,51047,8146,,,,,,...,,,,,,,,34 (100.00%),,
108260,other_document_#1,https://catalogues.ema.europa.eu/sites/default...,CEIm_LEGIT_COVIDX_EVCDAO_2022_TRJON-8abc0f12d8...,2,3492,448,,,,,,...,,,,,,,,1 (50.00%),,
108481,protocol_document,https://catalogues.ema.europa.eu/sites/default...,MK-5592-141-00-v1-Protocol_final-redaction,61,110814,15065,23.0,95.833333,23 (95.83%),18.0,43.902439,...,23.0,95.833333,23 (95.83%),18.0,43.902439,18 (43.90%),2.0,61 (100.00%),,


Next we will export and reload the generated results.

In [14]:
with pd.ExcelWriter('documents_analysis.xlsx') as writer:
    result_df.dropna(how='all').to_excel(writer, sheet_name='table')
    result_df.dropna(how='all').describe().to_excel(writer, sheet_name='description')

In [4]:
result_df = pd.read_excel(
    'documents_analysis.xlsx', 
    index_col=[0,1]
)

## Format / Finalise Table + Extra Analysis

In [5]:
extra_analysis_df = result_df.drop(
    [9953, 26001], # Drop 2 additional cancelled studies found after document classification 
    level=0,
    axis='index'
)[[
    # 'URL', 
    'pdf_name',
    'pages'
]].rename_axis(
    ['eu_pas_register_number', 'uploaded_document_type']
)

We will calculate median (IQR) page counts for PAS due result with abstract only or final report respectivly.

In [6]:
variables_due_result = pd.read_excel(
    '../../output/ema_rwd/ema_rwd_final_statistics_variables.xlsx', 
    sheet_name='due_result', 
    index_col=0
)

outcomes = pd.read_excel(
    '../study_documents/merge_classifications/outcomes_manual_individual.xlsx',
    index_col=[0,4]
)[['has_abstract_only_manual', 'has_final_study_report_manual']]

due_result_with_result = variables_due_result.index.intersection(outcomes.index.get_level_values(0))

In [7]:
page_analysis_df = outcomes.loc[due_result_with_result, :, :].merge(
    extra_analysis_df, left_index=True, right_index=True
)

abstract_pages_df = page_analysis_df[page_analysis_df['has_abstract_only_manual']]
_ , abstract_bins = pd.qcut(abstract_pages_df['pages'], 4, retbins=True)

final_report_pages_df = page_analysis_df[page_analysis_df['has_final_study_report_manual']]
_ , final_report_bins = pd.qcut(final_report_pages_df['pages'], 4, retbins=True)

display(
    'Median page counts for abstracts',
    f'{abstract_bins[2]:.0f} ({abstract_bins[1]:.0f} - {abstract_bins[3]:.0f})',
    'Median page counts for final study reports',
    f'{final_report_bins[2]:.0f} ({final_report_bins[1]:.0f} - {final_report_bins[3]:.0f})'
)

'Median page counts for abstracts'

'5 (3 - 8)'

'Median page counts for final study reports'

'77 (44 - 132)'

We will now format and finalise the table for publication:

In [7]:
document_type_map = {
    'other_document': 'Study, other information',
    'protocol_document': 'Protocol document',
    'result_document': 'Study report',
    'result_tables': 'Results tables'
}

In [8]:
formated_result_df = extra_analysis_df.reset_index().assign(
    uploaded_document_type = lambda df: df['uploaded_document_type'].str.replace(r'_#\d+', '', regex=True).map(document_type_map)
).set_index(
    ['eu_pas_register_number', 'uploaded_document_type']
).rename(
    columns={
        'pdf_name': 'Document Name',
        'pages': 'Page count',
        'words': 'Word count'
    }
).rename_axis(
    ['EU PAS Register Number', 'Upload Section']
).sort_index(axis='index')

formated_result_df

Unnamed: 0_level_0,Unnamed: 1_level_0,Document Name,Page count
EU PAS Register Number,Upload Section,Unnamed: 2_level_1,Unnamed: 3_level_1
1587,Protocol document,Tesis Maria Jose Alcala,164
1591,Results tables,Report_Rosiglitazone_use,14
1597,Protocol document,Protocol INAS-FOCUS,29
1597,Results tables,IFOC_FinalStudyReport_Public Version 20200819,227
1613,Protocol document,VIPOS_Study Protocol,26
...,...,...,...
108254,"Study, other information",CEIM_LEGIT_MC_EVCDAO_2019 Modificacion Favorab...,2
108260,Protocol document,LEGIT_COVIDX_EVCDAO_2022 Protocol Multipatholo...,34
108260,"Study, other information",CEIm_LEGIT_COVIDX_EVCDAO_2022_TRJON-8abc0f12d8...,2
108481,Protocol document,MK-5592-141-00-v1-Protocol_final-redaction,61


In [9]:
formated_result_df.to_excel('documents_analysis_formated.xlsx')

## Experiments

Testing the extraction of other pdf document features:

In [56]:
def test(series):
    path = series['path']
    document_name = series['name']

    pdf_path = Path(BASE_PATH / path)

    # index = int(pdf_path.stem.split('_')[0][5:])
    document_type = pdf_path.stem.partition('_')[-1]
    
    # Compile regex patterns for report type key words
    report_type_patterns = [
        re.compile(r'\bfinal\s+study\s+report\b', re.IGNORECASE)
    ]

    with pymupdf.open(pdf_path) as doc:
        text_pages = [page.get_textpage() for page in doc]

        # Extract texts, words and characters
        texts = [text_page.extractText() for text_page in text_pages]

        search_range = texts[:min(9, len(doc))]
        report_type_page_number = []
        for i, text in enumerate(search_range):
            if any(pattern.search(text) for pattern in report_type_patterns):
                report_type_page_number.append(i + 1)

        return {
            'document_type': document_type,
            'URL': series['url'],
            'pdf_name': document_name,
            'pages': len(doc),
            'report_page_number': '; '.join(map(str, report_type_page_number)) if report_type_page_number else pd.NA
        }

Applying `test` to `analysis_df`: 

In [None]:
tqdm.pandas(
    desc = test.__name__,
    total = len(analysis_df[:10]),
    unit = 'studies'
)

analysis_df[:10].progress_apply(test, axis=1, result_type='expand').reset_index().set_index(['eu_pas_register_number', 'document_type'])

test:   0%|          | 0/10 [00:00<?, ?it/s]

Unnamed: 0_level_0,Unnamed: 1_level_0,URL,pdf_name,pages,report_page_number
eu_pas_register_number,document_type,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1587,protocol_document,https://catalogues.ema.europa.eu/sites/default...,Tesis Maria Jose Alcala,164,
1591,result_tables,https://catalogues.ema.europa.eu/sites/default...,Report_Rosiglitazone_use,14,
1597,protocol_document,https://catalogues.ema.europa.eu/sites/default...,Protocol INAS-FOCUS,29,
1597,result_tables,https://catalogues.ema.europa.eu/sites/default...,IFOC_FinalStudyReport_Public Version 20200819,227,1.0
1613,protocol_document,https://catalogues.ema.europa.eu/sites/default...,VIPOS_Study Protocol,26,
1613,result_tables,https://catalogues.ema.europa.eu/sites/default...,INAS-VIPOS_FinalStudyReport_PublicVersion,132,1.0
1705,result_tables,https://catalogues.ema.europa.eu/sites/default...,EMA H1N1 Jan 2013_EMA,13,
1705,result_document,https://catalogues.ema.europa.eu/sites/default...,Charlton_DeVries_Final report data sources for...,50,
1777,protocol_document,https://catalogues.ema.europa.eu/sites/default...,Protocol_ENcEPP_Submitted,45,
1777,result_tables,https://catalogues.ema.europa.eu/sites/default...,Rosiglitazone_executive_summary_for_ENCePP,1,
