***Author: Ritwika VPS, ritwika@ucmerced.edu***  
***Written: February 2025***  
###### (see below for modifications log as applicable)  

This script is intended to extract specific details from paper pdfs for the DARCLE paper's review component. As of now, extracted categories are: recording device, segmentation software, countries studies, languages studied, paper title, whether manual coding has been done, mentions of {ACLEW, Homebank, etc.}, and mentions of specific topics. 

- Recording device keywords (case insensitive): = LENA, Camera, Recorder/Audio reorder/voice recorder/microphone (uses same list as from Laudanska et al, 2025, Neuroscience & Biobehavioral Reviews)
- Software keywords (case insensitive): LENA, Praat, OpenSMILE, ALICE, Speech detection algorithm/Speech detection algorithms (uses same list as from Laudanska et al, 2025, Neuroscience & Biobehavioral Reviews)
- Manual annotation keywords: human annota*, human validation, human listener, human cod*, human label*, manual annot*, manual validation, manual cod*, manual label*, manually annot*, manually label*, annot*, hand annot*, hand-cod*, hand cod*, ground truth, manual*
- ACLEW, etc keywords (case insensitive): DARCLE, ACLEW, Homebank, Talkbank, ALICE, VTC/Voice Type Classifier, LangView, Many Paths to Language/MPaL
- Topics: 
    - CDSvADS: child directed speech/input, adult directed speech/input, CDS, ADS, child-directed speech/input, adult-directed speech/input, infant directed speech/input, infant-directed speech/input, IDS
    - vocal maturity: maturity, vocal maturity, speech maturity
    - Multilingualism: bi-lingual*, multi-lingual*, bilingual*, multilingual*, tri-lingual*, trilingual*
    - communication disorders: ASD, Autism spectrum disorder, hearing loss, deaf*, hearing impair*, cerebral palsy, Down syndrome, cleft palate, speech-language pathology, speech language pathology, speech-language therapy, speech language therapy, speech therapy, SLP, language disorder*, speech disorder* Fragile X Syndrome, Angelman Syndrome, Phelan-McDermid Syndrome, Phelan McDermid Syndrome, Developmental Language Disorder, DLD, Specific Language Impairment, SLI, HOH, hard of hearing, speech therapy, speech-language services, speech-language intervention, speech language services, speech language intervention, phonological disorder, phonological impairment, speech impair*, language difficulties, speech-language impair*, speech language impair*, speech difficulties, speech sound disorder, articulation disorder, speech sound impairment*, speech sound difficulties, articulation impairment*, articulation difficulties, language impair*, language-impair*, language impair*, speech-sound disorder, speech sound disorder, speech-sound impairment, speech sound impairment, speech-sound difficulties,  speech sound difficulties, communication impairment, developmental disabilities, neurodevelopmental disorder, word finding difficulties, word-finding difficulties, Specific-language-impaired, Specific language impaired, developmental disorder of language and communication 
    - SES: SES, socioeconomic status, socio-economic status, socioeconomically disadvantaged, socio-economically disadvantaged, low income, low-income, socio-economically diverse, socioeconomically diverse 
    - school programs: school, classroom, class-room, teacher, student, preschool, pre-school
    - Multi-modal: GPS, accelerometer, eye-track*, eye track*, EEG 
    - stress/anxiety/emotion:
    - cries/non-cries OR crying: cries, cry, non-cries, non cries, non-cry, non cry, crying
    - NICU: neonatal intensive care unit, preterm, pre-term, premature
- Countries and languages studied: (uses GeoText and pycountry)

Notes:
- Paper title, author names, year, and journal are present in the Covidence exported .csv. As such, I am only extracting paper title from the table of contents so I can match the details against the covidence csv. However, note that some paper titles are not extracted properly, so these will need to be matched based on the file name at a later point (which does get extracted with this script)
- Recorder device: extracted from METHODS ONLY, uses keywords (see keyword list above), outputs all sentences with keywords into a spreadsheet that will then need to be manually reviewed. If a paper does not seem to have any rec device sentences extracted, we'll have to manually go in.
- Segmentation software: extracted from METHODS ONLY, uses keywords (see keyword list above), outputs all sentences with keywords into a spreadsheet that will then need to be manually reviewed. Note that we'll have to manually identify items that constitute 'Other' and if the 'speech processing algorithm' text is not part of methods, we'll also have to
manually complement that item. As in, if there is a custom algorithm written (but it is not explicitly mentioned as a speech detection algorithm(s)) or a less common but named algo is used, this code will not capture that. However, I am fairly confident that this should only be a few cases (see Laudanska et al's results). Similarly, if a paper does not seem to have any software sentences extracted, we'll have to manually go in.
- Manual annotation: extracted from METHODS ONLY, uses keywords (see keyword list above), outputs all sentences with keywords into a spreadsheet that will then need to be manually reviewed. My hunch is that if no manual annotation text is extracted, it was not done? 
- Languages and countries (and US states): extracted from METHODS ONLY, uses specific python libraries (GeoText and pycountry), and are simply extracted from Methods and added to a spreadsheet. The extraction is more dynamics than the others, because I use the country lists extracted for each paper (+ a list of US states) to then extract sentences with mentions of the countries or US states.
    - Countries (and US states): I do also extract sentences containing countries and sentences containing the names of US states (because some US studies only mention the US state re: location) for further manual verification. Also note that some papers may mention US vs. United States, and this gets extracted as United States using the country code, but the list  
    - Languages: flagging is case sensitive as it is set up. While I do not think this is an issue for more common languages, there are also languages such as 'As' and 'To', and if the matcher is case insensitive, 'as' and 'to' will be matched, and we don't want that. At any rate, some of the languages in the language list will need to be manually deleted (see point about 'As' and 'to', for instance). Also note that because some of these more common words are extracted as languages, I have chosen not to extract sentences with language references.
    - It might be useful to determine if we want to just go with the extracted list or go in and check for languages/countries/states for cases where there is no extracted language or country list for a paper.  
- Mentions of ACLEW etc: extracted from ALL text except refs, keyword match-based extraction without further manual verification. Since these are very specific keywords, my hunch is that this should be sufficient. There are some papers (such as some PNAS) papers that do not have a titled Refernce section are extracted using the entire text (but these are very few). This is simply a YES or NO spreadsheet and is not set up for further manual review
- Topics are extracted from Methods and Results combined. This is also a YES or NO spreadsheet and is not set up for further manual review
- Methods and Results text are extracted by identifying the start of the Methods Section and a reasonable guess of the next section (see code block where these regexp patterns are defined; MethodsStartPattern, etc) and similarly for the Results section
    - However, if methods and results sections cannot be reliably identified, extraction is attempted using pymupdf's table of contents by identifying text that has Methods (or other variants). If this also fails, that paper is not extracted, and will need to be manually extracted. Community/consortia details are extracted from every paper (unless it is not read in by pymupdf4llm)
- Extraction from tables might be a hit or a miss, so there's a degree of error here, if some of this info is in a table. That broadly does not seem to be the case, however.
But, it might be worth it to quickly manually look for tables in the main text, if we choose to do so (I think it might be ok to miss anything that's relegated to the SI)
- Paper title is extracted both from the table of contents (PaperTitle_ToC) and separately by pulling all of the largest text in the first page (PaperTitle_NoToC) 


In [2]:
""" 
Import packages
"""
import fitz #pymupdf
import pymupdf4llm #to read to markdown, etc
from pathlib import Path #for paths
#import pathlib
#import os
import math #to check if something is na
import numpy as np
import re #regular expressions
import spacy #nlp stuff
import pandas as pd #data frames
from flashgeotext.geotext import GeoText #for countries 
import pycountry #for languages

In [630]:
""" 
This function removes illegal characters (that throw IllegalCharacterError for writing to excel) from text

Input: the text to be processes for illegal chars (should be str)
Output: processed text
"""
def RemoveIllegalChars(Txt):
    if isinstance(Txt, str): #check if input is str
        # This regex removes control characters (00-1F) except tab, newline, carriage return and also handles specific problematic characters
        Txt = re.sub(r'[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\xff]', '', Txt)
    else:
        print('Input is not a string')

    return Txt

In [631]:
""" 
Takes in doc (opened through pymupdf) and extracts paper title by identifying and appending all of the largest text from page 1. Note that this may not work perfectly depending on 
typesetting but is a broadly good heuristic! 
"""
def GetPaperTitle_NoToC(doc):

    MaxSize = 0; TitleList = [] #Initialise
    Page1 = doc[0] #Get page 1

    BlocksFromPg1 = Page1.get_text("dict")["blocks"] #get text as blocks (which will allow us to parse through the nested dictionary structure)

    for block in BlocksFromPg1: #go through blocks
        if "lines" in block: #if a 'lines' key exists
            for line in block["lines"]: #go through lines
                for span in line["spans"]: #go through spans within lines
                    CurrTxtSize = span["size"] #get the size
                    if CurrTxtSize > MaxSize: #iteratively update MaxSize
                        MaxSize = CurrTxtSize

    for block in BlocksFromPg1: #repeat, but this time to extract max-sized text
        if "lines" in block:
            for line in block["lines"]:
                for span in line["spans"]:
                    CurrTxtSize = span["size"] #get size
                    CurrTxt = span["text"].strip() #get associated text
                    if CurrTxtSize == MaxSize: #if the text is at max size, add to the Title list
                        TitleList.append(CurrTxt)

    MergedTitle = ' '.join(TitleList) #merge into a full title

    return MergedTitle 

In [632]:
""" 
This function takes in markdown text (of the entire paper text) and removes the References from it.

Input: - MkDwnTxt: markdown text of the entire paper text
       - FileName: for error message

Output: NoRefsTxt: input text with References section removed (if there is no References as a titled section, the entire input text is simply passed as output text)
"""
def RemoveRefs(MkDwnTxt,FileName):

    #[] groups the items so that # and * are treated as characters, \s is for white space, . is for period, and []* matches 0 or more instances
    #\b is to indicate a word boundaru so that a non-word character needs to be matched (vs. aReferences being matched)
    #\w* is for one or more word characters
    RefsPattern = re.compile(r'^[#*\d\s.|]*'             # 0 or more symbols/digits/dots/spaces at start (for markdown's # and *, and when sections are numbered (e.g., 1.2. etc))
                             r'\b('                         #Group start: 
                             r'References\w*|Bibliograph\w*|Citation\w*|Work\w*\s*Cited'
                             r')\b'                         # Group end
                             r'[*#\s]*$',                   # Trailing symbols and spaces
                             re.IGNORECASE | re.MULTILINE)
    
    #Some PNAS artcles do not have an explicit references section, ergo the if else condition
    MatchesIterator = list(RefsPattern.finditer(MkDwnTxt)) #finditer finds all instances of the RefsPattern and list() lists it
    if len(MatchesIterator) != 0: # if there is at least one element
        LastMatch = MatchesIterator[-1] #get the last instance
        LastMatch_st = LastMatch.start() #gets the start position in text of this last match
        NoRefsTxt = MkDwnTxt[:LastMatch_st] #extracts up till this last match
    else:
        NoRefsTxt = MkDwnTxt
        print(f'Refs section has not been found for {FileName}. This can happen for some PNAS articles. Entire input text passed as output')

    return NoRefsTxt


In [633]:
""" 
This function identifies the start and end of a specified section using a start pattern (say, to identify the start of Methods) and an ending pattern (say, to identify the start
of a section that can be reasonable expected to follow, such as Results or Discussion or Acknowledgement) as well as the full focument (markdown) text and returns the starting
and ending positions of the required section with respect to the full input text

Inputs:
    - StartPattern: pattern to id the start of the required section (an example shown to id the methods section. Note how we are being v precise with the vairous forms the Methods
                    section title can take)
                    e.g., MethodsStartPattern = re.compile(r'^[#*\d\s.]*'             # 0 or more symbols/digits/dots/spaces at start
                                                r'\b('                         # Group start
                                                r'Material\w*\s*(?:&|and)\s*Method\w*|' # Most specific first
                                                r'Experimental\s*Method\w*|'
                                                r'Methodolog\w*|'
                                                r'Method\w*'               # Most general last
                                                r')\b'                         # Group end
                                                r'[*#\s]*$',                   # Trailing symbols and spaces
                                                re.IGNORECASE | re.MULTILINE)
    - NextSecPattern: pattern to id the start of possible next sections. With the search for the start and end of the Methods section as an example, see how we are being 
                      exhaustive/comprehensive, re: which sections could possible follow
                      MethodsNextSecPattern = re.compile(r'^[#*\d\s.]*'             # 0 or more symbols/digits/dots/spaces at start
                                              r'\b('                         #group start
                                              r'Result\w*|Discussion|Result\w*\s*(?:&|and)\s*Discussion|Experimental\s*Finding\w*|Finding\w*|'
                                              r'Reference\w*|Acknowledgment\w*|Conclusion\w*|Citation\w*|Work\w*\s*Cited|Bibliograph\w*'
                                              r')\b'                         # Group end
                                              r'[*#\s]*$',                   # Trailing symbols and spaces
                                              re.IGNORECASE | re.MULTILINE)

    - SectionIdStr: a string to identify the section (but only to print messages to the console)
    - FullDocTxt: the full markdown of the paper text

Outputs: 
    - StartPos: starting position of the required section wrt FullDocTxt
    - EndPos: ending position of the required section wrt FullDocTxt
""" 
def GetSectionStart_NoToc(StartPattern,NextSecPattern,SectionIdStr,FullDocTxt):

    #Initialise ops as nan in case we don't find the required strings, etc
    StartPos = np.nan; EndPos = np.nan
    
    #get iterators for the starting and ending pattern
    MatchIterator = list(StartPattern.finditer(FullDocTxt))
    NextSecMatchIterator = list(NextSecPattern.finditer(FullDocTxt))

    #If there is exactly one match for the required section (which should be the ideal case)
    if len(MatchIterator) == 1:
        #print(f'{SectionIdStr} section is {MatchIterator[0].group()}, starts at {MatchIterator[0].start()}')
        StartPos = MatchIterator[0].start() #get the start position

        #Now to find the ending (we do this by finding where the next section starts)
        for match in NextSecMatchIterator: #go through the list of potential next sections
            if match.start() > StartPos: #find the first start position out of the next sections that is *after* the start of the required section
                #(so, if the required section is Methods, we might expect the next section to be Results or Discussion or References, etc. But, for some papers, Methods can come
                #after the results and discussion, ergo this check)
                EndPos = match.start() #Get the start of the first next section as the ending pos of the current section
                #print(f'Next section is {match.group()}, starts at {match.start()}')
                break #Immediately break

    return StartPos, EndPos

  """


In [634]:
"""
This function takes in the document table of contents (ToC), the query string for the section title we are looking for (ReqSectionStr; e.g., METHOD for Methods), 
and logical inputs for whether the ToC should be printed (printToc) and whether the output should be printed (printOp) and returns the start page number (ReqSecStPage), 
the level representing the indentation (ReqSecLvl) so this can be matched to the level of the next section to ensure these are the same, and the title (ReqSecTitle)
of the section being queried for; as well as the same items--starting page number, level, and title--of the next section (NextSecStPage, NextSecLvl, NextSecTitle)

NOTE that this is only used if extracting by identifying Methods and Results sections without ToC fails

Notes: 
- to extract pages, the ReqSecStPage must be subtracted by 1 because pymupdf4llm starts indexing at 0. The NextSecStPage does not need to be subtracted by 1
- ReqSectionStr MUST be in all upper case! 
"""
def GetReqSectionPgNos_wToC(ToC, ReqSectionStr,printToc,printOp):
    if ToC: #if table of contents exists (but this is a redundant check)
        DoesSectionExist = False; NextSectionFound = False #initialise flags
        ReqSectionInd = 0 #to get an index for the required section so that the next section can be checked to make sure that it occure *after* the current section

        for level, title, pageNo in ToC: #go through the level (representing indentation), section title, and correspoding page num (this is the table of contents structure)
            ReqSectionInd = ReqSectionInd + 1
            if ReqSectionStr in title.upper(): #if the section text we are looking for is in the section title
                ReqSecStPage = pageNo; ReqSecLvl = level; ReqSecTitle = title #get page number, level, and title of the required section
                DoesSectionExist = True #toggle flag
                break
        
        SectionIndTracker = 0 #to keep track of the secton index to make sure that the next section is indeed *after* the current section
        if DoesSectionExist: #if required section exists, we want to find the next section
            for level, title, pageNo in ToC:
                SectionIndTracker = SectionIndTracker + 1
                if level == ReqSecLvl and pageNo >= ReqSecStPage and title != ReqSecTitle and SectionIndTracker > ReqSectionInd: # the section level has to be the same, 
                    #the page number should be at least equal to page num of the required section, the section title of the next section should not be the same as that of 
                    # the required section, and the section itseld should have an index greater than the required section
                    NextSecStPage = pageNo; NextSecLvl = level; NextSecTitle = title
                    NextSectionFound = True #toggle flag
                    break
        else: #if queried section does not exist
            #print(f"{ReqSectionStr} section not found") #print message
            ReqSecStPage = np.nan; ReqSecLvl = np.nan; ReqSecTitle = "" #set outputs
            NextSecStPage = np.nan; NextSecLvl = np.nan; NextSecTitle = ""

        if not NextSectionFound: #if the next section is not found (this is redundant, but still)
            #print('Next section not found')
            NextSecStPage = np.nan; NextSecLvl = np.nan; NextSecTitle = ""
        
        if printToc: #if ToC printing is true
            for entry in ToC:
                print(entry)        
        
    else: #if no ToC
        #print('No ToC')
        ReqSecStPage = np.nan; ReqSecLvl = np.nan; ReqSecTitle = ""
        NextSecStPage = np.nan; NextSecLvl = np.nan; NextSecTitle = ""

    if printOp: #if output printing is true
        print(ReqSecStPage, ReqSecLvl, ReqSecTitle)
        print(NextSecStPage, NextSecLvl, NextSecTitle)
    
    return ReqSecStPage, ReqSecLvl, ReqSecTitle, NextSecStPage, NextSecLvl, NextSecTitle

In [635]:
""" 
This function takes in spacy-processed markdown text, takes a list of lists containing keywords and associated regexp patterns, and appends sentences that contains keywords to 
an appropriately structured dataframe (for extracting recorder details, software details, manual annotation details ONLY). 

Spacy-processed markdown text is used so that it can be broken into sentences, etc, which I suspect can also be done by splitting at '.', but that would also interfere
with citations with et al. etc

NOTE: this function is only used for extracting recorder details, software details, manual annotation details

Inputs:
    - SpacyTxt: spacy-processed markdown text
    - KeywordPatts_LoL: list of lists containing keywords and corresponding regexp patterns 
                        e.g.: [['LENA','Camera','RecOrMic'],                #keywords (become used for output df colnames)
                              [r'\bLENA\b',r'\bcamera\b',r'\bmicrophone\b|\bvoice recorder\b|\brecorder\b']] #corresponding keyword patterns
    - FullOpDf: the dataframe the output is appended to
    - PaperTitle_ToC, _NoToC: title of the paper being processed (using pymupdf's table of contents, for _ToC; and by id'ing largest txt on the 1st page, for _NoToC) 
                            (goes into the dataframe under the 'PaperTitle_ToC, _NoToC' cols)
    - FileName: the file name that the paper is saved under (goes into the dataframe under the 'FileName' col)

Outputs:
    - DictForDf: the op dictionary for the input paper that gets converted into a df and then gets appended to the input FullOpDf
    - CurrOpDf: the op df for the input paper that gets converted from DictForDf and then gets appended to the input FullOpDf
    - FullOpDf: the input FullOpDf with the details extracted from the current paper appended to it (so it is recursive, in a sense)
"""
def GetMatchedSentences(SpacyTxt, KeywordPatts_LoL, FullOpDf, PaperTitle_ToC, PaperTitle_NoToC, FileName):
    
    #--1. Initialising and set up---------------------------------------------------------------------------------------------------------
    #get the list of keywords and corresponding patterns
    Keywords = KeywordPatts_LoL[0]; Patterns = KeywordPatts_LoL[1]

    # Initialize a list of lists to hold sentences for each pattern. This is the equivalent of: MatchSentenceCell = cell(1, numel(Patterns))
    MatchedSentence_LoL = [[] for _ in range(len(Patterns))]

    #--2. Keyword flagging and extraction---------------------------------------------------------------------------------------------------------
    #keyword flagging in the spacy-processed text by going sentence by sentence and by looping through the keywords
    for CurrSentence in SpacyTxt.sents: #go through sentence by sentence
        sentence_text = CurrSentence.text.replace("\n", " ").strip() #.replace("\n", " ") swaps line breaks for a single space
        #and strip() removes all leading and trailing whitespace from the string from the text (.text) of CurrSentence

        #Collapse multiple spaces into one: .split(), since there are no splitters provided, will split the sentence into words, using any white space (tabs, newlines, multiple spaces)
        # as the splitting delimiter. Then, the join stitches them back together with only a single space
        sentence_text = " ".join(sentence_text.split())
        
        #Skip very short "sentences" (often noise or page numbers)
        if len(sentence_text) < 15:
            #print(sentence_text)
            continue #ends the current iteration of the for loop
        
        for keyword_ind, pattern in enumerate(Patterns): #go through list of patterns, by using enumerate, we also get an index for the pattern
            if re.search(pattern, sentence_text, re.IGNORECASE): #regexp search
                sentence_text = RemoveIllegalChars(sentence_text) #Remove any illegal characters (see user defined fn above)
                MatchedSentence_LoL[keyword_ind].append(sentence_text) #add the matched sentence to the corresponding list in the matched sentence list of lists

    #--3. Putting together op df---------------------------------------------------------------------------------------------------------
    #initialise the dictionary that we will convert to op df
    DictForDf = {"PaperTitle_ToC": PaperTitle_ToC,
                 "PaperTitle_NoToC": PaperTitle_NoToC,
                 "FileName": FileName}
    
    ColNames = [kywrd_i + "_sents" for kywrd_i in Keywords] #get columns names for op df by adding '_sents' (for sentences)

    #Add the dynamic matched sentence blocks to the dict
    for i in range(len(ColNames)):
        #Join the lists of sentences into single text blocks. We do this by creating a dictionary: {'Keyword1': 'Sent1\nSent2', 'Keyword2': 'Sent3', ...}
        DictForDf[ColNames[i]] = "; ".join(MatchedSentence_LoL[i]) #the sentences on MatchedSentence_LoL are joined with '; ' separating them

    CurrOpDf = pd.DataFrame([DictForDf]) #This dict can then be converted into the output df, with the dict keys as colnames. Note that this will only have the extracted details from 
    #the current paper (given by SpacyTxt)

    FullOpDf = pd.concat([FullOpDf,CurrOpDf], ignore_index=True) #concat with the full df that stores all the paper extracted info, with a row for each paper

    return DictForDf, CurrOpDf, FullOpDf 

In [636]:
""" 
This function takes in spacy-processed markdown text, takes a list of lists containing keywords and associated regexp patterns, and appends sentences that contains keywords to 
an appropriately structured dataframe **SPECIFICALLY** for Topics and Communities (ACLEW, etc) extraction. This ideally can be merged with the previous general function, but 
it's going to take some refactoring that I do not want to do. The key differencs is that if there is a mention of a keyword (per the specified pattern), the search flags YES for
that keyword and breaks the search loop and goes to the next keyword. Essentially, the output is a YES/NO dataframe and NO sentences are extracted. 

NOTE: this function is only used for extracting Topics and Communities details, as a binary YES/NO df

Spacy-processed markdown text is used so that it can be broken into sentences, etc, which I suspect can also be done by splitting at '.', but that would also interfere
with citations with et al. etc

Inputs:
    - SpacyTxt: spacy-processed markdown text
    - KeywordPatts_LoL: list of lists containing keywords and corresponding regexp patterns (ONLY for topics and communities extraction)
                        e.g.: KeywordPatts_Community_LoL = [['DARCLE', 'ACLEW', 'Homebank', 'Talkbank', 'ALICE', 'VTC', 'LangView', 'MPaL','ManyBabies'],      #keywords
                                                            [r'\bDARCLE\b', r'\bACLEW\b', r'\bHomebank\b', r'\bTalkbank\b', r'\bALICE\b',                 #corresponding patterns
                                                            r'\bVTC\b|\bVoice Type Classifier\b', r'\bLangView\b', r'\bMany Paths to Language\b|\bMPaL\b',r'\bManyBabies\b']]
    - FullOpDf: the dataframe the output is appended to
    - PaperTitle_ToC, _NoToC: title of the paper being processed (using pymupdf's table of contents, for _ToC; and by id'ing largest txt on the 1st page, for _NoToC) 
                            (goes into the dataframe under the 'PaperTitle_ToC, _NoToC' cols)
    - FileName: the file name that the paper is saved under (goes into the dataframe under the 'FileName' col)

Outputs:
    - DictForDf: the op dictionary for the input paper that gets converted into a df and then gets appended to the input FullOpDf
    - CurrOpDf: the op df for the input paper that gets converted from DictForDf and then gets appended to the input FullOpDf
    - FullOpDf: the input FullOpDf with the details extracted from the current paper appended to it (so it is recursive, in a sense)
"""
def GetOpDf_TopicsCommunities(SpacyTxt, KeywordPatts_LoL, FullOpDf, PaperTitle_ToC, PaperTitle_NoToC, FileName):
    
    #--1. Initialising and set up---------------------------------------------------------------------------------------------------------
    #get the list of keywords and corresponding patterns
    Keywords = KeywordPatts_LoL[0]; Patterns = KeywordPatts_LoL[1]

    #Initialize a list of lists to hold sentences for each pattern. This is the equivalent of: MatchSentenceCell = cell(1, numel(Patterns))
    #This is no by default
    MatchFound_List = ['NO' for _ in range(len(Patterns))]

    #--2. Keyword flagging and extraction---------------------------------------------------------------------------------------------------------
    #keyword flagging in the spacy-processed text by going sentence by sentence and by looping through the keywords
    #NOTE we are choosing to loop through keywords first because we are only flagging for presence and as such, can stop once we hit the first instance
    for keyword_ind, pattern in enumerate(Patterns): #go through list of patterns, by using enumerate, we also get an index for the pattern
        for CurrSentence in SpacyTxt.sents: #go through sentence by sentence
            sentence_text = CurrSentence.text.replace("\n", " ").strip() #.replace("\n", " ") swaps line breaks for a single space
            #and strip() removes all leading and trailing whitespace from the string from the text (.text) of CurrSentence

            #Collapse multiple spaces into one: .split(), since there are no splitters provided, will split the sentence into words, using any white space (tabs, newlines, multiple spaces)
            # as the splitting delimiter. Then, the join stitches them back together with only a single space
            sentence_text = " ".join(sentence_text.split())
    
            #Skip very short "sentences" (often noise or page numbers)
            if len(sentence_text) < 15:
                #print(sentence_text)
                continue #ends the current iteration of the for loop

            if re.search(pattern, sentence_text, re.IGNORECASE): #regexp search
                MatchFound_List[keyword_ind] = 'YES'
                break #if a mention is found, BREAK, and then go to the next keywork (because this will break out of the CurrSentence for loop)

    #--3. Putting together op df---------------------------------------------------------------------------------------------------------
    #initialise the dictionary that we will convert to op df
    DictForDf = {"PaperTitle_ToC": PaperTitle_ToC,
                 "PaperTitle_NoToC": PaperTitle_NoToC,
                 "FileName": FileName}

    ColNames = Keywords #get columns names for op df
    
    #Add the dynamic matched sentence blocks to the dict
    for i in range(len(ColNames)):
        #Join the lists of sentences into single text blocks. We do this by creating a dictionary: {'Keyword1': 'Sent1\nSent2', 'Keyword2': 'Sent3', ...}
        DictForDf[ColNames[i]] = MatchFound_List[i] #the sentences on MatchFound_List are joined with '; ' separating them
    
    CurrOpDf = pd.DataFrame([DictForDf]) #This dict can then be converted into the output df, with the dict keys as colnames. Note that this will only have the extracted details from 
    #the current paper (given by SpacyTxt)

    FullOpDf = pd.concat([FullOpDf,CurrOpDf], ignore_index=True) #concat with the full df that stores all the paper extracted info, with a row for each paper
    #ignore_index=True so that the orginal row index from each concatenated df is not thrown in, and in fact, there are no row indices

    return DictForDf, CurrOpDf, FullOpDf

In [637]:
""" 
This function takes in spacy-processed markdown text, takes a list of lists containing keywords and associated regexp patterns, and appends sentences that contains keywords to 
a list of lists **SPECIFICALLY** for Countries and US states extraction. This ideally can also be merged with the previous general function, but it's going to take some 
refactoring that I do not want to do. 

Specifically, this functions extracts sentences where countries or US states are mentioned, concatenates those sentences, and packes them up into yhe OpDf appropriately. The key 
differencs between this function and the previous general function is that the keyword patterns are dynamic, in that the keyword pattern list of lists contain a list for countries
mentioned in the paper's Methods text, and a list of all 50 states in the US. The corresponding keyword pattern for the country list (and separately, the state list) are constituted 
within the function, so the country keyword pattern is dynamically done for each paper separately. The rest of the sentence matching and plopping works as previous functions, except
that the output is a list of lists, where the first list has the sentences with country mentions, and the second list has the state mentions. There is no sentence extraction for 
languages (see markdown text)

NOTE: this function is only used for extracting countries and languages details

Spacy-processed markdown text is used so that it can be broken into sentences, etc, which I suspect can also be done by splitting at '.', but that would also interfere
with citations with et al. etc

Inputs:
    - SpacyTxt: spacy-processed markdown text
    - Countries_List, StatesUS_List: list of countries mentioned in the Methods, and list of US states. Both of these are separately used to make relevant keyword patterns
                                     to then search for sentences with mentions, which then gets packed up into the outputs

Outputs:
    - MatchedSentence_LoL: list of list containing matched sentences with all country names in the paper's methods text (based on input Countries_List), and matched sentences
                           containing all mentions of US states (based on StatesUS_List input). The order of MatchedSentence_LoL is [[countries_sentences],[US_states_sentences]]
"""
def GetMatchedSentences_CountriesEtc(SpacyTxt, Countries_List, StatesUS_List):

    #--1. Initialising and set up---------------------------------------------------------------------------------------------------------
    # Initialize a list of lists to hold sentences for each pattern. 
    MatchedSentence_LoL = [[] for _ in range(2)] #for countries and states_US

    Patterns_LoL = [Countries_List, StatesUS_List] #create list of lists for the patterns. 
    #Note that these are simply lists (e.g., [Unites States, United Kingdom], etc) that are then plopped into a list of lists

    #--2. Keyword flagging and extraction---------------------------------------------------------------------------------------------------------
    #keyword flagging in the spacy-processed text by going sentence by sentence and by looping through the keywords
    for CurrSentence in SpacyTxt.sents: #go through sentence by sentence
        sentence_text = CurrSentence.text.replace("\n", " ").strip() #.replace("\n", " ") swaps line breaks for a single space
        #and strip() removes all leading and trailing whitespace from the string from the text (.text) of CurrSentence

        #Collapse multiple spaces into one: .split(), since there are no splitters provided, will split the sentence into words, using any white space (tabs, newlines, multiple spaces)
        # as the splitting delimiter. Then, the join stitches them back together with only a singleb space
        sentence_text = " ".join(sentence_text.split())
        
        #Skip very short "sentences" (often noise or page numbers)
        if len(sentence_text) < 15:
            #print(sentence_text)
            continue #ends the current iteration of the for loop
        
        for keyword_ind, pattern in enumerate(Patterns_LoL): #go through list of patterns, by using enumerate, we also get an index for the pattern. So, for keyword_ind = 0,
            #we pull out the country list, and so on
            if pattern: #if the pattern is not empty (cuz if its an empty pattern, all text gets matched, and because we are assembling the pattern from the list of
                #countries found, this varies for each paper)
                CurrPattern = r"\b" + '|'.join(pattern) + r"\b" #create regexp out of the pattern
                #That is, we are grabbing the keyword_ind-th list of strings from the Patterns_LoLs and actually stitching that into a regexp
                if re.search(CurrPattern, sentence_text, re.IGNORECASE): #regexp search
                    #print(sentence_text)
                    sentence_text = RemoveIllegalChars(sentence_text)
                    MatchedSentence_LoL[keyword_ind].append(sentence_text) #add the matched sentence to the corresponding list in the matched sentence list of lists

    return MatchedSentence_LoL 

In [638]:
""" 
This function takes in input text (spacy-processed markdown and regular markdown) of the Methods section, takes other inputs to parse for countries, languages, and US states, and 
returns a dict and a df with extracted details for the current paper as well as appends the df for the current paper to the larger output df for the entire dataset. This includes
a list of identified countries, US states, and languages, as well as all sentences with mentions of the identified countries and all sentences with mentiones of US states in the paper
text, all as separate df columns. This ideally can also be merged with the previous general function, but it's going to take some refactoring that I do not want to do. 

See the user-defined function that does the sentence extraction for countries and states ('GetMatchedSentences_CountriesEtc') as well as the markdown text at the start of this doc
for details. 

NOTE: this function is only used for getting the output df for countries and languages (which includes US states mentioned; see relevant df initialisation for details)

Spacy-processed markdown text is used so that it can be broken into sentences, etc, which I suspect can also be done by splitting at '.', but that would also interfere
with citations with et al. etc, while geotext requires non-spacy text

Inputs:
    - geotext: the main class constructor for the GeoText library, which needs to be initialised (done outside the function, once, and then passed in) that does the work
               of parsing through the text and identifying countries, states, etc.
    - LangMatcher: the language matcher object from spacy that has the languages group and corresponding keywords from pycountry added to identify languges in the text using NLP
    - StatesUS_List: list of US states
    - PaperTitle_ToC, _NoToC: title of the paper being processed (using pymupdf's table of contents, for _ToC; and by id'ing largest txt on the 1st page, for _NoToC) 
                            (goes into the dataframe under the 'PaperTitle_ToC, _NoToC' cols)
    - FileName: the file name that the paper is saved under (goes into the dataframe under the 'FileName' col)
    - Methods_Mkdwn: markdown text of the markdown text (from pymupdf4llm)
    - Methods_spacy: spacy-processed markdown text of the methods section
    - CountriesAndLangsFullDf: the dataframe the output is appended to

Outputs:
    - Temp_CountriesAndLangsDict: the op dictionary for the input paper that gets converted into a df and then gets appended to the input CountriesAndLangsFullDf
    - Temp_CountriesAndLangsDf: the op df for the input paper that gets converted from Temp_CountriesAndLangsDict and then gets appended to the input CountriesAndLangsFullDf
    - CountriesAndLangsFullDf: the input CountriesAndLangsFullDf with the details extracted from the current paper appended to it (so it is recursive, in a sense)  
"""
def GetCountriesAndLangs_OpDf(geotext, LangMatcher, StatesUS_List, PaperTitle_ToC, PaperTitle_NoToC, FileName, Methods_Mkdwn, Methods_spacy, CountriesAndLangsFullDf):

    #--1. Initialising and set up---------------------------------------------------------------------------------------------------------
    #set up temp dict for countries and languagses extration from this paper
    Temp_CountriesAndLangsDict = {"PaperTitle_ToC": PaperTitle_ToC,
                                  "PaperTitle_NoToC": PaperTitle_NoToC,
                                  "FileName": FileName} #initialise temporary dict
    
    #--2. Getting mentioned countries---------------------------------------------------------------------------------------------------------
    GeoOp = geotext.extract(Methods_Mkdwn) #geotext output (has countries, cities, etc). Input text should be a string so cannot use spacy-d text as input
    FoundCountries_List_PreProcess = list(set(GeoOp['countries'].keys())) #list(set()) uniques everythig (set()) and then lists it back
    FoundCountries_List = [RemoveIllegalChars(ctry) for ctry in FoundCountries_List_PreProcess] #remove illegal chars from each and re-assign
    FoundCountries = '; '.join(FoundCountries_List) 
    Temp_CountriesAndLangsDict["Countries"] = FoundCountries #add country list to the dict
    MatchedSentence_CountriesAndUsStates_LoL = GetMatchedSentences_CountriesEtc(Methods_spacy, FoundCountries_List, StatesUS_List)
    Temp_CountriesAndLangsDict["Countries_sents"] = "; ".join(MatchedSentence_CountriesAndUsStates_LoL[0])
    Temp_CountriesAndLangsDict["StatesUS_sents"] = "; ".join(MatchedSentence_CountriesAndUsStates_LoL[1])

    #--2. Getting mentioned langs---------------------------------------------------------------------------------------------------------
    FoundLangs = LangMatcher(Methods_spacy) #use the matcher on the methods text
    #Matcher op is a tuple (3 element tuple?) that has the match_id, the start and the end of the matched string in the full text, and we can use this to extract the
    #languages from the text
    FoundLangsList_PreProcess = list(set([Methods_spacy[start:end].text for _, start, end in FoundLangs])) # Extract unique found languages. set() removes duplicates,
    #and list() converts bacj to a list
    FoundLangsList = [RemoveIllegalChars(lang_i) for lang_i in FoundLangsList_PreProcess] #remove illegal chars from each and re-assign
    Temp_CountriesAndLangsDict["Languages"] = "; ".join(FoundLangsList) #joins the language list of found languages

    #Packing countries and languages into a temp data frame and concatenating with the full data frame
    Temp_CountriesAndLangsDf = pd.DataFrame([Temp_CountriesAndLangsDict]) #convert country and language dict to df
    CountriesAndLangsFullDf = pd.concat([CountriesAndLangsFullDf,Temp_CountriesAndLangsDf], ignore_index=True) #concat with the full country and language df

    return Temp_CountriesAndLangsDict, Temp_CountriesAndLangsDf, CountriesAndLangsFullDf 


In [639]:
""" 
This function processes markdown text that needs to be extracted using spacy, takes in all other inputs for extraction, and does the extraction for all items except 
communities/consortia (which extracts from the entire text minus refs)

Inputs: 
    - nlp: english language model used for spacy processing
    - PaperTitle_ToC, PaperTitle_NoToC, FileName: paper title and file name that go into the output dfs
    - Methods_Mkdwn, MethodsAndResults_Mkdwn: input mark down text
    - KeywordPatts_RecDevice_LoL, RecDevFullDf: keyword pattern list of lists ([[keyword],[corresponding patterns]]) used for extraction as well as the initialised full dataframe
                                                within which results are stored, for recording device extraction. See relevant user-defined fn (GetMatchedSentences)
    - KeywordPatts_Software_LoL, SoftwareFullDf: similarly for software extraction
    - KeywordPatts_ManualAnnot_LoL, ManualAnnotFullDf: similarly for manual annotation extraction
    - KeywordPatts_Topics_LoL, TopicsFullDf: similarly for topic extraction
    - geotext,LangMatcher,StatesUS_List,CountriesAndLangsFullDf: details for language and country extraction; see relevant user defined fn (GetCountriesAndLangs_OpDf)

Outputs: 
    RecDevFullDf, SoftwareFullDf, ManualAnnotFullDf, TopicsFullDf, CountriesAndLangsFullDf: output dataframes containing all extracted details that get saved
"""
def GetExtractedDfs_ExceptCommunities(nlp, PaperTitle_ToC, PaperTitle_NoToC,FileName,
                                      Methods_Mkdwn, MethodsAndResults_Mkdwn,
                                      KeywordPatts_RecDevice_LoL, RecDevFullDf,
                                      KeywordPatts_Software_LoL, SoftwareFullDf,
                                      KeywordPatts_ManualAnnot_LoL, ManualAnnotFullDf,
                                      KeywordPatts_Topics_LoL, TopicsFullDf,
                                      geotext, LangMatcher, StatesUS_List, CountriesAndLangsFullDf):
    
    Methods_spacy = nlp(Methods_Mkdwn) #basic nlp processing (which I gather to be tokenisation, etc?)
    MethodsAndResults_spacy = nlp(MethodsAndResults_Mkdwn)

    #--2.1 Recording device, software, manual annot (the ones where sentences are extracted)-------------------------------------------------------------------
    #process text and add output df from this paper to the full op df for various item lists (the unassigned outputs are the dictonary and df with the
    #current paper's extracted details, and those)
    _, _, RecDevFullDf = GetMatchedSentences(Methods_spacy,KeywordPatts_RecDevice_LoL, RecDevFullDf, PaperTitle_ToC, PaperTitle_NoToC, FileName) #get the rec device output df 
    _, _, SoftwareFullDf = GetMatchedSentences(Methods_spacy,KeywordPatts_Software_LoL, SoftwareFullDf, PaperTitle_ToC, PaperTitle_NoToC, FileName) #get the software output df
    _, _, ManualAnnotFullDf = GetMatchedSentences(Methods_spacy,KeywordPatts_ManualAnnot_LoL, ManualAnnotFullDf, PaperTitle_ToC, PaperTitle_NoToC, FileName) #get the manual annot df

    #--2.2. topics (binary YES/NO, no sentences extracted)--------------------------------------------------------------------------------------------------------
    _, _, TopicsFullDf = GetOpDf_TopicsCommunities(MethodsAndResults_spacy,KeywordPatts_Topics_LoL, TopicsFullDf, PaperTitle_ToC, PaperTitle_NoToC, FileName) #get the topics df
    
    #--2.3. Countries and languages (sentences extracted for countries and US states)-------------------------------------------------------------------------------
    _, _, CountriesAndLangsFullDf = GetCountriesAndLangs_OpDf(geotext,LangMatcher,StatesUS_List,PaperTitle_ToC, PaperTitle_NoToC,FileName,
                                                              Methods_Mkdwn,Methods_spacy,CountriesAndLangsFullDf)

    return RecDevFullDf, SoftwareFullDf, ManualAnnotFullDf, TopicsFullDf, CountriesAndLangsFullDf

In [4]:
""" 
Set up for extraction: load english language model, country extraction libraries, data paths, files, etc
"""
#--1. infrastructure for extration: spacy english model, geotext, etc----------------------------------------------------------------------------------------------------
nlp = spacy.load("en_core_web_sm") #load english language model (for spacy--does the sentence segmentation)
geotext = GeoText() #extracts countries, cities, etc

#--2. infrastructure for extration: languages from pycountry and spacy phrasematcher, etc---------------------------------------------------------------------------------
LangList_PyCtry = [lang.name for lang in pycountry.languages] #get list of languages from pycountry (provides ISO datatbases; 
#see: https://en.wikipedia.org/wiki/List_of_ISO_639_language_codes)

#set up spacy phrase matcher
LangMatcher = spacy.matcher.PhraseMatcher(nlp.vocab) #, attr="LOWER") #Initiliase the matcher object by connecting it to spacy's vocab. The 'lower' ensures this is case insensitive
LangKeywords = [nlp.make_doc(lang) for lang in LangList_PyCtry] #get the language keywords by iterating through the language list. nlp.make_doc does tokenisaton (instead of the full
#nlp processing, which would include parsing to understand grammatical structure, tagging to understand if the word is a noun or verb etc, finding names or dates, etc.). Tokenisation
#is all we need for language matching. So each of these is now a spacy doc object, and what we have is a list of space doc objects
LangMatcher.add("LANGUAGES", LangKeywords) #add the tokenised language names into the matcher under the label 'LANGUAGES'. Because we are only matching one category, the 'LANGUAGES'
#label is not functional, but usually, the output of the matcher is a list of tuples where the first element is the match_id, which points to the label category. So, if there
#are multiple label categories that are being matches, say LANGUAGES and FRUITS, the match_id tells you whether the match is for LANGUAGES or FRUITS. So, if you look at the matcher
#op, all the match_ids will be the same, because we only have one category that we are matching

#--3. data path and files----------------------------------------------------------------------------------------------------------------------------------------------------
DataPath = Path('/Users/ritwikavps/Desktop/GoogleDriveFiles/research/DARCLEPaper2025/PapersForExtraction/') #path object where files lives
Files = DataPath.glob('*.pdf') #get .pdf files in path. NOTE THAT because this is a generator, this block will have to be executed every time, because once this is stepped through,
#it will be empty, and won't be a perpetually stored thing! 
FileList = list(Files)

#--4. Read in extraction file for checks + covidence export .csv
MainPathStr = '/Users/ritwikavps/Desktop/GoogleDriveFiles/research/DARCLEPaper2025/DARCLEPaper2025_ExtractionCode/'
Tab_CovidenceExport = pd.read_csv(MainPathStr + 'CovidenceExport_ListOfPapersForReview.csv') #list of papers from covidence (has title, authors, abstract, etc)
Tab_Extraction_HG = pd.read_excel(MainPathStr + 'DARCLE_extraction_2.16.26.xlsx') #has list of papers downloaded for extraction

In [641]:
""" 
Set up keywords for extraction and initialise opdfs (sentence matching: recording device, software, manual annotation, countries/states/langs)
""" 
#--1. Recording device------------------------------------------------------------------------------------------------------------------------------
#Recording device keywords list of lists: where first list has the items we are checking for and the second list has the corresponding keyword patterns. 
#Not using a dict because it won't allow for looping by indexing
KeywordPatts_RecDevice_LoL = [['LENA','Camera','RecOrMic'],
                              [r'\bLENA\b',r'\bcamera\b',r'\bmicrophone\b|\b(audio|voice)[\s-]*recorder\b|\brecorder\b']]
RecDevFullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName','LENA_sents','Camera_sents','RecOrMic_sents']) #corresponding df

#--2. software------------------------------------------------------------------------------------------------------------------------------
KeywordPatts_Software_LoL = [['LENA','Praat','OpenSMILE','ALICE','SpeechDetAlg'],
                         [r'\bLENA\b',r'\bPraat\b',r'\bOpenSMILE\b',r'\bALICE\b',r'\bspeech detection algorithm\w*\b']]
SoftwareFullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName','LENA_sents','Praat_sents','OpenSMILE_sents','ALICE_sents','SpeechDetAlg_sents']) #corresponding df
                              
#--3. manual annotation------------------------------------------------------------------------------------------------------------------------------
ManualAnnots_RegExps = r'\b(human|manual|manually|hand)[\s-]*(annot\w*|cod\w*|label\w*|valid\w*|listen\w*)\b|\b(manual\w+|annot\w+)\b|\bground truth\b'
#This consists of several combos:
#1. \b(human|manual|manually|hand)[\s-]*(annot\w*|cod\w*|label\w*|valid\w*|listen\w*)\b: this is the most complicated bit. This basically matches two parters, with the first part 
    #being 'human','manual','manually', or 'hand', followed by any number or combo of space/tab/no space etc or hyphen ('*' asks regexp to match) any number of the preceding 
    # characters, followed by 'annot\w*|cod\w*|label\w*|valid\w*|listen\w*'. Here, '\w*' matches a word character 0 or more times (because of the *). A word characters is a-z, A-Z, 
    #0-9, including underscore. All of this two-part setup has to be flanked by non-word characters on either side (\b \b)
#2. \b(manual\w+|annot\w+)\b: The '\w+' matches one or more word characters (while * matches 0 or more)
#3. \bground truth\b: this is straight forward. Simply, 'ground truth' flanked by non-word characters
#ManualAnnots_RegExp_Joined = '|'.join(ManualAnnots_RegExps) #This was from when this was a more explicit list that needed to be joined
KeywordPatts_ManualAnnot_LoL = [['ManualAnnot'],
                                [ManualAnnots_RegExps]]
ManualAnnotFullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName','ManualAnnot_sents']) #corresponding df

#--4. Countries and languages------------------------------------------------------------------------------------------------------------------------------
CountriesAndLangsFullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName','Countries','Countries_sents','StatesUS_sents','Languages']) #countries and languages df 
#(adding US states because a lot of US studies simply mention the states)
StatesUS_List = ["Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho", 
                 "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine", "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", 
                 "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", 
                 "Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas", "Utah", "Vermont", "Virginia", 
                 "Washington", "West Virginia", "Wisconsin", "Wyoming"]

#--5. Df to store details of papers that have not been extracted for extraction based on methods and results--------------------------------------------------------------
#(so, extraction for countries and langs; software; audio rec; manual annot; topics)
NoMethResExtract_FullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName'])

In [642]:
""" 
Set up keywords for extraction and initialise op dfs (binary YES/NO: communities/consortia and topics)
""" 
#--1. communities/consortia------------------------------------------------------------------------------------------------------------------------------
KeywordPatts_Community_LoL = [['DARCLE', 'ACLEW', 'Homebank', 'Talkbank', 'ALICE', 'VTC', 'LangView', 'MPaL','ManyBabies'],
                              [r'\bDARCLE\b', r'\bACLEW\b', r'\bHomebank\b', r'\bTalkbank\b', r'\bALICE\b', 
                               r'\bVTC\b|\bVoice Type Classifier\b', r'\bLangView\b', r'\bMany Paths to Language\b|\bMPaL\b',r'\bManyBabies\b']]
CommunityFullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName','DARCLE', 'ACLEW', 
                                        'Homebank', 'Talkbank', 'ALICE', 'VTC', 'LangView', 'MPaL','ManyBabies']) #corresponding df

#--2. Topics------------------------------------------------------------------------------------------------------------------------------
CDSPatts = r'\b(child|adult|infant)[\s-]*(directed)[\s-]*(speech|input)\b|\bCDS\b|\bADS\b|\bIDS\b'
MaturityPatts = r'\bmaturity\b|\bvocal maturity\b|\bspeech maturity\b'
MultilingPatts = r'\b(bi|multi|tri)[\s-]*lingual\w*\b'
CommDisOrderPattList = [r'\bASD\b', r'\bAutism spectrum disorder\b', r'\bhearing loss\b', r'\bdeaf\w*\b', r'\bhearing[\s-]*impair\w*\b', r'\bcerebral palsy\b', 
                        r'\bDown syndrome\b', r'\bcleft palate\b', 
                        r'\b(speech[\s-]*language|speech|language|communication)[\s-]*(pathology|therapy|disorder\w*|services|intervention|impair\w*|difficult\w*)\b',
                        r'\bSLP\b',r'\bFragile X Syndrome\b', r'\bAngelman Syndrome\b', r'\bPhelan[\s-]*McDermid Syndrome\b', r'\bDevelopmental Language Disorder\b', r'\bDLD\b', 
                        r'\bSpecific Language Impairment\b', r'\bSLI\b', r'\bHOH\b', r'\bhard of hearing\b', r'\bdevelopmental disabilities\b', r'\bword[\s-]*finding difficulties\b', 
                        r'\b(phonological|speech[\s-]*sound|articulation|neurodevelopmental)[\s-]*(disorder|impair\w*|difficult\w*)\b',
                        r'\bSpecific[\s-]*language[\s-]*impaired\b', r'\bdevelopmental disorder of language and communication\b'] 
CommDisorderPatts = '|'.join(CommDisOrderPattList)
SesPatts = r'\bSES\b|\b(socio[\s-]*economic\w*)[\s-]*(status|disadvantage\w*|divers\w*)\b|\blow[\s-]*income\b'
SchoolProgPatts = r'\bschool\b|\bclass[\s-]*room\b|\bteacher\b|\bstudent\b|\bpre[\s-]*school\b'
MultimodalPatts = r'\bGPS\b|\baccelerometer\b|\beye[\s-]*track\w*\b|\bEEG\b'
StressAnxEmoPatts = r'\bstress\b|\banxiety\b|\banxious\b|\bemotion\b'
CryNonCryPatts = r'\bcries\b|\bcry\w*\b|\bnon[\s-]*cries\b|\bnon[\s-]*cry\b'
NICUPatts = r'\bNICU\b|\bneonatal intensive care unit\b|\bpre[\s-]*(term|mature)[\s-]*(infant\w*|child\w*|birth\w*)\b'

#put it all together into keywords LoL
KeywordPatts_Topics_LoL = [['CDSvsADS','VocMaturity','Multiling','CommDisorder','SES','SchoolProg','Multimodal','StressAnxEmo','CryNonCry','NICU'],
                           [CDSPatts, MaturityPatts, MultilingPatts, CommDisorderPatts, SesPatts, SchoolProgPatts, MultimodalPatts, StressAnxEmoPatts, CryNonCryPatts, NICUPatts]]

#corresponding df
TopicsFullDf = pd.DataFrame(columns=['PaperTitle_ToC','PaperTitle_NoToC','FileName','CDSvsADS','VocMaturity','Multiling','CommDisorder','SES',
                                     'SchoolProg','Multimodal','StressAnxEmo','CryNonCry','NICU']) 

In [643]:
""" 
Set up keywords to extract sections (Methods and Results)
""" 
#-------1. Extraction keywords for Methods section and next section---------------------------------------------------------------------------------------
MethodsStartPattern = re.compile(r'^[#*\d\s.|]*'             # 0 or more symbols/digits/dots/spaces at start; some dev sci articles have a | in the section titles
                                 r'\b('                         # Group start
                                 r'Material\w*\s*(?:&|and)\s*Method\w*|' # Most specific first
                                 r'Experimental\s*Method\w*|'
                                 r'Methodolog\w*|'
                                 r'Method\w*'               # Most general last
                                 r')\b'                         # Group end
                                 r'[*#\s]*$',                   # Trailing symbols and spaces
                                 re.IGNORECASE | re.MULTILINE)
MethodsNextSecPattern = re.compile(r'^[#*\d\s.|]*'             # 0 or more symbols/digits/dots/spaces at start
                                   r'\b('                         #group start
                                   r'Result\w*|Discussion|Result\w*\s*(?:&|and)\s*Discussion|Experimental\s*Finding\w*|Finding\w*|'
                                   r'Reference\w*|Acknowledgment\w*|Conclusion\w*|Citation\w*|Work\w*\s*Cited|Bibliograph\w*'
                                   r')\b'                         # Group end
                                   r'[*#\s]*$',                   # Trailing symbols and spaces
                                   re.IGNORECASE | re.MULTILINE)

#-------2. Extraction keywords for Results section and next section---------------------------------------------------------------------------------------
ResultsStartPattern = re.compile(r'^[#*\d\s.|]*'             # 0 or more symbols/digits/dots/spaces at start
                                 r'\b('                        #group start
                                 r'Result\w*|Result\w*\s*(?:&|and)\s*Discussion|Experimental\s*Result\w*|Experimental\s*Finding\w*|Finding\w*'
                                 r')\b'                         # Group end
                                 r'[*#\s]*$',                   # Trailing symbols and spaces
                                 re.IGNORECASE | re.MULTILINE)
ResultsNextSecPattern =  re.compile(r'^[#*\d\s.|]*'             # 0 or more symbols/digits/dots/spaces at start
                                   r'\b('                        #group start
                                   r'Discussion|Reference\w*|Acknowledgment\w*|Conclusion\w*|Citation\w*|Work\w*\s*Cited|Bibliograph\w*|'
                                   r'Method\w*|Experimental\s*Method\w*|Material\w*\s*(?:&|and)\s*Method\w*|Methodolog\w*'
                                   r')\b'                         # Group end
                                   r'[*#\s]*$',                   # Trailing symbols and spaces
                                   re.IGNORECASE | re.MULTILINE)

In [644]:
""" 
Basic checks before proceedings
""" 
#Check if the number of downloaded files match the number of files tagged for extraction
DownloadedFiles_SubTab = Tab_Extraction_HG[Tab_Extraction_HG['Downloaded'].str.match('yes',case=False,na=False)] #get the subset of the table that has 'yes' in the downloaded column
#The .str tell python that each entry will be treated as a string. 
ObsOrExptType_SubTab = Tab_Extraction_HG[Tab_Extraction_HG['Type'].str.contains('Experimental|Observational', case=False,na=False)] #similarly for the observational or 
#experimental types

if not DownloadedFiles_SubTab.equals(ObsOrExptType_SubTab):
    print("The tables are different (they should not be).")

#check number of files tagged for download and actually downloaded files
NumFilesFromExtractionTab = DownloadedFiles_SubTab.shape[0]
NumFilesInExtractionFolder = len(FileList)

if NumFilesFromExtractionTab != NumFilesInExtractionFolder:
    ValueError('Number of files tagged for extraction (in the extraction sheet) is not the same as actual number of downloaded files for extraction.')

print('Keep in mind that these checks do not check whether the files tagged for extraction are actually the files that have been downloaded!\n' \
'This check (or associated precautions) needs to be built into post-processing!')

Keep in mind that these checks do not check whether the files tagged for extraction are actually the files that have been downloaded!
This check (or associated precautions) needs to be built into post-processing!


In [645]:
"""
Main extraction for loop for METHODS section (loops through files and saves outputs): this includes recording device, manual annotation, segmentation software, list of countries,
list of languages

OTHER THINGS TO DO: 
    - get number of files where refs aren't removed
"""
for i_file, file in enumerate(FileList): #go through files + get an index if we need it
    with fitz.open(file) as doc: #open current doc(with closes it after with ends)
        ToC = doc.get_toc() #get table of contents

        #print('-------------------------------------------')
        #print(f'{i_file}. {FileName}')

        #--1. Set up and initialisation--------------------------------------------------------------------------------------------------------------------------
        #set up paper title (no easy way to extract if there is no ToC). I am sure this can be done, but it's going to take even more wrangling
        if ToC:
            PaperTitle_ToC = ToC[0][1] #gets the paper title, [0] indexes the first entry, and [1] indexes the title of the first entry (ToC format is level, title, pageNo)
            PaperTitle_ToC = RemoveIllegalChars(PaperTitle_ToC)
        else:
            PaperTitle_ToC = ''

        PaperTitle_NoToC = GetPaperTitle_NoToC(doc) #get paper title from page 1 of the document by extracting all of the largest font text, and add that to its own df
        FileName = file.stem #gets the file name (i.e., pdf name as saved)

        #initialise temporary dict to keep track of cases where extraction based on methods and results section has not happened
        Temp_Dict_NoMethResExtract = {"PaperTitle_ToC": PaperTitle_ToC,
                                      "PaperTitle_NoToC": PaperTitle_NoToC,
                                      "FileName": FileName} 

        #--2. Get full text markdown + nlp processing + getting other sections + extraction (by id'ing Methods and results sections, without using TOC) ----------------------------
        FullTxtMkDwn = pymupdf4llm.to_markdown(doc, write_images=False, ignore_graphics=True) #get full text as markdown

        #remove refs and nlp process + get topics and consortia df
        RefsRemovedTxt = RemoveRefs(FullTxtMkDwn, FileName); RefsRemoved_spacy = nlp(RefsRemovedTxt)

        #get the communities and consortia df--binary YES/NO df, no sentences extracted (the unassigned outputs are the dictonary and df with the current paper's 
        #extracted details, and those)
        _, _, CommunityFullDf = GetOpDf_TopicsCommunities(RefsRemoved_spacy, KeywordPatts_Community_LoL, CommunityFullDf, PaperTitle_ToC, PaperTitle_NoToC, FileName) 
        
        #get start and end positions for methods and results
        MethodsStartPos, MethodsEndPos = GetSectionStart_NoToc(MethodsStartPattern, MethodsNextSecPattern,'Methods', FullTxtMkDwn)
        ResultsStartPos, ResultsEndPos = GetSectionStart_NoToc(ResultsStartPattern, ResultsNextSecPattern,'Results', FullTxtMkDwn)

        #get methods and results sections *IF* we can id the start and end + other conditions
        if (not math.isnan(MethodsStartPos) and not math.isnan(MethodsEndPos) and not math.isnan(ResultsStartPos) and 
                not math.isnan(ResultsEndPos) and MethodsStartPos < MethodsEndPos and ResultsStartPos < ResultsEndPos):

                Methods_Mkdwn = FullTxtMkDwn[MethodsStartPos:MethodsEndPos] #methods
                Results_MkDwn = FullTxtMkDwn[ResultsStartPos:ResultsEndPos] #results
                MethodsAndResults_Mkdwn = Methods_Mkdwn + Results_MkDwn #because these are necessarily exclusive, we can just add these together
                #print(Methods_Mkdwn)

                #extraction of everything except topics packes into one function
                RecDevFullDf, SoftwareFullDf, ManualAnnotFullDf, TopicsFullDf, CountriesAndLangsFullDf = GetExtractedDfs_ExceptCommunities(nlp, 
                                                                                            PaperTitle_ToC, PaperTitle_NoToC, FileName,
                                                                                            Methods_Mkdwn, MethodsAndResults_Mkdwn,
                                                                                            KeywordPatts_RecDevice_LoL, RecDevFullDf,
                                                                                            KeywordPatts_Software_LoL, SoftwareFullDf,
                                                                                            KeywordPatts_ManualAnnot_LoL, ManualAnnotFullDf,
                                                                                            KeywordPatts_Topics_LoL, TopicsFullDf,
                                                                                            geotext, LangMatcher, StatesUS_List, CountriesAndLangsFullDf)

        #--3. Get full text markdown + nlp processing + getting other sections + extraction (by id'ing Methods and results sections, WITH using TOC) ----------------------------
        else:
            if ToC: #if methods OR results cannot be extracted, we can check if ToC exists and extract based on that
            
                #_ outputs are, in order, the level (from ToC) and title of the required section, and the level and title of the next section
                MethodStPage, _, _, MethodsNextSecStPage, _, _ = GetReqSectionPgNos_wToC(ToC, "METHOD",printToc=False,printOp=False)
                ResultsStPage, _, _, ResultsNextSecStPage, _, _ = GetReqSectionPgNos_wToC(ToC, "RESULT",printToc=False,printOp=False)

                #if both methods start page and next section start page exist, proceed
                if not math.isnan(MethodStPage) and not math.isnan(MethodsNextSecStPage) and not math.isnan(ResultsStPage) and not math.isnan(ResultsNextSecStPage): 
                    
                    #get page numbers for methods and results
                    MethodsPgs = set(range(MethodStPage-1, MethodsNextSecStPage)) 
                    ResultsPgs = set(range(ResultsStPage-1, ResultsNextSecStPage))
                    MethdsAndResultsPgs = sorted(list(MethodsPgs|ResultsPgs)) #get set union (so duplicated are removed)

                    Methods_Mkdwn = pymupdf4llm.to_markdown(doc, write_images=False, ignore_graphics=True, pages=list(range(MethodStPage-1, MethodsNextSecStPage)))
                    MethodsAndResults_Mkdwn = pymupdf4llm.to_markdown(doc, write_images=False, ignore_graphics=True, pages=MethdsAndResultsPgs) #get methods and results together

                    #extraction of everything except topics packes into one function
                    RecDevFullDf, SoftwareFullDf, ManualAnnotFullDf, TopicsFullDf, CountriesAndLangsFullDf = GetExtractedDfs_ExceptCommunities(nlp, 
                                                                                                PaperTitle_ToC, PaperTitle_NoToC, FileName,
                                                                                                Methods_Mkdwn, MethodsAndResults_Mkdwn,
                                                                                                KeywordPatts_RecDevice_LoL, RecDevFullDf,
                                                                                                KeywordPatts_Software_LoL, SoftwareFullDf,
                                                                                                KeywordPatts_ManualAnnot_LoL, ManualAnnotFullDf,
                                                                                                KeywordPatts_Topics_LoL, TopicsFullDf,
                                                                                                geotext, LangMatcher, StatesUS_List, CountriesAndLangsFullDf)

            else: #if ToC also does not exist
                print(f'No Methods or Results section OR ToC found for {FileName}')
                Temp_df_NoMethResExtract = pd.DataFrame([Temp_Dict_NoMethResExtract])
                NoMethResExtract_FullDf = pd.concat([NoMethResExtract_FullDf,Temp_df_NoMethResExtract],ignore_index=True)
             
#sentences extracted dfs
RecDevFullDf.to_excel('RecDevice_ExtractedSents.xlsx', index=False) #index = False so there are no row indices
SoftwareFullDf.to_excel('Software_ExtractedSents.xlsx', index=False)  
ManualAnnotFullDf.to_excel('ManualAnnot_ExtractedSents.xlsx', index=False)
CountriesAndLangsFullDf.to_excel('CountriesLangsStatesUS_ExtractedSents.xlsx', index=False) #countries, US states, and langs

#binary dfs
TopicsFullDf.to_excel('Topics_ExtractedBinary.xlsx', index=False)  
CommunityFullDf.to_excel('CommunityConsortia_ExtractedBinary.xlsx', index=False) 

#non extracted dfs
NoMethResExtract_FullDf.to_excel('ListOfNotExtractedPapers_ExceptTopics.xlsx',index=False)

No Methods or Results section OR ToC found for TrackingAcquisitionOfLanguageInKidsTALKStudyProtocolALongitudinalInvestigationOfInfantsAtHighVsLowRiskForAtypicalSpeechAndLanguageDevelopment_Goble
No Methods or Results section OR ToC found for Mahr 2018
Refs section has not been found for Oller 2010. This can happen for some PNAS articles. Entire input text passed as output
No Methods or Results section OR ToC found for Oller 2010
No Methods or Results section OR ToC found for csd-28-4-828
Refs section has not been found for Warlaumont 2014. This can happen for some PNAS articles. Entire input text passed as output
No Methods or Results section OR ToC found for Xu 2012
No Methods or Results section OR ToC found for Caskey and Vohr - 2013 - Assessing language and language environment of hig
No Methods or Results section OR ToC found for Ha 2019
No Methods or Results section OR ToC found for suskind-et-al-2013-an-exploratory-study-of-quantitative-linguistic-feedback
No Methods or Results s