## Extraction of Accession Numbers from PDF documents


This notebook is for extraction all referenced NCBI accession numbers from a journal paper.


Information:

### Ref Seq Numbers format

2 letters, '_' , 1 to 6 numbers.
r'\w{2}_\d{1,6}'


### Genbank Numbers format

DDBJ/EMBL/GenBank Accession Prefix Format
The format for GenBank Accession numbers are:

Nucleotide:    
1 letter + 5 numerals
r'\w\d{5}'

2 letters + 6 numerals
r'\w{2}\d{5}'

2 letters + 8 numerals
r'\w{2}\d{8}'

r'\w{1,2}\d{5,8}


Protein:       
3 letters + 5 numerals
r'\w{3}\d{5}'


3 letters + 7 numerals
r'\w{3}\d{5|7}'


WGS:           
4 letters + 2 numerals for WGS assembly version + 6 or more numerals
r'\w{4}\d{2}\d{6,}'

6 letters + 2 numerals for WGS assembly version + 7 or more numerals
r'\w{6}\d{2}\d{7,}'               
               
               
### SRA numbers format

Bioproject
5 letters + 6 numbers
r'[A-Z]{5}\d{6,}'

Biosample
4 letters + numbers
r'\w{4}\d{5,}'

Experiment
3 letters + numbers
r'\w{4}\d{5,}'

Run numbers
3 letters + numbers
r'\w{4}\d{5,}'



In [1]:
import PyPDF2
import re

In [2]:
#pattern_index

refseq_pattern = r'[A-Z]{2}_\d{1,6}'
genbank_nucleotide_pattern = r'[A-Z]{1,2}\d{5,8}'
genbank_protein_pattern = r'[A-Z]{3}\d{5|7}'
genbank_WGS_pattern = r'[A-Z]{4|6}\d{2}\d{6,}' 
sra_bioproject_pattern = r'[A-Z]{5}\d{6,}'
sra_biosample_pattern = r'[A-Z]{4}\d{5,}'
sra_experiment_pattern = r'[A-Z]{4}\d{5,}'
sra_runnumber_pattern =r'[A-Z]{4}\d{5,}'


pattern_choice_dict = {1:refseq_pattern, 
                      2:genbank_nucleotide_pattern,
                      3:genbank_protein_pattern,
                      4:genbank_WGS_pattern,
                      5:sra_bioproject_pattern,
                      6:sra_biosample_pattern,
                      7:sra_experiment_pattern,
                      8:sra_runnumber_pattern}

#print(pattern_choice_dict[1])

In [3]:
#function for searching for pattern

def find_pattern_in_text(pattern, pdf_text):
    
    search_result = []

    for text in pdf_text[1:]:
    
        for match in re.finditer(pattern,text):     
            #print(match.span()) 
            #print(match.group())
            search_result.append(match.group())
        
    return search_result
    

In [4]:
#open file

myfile = r's12864-023-09114-w.pdf' #please enter file name and path here

mypdf = open(myfile,'rb')

pdf_reader = PyPDF2.PdfReader(mypdf)


#extracting the text

pdf_text = [0]

for p in range(len(pdf_reader.pages)):
    page = pdf_reader.pages[p]
    pdf_text.append(page.extract_text())
    
mypdf.close()



In [5]:
#for user to enter choices

choice_input = []
choices_allowed_list =['1', '2', '3', '4', '5', '6', '7', '8']

print('Here are the things that this notebook can scan a paper for:\n1:refseq_pattern, \n2:genbank_nucleotide_pattern, \n3:genbank_protein_pattern, \n4:genbank_WGS_pattern, \n5:sra_bioproject_pattern, \n6:sra_biosample_pattern, \n7:sra_experiment_pattern, \n8:sra_runnumber_pattern \n')
print("Please enter your choice(s): ")

while True:
    user_choice = input()
    if user_choice:
        if user_choice in choices_allowed_list:
            choice_input.append(user_choice)
        else:
            print('Sorry that is not a valid choice.')
    else:
        break

print('Choices received :')
print(choice_input)

Here are the things that this notebook can scan a paper for:
1:refseq_pattern, 
2:genbank_nucleotide_pattern, 
3:genbank_protein_pattern, 
4:genbank_WGS_pattern, 
5:sra_bioproject_pattern, 
6:sra_biosample_pattern, 
7:sra_experiment_pattern, 
8:sra_runnumber_pattern 

Please enter your choice(s): 
1
2
3
4
5
6

Choices received :
['1', '2', '3', '4', '5', '6']


In [6]:
patterns_to_run = []

for choice in choice_input:
    retrieve_pattern_from_choice = pattern_choice_dict[int(choice)]
    patterns_to_run.append(retrieve_pattern_from_choice)

print(patterns_to_run)

['[A-Z]{2}_\\d{1,6}', '[A-Z]{1,2}\\d{5,8}', '[A-Z]{3}\\d{5|7}', '[A-Z]{4|6}\\d{2}\\d{6,}', '[A-Z]{5}\\d{6,}', '[A-Z]{4}\\d{5,}']


In [7]:
for pattern in patterns_to_run:
    
    testresult = find_pattern_in_text(pattern, pdf_text)
    print(testresult)
    
    

['SA_2022', 'SA_2022', 'SA_2022']
['ON563414', 'ON563414', 'ON563414', 'ON563414', 'ON563414', 'ON563414', 'OP120937', 'OP120938', 'NA863094']
[]
[]
['PRJNA863094']
['RJNA863094']


Improvements:

- not having repeated results in the list
- output improvements: print choice + result of the choice
- turning it into a proper python script


In [None]:
#pattern = r'[A-Z]{2}\d{5}' #pattern for experiment number in SRA

In [None]:
#print(pdf_text)

In [None]:
#test script

# search_result = []

# for text in pdf_text[1:]:
    
#     for match in re.finditer(pattern,text):     
#         #print(match.span()) 
#         #print(match.group())
#         search_result.append(match.group())
        
# print(search_result)

In [None]:
# testresult = find_pattern_in_text(pattern, pdf_text)
# print(testresult)