# NLP Algorithms

## SpaCy
SpaCy is fast and agile. It’s designed to amp up cutting edge NLP by making it practical and accessible. It works with other well-known libraries like Gensim and Scikit Learn. Written in Python and Cython, it’s optimized for performance and allows developers a more natural path to more advanced NLP tasks like named entity recognition.

# Prepareing Bibliography
This is necessary to find the files attached in the Zotero Library.

In [36]:
from pybtex import database

In [37]:
class Library:

    def __init__(self, path, format='bibtex'):
        self.path = path
        self.library = database.parse_file(path, bib_format=format)
        self.entries = []
        for entry in self.library.entries:
            self.entries.append(self.library.entries[entry])

In [38]:
library = Library('/Users/paul/Desktop/FOM_MSc_Thesis.bib')

In [39]:
#print tiltles and paths for files in bibtexfile. count documents with filepaht

len(library.entries)

41

In [40]:
library.entries[-1]

Entry('misc',
  fields=[
    ('title', 'Automating {Systematic} {Literature} {Reviews} with {Natural} {Language} {Processing} and {Text} {Mining}: a {Systematic} {Literature} {Review}'), 
    ('shorttitle', 'Automating {Systematic} {Literature} {Reviews} with {Natural} {Language} {Processing} and {Text} {Mining}'), 
    ('url', 'http://arxiv.org/abs/2211.15397'), 
    ('abstract', 'Objectives: An SLR is presented focusing on text mining based automation of SLR creation. The present review identifies the objectives of the automation studies and the aspects of those steps that were automated. In so doing, the various ML techniques used, challenges, limitations and scope of further research are explained. Methods: Accessible published literature studies that primarily focus on automation of study selection, study quality assessment, data extraction and data synthesis portions of SLR. Twenty-nine studies were analyzed. Results: This review identifies the objectives of the automation studie

## Klasse Dokument erstellen

In [41]:
import re
from pdfminer.high_level import extract_text

In [64]:
class Document():
    base_path='/Users/paul/Zotero/storage/'

    def __init__(self, entry):
        self.entry = entry
        self.title = self.entry.fields['title']
        self.fields = self.entry.fields.keys()
        if 'file' in self.fields:
           self.file = self.entry.fields['file'].split('/Users/paul/Zotero/storage/')[1].split(':')[0]
        else:
            self.file = ''
        self.is_pdf = re.search('.pdf', self.file)
        self.text = None

    def get_text(self, base_path=base_path):
        self.text = extract_text(base_path+self.file)
        return self.text

    def split_sentences(self):
        sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s|(\n){2,}', self.get_text())
        sentences = [sentence.replace('\n',' ') for sentence in sentences if sentence not in [None,'\n','',' ','  ']]
        return [sentence for sentence in sentences if not re.match(r'^[^a-zA-Z]*$', sentence)]



In [66]:
documents = []

for entry in library.entries:
    document = Document(entry)
    if document.is_pdf:
        documents.append(document)


In [69]:
for doc in documents[-3:-2]:
    print('Title: '+doc.title, 
          '\n\tFilepath: '+doc.file)
    sentences = doc.split_sentences()
    print(sentences)


Title: An automated method for developing search strategies for systematic review using {Natural} {Language} {Processing} ({NLP}) 
	Filepath: 2W9KB3PQ/Kwabena et al. - 2023 - An automated method for developing search strategies for systematic review using Natural Language Pr.pdf
['MethodsX 10 (2023) 101935 ', 'Contents lists available at ScienceDirect ', 'MethodsX ', 'journal homepage: www.elsevier.com/locate/mex ', 'Method Article ', 'An automated method for developing search strategies for  systematic review using Natural Language Processing (NLP) ', 'Antwi Eﬀah Kwabena a , ∗ , Owusu-Banahene Wiafe b , Boakye-Danquah John a ,  Asare Bernard b , Frimpong A.F. Boateng b  a  Canadian Forest Service, Great Lakes Forestry Centre, 1219 Queen Street East, Sault Ste.', 'Marie, Ontario, P6A 2E5  b  University of Ghana, Department of Computer Engineering, P.O. BOX LG 77, Legon, Accra, Ghana ', 'a r t i c l e ', 'i n f o ', 'a b s t r a c t ', 'Method name:  Search Strategy  Search Terms  Data 

# Extracting Text from PDFs

# Preprocessing

In [70]:
import spacy
nlp = spacy.load("en_core_web_trf")

In [71]:
class Sentence:
    def __init__(self, sentence, file):
        self.file = file

        self.tokens = None
        self.inventory = None
        self.contains_noun = None
        self.contains_verb = None
        self.contains_cid = None
        self.valid = None

        #replace tailing digits on words. those digits are usually footnotes
        self.sentence = re.sub(r'[A-Za-z]\d+\b', '', sentence)

         #corp sentence to beginning based on first alphabtic character
        for i, char in enumerate(self.sentence):
            if char.isalpha() and re.match(r'[A-Z]',char):
                self.sentence = self.sentence[i:]
                break
        


    def tokenize(self):
        self.tokens = [(word.text, word.pos_) for word in nlp(self.sentence)]
    
    def count_tokens(self):
        if self.tokens is None:
            self.tokenize()

        inventory = {}
        for _, value in self.tokens:
            inventory[value] = inventory.get(value, 0) + 1
    
        self.inventory = inventory
    
    def check_validity(self):
        if self.inventory is None:
            self.count_tokens()

        word_types = self.inventory.keys()

        if 'NOUN' in word_types:
            self.contains_noun = True
        else:
            self.contains_verb = False

        if 'VERB' in word_types:
            self.contains_verb = True
        else:
            self.contains_verb = False

        if re.match(r'\(cid:\d{1,4}\)', self.sentence):
            self.contains_cid = True
        
        if self.contains_noun and self.contains_verb:
            self.valid = True
        else:
            self.valid = False
    
    def summarize(self, show_token_details=False):
        print(f'The origin file is: {self.file}')
        print(f'The sentence is:\n{self.sentence}')
        print(f'The inventory holds:\n{self.inventory}')
        if show_token_details:
            print(f'The token details are:\n{self.tokens}')

         
        

In [72]:
valid_sentences = []

for sen in sentences:

    Sen = Sentence(sen, documents[0])
    Sen.check_validity()
    if Sen.valid:
        valid_sentences.append(Sen)

In [73]:
counter = 0
for s in valid_sentences:
    counter += 1
    print(counter, s.sentence)

1 An automated method for developing search strategies for  systematic review using Natural Language Processing (NLP) 
2 Method name:  Search Strategy  Search Terms  Data Deduplication  Software Implementation  Evidence Synthesis  Systematic Literature Review  Text mining and keyword co ‐occurrence  networks to identify the most important terms  for a review 
3 The design and implementation of systematic reviews and meta-analyses are often hampered by  high ﬁnancial costs, signiﬁcant time commitment, and biases due to researchers’ familiarity with  studies.
4 We proposed and implemented a fast and standardized method for search term selection  using Natural Language Processing (NLP) and co-occurrence networks to identify relevant search  terms to reduce biases in conducting systematic reviews and meta-analyses.
5 The method was implemented using Python packaged dubbed Ananse, which is benchmarked  on the search terms strategy for naïve search proposed by Grames et al.
6 Ananse was appl

In [75]:
valid_sentences[-50].summarize(True)


The origin file is: <__main__.Document object at 0x7fc3eaeb5d90>
The sentence is:
These eﬀorts  should help facilitate the reproducibility of ecological reviews, enhance transparency, and improve the rigor of evidence used to guide  policy decisions [40] .
The inventory holds:
{'DET': 3, 'NOUN': 8, 'SPACE': 2, 'AUX': 1, 'VERB': 6, 'ADP': 2, 'ADJ': 1, 'PUNCT': 3, 'CCONJ': 1, 'PART': 1, 'X': 3}
The token details are:
[('These', 'DET'), ('eﬀorts', 'NOUN'), (' ', 'SPACE'), ('should', 'AUX'), ('help', 'VERB'), ('facilitate', 'VERB'), ('the', 'DET'), ('reproducibility', 'NOUN'), ('of', 'ADP'), ('ecological', 'ADJ'), ('reviews', 'NOUN'), (',', 'PUNCT'), ('enhance', 'VERB'), ('transparency', 'NOUN'), (',', 'PUNCT'), ('and', 'CCONJ'), ('improve', 'VERB'), ('the', 'DET'), ('rigor', 'NOUN'), ('of', 'ADP'), ('evidence', 'NOUN'), ('used', 'VERB'), ('to', 'PART'), ('guide', 'VERB'), (' ', 'SPACE'), ('policy', 'NOUN'), ('decisions', 'NOUN'), ('[', 'X'), ('40', 'X'), (']', 'X'), ('.', 'PUNCT')]


## Tokenization

## Lemmatization

## Stemming

# Bag of Words

# Topic Modeling with LDA

# Topic Clustering