# NLP Algorithms

## SpaCy
SpaCy is fast and agile. It’s designed to amp up cutting edge NLP by making it practical and accessible. It works with other well-known libraries like Gensim and Scikit Learn. Written in Python and Cython, it’s optimized for performance and allows developers a more natural path to more advanced NLP tasks like named entity recognition.

# Prepareing Bibliography
This is necessary to find the files attached in the Zotero Library.

In [51]:
import pybtex

In [88]:
class Library:

    def __init__(self, path, format='bibtex'):
        self.path = path
        self.library = pybtex.database.parse_file(path, bib_format=format)
        self.entries = []

    def collect_entries(self):
        for entry in self.library.entries:
            self.entries.append(self.library.entries[entry])

In [89]:
library = Library('/Users/paul/Desktop/FOM_MSc_Thesis.bib')
library.entries

[]

In [90]:
library.collect_entries()
library.entries

[Entry('article',
   fields=[
     ('title', 'A systematic literature review of data governance and cloud data governance'), 
     ('volume', '23'), 
     ('issn', '1617-4909, 1617-4917'), 
     ('url', 'http://link.springer.com/10.1007/s00779-017-1104-3'), 
     ('doi', '10.1007/s00779-017-1104-3'), 
     ('abstract', 'Data management solutions on their own are becoming very expensive and not able to cope with the reality of everlasting data complexity. Businesses have grown more sophisticated in their use of data, which drives new demands that require different ways to handle this data. Forward thinking organizations believe that the only way to solve the data problem will be the implementation of an effective data governance. Attempts in governing data failed before, as they were driven by IT, and affected by rigid processes and fragmented activities carried out on system by system basis. Up to very recently governance is mostly informal with very ambiguous and generic regulations, 

In [145]:
class Document:
    
    def __init__(self, entry):
        self.entry = entry
        self.title = self.entry.fields['title']
        self.fields = self.entry.fields.keys()
        if 'file' in self.fields:
           self.file = self.entry.fields['file'].split('/Users/paul/Zotero/storage/')[1].split(':application/')[0]
        else:
            self.file = None

In [146]:
documents = []

for entry in library.entries:
    document = Document(entry)
    documents.append(document)

In [148]:
#print tiltles and paths for files in bibtexfile. count documents with filepaht

counter = 0
for document in documents:
    print(document.title, document.file)
    if document.file is not None:
        counter += 1
print(counter)


A systematic literature review of data governance and cloud data governance KJ3KT8X6/Al-Ruithe et al. - 2019 - A systematic literature review of data governance .pdf
Data governance: {A} conceptual framework, structured review, and research agenda F4RDFWV2/Abraham et al. - 2019 - Data governance A conceptual framework, structure.pdf
Systematic review of data-centric approaches in artificial intelligence and machine learning - {ScienceDirect} None
Data governance: {A} conceptual framework, structured review, and research agenda LIHTS4UD/Abraham et al. - 2019 - Data governance A conceptual framework, structured review, and research agenda.pdf
Information security management needs more holistic approach: {A} literature review Q2PXU82Q/Soomro et al. - 2016 - Information security management needs more holistic approach A literature review.pdf
A systematic literature review of data governance and cloud data governance S5M6U3UU/Al-Ruithe et al. - 2019 - A systematic literature review of data 

# Extracting Text from PDFs

In [10]:
from pdfminer.high_level import extract_text
import re


In [151]:
base_path = '/Users/paul/Zotero/storage/'
file_paths = [document.file for document in documents if document.file is not None]
file_paths[:5]

['KJ3KT8X6/Al-Ruithe et al. - 2019 - A systematic literature review of data governance .pdf',
 'F4RDFWV2/Abraham et al. - 2019 - Data governance A conceptual framework, structure.pdf',
 'LIHTS4UD/Abraham et al. - 2019 - Data governance A conceptual framework, structured review, and research agenda.pdf',
 'Q2PXU82Q/Soomro et al. - 2016 - Information security management needs more holistic approach A literature review.pdf',
 'S5M6U3UU/Al-Ruithe et al. - 2019 - A systematic literature review of data governance and cloud data governance.pdf']

In [154]:
text = extract_text(base_path+file_paths[0])

In [155]:
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s|(\n){2,}',text)
sentences = [sentence.replace('\n',' ') for sentence in sentences if sentence not in [None,'\n','',' ','  ']]
sentences = [sentence for sentence in sentences if not re.match(r'^[^a-zA-Z]*$', sentence)]
for sentence in sentences:
    print(repr(sentence))


'See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/322259947'
'A systematic literature review of data governance and cloud data governance'
'Article\xa0\xa0in\xa0\xa0Personal and Ubiquitous Computing · November 2019'
'DOI: 10.1007/s00779-017-1104-3'
'CITATIONS 105'
'3 authors:'
'READS 11,596'
'Majid Al-Ruithe'
'Ministry of Interior,Saudi Arabia,GDP'
'13 PUBLICATIONS\xa0\xa0\xa0414 CITATIONS\xa0\xa0\xa0'
'Elhadj Benkhelifa'
'Staffordshire University'
'177 PUBLICATIONS\xa0\xa0\xa03,222 CITATIONS\xa0\xa0\xa0'
'SEE PROFILE'
'SEE PROFILE'
'Khawar Hameed'
'Birmingham City University'
'19 PUBLICATIONS\xa0\xa0\xa0463 CITATIONS\xa0\xa0\xa0'
'SEE PROFILE'
'All content following this page was uploaded by Elhadj Benkhelifa on 18 December 2019.'
' The user has requested enhancement of the downloaded file.'
' \x0cA Systematic Literature Review of Data Governance &  Cloud Data Governance  '
'Majid Al-Ruithe[1], Elhadj Benkhelifa[2] , Khawar 

# Preprocessing

In [156]:
import spacy
nlp = spacy.load("en_core_web_trf")

In [157]:
class Sentence:
    def __init__(self, sentence, file):
        self.file = file
        self.sentence = sentence 
        self.tokens = None
        self.inventory = None
        self.contains_noun = None
        self.contains_verb = None
        self.contains_cid = None
        self.valid = None

         #corp sentence to beginning based on first alphabtic character
        for i, char in enumerate(self.sentence):
            if char.isalpha():
                self.sentence = self.sentence[i:]
                break
        
        #replace tailing digits on words. those digits are usually footnotes
        self.sentence = re.sub(r'[A-Za-z]\d+\b', '', self.sentence)

    def tokenize(self):
        self.tokens = [(word.text, word.pos_) for word in nlp(self.sentence)]
    
    def count_tokens(self):
        if self.tokens is None:
            self.tokenize()

        inventory = {}
        for _, value in self.tokens:
            inventory[value] = inventory.get(value, 0) + 1
    
        self.inventory = inventory
    
    def check_validity(self):
        if self.inventory is None:
            self.count_tokens()

        word_types = self.inventory.keys()

        if 'NOUN' in word_types:
            self.contains_noun = True
        else:
            self.contains_verb = False

        if 'VERB' in word_types:
            self.contains_verb = True
        else:
            self.contains_verb = False

        if re.match(r'\(cid:\d{1,4}\)', self.sentence):
            self.contains_cid = True
        
        if self.contains_noun and self.contains_verb:
            self.valid = True
        else:
            self.valid = False
    
    def summarize(self, show_token_details=False):
        print(f'The origin file is: {self.file}')
        print(f'The sentence is:\n{self.sentence}')
        print(f'The inventory holds:\n{self.inventory}')
        if show_token_details:
            print(f'The token details are:\n{self.tokens}')

         
        

In [158]:
valid_sentences = []

for sen in sentences:
    Sen = Sentence(sen, file)
    Sen.check_validity()
    if Sen.valid:
        valid_sentences.append(Sen)

In [159]:
for s in valid_sentences:
    print(s.sentence)

See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/322259947
SEE PROFILE
SEE PROFILE
SEE PROFILE
All content following this page was uploaded by Elhadj Benkhelifa on 18 December 2019.
The user has requested enhancement of the downloaded file.
Data management solutions on their own are becoming  very expensive and  not able to cope with the reality of everlasting data complexity.
Businesses have grown  more sophisticated in their use of data, which drives new demands that require different  ways to handle this data.
Forward thinking organizations believe that the only way to  solve the data problem will be the implementation of an effective data governance.
At- tempts in governing data failed before, as they were driven by IT, and affected by rigid  processes and fragmented activities carried out on system by system basis.
Up to very  recently governance is mostly informal with very ambiguous and generic regulations, in  siloes 

In [162]:
valid_sentences[157].summarize(True)


The origin file is: /Users/paul/Zotero/storage/2UPT8ZFX/Auer und Claessens - 2018 - Regulating cryptocurrencies assessing market reac.pdf
The sentence is:
Out of 41  records, the total studies on data governance for non-cloud computing which were pub- lished by practice oriented were 26.83 % (n=11), while 73.17 % (n=30) were published  by academic research.
The inventory holds:
{'ADP': 6, 'NUM': 5, 'SPACE': 2, 'NOUN': 9, 'PUNCT': 8, 'DET': 1, 'ADJ': 5, 'PRON': 1, 'AUX': 3, 'VERB': 3, 'SCONJ': 1}
The token details are:
[('Out', 'ADP'), ('of', 'ADP'), ('41', 'NUM'), (' ', 'SPACE'), ('records', 'NOUN'), (',', 'PUNCT'), ('the', 'DET'), ('total', 'ADJ'), ('studies', 'NOUN'), ('on', 'ADP'), ('data', 'NOUN'), ('governance', 'NOUN'), ('for', 'ADP'), ('non', 'ADJ'), ('-', 'ADJ'), ('cloud', 'ADJ'), ('computing', 'NOUN'), ('which', 'PRON'), ('were', 'AUX'), ('pub-', 'PUNCT'), ('lished', 'VERB'), ('by', 'ADP'), ('practice', 'NOUN'), ('oriented', 'VERB'), ('were', 'AUX'), ('26.83', 'NUM'), ('%', 'N

## Tokenization

## Lemmatization

## Stemming

# Bag of Words

# Topic Modeling with LDA

# Topic Clustering