# NLP Algorithms

## SpaCy
SpaCy is fast and agile. It’s designed to amp up cutting edge NLP by making it practical and accessible. It works with other well-known libraries like Gensim and Scikit Learn. Written in Python and Cython, it’s optimized for performance and allows developers a more natural path to more advanced NLP tasks like named entity recognition.

# Prepareing Bibliography
This is necessary to find the files attached in the Zotero Library.

In [51]:
import pybtex

In [77]:
class Library:

    def __init__(self, path, format='bibtex'):
        self.path = path
        self.library = pybtex.database.parse_file(path, bib_format=format)
        self.entries = self.library.entries

In [79]:
library = Library('/Users/paul/Desktop/FOM_MSc_Thesis.bib')

OrderedCaseInsensitiveDict([('al-ruithe_systematic_2019', Entry('article',
  fields=[
    ('title', 'A systematic literature review of data governance and cloud data governance'), 
    ('volume', '23'), 
    ('issn', '1617-4909, 1617-4917'), 
    ('url', 'http://link.springer.com/10.1007/s00779-017-1104-3'), 
    ('doi', '10.1007/s00779-017-1104-3'), 
    ('abstract', 'Data management solutions on their own are becoming very expensive and not able to cope with the reality of everlasting data complexity. Businesses have grown more sophisticated in their use of data, which drives new demands that require different ways to handle this data. Forward thinking organizations believe that the only way to solve the data problem will be the implementation of an effective data governance. Attempts in governing data failed before, as they were driven by IT, and affected by rigid processes and fragmented activities carried out on system by system basis. Up to very recently governance is mostly info

In [None]:
class Document:
    
    def __init__(self, entry):
        self.entry = entry
        self.title = 
    

In [74]:
for entry in library:
    print(library[entry].fields['title'])


A systematic literature review of data governance and cloud data governance
Data governance: {A} conceptual framework, structured review, and research agenda
Systematic review of data-centric approaches in artificial intelligence and machine learning - {ScienceDirect}
Data governance: {A} conceptual framework, structured review, and research agenda
Information security management needs more holistic approach: {A} literature review
A systematic literature review of data governance and cloud data governance
Data {Governance}
Data governance: {Organizing} data for trustworthy {Artificial} {Intelligence}
Data governance activities: an analysis of the literature
Designing data governance
The need for data governance: {A} case study
Data governance, data literacy and the management of data quality
A {MORPHOLOGY} {OF} {THE} {ORGANISATION} {OF} {DATA} {GOVERNANCE}
The {CARE} {Principles} for {Indigenous} {Data} {Governance}
Some {Practical} {Experiences} in {Data} {Governance}
A {Model} for {D

# Extracting Text from PDFs

In [10]:
from pdfminer.high_level import extract_text
import re


In [11]:
file = '/Users/paul/Zotero/storage/2UPT8ZFX/Auer und Claessens - 2018 - Regulating cryptocurrencies assessing market reac.pdf'
text = extract_text(file)

In [12]:
print(text)

Raphael Auer 

Stijn Claessens

raphael.auer@bis.org

stijn.claessens@bis.org

Regulating cryptocurrencies: assessing market 
reactions1 

Cryptocurrencies are often thought to operate out of the reach of national regulation, but in fact 
their valuations, transaction volumes and user bases react substantially to news about regulatory 
actions. The impact depends on the specific regulatory category to which the news relates: events 
related to general bans on cryptocurrencies or to their treatment under securities law have the 
greatest adverse effect, followed by news on combating money laundering and the financing of 
terrorism, and on restricting the interoperability of cryptocurrencies with regulated markets. News 
pointing to the establishment of specific legal frameworks tailored to cryptocurrencies and initial 
coin  offerings  coincides  with  strong  market  gains.  These  results  suggest  that  cryptocurrency 
markets rely on regulated financial institutions to operate and t

In [13]:
sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s|(\n){2,}',text)
sentences = [sentence.replace('\n',' ') for sentence in sentences if sentence not in [None,'\n','',' ','  ']]
sentences = [sentence for sentence in sentences if not re.match(r'^[^a-zA-Z]*$', sentence)]
for sentence in sentences:
    print(repr(sentence))


'Raphael Auer '
'Stijn Claessens'
'raphael.auer@bis.org'
'stijn.claessens@bis.org'
'Regulating cryptocurrencies: assessing market  reactions1 '
'Cryptocurrencies are often thought to operate out of the reach of national regulation, but in fact  their valuations, transaction volumes and user bases react substantially to news about regulatory  actions.'
'The impact depends on the specific regulatory category to which the news relates: events  related to general bans on cryptocurrencies or to their treatment under securities law have the  greatest adverse effect, followed by news on combating money laundering and the financing of  terrorism, and on restricting the interoperability of cryptocurrencies with regulated markets.'
'News  pointing to the establishment of specific legal frameworks tailored to cryptocurrencies and initial  coin  offerings  coincides  with  strong  market  gains.'
' These  results  suggest  that  cryptocurrency  markets rely on regulated financial institutions to o

# Preprocessing

In [5]:
import spacy
nlp = spacy.load("en_core_web_trf")

In [45]:
class Sentence:
    def __init__(self, sentence, file):
        self.file = file
        self.sentence = sentence 
        self.tokens = None
        self.inventory = None
        self.contains_noun = None
        self.contains_verb = None
        self.contains_cid = None
        self.valid = None

         #corp sentence to beginning based on first alphabtic character
        for i, char in enumerate(self.sentence):
            if char.isalpha():
                self.sentence = self.sentence[i:]
                break
        
        #replace tailing digits on words. those digits are usually footnotes
        self.sentence = re.sub(r'[A-Za-z]\d+\b', '', self.sentence)

    def tokenize(self):
        self.tokens = [(word.text, word.pos_) for word in nlp(self.sentence)]
    
    def count_tokens(self):
        if self.tokens is None:
            self.tokenize()

        inventory = {}
        for _, value in self.tokens:
            inventory[value] = inventory.get(value, 0) + 1
    
        self.inventory = inventory
    
    def check_validity(self):
        if self.inventory is None:
            self.count_tokens()

        word_types = self.inventory.keys()

        if 'NOUN' in word_types:
            self.contains_noun = True
        else:
            self.contains_verb = False

        if 'VERB' in word_types:
            self.contains_verb = True
        else:
            self.contains_verb = False

        if re.match(r'\(cid:\d{1,4}\)', self.sentence):
            self.contains_cid = True
        
        if self.contains_noun and self.contains_verb:
            self.valid = True
        else:
            self.valid = False
    
    def summarize(self, show_token_details=False):
        print(f'The origin file is: {self.file}')
        print(f'The sentence is:\n{self.sentence}')
        print(f'The inventory holds:\n{self.inventory}')
        if show_token_details:
            print(f'The token details are:\n{self.tokens}')

         
        

In [48]:
valid_sentences = []

for sen in sentences:
    Sen = Sentence(sen, file)
    Sen.check_validity()
    if Sen.valid:
        valid_sentences.append(Sen)

In [49]:
for s in valid_sentences:
    print(s.sentence)

Regulating cryptocurrencies: assessing market  reaction 
Cryptocurrencies are often thought to operate out of the reach of national regulation, but in fact  their valuations, transaction volumes and user bases react substantially to news about regulatory  actions.
The impact depends on the specific regulatory category to which the news relates: events  related to general bans on cryptocurrencies or to their treatment under securities law have the  greatest adverse effect, followed by news on combating money laundering and the financing of  terrorism, and on restricting the interoperability of cryptocurrencies with regulated markets.
News  pointing to the establishment of specific legal frameworks tailored to cryptocurrencies and initial  coin  offerings  coincides  with  strong  market  gains.
These  results  suggest  that  cryptocurrency  markets rely on regulated financial institutions to operate and that these markets are segmented  across jurisdictions, bringing cryptocurrencies wi

In [28]:
valid_sentences[157].summarize(True)


The origin file is: /Users/paul/Zotero/storage/2UPT8ZFX/Auer und Claessens - 2018 - Regulating cryptocurrencies assessing market reac.pdf
The sentence is:
 This index captures how, on a given day, regulatory events would have moved  the price of bitcoin.
The inventory holds:
{'SPACE': 2, 'DET': 3, 'NOUN': 5, 'VERB': 3, 'SCONJ': 1, 'PUNCT': 3, 'ADP': 2, 'ADJ': 1, 'AUX': 2}
The token details are:
[(' ', 'SPACE'), ('This', 'DET'), ('index', 'NOUN'), ('captures', 'VERB'), ('how', 'SCONJ'), (',', 'PUNCT'), ('on', 'ADP'), ('a', 'DET'), ('given', 'VERB'), ('day', 'NOUN'), (',', 'PUNCT'), ('regulatory', 'ADJ'), ('events', 'NOUN'), ('would', 'AUX'), ('have', 'AUX'), ('moved', 'VERB'), (' ', 'SPACE'), ('the', 'DET'), ('price', 'NOUN'), ('of', 'ADP'), ('bitcoin', 'NOUN'), ('.', 'PUNCT')]


In [67]:
#Test
for sentence in sentences:
    doc = nlp(sentence)
    print([(w.text, w.pos_) for w in doc])

[('Raphael', 'PROPN'), ('Auer', 'PROPN')]
[('Stijn', 'PROPN'), ('Claessens', 'PROPN')]
[('raphael.auer@bis.org', 'PROPN')]
[('stijn.claessens@bis.org', 'PROPN')]
[('Regulating', 'VERB'), ('cryptocurrencies', 'NOUN'), (':', 'PUNCT'), ('assessing', 'VERB'), ('market', 'NOUN'), (' ', 'SPACE'), ('reactions1', 'NOUN')]
[('Cryptocurrencies', 'NOUN'), ('are', 'AUX'), ('often', 'ADV'), ('thought', 'VERB'), ('to', 'PART'), ('operate', 'VERB'), ('out', 'ADP'), ('of', 'ADP'), ('the', 'DET'), ('reach', 'NOUN'), ('of', 'ADP'), ('national', 'ADJ'), ('regulation', 'NOUN'), (',', 'PUNCT'), ('but', 'CCONJ'), ('in', 'ADP'), ('fact', 'NOUN'), (' ', 'SPACE'), ('their', 'PRON'), ('valuations', 'NOUN'), (',', 'PUNCT'), ('transaction', 'NOUN'), ('volumes', 'NOUN'), ('and', 'CCONJ'), ('user', 'NOUN'), ('bases', 'NOUN'), ('react', 'VERB'), ('substantially', 'ADV'), ('to', 'ADP'), ('news', 'NOUN'), ('about', 'ADP'), ('regulatory', 'ADJ'), (' ', 'SPACE'), ('actions', 'NOUN'), ('.', 'PUNCT')]
[('The', 'DET'), ('i

## Tokenization

## Lemmatization

## Stemming

# Bag of Words

# Topic Modeling with LDA

# Topic Clustering