# Text Preprocessing

##### Author: Alex Sherman | alsherman@deloitte.com

In [10]:
pwd

'/Users/alex/Desktop/ml_guild/ml_guild/lessons/lesson7_text_preprocessing'

In [14]:
import os
import configparser
from IPython.core.display import display, HTML

config = configparser.ConfigParser()
config.read('../../config.ini')
DB_PATH = config['NLP']['DB_PATH']

In [15]:
# confirm DB_PATH is correct db directory, otherwise the rest of the code will not work
DB_PATH

'sqlite:///C:\\Users\\alsherman\\Desktop\\PycharmProjects\\firm_initiatives\\ml_guild\\raw_data\\annual_report.db'

In [18]:
# read the oracle 10k documents 

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(DB_PATH)

df = pd.read_sql("SELECT * FROM annual_report WHERE COMPANY = 'oracle'", con=engine)
df.head(25)

Unnamed: 0,annual_report_id,company,report_name,report_year,section_name,section_text,section_type
0,211,oracle,oracle-corporation_annual_report_1994.docx,1994,ORACLE SYSTEMS,,bold
1,212,oracle,oracle-corporation_annual_report_1994.docx,1994,FORM 10-K,(Annual Report) Filed 07/27/94 for the...,bold
2,213,oracle,oracle-corporation_annual_report_1994.docx,1994,SECURITIES AND EXCHANGE COMMISSION,"Washington, D.C. 20549",bold
3,214,oracle,oracle-corporation_annual_report_1994.docx,1994,Form 10-K [X] ANNUAL REPORT PURSUANT TO SECTIO...,,bold
4,215,oracle,oracle-corporation_annual_report_1994.docx,1994,"FOR THE FISCAL YEAR ENDED MAY 31, 1994",OR,bold
5,216,oracle,oracle-corporation_annual_report_1994.docx,1994,[ ] TRANSITION REPORT PURSUANT TO SECTION 13 O...,COMMISSION FILE NUMBER 0-14376,heading
6,217,oracle,oracle-corporation_annual_report_1994.docx,1994,Oracle Systems Corporation,(Exact name of registrant as specified in its ...,bold
7,218,oracle,oracle-corporation_annual_report_1994.docx,1994,SECURITIES REGISTERED PURSUANT TO SECTION 12(B...,(Title of class) Indicate by check mark wheth...,heading
8,219,oracle,oracle-corporation_annual_report_1994.docx,1994,ORACLE SYSTEMS CORPORATION 1994 FORM 10-K ANNU...,PART I i,heading
9,220,oracle,oracle-corporation_annual_report_1994.docx,1994,PART I,,heading


In [19]:
df[df.section_text.str.contains('CEO')].section_name

2452          Board of Directors Composition and category
2454    Attendance of each Director at the Board Meeti...
2458    Brief resume of Directors proposed to be appoi...
2467                                  Compensation policy
2468    Details of remuneration paid to the Directors ...
2496                  Chaitanya Kamat\tMakarand  Padalkar
2594                                  Mr. Chaitanya Kamat
2596                                  Mr. Robert K Weiler
Name: section_name, dtype: object

In [20]:
# example text
text = df.section_text[2452]
text

'The composition of the Board of Directors of the Company (“the Board”) as on March 31, 2011, was as under:  * Only the Audit Committee and Shareholders’ Grievances Committee are considered. All Directorships of Mr. William T Comfort, Jr., Mr. Frank Brienzi, Ms. Dorian Daley, Mr. William Corey West and Mr. Derek H Williams are in foreign companies. None of the directors are related inter se. 1   Mr. Frank Brienzi was appointed as a Director in the Annual General Meeting held on August 25, 2010. 2   Mr. Joseph John was appointed as a Director and Whole‑time Director in the Annual General Meeting held on August 25, 2010. He ceased to be a director with effect from March 31, 2011. 3  Mr. Chaitanya Kamat was appointed as an Additional Director and as the Managing Director and CEO with effect from October 25, 2010 subject to the approval of the members of the Company. 4  Mr. S Venkatachalam was appointed as an Additional Director with effect from October 25, 2010. 5   Mr. William Corey West

### SpaCy

#### Installation:
- conda install -c conda-forge spacy
- Download Microsoft Visual C++: http://landinghub.visualstudio.com/visual-cpp-build-tools

spaCy is a free, open-source library for advanced Natural Language Processing (NLP) in Python.

If you're working with a lot of text, you'll eventually want to know more about it. For example, what's it about? What do the words mean in context? Who is doing what to whom? What companies and products are mentioned? Which texts are similar to each other?

spaCy is designed specifically for production use and helps you build applications that process and "understand" large volumes of text. It can be used to build information extraction or natural language understanding systems, or to pre-process text for deep learning.

spaCy is not research software. It's built on the latest research, but it's designed to get things done. This leads to fairly different design decisions than NLTK or CoreNLP, which were created as platforms for teaching and research. The main difference is that spaCy is integrated and opinionated. spaCy tries to avoid asking the user to choose between multiple algorithms that deliver equivalent functionality. Keeping the menu small lets spaCy deliver generally better performance and developer experience.

### SpaCy Features 

NAME |	DESCRIPTION |
:----- |:------|
Tokenization|Segmenting text into words, punctuations marks etc.|
Part-of-speech (POS) Tagging|Assigning word types to tokens, like verb or noun.|
Dependency Parsing|	Assigning syntactic dependency labels, describing the relations between individual tokens, like subject or object.|
Lemmatization|	Assigning the base forms of words. For example, the lemma of "was" is "be", and the lemma of "rats" is "rat".|
Sentence Boundary Detection (SBD)|	Finding and segmenting individual sentences.|
Named Entity Recognition (NER)|	Labelling named "real-world" objects, like persons, companies or locations.|
Similarity|	Comparing words, text spans and documents and how similar they are to each other.|
Text Classification|	Assigning categories or labels to a whole document, or parts of a document.|
Rule-based Matching|	Finding sequences of tokens based on their texts and linguistic annotations, similar to regular expressions.|
Training|	Updating and improving a statistical model's predictions.|
Serialization|	Saving objects to files or byte strings.|

SOURCE: https://spacy.io/usage/spacy-101

In [21]:
import spacy
from spacy import displacy

In [22]:
# read in a English language model
nlp = spacy.load('en')

In [23]:
# instantiate the document text
doc = nlp(text)

In [24]:
# view the text
doc

The composition of the Board of Directors of the Company (“the Board”) as on March 31, 2011, was as under:  * Only the Audit Committee and Shareholders’ Grievances Committee are considered. All Directorships of Mr. William T Comfort, Jr., Mr. Frank Brienzi, Ms. Dorian Daley, Mr. William Corey West and Mr. Derek H Williams are in foreign companies. None of the directors are related inter se. 1   Mr. Frank Brienzi was appointed as a Director in the Annual General Meeting held on August 25, 2010. 2   Mr. Joseph John was appointed as a Director and Whole‑time Director in the Annual General Meeting held on August 25, 2010. He ceased to be a director with effect from March 31, 2011. 3  Mr. Chaitanya Kamat was appointed as an Additional Director and as the Managing Director and CEO with effect from October 25, 2010 subject to the approval of the members of the Company. 4  Mr. S Venkatachalam was appointed as an Additional Director with effect from October 25, 2010. 5   Mr. William Corey West 

In [7]:
string_formatting_url = 'https://spacy.io/assets/img/pipeline.svg'
iframe = '<iframe src={} width=1000 height=200></iframe>'.format(string_formatting_url)
HTML(iframe)

### Tokenization

spaCy first tokenizes the text, i.e. segments it into words, punctuation and so on. This is done by applying rules specific to each language. For example, punctuation at the end of a sentence should be split off – whereas "U.K." should remain one token. 

In [8]:
string_formatting_url = 'https://spacy.io/assets/img/tokenization.svg'
iframe = '<iframe src={} width=650 height=400></iframe>'.format(string_formatting_url)
HTML(iframe)

### Part-of-speech (POS) Tagging

After tokenization, spaCy can parse and tag a given Doc. This is where the statistical model comes in, which enables spaCy to make a prediction of which tag or label most likely applies in this context. A model consists of binary data and is produced by showing a system enough examples for it to make predictions that generalise across the language – for example, a word following "the" in English is most likely a noun.

Annotation | Description
:----- |:------|
Text |The original word text|
Lemma |The base form of the word.|
POS |The simple part-of-speech tag.|
Tag |The detailed part-of-speech tag.|
Dep |Syntactic dependency, i.e. the relation between tokens.|
Shape |The word shape – capitalisation, punctuation, digits.|
Is Alpha |Is the token an alpha character?|
Is Stop |Is the token part of a stop list, i.e. the most common words of the language?|

In [91]:
print('{:13} | {:13} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} | '.format(
    'text', 'lemma_', 'pos_', 'tag_', 'dep_', 'shape_', 'is_alpha', 'is_stop'))
print('_'*100)

for token in doc:
    print('{:13} | {:13} | {:8} | {:8} | {:11} | {:8} | {:8} | {:8} |'.format(
          token.text, token.lemma_, token.pos_, token.tag_, token.dep_
        , token.shape_, token.is_alpha, token.is_stop))

text          | lemma_        | pos_     | tag_     | dep_        | shape_   | is_alpha | is_stop  | 
____________________________________________________________________________________________________
              |               | SPACE    |          |             |          |        0 |        0 |
The           | the           | DET      | DT       | det         | Xxx      |        1 |        0 |
Company       | company       | PROPN    | NNP      | compound    | Xxxxx    |        1 |        0 |
designs       | design        | VERB     | VBZ      | ROOT        | xxxx     |        1 |        0 |
,             | ,             | PUNCT    | ,        | punct       | ,        |        0 |        0 |
develops      | develop       | VERB     | VBZ      | conj        | xxxx     |        1 |        0 |
,             | ,             | PUNCT    | ,        | punct       | ,        |        0 |        0 |
markets       | market        | NOUN     | NNS      | conj        | xxxx     |        1 | 

,             | ,             | PUNCT    | ,        | punct       | ,        |        0 |        0 |
was           | be            | VERB     | VBD      | auxpass     | xxx      |        1 |        1 |
incorporated  | incorporate   | VERB     | VBN      | ROOT        | xxxx     |        1 |        0 |
in            | in            | ADP      | IN       | prep        | xx       |        1 |        1 |
June          | june          | PROPN    | NNP      | pobj        | Xxxx     |        1 |        0 |
1977          | 1977          | NUM      | CD       | nummod      | dddd     |        0 |        0 |
.             | .             | PUNCT    | .        | punct       | .        |        0 |        0 |
Unless        | unless        | ADP      | IN       | mark        | Xxxxx    |        1 |        0 |
the           | the           | DET      | DT       | det         | xxx      |        1 |        1 |
context       | context       | NOUN     | NN       | nsubj       | xxxx     |        1 |  

In [None]:
displacy.serve(doc, style='dep')

### Named Entity Recognition (NER)

A named entity is a "real-world object" that's assigned a name – for example, a person, a country, a product or a book title. spaCy can recognise various types of named entities in a document, by asking the model for a prediction. 

In [9]:
for ent in doc.ents:
    print('{:10} | {:50} '.format(ent.label_, ent.text))

NameError: name 'doc' is not defined

In [None]:
displacy.serve(doc, style='ent')

In [None]:
# observe the named entities tagged as PERSON
for ent in doc.ents:
    if 'PERSON' in ent.label_:
        print(ent)

In [None]:
# observe the named entities tagged as ORG (organization)
for ent in doc.ents:
    if 'ORG' in ent.label_:
        print(ent, ent.label_)

In [None]:
for chunk in doc.noun_chunks:
    print(chunk.text, chunk.root.text, chunk.root.dep_,
          chunk.root.head.text)

### Text Matches

In [105]:
from spacy.matcher import Matcher
matcher = Matcher(nlp.vocab)

matched_sents = [] # collect data of matched sentences to be visualized

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end] # matched span
    sent = span.sent # sentence containing matched span
    # append mock entity for match in displaCy style to matched_sents
    # get the match span by ofsetting the start and end of the span with the
    # start and end of the sentence in the doc
    match_ents = [{'start': span.start_char - sent.start_char, 
                   'end': span.end_char - sent.start_char,
                   'label': 'MATCH'}]
    matched_sents.append({'text': sent.text, 'ents': match_ents,'span':span })

pattern = [{'POS':'NOUN', 'OP':'+'},{'LOWER':'services'}]

#pattern = [{'TAG': 'VBN'},{'TAG':'IN','OP': '+'},{'TAG':'DT','OP': '+'},{'TAG': 'NNP', 'OP': '*'}]#,{},{'TAG': 'CD'}]#,{'TAG': 'CD', 'OP': '+'}]#{'POS': 'ADV', 'OP': '*'}]#,{'POS': 'ADJ'}]
matcher.add('FacebookIs', collect_sents, pattern) # add pattern

In [138]:
from spacy.matcher import Matcher
from collections import defaultdict

matcher = Matcher(nlp.vocab)
entities = defaultdict(int)

def collect_sents(matcher, doc, i, matches):
    match_id, start, end = matches[i]
    span = doc[start : end] # matched span
    entities[span.text] += 1

pattern = [{'POS':'NOUN', 'OP':'+'},{'LOWER':'services'}]

matcher.add('entities', collect_sents, pattern) # add pattern

In [140]:
for section in text[0:30]:
    matcher(nlp(section)) # match on your text
entities

defaultdict(int,
            {'consulting services': 2,
             'customer support services': 2,
             'education services': 9,
             'integration services': 5,
             'maintenance services': 1,
             'multimedia services': 1,
             'support services': 10,
             'systems integration services': 5})

### Exercise 
get all sentences with word risk for topic analysis

In [236]:
matcher = Matcher(nlp.vocab)
pattern = [{'POS':'NOUN', 'OP':'+'},{'LOWER':'risk'}]
risks = defaultdict(int)

def collect_sents(matcher, doc, i, matches):
    print(matches)
    print()
    match_id, start, end = matches[i]
    span = doc[start : end] # matched span
    risks[span.text.lower()] += 1

matcher.add('risk', collect_sents, pattern) # add pattern

In [None]:
groupby_text = df.groupby('report_year')['section_text'].sum()
section_text_by_year = pd.DataFrame(groupby_text).reset_index()

years = {}
for ind, row in section_text_by_year.iterrows():
    print(ind)
    #if ind == 1:
    #    break
    year = row['report_year']
    doc = nlp(row['section_text'])
    matcher(doc) # match on your text
    years[year] = risks.copy()

0
[(14326900376835226264, 15001, 15003)]

1
[(14326900376835226264, 14312, 14314)]

2
[(14326900376835226264, 17458, 17460)]

3
[(14326900376835226264, 18108, 18110)]

4
[(14326900376835226264, 13084, 13087), (14326900376835226264, 13085, 13087), (14326900376835226264, 13093, 13096), (14326900376835226264, 13094, 13096), (14326900376835226264, 13162, 13164), (14326900376835226264, 13236, 13238), (14326900376835226264, 17443, 17445)]

[(14326900376835226264, 13084, 13087), (14326900376835226264, 13085, 13087), (14326900376835226264, 13093, 13096), (14326900376835226264, 13094, 13096), (14326900376835226264, 13162, 13164), (14326900376835226264, 13236, 13238), (14326900376835226264, 17443, 17445)]

[(14326900376835226264, 13084, 13087), (14326900376835226264, 13085, 13087), (14326900376835226264, 13093, 13096), (14326900376835226264, 13094, 13096), (14326900376835226264, 13162, 13164), (14326900376835226264, 13236, 13238), (14326900376835226264, 17443, 17445)]

[(14326900376835226264, 13

[(14326900376835226264, 18267, 18269), (14326900376835226264, 18323, 18325), (14326900376835226264, 18326, 18328), (14326900376835226264, 18334, 18336), (14326900376835226264, 19836, 19842), (14326900376835226264, 19837, 19842), (14326900376835226264, 19838, 19842), (14326900376835226264, 19839, 19842), (14326900376835226264, 19840, 19842), (14326900376835226264, 19870, 19875), (14326900376835226264, 19871, 19875), (14326900376835226264, 19872, 19875), (14326900376835226264, 19873, 19875), (14326900376835226264, 24383, 24385), (14326900376835226264, 25318, 25320), (14326900376835226264, 25374, 25376), (14326900376835226264, 25377, 25379), (14326900376835226264, 26088, 26094), (14326900376835226264, 26089, 26094), (14326900376835226264, 26090, 26094), (14326900376835226264, 26091, 26094), (14326900376835226264, 26092, 26094), (14326900376835226264, 26126, 26131), (14326900376835226264, 26127, 26131), (14326900376835226264, 26128, 26131), (14326900376835226264, 26129, 26131)]

[(14326900

[(14326900376835226264, 20333, 20335), (14326900376835226264, 20356, 20359), (14326900376835226264, 20357, 20359), (14326900376835226264, 20980, 20986), (14326900376835226264, 20981, 20986), (14326900376835226264, 20982, 20986), (14326900376835226264, 20983, 20986), (14326900376835226264, 20984, 20986), (14326900376835226264, 26289, 26291), (14326900376835226264, 27827, 27829), (14326900376835226264, 27883, 27885), (14326900376835226264, 27886, 27888), (14326900376835226264, 29128, 29131), (14326900376835226264, 29129, 29131), (14326900376835226264, 29305, 29311), (14326900376835226264, 29306, 29311), (14326900376835226264, 29307, 29311), (14326900376835226264, 29308, 29311), (14326900376835226264, 29309, 29311)]

[(14326900376835226264, 20333, 20335), (14326900376835226264, 20356, 20359), (14326900376835226264, 20357, 20359), (14326900376835226264, 20980, 20986), (14326900376835226264, 20981, 20986), (14326900376835226264, 20982, 20986), (14326900376835226264, 20983, 20986), (14326900

[(14326900376835226264, 23173, 23175), (14326900376835226264, 23895, 23900), (14326900376835226264, 23896, 23900), (14326900376835226264, 23897, 23900), (14326900376835226264, 23898, 23900), (14326900376835226264, 28332, 28334), (14326900376835226264, 30100, 30102), (14326900376835226264, 30156, 30158), (14326900376835226264, 30159, 30161), (14326900376835226264, 30162, 30164), (14326900376835226264, 30905, 30908), (14326900376835226264, 30906, 30908), (14326900376835226264, 31168, 31174), (14326900376835226264, 31169, 31174), (14326900376835226264, 31170, 31174), (14326900376835226264, 31171, 31174), (14326900376835226264, 31172, 31174)]

[(14326900376835226264, 23173, 23175), (14326900376835226264, 23895, 23900), (14326900376835226264, 23896, 23900), (14326900376835226264, 23897, 23900), (14326900376835226264, 23898, 23900), (14326900376835226264, 28332, 28334), (14326900376835226264, 30100, 30102), (14326900376835226264, 30156, 30158), (14326900376835226264, 30159, 30161), (14326900

[(14326900376835226264, 26826, 26828), (14326900376835226264, 27801, 27806), (14326900376835226264, 27802, 27806), (14326900376835226264, 27803, 27806), (14326900376835226264, 27804, 27806), (14326900376835226264, 34370, 34372), (14326900376835226264, 42924, 42926), (14326900376835226264, 42980, 42982), (14326900376835226264, 42983, 42985), (14326900376835226264, 42986, 42988), (14326900376835226264, 43478, 43481), (14326900376835226264, 43479, 43481), (14326900376835226264, 43716, 43721), (14326900376835226264, 43717, 43721), (14326900376835226264, 43718, 43721), (14326900376835226264, 43719, 43721)]

[(14326900376835226264, 26826, 26828), (14326900376835226264, 27801, 27806), (14326900376835226264, 27802, 27806), (14326900376835226264, 27803, 27806), (14326900376835226264, 27804, 27806), (14326900376835226264, 34370, 34372), (14326900376835226264, 42924, 42926), (14326900376835226264, 42980, 42982), (14326900376835226264, 42983, 42985), (14326900376835226264, 42986, 42988), (14326900

[(14326900376835226264, 4098, 4100), (14326900376835226264, 10193, 10195), (14326900376835226264, 29436, 29441), (14326900376835226264, 29437, 29441), (14326900376835226264, 29438, 29441), (14326900376835226264, 29439, 29441), (14326900376835226264, 35848, 35850)]

[(14326900376835226264, 4098, 4100), (14326900376835226264, 10193, 10195), (14326900376835226264, 29436, 29441), (14326900376835226264, 29437, 29441), (14326900376835226264, 29438, 29441), (14326900376835226264, 29439, 29441), (14326900376835226264, 35848, 35850)]

[(14326900376835226264, 4098, 4100), (14326900376835226264, 10193, 10195), (14326900376835226264, 29436, 29441), (14326900376835226264, 29437, 29441), (14326900376835226264, 29438, 29441), (14326900376835226264, 29439, 29441), (14326900376835226264, 35848, 35850)]

[(14326900376835226264, 4098, 4100), (14326900376835226264, 10193, 10195), (14326900376835226264, 29436, 29441), (14326900376835226264, 29437, 29441), (14326900376835226264, 29438, 29441), (143269003768

In [None]:
years

In [201]:
pd.DataFrame(years).T

Unnamed: 0,credit risk,currency exchange risk,currency risk,customer risk,default risk,enterprise risk,equity hedge minimizes currency risk,equity price risk,exchange risk,hedge minimizes currency risk,...,liquidity risk,litigation risk,market rate risk,market risk,minimizes currency risk,mitigates credit risk,price risk,rate risk,reinvestment risk,yen equity hedge minimizes currency risk
1994,15.0,,,,,,,,,,...,,,,,,,,,,
1995,15.0,,,,,,,,,,...,,,,,,,,,,
1996,15.0,,,,,,,,,,...,,,,,,,,,,
1997,15.0,,,,,,,,,,...,,,,,,,,,,
1998,15.0,,15.0,,,,,,,,...,,,15.0,,,,,30.0,15.0,
1999,15.0,,,,30.0,,,,,,...,,,,30.0,,,,,,
2000,15.0,,,,30.0,,,,,,...,,,,30.0,,,,,,
2001,15.0,,,,30.0,,,,,,...,,,,30.0,,,,,15.0,
2002,15.0,,60.0,,45.0,,60.0,,,60.0,...,,,,60.0,60.0,,,,,30.0
2003,15.0,,45.0,,15.0,,30.0,15.0,,30.0,...,,,,30.0,30.0,,15.0,15.0,,30.0


In [216]:
df[df['section_text'].str.contains('risk')]

Unnamed: 0,annual_report_id,company,report_name,report_year,section_name,section_text,section_type
28,239,oracle,oracle-corporation_annual_report_1994.docx,1994,Additional Customer Information,Revenues from international customers (includ...,heading
68,279,oracle,oracle-corporation_annual_report_1994.docx,1994,Concentration of Credit Risk,Financial instruments which potentially subje...,heading
124,443,oracle,oracle-corporation_annual_report_1995.docx,1995,Additional Customer Information,Revenues from international customers (includ...,bold
165,484,oracle,oracle-corporation_annual_report_1995.docx,1995,Concentration of Credit Risk,Financial instruments which potentially subje...,bold
203,522,oracle,oracle-corporation_annual_report_1996.docx,1996,FORWARD-LOOKING STATEMENTS,"In addition to historical information, this A...",heading
238,557,oracle,oracle-corporation_annual_report_1996.docx,1996,FACTORS THAT MAY AFFECT FUTURE RESULTS AND MAR...,The Company operates in a rapidly changing en...,heading
261,580,oracle,oracle-corporation_annual_report_1996.docx,1996,Concentration of Credit Risk,Financial instruments which potentially subje...,heading
292,1188,oracle,oracle-corporation_annual_report_1997.docx,1997,FORWARD-LOOKING STATEMENTS,"In addition to historical information, this A...",heading
325,1221,oracle,oracle-corporation_annual_report_1997.docx,1997,FACTORS THAT MAY AFFECT FUTURE RESULTS AND MAR...,The Company operates in a rapidly changing en...,heading
350,1246,oracle,oracle-corporation_annual_report_1997.docx,1997,Concentration of Credit Risk,Financial instruments which potentially subje...,bold


In [230]:
displacy.serve(matched_sents, style='ent', manual=True)

TypeError: tuple indices must be integers or slices, not str