# Text analysis

Let us refer to text analysis, text analytics or text mining as a process which extracts useful information from data using statistical algorithms techniques from natural language processing. Before we are able to start with text analysis, we need to gather text data and prepare it in certain ways to ease the use of text models. 

## Text preprocessing

Usually texts such as financial reports, news or others are available in digital form. Either embedded as content on web pages using hypertext markup language (HTML) or by downloading common text files such as pdf-files. If we want collect text from these sources we either need to manually collect them or we make use of techniques like web scraping. For many different types of text, this already has been done by third-party providers and text is available upon request from these providers. However, we often receive texts in raw format with special characters and other peculiarities that make direct further processing difficult. For instance, take a look at the output below which shows you the beginning of the scraped raw text format of [this](https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm) company filing. As you can see, for this example even some metadata is included before the actual content of the report begins.

In [1]:
import pandas as pd 
import sqlite3


conn = sqlite3.connect("../data/dlta_texts.db")
df_filings = pd.read_sql("SELECT * FROM filings", conn)
conn.close()

fst_report = df_filings.text[0]
print(fst_report[:1000])

10-K
 1
 aapl-20220924.htm
 10-K
 
 
 
 aapl-20220924 false 2022 FY 0000320193 P1Y P5Y P1Y 64 P1Y 27 P1Y 7 2 http://fasb.org/us-gaap/2022#OtherAssetsNoncurrent http://fasb.org/us-gaap/2022#OtherAssetsNoncurrent http://fasb.org/us-gaap/2022#PropertyPlantAndEquipmentNet http://fasb.org/us-gaap/2022#PropertyPlantAndEquipmentNet http://fasb.org/us-gaap/2022#OtherLiabilitiesCurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesCurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesNoncurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesNoncurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesCurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesCurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesNoncurrent http://fasb.org/us-gaap/2022#OtherLiabilitiesNoncurrent 0000320193 2021-09-26 2022-09-24 0000320193 us-gaap:CommonStockMember 2021-09-26 2022-09-24 0000320193 aapl:A1.000NotesDue2022Member 2021-09-26 2022-09-24 0000320193 aapl:A1.375NotesDue2024Member 2021-09-26 2022-09-24 0000320193 aapl:A0.

This is how a sentence of the report looks like.

In [2]:
import nltk.data

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
filings_sentences = sent_detector.tokenize(df_filings.text[0].strip())
filings_sentences[40]

'Form 10-K Summary 57 This Annual Report on Form 10-K (“Form 10-K”) contains forward-looking statements, within the meaning of the Private Securities Litigation Reform Act of 1995, that involve risks and uncertainties.'

After conducting a simple preprocessing routine, this is how the sentence looks like.

In [3]:
from gensim.utils import simple_preprocess
import numpy as np


" ".join(simple_preprocess(filings_sentences[40]) )

'form summary this annual report on form form contains forward looking statements within the meaning of the private securities litigation reform act of that involve risks and uncertainties'

Usually this is the textual form which is used for building and training text models. The basic preprocessig steps used here are:

* remove special characters
* remove numbers
* use lower cases only

As you can see in the example above, we can use pre-written helper functions to conduct these preproceesing steps. In our example, we make use of the *simple_preprocess* function from the [gensim](https://radimrehurek.com/gensim/) package which is one among great text analysis libraries that we can use with python. Depending on the chosen text model, some optional choices exist which can already be conducted in the text preprocessing process. Some popular choices are:

* removal of stopwords
* stemming
* lemmatization


### Stopwords

While some words may be very special and, thus, very informative for a specific text, other words appear frequently in every text. Examples are the, and, this, that, or, etc., due to their frequent use, these words do not reveal specific content of sentences, paragraphs and so on. This is why we may want to delete these words from the original text which simplifies the processing of the information contained. The cell below shows the first ten list of the stopwords dictionary of the gensim package. Note that stopword lists need to be defined first and can also be collected manually. 

In [4]:
from gensim.parsing.preprocessing import STOPWORDS

stopword_list = list(STOPWORDS)
stopword_list.sort()
for i, word in enumerate(stopword_list):
    print(word)
    if i>= 9:
        break

a
about
above
across
after
afterwards
again
against
all
almost


### Stemming and lemmatization

Stemming and lemmatization both bring word variatons back to their root form. While lemmatization brings back words to their canonical form, stemming reduces words to their word stem. For instance the words:

* improve
* improving
* improved

are brought to improve by lemmatization and to improv by stemming. Lemmatization is build upon text data and aims to learn structure of language. Stemming is a rule-based system which makes it a little easier to implement. Note that different models and rule systems do exist for conducting stemming and lemmatization.

This is the list of preprocessed words from a sentence:

In [5]:
sentence = simple_preprocess(filings_sentences[40])
for word in sentence:
    print(word)

form
summary
this
annual
report
on
form
form
contains
forward
looking
statements
within
the
meaning
of
the
private
securities
litigation
reform
act
of
that
involve
risks
and
uncertainties


This is how the example sentence looks like after stemming:

In [6]:
from nltk.stem.snowball import SnowballStemmer
from gensim.utils import simple_preprocess

stemmer = SnowballStemmer(language = "english")

for word in sentence:
    print(stemmer.stem(word))

form
summari
this
annual
report
on
form
form
contain
forward
look
statement
within
the
mean
of
the
privat
secur
litig
reform
act
of
that
involv
risk
and
uncertainti


This is how the example sentence looks like after lemmatization:

In [7]:
from textblob import Word

for word in sentence:
    w = Word(word)
    print(w.lemmatize())

form
summary
this
annual
report
on
form
form
contains
forward
looking
statement
within
the
meaning
of
the
private
security
litigation
reform
act
of
that
involve
risk
and
uncertainty


## Tokenization

While the simple preprocessing example from above splits the sentence into individual words, most state of the art language models use algorithms which are trained to split a text sequence into tokens. This usually includes normalization steps such as removing unneeded white space, stripping accents, lower casing, etc. The Stanford Natural Language Processing group describes a token as *an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing*. Learning useful tokens is a task which is learned by algorithms such as Byte-Pair-Encoding or Wordpiece. Interested readers can get some first information about these algorithms at [huggingface's documentation](https://huggingface.co/docs/tokenizers/components#pretokenizers). Below, we can take a look how the BERT model and its pretrained tokenizer would preprocess the example sentence from above.

In [17]:
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.encode(filings_sentences[40]).tokens
for token in tokens:
    print(token)

[CLS]
form
10
-
k
summary
57
this
annual
report
on
form
10
-
k
(
“
form
10
-
k
”
)
contains
forward
-
looking
statements
,
within
the
meaning
of
the
private
securities
litigation
reform
act
of
1995
,
that
involve
risks
and
uncertain
##ties
.
[SEP]


## Document and Corpus
Let us clarify some wording conventions before we dig deeper into text analysis. Text analysis usually starts with a set of documents which is called a corpus. After documents have been processed, the outcome is a list of tokens. Distinct tokens/words are called a term and the set of all terms in a corpus is called the lexicon. A document is the general notion for different things such as a sentence, a paragraph, a chapter, etc.

## Common text types for financial text analysis

This course is focused on financial text analysis, thus, we are more interested on the domain-specific understanding of texts which are relevant for financial market participants. Popular examples are:

* company filings
* earning call transcripts
* voluntary reports, e.g., corporate social responsibility or environmental, social and corporate governance reports
* financial news

### 10-K, 10-Q and 8-K filings
One popular company filing type are 10-K reports in the United States. According federal securities laws, publicly reporting companies must file these reports on a yearly basis. 10-K reports provide an overview of the company's business and financial condition and also include financial statements. Every company must file this report within a certain time period (60, 75 or 90 days) after its fiscal year end. These reports follow a fixed structure and must be signed by the companies or executive officers of the company. See [this example](https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019322000108/aapl-20220924.htm) to get an impression of these reports. Other important and interesting filings forms are 10-Q filings (quarterly reports) and 8-K (current information) filings. See a list of Apple's recent filings regarding these form types.

In [10]:
import pandas as pd

apple_filings_list = pd.read_csv("../data/apple_filings_list.csv")
apple_filings_list.head(20)

Unnamed: 0,accessionNumber,filingDate,reportDate,acceptanceDateTime,act,form,fileNumber,filmNumber,items,size,isXBRL,isInlineXBRL,primaryDocument,primaryDocDescription,ticker,cik
0,0000320193-23-000077,2023-08-04,2023-07-01,2023-08-03T18:04:43.000Z,34.0,10-Q,001-36743,231141522,,5939898,1,1,aapl-20230701.htm,10-Q,AAPL,320193
1,0000320193-23-000075,2023-08-03,2023-08-03,2023-08-03T16:30:21.000Z,34.0,8-K,001-36743,231140568,"2.02,9.01",452428,1,1,aapl-20230803.htm,8-K,AAPL,320193
2,0001140361-23-023909,2023-05-10,2023-05-08,2023-05-10T16:31:27.000Z,34.0,8-K,001-36743,23907040,"8.01,9.01",932286,1,1,ny20007635x4_8k.htm,8-K,AAPL,320193
3,0000320193-23-000064,2023-05-05,2023-04-01,2023-05-04T18:03:52.000Z,34.0,10-Q,001-36743,23890444,,6314786,1,1,aapl-20230401.htm,10-Q,AAPL,320193
4,0000320193-23-000063,2023-05-04,2023-05-04,2023-05-04T16:30:43.000Z,34.0,8-K,001-36743,23889274,"2.02,9.01",493947,1,1,aapl-20230504.htm,8-K,AAPL,320193
5,0001140361-23-011192,2023-03-10,2023-03-10,2023-03-10T16:30:52.000Z,34.0,8-K,001-36743,23724135,5.07,312760,1,1,brhc10049413_8k.htm,8-K,AAPL,320193
6,0000320193-23-000006,2023-02-03,2022-12-31,2023-02-02T18:01:30.000Z,34.0,10-Q,001-36743,23582662,,5915088,1,1,aapl-20221231.htm,10-Q,AAPL,320193
7,0000320193-23-000005,2023-02-02,2023-02-02,2023-02-02T16:30:33.000Z,34.0,8-K,001-36743,23581333,"2.02,9.01",464814,1,1,aapl-20230202.htm,8-K,AAPL,320193
8,0001193125-22-278435,2022-11-07,2022-11-06,2022-11-07T06:27:07.000Z,34.0,8-K,001-36743,221363624,"7.01,9.01",299395,1,1,d400465d8k.htm,8-K,AAPL,320193
9,0000320193-22-000108,2022-10-28,2022-09-24,2022-10-27T18:01:14.000Z,34.0,10-K,001-36743,221338448,,10332356,1,1,aapl-20220924.htm,10-K,AAPL,320193


### Earning call transcripts

For a large number of companies, earning calls are held around quarterly reports. Earning calls are webcasts or phone calls with firm representatives, investors and analysts of a company. Usually, firm representative make the save harbour statement in the beginning of a call. After the initial statement, investors and analysts are allowed to ask questions which usually relate to the report and corresponding financial as well as operational results. See [this example](https://www.fool.com/earnings/call-transcripts/2022/10/27/apple-aapl-q4-2022-earnings-call-transcript/) to get a first impression. 

### Voluntary reports

While annual audited reports are usually mandatory, many firms tend to publish further information in voluntary reports. Popular voluntary reports are corporate social responsibilty (CSR) reports or environmental, social and corporate governance (ESG) reports. These so called non-financial factors tend to become more important to investors and companies. A common point of view is that companies which are not able to manage environmental, social and governance issues properly exhibit higher risks for which investors want to be compensated. Until today, voluntary CSR or ESG reports are an important source of information for investors to understand corresponding risks with this respect. Note that the wording CSR and ESG is often mixed. See [this exapmle](https://s2.q4cdn.com/470004039/files/doc_downloads/2022/08/2022_Apple_ESG_Report.pdf) to get a first impression. While ESG reports do not have to filed mandatory until today, certain information need to be provided mandatory in the near future in the EU and the US.

### Financial news

Financial news are popular for text analysis. While it sound pretty straight forward in the beginning, extracting useful information from financial news can be challenging. As news tend to be backward looking and repetitive in many cases. Nevertheless, interesting applications of news analysis exist, e.g., to identify systematic market movements.