### Assigment 1

# Annual Reports: Text to Data
---

This notebook goes through a necessary step of any data science project - data cleaning. Data cleaning is a time consuming and unenjoyable task, yet it's a very important one. Keep in mind, "garbage in, garbage out". Feeding dirty data into a model will give us results that are meaningless.

Specifically, we'll be walking through:

1. **Getting the data - **in this case, we'll be scraping data from EDGAR filings
2. **Cleaning the data - **we will walk through popular text pre-processing techniques
3. **Organizing the data - **we will organize the cleaned data into a way that is easy to input into other algorithms

The output of this notebook will be clean, organized data in two standard text formats:

1. **Corpus** - a collection of text
2. **Document-Term Matrix** - word counts in matrix format

## Data Source

We aim to get insights from the financial reports of the companies intended to invest
Corporate Reports  
• SEC’s EDGAR: 1994-2015, 15+ million filings, annual and quarterly reports  
• Regulatory disclosures: annual and interim filings (10-K and 10-Q), correspondences, IPO registration statements, etc.  
• PGGM Annual reports

---
## Mining PDFs


In [None]:
# from cbcm scrips
#!pip install textract

In [1]:
#Getting the name of one file
# Write a function that will create a list of names of files of certain folder

In [None]:
import textract

In [None]:
import os  
opath = os.getcwd()  
print ("The current working directory is %s" % opath)

In [None]:
path = '../AnnualReports/'

In [None]:
file_names = []
# iterate through all file names found in that path
# This option gets all files = !ls AnnualReports
for file in os.listdir(path):
    # append all files names to a list
    if file[0]!='.' and file[-3:]=='pdf': #to handle issue with hidden files and non-pdf files
        file_names.append(file)

In [2]:
#write a function that will read every pdf file and will write to txt

In [None]:
extracted_pdfs = []
for file_name in file_names: 
    text = str(textract.process(path+file_name))
    # each extracted text is saved in .txt with same file name
    file = open(path+file_name+'.txt','w+')
    file.write(text)

---
## Retrieving files

In [1]:
import pickle
annual_report_files = !ls ../AnnualReports/*.txt

In [2]:
# Load pickled files
data = {}
for i, c in enumerate(annual_report_files):
    with open(c, "r") as file:
        data[c[17:-4]] = file.read()

In [3]:
len(data.keys())

17

In [4]:
print(data['ABN_AMRO_Group_(2018)'][:1000])

b"ABN AMRO Bank N.V.ABN AMRO Group N.V. Annual Report 2018\x0c   \n\nTable of contents\n\n2Introduction\n\nAbout this report \nKey figures and profile \nABN AMRO shares \n\n2\n3\n4\n\n5Strategy and \n\nperformance\n6\nEconomic environment \n8\nRegulatory environment \n10\nStrategy \n13\nGroup performance \nBusiness performance \n20\nResponsibility statement  34\n\n35Risk, funding & \n\ncapital\nIntroduction to Risk, \nfunding &\xc2\xa0capital \n\nRisk, funding & \n\n123Leadership & \n\ngovernance\nIntroduction to Leadership \n\n36\n\n& governance \n\nExecutive Board and \n\ncapital\xc2\xa0management \n\n37\n\nExecutive\xc2\xa0Committee \n\nRisk, funding & \ncapital\xc2\xa0review \n\nSupervisory Board \nReport of the \n\n65\n\nAdditional risk, funding \n\n&\xc2\xa0capital\xc2\xa0disclosures  116\n\nSupervisory\xc2\xa0Board \nGeneral Meeting and \n\nshareholder\xc2\xa0structure \nCodes and regulations \nLegal structure \nRemuneration report \n\n124\n\n124\n128\n\n132\n\n138\n141\n144\n14

---
## Text data processing

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

In [5]:
# Let's take a look at our data again
next(iter(data.keys()))

'ABN_AMRO_Group_(2018)'

In [6]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [7]:
#data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
data_combined = {key: [value] for (key, value) in data.items()}

In [8]:
# We can either keep it in dictionary format or put it into a pandas dataframe
import pandas as pd
pd.set_option('max_colwidth',1000)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['report']
data_df = data_df.sort_index()

In [9]:
data_df.head()

Unnamed: 0,report
ABN_AMRO_Group_(2018),"b""ABN AMRO Bank N.V.ABN AMRO Group N.V. Annual Report 2018\x0c \n\nTable of contents\n\n2Introduction\n\nAbout this report \nKey figures and profile \nABN AMRO shares \n\n2\n3\n4\n\n5Strategy and \n\nperformance\n6\nEconomic environment \n8\nRegulatory environment \n10\nStrategy \n13\nGroup performance \nBusiness performance \n20\nResponsibility statement 34\n\n35Risk, funding & \n\ncapital\nIntroduction to Risk, \nfunding &\xc2\xa0capital \n\nRisk, funding & \n\n123Leadership & \n\ngovernance\nIntroduction to Leadership \n\n36\n\n& governance \n\nExecutive Board and \n\ncapital\xc2\xa0management \n\n37\n\nExecutive\xc2\xa0Committee \n\nRisk, funding & \ncapital\xc2\xa0review \n\nSupervisory Board \nReport of the \n\n65\n\nAdditional risk, funding \n\n&\xc2\xa0capital\xc2\xa0disclosures 116\n\nSupervisory\xc2\xa0Board \nGeneral Meeting and \n\nshareholder\xc2\xa0structure \nCodes and regulations \nLegal structure \nRemuneration report \n\n124\n\n124\n128\n\n132\n\n138\n141\n144..."
AGNC_Investment_(2018).pdf,"b'Providing Private Capital to the U.S. Housing Market Through\n\nINVESTMENT EXCELLENCE\n\n2018 ANNUAL REPORT\n\n\x0c\x0cDEAR FELLOW STOCKHOLDERS:\n\n2018 marked a signi\xef\xac\x81cant milestone for AGNC Investment Corp.\xe2\x80\x94our tenth anniversary as a public\ncompany, which we celebrated by ringing the Nasdaq opening bell on May 15, 2018.\n\nPictured (left to right) AGNC\xe2\x80\x99s board of directors as of May 15, 2018: Paul Mullings, Larry Harvey, Gary Kain (Chief Executive\nO\xef\xac\x83cer and Chief Investment O\xef\xac\x83cer), Prue Larocca (Chair), Morris Davis. Not Pictured: Donna Blank, who became a\ndirector on December 3, 2018.\n\nWhile we remain steadfasff\nour stockholders, this\nsigni\xef\xac\x81cant accomplishment provides an opportunity to re\xef\xac\x82ect upon AGNC\xe2\x80\x99s achievements over the past decade:\n\nsed on generating attractive long-term returns forff\n\ntly focuff\n\n\xe2\x80\xa2 Industry-leading total stock and economic returns;\n\n\xe2\x..."
Acer_(2018).pdf,"b'ACER INCORPORTED\n2018 ANNUAL REPORT\n\nPublication Date : April 16, 2019\n\n\x0cAPPENDIX\n\n1. Name, Title and Contact Details of Company\'s Spokespersons:\n\nPrincipal \n\nMeggy Chen\n\nCFO\n\n+886-2-2696-1234\n\nMeggy.Chen@acer.com\n\nDeputy\n\nWayne Chang\n\nManager\n\n+886-2-2719-5000\n\nWayne.Chang@acer.com\n\n2. Address and Telephone Numbers of Company\'s Headquarter and Branches\n\nOffice\n\nAddress\n\nTel\n\nAcer Inc.\nRegistered Address\n\n7F.-5, No.369, Fuxing N. Rd., Songshan Dist., \nTaipei City 105, Taiwan \n\n+886-2-2719-5000\n\nAcer Inc.\n(Xizhi Office)\n\nAcer Inc.\n(Hsinchu Branch)\n\nAcer Inc.\n(Taichung Branch)\n\n8F., No.88, Sec. 1, Xintai 5th Rd., Xizhi Dist., New \nTaipei City 221, Taiwan \n\n+886-2-2696-1234\n\n3F., No.139, Minzu Rd., East Dist., Hsinchu City \n300, Taiwan\n\n+886-3-534-9490\n\n3F., No.371, Sec. 1, Wenxin Rd., Nantun Dist., \nTaichung City 408, Taiwan\n\n+886-4-2250-3355\n\nAcer Inc.\n(Kaohsiung Branch)\n\n4F.-6, No.38, Xinguang Rd., Ling..."
Autohome_(2018).pdf,"b'Table of Contents\n\n \n \n\n \n\n \n\nUNITED STATES\n\nSECURITIES AND EXCHANGE COMMISSION\n\nWashington, D.C. 20549\n\n \n\nForm 20-F\n\n \n\n(Mark One)\n\xe2\x98\x90 REGISTRATION STATEMENT PURSUANT TO SECTION 12(b) OR 12(g) OF THE SECURITIES EXCHANGE\n\nACT OF 1934\n\n \n\nor\n\n\xe2\x98\x92 ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\n\n \n\n\xe2\x98\x90 TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF\n\nFor the fiscal year ended December 31, 2018.\n\nor\n\n1934\n\n \n\nFor the transition period from to \n\nor\n\n\xe2\x98\x90 SHELL COMPANY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT\n\nOF 1934\n\n \n\n \n\n \n\n \n\n*\n\nDate of event requiring this shell company report \n\nFor the transition period from to \n\nCommission file number: 001-36222\n\n \n\nAutohome Inc.\n\n(Exact..."
BAIC_Motor_Corporation_(2018).pdf,"b'(A joint stock company incorporated in the People\xe2\x80\x99s Republic of China with limited liability)\nStock code: 1958\n\n\xef\xbc\x8a\n\nAnnuAl RepoRt\n2018\n\n\xef\xbc\x8a\n\n\xef\xbc\x8a\n For identification purpose only \n\n\x0cSenova Zhidao\n\nONTENTSC\x0c2\n\n4\n\n8\n\nCorporate Information\n\nChairman\xe2\x80\x99s Statement\n\nSummary of Financial and Performance \nInformation\n\n11\n\nCompany Profile and Business Overview \n\n26\n\nManagement Discussion and Analysis\n\n33\n\nReport of the Board of Directors\n\n61\n\nReport of the Board of Supervisors\n\n64\n\nCorporate Governance Report\n\n82\n\nDirectors, Supervisors and Senior Management\n\n93\n\nHuman Resources\n\n94\n\nIndependent Auditor\xe2\x80\x99s Report\n\n102\n\nConsolidated Balance Sheet\n\n104\n\nConsolidated Statement of Comprehensive Income\n\n106\n\nConsolidated Statement of Changes in Equity\n\n108\n\nConsolidated Statement of Cash Flows\n\n109\n\nNotes to the Consolidated Financial Statements\n\n177\..."


In [10]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [11]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.report.apply(round1))

In [12]:
data_clean.head()

Unnamed: 0,report
ABN_AMRO_Group_(2018),babn amro bank nvabn amro group nv annual report nntable of this report nkey figures and profile nabn amro shares and environment environment performance nbusiness performance statement funding nncapitalnintroduction to risk nfunding nnrisk funding nngovernancenintroduction to leadership governance nnexecutive board and nnrisk funding nnsupervisory board nreport of the risk funding ngeneral meeting and ncodes and regulations nlegal structure nremuneration report financial statements income statement nconsolidated statement income nconsolidated statement nconsolidated statement nconsolidated statement nnotes to the consolidated ncompany financial statements nother information statements amro group annual report and performancerisk funding capitalleadershipgovernanceannual financial statementsotherintroductionstrategy and performancerisk funding capitalleadership governanceannual financial about this this reportnnthis is the annual re...
AGNC_Investment_(2018).pdf,bproviding private capital to the us housing market throughnninvestment annual fellow marked a milestone for agnc investment tenth anniversary as a publicncompany which we celebrated by ringing the nasdaq opening bell on may left to right board of directors as of may paul mullings larry harvey gary kain chief and chief investment prue larocca chair morris davis not pictured donna blank who became andirector on december we remain steadfasffnour stockholders accomplishment provides an opportunity to upon achievements over the past decadennsed on generating attractive longterm returns forffnntly industryleading total stock and economic dividends paid to internalization of management remarkable growth in equity capital successful navigation of challenging macroeconomic events economic cycles and interest rate environmentsnna g n c a n n u a l r e p o r total stock and economic returnsnsince inception our primary objective has been to provide our stockholder...
Acer_(2018).pdf,bacer annual reportnnpublication date april name title and contact details of companys spokespersonsnnprincipal nnmeggy address and telephone numbers of companys headquarter and branchesnnofficennaddressnntelnnacer incnregistered fuxing n rd songshan dist ntaipei city taiwan incnxizhi officennacer incnhsinchu branchnnacer incntaichung sec xintai rd xizhi dist new ntaipei city taiwan minzu rd east dist hsinchu city sec wenxin rd nantun dist ntaichung city incnkaohsiung xinguang rd lingya dist nkaohsiung city incnshipping warehouse nmanagement center neixin rd luzhu dist taoyuan city address and contact details of acer shareholders servicesnnaddress fuxing n rd songshan dist taipei city taiwan ntel nemail address and contact details of auditing cpas in the most recent yearnnhueichen chang and tzuchieh tang at kpmgnnname naddress sec xinyi rd xinyi dist taipei city taiwan ntel overseas securities exchangennlisted market for gdrs london stoc...
Autohome_(2018).pdf,btable of contentsnn n nn nn nnunited statesnnsecurities and exchange commissionnnwashington dc nnform nnmark registration statement pursuant to section or of the securities exchangennact of annual report pursuant to section or of the securities exchange act of transition report pursuant to section or of the securities exchange act ofnnfor the fiscal year ended december nnfor the transition period from to shell company report pursuant to section or of the securities exchange actnnof nn nn nn nnnndate of event requiring this shell company report nnfor the transition period from to nncommission file number nnautohome incnnexact name of registrant as specified in its charternn nnnanntranslation of name into englishnncayman islandsnnjurisdiction of incorporation or floor tower b cec dan ling streetnnhaidian district beijing republic of chinanaddress of principa...
BAIC_Motor_Corporation_(2018).pdf,ba joint stock company incorporated in the republic of china with limited liabilitynstock code for identification purpose only statementnnsummary of financial and performance profile and business overview discussion and of the board of of the board of governance supervisors and senior balance statement of comprehensive statement of changes in statement of cash to the consolidated financial motor corporation unit floor building no ncourtyard no shuanghe street nshunyi district beijing headquartersnnno shuanghe street shunyi district nbeijing tower two times square matheson street causeway bay nhong shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing tower two times square matheson street causeway bay nhong tower two times square matheson street causeway bay nhong floor al...


In [13]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text) 
    text = re.sub('<', '', text)
    text = re.sub('>', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [14]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.report.apply(round2))

In [15]:
data_clean.head()

Unnamed: 0,report
ABN_AMRO_Group_(2018),babn amro bank nvabn amro group nv annual report nntable of this report nkey figures and profile nabn amro shares and environment environment performance nbusiness performance statement funding nncapitalnintroduction to risk nfunding nnrisk funding nngovernancenintroduction to leadership governance nnexecutive board and nnrisk funding nnsupervisory board nreport of the risk funding ngeneral meeting and ncodes and regulations nlegal structure nremuneration report financial statements income statement nconsolidated statement income nconsolidated statement nconsolidated statement nconsolidated statement nnotes to the consolidated ncompany financial statements nother information statements amro group annual report and performancerisk funding capitalleadershipgovernanceannual financial statementsotherintroductionstrategy and performancerisk funding capitalleadership governanceannual financial about this this reportnnthis is the annual re...
AGNC_Investment_(2018).pdf,bproviding private capital to the us housing market throughnninvestment annual fellow marked a milestone for agnc investment tenth anniversary as a publicncompany which we celebrated by ringing the nasdaq opening bell on may left to right board of directors as of may paul mullings larry harvey gary kain chief and chief investment prue larocca chair morris davis not pictured donna blank who became andirector on december we remain steadfasffnour stockholders accomplishment provides an opportunity to upon achievements over the past decadennsed on generating attractive longterm returns forffnntly industryleading total stock and economic dividends paid to internalization of management remarkable growth in equity capital successful navigation of challenging macroeconomic events economic cycles and interest rate environmentsnna g n c a n n u a l r e p o r total stock and economic returnsnsince inception our primary objective has been to provide our stockholder...
Acer_(2018).pdf,bacer annual reportnnpublication date april name title and contact details of companys spokespersonsnnprincipal nnmeggy address and telephone numbers of companys headquarter and branchesnnofficennaddressnntelnnacer incnregistered fuxing n rd songshan dist ntaipei city taiwan incnxizhi officennacer incnhsinchu branchnnacer incntaichung sec xintai rd xizhi dist new ntaipei city taiwan minzu rd east dist hsinchu city sec wenxin rd nantun dist ntaichung city incnkaohsiung xinguang rd lingya dist nkaohsiung city incnshipping warehouse nmanagement center neixin rd luzhu dist taoyuan city address and contact details of acer shareholders servicesnnaddress fuxing n rd songshan dist taipei city taiwan ntel nemail address and contact details of auditing cpas in the most recent yearnnhueichen chang and tzuchieh tang at kpmgnnname naddress sec xinyi rd xinyi dist taipei city taiwan ntel overseas securities exchangennlisted market for gdrs london stoc...
Autohome_(2018).pdf,btable of contentsnn n nn nn nnunited statesnnsecurities and exchange commissionnnwashington dc nnform nnmark registration statement pursuant to section or of the securities exchangennact of annual report pursuant to section or of the securities exchange act of transition report pursuant to section or of the securities exchange act ofnnfor the fiscal year ended december nnfor the transition period from to shell company report pursuant to section or of the securities exchange actnnof nn nn nn nnnndate of event requiring this shell company report nnfor the transition period from to nncommission file number nnautohome incnnexact name of registrant as specified in its charternn nnnanntranslation of name into englishnncayman islandsnnjurisdiction of incorporation or floor tower b cec dan ling streetnnhaidian district beijing republic of chinanaddress of principa...
BAIC_Motor_Corporation_(2018).pdf,ba joint stock company incorporated in the republic of china with limited liabilitynstock code for identification purpose only statementnnsummary of financial and performance profile and business overview discussion and of the board of of the board of governance supervisors and senior balance statement of comprehensive statement of changes in statement of cash to the consolidated financial motor corporation unit floor building no ncourtyard no shuanghe street nshunyi district beijing headquartersnnno shuanghe street shunyi district nbeijing tower two times square matheson street causeway bay nhong shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing tower two times square matheson street causeway bay nhong tower two times square matheson street causeway bay nhong floor al...


In [16]:
#data_clean.loc['ABN_AMRO_Group_(2018)','report']

---
## Organizing The Data

1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [17]:
data_clean['company_name'] = [c[17:-4] for c in annual_report_files]

In [18]:
data_clean.head()

Unnamed: 0,report,company_name
ABN_AMRO_Group_(2018),babn amro bank nvabn amro group nv annual report nntable of this report nkey figures and profile nabn amro shares and environment environment performance nbusiness performance statement funding nncapitalnintroduction to risk nfunding nnrisk funding nngovernancenintroduction to leadership governance nnexecutive board and nnrisk funding nnsupervisory board nreport of the risk funding ngeneral meeting and ncodes and regulations nlegal structure nremuneration report financial statements income statement nconsolidated statement income nconsolidated statement nconsolidated statement nconsolidated statement nnotes to the consolidated ncompany financial statements nother information statements amro group annual report and performancerisk funding capitalleadershipgovernanceannual financial statementsotherintroductionstrategy and performancerisk funding capitalleadership governanceannual financial about this this reportnnthis is the annual re...,ABN_AMRO_Group_(2018)
AGNC_Investment_(2018).pdf,bproviding private capital to the us housing market throughnninvestment annual fellow marked a milestone for agnc investment tenth anniversary as a publicncompany which we celebrated by ringing the nasdaq opening bell on may left to right board of directors as of may paul mullings larry harvey gary kain chief and chief investment prue larocca chair morris davis not pictured donna blank who became andirector on december we remain steadfasffnour stockholders accomplishment provides an opportunity to upon achievements over the past decadennsed on generating attractive longterm returns forffnntly industryleading total stock and economic dividends paid to internalization of management remarkable growth in equity capital successful navigation of challenging macroeconomic events economic cycles and interest rate environmentsnna g n c a n n u a l r e p o r total stock and economic returnsnsince inception our primary objective has been to provide our stockholder...,AGNC_Investment_(2018).pdf
Acer_(2018).pdf,bacer annual reportnnpublication date april name title and contact details of companys spokespersonsnnprincipal nnmeggy address and telephone numbers of companys headquarter and branchesnnofficennaddressnntelnnacer incnregistered fuxing n rd songshan dist ntaipei city taiwan incnxizhi officennacer incnhsinchu branchnnacer incntaichung sec xintai rd xizhi dist new ntaipei city taiwan minzu rd east dist hsinchu city sec wenxin rd nantun dist ntaichung city incnkaohsiung xinguang rd lingya dist nkaohsiung city incnshipping warehouse nmanagement center neixin rd luzhu dist taoyuan city address and contact details of acer shareholders servicesnnaddress fuxing n rd songshan dist taipei city taiwan ntel nemail address and contact details of auditing cpas in the most recent yearnnhueichen chang and tzuchieh tang at kpmgnnname naddress sec xinyi rd xinyi dist taipei city taiwan ntel overseas securities exchangennlisted market for gdrs london stoc...,Acer_(2018).pdf
Autohome_(2018).pdf,btable of contentsnn n nn nn nnunited statesnnsecurities and exchange commissionnnwashington dc nnform nnmark registration statement pursuant to section or of the securities exchangennact of annual report pursuant to section or of the securities exchange act of transition report pursuant to section or of the securities exchange act ofnnfor the fiscal year ended december nnfor the transition period from to shell company report pursuant to section or of the securities exchange actnnof nn nn nn nnnndate of event requiring this shell company report nnfor the transition period from to nncommission file number nnautohome incnnexact name of registrant as specified in its charternn nnnanntranslation of name into englishnncayman islandsnnjurisdiction of incorporation or floor tower b cec dan ling streetnnhaidian district beijing republic of chinanaddress of principa...,Autohome_(2018).pdf
BAIC_Motor_Corporation_(2018).pdf,ba joint stock company incorporated in the republic of china with limited liabilitynstock code for identification purpose only statementnnsummary of financial and performance profile and business overview discussion and of the board of of the board of governance supervisors and senior balance statement of comprehensive statement of changes in statement of cash to the consolidated financial motor corporation unit floor building no ncourtyard no shuanghe street nshunyi district beijing headquartersnnno shuanghe street shunyi district nbeijing tower two times square matheson street causeway bay nhong shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing shuanghe street shunyi district nbeijing tower two times square matheson street causeway bay nhong tower two times square matheson street causeway bay nhong floor al...,BAIC_Motor_Corporation_(2018).pdf


In [19]:
# Let's pickle it for later use
data_clean.to_pickle("pickle/corpus_AnnualR.pkl")

---
### Document-Term Matrix

For many of the techniques, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [20]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

In [21]:
data_cv = cv.fit_transform(data_clean.report)

In [22]:
print(data_clean.report[0][:800])

babn amro bank nvabn amro group nv annual report    nntable of  this report nkey figures and profile nabn amro shares  and  environment  environment   performance nbusiness performance  statement   funding  nncapitalnintroduction to risk nfunding  nnrisk funding    nngovernancenintroduction to leadership  governance nnexecutive board and   nnrisk funding   nnsupervisory board nreport of the  risk funding    ngeneral meeting and  ncodes and regulations nlegal structure nremuneration report  financial statements  income statement nconsolidated statement  income nconsolidated statement  nconsolidated statement  nconsolidated statement  nnotes to the consolidated  ncompany financial statements    nother information  statements    amro group annual report  and performancerisk funding  capitalle


In [23]:
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index

In [24]:
#data_dtm.head()

In [25]:
# Let's pickle it for later use
data_dtm.to_pickle("pickle/dtm_AnnualR.pkl")

In [26]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('pickle/data_clean_AnnualR.pkl')
pickle.dump(cv, open("pickle/cv_AnnualR.pkl", "wb"))