---
## Text data processing

When dealing with numerical data, data cleaning often involves removing null values and duplicate data, dealing with outliers, etc. With text data, there are some common data cleaning techniques, which are also known as text pre-processing techniques.

With text data, this cleaning process can go on forever. There's always an exception to every cleaning step. So, we're going to follow the MVP (minimum viable product) approach - start simple and iterate. Here are a bunch of things you can do to clean your data. We're going to execute just the common cleaning steps here and the rest can be done at a later point to improve our results.

**Common data cleaning steps on all text:**
* Make text all lower case
* Remove punctuation
* Remove numerical values
* Remove common non-sensical text (/n)
* Tokenize text
* Remove stop words

**More data cleaning steps after tokenization:**
* Stemming / lemmatization
* Parts of speech tagging
* Create bi-grams or tri-grams
* Deal with typos
* And more...

### Load

In [7]:
import pandas as pd
import pickle

In [3]:
df = pd.read_csv('table.csv')

In [4]:
len(df)

15

In [5]:
tickers = list(df['ticker'])

In [8]:
# Load pickled files
data = {}
for i, c in enumerate(tickers):
    with open("reports/" + c + ".txt", "rb") as file:
        data[c] = pickle.load(file)

In [9]:
# Double check to make sure data has been loaded properly
data.keys()

dict_keys(['AAC', 'AAL', 'AAME', 'AAOI', 'AAPL', 'ABAX', 'ABCB', 'ABCD', 'ABFS', 'ABII', 'ABK', 'ABMD', 'ABTL', 'ACAD', 'ACAS'])

In [10]:
# More checks
print(data['AAL'][:800])

<SEC-DOCUMENT>0000006201-19-000009.txt : 20190225
<SEC-HEADER>0000006201-19-000009.hdr.sgml : 20190225
<ACCEPTANCE-DATETIME>20190225073134
ACCESSION NUMBER:		0000006201-19-000009
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		140
CONFORMED PERIOD OF REPORT:	20181231
FILED AS OF DATE:		20190225
DATE AS OF CHANGE:		20190225

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			American Airlines Group Inc.
		CENTRAL INDEX KEY:			0000006201
		STANDARD INDUSTRIAL CLASSIFICATION:	AIR TRANSPORTATION, SCHEDULED [4512]
		IRS NUMBER:				751825172
		STATE OF INCORPORATION:			DE
		FISCAL YEAR END:			1231

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-08400
		FILM NUMBER:		19628071

	BUSINESS ADDRESS:	
		STREET 1:		4333 AMON CARTER BLVD
		CITY:			FORT WORTH
		ST


## Start processing

In [21]:
# Let's take a look at our data again
next(iter(data.keys()))

'AAC'

In [22]:
# We are going to change this to key: comedian, value: string format
def combine_text(list_of_text):
    '''Takes a list of text and combines them into one large chunk of text.'''
    combined_text = ' '.join(list_of_text)
    return combined_text

In [23]:
#data_combined = {key: [combine_text(value)] for (key, value) in data.items()}
data_combined = {key: [value] for (key, value) in data.items()}

In [24]:
# We can either keep it in dictionary format or put it into a pandas dataframe
pd.set_option('max_colwidth',1000)

data_df = pd.DataFrame.from_dict(data_combined).transpose()
data_df.columns = ['report']
data_df = data_df.sort_index()
data_df

Unnamed: 0,report
AAC,"<SEC-DOCUMENT>0001144204-17-037280.txt : 20170719\n<SEC-HEADER>0001144204-17-037280.hdr.sgml : 20170719\n<ACCEPTANCE-DATETIME>20170719151214\nACCESSION NUMBER:\t\t0001144204-17-037280\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t123\nCONFORMED PERIOD OF REPORT:\t20160630\nFILED AS OF DATE:\t\t20170719\nDATE AS OF CHANGE:\t\t20170719\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tHongli Clean Energy Technologies Corp.\n\t\tCENTRAL INDEX KEY:\t\t\t0001099290\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tSTEEL WORKS, BLAST FURNACES ROLLING MILLS (COKE OVENS) [3312]\n\t\tIRS NUMBER:\t\t\t\t593404233\n\t\tSTATE OF INCORPORATION:\t\t\tFL\n\t\tFISCAL YEAR END:\t\t\t0630\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-15931\n\t\tFILM NUMBER:\t\t17972216\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\tKUANGGONG RD & TIYU RD 10TH FLR,\n\t\tSTREET 2:\t\tCHENGSHI XIN YONG SHE, TIYU RD, XINHUA\n\t\tCITY:\t\t\tPINGDINGSHA..."
AAL,"<SEC-DOCUMENT>0000006201-19-000009.txt : 20190225\n<SEC-HEADER>0000006201-19-000009.hdr.sgml : 20190225\n<ACCEPTANCE-DATETIME>20190225073134\nACCESSION NUMBER:\t\t0000006201-19-000009\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t140\nCONFORMED PERIOD OF REPORT:\t20181231\nFILED AS OF DATE:\t\t20190225\nDATE AS OF CHANGE:\t\t20190225\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAmerican Airlines Group Inc.\n\t\tCENTRAL INDEX KEY:\t\t\t0000006201\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tAIR TRANSPORTATION, SCHEDULED [4512]\n\t\tIRS NUMBER:\t\t\t\t751825172\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-08400\n\t\tFILM NUMBER:\t\t19628071\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t4333 AMON CARTER BLVD\n\t\tCITY:\t\t\tFORT WORTH\n\t\tSTATE:\t\t\tTX\n\t\tZIP:\t\t\t76155\n\t\tBUSINESS PHONE:\t\t8179631234\n\n\tMAIL ADDRESS:\t\n\t\tST..."
AAME,<SEC-DOCUMENT>0001140361-19-006219.txt : 20190401\n<SEC-HEADER>0001140361-19-006219.hdr.sgml : 20190401\n<ACCEPTANCE-DATETIME>20190401164812\nACCESSION NUMBER:\t\t0001140361-19-006219\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t89\nCONFORMED PERIOD OF REPORT:\t20181231\nFILED AS OF DATE:\t\t20190401\nDATE AS OF CHANGE:\t\t20190401\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tATLANTIC AMERICAN CORP\n\t\tCENTRAL INDEX KEY:\t\t\t0000008177\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tLIFE INSURANCE [6311]\n\t\tIRS NUMBER:\t\t\t\t581027114\n\t\tSTATE OF INCORPORATION:\t\t\tGA\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t000-03722\n\t\tFILM NUMBER:\t\t19721385\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t4370 PEACHTREE RD NE\n\t\tCITY:\t\t\tATLANTA\n\t\tSTATE:\t\t\tGA\n\t\tZIP:\t\t\t30319\n\t\tBUSINESS PHONE:\t\t4042665500\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\t4370 PEACHTREE ...
AAOI,"<SEC-DOCUMENT>0001558370-19-001078.txt : 20190226\n<SEC-HEADER>0001558370-19-001078.hdr.sgml : 20190226\n<ACCEPTANCE-DATETIME>20190226160641\nACCESSION NUMBER:\t\t0001558370-19-001078\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t101\nCONFORMED PERIOD OF REPORT:\t20181231\nFILED AS OF DATE:\t\t20190226\nDATE AS OF CHANGE:\t\t20190226\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAPPLIED OPTOELECTRONICS, INC.\n\t\tCENTRAL INDEX KEY:\t\t\t0001158114\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tSEMICONDUCTORS & RELATED DEVICES [3674]\n\t\tIRS NUMBER:\t\t\t\t760533927\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-36083\n\t\tFILM NUMBER:\t\t19633410\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t13139 JESS PIRTLE BLVD\n\t\tCITY:\t\t\tSUGAR LAND\n\t\tSTATE:\t\t\tTX\n\t\tZIP:\t\t\t77478\n\t\tBUSINESS PHONE:\t\t281-295-1800\n\n\tMAIL ADDRESS:\t\..."
AAPL,<SEC-DOCUMENT>0000320193-19-000119.txt : 20191031\n<SEC-HEADER>0000320193-19-000119.hdr.sgml : 20191031\n<ACCEPTANCE-DATETIME>20191030181236\nACCESSION NUMBER:\t\t0000320193-19-000119\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t96\nCONFORMED PERIOD OF REPORT:\t20190928\nFILED AS OF DATE:\t\t20191031\nDATE AS OF CHANGE:\t\t20191030\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tApple Inc.\n\t\tCENTRAL INDEX KEY:\t\t\t0000320193\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tELECTRONIC COMPUTERS [3571]\n\t\tIRS NUMBER:\t\t\t\t942404110\n\t\tSTATE OF INCORPORATION:\t\t\tCA\n\t\tFISCAL YEAR END:\t\t\t0928\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-36743\n\t\tFILM NUMBER:\t\t191181423\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\tONE APPLE PARK WAY\n\t\tCITY:\t\t\tCUPERTINO\n\t\tSTATE:\t\t\tCA\n\t\tZIP:\t\t\t95014\n\t\tBUSINESS PHONE:\t\t(408) 996-1010\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\tONE APPLE PARK W...
ABAX,"<SEC-DOCUMENT>0001140361-18-034234.txt : 20180730\n<SEC-HEADER>0001140361-18-034234.hdr.sgml : 20180730\n<ACCEPTANCE-DATETIME>20180730171319\nACCESSION NUMBER:\t\t0001140361-18-034234\nCONFORMED SUBMISSION TYPE:\t10-K/A\nPUBLIC DOCUMENT COUNT:\t\t8\nCONFORMED PERIOD OF REPORT:\t20180331\nFILED AS OF DATE:\t\t20180730\nDATE AS OF CHANGE:\t\t20180730\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tABAXIS INC\n\t\tCENTRAL INDEX KEY:\t\t\t0000881890\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tMEASURING & CONTROLLING DEVICES, NEC [3829]\n\t\tIRS NUMBER:\t\t\t\t770213001\n\t\tSTATE OF INCORPORATION:\t\t\tCA\n\t\tFISCAL YEAR END:\t\t\t0331\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K/A\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t000-19720\n\t\tFILM NUMBER:\t\t18978522\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t3240 WHIPPLE\n\t\tSTREET 2:\t\tROAD\n\t\tCITY:\t\t\tUNION CITY\n\t\tSTATE:\t\t\tCA\n\t\tZIP:\t\t\t94587\n\t\tBUSINESS PHONE:\t\t(510) 675-6500\n\n\tMAIL ADDRESS:\..."
ABCB,<SEC-DOCUMENT>0000351569-19-000006.txt : 20190301\n<SEC-HEADER>0000351569-19-000006.hdr.sgml : 20190301\n<ACCEPTANCE-DATETIME>20190301160717\nACCESSION NUMBER:\t\t0000351569-19-000006\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t154\nCONFORMED PERIOD OF REPORT:\t20181231\nFILED AS OF DATE:\t\t20190301\nDATE AS OF CHANGE:\t\t20190301\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAmeris Bancorp\n\t\tCENTRAL INDEX KEY:\t\t\t0000351569\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tSTATE COMMERCIAL BANKS [6022]\n\t\tIRS NUMBER:\t\t\t\t581456434\n\t\tSTATE OF INCORPORATION:\t\t\tGA\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-13901\n\t\tFILM NUMBER:\t\t19649333\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t24 2/ND/ AVENUE\n\t\tCITY:\t\t\tMOULTRIE\n\t\tSTATE:\t\t\tGA\n\t\tZIP:\t\t\t31768\n\t\tBUSINESS PHONE:\t\t9128901111\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\tPO BOX 1500\n\t\tC...
ABCD,"<SEC-DOCUMENT>0001466815-18-000016.txt : 20180307\n<SEC-HEADER>0001466815-18-000016.hdr.sgml : 20180307\n<ACCEPTANCE-DATETIME>20180306195519\nACCESSION NUMBER:\t\t0001466815-18-000016\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t106\nCONFORMED PERIOD OF REPORT:\t20171231\nFILED AS OF DATE:\t\t20180307\nDATE AS OF CHANGE:\t\t20180306\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tCAMBIUM LEARNING GROUP, INC.\n\t\tCENTRAL INDEX KEY:\t\t\t0001466815\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tMISCELLANEOUS PUBLISHING [2741]\n\t\tIRS NUMBER:\t\t\t\t270587428\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t001-34575\n\t\tFILM NUMBER:\t\t18671602\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t17855 NORTH DALLAS PARKWAY, SUITE 400\n\t\tCITY:\t\t\tDALLAS\n\t\tSTATE:\t\t\tTX\n\t\tZIP:\t\t\t75287\n\t\tBUSINESS PHONE:\t\t214-932-9500\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\t17855 NORTH DALLA..."
ABFS,<SEC-DOCUMENT>0001558370-19-001332.txt : 20190228\n<SEC-HEADER>0001558370-19-001332.hdr.sgml : 20190228\n<ACCEPTANCE-DATETIME>20190228164652\nACCESSION NUMBER:\t\t0001558370-19-001332\nCONFORMED SUBMISSION TYPE:\t10-K\nPUBLIC DOCUMENT COUNT:\t\t129\nCONFORMED PERIOD OF REPORT:\t20181231\nFILED AS OF DATE:\t\t20190228\nDATE AS OF CHANGE:\t\t20190228\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tARCBEST CORP /DE/\n\t\tCENTRAL INDEX KEY:\t\t\t0000894405\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tTRUCKING (NO LOCAL) [4213]\n\t\tIRS NUMBER:\t\t\t\t710673405\n\t\tSTATE OF INCORPORATION:\t\t\tDE\n\t\tFISCAL YEAR END:\t\t\t1231\n\n\tFILING VALUES:\n\t\tFORM TYPE:\t\t10-K\n\t\tSEC ACT:\t\t1934 Act\n\t\tSEC FILE NUMBER:\t000-19969\n\t\tFILM NUMBER:\t\t19644149\n\n\tBUSINESS ADDRESS:\t\n\t\tSTREET 1:\t\t8401 MCCLURE DRIVE\n\t\tCITY:\t\t\tFORT SMITH\n\t\tSTATE:\t\t\tAR\n\t\tZIP:\t\t\t72916\n\t\tBUSINESS PHONE:\t\t4797856000\n\n\tMAIL ADDRESS:\t\n\t\tSTREET 1:\t\tP O BOX 10048...
ABII,"-----BEGIN PRIVACY-ENHANCED MESSAGE-----\nProc-Type: 2001,MIC-CLEAR\nOriginator-Name: webmaster@www.sec.gov\nOriginator-Key-Asymmetric:\n MFgwCgYEVQgBAQICAf8DSgAwRwJAW2sNKK9AVtBzYZmr6aGjlWyK3XmZv3dTINen\n TWSM7vrzLADbmYQaionwg5sDW3P6oaM5D3tdezXMm7z1T+B+twIDAQAB\nMIC-Info: RSA-MD5,RSA,\n CHkUjMkstEDpvbShoQauSoyOavtwbHq30cGukjchmSpeTxu1XA5MoFRR+wlBJDTc\n kghpmwLL9rBB1LcOqAJPOw==\n\n<SEC-DOCUMENT>0001193125-10-093431.txt : 20100426\n<SEC-HEADER>0001193125-10-093431.hdr.sgml : 20100426\n<ACCEPTANCE-DATETIME>20100426170124\nACCESSION NUMBER:\t\t0001193125-10-093431\nCONFORMED SUBMISSION TYPE:\t10-K/A\nPUBLIC DOCUMENT COUNT:\t\t3\nCONFORMED PERIOD OF REPORT:\t20091231\nFILED AS OF DATE:\t\t20100426\nDATE AS OF CHANGE:\t\t20100426\n\nFILER:\n\n\tCOMPANY DATA:\t\n\t\tCOMPANY CONFORMED NAME:\t\t\tAbraxis BioScience, Inc.\n\t\tCENTRAL INDEX KEY:\t\t\t0001409012\n\t\tSTANDARD INDUSTRIAL CLASSIFICATION:\tPHARMACEUTICAL PREPARATIONS [2834]\n\t\tIRS NUMBER:\t\t\t\t300431735\n\t\tSTATE OF INCORPO..."


In [25]:
# Let's take a look at the transcript for Ali Wong
print(data_df.report.loc['AAC'][0:800])

<SEC-DOCUMENT>0001144204-17-037280.txt : 20170719
<SEC-HEADER>0001144204-17-037280.hdr.sgml : 20170719
<ACCEPTANCE-DATETIME>20170719151214
ACCESSION NUMBER:		0001144204-17-037280
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		123
CONFORMED PERIOD OF REPORT:	20160630
FILED AS OF DATE:		20170719
DATE AS OF CHANGE:		20170719

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			Hongli Clean Energy Technologies Corp.
		CENTRAL INDEX KEY:			0001099290
		STANDARD INDUSTRIAL CLASSIFICATION:	STEEL WORKS, BLAST FURNACES  ROLLING MILLS (COKE OVENS) [3312]
		IRS NUMBER:				593404233
		STATE OF INCORPORATION:			FL
		FISCAL YEAR END:			0630

	FILING VALUES:
		FORM TYPE:		10-K
		SEC ACT:		1934 Act
		SEC FILE NUMBER:	001-15931
		FILM NUMBER:		17972216

	BUSINESS ADDRESS:	
		STREET 1:		KUANGGONG R


In [26]:
# Apply a first round of text cleaning techniques
import re
import string

def clean_text_round1(text):
    '''Make text lowercase, remove text in square brackets, remove punctuation and remove words containing numbers.'''
    text = text.lower()
    text = re.sub('\[.*?\]', '', text)
    text = re.sub('[%s]' % re.escape(string.punctuation), '', text)
    text = re.sub('\w*\d\w*', '', text)
    return text

round1 = lambda x: clean_text_round1(x)

In [27]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_df.report.apply(round1))

In [28]:
data_clean.head()

Unnamed: 0,report
AAC,\n \n\naccession number\t\t\nconformed submission type\t\npublic document count\t\t\nconformed period of report\t\nfiled as of date\t\t\ndate as of change\t\t\n\nfiler\n\n\tcompany data\t\n\t\tcompany conformed name\t\t\thongli clean energy technologies corp\n\t\tcentral index key\t\t\t\n\t\tstandard industrial classification\tsteel works blast furnaces rolling mills coke ovens \n\t\tirs number\t\t\t\t\n\t\tstate of incorporation\t\t\tfl\n\t\tfiscal year end\t\t\t\n\n\tfiling values\n\t\tform type\t\t\n\t\tsec act\t\t act\n\t\tsec file number\t\n\t\tfilm number\t\t\n\n\tbusiness address\t\n\t\tstreet \t\tkuanggong rd tiyu rd flr\n\t\tstreet \t\tchengshi xin yong she tiyu rd xinhua\n\t\tcity\t\t\tpingdingshan henan province\n\t\tstate\t\t\t\n\t\tzip\t\t\t\n\t\tbusiness phone\t\t\n\n\tmail address\t\n\t\tstreet \t\tkuanggong rd tiyu rd flr\n\t\tstreet \t\tchengshi xin yong she tiyu rd xinhua\n\t\tcity\t\t\tpingdingshan henan province\n\t\tstate\t\t\t\n\t\tzip\t\t\t\n\n\tforme...
AAL,\n \n\naccession number\t\t\nconformed submission type\t\npublic document count\t\t\nconformed period of report\t\nfiled as of date\t\t\ndate as of change\t\t\n\nfiler\n\n\tcompany data\t\n\t\tcompany conformed name\t\t\tamerican airlines group inc\n\t\tcentral index key\t\t\t\n\t\tstandard industrial classification\tair transportation scheduled \n\t\tirs number\t\t\t\t\n\t\tstate of incorporation\t\t\tde\n\t\tfiscal year end\t\t\t\n\n\tfiling values\n\t\tform type\t\t\n\t\tsec act\t\t act\n\t\tsec file number\t\n\t\tfilm number\t\t\n\n\tbusiness address\t\n\t\tstreet \t\t amon carter blvd\n\t\tcity\t\t\tfort worth\n\t\tstate\t\t\ttx\n\t\tzip\t\t\t\n\t\tbusiness phone\t\t\n\n\tmail address\t\n\t\tstreet \t\t amon carter blvd\n\t\tcity\t\t\tfort worth\n\t\tstate\t\t\ttx\n\t\tzip\t\t\t\n\n\tformer company\t\n\t\tformer conformed name\tamr corp\n\t\tdate of name change\t\n\nfiler\n\n\tcompany data\t\n\t\tcompany conformed name\t\t\tamerican airlines inc\n\t\tcentral index key\t\t\t...
AAME,\n \n\naccession number\t\t\nconformed submission type\t\npublic document count\t\t\nconformed period of report\t\nfiled as of date\t\t\ndate as of change\t\t\n\nfiler\n\n\tcompany data\t\n\t\tcompany conformed name\t\t\tatlantic american corp\n\t\tcentral index key\t\t\t\n\t\tstandard industrial classification\tlife insurance \n\t\tirs number\t\t\t\t\n\t\tstate of incorporation\t\t\tga\n\t\tfiscal year end\t\t\t\n\n\tfiling values\n\t\tform type\t\t\n\t\tsec act\t\t act\n\t\tsec file number\t\n\t\tfilm number\t\t\n\n\tbusiness address\t\n\t\tstreet \t\t peachtree rd ne\n\t\tcity\t\t\tatlanta\n\t\tstate\t\t\tga\n\t\tzip\t\t\t\n\t\tbusiness phone\t\t\n\n\tmail address\t\n\t\tstreet \t\t peachtree road\n\t\tcity\t\t\tatlanta\n\t\tstate\t\t\tga\n\t\tzip\t\t\t\nsecheader\ndocument\n\n\n\ndescriptionform \ntext\ndoctype html public html \nhtml\nhead\n title title\n meta contenttexthtml \n meta namegeneratedby contentsummit financial printing llc \n licensed to broadridge fina...
AAOI,\n \n\naccession number\t\t\nconformed submission type\t\npublic document count\t\t\nconformed period of report\t\nfiled as of date\t\t\ndate as of change\t\t\n\nfiler\n\n\tcompany data\t\n\t\tcompany conformed name\t\t\tapplied optoelectronics inc\n\t\tcentral index key\t\t\t\n\t\tstandard industrial classification\tsemiconductors related devices \n\t\tirs number\t\t\t\t\n\t\tstate of incorporation\t\t\tde\n\t\tfiscal year end\t\t\t\n\n\tfiling values\n\t\tform type\t\t\n\t\tsec act\t\t act\n\t\tsec file number\t\n\t\tfilm number\t\t\n\n\tbusiness address\t\n\t\tstreet \t\t jess pirtle blvd\n\t\tcity\t\t\tsugar land\n\t\tstate\t\t\ttx\n\t\tzip\t\t\t\n\t\tbusiness phone\t\t\n\n\tmail address\t\n\t\tstreet \t\t jess pirtle blvd\n\t\tcity\t\t\tsugar land\n\t\tstate\t\t\ttx\n\t\tzip\t\t\t\n\n\tformer company\t\n\t\tformer conformed name\tapplied optoelectronics inc\n\t\tdate of name change\t\nsecheader\ndocument\n\n\n\n\ntext\nhtml document created with merrill bridge \ncreated o...
AAPL,\n \n\naccession number\t\t\nconformed submission type\t\npublic document count\t\t\nconformed period of report\t\nfiled as of date\t\t\ndate as of change\t\t\n\nfiler\n\n\tcompany data\t\n\t\tcompany conformed name\t\t\tapple inc\n\t\tcentral index key\t\t\t\n\t\tstandard industrial classification\telectronic computers \n\t\tirs number\t\t\t\t\n\t\tstate of incorporation\t\t\tca\n\t\tfiscal year end\t\t\t\n\n\tfiling values\n\t\tform type\t\t\n\t\tsec act\t\t act\n\t\tsec file number\t\n\t\tfilm number\t\t\n\n\tbusiness address\t\n\t\tstreet \t\tone apple park way\n\t\tcity\t\t\tcupertino\n\t\tstate\t\t\tca\n\t\tzip\t\t\t\n\t\tbusiness phone\t\t \n\n\tmail address\t\n\t\tstreet \t\tone apple park way\n\t\tcity\t\t\tcupertino\n\t\tstate\t\t\tca\n\t\tzip\t\t\t\n\n\tformer company\t\n\t\tformer conformed name\tapple inc\n\t\tdate of name change\t\n\n\tformer company\t\n\t\tformer conformed name\tapple computer inc\n\t\tdate of name change\t\nsecheader\ndocument\n\n\n\n\ntext\nxbrl...


In [29]:
# Apply a second round of cleaning
def clean_text_round2(text):
    '''Get rid of some additional punctuation and non-sensical text that was missed the first time around.'''
    text = re.sub('[‘’“”…]', '', text)
    text = re.sub('\n', '', text) 
    text = re.sub('<', '', text)
    text = re.sub('>', '', text)
    return text

round2 = lambda x: clean_text_round2(x)

In [30]:
# Let's take a look at the updated text
data_clean = pd.DataFrame(data_clean.report.apply(round2))

In [31]:
data_clean.head()

Unnamed: 0,report
AAC,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\thongli clean energy technologies corp\t\tcentral index key\t\t\t\t\tstandard industrial classification\tsteel works blast furnaces rolling mills coke ovens \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tfl\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\tkuanggong rd tiyu rd flr\t\tstreet \t\tchengshi xin yong she tiyu rd xinhua\t\tcity\t\t\tpingdingshan henan province\t\tstate\t\t\t\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\tkuanggong rd tiyu rd flr\t\tstreet \t\tchengshi xin yong she tiyu rd xinhua\t\tcity\t\t\tpingdingshan henan province\t\tstate\t\t\t\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tsinocoking coal coke chemical industries in...
AAL,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tamerican airlines group inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\tair transportation scheduled \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tde\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\t amon carter blvd\t\tcity\t\t\tfort worth\t\tstate\t\t\ttx\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\t amon carter blvd\t\tcity\t\t\tfort worth\t\tstate\t\t\ttx\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tamr corp\t\tdate of name change\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tamerican airlines inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\tair transportation scheduled \t\tirs number\t\t\t\t\t\...
AAME,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tatlantic american corp\t\tcentral index key\t\t\t\t\tstandard industrial classification\tlife insurance \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tga\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\t peachtree rd ne\t\tcity\t\t\tatlanta\t\tstate\t\t\tga\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\t peachtree road\t\tcity\t\t\tatlanta\t\tstate\t\t\tga\t\tzip\t\t\tsecheaderdocumentdescriptionform textdoctype html public html htmlhead title title meta contenttexthtml meta namegeneratedby contentsummit financial printing llc licensed to broadridge financial solutions inc document created using edgarfilings profile copyright broadrid...
AAOI,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tapplied optoelectronics inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\tsemiconductors related devices \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tde\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\t jess pirtle blvd\t\tcity\t\t\tsugar land\t\tstate\t\t\ttx\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\t jess pirtle blvd\t\tcity\t\t\tsugar land\t\tstate\t\t\ttx\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tapplied optoelectronics inc\t\tdate of name change\tsecheaderdocumenttexthtml document created with merrill bridge created on pmhtml\thead\t\ttitle\t\t\t\t\ttitle\thead\tbodydiv \t\tp new romantimesseriffontsize \t\t\t...
AAPL,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tapple inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\telectronic computers \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tca\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\tone apple park way\t\tcity\t\t\tcupertino\t\tstate\t\t\tca\t\tzip\t\t\t\t\tbusiness phone\t\t \tmail address\t\t\tstreet \t\tone apple park way\t\tcity\t\t\tcupertino\t\tstate\t\t\tca\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tapple inc\t\tdate of name change\t\tformer company\t\t\tformer conformed name\tapple computer inc\t\tdate of name change\tsecheaderdocumenttextxbrlxml xbrl document created with wdesk from workiva document created using wdesk copyright workiva html ...


---
## Organizing The Data

1. **Corpus - **a collection of text
2. **Document-Term Matrix - **word counts in matrix format

### Corpus

We already created a corpus in an earlier step. The definition of a corpus is a collection of texts, and they are all put together neatly in a pandas dataframe here.

In [32]:
data_clean['company_name'] = list(df['name'])

In [33]:
data_clean.head()

Unnamed: 0,report,company_name
AAC,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\thongli clean energy technologies corp\t\tcentral index key\t\t\t\t\tstandard industrial classification\tsteel works blast furnaces rolling mills coke ovens \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tfl\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\tkuanggong rd tiyu rd flr\t\tstreet \t\tchengshi xin yong she tiyu rd xinhua\t\tcity\t\t\tpingdingshan henan province\t\tstate\t\t\t\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\tkuanggong rd tiyu rd flr\t\tstreet \t\tchengshi xin yong she tiyu rd xinhua\t\tcity\t\t\tpingdingshan henan province\t\tstate\t\t\t\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tsinocoking coal coke chemical industries in...,Sinocoking Coal & Coke Chemical Industries Inc
AAL,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tamerican airlines group inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\tair transportation scheduled \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tde\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\t amon carter blvd\t\tcity\t\t\tfort worth\t\tstate\t\t\ttx\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\t amon carter blvd\t\tcity\t\t\tfort worth\t\tstate\t\t\ttx\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tamr corp\t\tdate of name change\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tamerican airlines inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\tair transportation scheduled \t\tirs number\t\t\t\t\t\...,American Airlines Group Inc
AAME,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tatlantic american corp\t\tcentral index key\t\t\t\t\tstandard industrial classification\tlife insurance \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tga\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\t peachtree rd ne\t\tcity\t\t\tatlanta\t\tstate\t\t\tga\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\t peachtree road\t\tcity\t\t\tatlanta\t\tstate\t\t\tga\t\tzip\t\t\tsecheaderdocumentdescriptionform textdoctype html public html htmlhead title title meta contenttexthtml meta namegeneratedby contentsummit financial printing llc licensed to broadridge financial solutions inc document created using edgarfilings profile copyright broadrid...,Atlantic American Corp
AAOI,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tapplied optoelectronics inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\tsemiconductors related devices \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tde\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\t jess pirtle blvd\t\tcity\t\t\tsugar land\t\tstate\t\t\ttx\t\tzip\t\t\t\t\tbusiness phone\t\t\tmail address\t\t\tstreet \t\t jess pirtle blvd\t\tcity\t\t\tsugar land\t\tstate\t\t\ttx\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tapplied optoelectronics inc\t\tdate of name change\tsecheaderdocumenttexthtml document created with merrill bridge created on pmhtml\thead\t\ttitle\t\t\t\t\ttitle\thead\tbodydiv \t\tp new romantimesseriffontsize \t\t\t...,Applied Optoelectronics Inc
AAPL,accession number\t\tconformed submission type\tpublic document count\t\tconformed period of report\tfiled as of date\t\tdate as of change\t\tfiler\tcompany data\t\t\tcompany conformed name\t\t\tapple inc\t\tcentral index key\t\t\t\t\tstandard industrial classification\telectronic computers \t\tirs number\t\t\t\t\t\tstate of incorporation\t\t\tca\t\tfiscal year end\t\t\t\tfiling values\t\tform type\t\t\t\tsec act\t\t act\t\tsec file number\t\t\tfilm number\t\t\tbusiness address\t\t\tstreet \t\tone apple park way\t\tcity\t\t\tcupertino\t\tstate\t\t\tca\t\tzip\t\t\t\t\tbusiness phone\t\t \tmail address\t\t\tstreet \t\tone apple park way\t\tcity\t\t\tcupertino\t\tstate\t\t\tca\t\tzip\t\t\t\tformer company\t\t\tformer conformed name\tapple inc\t\tdate of name change\t\tformer company\t\t\tformer conformed name\tapple computer inc\t\tdate of name change\tsecheaderdocumenttextxbrlxml xbrl document created with wdesk from workiva document created using wdesk copyright workiva html ...,Apple Inc


In [34]:
# Let's pickle it for later use
data_clean.to_pickle("pickle/corpus_EDGAR.pkl")

---
### Document-Term Matrix

For many of the techniques, the text must be tokenized, meaning broken down into smaller pieces. The most common tokenization technique is to break down text into words. We can do this using scikit-learn's CountVectorizer, where every row will represent a different document and every column will represent a different word.

In addition, with CountVectorizer, we can remove stop words. Stop words are common words that add no additional meaning to text such as 'a', 'the', etc.

In [35]:
# We are going to create a document-term matrix using CountVectorizer, and exclude common English stop words
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(stop_words='english')

In [36]:
data_cv = cv.fit_transform(data_clean.report)

In [37]:
print(data_clean.report[0][:800])

    accession number		conformed submission type	public document count		conformed period of report	filed as of date		date as of change		filer	company data			company conformed name			hongli clean energy technologies corp		central index key					standard industrial classification	steel works blast furnaces  rolling mills coke ovens 		irs number						state of incorporation			fl		fiscal year end				filing values		form type				sec act		 act		sec file number			film number			business address			street 		kuanggong rd  tiyu rd  flr		street 		chengshi xin yong she tiyu rd xinhua		city			pingdingshan henan province		state					zip					business phone			mail address			street 		kuanggong rd  tiyu rd  flr		street 		chengshi xin yong she tiyu rd xinhua		city			pingdingshan henan province		state					zip				for


In [38]:
data_dtm = pd.DataFrame(data_cv.toarray(), columns=cv.get_feature_names())
data_dtm.index = data_clean.index

In [None]:
data_dtm.head()

In [40]:
# Let's pickle it for later use
data_dtm.to_pickle("pickle/dtm_EDGAR.pkl")

In [41]:
# Let's also pickle the cleaned data (before we put it in document-term matrix format) and the CountVectorizer object
data_clean.to_pickle('pickle/data_clean_EDGAR.pkl')
pickle.dump(cv, open("pickle/cv_EDGAR.pkl", "wb"))