### Industry Clustering of Russell 2000 Companies using Latent Dirichlet Allocation

This project employs a latent Dirichlet allocation model (LDA) to cluster the companies comprising the FTSE Russell 2000 Index into industry groups based upon the text used in their annual reports. Using the industry groupings as identified by the model, sentiment analysis is performed to compare the relative levels of optimism amongst companies within one industry and between industries themselves. 

This package will perform the following sequential steps:

1. Gather and clean raw 10-K filings from SEC online database
2. Preprocess each document
3. Cluster each company into industry groupings
4. Display the results of the LDA cluster model
5. Perform sentiment analysis at the industry level
6. Perform sentiment analysis at the company level

In [1]:
# Ensure that packages are located in environment path
import sys
sys.path.append(r'C:\Users\ian_d\OneDrive\Desktop\Capstone_files\Final_Code')

In [2]:
# import package
import Russell2000_LDA

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Data Acquisition**

This module will acquire and clean the 10-K filing for each company included in the reference csv file. Please make sure that the csv file is saved in your working directory.

Each 10-K filing is cleaned and saved as one string in a text file that is ready for NLP preprocessing. The files will be titled after the company's name (ex. "Shake Shack Inc.".txt)

These files will be saved to new directory in the working directory, titled after the year entered as an argument into the function (ex. "2018").

The function takes three arguments:

1. start index - should be an integer - indicates the starting index for document iteration
2. end index - should be an integer - indicates the stop index for document iteration
3. year - should be an integer - indicates the target year of analysis

Having a stop and start index allows for batch processing to avoid memory error issues if this is performed on a standard personal laptop.

The function does not have a return statement but will print progress statements as it operates.

In [13]:
# Example: collect cleaned data for first 100 companies using 2018 10-Ks
# Russell2000_LDA.data_acquisition(start index, end index, year)

Russell2000_LDA.data_acquisition.get(0,100,2017)

Master dataframe created
Russell CIK nums dataframe created
Raw 10-Ks collected
Raw files cleaned
Acquired target sections of 10-Ks
Saved to disk


**Running the LDA Model**

This module runs the LDA model for the target year.

The function takes in seven arguments:

1. year - integer - indicates the target year of analysis
2. number of clusters - integer - indicates the number of clusters
3. random_state - integer (default = 100) - hyperparameter for Gensim LDA implementation
4. update_every - integer (default = 1) - hyperparameter for Gensim LDA implementation
5. chunksize - integer (default = 100) - hyperparameter for Gensim LDA implementation
6. passes - integer (default = 10) - hyperparameter for Gensim LDA implementation
7. alpha - string (default = 'auto') - hyperparameter for Gensim LDA implementation

The function will return a tuple (results), such that:

results[0] = Gensim LDA model instance

results[1] = an unlabeled bag of words for each document

results[2] = a labeled bag of words for each document

results[3] = corpus

results[4] = corpus dictionary

results[5] = dataframe with clustering results

To improve interpretability, the function will print the above to the console.

In [6]:
# Example: run LDA model for 15 industries (clusters) using 2018 data
# output = Russell2000_LDA.lda.lda(year, number of clusters)

results = Russell2000_LDA.lda.lda(2018,15)

Year: 2018
Output tuple: 
results[0] = Gensim LDA model instance
results[1] = corpus_list
results[2] = reference_list
results[3] = corpus
results[4] = id2word
results[5] = classification_df


**Display Results from LDA Model**

The function below display the results from the LDA model.

The function takes one tuple as argument: the output from the lda() function.

The function returns a dataframe.

In [7]:
# Example: display results from LDA model for 15 industries (clusters) using 2018 data
# Russell2000_LDA.lda.display(ouput from LDA model)

Russell2000_LDA.lda.display(results)

Unnamed: 0,Industry,Percent_Contribution,Keywords,Company
0,7.0,0.568,"patient, develop, studi, disea, includ, treatm...",ACORDA THERAPEUTICS INC
1,9.0,1.000,"product, oper, compani, includ, custom, financ...",ALLEGHENY TECHNOLOGIES INC
2,12.0,1.000,"compani, financ, insur, bank, includ, rate, re...",AMERICAN EQUITY INVESTMENT LIFE HOLDING CO
3,3.0,0.812,"bank, financ, loan, compani, requir, includ, s...",AMERISAFE INC
4,0.0,1.000,"servic, product, provid, includ, busi, custom,...","AMKOR TECHNOLOGY, INC."
5,5.0,0.969,"product, custom, market, includ, busi, oper, d...",ANI PHARMACEUTICALS INC
6,14.0,0.959,"product, servic, compani, market, busi, provid...","ANTARES PHARMA, INC."
7,6.0,1.000,"properti, financ, oper, lea, includ, invest, s...",ANWORTH MORTGAGE ASSET CORP
8,6.0,0.998,"properti, financ, oper, lea, includ, invest, s...",ARCH COAL INC
9,7.0,0.951,"patient, develop, studi, disea, includ, treatm...",ARQULE INC


**Run Industry Sentiment Analysis**

This module compares the average sentiment between each industry cluster.

The function takes three arguments:

1. year - integer - should be the same year used as an argument in the LDA model
2. number of clusters - integer - should be the same number used as an argument in the LDA model
3. results - tuple - should be the output from the NMF model

The function will return a pandas dataframe that contains the results from the sentiment analysis.

In [8]:
# Example: compare sentiment across 15 industries using 2018 data
# industry_sentiment = Russell2000_LDA.lda.industy_sentiment(year, number of clusters, output from LDA model)

industry_sentiment = Russell2000_LDA.lda.industry_sentiment(2018,15,results)

In [9]:
industry_sentiment

Unnamed: 0,Industry,Sentiment,Industry_Keywords
0,0,0.0209,"[servic, product, provid, includ, busi, custom, compani, client, market, oper, employ, financ, manag, technolog, offer, industri, develop, base, requir, cost, addit, gener, commun, process, facil, result, care, increa, manufactur, applic, chang, new, support, perform, believ, sale, revenu, plan, solut, packag]"
1,1,0.0478,"[vaccin, trial, rsv, develop, disea, product, immun, rsv_vaccin, influenza, viru, technolog, respon, adjuv, program, includ, clinic_trial, patent, antigen, efficaci, current, older_adult, infect, dose, antibodi, compani, base, phase, potenti, year, result, strain, data, recombin, protein, market, infant, relat, vaccin_candid, popul, protect]"
2,2,0.037,"[includ, product, financ, oper, market, busi, provid, manag, servic, custom, student, develop, offer, industri, lea, sale, univ, addit, design, invest, program, restaur, requir, result, employ, gener, griffin, client, increa, time, technolog, institut, believ, new, continu, base, compani, secur, offic, build]"
3,3,0.0082,"[bank, financ, loan, compani, requir, includ, servic, busi, capit, secur, deposit, provid, regul, institut, manag, activ, oper, gener, insur, bank_hold_compani, rate, subject, market, invest, chang, certain, addit, custom, feder_reserv, asset, commerci, borrow, polici, employ, control, product, act, base, corpor, offer]"
4,4,0.007,"[loan, bank, total, gener, market, decemb, compani, includ, rate, patient, afx, year, financ, borrow, secur, aaa, result, commerci, current, million, provid, unit_state, ovat, base, product, sale, requir, allow_loan_loss, devic, invest, approv, aneurysm, classifi, servic, nellix_eva, addit, origin, allow, time, receiv]"
5,5,0.0175,"[product, custom, market, includ, busi, oper, develop, store, financ, new, sale, manufactur, provid, result, addit, design, technolog, gener, solut, servic, requir, system, compani, continu, manag, increa, test, industri, us, merchandi, base, offer, time, revenu, effect, competitor, adver, believ, chang, process]"
6,6,0.0478,"[properti, financ, oper, lea, includ, invest, servic, manag, loan, tenant, rate, coal, gener, compani, market, increa, secur, asset, real_estat, mortgag, requir, affect, adver, result, addit, provid, cost, sale, acquisit, busi, custom, portfolio, offic, borrow, chang, risk, valu, current, debt, certain]"
7,7,0.0333,"[patient, develop, studi, disea, includ, treatment, cell, activ, addit, commerci, target, product, respon, fda, therapi, program, result, agreement, patent, drug, current, therapeut, licen, data, potenti, water, rate, receiv, approv, dose, clinic, combin, phase, trial, gene, increa, inhibitor, effect, arq, base]"
8,8,-0.0149,"[rei, data, market, client, inform, servic, busi, includ, provid, financ, transact, product, compani, audit, properti, new, report, recoveri_audit, databa, sale, develop, research, profess, vendor, prgx, price, retail, larg, subscrib, base, forrest, industri, commerci_real_estat, cre, manag, revenu, level, custom, oper, support]"
9,9,0.0306,"[product, oper, compani, includ, custom, financ, servic, market, busi, lea, engin, sale, result, provid, materi, increa, gener, price, addit, adver, develop, cost, new, manufactur, contract, requir, facil, relat, satellit, purcha, us, produc, telesat, believ, industri, futur, chang, segment, effect, base]"


**Run Company Sentiment Analysis**

This module compares the average between each company in each industry cluster.

The function takes three arguments:

1. year - integer - should be the same year used as an argument in the LDA model
2. number of clusters - integer - should be the same number used as an argument in the LDA model
3. results - tuple - should be the output from the LDA model

The function will return a pandas dataframe that contains the results from the sentiment analysis.

In [10]:
# Example: compare sentiment for all companies within each industry  using 2018 data
# company_sentiment = Russell2000_LDA.lda.company_sentiment(year, number of clusters, output from LDA model)

company_sentiment = Russell2000_LDA.lda.company_sentiment(2018,15,results)

In [11]:
company_sentiment

Unnamed: 0,Industry,Most_Postive,Most_Negative,Industry_Keywords
0,0,"[Capital senior living corp, Vishay intertechnology inc, Veeco instruments inc, National healthcare corp, Ttec holdings, inc., Amkor technology, inc., Dxp enterprises inc, Siga technologies inc, Forrester research, inc., Navigant consulting inc]","[Insperity, inc., Nii holdings inc, Heska corp, Navigant consulting inc, Forrester research, inc., Siga technologies inc, Dxp enterprises inc, Amkor technology, inc., Ttec holdings, inc.]","[servic, product, provid, includ, busi, custom, compani, client, market, oper, employ, financ, manag, technolog, offer, industri, develop, base, requir, cost, addit, gener, commun, process, facil, result, care, increa, manufactur, applic, chang, new, support, perform, believ, sale, revenu, plan, solut, packag]"
1,1,[Novavax inc],[Novavax inc],"[vaccin, trial, rsv, develop, disea, product, immun, rsv_vaccin, influenza, viru, technolog, respon, adjuv, program, includ, clinic_trial, patent, antigen, efficaci, current, older_adult, infect, dose, antibodi, compani, base, phase, potenti, year, result, strain, data, recombin, protein, market, infant, relat, vaccin_candid, popul, protect]"
2,2,"[Knoll inc, Griffin industrial realty, inc., Tenneco inc, Silicon laboratories inc, Strayer education inc, Ryman hospitality properties, inc., Career education corp, Sykes enterprises inc, Bjs restaurants inc, Bioscrip, inc.]","[Ladenburg thalmann financial services inc., Vasco data security international inc, Bioscrip, inc., Bjs restaurants inc, Sykes enterprises inc, Career education corp, Ryman hospitality properties, inc., Strayer education inc, Silicon laboratories inc]","[includ, product, financ, oper, market, busi, provid, manag, servic, custom, student, develop, offer, industri, lea, sale, univ, addit, design, invest, program, restaur, requir, result, employ, gener, griffin, client, increa, time, technolog, institut, believ, new, continu, base, compani, secur, offic, build]"
3,3,"[Mercantile bank corp, Norwood financial corp, Enterprise financial services corp, Heritage financial corp wa, Pacific premier bancorp inc, Shore bancshares inc, Enterprise bancorp inc ma, Univest corp of pennsylvania, Amerisafe inc, Smartfinancial inc.]","[Ofg bancorp, Smartfinancial inc., Amerisafe inc, Univest corp of pennsylvania, Enterprise bancorp inc ma, Shore bancshares inc, Pacific premier bancorp inc, Heritage financial corp wa, Enterprise financial services corp]","[bank, financ, loan, compani, requir, includ, servic, busi, capit, secur, deposit, provid, regul, institut, manag, activ, oper, gener, insur, bank_hold_compani, rate, subject, market, invest, chang, certain, addit, custom, feder_reserv, asset, commerci, borrow, polici, employ, control, product, act, base, corpor, offer]"
4,4,"[Dime community bancshares inc, Oceanfirst financial corp, Endologix inc de]","[Endologix inc de, Oceanfirst financial corp, Dime community bancshares inc]","[loan, bank, total, gener, market, decemb, compani, includ, rate, patient, afx, year, financ, borrow, secur, aaa, result, commerci, current, million, provid, unit_state, ovat, base, product, sale, requir, allow_loan_loss, devic, invest, approv, aneurysm, classifi, servic, nellix_eva, addit, origin, allow, time, receiv]"
5,5,"[Hibbett sports inc, Sun hydraulics corp, Childrens place, inc., Formfactor inc, Ani pharmaceuticals inc, Csg systems international inc, Cerus corp]","[Cerus corp, Csg systems international inc, Ani pharmaceuticals inc, Formfactor inc, Childrens place, inc., Sun hydraulics corp, Hibbett sports inc]","[product, custom, market, includ, busi, oper, develop, store, financ, new, sale, manufactur, provid, result, addit, design, technolog, gener, solut, servic, requir, system, compani, continu, manag, increa, test, industri, us, merchandi, base, offer, time, revenu, effect, competitor, adver, believ, chang, process]"
6,6,"[Impac mortgage holdings inc, Washington real estate investment trust, Piedmont office realty trust, inc., Urstadt biddle properties inc, Anworth mortgage asset corp, Rush enterprises inc tx, Franklin street properties corp ma, Arch coal inc]","[Arch coal inc, Franklin street properties corp ma, Rush enterprises inc tx, Anworth mortgage asset corp, Urstadt biddle properties inc, Piedmont office realty trust, inc., Washington real estate investment trust, Impac mortgage holdings inc]","[properti, financ, oper, lea, includ, invest, servic, manag, loan, tenant, rate, coal, gener, compani, market, increa, secur, asset, real_estat, mortgag, requir, affect, adver, result, addit, provid, cost, sale, acquisit, busi, custom, portfolio, offic, borrow, chang, risk, valu, current, debt, certain]"
7,7,"[Arqule inc, Rigel pharmaceuticals inc, Sangamo therapeutics, inc, Acorda therapeutics inc, Cymabay therapeutics, inc., Tg therapeutics, inc., California water service group]","[California water service group, Tg therapeutics, inc., Cymabay therapeutics, inc., Acorda therapeutics inc, Sangamo therapeutics, inc, Rigel pharmaceuticals inc, Arqule inc]","[patient, develop, studi, disea, includ, treatment, cell, activ, addit, commerci, target, product, respon, fda, therapi, program, result, agreement, patent, drug, current, therapeut, licen, data, potenti, water, rate, receiv, approv, dose, clinic, combin, phase, trial, gene, increa, inhibitor, effect, arq, base]"
8,8,"[Reis, inc., Prgx global, inc.]","[Prgx global, inc., Reis, inc.]","[rei, data, market, client, inform, servic, busi, includ, provid, financ, transact, product, compani, audit, properti, new, report, recoveri_audit, databa, sale, develop, research, profess, vendor, prgx, price, retail, larg, subscrib, base, forrest, industri, commerci_real_estat, cre, manag, revenu, level, custom, oper, support]"
9,9,"[Fresh del monte produce inc, Sonic automotive inc, Willis lease finance corp, Loral space & communications inc., Tupperware brands corp, Earthstone energy inc, Northwest pipe co, Schweitzer mauduit international inc, Carbo ceramics inc, Stoneridge inc]","[Allegheny technologies inc, Dril-quip inc, Vse corp, Stoneridge inc, Carbo ceramics inc, Schweitzer mauduit international inc, Northwest pipe co, Earthstone energy inc, Tupperware brands corp]","[product, oper, compani, includ, custom, financ, servic, market, busi, lea, engin, sale, result, provid, materi, increa, gener, price, addit, adver, develop, cost, new, manufactur, contract, requir, facil, relat, satellit, purcha, us, produc, telesat, believ, industri, futur, chang, segment, effect, base]"


**Test the Stability of the Model**

This module tests how stabile the clusters are from one year to the next. Adjusted rand score is used as the measure of stability.

The module will only compare clusters created from the set of companies that are in all target years.

The function takes two arguments:

1. list of years - list of integers - a list of years over which the comparisons will be made
2. number of clusters - integer - the number of industry clusters

The output will be a list of floats; the adjusted rand scores between each sequential year.

In [14]:
# Example: measure adjusted rand score for sequential years for a LDA model with 15 clusters
# rand_scores = Russell2000_LDA.stability.stability(list of years, 15)

rand_scores = Russell2000_LDA.stability.stability([2017,2018],15)

Completed 2017
Completed 2018


In [15]:
rand_scores

[0.3303947325030504]