### Industry Clustering of Russell 2000 Companies using Non-Negative Matrix Factorization

This project employs a non-negative matrix factorization model (NMF) to cluster the companies comprising the FTSE Russell 2000 Index into industry groups based upon the text used in their annual reports. Using the industry groupings as identified by the model, sentiment analysis is performed to compare the relative levels of optimism amongst companies within one industry and between industries themselves. 

This package will perform the following sequential steps:

1. Gather and clean raw 10-K filings from SEC online database
2. Preprocess each document
3. Cluster each company into industry groupings
4. Display the results of the NMF cluster model
5. Perform sentiment analysis at the industry level
6. Perform sentiment analysis at the company level

In [1]:
# Ensure that packages are located in environment path
import sys
sys.path.append(r'C:\Users\ian_d\OneDrive\Desktop\Capstone_files\Final_Code')

In [2]:
# import package
import Russell2000_NMF

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ian_d\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


**Data Acquisition**

This module will acquire and clean the 10-K filing for each company included in the reference csv file. Please make sure that the csv file is saved in your working directory.

Each 10-K filing is cleaned and saved as one string in a text file that is ready for NLP preprocessing. The files will be titled after the company's name (ex. "Shake Shack Inc.".txt)

These files will be saved to new directory in the working directory, titled after the year entered as an argument into the function (ex. "2018").

The function takes three arguments:

1. start index - should be an integer - indicates the starting index for document iteration
2. end index - should be an integer - indicates the stop index for document iteration
3. year - should be an integer - indicates the target year of analysis

Having a stop and start index allows for batch processing to avoid memory error issues if this is performed on a standard personal laptop.

The function does not have a return statement but will print progress statements as it operates.

In [3]:
# Example: collect cleaned data for first 100 companies using 2018 10-Ks
# Russell2000_NMF.data_acquisition(start index, end index, year)

Russell2000_NMF.data_acquisition.get(0,100,2018)

Master dataframe created
Russell CIK nums dataframe created
Raw 10-Ks collected
Raw files cleaned
Acquired target sections of 10-Ks
Saved to disk


**Running the NMF Model**

This module runs the NMF for the target year.

The function takes in four arguments:

1. year - integer - indicates the target year of analysis
2. number of clusters - integer - indicates the number of clusters
3. number of top words - integer (default = 40) - indicates the number of keywords associated with each cluster that will be displayed, in descending order of importance
4. number of top documents - integer ( default = 10) - indicates the number of companies within each cluster that will be displayed, in descending order of fit

The function will return a tuple (results), such that:

results[0] = labeled preprocessed bag of words for each document

results[1] = unlabeled preprocessed bag of words for each document

results[2] = NMF H matrix (a matrix of topics x words)

results[3] = NMF W matrix (a matrix of documents x topics)

results[4] = feature names of TF-IDF matrix

results[5] = dataframe with clustering results

To improve interpretability, the function will print the above to the console.

In [4]:
# Example: run NMF model for 15 industries (clusters) using 2018 data
# output = Russell2000_NMF.nmf.nmf(year, number of clusters)

results = Russell2000_NMF.nmf.nmf(2018,15)

Created initial dictionary
Created reference corpus
Created corpus list
Stemmed words
Created bigrams
Created trigrams
Created NMF model
Output tuple:
results[0] = labeled preprocessed bag of words for each document
results[1] = unlabeled preprocessed bag of words for each document
results[2] = NMF H matrix
results[3] = NMF W matrix
results[4] = feature names of TF-IDF matrix
results[5] = dataframe with clustering results


**Display Results from NMF Model**

The function below display the results from the NMF model.

The function takes one tuple as argument: the output from the nmf() function.

The function has no return statement but instead prints the results to the console.

In [6]:
# Example: display results from NMF model for 15 industries (clusters) using 2018 data
# Russell2000_NMF.nmf.display(ouput from NMF model)

Russell2000_NMF.nmf.display(results)

Industry 0:

custom, solut, technolog, manufactur, offer, design, adver, revenu, contract, segment, price, secur, tabl_content, support, facil, integr, engin, consum, supplier, brand, softwar, devic, employ, network, demand, project, water, need, locat, order, data, environ, intern, impact, licen, platform, adver affect, global, invest, enabl

PC CONNECTION INC
SPS COMMERCE INC
QUALYS, INC.
STAMPS.COM INC
DXP ENTERPRISES INC
WEB.COM GROUP, INC.
LIVEPERSON INC
Internap Corp
FORMFACTOR INC
PLUG POWER INC

Industry 1:

bank, loan, deposit, bank_hold_compani, borrow, institut, capit, secur, rate, feder_reserv, lend, fdic, real_estat, invest, regul, financ institut, asset, mortgag, frb, commerci, hold_compani, dodd_frank_act, insur, credit, loan_portfolio, collat, market_area, origin, counti, financ servic, supervi, consum_loan, financ hold_compani, trust, limit, offic, compani bank, allow_loan_loss, fund, polici

DIME COMMUNITY BANCSHARES INC
OCEANFIRST FINANCIAL CORP
EAGLE BANCORP INC
NOR

**Run Industry Sentiment Analysis**

This module compares the average sentiment between each industry cluster.

The function takes three arguments:

1. year - integer - should be the same year used as an argument in the NMF model
2. number of clusters - integer - should be the same number used as an argument in the NMF model
3. results - tuple - should be the output from the NMF model

The function will return a pandas dataframe that contains the results from the sentiment analysis.

In [7]:
# Example: compare sentiment across 15 industries using 2018 data
# industry_sentiment = Russell2000_NMF.nmf.industy_sentiment(year, number of clusters, output from NMF model)

industry_sentiment = Russell2000_NMF.nmf.industry_sentiment(2018,15,results)

In [8]:
industry_sentiment

Unnamed: 0,Industry,Sentiment,Industry_Keywords
0,0,0.0115,"[custom, solut, technolog, manufactur, offer, design, adver, revenu, contract, segment, price, secur, tabl_content, support, facil, integr, engin, consum, supplier, brand, softwar, devic, employ, network, demand, project, water, need, locat, order, data, environ, intern, impact, licen, platform, adver affect, global, invest, enabl]"
1,1,0.0112,"[bank, loan, deposit, bank_hold_compani, borrow, institut, capit, secur, rate, feder_reserv, lend, fdic, real_estat, invest, regul, financ institut, asset, mortgag, frb, commerci, hold_compani, dodd_frank_act, insur, credit, loan_portfolio, collat, market_area, origin, counti, financ servic, supervi, consum_loan, financ hold_compani, trust, limit, offic, compani bank, allow_loan_loss, fund, polici]"
2,2,0.0269,"[patient, disea, studi, drug, treatment, therapi, fda, cell, trial, dose, clinic, cancer, clinic_trial, product_candid, immun, manufactur, pharmaceut, patent, compound, approv, inhibitor, commerci, tumor, vaccin, therapeut, licen, collabor, phase, disord, receptor, respon, phase_trial, drug_candid, treat, liver, agreement, phase_studi, nda, protein, adver event]"
3,3,0.0299,"[client, consult, employ, solut, technolog, global, custom_engag, profess, custom_experi, digit, engag, best_practic, help_client, offer, servic offer, analyt, servic client, deliv, research, data, insight, busi process, firm, outsourc, experti, custom_care, client custom, revenu, park, audit, manag servic, strategi, commerc, client engag, transform, implement, recruit, client servic, provid client, practic]"
4,4,0.065,"[lea, properti, tenant, real_estat, rent, invest, rental_rate, space, reit, adver, portfolio, real_estat_invest, contamin, asset, net, joint_ventur, offic, environ, lea expir, wareh, acquir properti, aircraft, connecticut, redevelop, adver affect, build, renew lea, expir, debt, retail, acquisit, approxim_squar_feet, oper properti, land_develop, term, acquir, engin, land, renew, lea term]"
5,5,0.0496,"[user, adverti, app, mobil, analyt, data, content, platform, monet, onlin, video, live, social_network, engag, mobil_applic, enterpri, featur, meet new, revenu, gift, web, audienc, chat, connect, websit, peopl, mobil_devic, email, internet, marketplac, com, vendor, digit, interact, visual, data sourc, digit adverti, tag, network, secur]"
6,6,0.0055,"[subscrib, content, brazil, televi, subscript, network, stream_video, stream, offer, wireless, data, entertain, spectrum, media, video, adver affect, exclu, librari, adver, consciou, access, busi adver, busi adver affect, competitor, repr total, channel, brand, fan, film, support, licen, cre, internet, digit, consum, uniqu, movi, brand_ident, websit, talent]"
7,7,0.0648,"[store, merchandi, fiscal, vendor, retail, net_sale, factori, retail_store, commerc, assort, logist, athlet, shop, omni_channel, brand, new store, furnitur, retail sale, food, wholes, apparel, promot, locat, week, close_store, februari, join_compani, trend, adver, sport, oper store, sourc, mobil_app, remodel, holidai, consum, replenish, distribut_center, real_estat, fashion]"
8,8,0.0308,"[insur, reinsur, policyhold, underwrit, agent, annuiti, life_insur, rate, premium, state, insur compani, polici, reserv, properti_casualti_insur, claim, loss, state insur, regul, florida, statutori, coverag, iowa, worker compen, solvenc, properti_casualti, homeown, syndic, agenc, life, catastroph, worker, treati, liabil, hold_compani, financ strength, guarant, credit_rate, compen insur, group, actuari]"
9,9,0.0481,"[wafer, substrat, semiconductor, manufactur, packag, devic, technolog, laser, die, optic, defect, chip, semiconductor_manufactur, silicon, custom, yield, film, inspect, semiconductor_industri, manufactur_process, semiconductor_devic, process control, deposit, advanc, thermal, surfac, bump, metal, led, stack, equip, clean, interconnect, layer, mem, tool, chemic, probe, process_step, proprietari]"


**Run Company Sentiment Analysis**

This module compares the average between each company in each industry cluster.

The function takes three arguments:

1. year - integer - should be the same year used as an argument in the NMF model
2. number of clusters - integer - should be the same number used as an argument in the NMF model
3. results - tuple - should be the output from the NMF model

The function will return a pandas dataframe that contains the results from the sentiment analysis.

In [9]:
# Example: compare sentiment for all companies within each industry  using 2018 data
# company_sentiment = Russell2000_NMF.nmf.company_sentiment(year, number of clusters, output from NMF model)

company_sentiment = Russell2000_NMF.nmf.company_sentiment(2018,15,results)

In [10]:
company_sentiment

Unnamed: 0,Industry,Most_Postive,Most_Negative,Industry_Keywords
0,0,"[Nautilus, inc., Fresh del monte produce inc, Knoll inc, Encore capital group inc, Tenneco inc, Trex co inc, Vishay intertechnology inc, William lyon homes, Silicon laboratories inc, Strayer education inc]","[Luminex corp, Fusion telecommunications international inc, Allegheny technologies inc, Carrizo oil & gas inc, Golden entertainment, inc., Nic inc, Liveperson inc, Sps commerce inc, Dril-quip inc]","[custom, solut, technolog, manufactur, offer, design, adver, revenu, contract, segment, price, secur, tabl_content, support, facil, integr, engin, consum, supplier, brand, softwar, devic, employ, network, demand, project, water, need, locat, order, data, environ, intern, impact, licen, platform, adver affect, global, invest, enabl]"
1,1,"[Eagle bancorp inc, Impac mortgage holdings inc, Mercantile bank corp, Norwood financial corp, Anworth mortgage asset corp, Southern first bancshares inc, Enterprise financial services corp, Heritage financial corp wa, Macatawa bank corp, Dime community bancshares inc]","[Ofg bancorp, Cobiz financial inc, Peoples bancorp of north carolina inc, Umb financial corp, Smartfinancial inc., Heritage commerce corp, Centerstate bank corp, Univest corp of pennsylvania, Brookline bancorp inc]","[bank, loan, deposit, bank_hold_compani, borrow, institut, capit, secur, rate, feder_reserv, lend, fdic, real_estat, invest, regul, financ institut, asset, mortgag, frb, commerci, hold_compani, dodd_frank_act, insur, credit, loan_portfolio, collat, market_area, origin, counti, financ servic, supervi, consum_loan, financ hold_compani, trust, limit, offic, compani bank, allow_loan_loss, fund, polici]"
2,2,"[Arqule inc, Agenus inc, Cytokinetics inc, Antares pharma, inc., Innoviva, inc., Rigel pharmaceuticals inc, Dynavax technologies corp, Lexicon pharmaceuticals, inc., Corcept therapeutics inc, Novavax inc]","[Durect corp, Tactile systems technology inc, Acadia pharmaceuticals inc, Cerus corp, Insmed inc, Endologix inc de, Tg therapeutics, inc., Rti surgical, inc., Ziopharm oncology inc]","[patient, disea, studi, drug, treatment, therapi, fda, cell, trial, dose, clinic, cancer, clinic_trial, product_candid, immun, manufactur, pharmaceut, patent, compound, approv, inhibitor, commerci, tumor, vaccin, therapeut, licen, collabor, phase, disord, receptor, respon, phase_trial, drug_candid, treat, liver, agreement, phase_studi, nda, protein, adver event]"
3,3,"[Hackett group, inc., Pfsweb inc, Heidrick & struggles international inc, Ttec holdings, inc., Sykes enterprises inc, Cra international, inc., Forrester research, inc., Perficient inc, Csg systems international inc, Navigant consulting inc]","[Insperity, inc., Prgx global, inc., Sp plus corp, Neogenomics inc, Convergys corp, Navigant consulting inc, Csg systems international inc, Perficient inc, Forrester research, inc.]","[client, consult, employ, solut, technolog, global, custom_engag, profess, custom_experi, digit, engag, best_practic, help_client, offer, servic offer, analyt, servic client, deliv, research, data, insight, busi process, firm, outsourc, experti, custom_care, client custom, revenu, park, audit, manag servic, strategi, commerc, client engag, transform, implement, recruit, client servic, provid client, practic]"
4,4,"[Willis lease finance corp, Griffin industrial realty, inc., Washington real estate investment trust, Piedmont office realty trust, inc., Urstadt biddle properties inc, Getty realty corp md, Istar inc., Franklin street properties corp ma]","[Franklin street properties corp ma, Istar inc., Getty realty corp md, Urstadt biddle properties inc, Piedmont office realty trust, inc., Washington real estate investment trust, Griffin industrial realty, inc., Willis lease finance corp]","[lea, properti, tenant, real_estat, rent, invest, rental_rate, space, reit, adver, portfolio, real_estat_invest, contamin, asset, net, joint_ventur, offic, environ, lea expir, wareh, acquir properti, aircraft, connecticut, redevelop, adver affect, build, renew lea, expir, debt, retail, acquisit, approxim_squar_feet, oper properti, land_develop, term, acquir, engin, land, renew, lea term]"
5,5,"[Xo group inc., Meet group, inc., Microstrategy inc]","[Microstrategy inc, Meet group, inc., Xo group inc.]","[user, adverti, app, mobil, analyt, data, content, platform, monet, onlin, video, live, social_network, engag, mobil_applic, enterpri, featur, meet new, revenu, gift, web, audienc, chat, connect, websit, peopl, mobil_devic, email, internet, marketplac, com, vendor, digit, interact, visual, data sourc, digit adverti, tag, network, secur]"
6,6,"[World wrestling entertainmentinc, Gaia, inc, Reis, inc., Nii holdings inc]","[Nii holdings inc, Reis, inc., Gaia, inc, World wrestling entertainmentinc]","[subscrib, content, brazil, televi, subscript, network, stream_video, stream, offer, wireless, data, entertain, spectrum, media, video, adver affect, exclu, librari, adver, consciou, access, busi adver, busi adver affect, competitor, repr total, channel, brand, fan, film, support, licen, cre, internet, digit, consum, uniqu, movi, brand_ident, websit, talent]"
7,7,"[Bassett furniture industries inc, Weis markets inc, Hibbett sports inc, Childrens place, inc.]","[Childrens place, inc., Hibbett sports inc, Weis markets inc, Bassett furniture industries inc]","[store, merchandi, fiscal, vendor, retail, net_sale, factori, retail_store, commerc, assort, logist, athlet, shop, omni_channel, brand, new store, furnitur, retail sale, food, wholes, apparel, promot, locat, week, close_store, februari, join_compani, trend, adver, sport, oper store, sourc, mobil_app, remodel, holidai, consum, replenish, distribut_center, real_estat, fashion]"
8,8,"[Federated national holding co, American equity investment life holding co, United fire group inc, Argo group international holdings, ltd., Fbl financial group inc, Amerisafe inc]","[Amerisafe inc, Fbl financial group inc, Argo group international holdings, ltd., United fire group inc, American equity investment life holding co, Federated national holding co]","[insur, reinsur, policyhold, underwrit, agent, annuiti, life_insur, rate, premium, state, insur compani, polici, reserv, properti_casualti_insur, claim, loss, state insur, regul, florida, statutori, coverag, iowa, worker compen, solvenc, properti_casualti, homeown, syndic, agenc, life, catastroph, worker, treati, liabil, hold_compani, financ strength, guarant, credit_rate, compen insur, group, actuari]"
9,9,"[Entegris inc, Veeco instruments inc, Rudolph technologies inc, Amkor technology, inc., Axt inc]","[Axt inc, Amkor technology, inc., Rudolph technologies inc, Veeco instruments inc, Entegris inc]","[wafer, substrat, semiconductor, manufactur, packag, devic, technolog, laser, die, optic, defect, chip, semiconductor_manufactur, silicon, custom, yield, film, inspect, semiconductor_industri, manufactur_process, semiconductor_devic, process control, deposit, advanc, thermal, surfac, bump, metal, led, stack, equip, clean, interconnect, layer, mem, tool, chemic, probe, process_step, proprietari]"


**Test the Stability of the Model**

This module tests how stabile the clusters are from one year to the next. Adjusted rand score is used as the measure of stability.

The module will only compare clusters created from the set of companies that are in all target years.

The function takes two arguments:

1. list of years - list of integers - a list of years over which the comparisons will be made
2. number of clusters - integer - the number of industry clusters

The output will be a list of floats; the adjusted rand scores between each sequential year.

In [11]:
# Example: measure adjusted rand score for sequential years for a NMF model with 15 clusters
# rand_scores = Russell2000_NMF.stability.stability(list of years, 15)

rand_scores = Russell2000_NMF.stability.stability([2017,2018],15)

Completed 2017
Completed 2018


In [12]:
rand_scores

[0.9352529219527853]