# Introduction to Relative Valuation using Market Comparables

Relative valuation is a popular technique which relies on the market value of similar assets to price a given asset. For example, if you want to price a laundry business, you might want to calculate the price to earnings ratio of other similar businesses, and multiply this ratio by the earnings of the laundry business to obtain a valuation. Sheridan and Martin describe this methodology as a 4-step process:

**Step 1:** Identify similar or comparable investments and recent market prices for each.

**Step 2:** Calculate a valuation metric for use in valuing the asset.

**Step 3:** Calculate an initial estimate of value.

**Step 4:** Refine or tailor your initial valuation estimate to the specific characteristics of the investment.

Some of the most commonly used multiples are:

- Price to earnings (P/E)

- Market to book value of equity

- Enterprise Value (EV) to EBITDA

- Enterprise Value (EV) to revenue

- Enterprise Value (EV) to cash flow

This list is not exhaustive, and it is possible for you to create your own multiple. This is particularly popular for the technology sector where analysts have come up with multiples such as Enterprise Value to unique visitors or website hits. However, in doing so, you must ensure that the components of the multiple are consistent with each other. For example, you might consider using the price to sales ratio as a valuation multiple. However, an implicit assumption behind this multiple is that comparable companies have identical capital structures which is very rarely the case in practice. When this assumption is violated, the multiple becomes inconsistent because the multiple is impacted by the amount of debt that a company has relative to its equity.

Finally, a key step in applying this methodology is to determine which multiple is appropriate for the asset you are trying to value. For example, valuing young firms and startups using the P/E ratio is likely to be unappropriate if those firms have negative or highly volatile earnings. Instead, using the EV to sales ratio would likely give a better estimate. Additionally, it is important to realize that these multiples have different characteristics. While EV to EBITDA ignores the firm's CapEx, depreciation and capital structure, while the P/E ratio takes those into account. As such, using these multiples concurrently allows you to see the big picture and understand what is driving the valuation of an asset.

# Systematic Relative Valuation Using Machine Learning

### Objective

In this notebook, we systematize the methodology introduced above for companies in the S&P 500 using two different Machine Learning approaches. First, we replicate Gael Varoquaux's analysis on scikit-learn which extracts a graphical structure from the correlation of intraday variations, and then applies Affinity Propagation to group together stocks that behave similarly. Second, we analyze the stock's latest 10K using the Doc2Vec implementation in gensim to quantify similarity. We pick the top 3 comparable companies based on these analyses, and normalize the similarity measures as weights to compute the average comparable multiple. We then apply this multiple to each company's financials to obtain a valuation.

### Data

We scrape the list of companies in the S&P 500, their ticker and CIK code from Wikipedia. We then use the CIK code to scrape the latest 10K from EDGAR. There are 505 companies in the Wikipedia list because some companies trade under multiple symbols (for example, Discovery Communications Inc.). A few companies' financial statements are not available on EDGAR for various reasons -- we ignore those companies. We clean the data by removing "Table of Contents" markers when they exist, page numbers, line breaks, punctuations and numbers from the statements. Quality control tests are available in the appendix. We scrape company fundamentals and their historical prices from Yahoo! Finance.

## Required Packages

In [None]:
!pip install numpy
!pip install pandas
!pip install collections
!pip install string
!pip install beautifulsoup4
!pip install requests
!pip install gensim
!pip install nltk
!pip install sklearn
!pip install bokeh
!pip install pandas_datareader
!pip install datetime

## Data Scraping

### Scraping 10Ks from EDGAR

In [None]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen


def table_extractor(soup): # Extract the tables from a soup object
    for table in soup.find_all("table"):
        table.extract()
    return soup


sp_500_wiki_link = 'https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'
soup_wiki = BeautifulSoup(urlopen('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies'), 'html.parser')
table_wiki = soup_wiki.find("table", { "class" : "wikitable sortable" })

# Fail now if the right table hasn't been found
header = table_wiki.findAll('th')
if header[0].string != "Ticker symbol" or header[1].string != "Security":
    raise Exception("Can't parse wikipedia's table!") 
    
# Retreive the values in the table
records = []
rows = table_wiki.findAll('tr')
for row in rows:
    fields = row.findAll('td')
    if fields:
        # Get info and SEC company link
        symbol = fields[0].string
        wiki_link = "https://en.wikipedia.org/wiki" + fields[1].a.get('href')
        CIK = fields[7].string
        sec_company_link = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK=" + CIK + \
                           "&type=10-K&dateb=&owner=include&count=40"
        name = fields[1].a.string
        sector = fields[3].string

        # Get link for the page with latest 10-K related filings
        soup_filings = BeautifulSoup(urlopen(sec_company_link), 'html.parser')
        table_filings = soup_filings.find("table", { "class" : "tableFile2" })
        try:
            filings_page_link = "https://www.sec.gov" + table_filings.a.get('href') # Get the latest filing page
            
            # Get the link for the latest 10K
            soup_filings_page = BeautifulSoup(urlopen(filings_page_link), 'html.parser')
            table_filings_page = soup_filings_page.find("table", { "class" : "tableFile" })
            _10K_link = "https://www.sec.gov" + table_filings_page.a.get('href')
            
            # Extracting the text from the latest 10K
            try: 
                soup_latest_10K = BeautifulSoup(urlopen(_10K_link).read(), 'html.parser')
                soup_latest_10K = table_extractor(soup_latest_10K)
                _latest_10K_txt = soup_latest_10K.get_text()

            except:
                # If the 10K is not available, return N/A
                _latest_10K_txt = np.nan
                
        except:
            # If the filings are not available, return N/A
            _10K_link = np.nan
            _latest_10K_txt = np.nan
        
        # Append results
        records.append([symbol, wiki_link, name, sector, sec_company_link, CIK, _latest_10K_txt])
        
headers = ['Symbol', 'Wikipedia Link', 'Name', 'Sector', 'SEC Filings Link', 'CIK', 'Latest 10K']
data = pd.DataFrame(records, columns=headers)

# Correct ambiguous tickers
ambiguous_tickers = ['BRK.B', 'BF.B']
corrected_tickers = ['BRK-B', 'BF-B']

for i, ticker in enumerate(ambiguous_tickers):
    data['Symbol'] = data['Symbol'].replace(ticker, corrected_tickers[i])

### Scraping Fundamentals from Yahoo! Finance

In [None]:
def unit_converter(data):
    billion = 1_000_000_000
    million = 1_000_000
    if data[-1] == 'B':
        return float(data[:-1])*billion
    elif data[-1] == 'M':
        return float(data[:-1])*million
    else:
        return float(data)

    
items = ['Enterprise Value', 'Enterprise Value/Revenue', 'Enterprise Value/EBITDA',
         'Revenue', 'EBITDA', 'Diluted EPS', 'Trailing P/E']

for item in items:
    data[item] = np.nan
    
for i, ticker in enumerate(data['Symbol']):
    key_stats_link = 'https://finance.yahoo.com/quote/' + ticker + '/key-statistics?p=' + ticker
    key_stats_soup = BeautifulSoup(urlopen(key_stats_link).read(), 'html.parser').findAll('td')
    for j, row in enumerate(key_stats_soup):
        for item in items:
            try: 
                if item == row.span.string:
                    data.loc[i, item] = unit_converter(key_stats_soup[j+1].string) # Dangerous
            except:
                next

### Scraping Historical Prices from Yahoo! Finance

In [13]:
from pandas_datareader.data import DataReader
from datetime import date

start = date(2013, 1, 2)
end = date.today()

data_source = 'yahoo'

# Sometimes fail -- retry if it does
historical_prices_panel = DataReader(data['Symbol'], data_source, start, end, retry_count=5)

# Current price is last close
# last_price = historical_prices_panel['Close'].tail(1).T.iloc[:, 0].rename('Current Price')
# data = data.join(last_price, on='Symbol')

In [27]:
historical_prices_panel['Close'].to_csv('close_price_data.csv')
historical_prices_panel['Open'].to_csv('open_price_data.csv')
# historical_prices_panel = pd.read_csv('price_data.csv')

## Data Cleaning and Preprocessing

In [5]:
#data.to_csv('10K_data.csv')

#data = pd.read_csv('10K_data.csv')

In [6]:
# Remove companies without filings
no_filings_data = data[data['Latest 10K'].isnull()]
data = data[~data['Latest 10K'].isnull()]

# Remove duplicates (keep first)
data = data.drop_duplicates(keep='first')

# Drop Google duplicate
data = data[data['Symbol'] != 'GOOG']

# Drop NA rows (about 60 companies)
data = data.dropna()

# Drop companies with negative EPS or EBITDA (about 30 companies)
data = data[(data[['EBITDA', 'Diluted EPS']] > 0).all(1)]

# Reset index
data = data.reset_index(drop=True)

In [7]:
from collections import namedtuple
from nltk import word_tokenize
from nltk.corpus import stopwords
from string import punctuation, digits


def _10K_string_cleaner(_10K):
    _10K = _10K.lower() # Lowercase the text
    stopchar = punctuation + digits + '’“”'
    for ch in stopchar:
        _10K = _10K.replace(ch, ' ') # Replace stopchar by whitespace
    _10K = word_tokenize(_10K) # Tokenize
    _10K = [word for word in _10K if word not in stopwords.words('english')] # Remove stopwords
    return _10K


corpus = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(data['Latest 10K']):
    corpus.append(analyzedDocument(_10K_string_cleaner(text), [i]))

## Doc2Vec Model

In [8]:
from gensim.models import doc2vec
from gensim.similarities import docsim

model_NLP = doc2vec.Doc2Vec(corpus, size = 500, window = 300, min_count = 1, workers = 4) 

In [9]:
similarity_matrix_NLP = np.empty(0)
length_docvecs = len(model_NLP.docvecs)
for item in itertools.product(range(length_docvecs), range(length_docvecs)):
    similarity_matrix_NLP = np.append(similarity_matrix_NLP, model_NLP.docvecs.similarity(*item))
    
similarity_matrix_NLP = similarity_matrix_NLP.reshape((length_docvecs, length_docvecs))

## Visualization

### Stock Market Structure

We use three different methodologies for vizualizing the structure of the stock market. First, we create simple scatter plot of the companies' EV/EBITDA against their P/E ratio. Second, we use the correlation matrix of the S&P 500 stocks between the start of 2013 and the last price. Third, we use the similarity matrix outputted from the Doc2Vec model. For the last two methods, t-SNE is used for reducing the dimensionality to two.

**FIX THIS**

In [10]:
from bokeh.plotting import figure, show, output_notebook
from bokeh.models import ColumnDataSource, LabelSet, Legend, HoverTool
from bokeh.palettes import all_palettes

output_notebook()

category_items = data['Sector'].unique()
palette = all_palettes['Viridis'][len(category_items)]
colormap = dict(zip(category_items, palette))
data['Color'] = data['Sector'].map(colormap)

TOOLS = "crosshair,pan,wheel_zoom,reset,tap,save,box_select"

source = ColumnDataSource(dict(x = data['Trailing P/E'],
                               y = data['Enterprise Value/EBITDA'],
                               color = data['Color'],
                               label = data['Name'],
                               ticker = data['Symbol'],
                               sector = data['Sector']))

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),
])

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source, text_font_size='8pt')

p.scatter(x='x', y='y', color='color', legend='sector', source=source) # Remove "legend='sector'," to remove legend

#p.add_layout(labels) # Comment this line to remove labels

p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'

show(p)

In [11]:
from sklearn.manifold import TSNE

tsne = TSNE(n_components=2, random_state=0, perplexity=5.0)
Y = tsne.fit_transform(similarity_matrix_NLP)

plotting_df = pd.concat([data[['Name', 'Symbol', 'Sector', 'Color']],
                         pd.DataFrame(Y, columns=['x', 'y'])], axis=1)

source = ColumnDataSource(dict(x = plotting_df['x'],
                               y = plotting_df['y'],
                               color = plotting_df['Color'],
                               label = plotting_df['Name'],
                               ticker = plotting_df['Symbol'],
                               sector = plotting_df['Sector']))

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),
])

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source, text_font_size='8pt')

p.scatter(x='x', y='y', color='color', legend='sector', source=source)

#p.add_layout(labels) # Comment this line to remove labels

show(p)

In [14]:
from sklearn import cluster, covariance

# This code was adapted from Gael Varoquaux's work (see references)

# Calculate intraday variation
variation_df = (historical_prices_panel['Close'] - historical_prices_panel['Open']).T

# Get name, sector and color from the data dataframe
variation_df = data[['Symbol', 'Name', 'Sector', 'Color']].join(variation_df.reindex(data['Symbol']), on='Symbol')

# Drop rows with NAs
variation_df = variation_df.dropna(axis=0)

# Data for the model
var_data = variation_df.drop(['Symbol', 'Name', 'Sector', 'Color'], axis=1).T

# Learn a graphical structure from the correlations
edge_model = covariance.GraphLassoCV()

# Standardize the time series: using correlations rather than covariance is more efficient for structure recovery
var_data /= var_data.std(axis=0)
edge_model.fit(var_data)

# Cluster using affinity propagation
_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()

for i in range(n_labels + 1):
    print('Cluster %i: %s' % ((i + 1), ', '.join(variation_df['Name'][labels == i])))

# Find a low-dimension embedding for visualization: find the best position of the nodes (the stocks) on a 2D plane
embedding = tsne.fit_transform(var_data.T)

# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.06)

# Plot the edges
start_idx, end_idx = np.where(non_zero)
# a sequence of (*line0*, *line1*, *line2*), where::
#            linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[embedding.T[:, start], embedding.T[:, stop]] for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])

  sign, logdet = _umath_linalg.slogdet(a, signature=signature)
  * coefs)
  * coefs)
  sign, logdet = _umath_linalg.slogdet(a, signature=signature)


Cluster 1: Activision Blizzard, Electronic Arts
Cluster 2: Agilent Technologies Inc, PerkinElmer, Thermo Fisher Scientific, Verisign Inc.
Cluster 3: Adobe Systems Inc, Alphabet Inc Class A, Amazon.com Inc, eBay, Expedia Inc., Facebook, Inc., Microsoft Corp., Netflix Inc., Priceline.com Inc
Cluster 4: Aetna Inc, Anthem Inc., Centene Corporation, CIGNA Corp., Humana Inc., United Health Group Inc.
Cluster 5: Advance Auto Parts, AutoZone Inc, O'Reilly Automotive
Cluster 6: Block H&R
Cluster 7: BorgWarner, Delphi Automotive, Goodyear Tire & Rubber, LKQ Corporation
Cluster 8: Cadence Design Systems, Synopsys Inc.
Cluster 9: AmerisourceBergen Corp, Cardinal Health, CVS Health, Express Scripts, Henry Schein, McKesson Corp., Patterson Companies
Cluster 10: Carnival Corp., Royal Caribbean Cruises Ltd
Cluster 11: Caterpillar Inc., Cummins Inc., Deere & Co., Freeport-McMoRan Inc., Nucor Corp., United Rentals, Inc.
Cluster 12: AbbVie Inc., Alexion Pharmaceuticals, Amgen Inc, Biogen Inc., Bristol-My

In [22]:
plotting_df = pd.concat([variation_df[['Name', 'Symbol', 'Sector', 'Color']].reset_index(drop=True),
                         pd.DataFrame(np.stack(embedding, axis=1).T, columns=['x', 'y'])], axis=1)

source = ColumnDataSource(dict(x = plotting_df['x'],
                               y = plotting_df['y'],
                               color = plotting_df['Color'],
                               label = plotting_df['Name'],
                               ticker = plotting_df['Symbol'],
                               sector = plotting_df['Sector']))

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),
])

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source, text_font_size='8pt')

p.scatter(x='x', y='y', color='color', legend='sector', source=source)

#p.segment(*np.reshape(np.array(segments).flatten(), (len(segments),4)).T)

#p.add_layout(labels)

show(p)

In [21]:
partial_correlations[non_zero]

array([-0.06181062, -0.07155183, -0.07445316, -0.0924572 , -0.06091989,
       -0.0644079 , -0.06043566, -0.06067483, -0.06278194, -0.07241528,
       -0.07433179, -0.06005098, -0.06018425, -0.07473074, -0.11138014,
       -0.07684987, -0.09372424, -0.18195367, -0.0930303 , -0.10836526,
       -0.09970982, -0.10590698, -0.12655645, -0.12571089, -0.15757206,
       -0.07505235, -0.14304754, -0.09008764, -0.15649933, -0.11401709,
       -0.09796537, -0.13527524, -0.06956268, -0.11556889, -0.07491265,
       -0.07327088, -0.07952767, -0.06636465, -0.10874069, -0.06425771,
       -0.06266171, -0.1650774 , -0.14371938, -0.06790511, -0.07741382,
       -0.17823096, -0.06067007, -0.06451649, -0.13962553, -0.16893616,
       -0.11197579, -0.07407412, -0.07248271, -0.07490203, -0.08240091,
       -0.1089997 , -0.10944981, -0.07989192, -0.09179934, -0.06156174,
       -0.06240679, -0.10274681, -0.06917404, -0.06198231, -0.06293759,
       -0.0817737 , -0.07718345, -0.07659174, -0.0720536 , -0.07

# Valuation

In [16]:
sim_mat_NLP = similarity_matrix_NLP - np.eye(len(similarity_matrix_NLP)) # FIX THIS
#sim_mat_corr


def valuation_calculator(index, data, multiple, similarity_matrix):
    sorted_similarity_array = -np.sort((-similarity_matrix[0]))
    sorted_similarity_indices = (-similarity_matrix[index]).argsort()
    top_3_comps = sorted_similarity_array[:3]
    baseline_comp = sorted_similarity_array[4]
    normalized_weights = (top_3_comps - baseline_comp) / sum(top_3_comps - baseline_comp)
    top_3_pe = data[multiple][sorted_similarity_indices[:3]]
    weigthed_pe = np.dot(normalized_weights, top_3_pe)
    if multiple == 'Trailing P/E':
        valuation = weigthed_pe * data['Diluted EPS'][index]
    elif multiple == 'Enterprise Value/EBITDA':
        valuation = weigthed_pe * data['EBITDA'][index]
    elif multiple == 'Enterprise Value/Revenue':
        valuation = weigthed_pe * data['Revenue'][index]
    return valuation


valuation_df = data[['Name', 'Symbol', 'Sector', 'Color', 'Current Price', 'Enterprise Value']].copy()

# headers = ['Valuation (P/E)', '% Over/Undervaluation (P/E)',
#            'Valuation (EV/EBITDA)', '% Over/Undervaluation (EV/EBITDA)',
#            'Valuation (EV/Revenue)', '% Over/Undervaluation (EV/Revenue)']

# methods = ['NPL', 'Correlation']

# for header in headers:
#     for method in methods:
#         valuation_df[header + ' ' + method] = np.nan

for i, company in enumerate(data['Name']):
    valuation_df.loc[i, 'Valuation (P/E) NLP'] = valuation_calculator(i, data, 'Trailing P/E',
                                                                      sim_mat_NLP)
    valuation_df.loc[i, 'Valuation (EV/EBITDA) NLP'] = valuation_calculator(i, data, 'Enterprise Value/EBITDA',
                                                                            sim_mat_NLP)
    valuation_df.loc[i, 'Valuation (EV/Revenue) NLP'] = valuation_calculator(i, data, 'Enterprise Value/Revenue',
                                                                             sim_mat_NLP)
#     valuation_df.loc[i, 'Valuation (P/E) Correlation'] = valuation_calculator(i, data, 'Trailing P/E',
#                                                                               sim_mat_corr)
#     valuation_df.loc[i, 'Valuation (EV/EBITDA) Correlation'] = valuation_calculator(i, data,
#                                                                                     'Enterprise Value/EBITDA',
#                                                                                      sim_mat_corr)
#     valuation_df.loc[i, 'Valuation (EV/Revenue) Correlation'] = valuation_calculator(i, data,
#                                                                                      'Enterprise Value/Revenue',
#                                                                                      sim_mat_corr)

valuation_df['% Over/Undervaluation (EV/Revenue) NLP'] = valuation_df['Valuation (EV/Revenue) NLP'] / \
                                                         valuation_df['Enterprise Value'] - 1
valuation_df['% Over/Undervaluation (EV/EBITDA) NLP'] = valuation_df['Valuation (EV/EBITDA) NLP'] / \
                                                        valuation_df['Enterprise Value'] - 1    
valuation_df['% Over/Undervaluation (P/E) NLP'] = valuation_df['Valuation (P/E) NLP'] / \
                                                  valuation_df['Current Price'] - 1
# valuation_df['% Over/Undervaluation (EV/Revenue) Correlation'] = \
# valuation_df['Valuation (EV/Revenue) Corr'] / valuation_df['Enterprise Value'] - 1
# valuation_df['% Over/Undervaluation (EV/EBITDA) Correlation'] = \
# valuation_df['Valuation (EV/EBITDA) Corr'] / valuation_df['Enterprise Value'] - 1    
# valuation_df['% Over/Undervaluation (P/E) Correlation'] =\
# valuation_df['Valuation (P/E) Corr'] / valuation_df['Current Price'] - 1

In [17]:
valuation_df['% Over/Undervaluation (EV/Revenue) NLP'].describe()

count    414.000000
mean       0.849263
std        2.886371
min       -0.890354
25%       -0.302254
50%        0.121113
75%        0.854788
max       40.900952
Name: % Over/Undervaluation (EV/Revenue) NLP, dtype: float64

In [18]:
valuation_df['% Over/Undervaluation (EV/EBITDA) NLP'].describe()

count    414.000000
mean       0.256921
std        1.953579
min       -0.875880
25%       -0.208111
50%        0.037636
75%        0.401249
max       38.134563
Name: % Over/Undervaluation (EV/EBITDA) NLP, dtype: float64

In [19]:
valuation_df['% Over/Undervaluation (P/E) NLP'].describe()

count    414.000000
mean       0.528534
std        1.842327
min       -0.969607
25%       -0.264239
50%        0.051261
75%        0.588994
max       14.776323
Name: % Over/Undervaluation (P/E) NLP, dtype: float64

In [25]:
last_day_2016 = historical_prices_panel['Close'].loc['2016-12-30', :].rename('2016-12-30 Price') #FIX THIS
#valuation_df = valuation_df.join(last_day_2016, on='Symbol')
valuation_df['Actual Change'] = valuation_df['Current Price'] > valuation_df['2016-12-30 Price']
valuation_df['Prediction NLP'] = valuation_df['Valuation (P/E) NLP'] > valuation_df['2016-12-30 Price']
outcome = valuation_df['Prediction NLP'] == valuation_df['Actual Change']
correct_pred_NLP = sum(outcome) / len(outcome)
print('Percentage of correct predictions (NLP): ' + str(correct_pred_NLP))

# valuation_df['Prediction Corr'] = valuation_df['Valuation (P/E) Corr'] > valuation_df['2016-12-30 Price']
# correct_pred_corr = sum(valuation_df[['Prediction Corr', 'Actual Change']].all(1)) / \
#                     len(valuation_df[['Prediction Corr', 'Actual Change']].all(1))
# print('Percentage of correct predictions (Corr): ' + str(correct_pred_corr))

Percentage of correct predictions (NLP): 0.599033816425


In [28]:
valuation_df

Unnamed: 0,Name,Symbol,Sector,Color,Current Price,Enterprise Value,Valuation (P/E) NLP,Valuation (EV/EBITDA) NLP,Valuation (EV/Revenue) NLP,% Over/Undervaluation (EV/Revenue) NLP,% Over/Undervaluation (EV/EBITDA) NLP,% Over/Undervaluation (P/E) NLP,2016-12-30 Price,Actual Change,Prediction NLP
0,3M Company,MMM,Industrials,#440154,234.740005,1.486000e+11,181.283170,1.171161e+11,1.029547e+11,-0.307169,-0.211870,-0.227728,178.570007,True,True
1,Abbott Laboratories,ABT,Health Care,#482374,55.369999,1.104600e+11,11.958059,5.996827e+10,9.336404e+10,-0.154771,-0.457104,-0.784034,38.410000,True,False
2,AbbVie Inc.,ABBV,Health Care,#482374,91.930000,1.768800e+11,77.342773,1.447448e+11,8.601826e+10,-0.513691,-0.181678,-0.158678,62.619999,True,True
3,Accenture plc,ACN,Information Technology,#404387,143.279999,8.437000e+10,178.947997,7.177778e+10,9.039602e+10,0.071424,-0.149250,0.248939,117.129997,True,True
4,Activision Blizzard,ATVI,Information Technology,#404387,64.139999,4.953000e+10,33.390076,2.628960e+10,2.421931e+10,-0.511017,-0.469219,-0.479419,36.110001,True,False
5,Acuity Brands Inc,AYI,Industrials,#440154,158.690002,6.690000e+09,203.142377,1.036741e+10,9.592343e+09,0.433833,0.549688,0.280121,230.860001,False,False
6,Adobe Systems Inc,ADBE,Information Technology,#404387,177.330002,8.393000e+10,82.726787,4.310069e+10,5.804717e+10,-0.308386,-0.486469,-0.533487,102.949997,True,False
7,Advance Auto Parts,AAP,Consumer Discretionary,#345E8D,81.930000,6.840000e+09,126.952827,1.311874e+10,2.285898e+10,2.341957,0.917945,0.549528,169.119995,False,False
8,Aetna Inc,AET,Health Care,#482374,173.119995,6.033000e+10,100.429260,6.761254e+10,1.000120e+11,0.657749,0.120712,-0.419886,124.010002,True,False
9,Affiliated Managers Group Inc,AMG,Financials,#29788E,190.679993,1.223000e+10,252.829473,1.399744e+10,1.498943e+10,0.225628,0.144517,0.325936,145.300003,True,True


In [31]:
source = ColumnDataSource(dict(x = valuation_df['% Over/Undervaluation (P/E) NLP'],
                               y = valuation_df['% Over/Undervaluation (EV/EBITDA) NLP'],
                               color = valuation_df['Color'],
                               label = valuation_df['Name'],
                               ticker = valuation_df['Symbol'],
                               sector = valuation_df['Sector']))

hover = HoverTool(tooltips=[
    ("index", "$index"),
    ("name", "@label"),
    ("sector", "@sector"),
    ("ticker", "@ticker"),
    ("(x,y)", "($x, $y)"),
])

p = figure(tools=[TOOLS, hover], plot_width=800, plot_height=700)

labels = LabelSet(x='x', y='y', text='label', source=source, text_font_size='8pt')

p.scatter(x='x', y='y', color='color', legend='sector', source=source) # Remove "legend='sector'," to remove legend

#p.add_layout(labels) # Comment this line to remove labels

p.xaxis.axis_label = 'Trailing P/E'
p.yaxis.axis_label = 'EV/EBITDA'

show(p)

In [None]:
from __future__ import print_function

# Author: Gael Varoquaux gael.varoquaux@normalesup.org
# License: BSD 3 clause

import sys
from datetime import datetime

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.collections import LineCollection
from six.moves.urllib.request import urlopen
from six.moves.urllib.parse import urlencode
from sklearn import cluster, covariance, manifold

print(__doc__)


def retry(f, n_attempts=3):
    "Wrapper function to retry function calls in case of exceptions"
    def wrapper(*args, **kwargs):
        for i in range(n_attempts):
            try:
                return f(*args, **kwargs)
            except Exception:
                if i == n_attempts - 1:
                    raise
    return wrapper


def quotes_historical_google(symbol, start_date, end_date):
    """Get the historical data from Google finance.

    Parameters
    ----------
    symbol : str
        Ticker symbol to query for, for example ``"DELL"``.
    start_date : datetime.datetime
        Start date.
    end_date : datetime.datetime
        End date.

    Returns
    -------
    X : array
        The columns are ``date`` -- date, ``open``, ``high``,
        ``low``, ``close`` and ``volume`` of type float.
    """
    params = {
        'q': symbol,
        'startdate': start_date.strftime('%Y-%m-%d'),
        'enddate': end_date.strftime('%Y-%m-%d'),
        'output': 'csv',
    }
    url = 'https://finance.google.com/finance/historical?' + urlencode(params)
    response = urlopen(url)
    dtype = {
        'names': ['date', 'open', 'high', 'low', 'close', 'volume'],
        'formats': ['object', 'f4', 'f4', 'f4', 'f4', 'f4']
    }
    converters = {
        0: lambda s: datetime.strptime(s.decode(), '%d-%b-%y').date()}
    data = np.genfromtxt(response, delimiter=',', skip_header=1,
                         dtype=dtype, converters=converters,
                         missing_values='-', filling_values=-1)
    min_date = min(data['date'], default=datetime.min.date())
    max_date = max(data['date'], default=datetime.max.date())
    start_end_diff = (end_date - start_date).days
    min_max_diff = (max_date - min_date).days
    data_is_fine = (
        start_date <= min_date <= end_date and
        start_date <= max_date <= end_date and
        start_end_diff - 7 <= min_max_diff <= start_end_diff)

    if not data_is_fine:
        message = (
            'Data looks wrong for symbol {}, url {}\n'
            '  - start_date: {}, end_date: {}\n'
            '  - min_date:   {}, max_date: {}\n'
            '  - start_end_diff: {}, min_max_diff: {}'.format(
                symbol, url,
                start_date, end_date,
                min_date, max_date,
                start_end_diff, min_max_diff))
        raise RuntimeError(message)
    return data

# #############################################################################
# Retrieve the data from Internet

# Choose a time period reasonably calm (not too long ago so that we get
# high-tech firms, and before the 2008 crash)
start_date = datetime(2003, 1, 1).date()
end_date = datetime(2008, 1, 1).date()

symbol_dict = {
    'NYSE:TOT': 'Total',
    'NYSE:XOM': 'Exxon',
    'NYSE:CVX': 'Chevron',
    'NYSE:COP': 'ConocoPhillips',
    'NYSE:VLO': 'Valero Energy',
    'NASDAQ:MSFT': 'Microsoft',
    'NYSE:IBM': 'IBM',
    'NYSE:TWX': 'Time Warner',
    'NASDAQ:CMCSA': 'Comcast',
    'NYSE:CVC': 'Cablevision',
    'NASDAQ:YHOO': 'Yahoo',
    'NASDAQ:DELL': 'Dell',
    'NYSE:HPQ': 'HP',
    'NASDAQ:AMZN': 'Amazon',
    'NYSE:TM': 'Toyota',
    'NYSE:CAJ': 'Canon',
    'NYSE:SNE': 'Sony',
    'NYSE:F': 'Ford',
    'NYSE:HMC': 'Honda',
    'NYSE:NAV': 'Navistar',
    'NYSE:NOC': 'Northrop Grumman',
    'NYSE:BA': 'Boeing',
    'NYSE:KO': 'Coca Cola',
    'NYSE:MMM': '3M',
    'NYSE:MCD': 'McDonald\'s',
    'NYSE:PEP': 'Pepsi',
    'NYSE:K': 'Kellogg',
    'NYSE:UN': 'Unilever',
    'NASDAQ:MAR': 'Marriott',
    'NYSE:PG': 'Procter Gamble',
    'NYSE:CL': 'Colgate-Palmolive',
    'NYSE:GE': 'General Electrics',
    'NYSE:WFC': 'Wells Fargo',
    'NYSE:JPM': 'JPMorgan Chase',
    'NYSE:AIG': 'AIG',
    'NYSE:AXP': 'American express',
    'NYSE:BAC': 'Bank of America',
    'NYSE:GS': 'Goldman Sachs',
    'NASDAQ:AAPL': 'Apple',
    'NYSE:SAP': 'SAP',
    'NASDAQ:CSCO': 'Cisco',
    'NASDAQ:TXN': 'Texas Instruments',
    'NYSE:XRX': 'Xerox',
    'NYSE:WMT': 'Wal-Mart',
    'NYSE:HD': 'Home Depot',
    'NYSE:GSK': 'GlaxoSmithKline',
    'NYSE:PFE': 'Pfizer',
    'NYSE:SNY': 'Sanofi-Aventis',
    'NYSE:NVS': 'Novartis',
    'NYSE:KMB': 'Kimberly-Clark',
    'NYSE:R': 'Ryder',
    'NYSE:GD': 'General Dynamics',
    'NYSE:RTN': 'Raytheon',
    'NYSE:CVS': 'CVS',
    'NYSE:CAT': 'Caterpillar',
    'NYSE:DD': 'DuPont de Nemours'}


symbols, names = np.array(sorted(symbol_dict.items())).T

# retry is used because quotes_historical_google can temporarily fail
# for various reasons (e.g. empty result from Google API).
quotes = []

for symbol in symbols:
    print('Fetching quote history for %r' % symbol, file=sys.stderr)
    quotes.append(retry(quotes_historical_google)(
        symbol, start_date, end_date))

close_prices = np.vstack([q['close'] for q in quotes])
open_prices = np.vstack([q['open'] for q in quotes])

# The daily variations of the quotes are what carry most information
variation = close_prices - open_prices


# #############################################################################
# Learn a graphical structure from the correlations
edge_model = covariance.GraphLassoCV()

# standardize the time series: using correlations rather than covariance
# is more efficient for structure recovery
X = variation.copy().T
X /= X.std(axis=0)
edge_model.fit(X)

# #############################################################################
# Cluster using affinity propagation

_, labels = cluster.affinity_propagation(edge_model.covariance_)
n_labels = labels.max()

for i in range(n_labels + 1):
    print('Cluster %i: %s' % ((i + 1), ', '.join(names[labels == i])))

# #############################################################################
# Find a low-dimension embedding for visualization: find the best position of
# the nodes (the stocks) on a 2D plane

# We use a dense eigen_solver to achieve reproducibility (arpack is
# initiated with random vectors that we don't control). In addition, we
# use a large number of neighbors to capture the large-scale structure.
node_position_model = manifold.LocallyLinearEmbedding(
    n_components=2, eigen_solver='dense', n_neighbors=6)

embedding = node_position_model.fit_transform(X.T).T

# #############################################################################
# Visualization
plt.figure(1, facecolor='w', figsize=(10, 8))
plt.clf()
ax = plt.axes([0., 0., 1., 1.])
plt.axis('off')

# Display a graph of the partial correlations
partial_correlations = edge_model.precision_.copy()
d = 1 / np.sqrt(np.diag(partial_correlations))
partial_correlations *= d
partial_correlations *= d[:, np.newaxis]
non_zero = (np.abs(np.triu(partial_correlations, k=1)) > 0.02)

# Plot the nodes using the coordinates of our embedding
plt.scatter(embedding[0], embedding[1], s=100 * d ** 2, c=labels,
            cmap=plt.cm.spectral)

# Plot the edges
start_idx, end_idx = np.where(non_zero)
# a sequence of (*line0*, *line1*, *line2*), where::
#            linen = (x0, y0), (x1, y1), ... (xm, ym)
segments = [[embedding[:, start], embedding[:, stop]]
            for start, stop in zip(start_idx, end_idx)]
values = np.abs(partial_correlations[non_zero])
lc = LineCollection(segments,
                    zorder=0, cmap=plt.cm.hot_r,
                    norm=plt.Normalize(0, .7 * values.max()))
lc.set_array(values)
lc.set_linewidths(15 * values)
ax.add_collection(lc)

# Add a label to each node. The challenge here is that we want to
# position the labels to avoid overlap with other labels
for index, (name, label, (x, y)) in enumerate(
        zip(names, labels, embedding.T)):

    dx = x - embedding[0]
    dx[index] = 1
    dy = y - embedding[1]
    dy[index] = 1
    this_dx = dx[np.argmin(np.abs(dy))]
    this_dy = dy[np.argmin(np.abs(dx))]
    if this_dx > 0:
        horizontalalignment = 'left'
        x = x + .002
    else:
        horizontalalignment = 'right'
        x = x - .002
    if this_dy > 0:
        verticalalignment = 'bottom'
        y = y + .002
    else:
        verticalalignment = 'top'
        y = y - .002
    plt.text(x, y, name, size=10,
             horizontalalignment=horizontalalignment,
             verticalalignment=verticalalignment,
             bbox=dict(facecolor='w',
                       edgecolor=plt.cm.spectral(label / float(n_labels)),
                       alpha=.6))

plt.xlim(embedding[0].min() - .15 * embedding[0].ptp(),
         embedding[0].max() + .10 * embedding[0].ptp(),)
plt.ylim(embedding[1].min() - .03 * embedding[1].ptp(),
         embedding[1].max() + .03 * embedding[1].ptp())

plt.show()

**TODO LIST**
- Perform valuation and visualize under/overvalued
- Create additional data quality controls
- Comps strength
- How many undervalued/overvalued companies have increased/decreased in value using P/E?

In [4]:
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
from collections import namedtuple
from gensim.models import doc2vec
from gensim.similarities import docsim
from gensim import corpora
import itertools

# Conclusion

### References

Titman, Sheridan and Martin, John D. _Valuation: the Art and Science of Corporate Investment Decisions_. Prentice Hall, 2015. Print.

Quoc Le and Tomas Mikolov. _Distributed Representations of Sentences and Documents_. http://arxiv.org/pdf/1405.4053v2.pdf

Laurens van der Maaten and Geoffrey Hinton. _Visualizing Data using t-SNE_. http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf 

Gael Varoquaux. _Visualizing the Stock Market Structure_. http://scikit-learn.org/stable/auto_examples/applications/plot_stock_market.html#sphx-glr-auto-examples-applications-plot-stock-market-py

# Appendix

### Data Quality Controls

In [None]:
# Items to check: 
# Number of sentences that are dropped by the table extractor is less than 25% of original
from nose.tools import ok_


def test_na_values(data):
    """Test if there are N/A values in the data"""
    ok(data.isnull().values.any() == False)


def test_identical_values(data):
    """Test if there are identical values in the data"""
    for column in data.columns:
        ok_(len(set(data[column])) == len(data[column]), msg='There are identical elements in ' + column)


def test_name_in_10K(data):
    """Test if the company name is in the 10K"""
    for i, name in enumerate(data['Name'])
        ok_(name in data['Latest 10K'][i])
                  
        
def test_start_string_match(data):
    """Test if a start string was found for all 10Ks"""
    cnt = 0
    start_string = 'PART I     \n \n    Item 1.\xa0\xa0\xa0\xa0Business     \n \n' 
    for item in data['10K Soup'].dropna():
        if re.search(start_string, item, re.IGNORECASE) != None:
            cnt += 1
    ok_(data['10K Soup'].count() == cnt, msg='start_string was not found in every statement')