
## FINANCIAL DATA
MODULE 1 | LESSON 3


---


# **WORKING WITH SECURITIES DATA** 


|  |  |
|:---|:---|
|**Reading Time** |  40 minutes |
|**Prior Knowledge** | Equities, Bonds & Maturity Date, Options & Expiration Date, Bid Price / Offer Price, Quantity Bid/Offered  |
|**Keywords** | Primary Sources, Financial Platforms, Data Vendors, Ticker, CUSIP, SEDOL, Adjusted Close, ISIN |

---

*The previous lesson outlined the types of data for the major asset classes, stocks and bonds, and categorized different types of structured and unstructured data. It concluded by addressing 10 questions you can ask about your data. In this next lesson, we examine some of the real-world issues facing financial data: its sources, identifiers, timestamps and trading days, and other complexities.*

## **1. Data Sources**

In the previous lesson, we discussed the importance of the vendor.
Our first consideration is the quality of the data the vendor provides.
Informally speaking, there are four types of providers of data:

A. Primary Sources
These include exchanges and credit agencies, like rating agencies, credit cards, and credit bureaus.
Exchanges and credit agencies are the locations where the data originates based on transactions, trading, payments, or lack thereof. Here are examples of some exchanges and credit institutions:

* Stock Exchanges: NYSE, NASDAQ, Shangha, Shenzhen, Euronext, Nigerian, Nairobi, Stock Exchange of Mauritius, etc.
* Option Exchanges: Chicago Board Option Exchange
* Commodity Exchanges: Chicago Mercantile Exchange, London Metal Exchange
* Forex Exchanges: London, New York, Sydney, Tokyo 
* Cryptocurrency Exchanges: Bitcoin, Coinbase, Binance, Gemini, FTX  
* Credit Agencies: S and P, Moody's, Fitch, Transunion, Equifax, Experian, Visa, Mastercard
    

B. Financial Platforms
These companies provide trading platforms, news, analytics, and communications. While they may provide data, they are typically redistributors of data from exchanges, OTC markets, and other financial sources. The data can be analyzed from inside their platform or through an API.

* Bloomberg
* Thomson-Reuters
* Factset

C. Data Vendors
These companies tend to focus on data and analytics. They tend to have their data accessible through APIs. Their revenue typically comes from selling or streaming data to clients.

* BarChart
* RavenPack
* Capital IQ
* CRISP
* Datastream
* Morningstar
* Exante Data
* Xignite
* Refinitiv
* Short Trends
* dxFeed

D. Free Sites
Finally, there are free sites, whether public services, government sites, or user groups, that pool and provide access to free data for educational purposes or individuals. Some of these limit the amount of data that can be accessed.

* Quandl
* Alpha Vantage
* FRED: Federal Reserve Economic Database
* Yahoo finance
* Google finance


The emphasis here is not so much knowing which vendor is in which category, but how they may offer the data. Exchanges can often provide complete data, including quote and trade data. A free site like Yahoo finance may only be able to provide end-of-day data based on trades and not quotes. Some vendors may clean the data; others present the data as they collected it and leave it to customers to address errors. Some data may be in a format known as FIX: Financial Information eXchange Protocol. Other data may simply be in comma-separated values, tab-delimited formats, etc.

Data itself is a commodity, and exchanges, markets, and vendors make money providing this data, whether historical, daily, or real-time, to clients who need to make quick, if not real-time, decisions with it.


In [None]:
# Using the `pandas_datareader` package, we'll get Netflix stock information from Yahoo Finance
import pandas_datareader as pdr

# Request data via Yahoo public API
data = pdr.get_data_yahoo("NFLX")
# Display Info
print(data.info())

In [None]:
# If we wanted 1 day's worth of Bitcoin data, we can use the following
import yfinance as yf

BTC = yf.Ticker("BTC-USD")
BtcData = BTC.history(period="5D")
BtcData.tail(2)

## **2. Data Descriptors**

How do you refer to data related to specific securities? Most securities have identifiers. In the previous example, we referred to Netflix by its ticker, NFLX, on the stock exchange.  Depending on the asset class, different types of identifiers are used. In previous lessons, we discussed bonds, stocks, cryptocurrencies, mutual funds, ETFs, options, securitized products, and real estate, among others. Let's start with stocks.

Tickers are the most popularly known identifiers. You will see tickers listed on screens during television shows or podcasts, in newspapers and online stories, as the ticker may tend to have some resemblance to the name or type of company. Everyone knows Facebook as FB (even though it is now Meta), Apple as AAPL, and IBM as, well, IBM! The ticker is not only an identifier but a way to remember the company. For example, Papa John's is a food company that specializes in pizza. Its ticker is PZZA. Thermogenesis Corporation is an energy company with the ticker KOOL. Southwest Airlines' ticker is LUV.  WideOpenWest has ticker WOW. Research has shown that the ticker can actually affect the performance!

In 2006, researchers showed that if the ticker is easy to pronounce, it tends to outperform the market. [Boutin]
In a 10+ year study, researchers at Pomona College showed that stocks with clever tickers outperform stocks with plain tickers [Smith]. Who would think that the resemblance of a ticker to a fun or easily remembered word could affect the company's performance? In a future course, we'll cover topics related to behavioral finance and discuss the idea of familiarity bias.  

Tickers are important but tend to be applied to stocks and ETFs. Bonds, for example, do not have tickers. Let's look at some other ways securities are identified.

CUSIP: Typically, when you see all uppercase, the identifier is an acronym.  CUSIP is an acronym for Committee on Uniform Security Identification Procedures. Used in North America, it is a nine-digit alphanumeric code that uniquely identifies not only a security but also information about its type and issuer. The first 6 alphanumeric characters identify the issuer. The next two characters identify the type of security. The last digit is a check that the code is created correctly. CIN is a CUSIP for securities in foreign markets. These are extra digits, where the first letters represent the issuing country. The acronym refers to the CUSIP International Numbering System.

SEDOL is also an acronym: Stock Exchange Daily Official List. When a company is listed on an exchange, it has a SEDOL. If it gets delisted, it loses its SEDOL. Consider the case of Hertz Rental Cars. Hertz was the number one rental company in the world. Hertz ran into financial difficulties. It suffered further during the COVID-19 pandemic, which affected both business and leisure travel. In May 2020, Hertz filed for Chapter 11 bankruptcy. Later that year in October 2020, Hertz was delisted from the NYSE. At that time, the SEDOL changed representing a listing no longer on the NYSE but on NASDAQ. Any data collected on Hertz must recognize the switch from one exchange to another.

ISIN is the International Securities Identification Number. ISIN uses 12-digit identifiers. The first 2 characters are alphabetic identifying the country. The next 9 alphanumeric characters identify the security. ISINs are more universal. For example, futures tend not to have CUSIPs but have ISINs. Foreign Exchange rates have identifiable tickers (e.g., GBP/USD) but are referred to internationally by their ISINs.

The securities we've described are fungible. Fungible securities are interchangeable. For example, 100 shares of AAPL common stock are like any other 100 shares of AAPL common stock. Most financial securities are fungible. Note that fungible need not mean identical. Gold is considered fungible, so long as it does not have serial numbers or marks that uniquely identify it. (U.S. dollar bills have serial numbers but are still considered fungible.) Likewise, Bitcoin is fungible. You can pay someone a Bitcoin, and they can send you a different one back. They are considered interchangeable, just like paper currency in any country.  

When we get to items that are more specific or unique, we have infungible securities. Real estate is not a fungible commodity. It is not interchangeable. Real estate is infungible. Even two houses that are the same size, material, and location are not interchangeable, in part because they are in different physical locations, with different views, light, neighbors, properties, etc. Similarly, works of art are not fungible. An original painting or sculpture is not like any other. Securities that are infungible will be very difficult to identify because they are, by nature, unique.  

For example, Zillow may show a property that was \\$100,000, and then a year later, it is \\$500,000. Did the house really increase five-fold? No. Zillow represents the value of whatever the real estate is at the address.  Originally, the land was undeveloped; it was a vacant lot. It was bought by a builder for \\$100,000. The builder constructed a house that was then sold for $500,000. The land itself has value ("undeveloped") and any structures (homes, buildings) add value through "developments" on the land. Zillow relies on a deed or property record, often filed at the local level, to refer to the property. With millions of properties, it is difficult to uniquely identify a property and likely impossible (and unnecessary) to have an internationally standardized system.  

Moving to precious jewelry, diamonds are infungible. In the U.S., some diamonds are graded by the Gemological Institute of America (GIA) and have an engraved serial number. This makes diamonds like gold with a serial number--not interchangeable and thus infungible. Diamonds without this engraving are more fungible. However, there are risks that it is not properly appraised for clarity, carats, and even if it is man-made or natural.

When you want to analyze the behavior of securities, think about the data you are collecting, and ask yourself if it is a specific, one-of-a-kind asset (home, art, some jewelry) or a fungible, easily interchangeable asset (stock, bond, ETF). Use identifiers if they help.



## **3. Time Zones, Trading Days, and 24/7**

"Get me the closing price" is something you will hear as a financial engineer. But what is "the" closing price?
The answer sounds like a quantum physics one--depending on where you are, the close will be different, based on your location and time zone.

Clearly, if we are in Lagos, Nigeria, trading Dangote Cement, we can expect its official close at the close of the Nigerian Stock Market--2:30 PM local time, or GMT+1. If we are in Sydney, Australia, trading the Australian stock BHP, then we observe the market at its closing time, 4:00 PM Sydney time: GMT + 10 hours.

Suppose we want to correlate the returns of these two companies using closing prices. We have to remember that their official exchange closes are about nine hours apart! Here is a table of times the markets open and close:

 GMT    Sydney, Australia   Lagos, Nigeria    Event
 00:00       10:00            01:00           Sydney Stock Exchange opens
 06:00       16:00            07:00           Sydney Stock Exchange closes     
 09:00       19:00            10:00           Nigeria Stock Exchange opens
 13:30       23:30            14:30           Nigeria Stock Exchange closes
  

Depending on the objective of the analysis, one could use
* Plan A: GMT=06:00. Get the data at Sydney's close, but the Nigerian exchange is not open!
* Plan B: GMT=13:30. Get the data at the Nigerian exchange's close, but the Sydney exchange has already closed.
The advantage of Plan B is that the date will reflect the closing prices of both exchanges. Even though Sydney closed, that is the last observable price for that trading day.
Keep in mind that data is not available until at least GMT=13:30 when the Nigerian exchange closes.
If you were to build a trading strategy using this closing correlation and decided to buy Dangote, it would have to be for the following trading day. Indeed, novices will often lose a sense of the timestamp. If you were to build a strategy that buys Dangote based on the correlation, we would reverse cause and effect.

The difficulties are more subtle if we look at the same security that trades in different parts of the world.
Consider a credit derivative that was discussed in Financial Markets, say a 5-year Credit Default Swap on IBM.
You can purchase this CDS from a Tokyo-based bank, a London-based bank, or a New York-based bank, among other locations. Each of these banks is geographically located in a different part of the world. Depending on the time of year (thanks to daylight savings), Tokyo is 12 hours ahead of New York, and London is 5 hours ahead of New York. Each one of these locations has an official close. For securities that are not housed in a specific location (like a 5-year cds), there can be "different closes" because the CDSs are not exchange traded, but over the counter. As there are OTC markets worldwide, the notion of close is relative to a geographical location.  

Reflecting on section 1, think about the quality of your data vendor. Are they providing the timezone, market, currency, etc., of the CDS value? If not, there is uncertainty about where the data originates.

What if we have an exchange, but it never closes? For example, Bitcoin may use GMT - 5 hours as the official close. But for a market that trades 24/7, is this price really more important than the price each day at GMT?

When data lives in different geographical regions, or simply on exchanges with different closing hours, be sure to align properly so that the trading date reflects the close on that date.


## **4.  Mixing Up Fields**

Following up from the last section, depending on the location of the data, the currency can be different. So in our CDS example, you can find data denominated in yen, pound sterling, or U.S. dollars.  Oftentimes, people collect the data and think of it as a number separate from the units. This would be easily distinguished for yen-denominated and dollar-denominated, but not so for when the currencies have an exchange rate closer to 1. Be sure when you compare levels that you are comparing numbers representing the same quantity. One way to fix this problem would be to compute returns instead of using levels. Another way would be to have a common currency for all quoted prices. This would involve having appropriate FX conversions.

Collecting data as numbers without paying attention to the units can lead to disastrous results. Be sure to keep fields correctly labeled.  

For example, imagine if you mixed up price and quantity.
Suppose, for example, you were supposed to sell some stock: say 1 share at 610,000 yen.
That is, the quantity is 1, and the price is 610,000.
Suppose there is a mix up of the data, and you send an order to sell 610,000 at 1 yen!

Indeed, this very mistake happened in December 2005 to a well-known global bank. The bank had an operational risk of data whereby they mixed up price and quantity. The bank anxiously tried to cancel the order, but the Tokyo Stock Exchange did not allow cancellations. The stock was sold at a bargain, a fraction of its true worth. Ironically, the amount being sold was more than the amount actually issued! Nevertheless, when the damage was done, it cost the bank US \$225 million dollars, nearly a quarter of a billion dollars, which makes for a very expensive typo. The bank was large and profitable enough in other businesses that it did not result in their demise. [CBS]

This is no small matter and emphasizes a distinction between programming and financial engineering.
The programmer will apply test cases and have validation suites, such as asserting that prices must be positive.

The financial engineer, however, will put these in the context of financial markets, and likely would ensure that the program reject the trade before it ever gets to the exchange. Indeed, it would have been better
to reject the trade outright. What quality checks could have been done on the data? The financial engineer could ask:
1. Should the offer price be 600,000 yen different from the last traded price?  
2. Should the offer price be different from the daily volume weighted average price (VWAP)?
3. Should the quantity offered be more than the bank had on hand?
4. Should the quantity offered be more than the total amount in issuance?
All these questions have the same answer: NO. The financial engineer uses financial common sense to assert that the inputs do not make sense given the financial state of the system and rightfully rejects the order. This would have saved hundreds of millions of dollars.

We will see in the next lesson that there could be quality data checks throughout the handling of data to ensure that these types of errors avoid massive costs. The problem isn't merely saving money but also protecting the reputation and trustworthiness of operations of these financial institutions.



## **5.  Complexities**

In addition to differing units, timestamps, and currencies, financial data tends to have its own set of complexities.
For example, stocks can split, whereby the number of outstanding shares increase and the share price decreases. For example, a company with 100,000,000 shares at \$200 per share could have a 2-for-1 split so that there are now 200,000,000 shares at \\$100 per share. Similarly, a company could have a reverse split--those 100,000,000 shares at \\$200 could become 50,000,000 shares at \\$400 per share. In either case, market capitalization always remains the same. The idea behind splitting is that it makes stocks more affordable. When stocks split, the histories actually have to be adjusted so that the stock doesn't have a 50% drop in price.

In addition, equities receive dividends, which affect the prices, so equity prices also need to be adjusted for the income they receive as dividends.  Often, you will see stock prices adjusted for splits and dividends.  For example, importing data from Yahoo finance, the most suitable column is the "Adjusted Close" column, rather than the Close column.

When we upgrade from individual securities to ETFs, we have extra complexities. A sector ETF has members, whose names and weights change over time. So in addition to tracking the price and volume history of ETFs, we may want to track additions, deletions, and re-weightings.  

Keep in mind that ignoring deletions creates a survivorship bias. Suppose you did a five-year study on an equity index like SP500. You start in 2022, and look back, say 15 years, and only use those companies that existed for that 15-year period. Companies that started out in 2007 but failed to survive would be excluded from the study. So the study looks only at stocks that survived the 15-year period. This survivorship bias means that our results will appear better than they really were because there are some companies that went bankrupt bringing the average return down.

Leaving the world of equities, there's a complexity regarding time. Bonds mature. On-the-run active bonds roll. Options expire. Suppose you have a specific 10-year bond issued in the U.S. If you collect this for a number of years, you are following a bond whose time to maturity keeps decreasing.  After 10 years, the bond matures.  
Yet, it is possible to find decades and decades of history of 10-year notes.  How? We use a generic 10-year CUSIP.
What makes this generic? Suppose a new 10-year note is issued every three months. Then, every three months, the generic switches from the old issue to the new issue. So long as 10-year notes are regularly issued, we can continue to roll out of the old issue into the new one. This is what generic series do. This works for future contracts as well. Since futures expire, we simply use generic versions to track.

Options are more complex because they don't roll like futures contracts. In the next section, we'll see why options in fact are the most difficult.


## **6.  Choices within an Asset Class**
Choices refer to the extent of selections there are in an asset class.  Suppose you like to trade a security related to the company Apple.

There is one common stock for Apple, and the most liquid place to trade it is on the NASDAQ (though it trades on some foreign exchanges like Buenos Aires and Berlin)

Yet, if you wanted to buy bonds, there are dozens of choices--different maturities with various coupons.  

Furthermore, if you wanted to buy options, there are hundreds of choices--calls or puts, different strike levels, different expirations, different types of exercise, and even different payoffs (vanilla or exotics). The choices are staggering.

When there are more securities in a class (IBM Bonds or IBM calls), the liquidity is going to spread out. Therefore, the data is harder to collect, organize, and analyze.  

In general, equity data is the easiest data to manage, in part because of the small number of securities. ETFs are a bit harder because they involve members and weightings that change quite frequently. Then, bonds are harder still because of both the great number of them, the complex features (e.g., callable bonds, convertible bonds), and the maturity dates winding down.  Options are probably the most difficult security data to work with because of the vast number of them and their dependence on the underlying securities.  

Options can change in value because the underlying changes, volatility changes, interest rates change, or any combination of these. For example, option prices decreasing in price could simply reflect the fact that they are getting closer to their expiration dates and not necessarily due to a decrease in volatility.



## **7. Conclusion**
Financial data has its own set of idiosyncrasies and complexities: different frequencies, units, time zones, currencies, some of which represent executable prices and others that are mere quotes. This lesson gave a broad introduction to these issues. In the next lesson, we'll add unstructured data and compare how these data management issues lead the way to getting data ready for machine learning.
 

**References**

* Boutin, Chard. "Stock Performance Tied to Ease of Pronouncing Company’s Name."
https://www.princeton.edu/news/2006/05/29/study-stock-performance-tied-ease-pronouncing-companys-name

* [CBS News]. "Stock Trade Typo Costs Firm $225M." December 9, 2005.
https://www.cbsnews.com/news/stock-trade-typo-costs-firm-225m/
    
    
* Smith, Gary. Stocks with Clever Ticker Symbols Outperform Plain Names, New Pomona College Study Confirms.
https://www.pomona.edu/news/2019/09/27-stocks-clever-ticker-symbols-outperform-plain-names-new-pomona-college-study-confirms




---
Copyright © 2022 WorldQuant University. This
content is licensed solely for personal use. Redistribution or
publication of this material is strictly prohibited.
