- **Supervised Learning** (regressions, classifications)
- **Unsupervised Learning** (factor analysis, clustering)
- **Deep & Reinforcement Learning** (analyze unstructured data to find patterns) 

- ML methods are often simple extensions of well-known statistical methods
- **Supervised Learning Methods** aim to establish a relationship between 2 datasets and use one dataset to forecast the other
- **Unsupervised Learning Methods** try to understand te structure of data and identify the main drivers behind it 

- The *mkt will start reacting faster* and will increasingly anticipate traditional or "old" data sources (e.g. quarterly corporate earnings, low frecuency macroeconomic data, etc)
- As the Big Data ecosystem evolves, datasets that have *high Sharpe ratio signals* (viable as a standalone funds) will disappear

- **Machine Learning algorithms cannot entirely replace human intuition**
- Regarding **talent:** employing data scientists who lack specific financial expertise or financial intuition may not lead to the desired investment results
- In implementing Big Data & Machine Learning in Finance, it is more important to understand the economics behind data and signals, than to be able to develop complex technological solutions

- For **short term trading** such as high frequency mkt making, humans already play a very small role
- **Long term horizon**, machines will likely not do well in assesing regime changes (mkt turning points) and forecasts which involve interpreting more complicated human responses such as those of politicians and central bankers, understanding client positioning, or anticipate crowding

- This *industrial revolution of data* seeks to provide **alpha** through informational advantage and the ability to uncover new uncorrelated signals
- The informational advantage of Big Data is **not related to expert and industry networks, access to corporate management, etc., but rather the ability to collect large quantities of data and analyze them in real time


1. Exponential increase in amount of data
	* 90% of the data in the world today has been created in the past 2 years alone [^1]
	* 4.4 zettabytes (or trillion gigabytes) in late 2015 to 44 zettabytes by 2020[^2]

2. Increases in computing power & storage capacity
	* by 2020, over 1/3 of all data will either live in or pass through the cloud[^3]
	* Technology vendors provide remote access classified into:
		* Software-as-a-service (SaaS)
		* Platform-as-a-service (PaaS) or 
		* Infrastructure-as-a-service (IaaS) categories

3. Machine Learning Methods to analyze large & complex datasets
	* ML techniques enable analysis of large and unstructured datasets and construction of trading strategies
	* **Deep Learning** is an analysis method that relies on multi-layer neural networks
	* **Reinforcement Learning** is a specific approach that is encouraging algorithms to explore and find the most profitable strategies
	* Currently, just 0.5% of the data produced is currently being analyzed

![Figure 1](//CHRB1023.CORP.GWPNET.COM/homes/V/S8C9V6/Documents/2018_2/ML_Pictures)/JPM 1.png

- **Big Data**
	* "Big" stands in for 3 prominent characteristics:
		1. Volume: size of data collected
		2. Velocity: speed with which data is sent or received
		3. Variety: formats

- **Machine Learning (ML)**
	* In finance, one can view ML as an attempt at *uncovering realationships between variables*
	* ML can also be seen as a *model-independent (or statistical or data-driven) way* for *recognizing patterns* in large data sets

	* **Supervised Learning** (regressions and classifications)

	* **Unsupervised Learning** (factor analyses and regime identification)

	* **Deep & Reinforced Learning**
		* Deep Learning is based on neural network algorithms and is used in processing unstructured data


- **Artificial Intelligence (AI)**

	* ML is another attempt to achieve AI

	* ML and specifically Deep Learning so far represent the most serious attempt at achieving AI

	* While Deep Learning based AI can excel and beat humans in many tasks, it cannot do so in all. It is still struggling with some basic tasks such as the **Winograd Schema Challenge**


- The advantage given by data can be in the form of uncovering new information not contained in traditional sources, or uncovering the same information but at an earlier time

	* Satellite imagery of mines or agricultural land can reveal supply disruptions before they are broadly reported in the news or official reports

1. Data generated by Individuals

	* Often in unstructured textual formats, needs Natural Language Processing

2. Sensor generated data

	* Tends to be unstructured and may require analysis techniques such as counting objects, or removing the impact of weather/clouds from a satellite image

3. Business generated datasets

	* Such as credit card transactions and company "exhaust" data have common legal and privacy considerations

Figure 3


- **Data generated by Business Processes**

	* Data generated by business processes is often highly structured (in contrast to human-generated data) and can act as a leading indicator for corporate metrics, which tend to be reported at a significantly lower frequency

- **Data generated by sensors**

	* The data generated is typically unstructured and its size is often significantly larger than either human or process-generated data streams

	* Perhaps the most promising is the future concept of the Internet of Things (IoT) - the practice of embedding micro-processors and networking technology into all personal and commercial electronic devices



- High frequency quant traders will care about all signals that can be produced on an intraday basic such as tweets, news releases, etc. but will care less about e.g. credit card data that come with substantial delays and are less broadly followed

Figure 4


1. **Asset class**

	* There is relatively little alternative data on interest rates and currencies, making such data sets more valuable

2. **Investment style** 

	* most data are sector and stock specific and relevant for equity long-short investors

3. **Alpha Content (most important attribute)**

	* Alpha content has to be analyzed in the context of the price to purchase and implement the dataset

		- Sentiment analysis can be obtained for a few hundred or thousand dollars
		- Comprehensive credit card data can cost up to a few million USD a year

	* Most of data have a small positive Sharpe ratio that is not sufficiently high for a standalone investment strategy

4. **How well-known is the dataset**

	* **Well-known public datasets** such as financial ratios (P7E, P/B, etc.) likely have fairly low alpha content and are not viable as a standalone strategies (they may still be useful in a diversified risk premia portfolio)

5. **Stage of processing**

	* The highest level of data processing happens when data is presented in the form of **research reports**, alerts or trade ideas
		- Semi-processed data still has some outliers and gaps, and is not readily usable as input into a trading model

6. **Quality**

	* Data with longer **history** is often more desirable for the purpose of testing
		- satellite imagery > 3 years
		- sentiment data > 5 years
	  	- credit card data > 7 years
		- datasets with less than 50 points are typycally less useful

	* It must be specified, if the missing data was missing at random or had patterns

 	* Data should have a robust support structure for clients

7. **Technical Aspects**

	* Frequency of data
	* Latency
	* Format (preferably .csv)
	* Legal and reputational risk



Figure 5

- Automated analysis of unstructured data such as images, social media, and press releases is not possible with the standard tools of financial analysis




- Automation of tasks **is not** considered Machine Learning

	* We can instruct a computer to sell an asset if the asset price drops by a certain amount (stop loss)

- In **Machine Learning**, the computer is given an input (set of variables and datasets) and output that is a consequence of the input variables. The machine then finds or *learns* a rule that links the input and output

- In **Supervised Learning**, we find a rule, an *equation* that wwe can use to predict a variable

	* We may want to look for a momentum (trend following) signal that will have the best ability to predict future market performance
		- This may be accomplished by running advanced regression models to asses which one has the higher predictive power

- In **Unsupervised Learning** we are uncovering the structure of data

	* We take market returns and try to identify the main drivers of the market

	* A successful model may find that at one point in time, the market is driven by the momentum factor, energy prices, level of USD, and a new factor that may be related to liquidity

- **Deep Learning** is a Machine Learning method that analyszes data in multiple layers of learning