<a href="https://colab.research.google.com/github/11223548/UTS_ML2019_Main/blob/master/11223548_A2_ReportFinal.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 32513 Assignment 2: Practical Machine Learning Project

**Report Link:** https://github.com/11223548/UTS_ML2019_Main/blob/master/11223548_A2_ReportFinal.ipynb
<br>**Code Link:** https://github.com/11223548/UTS_ML2019_Main/blob/master/11223548_A2_SampleReportCode.ipynb
<br>**ID:** 11223548
<br>**Topic:** *Systematic identification of illegal informed trading in equity markets via machine learning*

## Introduction

<br><center>*“Fairness is achieved when insiders and outsiders are in equal positions. That is, a system is fair if we would not expect one group to envy the position of the other.”*
<br>- **Saul Levmore**
<br></center>
<br>Insider trading often garners a lot of media attention as it infringes upon societies expectations for equal access to material information within financial markets. However, the definition of insider trading, its prevalence and its benefit or detriment to financial markets is poorly understood. For example, just earlier this week numerous Australian media outlets began commenting on a recent study from the Australian National University which suggested insider trading in Australia was “rife” (ABC News, 2019). Whilst the study sparked some public outrage, the Australian financial market regulator, ASIC, was quick to comment that they disagreed with the report author’s definition of insider trading. Surprisingly, ASIC claimed that the “rife insider trading” described by the study did not conflict with existing laws and regulations against illegal insider trading and suggested that the trading was likely beneficial to the market. This report outlines how a historic inability to detect illegal insider trading has inhibited an understanding of its impact financial markets. Using a proprietary dataset, this report subsequently shows how machine learning can plausibly be utilised to detect incidences of illegal informed trading and hence facilitate future academic study and financial market regulation.

<br>Conflicting views on the definition of illegal insider trading, and its benefit or detriment to financial markets, are heavily debated in academic literature. However, existing studies in finance have focussed on legal informed trading; i.e. the publicly documented trades of corporate insiders and significant shareholders in compliance with regional regulations. By contrast, illegal informed trading has received less academic attention. This is unsurprising as empirical attempts to identify insider trading in financial markets are hampered by significant data challenges. Illegal informed trading involves intentional obscuring of trading activities that seek to profit from material non-public information; there is no complete available record of illegal informed trading. Therefore, academic research has typically relied on investigation of prosecuted cases of insider trading. This is challenging since the available prosecuted cases of insider trading vary in the extent of details published on the actual trades made by the defendant which can inhibit the ability to match prosecuted cases to actual trading data. Furthermore, since illegal informed trades are intentionally obscured, it is unclear how prevalent how common the practice is or how strongly it impacts financial markets.

<br>Despite numerous attempts, the ability to detect incidences of illegal informed trading in equity markets has remained elusive in financial literature. Prior research has arrived at opposing findings regarding the observable features of illegal informed trading. Cornell and Sirri (1992) and Chakravarty and McConnell (1997, 1999) find that bid-ask spreads are generally unaffected by insider trading. By contrast, Fishe and Robe (2004) find that market makers in specialist markets adjust bid-ask spreads in response to the presence of insider trading, suggesting that bid-ask spreads may be an informative indicator of inside trading in some markets. Ahern (2018) tests a variety of standard illiquidity measures and identifies that only absolute order imbalance and the negative autocorrelation of order flows are statistically and economically robust predictors of illegal informed trading. However, the results in the paper only hold for short-lived information. Ahern concludes that standard measures of illiquidity have limited applications for the detection of illegal informed trading. From literature it is unclear what market conditions are associated with periods of illegal informed trading.

<br>Without a robust means to detect incidences of insider trading it is challenging to evaluate the effect of regulatory policies aimed at curbing the prevalence of insider trading. It might be natural to assume that illegal informed trading in markets should be decreasing over time due to increasingly stringent reporting requirements being implemented by regulators. Continuous disclosure should reduce the opportunity for long-lived material private information that can be traded upon. By contrast, Acharya and Johnson (2010) find that insider trading becomes more likely with more insiders in spite of the presence of regulation. To the extent that financial market transactions are becoming more syndicated over time, inside trading opportunities could be increasing. Banerjee & Eckard (2001) analyse a wave of US mergers from 1897-1903, a time predating US legislation against insider trading, and find that pre-announcement stock run-ups and post-announcement price jumps closely resembled contemporary markets. In a study of 4,541 acquisitions spanning 52 countries Bris (2005) finds that insider trading enforcement increases both the incidence and profitability of insider trading. Whereas, Geurcio et al. (2013) review SEC indictments against insiders and discover that the price impact of insiders on insider trading days has decreased over time; suggesting that increasingly aggressive SEC enforcement has deterred illegal insider trading. In order to resolve debates on the efficacy of regulation it is vital that a robust method to detect illegal informed trading in equity markets is first developed. 

<br>Recent developments in machine learning represent a more plausible path to detection of illegal informed trading than traditional metrics presented in literature. Financial markets are typically complex, nonlinear and non-stationary in nature. Non-linear machine learning approaches are better suited to financial market data and are capable of incorporating a broader array of input data to capture complex relations. Lopez De Prado (2016) highlights some of the pitfalls of the traditional econometric toolkit used in published finance research and criticises assumptions of linear relations and reliance on in-sample performance. In a review of the recent use of machine learning in quantitative finance research, Emerson et al. (2019) conclude “machine learning offers an opportunity for more complex financial analysis than was previously possible”. Feng et al. (2017) provide an example of this when they apply a double LASSO machine learning method and identify that only a small number of the many risk premia proposed in finance literature in recent decades are significant. This report implements a neural network and shows that illegal informed trading is partially detectable via machine learning.


## Data & Exploration

<center><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Data%20Disclaimer.JPG?raw=true" width="800"/>

<justify>The data used for the analysis in this report was obtained from United States Securities and Exchange Commission (**SEC**) litigation releases and the Wharton Research Data Services (**WRDS**) Trade and Quote (**TAQ**) database. The SEC database was used to identify the trading timestamps of prosecuted cases of insider trading in the US, which were then matched to tick-level trading data obtained from the TAQ database. Further discussion on each of these databases, how data was extracted, the merging of data, data pre-processing, and data issues is provided below.</justify>

<br>***WRDS TAQ Database***
<br>The WRDS TAQ database contains records of each individual trade processed on the secondary market for US equities across major US exchanges, including information on transaction price, volume and trade timestamp. Whilst WRDS provides application programming interface (**API**) access to some of their databases, this particular database does not have an API. As a result, data had to be manually downloaded for each security related to a prosecuted case of insider trading. 

<br>For each prosecuted case of insider trading analysed in this report, the entire day of trading data for the equity affected was downloaded. In addition, one week of additional trading data for that equity, 3 days preceding the day of a company announcement, was downloaded as illustrated diagrammatically below.

<br><center>**Data Periods**
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/DataWindows.jpg?raw=true" width="800"/>
<br></center>
<br>A study of the proprietary insider trading database revealed that in almost every prosecuted case of insider trading the defendant had submitted the illegal trades on the same day as the company announcement to which their inside information was linked. This is unsurprising as continuous disclosure requirements largely prevent the ability for material information to remain non-public for multiple days. Therefore, the duration between non-public information arising and being disclosed is typically very small. In the sample of prosecuted insider trading cases observed in this report each case involved the defendant trading on the same day as the company announcement. However, in order to allow for the possibility that there may have been other inside traders in the market that influenced trading more than one day out from an announcement but were never caught, trading data from the three days preceding any announcement day has been excluded from the dataset used in this report. 

<br>A calendar week of trading data, more than three days out from an announcement linked to insider trading, was downloaded from TAQ. This provided five days (markets are closed on weekends) of plausibly clean data to train a machine learning model on since the period is too far before a company announcement for an insider to have received material non-public information. This clean data could be fed into a machine learning model to enable it to learn a baseline of normal trading conditions in the absence of inside traders so that it would be able to subsequently detect anomalies.

<br>***SEC “Proprietary” Database***
<br>The SEC provides a list of links to litigation releases on their website concerning civil lawsuits brought by the Commission in federal court. A subset of these litigation releases relates to successfully prosecuted cases of insider trading. An academic at the University of Technology, Sydney (**UTS**) has collated a proprietary database of information about these prosecuted cases of insider trading over several years. Access to the database was granted for my research on insider trading but under strict conditions that the underlying data is not redistributed. As a result, I have anonymised identifying information on the cases and companies involved for data uploaded to GitHub and Colab in this report. Furthermore, I have only used a very small subset of the entire database rather than uploading the entire database in anonymised form. For completeness, further explanation on how information in the proprietary database was originally collated by the UTS academic is presented below.

<br>***Obtaining SEC Data***
<br>The source of the data of known insider trading transactions is the online archive (dating back to 20th September, 1995) of the SEC’s “*litigation releases concerning civil lawsuits brought by the Commission in federal court*”. Each of these releases contains a short summary of the commencement, the resolution, or a significant update of a civil lawsuit that the SEC has pursued, and will usually be accompanied by a copy of the official court complaint, which contains the formal details supporting the SEC’s case.  With the aim to ensure that the collection process was a comprehensive as possible, key data from each litigation release was collected manually. This was then supplemented by information included in SEC annual reports, and all relevant court complaints available on Public Access to Court Electronic Records (**PACER**); the online archive of all court records in US courts. PACER data was obtained by algorithmically sourcing a list of all court records associated with each SEC case, and then cross-checking that list against the court documents available directly from the SEC. Any pertinent documents missing from the SEC archive were then included in the data collection process. Although certain cases may also have been pursued by the Department of Justice (in cases that went through the criminal, rather than civil, legal system), criminal court records from the Department of Justice are not available to the same degree. Consequently, the SEC’s civil cases are the only current source of insider trading data. The minimum criteria for inclusion was that it concerned insider trading in an identifiable public company, over an identified time-period. This includes both cases that went to court, and those that were settled out-of-court. The litigation releases and court complaints were then investigated, and all relevant details about the lawsuit and its associated trading activity were input into a spreadsheet used for initial data-capture.

<br>Prosecuted cases of insider trading range in the level of sophistication of participants. Some cases involved participants trading material non-public information gained from a relationship with a corporate insider. A more sophisticated case involved a company creating a fraudulent account with a media disseminator and then using their corporate account to operate an algorithm that would intercept scheduled information releases from the private accounts of other corporations and immediately trade prior to the imminent news release.

<br>***SEC Data Issues***
<br>The SEC litigation releases contain varying amounts of information on the trades that were classified as illegal and prosecuted. Some litigation releases include a high level of detail including the price and volume of trades along with precise timestamps. Other litigation releases are limited and provide no information on price or volume and only a general timeframe of when the offending trades were placed. Even after this supplementation of SEC data, only a minority of prosecuted insider trading cases contained sufficient detail to facilitate matching to corresponding market intraday trading windows.

<br>***Theoretical Data Issues***
<br>Dependence on prosecuted cases of illegal informed trading also introduces a theoretical challenge to the validity of research. There does not yet exist a published method to identify the actual extent of illegal informed trading in financial markets. Since prosecuted cases only represent a minimum bound of an unobservable quantity, actual illegal informed trading could be many multiples of the extent identified via prosecutions. A model comparison approach is outlined in the methodology section which attempts to identify whether actual inside trading may plausibly be greater than the merely the prosecuted cases on announcement days.

<br>***TAQ Data Issues***
<br>Using tick-level trade data from TAQ in this report created significant computational challenges. Even with use of the UTS Cluster to increase processing speed script execution times were lengthy. This inhibited the ability to use a larger sample of insider trading cases that may have improved model performance.

<br>***Data-Matching and Input Variables***
<br>The entire proprietary database included 440 prosecuted insider trades with sufficient identifying information on volume and time of trades to match to underlying trade data from TAQ. For this report I selected only a subsample of 28 of these trades for proof-of-concept.

<br><center>**Filtering**
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Filtering.JPG?raw=true" width="600"/>
<br></center>
<br>To facilitate data matching and model training, all TAQ data was aggregated into 15-minute windows. Each 15-minute window stored the total trading volume, average stock price and average trade size of the tick-level trading data during the window. Two dummy variables were added to each window; one recording whether any cases of insider trading occurred during that trading window (based on matching from the proprietary database) and another recording whether the company announcement associated with the prosecuted insider case had been released to the market during or before that trading window. The purpose of the second dummy variable was to facilitate exclusion of trading windows after public announcements since trading will be abnormal in reaction to the announcement and not representative of inside trading as material information is now public. 

<br>For each sliding window, the percentage change in total trading volume, average trading price and average trade size from the prior trading window was then calculated. Percentage change is more appropriate than absolute change as different equities will naturally have different price levels and trading volumes, so the use of percentage change represents a more normalised approach to compare trends. Whilst there are many potential independent variables that could be utilised in a machine learning model, these were considered to be variables that could plausibly identify abnormal trading periods whilst also being feasible to implement given the time constraints of this assignment.

<br>***Restricting Time Zone***
<br>After creating aggregated trading windows for insider trading days and clean data days, the trading windows were then filtered to omit time periods before 9.45am and after 4pm each day. 9.30am – 4pm are the standard trading hours of the New York Stock Exchange. However, limited trading can still occur outside of this period. Furthermore, the first 15 minutes of ordinary trading (9.30-9.45am) often exhibits exaggerated trading features since it captures the market’s reaction to overnight news flow. These abnormal trading periods can be highly asynchronous across different equities. Therefore, this time zone filter applied removed typically anomalous trading windows that would have inhibited training of machine learning models.

<br>***Data Insights***
<br>The distribution of input variables used in the model, conditioned against whether they occurred during a prosecuted inside trading period, are presented below.

<br><center>**Conditional Distributions of Input Variables**
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Conditional%20Dist%20of%20Independent%20Variables.jpeg?raw=true" width="800"/>
<br></center>
<br>As observed in above plot there are minimal differences between the conditional distributions of each variable. Percentage changes in price during a trading window seems to carry the most informational content out of the indicators with periods of insider trading being associated with price increases.

## Methodology

A neural network was implemented for anomaly detection throughout this report. Neural networks are well-suited to numerical data where they are able to capture complex, non-linear relationships. As a result, neural networks represent an ideal machine learning technique to implement for anomaly detection; namely the identification abnormal trading periods that exhibit features representative of the presence of inside traders.

<br>***Data Apportionment***
<br>Data was apportioned into three sub-samples; a training sample, a cross-validation sample, and a testing sample. This facilitated training of the neural networks by allowing inspection of the performance of models trained on the training data against the cross-validation set to ensure that the models were not overfitted to the training set. Finally, the test set represented an opportunity to evaluate the performance of the neural network on unseen data.

<br><center>**Apportionment of Sample Data**
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Data%20Apportionment.JPG?raw=true" width="600"/>
<br></center>

<br>***Re-Sampling Technique***
<br>The full sample included 772 trading windows, of which 28 (3.60%) represented periods of insider trading prosecuted by the SEC. Since the dataset was clearly imbalanced, data (for the training set only) was re-sampled to ensure a balanced dataset for model training. Synthetic minority oversampling technique (**SMOTE**) and random oversampling of the minority class were both tested. Since random oversampling delivered better performance for cross-validation sets it was the re-sampling technique used in subsequent sections of this report.

<br>***Model Implementation 1***
<br>The first model implemented involved training a neural network on a training set comprising the clean trading windows (t-10 to t-3) preceding the insider trading event as well as exclusively the observations of prosecuted insider trading windows on the announcement day. In other words, trading windows on the announcement day that did not represent a period of SEC prosecuted insider trading were excluded from this implementation. The logic behind this implementation was to train a neural network on plausibly clean data and certain cases of insider trading. Including other trading windows on announcement days could have biased predictions since it is uncertain whether those windows were truly absent the influence of inside traders that were never caught.

<br><center>**Neural Network Implementation**
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/ML%20Model.JPG?raw=true" width="800"/>
<br></center>

<br>Three input metrics, as outlined in the data section of this report, were fed into the neural network which would then learn to make classifications based on these inputs as to whether a trading window represented insider trading or not.

<br>Several layers and nodes formats of neural networks were trialled. A (5,5) network was found to be optimal and has been utilised in the final version of each neural network.

<br>Both stochastic gradient descent (**SGD**) and Adaptive Moment Estimation (**ADAM**) optimisation were tested. Surprisingly, ADAM resulted in significantly better classification predictions than SGD and was the optimisation approach selected for final versions of each neural network.

<br>***Model Implementation 2***
<br>Model Implementation 2 involved an identical approach to model implementation one but with the inclusion of trading windows from the announcement day which had not been flagged as insider trading periods by the SEC. These additional data points were only included in the cross-validation and testing data samples. The purpose was to compare the classifications of the second model implementation with the classifications of the first model implementation and observe whether the neural network identifies suspicious trading periods in excess of prosecuted insider trading periods. In doing so, the model would be implying that there were more insider traders in the market than just those which were caught and prosecuted by the SEC.

<br>***Evaluation criteria***
<br>In order to evaluate the performance of each model, the F1-Score was used as the primary evaluation tool. Accuracy would be an inappropriate evaluation metric to use since the dataset is highly imbalanced. As a result, a model could perform very well on accuracy by simply classifying every data point as the majority class. By contrast, the F1-Score considers the ability of the model to balance accurate predictions of the minority class with accurate predictions of the majority class. In addition, confusion matrixes were inspected to gain a more in depth understanding of how models performed at classifying cases of insider trading.

## Evaluation

***Implementation 1 – ADAM***
<br><center>**Implementation 1 - Optimisation Results (ADAM)**
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Model1_Adam.JPG?raw=true" width="600"/>
<br></center>
<br>The above table shows the results of training the neural network for model implementation one, optimised with ADAM. In this implementation the neural network is optimised against the cross-validation set after 100 iterations of training. This mark is where the cross-validation set achieves the highest F1-Score. Beyond this point the neural network begins to overfit and becomes less applicable to out-of-sample data. 

<br>An F1-Score of 63% is far from perfect. However, given that insider trading observations represented only 3.60% of the full data sample, and there were minimal differences in the conditional distributions of input metrics, this is an understandable result. There is a longstanding history of unsuccessful attempts to systematically identify illegal informed trading. Given this context, the cross-validation F1-Score is a relatively positive indication that with extended time a more sophisticated neural network could be developed with a reasonably strong ability to predict illegal insider trading.

<br>Below are standard and normalised confusion matrixes for performance of the neural network after 100 iterations of training for the cross-validation set. From the confusion matrix on the left we can observe that the neural network rarely made insider trading classifications but had a high success rate (75%) when it did.

<br><center>**Implementation 1 - Cross-Validation Confusion Matrixes (ADAM)**
<br>
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Model1_CM_ADAM.JPG?raw=true" width="800"/>
<br></center>

<br>***Implementation 1 – SGD***
<br><center>**Implementation 1 - Optimisation Results (SGD)**
<br>
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Model1_SGD.JPG?raw=true" width="600"/>
<br></center>
<br>The above table shows the results of training the neural network for model implementation one, optimised with SGD. In this implementation the neural network is optimised against the cross-validation set after 250 iterations of training; a greater number of iterations than the ADAM optimisation approach. The cross-validation F1-Scores under SGD are consistently lower than ADAM. The confusion matrix below shows that the SGD approach and ADAM approach make the same number of correct insider trading predictions. However, the SGD approach also makes a greater number of false insider trading predictions which is why it produces a lower F1-Score.

<br><center>**Implementation 1 - Cross-Validation Confusion Matrixes (SGD)**
<br>
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Model1_CM_SGD.JPG?raw=true" width="800"/>
<br></center>

<br>***Implementation 1 – Testing***
<br>Since the neural network optimised using ADAM outperformed the SGD version, the ADAM version was used against the testing sample of data. Against the testing set, the neural network delivered an accuracy score of 88% and F1-Score of 47%. However, by diagnosing the left confusion matrix below we can observe that when insider trading was prevalent it did a reasonably good job at identifying it (67% success rate).

<br><center>**Implementation 1 - Testing Confusion Matrixes (ADAM)**
<br>
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Model1_CM_ADAM_Testing.JPG?raw=true" width="800"/>
<br></center>

<br>Overall, implementation one has shown us that a neural network based on the input metrics used has some ability to detect incidences of illegal insider trading. However, it is clear that significant further work is required before the model could be relied upon by market regulators to catch illegal insider trading.

<br>***Implementation 2 – Detection of Additional Insider Trading***
<br>Implementation 2 by design will automatically deliver a lower F1-Score than Implementation 1 for testing data. The F1-Score for Implementation 2 on the training data is only 31%. The lower F1-Score is because the testing data now includes more observations from the announcement day which are labelled as absent insider trading because they do not represent prosecuted SEC cases. However, many of these cases could be incidences of insider trading that were never caught. 

<br><center>**Implementation 2 - Testing Confusion Matrixes (ADAM)**
<br>
<br><img src="https://github.com/11223548/UTS_ML2019_Main/blob/master/Model2_CM_ADAM_Testing.JPG?raw=true" width="800"/>
<br></center>

<br>We can observe by comparing the left confusion matrix above with the left confusion matrix from *Implementation 1 – Testing Set* that the same number of SEC prosecuted insider trading cases have been identified. However, in addition Implementation 2 is now predicting more cases of insider trading then before. The additional data from the announcement day has a number of periods believed by the model to be suspicious.

## Conclusion

The ability to systematically detect illegal informed trading in equity markets has remained elusive in academia despite decades of research by some of the world’s most accomplished researchers into plausible identification metrics. As a result, this report attempted to investigate a topic that was clearly ambitious. Whilst the findings of this report show that neural networks have some potential to identify illegal informed trading in US equity markets, the method implemented still requires significant improvements before it can be relied upon in practice. To develop and implement a sophisticated machine learning algorithm that could plausibly detect illegal informed trading in equity markets would take at least one year of full-time research (i.e. it is well beyond the scope of this course).

<br>There are many possible improvements that could be made to the machine learning approach implemented in this report. The characteristics of each equity are unique as the valuations of each company are subject to idiosyncratic risks. As a result, any algorithm which attempts to aggregate characteristics across many different equities will be less effective at determining what is an anomaly for a particular equity. Given sufficient time, a more plausible approach to systematic detection of illegal informed trading would be to implement a standardised machine learning model for each distinct equity. This standardised model could be trained to recognise what represents normal trading vs. abnormal trading for the specific equity examined and would be less impacted by aggregation induced noise. As a result, where unusual trading was identified preceding a company announcement there would be a reliable basis for further investigation by regulators.

<br>Although equity valuations are influenced by idiosyncratic factors, broad market factors can also have a significant influence on market trading. Whilst this report utilised the change in price, volume and trade size as independent variables, a more robust approach would be to use the abnormal change in these metrics. The abnormal change is an adjusted version of each metric which aims to account for the influences of significant market or industry news upon a specific company. For example, the equity of a particular company may surge if a central bank announces that they are going to reduce interest rates. In this case, there would be a large change in the typical trading conditions for this company which is not subsequent to any company-specific announcement. An algorithm which considers the change in various metrics, rather than the abnormal change in metrics, may falsely flag the period as an insider trading period since the model observed significant changes in trading conditions despite no company announcements.

<br>Another significant improvement to the method adopted would be to obtain market tick data for the order book of each equity as another supplementary data source for training the model. For example, a neural network might observe that, in periods of prosecuted cases of insider trading, at-market orders are submitted more than limit orders. This would indicate that traders are trying to enter market positions with a high sense of urgency and could be a strong indicator of illegal informed trading.

<br>Due to time constraints, the neural network implemented in this report only utilised a limited selection of dependent variables as inputs; namely the change in trading volume, price and trade size between trading windows. There are many more metrics that could be calculated and fed into a machine learning model. Some examples include measures of market liquidity, momentum and autocorrelation.

<br>This report explored a significant challenge in finance and observed how a simple machine learning represents a plausible method to detecting illegal informed trading. However, it is clear that the method implemented in this report could be significantly improved given sufficient time. By implementing the extensions to the model recommended in this report it may finally become possible to robustly detect incidences of illegal informed trading in financial markets; accomplishing an endeavour that has eluded academics and regulators alike for decades.



## Ethical Considerations

In conducting the research presented in this report a Kantian duty-based approach to ethical considerations was adopted. This ethical framework was seen as more appropriate than a utilitarian framework for the context of this project since confidential data was utilised. For example, a utilitarian approach would have prompted me to provide public access to the full confidential database of SEC prosecuted insider trading cases so that future researchers could use the data and more easily attempt to develop superior models for illegal informed trade detection. However, in providing the dataset I would have breached my duty to not publicly redistribute the confidential database. Some of the key ethical considerations in this research report are detailed below.

<br>***Use of Proprietary Data Set***
<br>The SEC database used in the analysis throughout this report was proprietary and is not permitted to be redistributed. As a result, only a small subset of the SEC database was used for analysis in this report which was deemed sufficient to illustrate proof-of-concept. Furthermore, information in the SEC database was anonymised prior to use in this report so that identifying features such as the names of the inside traders and the affected companies could not be recognised by readers of this report. 

<br>***Study of Humans***
<br>The original SEC database used in this report contained the names and details of individuals who had been convicted of insider trading. In many cases, this identifying information is not easily accessible. Anonymisation of the data has mitigated the risks that individuals prosecuted of insider trading will receive a resurgence of media attention and backlash for activities that they may have committed many years in the past.

<br>***Model Misuse and the Risk of False Positives***
<br>The model presented in this report can detect trading anomalies that exhibit highly similar characteristics to the trades of prosecuted insider trading cases. It is important to distinguish that trades that exhibit abnormal traits that are similar to known examples of insider trading are not guaranteed to be incidences of insider trading that were not previously detected by the SEC. Therefore, this tool should be perceived by any user as a filter for further investigation rather than a foolproof detection model upon which traders could be prosecuted for insider trading. Any anomalous trades identified should still be supplemented by a thorough manual investigation by market regulators rather than relied upon as a stand-alone conviction tool. If caution is not used when implementing the model, then the risk is that financial market regulators could prosecute traders which were falsely identified by the model.

<br>***Nefarious Model Use***
<br>It is possible that bad actors who intend to conduct illegal informed trading in the future could download the model source code and the data underlying this report and appropriate the model for their own use. For example, by understanding what the model identifies as likely cases of illegal informed training, individuals could learn to finetune the implementation of their future insider trading activities to avoid detection by machine learning models. To limit the potential for nefarious model use, only an anonymised subset of the full SEC database, restricted to observations from many years ago, was publicly uploaded and used in constructing this report. This provides less opportunity for bad actors to develop a version of the anomaly detection algorithm that is robust across time-periods and varying market conditions.


## References

[1] V. Acharya and T. Johnson, “More Insiders, More Insider Trading: Evidence from Private-Equity Buyouts”, *Journal of Financial Economics*, vol. 98, no. 3, pp. 500-523, 2010.

<br>[2] K. Ahern, “Do Proxies for Informed Trading Measure Informed Trading? Evidence from Illegal Insider Trades”, *Working Paper*, 2018.

<br>[3] K. Back, C. Cao and G. Willard, “Imperfect Competition Among Informed traders”, *The Journal of Finance*, vol. 55, no. 5, pp. 2117-2155, 2000.

<br>[4] A. Banerjee and W. Eckard, “Why Regulate Insider Trading? Evidence from the First Great Merger Wave (1897-1903)”, *American Economic Review*, vol. 91, no. 5, pp. 1329-1349, 2001.

<br>[5] U. Bhattacharya, “Insider Trading Controversies: A Literature Review”, *Annual Review of Financial Economics*, vol. 6, pp. 385-403, 2014.

<br>[6] U. Bhattacharya and H. Daouk, “The World Price of Insider Trading”, *The Journal of Finance*, vol. 57, pp. 75-108, 2002.

<br>[7] A. Bris, “Do Insider Trading Laws Work?”, *European Financial Management*, vol. 11, no. 3, pp. 267-312, 2005.

<br>[8] D. Carlton and D. Fishel, “The Regulation of Insider Trading”, *Stanford Law Review*, vol. 35, no. 5, pp. 857-895, 1983.

<br>[9] S. Chakravarty and J. McConnell, “An analysis of prices, bid/ask spreads, and bid and ask depths surrounding Ivan Boesky’s illegal trading in Carnation’s stock”, *Financial Management*, vol. 26, no. 2, pp. 18-34, 1997.

<br>[10] S. Chakravarty and J. McConnell, “Does Insider Trading Really Move Stock Prices?”, *Journal of Financial and Quantitative Analysis*, vol. 34, no. 2, pp. 191-209, 1999.

<br>[11] Q. Cheng and K. Lo, “Insider Trading and Voluntary Disclosures”, *Journal of Accounting Research*, vol. 44, no. 5, pp. 815-848, 2006.

<br>[12] P. Collin-Dufresne and V. Fos, “Do Prices Reveal the Presence of Informed Trading?”, *The Journal of Finance*, vol. 70, no. 4, pp. 1555-1582, 2015.

<br>[13] J. Du and S. Wei, “Does Insider Trading Raise Market Volatility?”, *The Economic Journal*, vol. 114, no. 498, pp. 916-942, 2004.

<br>[14] A. Ellul and M. Payanides, “Do Financial Analysts Restrain Insiders’ Informational Advantage?”, *Journal of Financial and Quantitative Analysis*, vol. 53, no. 1, pp. 203-241, 2018.

<br>[15] S. Emerson, R. Kennedy, L. O’Shea, and J. O’Brien, “Trends and Applications of Machine Learning in Quantitative Finance”, *8th International Conference on Economics and Finance Research*, 30 May, 2019.

<br>[16] G. Feng, S. Giglio, and D. Xiu, “Taming the Factor Zoo”. *Technical report*, University of Chicago, 2017.

<br>[17] R. Fishe and M. Robe, “The Impact of Illegal Insider Trading in Dealer and Specialist Markets: Evidence from a Natural Experiment”, *Journal of Financial Economics*, vol. 71, pp. 461-488, 2004.

<br>[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative Adversarial Nets”, *Neural Information Processing Systems, 2014*.

<br>[19] D. Guercio, E. Odders-White and M. Ready, “The Deterrence Effect of SEC Enforcement Intensity on Illegal Insider Trading”, *Working Paper*, 2013.

<br>[20] C. Holden and A. Subrahmanyam, “Long-Lived Private Information and Imperfect Competition”, The Journal of Finance, vol. 47, no. 1, pp. 247-270, 1992.

<br>[21] M. Kacperczyk and E. Pagnotta, “Becker Meets Kyle: Inside Insider Trading”, *Working Paper*, 2019.

<br>[22] M. Kacperczyk and E. Pagnotta, “Chasing Private Information”, *The Review of Financial Studies* (Forthcoming), 2019.

<br>[23] A. Kyle, “Continuous Auctions and Insider Trading”, *Econometrica*, vol. 53, no. 6, pp. 1315-1335, 1985.

<br>[24] R. Kaniel and H. Liu, “So What Orders do Informed Traders Use?”, *The Journal of Business*, vol. 79, no. 4, pp. 1867-1913, 2006.

<br>[25] M. Lopez De Prado, “Mathematics and Economics: A Reality Check”, *Journal of Portfolio Management*, vol. 43, no. 1, pp. 5-8, 2016.

<br>[26] H. Manne, “Insider Trading and the Stock Market”, *New York: The Free Press*, Collier Macmillan, 1966.

<br>[27] L. Meulbroek, “An Empirical Analysis of Illegal Insider Trading”, *The Journal of Finance*, vol. 47, no. 5, pp. 1661-1699, 1992.

<br>[28] P. Milgrom and N. Stokey, “Information, Trade and Common Knowledge”, *Journal of Economic Theory*, vol. 26, no. 1, pp. 17-27, 1982.

<br>[29] S. Ryan, J. Tucker, and Y. Zhou, “Securitization and Insider Trading”, *The Accounting Review*, vol. 91, no. 2, pp. 649-675, 2016.

<br>[30] D. Taylor, "Insider trading 'rife' among company directors on the ASX: ANU study finds", *ABC News*, website, 23 September, viewed 23 September 2019, <https://www.abc.net.au/news/2019-09-23/insider-trading-rife-among-asx-company-directors-study-finds/11537080/>.