Nowadays, the usage of machine learning, especially natural language processing is getting more prominent. Meanwhile, the analysis of textual data is getting more and more valuable. In the financial market, analysts used to dig out all sorts of information from numerical data without looking too much on textual information due to the limitation of the technology. Thus, there's much information and signals hidden behind those words waiting for us to dig out.
The entire project is to utilize Deep Learning and Natural Language Processing to capture signals of financial markets so that we will have a more accurate prediction for companies' overall operations and outlooks in the future. And the data source is the quarterly earnings call transcript from each public traded company. Each transcript consists of two parts: the speakings from the management team and the q&a sessions addressed by various analysts from wall street.
To obtain the data source, I created an automatic web scraping function based on HTML and API to scrape related financial news, financial reports, analysis reports, and Earnings Call Transcripts for every public traded companies in every year. The main data source here is the Motley Fool website, which is totally public and open. The obtained transcripts are organized into various paragraphs. From those, I also dug out the company's name, ticker, date as well as participants' names, positions, and affiliations.
The tricky part, as I propose here as a direction in this fellowship, is to effectively analyze those data. Other than the normal data set like Twitter, the sentiment in the transcripts is very stable, which means that there will be no dramatic or emotional words. Thus, using traditional training sets and labels are not enough to provide a solid result. I've been trying labeling with Word2vec, which is not working eventually. I've been also trying to use APIs like Monkey Learn or IBM Whatson, which can deliver some good scores. However, those to me are more like a black box without knowing their approaches and the labeling methods. I very much hope to learn and get some guidance in terms of the improvement of current unsupervised learning into more accurate and promising results.