ist-736

This is a final project for a short 10 week course in text mining. Coding, visualizations, and overall report were improved after completion of the course, then submitted within a portfolio requirement for graduation. In general, the project attempts to address several items, including the larger question -- Can Market Sentiment Predict the Stock Market?

Analysis

To address this overall question, different techniques were applied.

exploratory analysis: topic modeling determines which stock to study
sentiment analysis: text corpus are normalized into sentiment scores
granger analysis: find significant sentiment scores and stock index
timeseries analysis: determine LSTM and ARIMA comparison for sentiment and stock series

While the main focus of the study were between timeseries models, classification analysis was also performed. Specifically, signal analysis was used as the basis for classification:

signal analysis: apply signal analysis to determine exceeding index points
classification analysis: TF-IDF text corpus (X) trained against signal results (y)

In general, points exceeding the upper limit threshold was binned a value 1, while points below the lower threshold was binned a value -1. This approach provided the target vector (y) when using the TF-IDF corpus (X) during classification:

Note: the above animation was borrowed, and associated code was adjusted to meet the requirements for this study.

While the exact details of the project can be reviewed from the associated write-up.docx, the remaining segments in this document will remain succinct.

Dependencies

This project requires the following packages:

$ sudo pip install nltk \
    matplotlib \
    twython \
    quandl \
    sklearn \
    scikit-plot \
    statsmodels \
    seaborn \
    wordcloud \
    keras \
    numpy \
    h5py

Data

Two different datasets were acquired via the Twython and Quandl API:

financial analyst tweets
stock market index/volume measures

Due to limitations of the twitter API, roughly 3200 tweets could be collected for a given user timeline. However, the quandl data has a much larger limit. This imposed a limitation upon joining the data. Specifically, only a subset of the quandl dataset was utilized during the analysis.

Execution

Original aspiration was to complete the codebase using the app-factory. Due to time constraint, the codebase was not expanded as an application. However, a provided config-TEMPLATE.py is required at minimum to be copied as config.py in the same directory. If additional twitter user timelines, or quandl stock index would be studied, the contents of the copied config.py need to match registered API tokens for each of the service providers. However, to run the codebase to reflect the choices made in this study, then no API keys need to be pasted into the configuration. Instead, additional configurations need to be properly commented out. Specifically, only one analysis can be performed at a given time. Moreover, timeseries sentiment models (consisting of both ARIMA and LSTM) has an added constraint. Specifically, only one stock code can be implemented at a given time:

screen_name = [
    'jimcramer',
    'ReformedBroker',
    'TheStalwart',
    'LizAnnSonders',
    'SJosephBurns'
]
codes = [
    ('BATS', 'BATS_AAPL'),
##    ('BATS', 'BATS_AMZN'),
##    ('BATS', 'BATS_GOOGL'),
##    ('BATS', 'BATS_MMT'),
##    ('BATS', 'BATS_NFLX'),
##    ('CHRIS', 'CBOE_VX1'),
##    ('NASDAQOMX', 'COMP-NASDAQ'),
##    ('FINRA', 'FNYX_MMM'),
##    ('FINRA', 'FNSQ_SPY'),
##    ('FINRA', 'FNYX_QQQ'),
##    ('EIA', 'PET_RWTC_D'),
##    ('WFC', 'PR_CON_15YFIXED_IR'),
##    ('WFC', 'PR_CON_30YFIXED_APR')
]

This is largely due to an exponentiating memory requirement, due to keeping multiple trained arima models in memory. Should this codebase be extended to an application, the latter issue could resolve itself. Nevertheless, additional controls can be adjusted in the same config.py, including the number of epochs, lstm cells, number of neurons (i.e. lstm_units), signal analysis threshold (i.e. classify_threshold), and TF-IDF feature reduction for classification (i.e. classify_chi2) can be made. After dependencies and necessary changes have been made, the script can be executed in a stepwise fashion:

$ pwd
/path/to/web-projects/ist-736
$ python app.py

Name		Name	Last commit message	Last commit date
Latest commit History 940 Commits
app		app
brain		brain
consumer		consumer
data		data
reports		reports
resources		resources
viz		viz
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
app.py		app.py
config-TEMPLATE.py		config-TEMPLATE.py
proposal.ppt		proposal.ppt
write-up-team.docx		write-up-team.docx
write-up.docx		write-up.docx

jeff1evesque/ist-736

Folders and files

Latest commit

History

Repository files navigation

ist-736

Analysis

Dependencies

Data

Execution

About

Resources

Stars

Watchers

Forks

Languages