# Exploring the Stock Market with Supervised Machine Learning
### COGS 118A Final Project

## Group members

- David Thai
- Nicholas Nassour
- Daniel Byers
- Yu Pan
- Jordyn Ohashi

# Abstract 
Our goal is to be able to successfully be able to predict whether or not to buy or sell stock from the S&P500 based on various factors. Since we wanted to have ample data points we chose to have our data range from 1980 - 2022. The main factors we decided on was daily candle stick data, average news sentiment data, gold rates, and oil prices. Since we wanted to predict whether or not we should by stock we performed various binary classification algorithms. What determined if the algorithm worked well was if it succesfully predicted buy or sell (1 & -1). Whether or not if something should be bought was if the average of high and low was greater than zero and sell if it was less than zero. High and low where standardized with sklearn's StandardScaler package before the average was calculated. What we saw was that logistic regression was the best at predicting buy or sell with an accuarcy of 70%. This meant that it was able to successfully predict whether to buy or sell 70% of the time.  we can see had 810 false positives and 42 false negatives. So this model does the best at not selling when it isn't suppose to. The other models had similar false postive and false negative rate so they do not provide anything more benefical than our logistic regression model



# Background

Analyzing the stock market is sometimes considered to be a random process with rises and falls that are unpredictable. The belief is that there is no real way to predict the trends that occur as the ups and downs are similar to that of a random walk in a neighborhood and there is no accurate prediction that can be made. [<sup>1</sup>](#fn1)
However, time and time again, people attempt to develop models that analyze current and past trends in the market to accurately predict what their future trades should be. One such example is commonly referred to as the “Buffet indicator,” a model used to indicate the relative value of specific stock to the entirety of the market and assert whether or not it is undervalued or overpriced.[<sup>2</sup>](#fn2) With a lack of expertise in financial or economic sector, we believed that creating a machine learning model to analyze trends may not be entirely reliable. Therefore, when considering our own attempt to see if we could somehow game the stock market system, we shifted our focus from strictly analyzing how to notice internal trends in the market and what the next move might be to instead studying outside factors, such as daily news sentiment, and the impact they have on the system itself. News is just one factor but it plays an important role in determining the stock market shift as it heavily influences emotional traders who might be willing to purchase stock during good news while dumping their stock during bad news.[<sup>3</sup>](#fn3) For these reasons, our goal aims to determine how well we can predict possible buy and sell points by estimating the weight that outside factors have on the market and their combined scores.

# Problem Statement

The problem that our group will be tackling is looking for what is the most influential factor in the way stocks react. We will be taking in a number of features into account and using them to classify whether people should buy or sell. Each of the features will be put into a vector and we will be using feature selection to understand the feature should hold the most weight. The feature that is the most influential on getting the correct classification will be the factor we take as the most important to understanding how stocks react.

# Data

First Data Set:
- https://finance.yahoo.com/quote/%5EGSPC/history?period1=315532800&period2=1654300800&interval=1d&filter=history&frequency=1d&includeAdjustedClose=true
- The data set contains the around 7 variables and around 10,000 observations (everyday the stock market was open) from the beginning of 1980.
- Each observation contains open price, close price, close adjusted price, high and low price, and volume of the S&P 500 on any given day besides holidays or weekends.
- The critical values we are looking for are the date and the opening and closing prices which are labeled in dollars
- We copied the data from the site into an excel and exported as a csv file
- We reformated the data from strings to a datatime object and floats
- Additionally we used the S&P500 data to find the candle stick patterns
- Candle stick patterns are typically used by day traders help them decide if they should buy or sell a particular stock
- The package we used for this was TA-Lib which is an open source package that takes the open, close, high, low data and compares it to various patterns
- This helped us quickly and effectively identify the most important trends that were occuring each day


Second Data Set:
- https://www.frbsf.org/economic-research/indicators-data/daily-news-sentiment-index/
- The data set ranges from 1980 and has only 1 variable. This is about 15,000 observations.
- Each obserservation contains the daily sentiment index from financial news
- The daily sentiment index is a value that ranges from -1 to 1, with good financial news leading towards 1 and negative financial news leading towards -1.
- We simply exported it from an excel file to a csv and stripped the time.

Third Data Set:
- https://www.kaggle.com/datasets/hemil26/gold-rates-1985-jan-2022?select=daily_gold_rate.csv
- The data set ranges from 1985 to 2022 and has only 1 variable. This is about 9,000 observations.
- Each obserservation contains the daily gold rates
- We simply exported it from a csv file and normalized the values

Fourth Data Set:
- https://www.eia.gov/dnav/pet/PET_PRI_SPT_S1_D.htm
- The data set ranges from around 1986 to 2022 and has only 1 variable per type. This is about 9,000 observations.
- Each obserservation contains the daily spot prices
- We combined the different excel sheets with the different prices, dropped types not needed, normalized the values

Folder including cleaning for each dataset and the merging of all the datasets in the following folder https://github.com/COGS118A/Group019-Sp22/tree/main/Cleaning

# Proposed Solution

Our solution to solve the proposed problem is by using a binary classification model to determine whether we should buy or sell a stock of quantity one in a given timeframe. We plan to train a model based on day trading flags, sentiment analysis of news headlines, gold data, and opening and closing values and the price of past S&P 500 index to predict the trends of future market trends, thus providing the user a suggestion to purchase or sell.

To make this decision, we will create a specific database such that it consists of a series of vectors. Each vector contains information regarding the daily statistics of the various variables that we will be tracking, as stated in the last paragraph. The vector below is an example of how our daily data will be arranged which we will eventually use for binary classification of a company's stock.

![Cogs118vector.JPG](Images/Cogs118vector.jpg)

Overall we will be predicting whether to buy or sell a stock on a daily basis and determine our final outcome over a 6 month period. Whenever a buy order occurs we will keep all previous stocks of the same type, however, when a sell order occurs, we will sell the entirety of the stock before an upcoming dip that is predicted from our model

While we believe that we could make more money possibly investing in a single company which is more likely to rise in the future, the data we are analyzing is not specific to any one company but moreso general national data which is not a good indicator for individual stock.



# Evaluation Metrics

As stated in the solution section, we will be using a binary classification model to determine whether we should buy or sell a stock on a given day with the provided daily statistics. If the data in a single day overall points to a future increase in stock prices, the model will return 1, indicating a buy order, while the model will return -1 if the model predicts that the stock will go down in the future, telling the model to sell any stock it has.

In order to check if our model has accurately chosen the correct choice for a stock we will simply be referencing the next day's price as an indicator of whether or not we made the right choice to buy or sell our stock. If our model predicted a buy order and the stock price went up, it was able to profit and made the correct decision but if predicted a buy order and the stock price went down then it failed for the given day. Similarly, if our model predicted a sell order and the stock price went down, it was able to avoid losing potential money and made the correct decision but if predicted a sell order and the stock price went up then it failed for the given day and failed to secure profit.

Additionally, we will give our model some arbitrary amount of money that we start with which we will compare at the end of a 6 month timeline to see if our model was able to sucessfully profit or lost money in the process. If our model is able to successfully profit in a majority of the runs then this may indicate the model is able to predict future stock prices.

![Model.JPG](Images/Model.JPG)

# Results

### Subsection 1

neighbor = KNeighborsClassifier(n_neighbors=3).fit(X_train, y_train)
neighbor.score(X_test, y_test)

Starting with an exploratory model we used KNN to understand what our data looks like to know what to beat. Score of 0.6338315217391305

### Subsection 2

ADD Learning curve

### Subsection 3

ADD k-fold cross validation for these models ['Logistic','Random Forrest','Decision Tree','SVM','KNN']

### Subsection 4

ADD tuning hyperparamters(feature selection) using best model from above


### Subsection 5

ADD validation curve

### Subsection 6

ADD give our model some arbitrary amount of money that we start with which we will compare at the end of a 6 month timeline to see if our model was able to sucessfully profit or lost money in the process. If our model is able to successfully profit in a majority of the runs then this may indicate the model is able to predict future stock prices.

# Discussion

### Interpreting the result

We first compared across models to see which fits the problem the best. All of our models performed similarly in that the accuracy rates were close. We performed logistic regression, KNN, decision tree, random forrest, and SVM. They had the following respective accuracies: 71.94%, 65.44%, 67.16%, 70.99%, and 54.71%. 
![](Images/output.png)

But accuracy is not necessarily the only metric to follow, some things could be accurate but have high false positivity rates and high false negativity rates. Depending on the situation one may be more favorable over than the other. In the instance of medical decisions you want to have a false positivity rates since you dont want to miss a diagnosis that results in the death of someone. In other instance like convictions you rather have higher false negativity rates than false positive since you do not want to be sentencing people who are innocent. In the case of our model the choice of false negative rates or false positive is subjective. If you are being a more aggressive type of investor then you are ok with being wrong more (more false positive) since that increases the chances of you scoring big. If you are a more conservative invester then you want to be wrong less (more false negative) since you can't risk losing money. In our instance since we are trying to guess the market we are fine with being wrong more for the chance of scoring big. Therefore when we look at the bigger scope, not just pure accuracy but instead accuracy, false positivity, and false negativity. 

Logistic regression has the best performance in general, thus we proceed with our logistic regression model. After training, we obtain the following confusion matrix. The precision and recall rate are at 0.78 and 0.71. The F1 score is 0.71. 
![](Images/output1.png)

In order to further understand the performance of our model, we obtained the ROC curve. The area under the curve is 0.762, in comparison to chance level performance of 0.5. 

![](Images/output2.png)


### Limitations

Some of the limitations that we have is that we only chose 5 different variables to look at. With the stock market there are so many other factors that go into whether or not a stock will go out. Since we wanted data points that range from 1980 - 2020 it limited the amount of datasets that existed. We wanted more data points since this gives our model more points to train on but also limited our model since there were only so many variations. Even if we were to have more data points that does not necessarily mean that we would have a more accurate model, in fact it could make it even worse. In the end buy and selling can come down to a gut feeling and not an objective feeling, therefore making it almost impossible to accurate predict whether to buy or sell 

### Ethics & Privacy

If your project has obvious potential concerns with ethics or data privacy discuss that here.  Almost every ML project put into production can have ethical implications if you use your imagination. Use your imagination.

Even if you can't come up with an obvious ethical concern that should be addressed, you should know that a large number of ML projects that go into producation have unintended consequences and ethical problems once in production. How will your team address these issues?

Consider a tool to help you address the potential issues such as https://deon.drivendata.org

### Conclusion

Since investing in the stock market is becoming more accessable over the past few decades, the movements in stock market has become relevant to not only hedge fund managers but also anyone with a phone. The problem that we set out to explore was to examine the most influencial factors that affect the stock market. 

To find this out, we trained a Logistic regression model that make binary suggestions according to the factors that we theorized to be significant--news sentiment, changes in S&P 500, energy prices, gold prices, etc. To train the model with enough observations, we collected data ranging from 1980 up to 2022. After extracting and combining the above factors into vectors, we are able to train the binary classification model that can reach 71% accuracy at validation. 


# Footnotes

<span id="fn1"> 1. Cootner, Paul H. (1964). The random character of stock market prices. MIT Press. ISBN 978-0-262-03009-0.</span>

<span id="fn1"> 2. Mislinski, Jill (3 March 2020). "Market Cap to GDP: An Updated Look at the Buffett Valuation Indicator". www.advisorperspectives.com </span>

<span id="fn1"> 3. Chan, Wesley S. ” Stock price reaction to news and no-news: drift and reversal after headlines. ” Journal of Financial Economics 70.2 (2003): 223-260. </span>
