# Machine Learning for Stock Price Prediction
## Overview

### What You'll Learn
In this section, you'll learn
1. How to format data so that it can be used for machine learning
2. How to create, train, and test a model that predicts stock prices
3. How to improve it

### Prerequisites
Before starting this section, you should have an understanding of
1. [Basic Python (functions, loops, lists)](https://github.com/HackBinghamton/PythonWorkshop)
2. [scikit-learn](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/blob/master/intro_ml_scikit.ipynb) ([Boston Housing Price example](https://colab.research.google.com/github/HackBinghamton/MachineLearningWorkshopWeek1/blob/master/housing_price_prediction.ipynb) if you'd like extra practice)

### Introduction
Stock price prediction has been a Holy Grail of machine learning for years. If one can predict changes in stock prices, they can buy and sell at just the right times to make tons of money. In this workshop section, we'll discuss how to make data about a given stock fit into a `sklearn` machine learning model as well as how to train and test it. 

---

## Loading the Data
The first step in starting a project is loading the data. Usually, you have to use one of `sklearn`'s datasets, find a third-party dataset online, or build your own.

In this case, we've prepared datasets on different individual stocks for everyone! These datasets hold three types of data over multiple rows. In each row, you will find:

 1. The date of the following data
 2. The stock price change on that day
 3. The average sentiment (attitude) of news related to the stock on that day
 
Feel free to tweak the `dataset` and `train_proportion` to fit your needs. `dataset` signifies which dataset of our you'd like to load, while `train_proportion` indicates how much of the data you'd like to dedicate to testing as opposed to training.

Once you've tweaked these values to your liking, run the code block and your dataset will be loaded!

In [None]:
import requests
import csv


# TWEAK THIS VALUE TO USE WHATEVER DATASET YOU'D LIKE
# Options: Facebook, Amazon, Microsoft, Nvidia, Apple
dataset = "Amazon"



#########################
## DO NOT MODIFY BELOW ##
#########################

##### LOAD DATA #####
# Fetch the dataset contents
r = requests.get("https://raw.githubusercontent.com/HackBinghamton/MachineLearningWorkshopWeek2/master/stock_price_prediction/" + dataset + ".csv")

# Write to a local file
with open(dataset + ".csv", "w") as datafile:
    datafile.write(r.text)

# Read into rows
data = []
with open(dataset + ".csv", "r") as datafile:
    reader = csv.reader(datafile)
    for row in reader:
        data.append(row)

##### DISPLAY DATA #####
print(dataset, "dataset:")
for row in data:
    print("{0:12s} | {1:20s} | {2:14s}".format(*row))    


## Preparing the Data
Great! So we've now loaded up a dataset into a list that looks something like this:

```python
[[ "Date",       "Stock Change", "Sentiment" ],
 [ "2019-09-19", "27.55",        "16.98"     ],
 [ "2019-09-22", "-8.30",        "15.40"     ],
 ...
 [ "2019-09-29", "-8.92",        "15.06"     ]]
```

However, we can't yet feed this into our machine learning model. Here are a few problems with it:

1. The first row, `["Date", "Stock Change", "Sentiment"]`, is not a valid data point
2. The "Date" column is largely irrelevant, since all of the dates are within the same week. Unless we have multiple years worth of data points, this data is likely to cause confusion for our machine learning model
3. All of the data is in string format (whether it's a number or not)

In order to do fix this and make our data compatible with the machine learning model, we'll have to do the following:

1. Remove the first row
2. Remove the date column
3. Convert all of the number data to `float`s
4. Break our data into 4 different lists: training and testing sets of both X (input/news sentiment) and Y (output/stock price) values

Let's do it!

### 1. Chopping the top row off
Removing the top row of our data with `["Date", "Stock Change", "Sentiment"]` will let us avoid running into errors when we try to feed it into our machine learning model.

In [None]:
# "Slices" in Python (the [:] things) let you chop out parts of lists to your liking
data = data[1:]

# Display the data
for row in data:
    print(row)

### 2. Chopping the Date column off
The Date column is irrelevant, and is more likely to confuse the machine learning model than help it. It's all string data, and machine learning models like the ones we're using only take numerical data. Also, if there were years worth of data, maybe the model could find a correlation, but because there are only a few data points, there's no way that a model can make informed decisions off of the dates.

In [None]:
# Use slices again, but this time on each row (since we're deleting a column)
for row in range(len(data)):
    data[row] = data[row][1:]

# Display the data
for row in data:
    print(row)

### 3. Converting all numerical data to `float`
As we said above, our models only take numerical data. If we try to feed it strings, it'll error out.

In [None]:
# Iterate through each item and convert to float
for row in range(len(data)):
    for col in range(len(data[row])):
        data[row][col] = float(data[row][col])

# Display the data
for row in data:
    print(row)

### 4. Splitting into training and testing data
For our last step in data processing, we must split our data into training and testing groups, as well as organize our data so that it fits into our model.

In order to make our data fit into the model, we must also make each row within our inputs and outputs into its own list. To illustrate:

**Before:**
```python
#   y        x
[[27.55,  16.97],
 [-8.29,  15.40],
       ...
 [-8.92,  15.06]]
```

**After (without being broken into training and testing):**
```python
#   y
[[27.55 ],
 [-8.29 ],
   ...
 [-8.92 ]]

#   x 
[[16.97],
 [15.40],
   ...
 [15.06]]
```

This may seem confusing, but we need to put each data point's X values and Y values into individual arrays because more often than not, there will be multiple input (X) variables that map to a certain output (Y) value. For example, the Boston Housing dataset from last week had inputs for crime rate, student-teacher ratio, avg. property tax, and more, with an output of the house price.

Because many people use multiple inputs, we must follow suit and make sure that our data is in a compatible format.

In [None]:
# Designate how much of our dataset we'd like to dedicate to training (the rest goes to testing)
train_proportion = 0.7

# Create arrays for each type of data
xtrain = []
ytrain = []
xtest = []
ytest = []

# Iterate through each row of data dedicated to training
# Append a list into each training/testing list
for row in range(int(train_proportion * len(data))):
    ytrain.append([data[row][0]])
    xtrain.append([data[row][1]])

# Iterate through each row of data dedicated to testing
# Append a list into each training/testing list
for row in range(int(train_proportion * len(data)), len(data)):
    ytest.append([float(data[row][0])])
    xtest.append([float(data[row][1])])

# Display our newly-formatted data
print("X Training:", xtrain)
print("Y Training:", ytrain)
print("X Testing:", xtest)
print("Y Testing:", ytest)

## Creating, Training, and Testing Our Model

Now that we have our data properly formatted, we can finally create our model and run the data through it.

We've decided to use a Ridge regression arbitrarily -- poke around and see what other regressions you can use (LinearRegression and Lasso are a couple others), and how they affect the accuracy of the model.

In [None]:
# Grab our model from sklearn
from sklearn.linear_model import Ridge

# Create our model
model = Ridge()

# Train the model
model.fit(xtrain, ytrain)

# Test the model, and report its accuracy
print("Accuracy:", str(model.score(xtest, ytest) * 100) + "%")


### What You Could Do to Improve This System

You probably noticed that the accuracy of our model is *very* low. Don't worry! This is normal -- let's talk about why.

#### 1. Not enough data
Machine learning models need as much data as they can get in order to make the most educated estimates. Our datasets contain roughly 10 days worth of stock data -- imagine how much better it would be if we had access to 10 years worth.

#### 2. Not enough variables
Trying to predict stock prices based on news sentiments is like trying to predict the weather based on the average humidity. Both stock prices and the weather are very chaotic systems -- drastic changes can occur suddenly and unpredictably. In order to get better at predicting stock prices, we need not only more data, but more *types* of data.

In this workshop, we used news sentiment as one input. We could also gather data on the daily market average, the time of year, the time-proximity to nearby holidays, and so much more. The best models use the most data.

## Appendix: How We Collected Our Datasets
We used the [Alpha Vantage API](https://www.alphavantage.co/documentation/) to collect stock data on a daily basis, and the [News API](https://newsapi.org/) to gather news articles from the past month. To create average news sentiments, [we used the Natural Language Toolkit](https://www.nltk.org/) Vader analyzer.

To learn how to do this kind of data collection yourself and interact with websites online, come to our Webscraping and APIs workshop next week!
