# Linear Regression: Features and Labels
## Sources: 
1. <a href="https://pythonprogramming.net/regression-introduction-machine-learning-tutorial/" target="_blank">Python Programming: Regression - Intro and Data</a>
2. <a href="https://pythonprogramming.net/features-labels-machine-learning-tutorial/" target="_blank">Python Programming: Regression - Features and Labels</a>

A fundamental technique in Machine Learning is <em>Linear Regression</em>.  The idea of linear regression is to find a <em>best-fit line</em> that is as close to as much of the data as possible, so that given an input value $x$, you can predict the output value $y$.  Recall that the equation of a line is $y=mx+b$.

<img src="../Images/linear-regression-line.png" alt="best-fit line">
Image source: [1]

In [1]:
# Import dependencies
import pandas as pd
import numpy as np

# import quandl, the data source we use
import quandl

In [2]:
# Raise errors instead of give warnings
pd.set_option("mode.chained_assignment", "raise")

In [3]:
# Define stock_symbol here-
# the stock we want to use for this project
stock_symbol = "GOOGL"

# Use quandl to get stock data
stock_data_full = quandl.get(f"WIKI/{stock_symbol}")

# Preview stock
display(stock_data_full.head())

Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2004-08-19,100.01,104.06,95.96,100.335,44659000.0,0.0,1.0,50.159839,52.191109,48.128568,50.322842,44659000.0
2004-08-20,101.01,109.08,100.5,108.31,22834300.0,0.0,1.0,50.661387,54.708881,50.405597,54.322689,22834300.0
2004-08-23,110.76,113.48,109.05,109.4,18256100.0,0.0,1.0,55.551482,56.915693,54.693835,54.869377,18256100.0
2004-08-24,111.24,111.6,103.57,104.87,15247300.0,0.0,1.0,55.792225,55.972783,51.94535,52.597363,15247300.0
2004-08-25,104.76,108.0,103.88,106.0,9188600.0,0.0,1.0,52.542193,54.167209,52.10083,53.164113,9188600.0


In [4]:
# Check for missing values
print(stock_data_full.isna().sum())

Open           0
High           0
Low            0
Close          0
Volume         0
Ex-Dividend    0
Split Ratio    0
Adj. Open      0
Adj. High      0
Adj. Low       0
Adj. Close     0
Adj. Volume    0
dtype: int64


Often when working with machine learning, we have more data than we need; once we get to the "training" part, this can cause issues, as too much unnecessary data could just confuse or bias the model.

In this case, we don't need both regular prices as well as adjusted prices.  We will proceed with only the adjusted prices.

In [5]:
# Create new dataframe with only relevant data
stock_data = stock_data_full[["Adj. Open", \
                          "Adj. Close", \
                          "Adj. High", \
                          "Adj. Low", \
                          "Adj. Volume"]].copy()
# Preview new dataframe with only relevant data
display(stock_data.head())

Unnamed: 0_level_0,Adj. Open,Adj. Close,Adj. High,Adj. Low,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-08-19,50.159839,50.322842,52.191109,48.128568,44659000.0
2004-08-20,50.661387,54.322689,54.708881,50.405597,22834300.0
2004-08-23,55.551482,54.869377,56.915693,54.693835,18256100.0
2004-08-24,55.792225,52.597363,55.972783,51.94535,15247300.0
2004-08-25,52.542193,53.164113,54.167209,52.10083,9188600.0


A very important thing to keep in mind when working with data is how meaningful your data is.  Having a large abundance of data is not necessarily a good thing, if it's not meaningful data.

In this case, we have open and close price, which we will use to calculate daily percent change, as well as high and low price, which we will use to calculate daily high-low volatilty percent change.  These two measures may tell us more than open price, and close price, and high price, and low price...

In [6]:
# Calculuate daily percent change and make new column therefrom
stock_data["Daily_Percent_Change"] = \
                                     ((stock_data["Adj. Close"] - stock_data["Adj. Open"]) \
                                      / stock_data["Adj. Open"] * 100)
# Calculate high-low volatility percent and make new column therefrom
stock_data["High_Low_Volatility_Percent"] = \
                                            ((stock_data["Adj. High"] - stock_data["Adj. Low"]) \
                                            / stock_data["Adj. Low"] \
                                            * 100)

# Define the only columns we will use.
stock_data = stock_data[["Adj. Close", "High_Low_Volatility_Percent", "Daily_Percent_Change", "Adj. Volume"]].copy()
# Preview dataframe now with more meaningful columns
display(stock_data.head())

Unnamed: 0_level_0,Adj. Close,High_Low_Volatility_Percent,Daily_Percent_Change,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2004-08-19,50.322842,8.441017,0.324968,44659000.0
2004-08-20,54.322689,8.537313,7.227007,22834300.0
2004-08-23,54.869377,4.062357,-1.22788,18256100.0
2004-08-24,52.597363,7.75321,-5.726357,15247300.0
2004-08-25,53.164113,3.966115,1.183658,9188600.0


When working on a machine learning project, we must determine what it actually is we want to predict, then determine whether that is even possible or logical.  In this case, we use the Adj. Close value to calculate the Daily_Percent_Change column.  It wouldn't make sense then to try to predict the Daily_Percent_Change using the Adj. Close.

Our label would therefore be Adj. Close, or rather- the Adj. Close value $x$ days into the future.

In [7]:
# We define the label here
# In this case, the label is a forecast of the Adj. Close price
forecast_column = "Adj. Close"

In [8]:
# Check for missing data
print(stock_data.isna().sum())

Adj. Close                     0
High_Low_Volatility_Percent    0
Daily_Percent_Change           0
Adj. Volume                    0
dtype: int64


In [9]:
# Define the number of days out we want to forecast
# In this case, want to forecast out 1% of the dataframe
import math
forecast_out = math.ceil(0.01*len(stock_data))
forecast_out

35

In [10]:
# Add label- forecast column
stock_data["Forecast"] = stock_data[forecast_column].shift(-forecast_out)

#### As a reminder,
##### using the .shift() method with a positive number shifts each row up by one; that is, each value moves up to the next row
##### using the .shift() method with a negative number therefore shifts each row down by one; that is, each value moves down to the previous row.

In [11]:
# Create copy of dataframe
stock_data_features_and_label_shifts = stock_data.copy()

# Shift forecast column in positive direction (rows go down)
stock_data_features_and_label_shifts["shift_positive_3"] = \
stock_data_features_and_label_shifts[forecast_column].shift(3)

# Shift forecast column in negative direction (rows go up)
stock_data_features_and_label_shifts["shift_negative_2"] = \
stock_data_features_and_label_shifts[forecast_column].shift(-2)

# Observe the shift_positive_3 column;
# the value that was in the first column has been shifted down three rows,
# and is now in the third row

# Observe the shift_negative_2 column;
# the value that was in the third column has been shifted up rows,
# and is now in the first row
display(stock_data_features_and_label_shifts[[forecast_column, "shift_positive_3", "shift_negative_2"]].head(5))
display(stock_data_features_and_label_shifts[[forecast_column, "shift_positive_3", "shift_negative_2"]].tail(5))

Unnamed: 0_level_0,Adj. Close,shift_positive_3,shift_negative_2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2004-08-19,50.322842,,54.869377
2004-08-20,54.322689,,52.597363
2004-08-23,54.869377,,53.164113
2004-08-24,52.597363,50.322842,54.12207
2004-08-25,53.164113,54.322689,53.239345


Unnamed: 0_level_0,Adj. Close,shift_positive_3,shift_negative_2
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2018-03-21,1094.0,1134.42,1026.55
2018-03-22,1053.15,1100.07,1054.09
2018-03-23,1026.55,1095.8,1006.94
2018-03-26,1054.09,1094.0,
2018-03-27,1006.94,1053.15,


By shifting each column $x$ days in a negative direction- where $x$ is 1% of the length of the dataframe (or 1% of the total number of days)- the value that was in row $a$, is now in row $b$, where $b = a - x$.  This will make the "label" of each row the price $x$ days into the future.

See below for an illustration.

In [12]:
# we will see what 1% of the dataframe actually is
print(forecast_out)

35


In [13]:
# 1% of the total number of days is 35 days
# So each field in the Forecast column is what the stock price will be in 35 days
# For the first x-35 number of days, we already know this balue

# Our machine learning algorithm will use each feature,
# to build a model which will be able to predict the stock price
# 35 days into the future from our last obervation
display(stock_data[[forecast_column, "Forecast"]].head())
display(stock_data[[forecast_column, "Forecast"]].tail())

Unnamed: 0_level_0,Adj. Close,Forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2004-08-19,50.322842,69.078238
2004-08-20,54.322689,67.839414
2004-08-23,54.869377,68.912727
2004-08-24,52.597363,70.668146
2004-08-25,53.164113,71.219849


Unnamed: 0_level_0,Adj. Close,Forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-03-21,1094.0,
2018-03-22,1053.15,
2018-03-23,1026.55,
2018-03-26,1054.09,
2018-03-27,1006.94,


To limit each notebook to a particular topic, we will export our current dataframe (including all features) as a csv, and move on to the next step in a new notebook.

In [14]:
# Define export file path
import os
export_file_path = os.path.join("data", "stock_data.csv")

# Export stock_data_features_and_label as a csv
stock_data.to_csv(export_file_path, index=True, header=True)