# Linear Regression: Training and Testing
## Sources: 
1. <a href="https://pythonprogramming.net/training-testing-machine-learning-tutorial/" target="_blank">Python Programming: Regression - Training and Testing</a>

In the previous notebooks, we learned what linear regression is, what features and labels are, and why as well as how to scale the features.  In this notebook, we will train and test our data.

In [1]:
# Import dependencies
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
# Raise errors instead of give warnings
pd.set_option("mode.chained_assignment", "raise")

In [3]:
# Import data
import os
data_file_path = os.path.join("Data","stock_data.csv")

stock_data = pd.read_csv(data_file_path, index_col="Date")
display(stock_data.head())

Unnamed: 0_level_0,Adj. Close,High_Low_Volatility_Percent,Daily_Percent_Change,Adj. Volume,Forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2004-08-19,50.322842,8.441017,0.324968,44659000.0,69.078238
2004-08-20,54.322689,8.537313,7.227007,22834300.0,67.839414
2004-08-23,54.869377,4.062357,-1.22788,18256100.0,68.912727
2004-08-24,52.597363,7.75321,-5.726357,15247300.0,70.668146
2004-08-25,53.164113,3.966115,1.183658,9188600.0,71.219849


In [4]:
# Define features
X = np.array(stock_data.drop(["Forecast"], 1))

# Define labels
y = np.array(stock_data["Forecast"])

### Note!

Recall that we engineered the labels column—or the Forecast column—by shifting the values in the Adj. Close column up 35 values.  This would mean that the last 35 rows in the Forecast column have no value.  We will show this more clearly by looking at the last 37 rows of the dataframe.

In [5]:
# Define the number of days out we want to forecast
# In this case, want to forecast out 1% of the dataframe
import math
forecast_out = math.ceil(0.01*len(stock_data))

# Preview last 35 columns of data
display(stock_data.tail((forecast_out+2)))

Unnamed: 0_level_0,Adj. Close,High_Low_Volatility_Percent,Daily_Percent_Change,Adj. Volume,Forecast
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-02-02,1119.2,1.811604,-0.729098,5798880.0,1054.09
2018-02-05,1068.76,5.512236,-2.89385,3742469.0,1006.94
2018-02-06,1084.43,5.569849,4.879205,3732527.0,
2018-02-07,1055.41,3.025734,-2.724499,2544683.0,
2018-02-08,1005.6,5.851043,-5.120439,3067173.0,
2018-02-09,1043.43,5.488465,1.710726,4436032.0,
2018-02-12,1054.56,1.920631,-0.199684,2796258.0,
2018-02-13,1054.14,1.365911,0.394286,1574121.0,
2018-02-14,1072.7,2.445228,1.743304,2029979.0,
2018-02-15,1091.36,2.517733,0.730075,1806206.0,


Therefore, we should also adjust our features $X$ and labels $y$ such that it excludes the last 35 values.  That is, we will only go up to the ($\alpha - 35)^{th}$ value, where $\alpha$ is the length of the array.

In [6]:
# Define the number of days out we want to forecast
# In this case, want to forecast out 1% of the dataframe
import math
forecast_out = math.ceil(0.01*len(stock_data))

# Only include values up until the row before the Forecast values become null
# or rather, exclude the last 1% of values
# Recall that when specifying a range,
# Python excludes the last value
X = X[:-forecast_out]
y = y[:-forecast_out]

In [7]:
# Confirm features and labels are the same length
length_of_features = len(X)
length_of_labels = len(y)

print(f"Length of features: {length_of_features}")
print(f"Length of labels: {length_of_labels}")

Length of features: 3389
Length of labels: 3389


### Training and Testing

When building a machine learning model, you will usually split your data into training and testing sets.  The training set is passed into the classifier so that it can build the equation used to make predictions.  The testing set can then be used to measure the accuracy of the classifier.

In [8]:
# Define training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [9]:
# Define which classification algorithm we are using
classifier = LinearRegression()

In [10]:
# Fit- or train- the classifier
classifier.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [11]:
# Print the score—or accuracy—of the classifier
print(classifier.score(X_test, y_test))

0.9765288432062672
