# Linear Regression: Feature Scaling
## Sources: 
1. <a href="https://pythonprogramming.net/training-testing-machine-learning-tutorial/" target="_blank">Python Programming: Regression - Training and Testing</a>

In the previous notebooks, we learned what linear regression is, what features and labels are, and why as well as how to scale the features.  In this notebook, we will train and test our data.

In [1]:
# Import dependencies
import pandas as pd
import numpy as np

from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [2]:
# Import data
import os
data_file_path = os.path.join("Data","stock_data_features_and_label.csv")

stock_data_features_and_label = pd.read_csv(data_file_path)
print(stock_data_features_and_label.head())

   Adj. Close  High_Low_Volatility_Percent  Daily_Percent_Change  Adj. Volume  \
0   50.322842                     8.441017              0.324968   44659000.0   
1   54.322689                     8.537313              7.227007   22834300.0   
2   54.869377                     4.062357             -1.227880   18256100.0   
3   52.597363                     7.753210             -5.726357   15247300.0   
4   53.164113                     3.966115              1.183658    9188600.0   

    Forecast  
0  69.078238  
1  67.839414  
2  68.912727  
3  70.668146  
4  71.219849  


In [3]:
# Define features
X = np.array(stock_data_features_and_label.drop(["Forecast"], 1))

# Define label
y = np.array(stock_data_features_and_label["Forecast"])

In [4]:
# Scale features
X = preprocessing.scale(X)

### Note!

Recall that we engineered the label column- or the Forecast column- by shifting the values in the Adj. Close column up 35 values.  This would mean that the last 35 rows in the Forecast column have no value.  We will show this more clearly by looking at the last 37 rows of the dataframe.

In [5]:
# Preview last 35 columns of data
print(stock_data_features_and_label.tail(37))

      Adj. Close  High_Low_Volatility_Percent  Daily_Percent_Change  \
3387     1119.20                     1.811604             -0.729098   
3388     1068.76                     5.512236             -2.893850   
3389     1084.43                     5.569849              4.879205   
3390     1055.41                     3.025734             -2.724499   
3391     1005.60                     5.851043             -5.120439   
3392     1043.43                     5.488465              1.710726   
3393     1054.56                     1.920631             -0.199684   
3394     1054.14                     1.365911              0.394286   
3395     1072.70                     2.445228              1.743304   
3396     1091.36                     2.517733              0.730075   
3397     1095.50                     1.535431              0.193894   
3398     1103.59                     2.411927              0.991068   
3399     1113.75                     2.590496              0.419259   
3400  

Therefore, we should also adjust our features $X$ and label $y$ such that it excludes the last 35 values.  That is, we will only go up to the ($\alpha - 1)^{th}$ value, where $\alpha$ is the length of the array.

In [6]:
# Define the number of days out we want to forecast
# In this case, want to forecast out 1% of the dataframe
import math
forecast_out = math.ceil(0.01*len(stock_data_features_and_label))

# Only include values up until the row before the Forecast values become null
# or rather, exclude the last 1% of values
# Recall that when specifying a range,
# Python excludes the last value
X = X[:-forecast_out]
y = y[:-forecast_out]

In [7]:
# Confirm features and labels are the same length
length_of_features = len(X)
length_of_label = len(y)

print(f"Length of features: {length_of_features}")
print(f"Length of label: {length_of_label}")

Length of features: 3389
Length of label: 3389


### Training and Testing

When building a machine learning model, you will usually split your data into training and testing sets.  The training set is passed into the classifier so that it can build the equation used to make predictions.  The testing set can then be used to measure the accuracy of the classifier.

In [51]:
# Define training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [52]:
# Define which classification algorithm we are using
classifier = LinearRegression()

In [53]:
# Fit- or train- the classifier
classifier.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [60]:
# Print the score- or accuracy- of the classifier
print(classifier.score(X_test, y_test))

0.9788606958521906
