# Linear Regression: Training and Testing
## Sources: 
1. <a href="https://pythonprogramming.net/training-testing-machine-learning-tutorial/" target="_blank">Python Programming: Regression - Training and Testing</a>
2. <a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html" target="_blank">Scikit-learn: Importance of Feature Scaling</a>
3. <a href="https://www.geeksforgeeks.org/python-how-and-where-to-apply-feature-scaling/" target="_blank"> GeeksforGeeks: Python | How and where to apply Feature Scaling?</a>
4. <a href="https://www.nhlbi.nih.gov/health/educational/lose_wt/BMI/bmicalc.htm" target="_blank">National Heart, Lung, and Blood Institute: Calculate Your Body Mass Index</a>

In the previous notebook, we learned what linear regression is, and how to begin preparing a dataset for a linear regression machine learning model, by learning about features and labels.  In this notebook, we will import the csv we left off with into a dataframe, and continue building our machine learning model.

### NumPy 

We are ready to move on to the actual machine learning part, but before we do, we must import NumPy.   The scikit-learn library is built on NumPy, and therefore works with NumPy arrays.

In [1]:
# Import dependencies
import pandas as pd
import numpy as np

from sklearn import preprocessing, svm
from sklearn.model_selection import cross_validate
from sklearn.linear_model import LinearRegression

In [2]:
# Define file path to data,
# and import data into dataframe
import os
data_file_path = os.path.join("data", "googl_stock_data_features_and_label.csv")
googl_stock_data_features_and_label = pd.read_csv(data_file_path)

# Preview dataframe
print(googl_stock_data_features_and_label.head())

   Adj. Close  High_Low_Volatility_Percent  Daily_Percent_Change  Adj. Volume  \
0   50.322842                     8.441017              0.324968   44659000.0   
1   54.322689                     8.537313              7.227007   22834300.0   
2   54.869377                     4.062357             -1.227880   18256100.0   
3   52.597363                     7.753210             -5.726357   15247300.0   
4   53.164113                     3.966115              1.183658    9188600.0   

    Forecast  
0  69.078238  
1  67.839414  
2  68.912727  
3  70.668146  
4  71.219849  


In machine learning, features are generally denoted with an uppercase $X$, and labels are denoted with a lowercase $y$.

In [3]:
# Define X as everything except the Forecast column
X = np.array(googl_stock_data_features_and_label.drop(["Forecast"], 1))

# Define y as the Forecast column
y = np.array(googl_stock_data_features_and_label["Forecast"])

### Feature Scaling

While it is not necessary, you may often want to scale your features before testing your machine learning classifier.  Features may be scaled through standardization, (or Z-score normalization).  This rescales the features such that they have a normal distribution with a mean of zero and a standard deviation of one.<br>
[2]

#### Why scale the features?

Datasets often contains features of varying magnitudes, units, or range.  Algorithms that use Euclidean Distance are sensitive to magnitude, and this could result in misleading the classifier in the training phase of building a machine learning model.<br>
[3]

To better understand this, imagine using age and Body Mass Index (BMI) to predict whether an individual is at risk of having a stroke.  Typical BMIs usually fall somewhere around the range of 18-30- giving a statistical range of 12.$^{[4]}$  Typical ages on the other hand usually fall somewhere around the range of 0-100- giving a statistical range of 100.

These two measurements- age and BMI- have varying minimums, maximums, and statistical ranges.  Without scaling, the much larger variance of numbers for age could mislead the machine learning classifier into believing that age is more important when predicting whether an individual is at risk of having a stroke.

This may or may not be true.  Maybe age does hold more weight than BMI, and it's not a problem, or maybe age does hold more weight, but not as much as the classifier is led into believing, or maybe age does not hold more weight at all.

In any case, it is often recommended to scale the features before testing the classifier.  Putting the features on a smaller scale could also potentually improve performance time.

In [4]:
# Scale our features
X = preprocessing.scale(X)