# Introduction to Linear Regression
## Sources: 
1. <a href="https://pythonprogramming.net/regression-introduction-machine-learning-tutorial/">Python Programming, Data Analysis, Machine Learning, Practical Machine Learning with Python: Regression - Intro and Data</a>

A fundamental technique in Machine Learning is <em>Linear Regression</em>.  The idea of linear regression is to find a <em>best-fit line</em> that is as close to as much of the data as possible, so that given an input value $x$, you can predict the output value $y$.  Recall that the equation of a line is $y=mx+b$.

<img src="../Images/linear-regression-line.png" alt="best-fit line">
Image source: [1]

In [1]:
# import dependencies
import pandas as pd
import numpy as np

# import quandl, the data source we use
import quandl

In [2]:
# Use quandl to get stock data for GOOGL
googl_stock_data = quandl.get("WIKI/GOOGL")

# preview data for GOOGL
print(googl_stock_data.head())

              Open    High     Low    Close      Volume  Ex-Dividend  \
Date                                                                   
2004-08-19  100.01  104.06   95.96  100.335  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.310  22834300.0          0.0   
2004-08-23  110.76  113.48  109.05  109.400  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.870  15247300.0          0.0   
2004-08-25  104.76  108.00  103.88  106.000   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High   Adj. Low  Adj. Close  \
Date                                                                   
2004-08-19          1.0  50.159839  52.191109  48.128568   50.322842   
2004-08-20          1.0  50.661387  54.708881  50.405597   54.322689   
2004-08-23          1.0  55.551482  56.915693  54.693835   54.869377   
2004-08-24          1.0  55.792225  55.972783  51.945350   52.597363   
2004-08-25          1.0  52.542193  54.167209  52.100830   53.1

In [3]:
# Check for missing values
print(googl_stock_data.isna().sum())

Open           0
High           0
Low            0
Close          0
Volume         0
Ex-Dividend    0
Split Ratio    0
Adj. Open      0
Adj. High      0
Adj. Low       0
Adj. Close     0
Adj. Volume    0
dtype: int64


Often when working with machine learning, we have more data than we need; once we get to the "training" part, this can cause issues, as too much unnecessary data could just confuse or bias the model.

In this case, we don't need both regular prices as well as adjusted prices.  We will proceed with only the adjusted prices.

In [4]:
# Create new dataframe with only relevant data
googl_stock_data_meaningful_data = googl_stock_data[["Adj. Open", \
                                                   "Adj. Close", \
                                                   "Adj. High", \
                                                   "Adj. Low", \
                                                   "Adj. Volume"]]
# Preview new dataframe with only relevant data
print(googl_stock_data_meaningful_data.head())

            Adj. Open  Adj. Close  Adj. High   Adj. Low  Adj. Volume
Date                                                                
2004-08-19  50.159839   50.322842  52.191109  48.128568   44659000.0
2004-08-20  50.661387   54.322689  54.708881  50.405597   22834300.0
2004-08-23  55.551482   54.869377  56.915693  54.693835   18256100.0
2004-08-24  55.792225   52.597363  55.972783  51.945350   15247300.0
2004-08-25  52.542193   53.164113  54.167209  52.100830    9188600.0


A very important thing to keep in mind when working with data is how meaningful your data is.  Having a large abundance of data is not necessarily a good thing, if it's not meaningful data.

In this case, we have open and close price, which we will use to calculate daily percent change, as well as high and low price, which we will use to calculate daily high-low volatilty.

In [5]:
# Calculuate daily percent change and make new column therefrom
googl_stock_data_meaningful_data["Daily_Percent_Change"] = \
                                                         ((googl_stock_data_meaningful_data["Adj. Close"] \
                                                          - googl_stock_data_meaningful_data["Adj. Open"]) \
                                                          / googl_stock_data_meaningful_data["Adj. Open"]
                                                          * 100)
# Calculate high-low volatility percent and make new column therefrom
googl_stock_data_meaningful_data["High_Low_Volatility_Percent"] = \
                                                         ((googl_stock_data_meaningful_data["Adj. High"] \
                                                          - googl_stock_data_meaningful_data["Adj. Low"]) \
                                                          / googl_stock_data_meaningful_data["Adj. Low"] \
                                                          * 100)
# Preview dataframe now with more meaningful columns
print(googl_stock_data_meaningful_data.head())

            Adj. Open  Adj. Close  Adj. High   Adj. Low  Adj. Volume  \
Date                                                                   
2004-08-19  50.159839   50.322842  52.191109  48.128568   44659000.0   
2004-08-20  50.661387   54.322689  54.708881  50.405597   22834300.0   
2004-08-23  55.551482   54.869377  56.915693  54.693835   18256100.0   
2004-08-24  55.792225   52.597363  55.972783  51.945350   15247300.0   
2004-08-25  52.542193   53.164113  54.167209  52.100830    9188600.0   

            Daily_Percent_Change  High_Low_Volatility_Percent  
Date                                                           
2004-08-19              0.324968                     8.441017  
2004-08-20              7.227007                     8.537313  
2004-08-23             -1.227880                     4.062357  
2004-08-24             -5.726357                     7.753210  
2004-08-25              1.183658                     3.966115  


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
