# Simple Linear Regression

Here is a simple example of regression to get started.

The entire idea of regression is to take continuous data and to find a best-fit line to that data. This is contrasted to classification where groups of data points are grouped into classes. Here, each data point is treated as it's own and we try to work with (model) the data as a whole. To create a model of our data with simple linear regression, we use a straight line. In supervised learning, we exclusively deal with features and labels to train and test our models.

This entire algorithm is based off the equation for a straight line, $y = mx + b$. The whole time during regression is to figure out the right values for $m$ and $b$.

You'll see regression pop up in stock price analysis primarily. This works ideally since all the features of a stock are continuous. In this example, we're building a model for simple linear regression to work on stock data.

---


## Import libraries

### Why Quandl?

Quandl is a premier publisher of alternative data for institutional investors. A dedicated team of data scientists, quants and engineers combine uncompromising curation, high quality standards and experienced data science application to provide some of the most powerful data available today. Quandl also publishes free data, scraped from the web and delivered via Nasdaq Data Link’s industry-leading data delivery platform. For more information about Quandl, see [this page](data.nasdaq.com/publishers/qdl).


In [2]:
import pandas as pd
import quandl


Each dataset on the Nasdaq data link site has it's own code that allows you to access the dataset using the Quandl API. You could also copy the entire line of code required to import the dataset from it's page. Storing the downloaded dataset into a dataframe with Pandas.


In [3]:
df = quandl.get("WIKI/GOOGL")


### Checking out the dataset

Print out the tail end of the dataframe.


In [None]:
df.tail()


Unnamed: 0_level_0,Open,High,Low,Close,Volume,Ex-Dividend,Split Ratio,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1
2018-03-21,1092.57,1108.7,1087.21,1094.0,1990515.0,0.0,1.0,1092.57,1108.7,1087.21,1094.0,1990515.0
2018-03-22,1080.01,1083.92,1049.64,1053.15,3418154.0,0.0,1.0,1080.01,1083.92,1049.64,1053.15,3418154.0
2018-03-23,1051.37,1066.78,1024.87,1026.55,2413517.0,0.0,1.0,1051.37,1066.78,1024.87,1026.55,2413517.0
2018-03-26,1050.6,1059.27,1010.58,1054.09,3272409.0,0.0,1.0,1050.6,1059.27,1010.58,1054.09,3272409.0
2018-03-27,1063.9,1064.54,997.62,1006.94,2940957.0,0.0,1.0,1063.9,1064.54,997.62,1006.94,2940957.0


Printing out dataframe info.

In [None]:
df.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 3424 entries, 2004-08-19 to 2018-03-27
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Open         3424 non-null   float64
 1   High         3424 non-null   float64
 2   Low          3424 non-null   float64
 3   Close        3424 non-null   float64
 4   Volume       3424 non-null   float64
 5   Ex-Dividend  3424 non-null   float64
 6   Split Ratio  3424 non-null   float64
 7   Adj. Open    3424 non-null   float64
 8   Adj. High    3424 non-null   float64
 9   Adj. Low     3424 non-null   float64
 10  Adj. Close   3424 non-null   float64
 11  Adj. Volume  3424 non-null   float64
dtypes: float64(12)
memory usage: 347.8 KB


Extract the columns of the dataframe in a list.

In [None]:
columns = list(df.columns)
columns


['Open',
 'High',
 'Low',
 'Close',
 'Volume',
 'Ex-Dividend',
 'Split Ratio',
 'Adj. Open',
 'Adj. High',
 'Adj. Low',
 'Adj. Close',
 'Adj. Volume']

Each column in a dataset is a feature but machine learning models thrive on meaningful features. For highly correlated columns, i.e., columns that have a strong relationships, you are able to simplify your dataset by having one of those related columns as a representation of the others it's related to.

### Updating the dataframe

Creating a list of the columns that we need. The rest of the dataframe holds redundant data. In addition, we add two more features that are the calculated High-Low and Change percentages.

In [None]:
newColumns = [
    'Adj. Open',
    'Adj. High',
    'Adj. Low',
    'Adj. Close',
    'Adj. Volume',
]

df = df[newColumns]


Creating those new features.

In [None]:
df['High-Low %'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100
df['% Change'] =  ((df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open']) * 100


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['High-Low %'] = ((df['Adj. High'] - df['Adj. Close']) / df['Adj. Close']) * 100


Unnamed: 0_level_0,Adj. Open,Adj. High,Adj. Low,Adj. Close,Adj. Volume,High-Low %,% Change
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2004-08-19,50.159839,52.191109,48.128568,50.322842,44659000.0,3.712563,0.324968
2004-08-20,50.661387,54.708881,50.405597,54.322689,22834300.0,0.710922,7.227007
2004-08-23,55.551482,56.915693,54.693835,54.869377,18256100.0,3.729433,-1.22788
2004-08-24,55.792225,55.972783,51.94535,52.597363,15247300.0,6.417469,-5.726357
2004-08-25,52.542193,54.167209,52.10083,53.164113,9188600.0,1.886792,1.183658


In [None]:
df = df[newColumns]
df


NameError: name 'df' is not defined