# Simple Linear Regression

Here is a simple example of regression to get started.

The entire idea of regression is to take continuous data and to find a best-fit line to that data. This is contrasted to classification where groups of data points are grouped into classes. Here, each data point is treated as it's own and we try to work with (model) the data as a whole. To create a model of our data with simple linear regression, we use a straight line. In supervised learning, we exclusively deal with features and labels to train and test our models.

This entire algorithm is based off the equation for a straight line, $y = mx + b$. The whole time during regression is to figure out the right values for $m$ and $b$.

You'll see regression pop up in stock price analysis primarily. This works ideally since all the features of a stock are continuous. In this example, we're building a model for simple linear regression to work on stock data.

---


## Import libraries

### Why Quandl?

Quandl is a premier publisher of alternative data for institutional investors. A dedicated team of data scientists, quants and engineers combine uncompromising curation, high quality standards and experienced data science application to provide some of the most powerful data available today. Quandl also publishes free data, scraped from the web and delivered via Nasdaq Data Link’s industry-leading data delivery platform. For more information about Quandl, see [this page](data.nasdaq.com/publishers/qdl).


In [1]:
import pandas as pd
import quandl


Each dataset on the Nasdaq data link site has it's own code that allows you to access the dataset using the Quandl API. You could also copy the entire line of code required to import the dataset from it's page. Storing the downloaded dataset into a dataframe with Pandas.


In [2]:
df = quandl.get("FINRA/FNYX_GOOGL")


Print out the dataframe.


In [3]:
df


Unnamed: 0_level_0,ShortVolume,ShortExemptVolume,TotalVolume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2014-04-03,88639.0,0.0,220457.0
2014-04-04,100488.0,0.0,266648.0
2014-04-07,83192.0,0.0,175611.0
2014-04-08,65126.0,0.0,209328.0
2014-04-09,50757.0,0.0,156089.0
...,...,...,...
2022-12-23,759236.0,7882.0,1560812.0
2022-12-27,199860.0,1485.0,1120981.0
2022-12-28,184037.0,5633.0,895723.0
2022-12-29,220496.0,2474.0,1130918.0


Printing out dataframe info.

In [4]:
df.info()


<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2197 entries, 2014-04-03 to 2022-12-30
Data columns (total 3 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   ShortVolume        2197 non-null   float64
 1   ShortExemptVolume  2197 non-null   float64
 2   TotalVolume        2197 non-null   float64
dtypes: float64(3)
memory usage: 68.7 KB


Extract the columns of the dataframe in a list.

In [5]:
columns = list(df.columns)
columns


['ShortVolume', 'ShortExemptVolume', 'TotalVolume']

Print the details of the index column.

In [6]:
df.index


DatetimeIndex(['2014-04-03', '2014-04-04', '2014-04-07', '2014-04-08',
               '2014-04-09', '2014-04-10', '2014-04-11', '2014-04-14',
               '2014-04-15', '2014-04-16',
               ...
               '2022-12-16', '2022-12-19', '2022-12-20', '2022-12-21',
               '2022-12-22', '2022-12-23', '2022-12-27', '2022-12-28',
               '2022-12-29', '2022-12-30'],
              dtype='datetime64[ns]', name='Date', length=2197, freq=None)