<a href="https://colab.research.google.com/github/DinDev3/Python-Machine-Learning/blob/master/Machine-Learning-with-Python/Regression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression

Install packages
- pip install sklearn
- pip install pandas
- pip install quandl

---
Stock Prices


In [0]:
# !pip --version
# !pip install --upgrade pip
!pip install sklearn
!pip install pandas
!pip install quandl

In [0]:
import pandas as pd
import quandl
import math
import numpy as np
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

In [8]:
# storing the tabled data received from quandl in a pandas dataframe
df = quandl.get("WIKI/GOOGL")       # https://www.quandl.com/

print(df.head())

              Open    High     Low  ...   Adj. Low  Adj. Close  Adj. Volume
Date                                ...                                    
2004-08-19  100.01  104.06   95.96  ...  48.128568   50.322842   44659000.0
2004-08-20  101.01  109.08  100.50  ...  50.405597   54.322689   22834300.0
2004-08-23  110.76  113.48  109.05  ...  54.693835   54.869377   18256100.0
2004-08-24  111.24  111.60  103.57  ...  51.945350   52.597363   15247300.0
2004-08-25  104.76  108.00  103.88  ...  52.100830   53.164113    9188600.0

[5 rows x 12 columns]


## Feature Engineering
Each column is a feature here.

All these columns aren't needed to find a pattern.

Relationships between features can be identified in Deep learning, but not in Regression.

We want to keep only meaningful features.

In [9]:
# creating a long list of columns that we want to have
# recreating the dataframe to contain only the mentioned columns
df = df[['Adj. Open', 'Adj. High','Adj. Low','Adj. Close','Adj. Volume']]

# print(df.head())

# defining special relationships, to use as features
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Close'] * 100
df['PCT_Change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100

# creating a new dataframe with the special features
df = df[['Adj. Close','HL_PCT','PCT_Change','Adj. Volume']]

print(df.head())


            Adj. Close    HL_PCT  PCT_Change  Adj. Volume
Date                                                     
2004-08-19   50.322842  8.072956    0.324968   44659000.0
2004-08-20   54.322689  7.921706    7.227007   22834300.0
2004-08-23   54.869377  4.049360   -1.227880   18256100.0
2004-08-24   52.597363  7.657099   -5.726357   15247300.0
2004-08-25   53.164113  3.886792    1.183658    9188600.0


Features are kind of like attributes that make up the label.
The label is a prediction

In [27]:
# what is needed to be predicted (The column)
forecast_col = 'Adj. Close'

# fill up empty coloums(columns that don't have numbers)
df.fillna(-99999, inplace=True)

# math.ceil => round a number to the nearest whole number
# try to predict 1% of the dataframe
forecast_out = int(math.ceil(0.01*len(df)))

# will be using data that came 1% days ago to predict today ??

df['label'] = df[forecast_col].shift(-forecast_out)
# shifting the columns negatively (shifting columns up the spreadsheet)
# this way the forecast_col will take an Adj. Close which is 1% days into the future

print(df.head())

# df.dropna() => Drop the rows where at least one element is missing.
# (inplace=True) => Keep the DataFrame with valid entries in the same variable.
df.dropna(inplace=True)
print(df.tail())

            Adj. Close    HL_PCT  PCT_Change  Adj. Volume      label
Date                                                                
2004-08-19   50.322842  8.072956    0.324968   44659000.0  69.639972
2004-08-20   54.322689  7.921706    7.227007   22834300.0  69.078238
2004-08-23   54.869377  4.049360   -1.227880   18256100.0  67.839414
2004-08-24   52.597363  7.657099   -5.726357   15247300.0  68.912727
2004-08-25   53.164113  3.886792    1.183658    9188600.0  70.668146
            Adj. Close    HL_PCT  PCT_Change  Adj. Volume    label
Date                                                              
2017-08-31      955.24  1.145785    0.944732    1672387.0  1001.84
2017-09-01      951.99  0.845597   -0.572342    1034769.0  1005.07
2017-09-05      941.48  1.676658   -0.568194    1455058.0   985.54
2017-09-06      942.02  1.254750   -0.196002    1375952.0   988.49
2017-09-07      949.89  1.365053    0.597299    1103286.0   991.46


In ML, can't work with NA(Not available/ no numbers) data.

They need to be replaced with something.
Or get rid of an entire row (not that good in the real world, we'll be loosing data).


---

This is using the previous 1% of days from the entire dataset to predict the Adj. Close of a day (which is 1% of days into the future)



In [0]:
# defining feature and label
# feature
X = np.array(df.drop(['label'],1))    # df.drop(['label'],1) returns a new dataframe without the label column
# label
y = np.array(df['label'])

# normalizing all the datapoints by scaling along-side all the other values
# to help with training and testing
# can be skipped for high frequency (trading), because processing time increases
X = preprocessing.scale(X)

# print(len(X), len(y))     # making sure that X & y are equal

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


testing set size is 20% of the data

spliting training and testing data


In [28]:
# fit the classifier

# define classifier
clf = LinearRegression()
# training
clf.fit(X_train, y_train)
# testing
accuracy = clf.score(X_test, y_test)

print('accuracy: ',accuracy)

accuracy:  0.9766740741822422
