In [1]:
import quandl
import pandas as pd

quandl.ApiConfig.api_key = "Hzq3sWp4zmh3syzmcNQA"

df = quandl.get('WIKI/TSLA')
df = df[['Adj. Open', 'Adj. High', 'Adj. Low', 'Adj. Close', 'Adj. Volume']]
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Low'] * 100.0
df['PCT_Change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100.0
df = df[['Adj. Close', 'HL_PCT','PCT_Change', 'Adj. Volume']]
df.head()

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_Change,Adj. Volume
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
2010-06-29,23.89,42.531357,25.736842,18766300.0
2010-06-30,23.83,30.554506,-7.599845,17187100.0
2010-07-01,21.96,27.873705,-12.16,8218800.0
2010-07-02,19.2,23.463389,-16.521739,5139800.0
2010-07-06,16.11,26.342388,-19.45,6866900.0


Now that we retrieved the data, decided on valuable data we want, and created some extra data through basic manipulation, we're ready to ML!!

In [18]:
import math
import numpy as np
import pandas as pd
from sklearn import preprocessing, svm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

### Modules used
- **numpy** to convert data to numpy arrays for scikit-learn to use
- **sklearn.preprocessing** is the module used to do some cleaning & scaling of data before ML
- Import **LinearRegression** algorithm & **svm** from sklearn

### With supervised learning, you have features and labels:
- Features: descriptive attributes      <- Data we have
- labels: what we're trying to predict  <- Future price in our case

 Let's go ahead and add a few new rows:
- Define the predicting column
- fill NaN/ missing data with -99999
    - With many ML classfiers, this will just be treated as insignificant data
- Finally, define what you want to predict. -> we want to be able to predict not just one day but "1% of the entire length of the dataset" days!



In [6]:
predict_col = 'Adj. Close'        # Our label(value we want to predict) is actually the Adj. Close price
df.fillna(value=-99999, inplace=True)     # Handle missing data -> filled with -99999 will make most ML Classifiers avoid you
predict_out = int(math.ceil(0.01*len(df)))
predict_out    # How many days in the future do I want to predict --> 10% of the data points's length

20

In [11]:
df['label'] = df[predict_col].shift(-predict_out)    # Add a new column, but leave the last "predict_out" days blank for our prediction
df.tail(30)

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_Change,Adj. Volume,label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2018-02-13,323.66,3.737314,2.742683,4506915.0,326.63
2018-02-14,322.31,2.401733,0.458172,3930911.0,325.6
2018-02-15,334.065,3.635236,2.947612,5892048.0,321.35
2018-02-16,335.49,3.461585,0.899248,5585810.0,313.56
2018-02-20,334.77,2.817496,0.089694,3996951.0,310.55
2018-02-21,333.3,1.957829,-0.812427,3181755.0,316.53
2018-02-22,346.17,3.790858,3.171102,6940349.0,309.1
2018-02-23,352.05,2.27312,1.213236,5790795.0,301.54
2018-02-26,357.42,1.885882,1.108911,4312871.0,304.18
2018-02-27,350.99,2.851347,-1.476491,4761537.0,279.18


In [12]:
df.dropna(inplace=True)    # Since we shift up the Adj. Close and copy it to the label columnn
                           # Label column will have 20 NaN

Unnamed: 0_level_0,Adj. Close,HL_PCT,PCT_Change,Adj. Volume,label
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2010-06-29,23.89,42.531357,25.736842,18766300.0,20.72
2010-06-30,23.83,30.554506,-7.599845,17187100.0,20.35
2010-07-01,21.96,27.873705,-12.16,8218800.0,19.94
2010-07-02,19.2,23.463389,-16.521739,5139800.0,20.92
2010-07-06,16.11,26.342388,-19.45,6866900.0,21.95


Currently, we have data that comprises our features & labels.

Next, we will do some **preprocessing** and final steps to actually start running everything.

### Let's preprocess & train!!!
In Machine Learning, it's typical to define:
- X: features  --> Entire df EXCEPT for the label column
- y: labels corresponding to the features  -->  Only the label column

In [13]:
X = np.array(df.drop(['label'], axis=1))
y = np.array(df['label'])

We could leave it at this, and move on to training and testing, but we're going to do some pre-processing. 

#### Preprocess - Scale 

- **Generally, you want your features in machine learning to be in a range of -1 to 1. This may do nothing, but it usually speeds up processing and can also help with accuracy.**

- Because this range is so popularly used, **it is included in the preprocessing module of Scikit-Learn**. To utilize this, you can apply preprocessing.scale to your X variable:

- **WARNING: [Risk of data leak!!](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.scale.html)**
    - Do not use scale unless you know what you are doing. A common mistake is to apply it to the entire data before splitting into training and test sets.
    - This will bias the model evaluation because information would have leaked from the test set to the training set

In [14]:
X = preprocessing.scale(X)

### Training
- Split data into training set & validation set
-  There are many ways to do this, but, probably the best way is using the build in **cross_validation.train_test_split** provided, since this also _shuffles your data for you_.

    - **cross_validation is deprecated in the latest version** --> It's changed to [cross_validate](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) and doesn't include the train_test_split method
    - [**Use sklearn.model_selection.train_test_split instead!**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html)
    - Also try out [fstring](https://realpython.com/python-f-strings/#f-strings-a-new-and-improved-way-to-format-strings-in-python), fun!
    
Return here is the training set of features, testing set of features, training set of labels, and testing set of labels

In [60]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.23)
print(f"len of X_train = {len(X_train)}")
print(f"len of X_test = {len(X_test)}")

len of X_train = 1485
len of X_test = 444


#### Defining our Model
- Let's first try out Support Vector Regression, [svm](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html)

In [61]:
training_model = svm.SVR(kernel='linear') 
training_model_for_comparison = svm.SVR(kernel='sigmoid')

Once you have defined the classifer, you're ready to train it. 

With sklearn, you train with **.fit**:

In [62]:
training_model.fit(X_train, y_train)
training_model_for_comparison.fit(X_train, y_train)

SVR(kernel='sigmoid')

Our model is now trained. 

Wow that was easy! Now we can test it!

In [63]:
confidence = training_model.score(X_test, y_test)
confidence_for_comparison = training_model_for_comparison.score(X_test,y_test)
print("Model: svm.SVR(kernel='linear')")
print(f"confidence = {confidence}")
print("Model: svm.SVR(kernel='sigmoid')")
print(f"confidence_for_comparison = {confidence_for_comparison}")

Model: svm.SVR(kernel='linear')
confidence = 0.9606939988022062
Model: svm.SVR(kernel='sigmoid')
confidence_for_comparison = 0.926700865613112


Let's try another classifier, this time using **LinearRegression** from sklearn:

In [64]:
training_model = LinearRegression()
training_model.fit(X_train, y_train)

LinearRegression()

In [65]:
confidence = training_model.score(X_test, y_test)
print("Model: LinearRegression()")
print(f"confidence = {confidence}")

Model: LinearRegression()
confidence = 0.9613802968237725


Q) How might we know, as scientists, which algorithm to choose?

After a while, you will get used to what works in most situations and what doesn't. You can also check out: [choosing the right estimator from scikit-learn's website](https://scikit-learn.org/stable/tutorial/machine_learning_map/).

#### Other notes on the results
- some of the algorithms must run linearly, others not. Do not confuse linear regression with the requirement to run linearly, by the way. 
- Some of the ML algorithms will process **one step at a time, with no threading, others can thread and use all the CPU cores you have available**. You could learn a lot about each algorithm to figure out which ones can thread, or you can visit the documentation, and **look for the n_jobs parameter. If it has n_jobs, you have an algorithm that can be threaded** for high performance.
- There is a parameter to svm.SVR which is kernel. What's this? 
    - **Think of a kernel like a transformation against your data. It's a way to grossly, and I mean grossly, simplify your data. This makes processing go much faster** e.g. kernel='linear','sigmoid'
    - As we can see, the linear kernel performed the best, sigmoid is just not good enough

### Next Goal:
- Move foward to predict the label