# Practical Machine Learning Tutorial with Python Introduction

https://pythonprogramming.net/machine-learning-tutorial-python-introduction/
This Jupyter Notebook contains student notes while following the course in Python 2.7 (the author is using Python 3).

## Lesson 2: Regression - Intro and Data

In [154]:
import pandas as pd
import quandl

In [155]:
df_orig = quandl.get('WIKI/GOOGL')
print df_orig.head()

              Open    High     Low   Close      Volume  Ex-Dividend  \
Date                                                                  
2004-08-19  100.00  104.06   95.96  100.34  44659000.0          0.0   
2004-08-20  101.01  109.08  100.50  108.31  22834300.0          0.0   
2004-08-23  110.75  113.48  109.05  109.40  18256100.0          0.0   
2004-08-24  111.24  111.60  103.57  104.87  15247300.0          0.0   
2004-08-25  104.96  108.00  103.88  106.00   9188600.0          0.0   

            Split Ratio  Adj. Open  Adj. High  Adj. Low  Adj. Close  \
Date                                                                  
2004-08-19          1.0     50.000      52.03    47.980      50.170   
2004-08-20          1.0     50.505      54.54    50.250      54.155   
2004-08-23          1.0     55.375      56.74    54.525      54.700   
2004-08-24          1.0     55.620      55.80    51.785      52.435   
2004-08-25          1.0     52.480      54.00    51.940      53.000   

    

In [172]:
# create a new df on which we will do all our modifications
# this allows us to keep df_orig in memory so we can limit the amount of
# queries on Quandl
df = df_orig.copy(deep=True)

# create 2 new columns: High/Low Percent, and Percent_change
df['HL_PCT'] = (df['Adj. High'] - df['Adj. Low']) / df['Adj. Low'] * 100
df['PCT_change'] = (df['Adj. Close'] - df['Adj. Open']) / df['Adj. Open'] * 100

# we are only interested in keeping these 4 columns:
df = df[['Adj. Close','HL_PCT','PCT_change','Adj. Volume']]

print df.head()

            Adj. Close    HL_PCT  PCT_change  Adj. Volume
Date                                                     
2004-08-19      50.170  8.441017    0.340000   44659000.0
2004-08-20      54.155  8.537313    7.227007   22834300.0
2004-08-23      54.700  4.062357   -1.218962   18256100.0
2004-08-24      52.435  7.753210   -5.726357   15247300.0
2004-08-25      53.000  3.966115    0.990854    9188600.0


## Lesson 3: Regression - Features and Labels

In [173]:
import quandl
import math
import numpy as np
import pandas as pd
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression

In [174]:
forecast_col = 'Adj. Close'
df.fillna(value=-99999, inplace=True)

print df.head()
print df.tail()

# math.ceil rounds up.  At runtime forecast_out should be 30.0.
# that means we'll predict 30 days into the future.
# So the second line (shift) ensures that the label at row 1 
# will have the value of Adj.Close at row 31, and so on.
forecast_out = int(math.ceil(0.01 * len(df)))       
df['label'] = df2[forecast_col].shift(-forecast_out) 

print df.head()
print df.tail()

            Adj. Close    HL_PCT  PCT_change  Adj. Volume
Date                                                     
2004-08-19      50.170  8.441017    0.340000   44659000.0
2004-08-20      54.155  8.537313    7.227007   22834300.0
2004-08-23      54.700  4.062357   -1.218962   18256100.0
2004-08-24      52.435  7.753210   -5.726357   15247300.0
2004-08-25      53.000  3.966115    0.990854    9188600.0
            Adj. Close    HL_PCT  PCT_change  Adj. Volume
Date                                                     
2016-07-05      704.89  1.285884   -0.017021    1399205.0
2016-07-06      708.97  2.002861    1.304584    1428996.0
2016-07-07      707.26  1.355845   -0.401346    1053642.0
2016-07-08      717.78  1.382554    1.016100    1463884.0
2016-07-11      727.20  1.399971    1.081427    1430124.0
            Adj. Close    HL_PCT  PCT_change  Adj. Volume   label
Date                                                             
2004-08-19      50.170  8.441017    0.340000   44659000.

## Lesson 4: Training and testing

In [175]:
# define features
X = np.array(df.drop(['label'], 1))
X = preprocessing.scale(X)
X = X[:-forecast_out]  # drop all rows for which we don't have known forecasts
df.dropna(inplace=True)

# define labels
y = np.array(df['label'])

# set up crossvalidation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.2)

In [176]:
# Train with Support Vector Regression (SVR)
clf = svm.SVR()
clf.fit(X_train, y_train)

# Test how well we trained
confidence = clf.score(X_test, y_test)
print confidence


0.799987507512


In [177]:
# Train with Linear Regression
clf = LinearRegression()
clf.fit(X_train, y_train)

# Test how well we trained
confidence = clf.score(X_test, y_test)
print confidence

0.962395427799
