# Logistic Regression - Price Only

For a ticker, given a sample period (e.g. 3-1-2009 to 12-31-2015), we train a logistic regression model to to predict it will become a 10 Bagger by 3-31-2019:  

10_bagger = f(price_t0, price_t1, ..., p_t_end)
 
where:  
* p_t0 - ticker's price on 3-1-2009
* p_t_end - ticker's price on the date in the sample period

We want to see how much a basic ML model like logistic regression can outperform the Baseline (Linear Interpolation).

In [4]:
import quandl  # Access to Sharadar Core US Equities Bundle
api_key = '7B87ndLPJbCDzpNHosH3'

import math
import platform
import matplotlib
import matplotlib.pyplot as plt
from pylab import rcParams
import numpy as np
from sklearn import linear_model  # package for logistic regression (not using GPU)
import torch
import pandas as pd
from IPython.display import display
import time

from utils import *

from datetime import date, datetime, time, timedelta


print("Python version: ", platform.python_version())
print("Pytorch version: {}".format(torch.__version__))

Python version:  3.6.6
Pytorch version: 1.1.0


## Import Labels

For each sample period (e.g. 3-1-2009 to 12-31-2018), we want to import a list of valid tickers. A valid ticker is defined as a ticker which is active for at least 180 days before the end of the sample period. 

For example, if the end of the sample period is 12-31-2018, a ticker has to be active since 7-4-2018. Any ticker that IPO after 7-4-2018 is not a valid ticker, since there is no enough price history to make an educated prediction.

In [27]:
labels = pd.read_csv("../datasets/sharader/labels_12-31-2018.csv")

y = labels.set_index('ticker')
y['firstpricedate']= pd.to_datetime(y['firstpricedate'])
y['lastpricedate']= pd.to_datetime(y['lastpricedate'])

y.head()

Unnamed: 0_level_0,appreciation,10bagger,table,permaticker,name,exchange,isdelisted,category,cusips,siccode,...,currency,location,lastupdated,firstadded,firstpricedate,lastpricedate,firstquarter,lastquarter,secfilings,companysite
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,6.339117,False,SEP,196290,Agilent Technologies Inc,NYSE,N,Domestic,00846U101,3826.0,...,USD,California; U.S.A,2020-01-14,2014-09-26,1999-11-18,2020-01-14,1997-06-30,2019-09-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,http://www.agilent.com
AA,1.224348,False,SEP,124392,Alcoa Corp,NYSE,N,Domestic,013872106,3350.0,...,USD,New York; U.S.A,2020-01-14,2016-11-01,2016-11-01,2020-01-14,2014-12-31,2019-09-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,http://www.alcoa.com
AAAGY,1.275556,False,SEP,120538,Altana Aktiengesellschaft,NYSE,Y,ADR,02143N103,2834.0,...,USD,Jordan,2018-10-16,2018-02-13,2002-05-22,2010-08-12,2000-12-31,2005-12-31,https://www.sec.gov/cgi-bin/browse-edgar?actio...,
AAAP,3.331837,False,SEP,155760,Advanced Accelerator Applications SA,NASDAQ,Y,ADR,00790T100,2834.0,...,USD,France,2018-06-28,2016-05-19,2015-11-11,2018-02-09,2012-12-31,2016-12-31,https://www.sec.gov/cgi-bin/browse-edgar?actio...,
AAC,0.099459,False,SEP,187592,AAC Holdings Inc,NYSE,Y,Domestic,000307108,8093.0,...,USD,Tennessee; U.S.A,2019-10-25,2015-09-11,2014-10-02,2019-10-25,2013-09-30,2019-09-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,


### A list of active tickers

In [28]:
tickers = list(y.index)
print(len(tickers))

valid_tickers = pd.Series(tickers, name = 'ticker')
valid_tickers.head()

9881


0        A
1       AA
2    AAAGY
3     AAAP
4      AAC
Name: ticker, dtype: object

In [29]:
prices = pd.read_csv("../datasets/sharader/inputs_notfilled_2018-12-31.csv")
X = prices .set_index('date')

X

Unnamed: 0_level_0,A,AA,AAAGY,AAAP,AAC,AACC,AACG,AACPF,AAGIY,AAI,...,ZUO,ZURVY,ZVO,ZVUE,ZXAIY,ZYME,ZYNE,ZYTO,ZYXI,ZZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-03-02,12.68,,15.75,,,3.29,5.180,,,2.54,...,,12.750,,0.01,,,,0.011,1.21,0.84
2009-03-03,12.68,,15.75,,,3.30,5.320,,,2.46,...,,12.850,,0.01,,,,0.011,1.22,0.76
2009-03-04,13.31,,16.35,,,3.33,5.080,,,2.78,...,,13.740,,0.01,,,,0.011,1.22,0.76
2009-03-05,12.54,,15.59,,,3.30,5.080,,,2.56,...,,11.910,,0.01,,,,0.011,1.17,0.58
2009-03-06,12.65,,15.97,,,3.40,5.250,,,2.89,...,,11.300,,0.01,,,,0.011,1.20,0.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-12-24,62.67,25.15,,,1.52,,0.970,,32.335,,...,16.36,28.680,6.73,,0.67,11.08,2.94,,2.69,
2018-12-26,65.54,27.14,,,1.67,,0.998,,32.700,,...,17.81,29.050,6.78,,0.67,11.99,2.92,,2.63,
2018-12-27,66.48,27.16,,,1.49,,0.980,,32.330,,...,17.90,28.769,6.83,,0.62,11.70,2.95,,2.63,
2018-12-28,65.96,26.60,,,1.41,,1.000,,32.900,,...,17.65,29.668,6.87,,0.72,13.65,2.92,,2.65,


### Backfill and Forward Fill X_t

The logistic regression model expects no NaN values. We fill in the NaN values in the following way:

1. If there are NaN values before the first valid price, the ticker has an IPO in the sample period. We thus set these NaN values to zero.
2. If there are NaN values after the last valid price, the ticker has been delisted in the sample period. We thus set these NaN values to the last valid price.


In [30]:
X_filled = X.fillna(axis=0, method='ffill')  # forward fill along date axis with last valid price
X_filled = X_filled.fillna(0)  # fill all other NaN with zero - remaining NaN before the first valid price
X_train = X_filled.transpose()

print (X_train.shape)
display(X_train)

(9881, 2477)


date,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-17,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31
A,12.680,12.680,13.310,12.540,12.650,12.440,13.430,13.530,13.810,13.990,...,67.85,67.99,66.83,65.19,63.29,62.67,65.54,66.48,65.96,67.46
AA,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,28.11,28.24,27.61,27.33,26.39,25.15,27.14,27.16,26.60,26.58
AAAGY,15.750,15.750,16.350,15.590,15.970,15.960,16.200,16.150,16.500,16.360,...,20.09,20.09,20.09,20.09,20.09,20.09,20.09,20.09,20.09,20.09
AAAP,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,81.63,81.63,81.63,81.63,81.63,81.63,81.63,81.63,81.63,81.63
AAC,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,1.97,1.96,1.78,1.63,1.56,1.52,1.67,1.49,1.41,1.40
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYME,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,12.77,12.80,11.70,11.32,10.95,11.08,11.99,11.70,13.65,14.68
ZYNE,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,0.000,...,4.10,3.73,3.58,3.21,2.89,2.94,2.92,2.95,2.92,2.97
ZYTO,0.011,0.011,0.011,0.011,0.011,0.011,0.011,0.011,0.012,0.012,...,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03,0.03
ZYXI,1.210,1.220,1.220,1.170,1.200,1.100,1.000,1.000,1.010,1.050,...,3.04,2.96,3.07,2.89,2.81,2.69,2.63,2.63,2.65,2.94


In [31]:
y_train = y['10bagger']

print (y_train.shape)
print(sum(y_train))

(9881,)
531


## Failure to Converge

Convergence issues encountered when only nominal ticker price used.

In [32]:
C_values = [1e-4, 1e-3, 1e-2, 1e-1, 1.0, 1e1, 1e2, 1e3, 1e4]

for C in C_values:
    
    print("Regularization: {}".format(C))

    model = linear_model.LogisticRegression(C=C, class_weight='balanced', max_iter=10000)
    model.fit(X_train, y_train)

    predicts = model.predict(X_train)
    correct = predicts == y_train
    
    TP,FP,TN,FN = calc_metrics(predicts,y_train)

    print(TP,FP,TN,FN)

    precision, recall, accuracy, TPR, TNR, BER = calc_error_rates(TP, FP, TN, FN)

    print("Precision:{:.3f} Recall:{:.3f}".format(precision, recall))
    print("Accuracy:{:.3f} TPR:{:.3f} TNR:{:.3f} BER:{:.3f}".format(accuracy, TPR, TNR, BER))

Regularization: 0.0001


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


487 2444 6906 44
Precision:0.166 Recall:0.917
Accuracy:0.748 TPR:0.917 TNR:0.739 BER:0.172
Regularization: 0.001


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


486 2787 6563 45
Precision:0.148 Recall:0.915
Accuracy:0.713 TPR:0.915 TNR:0.702 BER:0.191
Regularization: 0.01


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


502 2821 6529 29
Precision:0.151 Recall:0.945
Accuracy:0.712 TPR:0.945 TNR:0.698 BER:0.178
Regularization: 0.1


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


509 3256 6094 22
Precision:0.135 Recall:0.959
Accuracy:0.668 TPR:0.959 TNR:0.652 BER:0.195
Regularization: 1.0


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


494 2713 6637 37
Precision:0.154 Recall:0.930
Accuracy:0.722 TPR:0.930 TNR:0.710 BER:0.180
Regularization: 10.0


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


497 2729 6621 34
Precision:0.154 Recall:0.936
Accuracy:0.720 TPR:0.936 TNR:0.708 BER:0.178
Regularization: 100.0


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


485 2559 6791 46
Precision:0.159 Recall:0.913
Accuracy:0.736 TPR:0.913 TNR:0.726 BER:0.180
Regularization: 1000.0


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


489 2680 6670 42
Precision:0.154 Recall:0.921
Accuracy:0.725 TPR:0.921 TNR:0.713 BER:0.183
Regularization: 10000.0
506 3241 6109 25
Precision:0.135 Recall:0.953
Accuracy:0.669 TPR:0.953 TNR:0.653 BER:0.197


ABNORMAL_TERMINATION_IN_LNSRCH.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


## Normalize Ticker Price

The model converges when ticker prices are normalized (first valid price = 1.0).

In [33]:
first_valid_prices = X.apply(first_valid_idx, axis=0)

first_valid_prices

A        12.680
AA       23.000
AAAGY    15.750
AAAP     24.500
AAC      18.500
          ...  
ZYME     13.000
ZYNE     16.250
ZYTO      0.011
ZYXI      1.210
ZZ        0.840
Length: 9881, dtype: float64

In [42]:
X_filled = X.fillna(axis=0, method='ffill')  # forward fill along date axis with last valid price
X_filled = X_filled.fillna(0)  # fill all other NaN with zero - remaining NaN before the first valid price

# Transpose Dataframe in rows of tickers, and normalize price
X_all = X_filled.transpose().div(first_valid_prices, axis=0)

X_all

date,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-17,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31
A,1.0,1.000000,1.049685,0.988959,0.997634,0.981073,1.059148,1.067035,1.089117,1.103312,...,5.350946,5.361987,5.270505,5.141167,4.991325,4.942429,5.168770,5.242902,5.201893,5.320189
AA,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.222174,1.227826,1.200435,1.188261,1.147391,1.093478,1.180000,1.180870,1.156522,1.155652
AAAGY,1.0,1.000000,1.038095,0.989841,1.013968,1.013333,1.028571,1.025397,1.047619,1.038730,...,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556
AAAP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837
AAC,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.106486,0.105946,0.096216,0.088108,0.084324,0.082162,0.090270,0.080541,0.076216,0.075676
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYME,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.982308,0.984615,0.900000,0.870769,0.842308,0.852308,0.922308,0.900000,1.050000,1.129231
ZYNE,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.252308,0.229538,0.220308,0.197538,0.177846,0.180923,0.179692,0.181538,0.179692,0.182769
ZYTO,1.0,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.090909,1.090909,...,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273
ZYXI,1.0,1.008264,1.008264,0.966942,0.991736,0.909091,0.826446,0.826446,0.834711,0.867769,...,2.512397,2.446281,2.537190,2.388430,2.322314,2.223140,2.173554,2.173554,2.190083,2.429752


In [43]:
y_all = y['10bagger']
y_all

ticker
A        False
AA       False
AAAGY    False
AAAP     False
AAC      False
         ...  
ZYME     False
ZYNE     False
ZYTO     False
ZYXI     False
ZZ       False
Name: 10bagger, Length: 9881, dtype: bool

In [49]:
data = X_all.merge(y_all, left_index=True, right_index=True)

data

Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31,10bagger
A,1.0,1.000000,1.049685,0.988959,0.997634,0.981073,1.059148,1.067035,1.089117,1.103312,...,5.361987,5.270505,5.141167,4.991325,4.942429,5.168770,5.242902,5.201893,5.320189,False
AA,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.227826,1.200435,1.188261,1.147391,1.093478,1.180000,1.180870,1.156522,1.155652,False
AAAGY,1.0,1.000000,1.038095,0.989841,1.013968,1.013333,1.028571,1.025397,1.047619,1.038730,...,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,False
AAAP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,False
AAC,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.105946,0.096216,0.088108,0.084324,0.082162,0.090270,0.080541,0.076216,0.075676,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYME,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.984615,0.900000,0.870769,0.842308,0.852308,0.922308,0.900000,1.050000,1.129231,False
ZYNE,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.229538,0.220308,0.197538,0.177846,0.180923,0.179692,0.181538,0.179692,0.182769,False
ZYTO,1.0,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.090909,1.090909,...,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,False
ZYXI,1.0,1.008264,1.008264,0.966942,0.991736,0.909091,0.826446,0.826446,0.834711,0.867769,...,2.446281,2.537190,2.388430,2.322314,2.223140,2.173554,2.173554,2.190083,2.429752,False


In [72]:
# Shuffle the data
shuffled_data = data.sample(frac=1)
display(shuffled_data.head())

# Split 
split = 6000
train_x = shuffled_data.iloc[:split,:-1]
train_y = shuffled_data.iloc[:split, -1]
display(train_x)
display(train_y)

valid_x = shuffled_data.iloc[split+1:,:-1]
valid_y = shuffled_data.iloc[split+1:, -1]
display(valid_x)
display(valid_y)


Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31,10bagger
ATLS1,1.0,0.930435,0.878261,0.782609,0.686957,0.686957,0.704348,0.843478,1.173913,1.347826,...,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,True
OPI,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.368146,0.367624,0.353003,0.344648,0.360836,0.376501,0.388512,0.379634,0.358747,False
IAGX,1.0,1.0,1.0,1.0,0.645161,0.516129,1.064516,1.064516,1.129032,0.741935,...,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,False
CMLP,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,False
OCX,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.17,0.161111,0.161111,0.163333,0.161111,0.164444,0.161111,0.163333,0.153333,False


Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-17,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31
ATLS1,1.0,0.930435,0.878261,0.782609,0.686957,0.686957,0.704348,0.843478,1.173913,1.347826,...,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609,27.582609
OPI,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.377023,0.368146,0.367624,0.353003,0.344648,0.360836,0.376501,0.388512,0.379634,0.358747
IAGX,1.0,1.000000,1.000000,1.000000,0.645161,0.516129,1.064516,1.064516,1.129032,0.741935,...,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903,0.012903
CMLP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142,0.350142
OCX,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.161111,0.170000,0.161111,0.161111,0.163333,0.161111,0.164444,0.161111,0.163333,0.153333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CALL1,1.0,0.949153,1.005650,1.005650,1.016949,1.016949,1.016949,1.011299,1.016949,1.073446,...,0.022599,0.022599,0.022599,0.022599,0.022599,0.022599,0.022599,0.022599,0.022599,0.022599
DKILF,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,3.801020,3.801020,3.801020,3.801020,3.644558,3.644558,3.644558,3.644558,3.605442,3.605442
FCB,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.587393,1.561605,1.544890,1.574976,1.553964,1.520057,1.599331,1.580707,1.608405,1.603629
TRTL,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.060628,1.060628,1.060628,1.060628,1.060628,1.060628,1.060628,1.060628,1.060628,1.060628


ATLS1     True
OPI      False
IAGX     False
CMLP     False
OCX      False
         ...  
CALL1    False
DKILF    False
FCB      False
TRTL     False
SII      False
Name: 10bagger, Length: 6000, dtype: bool

Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-17,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31
OBNK,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.975733,0.960267,0.900000,0.871200,0.860533,0.863200,0.890667,0.888800,0.912000,0.908800
MVL,1.0,1.002070,1.015728,0.984272,0.991308,0.961921,0.999586,1.016142,0.993377,1.021109,...,2.237169,2.237169,2.237169,2.237169,2.237169,2.237169,2.237169,2.237169,2.237169,2.237169
T,1.0,0.983941,0.998264,0.979167,0.980035,0.942708,0.999132,1.013455,1.056858,1.053385,...,1.296007,1.291233,1.294271,1.243490,1.228733,1.187500,1.218750,1.221788,1.235243,1.238715
FBIZ,1.0,1.000000,0.993538,0.993538,0.956058,0.956381,0.956381,0.956381,0.956381,0.956381,...,3.235864,3.200323,3.137318,3.260097,3.182553,3.121163,3.108239,3.069467,3.147011,3.151858
RGLXY,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.711250,0.711250,0.716250,0.716250,0.716250,0.716250,0.716250,0.716250,0.716250,0.707500
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
PSA-PP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.973717,0.973717,0.973717,0.973717,0.973717,0.973717,0.973717,0.973717,0.973717,0.973717
ARCL,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.590836,0.590836,0.590836,0.590836,0.590836,0.590836,0.590836,0.590836,0.590836,0.590836
CNAT,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.146316,0.128421,0.129474,0.132632,0.129474,0.123158,0.144211,0.135789,0.140000,0.182105
TPL,1.0,0.971053,1.052632,0.987895,0.960526,0.867895,0.958947,0.972632,1.025263,1.015789,...,27.376316,25.631053,25.574737,24.254211,22.736316,23.216842,25.431053,26.699474,28.826842,28.506842


OBNK      False
MVL       False
T         False
FBIZ      False
RGLXY     False
          ...  
PSA-PP    False
ARCL      False
CNAT      False
TPL        True
KRP       False
Name: 10bagger, Length: 3880, dtype: bool

In [76]:
C_values = [1e-5, 1e-3, 1e-2, 1e-1, 5e-1, 1.0, 5, 10, 1e2]

for C in C_values:
    
    print("Regularization: {}".format(C))

    model = linear_model.LogisticRegression(C=C, class_weight='balanced', max_iter=10000)
    model.fit(train_x, train_y)

    predicts = model.predict(valid_x)
    correct = predicts == valid_y
    
    TP,FP,TN,FN = calc_metrics(predicts,valid_y)

    print(TP,FP,TN,FN)

    precision, recall, accuracy, TPR, TNR, BER = calc_error_rates(TP, FP, TN, FN)

    print("Precision:{:.3f} Recall:{:.3f}".format(precision, recall))
    print("Accuracy:{:.3f} TPR:{:.3f} TNR:{:.3f} BER:{:.3f}".format(accuracy, TPR, TNR, BER))
    print('\n')

Regularization: 1e-05
209 124 3543 4
Precision:0.628 Recall:0.981
Accuracy:0.967 TPR:0.981 TNR:0.966 BER:0.026


Regularization: 0.001
211 81 3586 2
Precision:0.723 Recall:0.991
Accuracy:0.979 TPR:0.991 TNR:0.978 BER:0.016


Regularization: 0.01
208 62 3605 5
Precision:0.770 Recall:0.977
Accuracy:0.983 TPR:0.977 TNR:0.983 BER:0.020


Regularization: 0.1
201 43 3624 12
Precision:0.824 Recall:0.944
Accuracy:0.986 TPR:0.944 TNR:0.988 BER:0.034


Regularization: 0.5
201 41 3626 12
Precision:0.831 Recall:0.944
Accuracy:0.986 TPR:0.944 TNR:0.989 BER:0.034


Regularization: 1.0
199 38 3629 14
Precision:0.840 Recall:0.934
Accuracy:0.987 TPR:0.934 TNR:0.990 BER:0.038


Regularization: 5
197 39 3628 16
Precision:0.835 Recall:0.925
Accuracy:0.986 TPR:0.925 TNR:0.989 BER:0.043


Regularization: 10
197 40 3627 16
Precision:0.831 Recall:0.925
Accuracy:0.986 TPR:0.925 TNR:0.989 BER:0.043


Regularization: 100.0
197 41 3626 16
Precision:0.828 Recall:0.925
Accuracy:0.985 TPR:0.925 TNR:0.989 BER:0.043



In [79]:
label_filenames = ['labels_12-31-2010.csv',
              'labels_12-31-2011.csv',
              'labels_12-31-2012.csv',
              'labels_12-31-2013.csv',
              'labels_12-31-2014.csv',
              'labels_12-31-2015.csv',
              'labels_12-31-2016.csv',
              'labels_12-31-2017.csv',
              'labels_12-31-2018.csv'
             ]

prices_filenames = ['inputs_notfilled_2010-12-31.csv',
            'inputs_notfilled_2011-12-31.csv',
            'inputs_notfilled_2012-12-31.csv',
            'inputs_notfilled_2013-12-31.csv',
            'inputs_notfilled_2014-12-31.csv',
            'inputs_notfilled_2015-12-31.csv',
            'inputs_notfilled_2016-12-31.csv',
            'inputs_notfilled_2017-12-31.csv',
            'inputs_notfilled_2018-12-31.csv']

for label_filename, prices_filename in zip(label_filenames, prices_filenames):
    
    print(label_filename)
    
    labels = pd.read_csv("../datasets/sharader/"+label_filename)
    y = labels.set_index('ticker')
    y_all = y['10bagger']
    
    
    prices = pd.read_csv("../datasets/sharader/"+prices_filename)
    X = prices.set_index('date')
    first_valid_prices = X.apply(first_valid_idx, axis=0)
    
    X_filled = X.fillna(axis=0, method='ffill')  # forward fill along date axis with last valid price
    X_filled = X_filled.fillna(0)  # fill all other NaN with zero - remaining NaN before the first valid price

    # Transpose Dataframe in rows of tickers, and normalize price
    X_all = X_filled.transpose().div(first_valid_prices, axis=0)
    
    data = X_all.merge(y_all, left_index=True, right_index=True)
    
    # Shuffle the data
    shuffled_data = data.sample(frac=1)

    # Split 
    split = int(len(shuffled_data.index)*0.7)
    train_x = shuffled_data.iloc[:split,:-1]
    train_y = shuffled_data.iloc[:split, -1]

    valid_x = shuffled_data.iloc[split+1:,:-1]
    valid_y = shuffled_data.iloc[split+1:, -1]
    
    C_values = [1e-5, 1e-3, 1e-2, 1e-1, 5e-1, 1.0, 5, 10, 1e2]

    for C in C_values:

        print("Regularization: {}".format(C))

        model = linear_model.LogisticRegression(C=C, class_weight='balanced', max_iter=10000)
        model.fit(train_x, train_y)

        predicts = model.predict(valid_x)
        correct = predicts == valid_y

        TP,FP,TN,FN = calc_metrics(predicts,valid_y)

        print(TP,FP,TN,FN)

        precision, recall, accuracy, TPR, TNR, BER = calc_error_rates(TP, FP, TN, FN)

        print("Precision:{:.3f} Recall:{:.3f}".format(precision, recall))
        print("Accuracy:{:.3f} TPR:{:.3f} TNR:{:.3f} BER:{:.3f}".format(accuracy, TPR, TNR, BER))
        print('\n')



labels_12-31-2010.csv
Regularization: 1e-05
96 246 1570 52
Precision:0.281 Recall:0.649
Accuracy:0.848 TPR:0.649 TNR:0.865 BER:0.243


Regularization: 0.001
101 271 1545 47
Precision:0.272 Recall:0.682
Accuracy:0.838 TPR:0.682 TNR:0.851 BER:0.233


Regularization: 0.01
100 268 1548 48
Precision:0.272 Recall:0.676
Accuracy:0.839 TPR:0.676 TNR:0.852 BER:0.236


Regularization: 0.1
92 253 1563 56
Precision:0.267 Recall:0.622
Accuracy:0.843 TPR:0.622 TNR:0.861 BER:0.259


Regularization: 0.5
87 250 1566 61
Precision:0.258 Recall:0.588
Accuracy:0.842 TPR:0.588 TNR:0.862 BER:0.275


Regularization: 1.0
85 256 1560 63
Precision:0.249 Recall:0.574
Accuracy:0.838 TPR:0.574 TNR:0.859 BER:0.283


Regularization: 5
75 265 1551 73
Precision:0.221 Recall:0.507
Accuracy:0.828 TPR:0.507 TNR:0.854 BER:0.320


Regularization: 10
76 267 1549 72
Precision:0.222 Recall:0.514
Accuracy:0.827 TPR:0.514 TNR:0.853 BER:0.317


Regularization: 100.0
77 278 1538 71
Precision:0.217 Recall:0.520
Accuracy:0.822 TPR:0

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


75 178 1780 101
Precision:0.296 Recall:0.426
Accuracy:0.869 TPR:0.426 TNR:0.909 BER:0.332


labels_12-31-2013.csv
Regularization: 1e-05
135 234 1825 32
Precision:0.366 Recall:0.808
Accuracy:0.881 TPR:0.808 TNR:0.886 BER:0.153


Regularization: 0.001
135 220 1839 32
Precision:0.380 Recall:0.808
Accuracy:0.887 TPR:0.808 TNR:0.893 BER:0.149


Regularization: 0.01
133 203 1856 34
Precision:0.396 Recall:0.796
Accuracy:0.894 TPR:0.796 TNR:0.901 BER:0.151


Regularization: 0.1
128 192 1867 39
Precision:0.400 Recall:0.766
Accuracy:0.896 TPR:0.766 TNR:0.907 BER:0.163


Regularization: 0.5
116 190 1869 51
Precision:0.379 Recall:0.695
Accuracy:0.892 TPR:0.695 TNR:0.908 BER:0.199


Regularization: 1.0
110 182 1877 57
Precision:0.377 Recall:0.659
Accuracy:0.893 TPR:0.659 TNR:0.912 BER:0.215


Regularization: 5
97 172 1887 70
Precision:0.361 Recall:0.581
Accuracy:0.891 TPR:0.581 TNR:0.916 BER:0.251


Regularization: 10
92 172 1887 75
Precision:0.348 Recall:0.551
Accuracy:0.889 TPR:0.551 TNR:0.916 BE

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


85 168 1891 82
Precision:0.336 Recall:0.509
Accuracy:0.888 TPR:0.509 TNR:0.918 BER:0.286


labels_12-31-2014.csv
Regularization: 1e-05
122 200 2012 20
Precision:0.379 Recall:0.859
Accuracy:0.907 TPR:0.859 TNR:0.910 BER:0.116


Regularization: 0.001
122 192 2020 20
Precision:0.389 Recall:0.859
Accuracy:0.910 TPR:0.859 TNR:0.913 BER:0.114


Regularization: 0.01
122 193 2019 20
Precision:0.387 Recall:0.859
Accuracy:0.910 TPR:0.859 TNR:0.913 BER:0.114


Regularization: 0.1
114 179 2033 28
Precision:0.389 Recall:0.803
Accuracy:0.912 TPR:0.803 TNR:0.919 BER:0.139


Regularization: 0.5
102 162 2050 40
Precision:0.386 Recall:0.718
Accuracy:0.914 TPR:0.718 TNR:0.927 BER:0.177


Regularization: 1.0
101 155 2057 41
Precision:0.395 Recall:0.711
Accuracy:0.917 TPR:0.711 TNR:0.930 BER:0.179


Regularization: 5
90 141 2071 52
Precision:0.390 Recall:0.634
Accuracy:0.918 TPR:0.634 TNR:0.936 BER:0.215


Regularization: 10
88 138 2074 54
Precision:0.389 Recall:0.620
Accuracy:0.918 TPR:0.620 TNR:0.938 BER

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html.
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression


80 144 2068 62
Precision:0.357 Recall:0.563
Accuracy:0.912 TPR:0.563 TNR:0.935 BER:0.251


labels_12-31-2015.csv
Regularization: 1e-05
136 188 2262 24
Precision:0.420 Recall:0.850
Accuracy:0.919 TPR:0.850 TNR:0.923 BER:0.113


Regularization: 0.001
142 165 2285 18
Precision:0.463 Recall:0.887
Accuracy:0.930 TPR:0.887 TNR:0.933 BER:0.090


Regularization: 0.01
138 144 2306 22
Precision:0.489 Recall:0.863
Accuracy:0.936 TPR:0.863 TNR:0.941 BER:0.098


Regularization: 0.1
128 117 2333 32
Precision:0.522 Recall:0.800
Accuracy:0.943 TPR:0.800 TNR:0.952 BER:0.124


Regularization: 0.5
116 106 2344 44
Precision:0.523 Recall:0.725
Accuracy:0.943 TPR:0.725 TNR:0.957 BER:0.159


Regularization: 1.0
108 102 2348 52
Precision:0.514 Recall:0.675
Accuracy:0.941 TPR:0.675 TNR:0.958 BER:0.183


Regularization: 5
101 98 2352 59
Precision:0.508 Recall:0.631
Accuracy:0.940 TPR:0.631 TNR:0.960 BER:0.204


Regularization: 10
99 98 2352 61
Precision:0.503 Recall:0.619
Accuracy:0.939 TPR:0.619 TNR:0.960 BER:

In [46]:
X = np.array([[1, 7], [1, 2.5], [1, 0.67], [1, 0.5], [1, 0.33], [1, 4], [1, 3], [1, -1]])
print(X.shape)
y = np.array([1, 0, 0, 0, 0, 1, 1, 0])
print(y.shape)

(8, 2)
(8,)


In [47]:
model = linear_model.LogisticRegression()
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [48]:
predicts = model.predict(X)

In [49]:
predicts

array([1, 0, 0, 0, 0, 1, 1, 0])

In [2]:
df = pd.DataFrame([[np.NaN,np.NaN, 1, 7], [np.NaN, 3, 2.5, 5], [1, 0.67, 3], [np.NaN, 1, 0.5, 0.2]])
print(df)

print(df.apply(first_valid_idx, axis=1))

     0     1    2    3
0  NaN   NaN  1.0  7.0
1  NaN  3.00  2.5  5.0
2  1.0  0.67  3.0  NaN
3  NaN  1.00  0.5  0.2
0    1.0
1    3.0
2    1.0
3    1.0
dtype: float64


In [None]:
df