# Logistic Regression - Price + Industry

For a ticker, given a sample period (e.g. 3-1-2009 to 12-31-2015), we train a logistic regression model to to predict it will become a 10 Bagger by 3-31-2019:  

10_bagger = f(price_t0, price_t1, ..., p_t_end, ind_1, ..., ind_178)
 
where:  
* p_t0 - ticker's price on 3-1-2009
* p_t_end - ticker's price on the date in the sample period
* ind_i - ticker's industry (one hot)

We want to see if logistic regression can improved by context-specific parameters like industry grouping.

In [57]:
import quandl  # Access to Sharadar Core US Equities Bundle
api_key = '7B87ndLPJbCDzpNHosH3'

import math
import platform
import matplotlib
import matplotlib.pyplot as plt
from pylab import rcParams
import numpy as np
from sklearn import linear_model  # package for logistic regression (not using GPU)
import torch
import pandas as pd
from IPython.display import display
import time
import pickle

from utils import *

from datetime import date, datetime, time, timedelta


print("Python version: ", platform.python_version())
print("Pytorch version: {}".format(torch.__version__))

Python version:  3.6.6
Pytorch version: 1.1.0


## Import Labels

For each sample period (e.g. 3-1-2009 to 12-31-2018), we want to import a list of valid tickers. A valid ticker is defined as a ticker which is active for at least 180 days before the end of the sample period. 

For example, if the end of the sample period is 12-31-2018, a ticker has to be active since 7-4-2018. Any ticker that IPO after 7-4-2018 is not a valid ticker, since there is no enough price history to make an educated prediction.

In [2]:
labels = pd.read_csv("../datasets/sharader/labels_12-31-2018.csv")

y = labels.set_index('ticker')
y['firstpricedate']= pd.to_datetime(y['firstpricedate'])
y['lastpricedate']= pd.to_datetime(y['lastpricedate'])

y.head()

Unnamed: 0_level_0,appreciation,10bagger,table,permaticker,name,exchange,isdelisted,category,cusips,siccode,...,currency,location,lastupdated,firstadded,firstpricedate,lastpricedate,firstquarter,lastquarter,secfilings,companysite
ticker,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
A,6.339117,False,SEP,196290,Agilent Technologies Inc,NYSE,N,Domestic,00846U101,3826.0,...,USD,California; U.S.A,2020-01-14,2014-09-26,1999-11-18,2020-01-14,1997-06-30,2019-09-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,http://www.agilent.com
AA,1.224348,False,SEP,124392,Alcoa Corp,NYSE,N,Domestic,013872106,3350.0,...,USD,New York; U.S.A,2020-01-14,2016-11-01,2016-11-01,2020-01-14,2014-12-31,2019-09-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,http://www.alcoa.com
AAAGY,1.275556,False,SEP,120538,Altana Aktiengesellschaft,NYSE,Y,ADR,02143N103,2834.0,...,USD,Jordan,2018-10-16,2018-02-13,2002-05-22,2010-08-12,2000-12-31,2005-12-31,https://www.sec.gov/cgi-bin/browse-edgar?actio...,
AAAP,3.331837,False,SEP,155760,Advanced Accelerator Applications SA,NASDAQ,Y,ADR,00790T100,2834.0,...,USD,France,2018-06-28,2016-05-19,2015-11-11,2018-02-09,2012-12-31,2016-12-31,https://www.sec.gov/cgi-bin/browse-edgar?actio...,
AAC,0.099459,False,SEP,187592,AAC Holdings Inc,NYSE,Y,Domestic,000307108,8093.0,...,USD,Tennessee; U.S.A,2019-10-25,2015-09-11,2014-10-02,2019-10-25,2013-09-30,2019-09-30,https://www.sec.gov/cgi-bin/browse-edgar?actio...,


### A list of active tickers and corresponding industries

In [5]:
tickers = list(y.index)
print(len(tickers))

industries = list(y['industry'])

valid_tickers = pd.Series(tickers, name = 'ticker')
ticker_industries = pd.Series(industries, name = 'industry')

display(valid_tickers.head())
display(ticker_industries.head())

9881


0        A
1       AA
2    AAAGY
3     AAAP
4      AAC
Name: ticker, dtype: object

0    Diagnostics & Research
1                  Aluminum
2             Biotechnology
3             Biotechnology
4              Medical Care
Name: industry, dtype: object

In [7]:
y_train = y['10bagger']

print (y_train.shape)
print(sum(y_train))

(9881,)
531


In [6]:
prices = pd.read_csv("../datasets/sharader/inputs_notfilled_2018-12-31.csv")
X = prices .set_index('date')

X

Unnamed: 0_level_0,A,AA,AAAGY,AAAP,AAC,AACC,AACG,AACPF,AAGIY,AAI,...,ZUO,ZURVY,ZVO,ZVUE,ZXAIY,ZYME,ZYNE,ZYTO,ZYXI,ZZ
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2009-03-02,12.68,,15.75,,,3.29,5.180,,,2.54,...,,12.750,,0.01,,,,0.011,1.21,0.84
2009-03-03,12.68,,15.75,,,3.30,5.320,,,2.46,...,,12.850,,0.01,,,,0.011,1.22,0.76
2009-03-04,13.31,,16.35,,,3.33,5.080,,,2.78,...,,13.740,,0.01,,,,0.011,1.22,0.76
2009-03-05,12.54,,15.59,,,3.30,5.080,,,2.56,...,,11.910,,0.01,,,,0.011,1.17,0.58
2009-03-06,12.65,,15.97,,,3.40,5.250,,,2.89,...,,11.300,,0.01,,,,0.011,1.20,0.55
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2018-12-24,62.67,25.15,,,1.52,,0.970,,32.335,,...,16.36,28.680,6.73,,0.67,11.08,2.94,,2.69,
2018-12-26,65.54,27.14,,,1.67,,0.998,,32.700,,...,17.81,29.050,6.78,,0.67,11.99,2.92,,2.63,
2018-12-27,66.48,27.16,,,1.49,,0.980,,32.330,,...,17.90,28.769,6.83,,0.62,11.70,2.95,,2.63,
2018-12-28,65.96,26.60,,,1.41,,1.000,,32.900,,...,17.65,29.668,6.87,,0.72,13.65,2.92,,2.65,


### Backfill and Forward Fill X_t

The logistic regression model expects no NaN values. We fill in the NaN values in the following way:

1. If there are NaN values before the first valid price, the ticker has an IPO in the sample period. We thus set these NaN values to zero.
2. If there are NaN values after the last valid price, the ticker has been delisted in the sample period. We thus set these NaN values to the last valid price.


### Normalize Ticker Price

The model converges when ticker prices are normalized (first valid price = 1.0).

In [8]:
first_valid_prices = X.apply(first_valid_idx, axis=0)

first_valid_prices

A        12.680
AA       23.000
AAAGY    15.750
AAAP     24.500
AAC      18.500
          ...  
ZYME     13.000
ZYNE     16.250
ZYTO      0.011
ZYXI      1.210
ZZ        0.840
Length: 9881, dtype: float64

In [35]:
X_filled = X.fillna(axis=0, method='ffill')  # forward fill along date axis with last valid price
X_filled = X_filled.fillna(0)  # fill all other NaN with zero - remaining NaN before the first valid price

# Transpose Dataframe in rows of tickers, and normalize price
X_all = X_filled.transpose().div(first_valid_prices, axis=0)

X_all

date,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,2018-12-17,2018-12-18,2018-12-19,2018-12-20,2018-12-21,2018-12-24,2018-12-26,2018-12-27,2018-12-28,2018-12-31
A,1.0,1.000000,1.049685,0.988959,0.997634,0.981073,1.059148,1.067035,1.089117,1.103312,...,5.350946,5.361987,5.270505,5.141167,4.991325,4.942429,5.168770,5.242902,5.201893,5.320189
AA,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,1.222174,1.227826,1.200435,1.188261,1.147391,1.093478,1.180000,1.180870,1.156522,1.155652
AAAGY,1.0,1.000000,1.038095,0.989841,1.013968,1.013333,1.028571,1.025397,1.047619,1.038730,...,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556,1.275556
AAAP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837,3.331837
AAC,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.106486,0.105946,0.096216,0.088108,0.084324,0.082162,0.090270,0.080541,0.076216,0.075676
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYME,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.982308,0.984615,0.900000,0.870769,0.842308,0.852308,0.922308,0.900000,1.050000,1.129231
ZYNE,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,0.252308,0.229538,0.220308,0.197538,0.177846,0.180923,0.179692,0.181538,0.179692,0.182769
ZYTO,1.0,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.090909,1.090909,...,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273,2.727273
ZYXI,1.0,1.008264,1.008264,0.966942,0.991736,0.909091,0.826446,0.826446,0.834711,0.867769,...,2.512397,2.446281,2.537190,2.388430,2.322314,2.223140,2.173554,2.173554,2.190083,2.429752


### Append industry columns to X

In [36]:
ind_names = pd.read_csv("sectors.csv").set_index('industry')

for industry in list(ind_names.index):
    X_all[industry] = False
    
X_all   

date,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,Staffing & Outsourcing Services,Marketing Services,Home Furnishings & Fixtures,Data Storage,Beverages - Soft Drinks,Integrated Shipping & Logistics,Financial Exchanges,Infrastructure Operations,Banks - Regional - Latin America,Coking Coal
A,1.0,1.000000,1.049685,0.988959,0.997634,0.981073,1.059148,1.067035,1.089117,1.103312,...,False,False,False,False,False,False,False,False,False,False
AA,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
AAAGY,1.0,1.000000,1.038095,0.989841,1.013968,1.013333,1.028571,1.025397,1.047619,1.038730,...,False,False,False,False,False,False,False,False,False,False
AAAP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
AAC,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYME,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
ZYNE,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
ZYTO,1.0,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.090909,1.090909,...,False,False,False,False,False,False,False,False,False,False
ZYXI,1.0,1.008264,1.008264,0.966942,0.991736,0.909091,0.826446,0.826446,0.834711,0.867769,...,False,False,False,False,False,False,False,False,False,False


### Append y labels to dataset

In [37]:
y_all = y['10bagger']

data = X_all.merge(y_all, left_index=True, right_index=True)

data

Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,Marketing Services,Home Furnishings & Fixtures,Data Storage,Beverages - Soft Drinks,Integrated Shipping & Logistics,Financial Exchanges,Infrastructure Operations,Banks - Regional - Latin America,Coking Coal,10bagger
A,1.0,1.000000,1.049685,0.988959,0.997634,0.981073,1.059148,1.067035,1.089117,1.103312,...,False,False,False,False,False,False,False,False,False,False
AA,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
AAAGY,1.0,1.000000,1.038095,0.989841,1.013968,1.013333,1.028571,1.025397,1.047619,1.038730,...,False,False,False,False,False,False,False,False,False,False
AAAP,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
AAC,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
ZYME,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
ZYNE,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
ZYTO,1.0,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.000000,1.090909,1.090909,...,False,False,False,False,False,False,False,False,False,False
ZYXI,1.0,1.008264,1.008264,0.966942,0.991736,0.909091,0.826446,0.826446,0.834711,0.867769,...,False,False,False,False,False,False,False,False,False,False


In [43]:
for ticker, industry in zip(list(y.index), list(y['industry'])):
    print(ticker, industry)
    data.loc[ticker,industry] = True

A Diagnostics & Research
AA Aluminum
AAAGY Biotechnology
AAAP Biotechnology
AAC Medical Care
AACC Asset Management
AACG Education & Training Services
AACPF Conglomerates
AAGIY None
AAI Airlines
AAIIQ Aerospace & Defense
AAIR Airports & Air Services
AAL Airlines
AAMC Asset Management
AAME Insurance - Life
AAN Rental & Leasing Services
AANB Banks - Regional
AAOI Semiconductors
AAON Building Products & Equipment
AAP Specialty Retail
AAPC Leisure
AAPL Consumer Electronics
AAT REIT - Diversified
AATI Semiconductors
AAU Gold
AAUKF None
AAV Oil & Gas E&P
AAWW Airports & Air Services
AAXN Aerospace & Defense
AB Asset Management
ABAT Electrical Equipment & Parts
ABAX Diagnostics & Research
ABB Specialty Industrial Machinery
ABBC Banks - Regional
ABBV Drug Manufacturers - General
ABC Medical Distribution
ABCB Banks - Regional
ABCD Education & Training Services
ABCFF None
ABCO Health Information Services
ABCP REIT - Diversified
ABCW Banks - Regional
ABDC Asset Management
ABDS Specialty Chemicals


ALPN Biotechnology
ALPP Communication Equipment
ALR Diagnostics & Research
ALR-PB Diagnostics & Research
ALRM Software - Application
ALRN Biotechnology
ALRN1 None
ALSK Telecom Services
ALSN Auto Parts
ALT Biotechnology
ALTD Staffing & Employment Services
ALTE Insurance - Property & Casualty
ALTH Biotechnology
ALTI Specialty Chemicals
ALTM Conglomerates
ALTR Software - Infrastructure
ALTR1 Semiconductors
ALTUQ Biotechnology
ALTV Telecom Services
ALTX Oil & Gas E&P
ALU Communication Equipment
ALUS1 Medical Devices
ALV Auto Parts
ALVRQ Communication Equipment
ALX REIT - Retail
ALXA Biotechnology
ALXN Biotechnology
ALY Oil & Gas Equipment & Services
AM Oil & Gas Midstream
AM1 Specialty Retail
AM2 Oil & Gas Midstream
AMAC Engineering & Construction
AMAG Biotechnology
AMAP Software - Application
AMARQ Biotechnology
AMAT Semiconductor Equipment & Materials
AMBA Semiconductor Equipment & Materials
AMBC Insurance - Specialty
AMBI Biotechnology
AMBO Education & Training Services
AMBR Software - 

ATEA Software - Application
ATEC Medical Devices
ATEN Software - Infrastructure
ATEX Telecom Services
ATGE Education & Training Services
ATH Insurance - Diversified
ATHE Biotechnology
ATHL Oil & Gas E&P
ATHM Internet Content & Information
ATHN Health Information Services
ATHR Semiconductors
ATHX Biotechnology
ATHYQ Information Technology Services
ATI Metal Fabrication
ATIS Waste Management
ATISZ Biotechnology
ATKR Diversified Industrials
ATLC Credit Services
ATLKY None
ATLO Banks - Regional
ATLS Oil & Gas E&P
ATLS1 Oil & Gas Midstream
ATLS2 Oil & Gas Midstream
ATLT Packaged Foods
ATMI Chemicals
ATML Semiconductors
ATMS Specialty Chemicals
ATN Oil & Gas E&P
ATNI Telecom Services
ATNM Biotechnology
ATNX Biotechnology
ATNY Semiconductors
ATO Utilities - Regulated Gas
ATOM Semiconductor Equipment & Materials
ATOS Biotechnology
ATPG Oil & Gas E&P
ATPL Packaging & Containers
ATR Packaging & Containers
ATRA Biotechnology
ATRC Medical Instruments & Supplies
ATRI Medical Instruments & Supplies


BH.A Restaurants
BHAC Conglomerates
BHB Banks - Regional
BHBCQ Banks - Regional
BHBK Banks - Regional - US
BHC Drug Manufacturers - Specialty & Generic
BHE Electronic Components
BHF Insurance - Life
BHI Oil & Gas Equipment & Services
BHLB Banks - Regional
BHODQ Oil & Gas Midstream
BHP Other Industrial Metals & Mining
BHR REIT - Hotel & Motel
BHR-PB REIT - Hotel & Motel
BHS Residential Construction
BHTG Pollution & Treatment Controls
BHVN Biotechnology
BICX Medical Care Facilities
BID Specialty Retail
BIDU Internet Content & Information
BIDZ Specialty Business Services
BIEI Biotechnology
BIG Discount Stores
BIIB Drug Manufacturers - General
BIKEQ Recreational Vehicles
BILI Internet Content & Information
BIMI Specialty Industrial Machinery
BINDQ Biotechnology
BIO Medical Devices
BIO.B Medical Devices
BIOA Specialty Chemicals
BIOC Diagnostics & Research
BIOC1 Diagnostics & Research
BIOL Medical Devices
BIOS Medical Care Facilities
BIP Utilities - Diversified
BIQI Farm Products
BIR-PA Real

CASH Banks - Regional
CASI Biotechnology
CASLQ Steel
CASM Medical Instruments & Supplies
CASS Specialty Business Services
CASY Grocery Stores
CAT Farm & Heavy Construction Machinery
CATB Biotechnology
CATC Banks - Regional - US
CATM Business Equipment & Supplies
CATO Apparel Retail
CATS Medical Care Facilities
CATT Software - Application
CATY Banks - Regional
CAV Residential Construction
CAVM Semiconductors
CAVO Engineering & Construction
CAW Household & Personal Products
CB Insurance - Property & Casualty
CB1 Insurance - Property & Casualty
CBAI Diagnostics & Research
CBAN Banks - Regional
CBAT Electrical Equipment & Parts
CBAY Biotechnology
CBB Telecom Services
CBB-PB Telecom Services
CBBO Banks - Regional
CBBT Medical Distribution
CBCGQ Banks - Regional
CBCRQ Banks - Regional
CBD Department Stores
CBDS Personal Services
CBE Aerospace & Defense
CBEH Oil & Gas Refining & Marketing
CBEY Telecom Services
CBF Banks - Regional
CBFV Banks - Regional
CBI Engineering & Construction
CBIO Biot

CLCD Biotechnology
CLCGY None
CLCN Education & Training Services
CLCS Biotechnology
CLCT Specialty Business Services
CLDA Biotechnology
CLDB Banks - Regional
CLDC Credit Services
CLDPQ Coal
CLDR Software - Application
CLDT REIT - Hotel & Motel
CLDX Biotechnology
CLF Steel
CLFC Banks - Regional
CLFD Communication Equipment
CLGN Biotechnology
CLGRF Gold
CLGX Information Technology Services
CLH Waste Management
CLHRF Gold
CLI REIT - Office
CLIR Pollution & Treatment Controls
CLLS Biotechnology
CLMS Capital Markets
CLMT Oil & Gas E&P
CLNC REIT - Office
CLNE Oil & Gas Refining & Marketing
CLNH Real Estate Services
CLNS-PA REIT - Diversified
CLNS-PC REIT - Diversified
CLNS-PD REIT - Diversified
CLNS-PF REIT - Diversified
CLNY REIT - Diversified
CLNY-PA REIT - Diversified
CLNY-PB REIT - Diversified
CLNY-PC REIT - Diversified
CLNY-PE REIT - Diversified
CLNY-PG REIT - Diversified
CLNY-PH REIT - Diversified
CLNY-PI REIT - Diversified
CLNY-PJ REIT - Diversified
CLNY1 REIT - Mortgage
CLP REIT - Mo

CSX Railroads
CSXXY None
CTA-PA Chemicals
CTA-PB Chemicals
CTAS Specialty Business Services
CTB Auto Parts
CTBC1 None
CTBI Banks - Regional
CTCLY Real Estate Services
CTCM Broadcasting
CTCT Advertising Agencies
CTDBQ Broadcasting
CTDH Specialty Chemicals
CTEK Information Technology Services
CTESY Oil & Gas E&P
CTFO Information Technology Services
CTG Information Technology Services
CTGO Gold
CTHR Luxury Goods
CTIB Auto Parts
CTIC Biotechnology
CTL Telecom Services
CTLT Drug Manufacturers - Specialty & Generic
CTMX Biotechnology
CTO Real Estate - Development
CTP Staffing & Employment Services
CTQN Specialty Business Services
CTRC Apparel Manufacturing
CTRE REIT - Healthcare Facilities
CTRL Electronic Components
CTRN Apparel Retail
CTRX Insurance Brokers
CTS Electronic Components
CTSH Information Technology Services
CTSO Medical Devices
CTT REIT - Specialty
CTTAY None
CTUY Banks - Regional
CTV1 Communication Equipment
CTWS Utilities - Regulated Water
CTX1 Residential Construction
CTXR Bi

DRAD Medical Devices
DRC Leisure
DRCO Information Technology Services
DRD Gold
DRE REIT - Industrial
DRE-PL REIT - Industrial
DRH REIT - Hotel & Motel
DRI Restaurants
DRII Resorts & Casinos
DRIO Diagnostics & Research
DRIV1 Electronics & Computer Distribution
DRJ Specialty Retail
DRL Banks - Regional
DRNA Biotechnology
DRQ Oil & Gas Equipment & Services
DRRX Drug Manufacturers - Specialty & Generic
DRTX Biotechnology
DRUG Biotechnology
DRWI Telecom Services
DRYS Shipping & Ports
DS Leisure
DS-PB Leisure
DS-PC Leisure
DS-PD Leisure
DSCI Medical Devices
DSCM Pharmaceutical Retailers
DSCSY None
DSFGY None
DSGX Software - Application
DSHL Credit Services
DSIIQ Household & Personal Products
DSKE Trucking
DSKX Household & Personal Products
DSKY Software - Application
DSKYF None
DSNY Software - Application
DSPG Semiconductors
DSS Specialty Business Services
DST Software - Application
DSTI Semiconductors
DSUP Metal Fabrication
DSWL Electronic Components
DSX Marine Shipping
DSX-PB Marine Shippi

ERYP Biotechnology
ES Utilities - Regulated Electric
ES1 Waste Management
ESALY None
ESBA REIT - Diversified
ESBF Banks - Regional
ESBK Banks - Regional - US
ESC Medical Care Facilities
ESCA Leisure
ESCC Electrical Equipment & Parts
ESCI Pollution & Treatment Controls
ESCR Oil & Gas E&P
ESE Scientific & Technical Instruments
ESEA Marine Shipping
ESES Oil & Gas Equipment & Services
ESGR Insurance - Diversified
ESH Auto Parts
ESI Specialty Chemicals
ESI1 Education & Training Services
ESIC Information Technology Services
ESIMF Software - Application
ESIO Semiconductors
ESL Aerospace & Defense
ESLOY None
ESLRQ Semiconductors
ESLT Aerospace & Defense
ESMC Medical Devices
ESNC Diversified Industrials
ESND Business Equipment
ESNT Mortgage Finance
ESOA Engineering & Construction
ESP Electrical Equipment & Parts
ESPGY None
ESPH Oil & Gas Equipment & Services
ESPR Biotechnology
ESQ Banks - Regional - US
ESRT REIT - Diversified
ESRX Health Care Plans
ESS REIT - Residential
ESS-PH REIT - Residenti

FNB-PE Banks - Regional
FNBC Banks - Regional
FNBG Banks - Regional - US
FNCB Banks - Regional
FNCX Internet Content & Information
FND Home Improvement Stores
FNDT Software - Application
FNET Household & Personal Products
FNF Insurance - Specialty
FNFG Banks - Regional
FNFG-PB Banks - Regional
FNGN Asset Management
FNHC Insurance - Property & Casualty
FNJN Software - Infrastructure
FNKO Leisure
FNLC Banks - Regional
FNLYQ Luxury Goods
FNMA Credit Services
FNRG Chemicals
FNRN Banks - Regional
FNSC Banks - Regional
FNSR Communication Equipment
FNV Gold
FNWB Banks - Regional
FOE Specialty Chemicals
FOFN Banks - Regional
FOGO Restaurants
FOHL Apparel Retail
FOJCY None
FOLD Biotechnology
FOMX Biotechnology
FONR Medical Devices
FOOD1 Packaged Foods
FOR Real Estate - Development
FORBQ Medical Devices
FORD Footwear & Accessories
FORK Furnishings, Fixtures & Appliances
FORM Semiconductors
FORR Consulting Services
FORTY Information Technology Services
FOSL Luxury Goods
FOXF Recreational Vehicles

GMTA1 Specialty Retail
GMTC Leisure
GMXRQ Oil & Gas E&P
GNBC Banks - Regional - US
GNC Pharmaceutical Retailers
GNCA Biotechnology
GNCMA Telecom Services
GNE Utilities - Regulated Electric
GNE-PA Utilities - Regulated Electric
GNET Engineering & Construction
GNI Gold
GNK Marine Shipping
GNL REIT - Office
GNL-PA REIT - Office
GNMK Medical Devices
GNMX Biotechnology
GNOM2 Biotechnology
GNOW Medical Care Facilities
GNPX Biotechnology
GNRC Specialty Industrial Machinery
GNRL Tools & Accessories
GNRT Shipping & Ports
GNSS Scientific & Technical Instruments
GNTA Biotechnology
GNTX Auto Parts
GNTY Banks - Regional
GNUS Entertainment
GNVC Biotechnology
GNW Insurance - Life
GNXEE Entertainment
GOGL Marine Shipping
GOGO Telecom Services
GOL Airlines
GOLD Gold
GOLD1 Gold
GOLF Leisure
GOLF1 Specialty Retail
GOMO Software - Application
GOOD REIT - Diversified
GOODM REIT - Diversified
GOODN1 REIT - Diversified
GOODO REIT - Diversified
GOODP REIT - Diversified
GOOG Internet Content & Information
GOOG

HNH Specialty Chemicals
HNHPF None
HNI Business Equipment & Supplies
HNIN Aerospace & Defense
HNNA Asset Management
HNORY None
HNP Utilities - Independent Power Producers
HNR Oil & Gas E&P
HNRG Thermal Coal
HNSN Medical Devices
HNT Healthcare Plans
HNZ Packaged Foods
HOFD Real Estate Services
HOFT Furnishings, Fixtures & Appliances
HOG Recreational Vehicles
HOGS Packaged Foods
HOKU Electrical Equipment & Parts
HOLI Electrical Equipment & Parts
HOLL Specialty Retail
HOLX Medical Instruments & Supplies
HOMB Banks - Regional
HOME Specialty Retail
HOME1 Banks - Regional
HON Specialty Industrial Machinery
HONE Savings & Cooperative Banks
HOPE Banks - Regional
HOS Oil & Gas Equipment & Services
HOT Resorts & Casinos
HOTJ1 Luxury Goods
HOTT Apparel Retail
HOV Residential Construction
HOVNP Residential Construction
HP Oil & Gas Drilling
HPAC Conglomerates
HPCQ Conglomerates
HPE Communication Equipment
HPGP Oil & Gas E&P
HPHW Medical Care Facilities
HPJ Electronic Components
HPOL Consulting Ser

INMD1 Health Information Services
INN REIT - Hotel & Motel
INN-PA REIT - Hotel & Motel
INN-PB REIT - Hotel & Motel
INN-PC REIT - Hotel & Motel
INN-PD REIT - Hotel & Motel
INN-PE REIT - Hotel & Motel
INNL Biotechnology
INNL1 Biotechnology
INNT Biotechnology
INO Biotechnology
INOC Specialty Business Services
INOD Information Technology Services
INOV Health Information Services
INOW Electronics & Computer Distribution
INPHQ Communication Equipment
INPX Software - Application
INQD Specialty Industrial Machinery
INS Software - Application
INSE Electronic Gaming & Multimedia
INSG Communication Equipment
INSGY Software - Application
INSM Biotechnology
INSP Medical Devices
INST Software - Application
INSV Biotechnology
INSW Shipping & Ports
INSW-PA Shipping & Ports
INSY Biotechnology
INT Oil & Gas Refining & Marketing
INTC Semiconductors
INTG Lodging
INTI Biotechnology
INTL Capital Markets
INTT Semiconductors
INTU Software - Application
INTX Business Services
INTZ Communication Equipment
INUV 

KLIC Semiconductor Equipment & Materials
KLR Conglomerates
KLXI Aerospace & Defense
KMB Household & Personal Products
KMDA Biotechnology
KMG Specialty Chemicals
KMI Oil & Gas Midstream
KMI-PA Oil & Gas Midstream
KMLGF Oil & Gas Midstream
KMP Oil & Gas Midstream
KMPH Biotechnology
KMPR Insurance - Property & Casualty
KMR Oil & Gas Midstream
KMT Tools & Accessories
KMX Auto & Truck Dealerships
KN Communication Equipment
KNBA Leisure
KNBWY None
KNCAY None
KND Long-Term Care Facilities
KNDI Auto Manufacturers
KNDL Biotechnology
KNL Business Equipment & Supplies
KNMCY Software - Application
KNMX Tools & Accessories
KNOL Telecom Services
KNOP Marine Shipping
KNSA Biotechnology
KNSL Insurance - Property & Casualty
KNSY Medical Devices
KNTH Biotechnology
KNWN Scientific & Technical Instruments
KNX Trucking
KNX1 Trucking
KNXA Software - Application
KNYJY None
KO Beverages - Non-Alcoholic
KOAN Software - Application
KODK Specialty Business Services
KOF Beverages - Non-Alcoholic
KOG Oil & Gas Equ

MKTX Capital Markets
MKTY Scientific & Technical Instruments
MLAB Scientific & Technical Instruments
MLCO Resorts & Casinos
MLHR Business Equipment & Supplies
MLI Metal Fabrication
MLM Building Materials
MLND Biotechnology
MLNT Biotechnology
MLNX Semiconductors
MLP Real Estate - Development
MLR Auto Parts
MLSS Medical Instruments & Supplies
MLVF Banks - Regional
MM Advertising Agencies
MMAC Mortgage Finance
MMC Insurance Brokers
MMCE Utilities - Diversified
MMDM Conglomerates
MMI Real Estate Services
MMI1 Communication Equipment
MMLP Oil & Gas Midstream
MMM Specialty Industrial Machinery
MMP Oil & Gas Midstream
MMR Oil & Gas E&P
MMS Specialty Business Services
MMSI Medical Instruments & Supplies
MMTIF Information Technology Services
MMUS Software - Application
MMYT Travel Services
MN Asset Management
MNDO Software - Application
MNI Publishing
MNK Drug Manufacturers - Specialty & Generic
MNKD Biotechnology
MNLO Biotechnology
MNOV Biotechnology
MNR REIT - Industrial
MNR-PA REIT - Industr

NEXT Oil & Gas E&P
NFBK Banks - Regional
NFG Oil & Gas Integrated
NFH Conglomerates
NFLDQ Biotechnology
NFLX Entertainment
NFP Insurance Brokers
NFSB Banks - Regional
NFX Oil & Gas E&P
NG Gold
NGAC Specialty Chemicals
NGAS Oil & Gas E&P
NGBF Specialty Chemicals
NGCRF None
NGD Gold
NGEN Diagnostics & Research
NGG Utilities - Diversified
NGHC Insurance - Property & Casualty
NGHCN Insurance - Property & Casualty
NGHCO Insurance - Property & Casualty
NGHCP Insurance - Property & Casualty
NGL Oil & Gas Refining & Marketing
NGL-PB Oil & Gas Refining & Marketing
NGLS Oil & Gas Midstream
NGLS-PA Oil & Gas Midstream
NGS Oil & Gas Equipment & Services
NGSX Biotechnology
NGVC Grocery Stores
NGVT Specialty Chemicals
NH Health Information Services
NHC Medical Care Facilities
NHC-PA Medical Care Facilities
NHI REIT - Healthcare Facilities
NHLD Capital Markets
NHP REIT - Mortgage
NHPR Medical Care Facilities
NHR1 Conglomerates
NHRX Pharmaceutical Retailers
NHTC Internet Retail
NHWK Medical Care Facil

OFC REIT - Office
OFC-PL REIT - Office
OFED Banks - Regional
OFG Banks - Regional
OFG-PA Banks - Regional
OFG-PB Banks - Regional
OFG-PD Banks - Regional
OFI Packaged Foods
OFIX Medical Devices
OFLX Specialty Industrial Machinery
OFS Asset Management
OGCP REIT - Diversified
OGE Utilities - Regulated Electric
OGEN Biotechnology
OGS Utilities - Regulated Gas
OHAI Asset Management
OHBIQ Residential Construction
OHI REIT - Healthcare Facilities
OI Packaging & Containers
OIBR.C Telecom Services
OICO Diagnostics & Research
OII Oil & Gas Equipment & Services
OIIM Electronic Components
OILT Oil & Gas Midstream
OIS Oil & Gas Equipment & Services
OISI Medical Devices
OKE Oil & Gas Midstream
OKS Oil & Gas Midstream
OKSB Banks - Regional
OKTA Software - Infrastructure
OLBK Banks - Regional
OLCB Banks - Regional
OLED Semiconductor Equipment & Materials
OLLI Discount Stores
OLN Specialty Chemicals
OLP REIT - Diversified
OMAB Airports & Air Services
OMC Advertising Agencies
OMCL Health Information Se

PGOL Gold
PGR Insurance - Property & Casualty
PGRE REIT - Office
PGRX Residential Construction
PGTI Building Products & Equipment
PGTK Waste Management
PH Specialty Industrial Machinery
PHC Medical Care Facilities
PHG Diagnostics & Research
PHH Specialty Finance
PHHMQ None
PHI Telecom Services
PHIIK Airports & Air Services
PHIIQ Airports & Air Services
PHIO Biotechnology
PHM Residential Construction
PHUN Software - Application
PHX Oil & Gas E&P
PI Electronic Components
PIAGF None
PICO Utilities - Regulated Water
PIH Insurance - Property & Casualty
PIHPP Insurance - Property & Casualty
PII Recreational Vehicles
PIII Engineering & Construction
PIKE Engineering & Construction
PILLQ Software - Application
PINC Health Information Services
PINN Oil & Gas E&P
PIOIQ Utilities - Regulated Electric
PIPR Asset Management
PIR Specialty Retail
PIRS Biotechnology
PIXY Staffing & Outsourcing Services
PJT Capital Markets
PK REIT - Hotel & Motel
PKBK Banks - Regional
PKDSQ Oil & Gas Equipment & Service

QEPM Oil & Gas Midstream
QES Oil & Gas Equipment & Services
QGEN Diagnostics & Research
QGI Specialty Business Services
QHC Medical Care
QIHU Information Technology Services
QIWI Credit Services
QKLS Grocery Stores
QLGC Communication Equipment
QLIK Software - Application
QLTY Trucking
QLYS Software - Infrastructure
QMCI Specialty Business Services
QMCO Data Storage
QMDC Information Technology Services
QMRK Medical Devices
QNBC Banks - Regional
QNGP Consulting Services
QNST Advertising Agencies
QOBJ Software - Application
QRE Oil & Gas E&P
QRHC Waste Management
QRM Other Industrial Metals & Mining
QRTEA Internet Retail
QRTEB Internet Retail
QRVO Semiconductors
QSEP Oil & Gas Equipment & Services
QSFT Software - Application
QSND Software - Application
QSR Restaurants
QTET Conglomerates
QTNA Semiconductors
QTNT Diagnostics & Research
QTRH Communication Equipment
QTRX Biotechnology
QTS REIT - Industrial
QTS-PA REIT - Industrial
QTS-PB REIT - Industrial
QTWO Software - Application
QTWW Auto

RUE Apparel Retail
RUN Solar
RURL None
RUSHA Auto & Truck Dealerships
RUSHB Auto & Truck Dealerships
RUTH Restaurants
RVBD Communication Equipment
RVEN REIT - Residential
RVHLQ Resorts & Casinos
RVI Real Estate Services
RVI1 Discount Stores
RVLT Electronic Components
RVM Gold
RVNC Biotechnology
RVP Medical Instruments & Supplies
RVR Credit Services
RVSB Banks - Regional
RVSIQ Scientific & Technical Instruments
RVSN Software - Application
RWGE Conglomerates
RWLK Medical Devices
RWT REIT - Mortgage
RX Software - Application
RXDX Biotechnology
RXN Specialty Industrial Machinery
RXN-PA Specialty Industrial Machinery
RXPC Biotechnology
RY Banks - Diversified
RY-PS Banks - Diversified
RY-PT Banks - Diversified
RYAAY Airlines
RYAM Chemicals
RYAM-PA Chemicals
RYB Education & Training Services
RYCE Packaged Foods
RYI Metal Fabrication
RYL Residential Construction
RYMM Other Industrial Metals & Mining
RYN REIT - Specialty
RYTM Biotechnology
RZTIQ Utilities - Regulated Electric
S Telecom Services

SLS Biotechnology
SLSZQ Consumer Electronics
SLTM Medical Devices
SLXP Biotechnology
SM Oil & Gas E&P
SMA Medical Devices
SMAC Conglomerates
SMAR Software - Application
SMBC Banks - Regional
SMBK Banks - Regional
SMC Gold
SMCI Communication Equipment
SMDK Computer Hardware
SMDM None
SMED Waste Management
SMFG Banks - Diversified
SMFTF None
SMG Agricultural Inputs
SMGI Oil & Gas E&P
SMHI Shipping & Ports
SMICY Semiconductors
SMID Building Materials
SMIT Scientific & Technical Instruments
SMLP Oil & Gas Midstream
SMLR Medical Devices
SMME Household & Personal Products
SMMF Banks - Regional
SMMNY None
SMMT Biotechnology
SMMX Software - Application
SMOD Semiconductors
SMP Auto Parts
SMPL Packaged Foods
SMPL1 Banks - Regional
SMRT Apparel Retail
SMSC Semiconductors
SMSEY None
SMSI Software - Application
SMT Computer Hardware
SMTA REIT - Diversified
SMTB Banks - Regional
SMTC Semiconductors
SMTI Medical Devices
SMTL Semiconductor Equipment & Materials
SMTOY None
SMTS Industrial Metals & Mine

SUNW Solar
SUP Auto Parts
SUPN Drug Manufacturers - Specialty & Generic
SUPR Banks - Regional
SUPV Banks - Regional - Latin America
SUPX Semiconductors
SUR Insurance - Specialty
SURF Biotechnology
SURG Medical Devices
SURW Telecom Services
SUSQ Banks - Regional
SUSS None
SUTMQ Publishing
SUWN Biotechnology
SVA Biotechnology
SVBI Banks - Regional
SVBL Other Industrial Metals & Mining
SVC REIT - Hotel & Motel
SVLC Gold
SVLF Real Estate Services
SVM Silver
SVN Resorts & Casinos
SVNTQ Drug Manufacturers - Specialty & Generic
SVR Telecom Services
SVRA Biotechnology
SVT Electrical Equipment & Parts
SVU Grocery Stores
SVUL Consulting Services
SVVC Asset Management
SVVS Specialty Business Services
SWC Uranium
SWCH Information Technology Services
SWGAY None
SWHI1 Real Estate Services
SWI1 Software - Application
SWIM Capital Markets
SWIR Communication Equipment
SWK Tools & Accessories
SWKH Credit Services
SWKS Semiconductors
SWM Paper & Paper Products
SWN Oil & Gas E&P
SWNC Oil & Gas E&P
SWP Too

TNP Oil & Gas Midstream
TNP-PB Oil & Gas Midstream
TNP-PC Oil & Gas Midstream
TNP-PD Oil & Gas Midstream
TNP-PE Oil & Gas Midstream
TNP-PF Oil & Gas Midstream
TNS Specialty Business Services
TNSB Specialty Industrial Machinery
TNTR Software - Application
TNXP Biotechnology
TOBC Banks - Regional
TOCA Biotechnology
TOD Recreational Vehicles
TOFB None
TOFC Banks - Regional
TOL Residential Construction
TOMO Medical Devices
TONEQ Banks - Regional
TOO Oil & Gas Midstream
TOO-PA Oil & Gas Midstream
TOO-PB Oil & Gas Midstream
TOO-PE Oil & Gas Midstream
TOPPY None
TOPS Marine Shipping
TOR Steel
TORC Biotechnology
TORM Chemicals
TOSYY Computer Hardware
TOT Oil & Gas Integrated
TOUR Travel Services
TOUSQ Residential Construction
TOWN Banks - Regional - US
TOWR Auto Parts
TOX Diagnostics & Research
TPB Tobacco
TPC Engineering & Construction
TPCA Resorts & Casinos
TPCG Specialty Chemicals
TPCO Publishing
TPCS Metal Fabrication
TPGI Real Estate Services
TPH Residential Construction
TPHS Real Estate 

USB-PH Banks - Regional
USB-PM Banks - Regional
USB-PN Banks - Regional
USB-PO Banks - Regional
USCR Building Materials
USDP Railroads
USEG Oil & Gas E&P
USEY Utilities - Diversified
USFD Food Distribution
USG Building Materials
USHS Engineering & Construction
USIO Software - Infrastructure
USLM Building Materials
USM Telecom Services
USMD Medical Care Facilities
USNA Household & Personal Products
USNU Medical Care Facilities
USPH Medical Care Facilities
USPI2 Advertising Agencies
USPR Other Industrial Metals & Mining
USRM Biotechnology
USSPQ Oil & Gas Midstream
USWS Diversified Industrials
USX Trucking
UTEK Semiconductor Equipment & Materials
UTGN Insurance - Life
UTHR Biotechnology
UTI Education & Training Services
UTIW Integrated Freight & Logistics
UTL Utilities - Diversified
UTMD Medical Instruments & Supplies
UTRA Travel Services
UTSI Communication Equipment
UTX Aerospace & Defense
UTX-PA Aerospace & Defense
UUU Security & Protection Services
UUUU Uranium
UVE Insurance - Property

WFE-PA Banks - Diversified
WFM Grocery Stores
WFSC Banks - Regional
WFTIQ Oil & Gas Equipment & Services
WG Oil & Gas Equipment & Services
WGATQ Entertainment
WGBS Diagnostics & Research
WGL Utilities - Regulated Gas
WGLF Leisure
WGNB Banks - Regional
WGO Recreational Vehicles
WGW Gold
WH Lodging
WHCI Banks - Regional
WHD Oil & Gas Equipment & Services
WHF Asset Management
WHG Capital Markets
WHGLY None
WHHT Computer Hardware
WHLM Specialty Business Services
WHLR REIT - Retail
WHLRD REIT - Retail
WHLRP REIT - Retail
WHR Furnishings, Fixtures & Appliances
WHRT Medical Devices
WIBC Banks - Regional
WIFI Telecom Services
WILC Food Distribution
WILYY None
WIMHF None
WINA Specialty Retail
WIND Information Technology Services
WING Restaurants
WINMQ Telecom Services
WINN Grocery Stores
WINR Leisure
WINS Asset Management
WINT Biotechnology
WIRE Electrical Equipment & Parts
WIT Information Technology Services
WITM Gold
WIX Software - Infrastructure
WJRYY None
WK Software - Application
WKHS Auto

In [45]:
data.loc['ALLY','Credit Services']

True

In [46]:
data.loc['ZYTO','Biotechnology']

False

In [49]:
# Shuffle the data
shuffled_data = data.sample(frac=1)
display(shuffled_data.head())

# Split 
split = 7000
train_x = shuffled_data.iloc[:split,:-1]
train_y = shuffled_data.iloc[:split, -1]
display(train_x)
display(train_y)

valid_x = shuffled_data.iloc[split+1:,:-1]
valid_y = shuffled_data.iloc[split+1:, -1]
display(valid_x)
display(valid_y)


Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,Marketing Services,Home Furnishings & Fixtures,Data Storage,Beverages - Soft Drinks,Integrated Shipping & Logistics,Financial Exchanges,Infrastructure Operations,Banks - Regional - Latin America,Coking Coal,10bagger
LKQ,1.0,0.978884,1.0181,0.979638,0.92006,0.934389,0.9819,0.978884,0.9819,0.989442,...,False,False,False,False,False,False,False,False,False,False
DXF,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False
RVR,1.0,1.0,1.0,1.0,0.871632,0.871632,0.935024,0.855784,0.855784,0.855784,...,False,False,False,False,False,False,False,False,False,False
COST,1.0,0.996327,0.999265,0.968658,0.954456,0.941234,0.98286,0.968658,1.00049,1.033301,...,False,False,False,False,False,False,False,False,False,False
PRISY,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,False,False,False,False,False,False,False,False,False,False


Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,Staffing & Outsourcing Services,Marketing Services,Home Furnishings & Fixtures,Data Storage,Beverages - Soft Drinks,Integrated Shipping & Logistics,Financial Exchanges,Infrastructure Operations,Banks - Regional - Latin America,Coking Coal
LKQ,1.0,0.978884,1.018100,0.979638,0.920060,0.934389,0.981900,0.978884,0.981900,0.989442,...,False,False,False,False,False,False,False,False,False,False
DXF,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
RVR,1.0,1.000000,1.000000,1.000000,0.871632,0.871632,0.935024,0.855784,0.855784,0.855784,...,False,False,False,False,False,False,False,False,False,False
COST,1.0,0.996327,0.999265,0.968658,0.954456,0.941234,0.982860,0.968658,1.000490,1.033301,...,False,False,False,False,False,False,False,False,False,False
PRISY,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
NNN-PF,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
PTHN,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
SBLK,1.0,0.922541,1.105610,1.021736,1.227608,1.165318,1.242929,1.258108,1.372481,1.265733,...,False,False,False,False,False,False,False,False,False,False
FUPBY,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False


LKQ       False
DXF       False
RVR       False
COST      False
PRISY     False
          ...  
NNN-PF    False
PTHN      False
SBLK      False
FUPBY     False
FFCH      False
Name: 10bagger, Length: 7000, dtype: bool

Unnamed: 0,2009-03-02,2009-03-03,2009-03-04,2009-03-05,2009-03-06,2009-03-09,2009-03-10,2009-03-11,2009-03-12,2009-03-13,...,Staffing & Outsourcing Services,Marketing Services,Home Furnishings & Fixtures,Data Storage,Beverages - Soft Drinks,Integrated Shipping & Logistics,Financial Exchanges,Infrastructure Operations,Banks - Regional - Latin America,Coking Coal
OTTR,1.0,0.921068,1.030861,0.957864,0.995252,0.967359,1.051039,1.097923,1.162018,1.166172,...,False,False,False,False,False,False,False,False,False,False
MKTX,1.0,0.871353,0.870027,0.812997,0.935013,0.860743,0.907162,0.942971,1.021220,0.992042,...,False,False,False,False,False,False,False,False,False,False
MDCL,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
GS-PN,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
GMCR,1.0,1.038652,1.052246,1.049173,1.033570,0.981324,1.000000,0.993144,1.079551,1.087470,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
CUSI,1.0,0.984615,0.984615,1.000000,0.953846,0.953846,0.892308,0.892308,0.876923,0.876923,...,False,False,False,False,False,False,False,False,False,False
GMO,1.0,1.062500,1.203125,1.125000,1.125000,1.109375,1.234375,1.171875,1.296875,1.281250,...,False,False,False,False,False,False,False,False,False,False
MITL,0.0,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,...,False,False,False,False,False,False,False,False,False,False
ADPI,1.0,0.998445,1.018663,1.013997,1.032659,1.024883,0.967341,0.945568,0.936236,0.930016,...,False,False,False,False,False,False,False,False,False,False


OTTR     False
MKTX      True
MDCL     False
GS-PN    False
GMCR      True
         ...  
CUSI     False
GMO      False
MITL     False
ADPI     False
EQNR     False
Name: 10bagger, Length: 2880, dtype: bool

In [50]:
C_values = [1e-5, 1e-3, 1e-2, 1e-1, 5e-1, 1.0, 5, 10, 1e2]

for C in C_values:
    
    print("Regularization: {}".format(C))

    model = linear_model.LogisticRegression(C=C, class_weight='balanced', max_iter=10000)
    model.fit(train_x, train_y)

    predicts = model.predict(valid_x)
    correct = predicts == valid_y
    
    TP,FP,TN,FN = calc_metrics(predicts,valid_y)

    print(TP,FP,TN,FN)

    precision, recall, accuracy, TPR, TNR, BER = calc_error_rates(TP, FP, TN, FN)

    print("Precision:{:.3f} Recall:{:.3f}".format(precision, recall))
    print("Accuracy:{:.3f} TPR:{:.3f} TNR:{:.3f} BER:{:.3f}".format(accuracy, TPR, TNR, BER))
    print('\n')

Regularization: 1e-05
153 86 2641 0
Precision:0.640 Recall:1.000
Accuracy:0.970 TPR:1.000 TNR:0.968 BER:0.016


Regularization: 0.001
152 58 2669 1
Precision:0.724 Recall:0.993
Accuracy:0.980 TPR:0.993 TNR:0.979 BER:0.014


Regularization: 0.01
150 38 2689 3
Precision:0.798 Recall:0.980
Accuracy:0.986 TPR:0.980 TNR:0.986 BER:0.017


Regularization: 0.1
144 27 2700 9
Precision:0.842 Recall:0.941
Accuracy:0.988 TPR:0.941 TNR:0.990 BER:0.034


Regularization: 0.5
143 25 2702 10
Precision:0.851 Recall:0.935
Accuracy:0.988 TPR:0.935 TNR:0.991 BER:0.037


Regularization: 1.0
142 24 2703 11
Precision:0.855 Recall:0.928
Accuracy:0.988 TPR:0.928 TNR:0.991 BER:0.040


Regularization: 5
140 23 2704 13
Precision:0.859 Recall:0.915
Accuracy:0.988 TPR:0.915 TNR:0.992 BER:0.047


Regularization: 10
140 22 2705 13
Precision:0.864 Recall:0.915
Accuracy:0.988 TPR:0.915 TNR:0.992 BER:0.047


Regularization: 100.0
139 20 2707 14
Precision:0.874 Recall:0.908
Accuracy:0.988 TPR:0.908 TNR:0.993 BER:0.049




In [55]:
label_filenames = ['labels_12-31-2010.csv',
              'labels_12-31-2011.csv',
              'labels_12-31-2012.csv',
              'labels_12-31-2013.csv',
              'labels_12-31-2014.csv',
              'labels_12-31-2015.csv',
              'labels_12-31-2016.csv',
              'labels_12-31-2017.csv',
              'labels_12-31-2018.csv'
             ]

prices_filenames = ['inputs_notfilled_2010-12-31.csv',
            'inputs_notfilled_2011-12-31.csv',
            'inputs_notfilled_2012-12-31.csv',
            'inputs_notfilled_2013-12-31.csv',
            'inputs_notfilled_2014-12-31.csv',
            'inputs_notfilled_2015-12-31.csv',
            'inputs_notfilled_2016-12-31.csv',
            'inputs_notfilled_2017-12-31.csv',
            'inputs_notfilled_2018-12-31.csv']

top100s = {}

for label_filename, prices_filename in zip(label_filenames, prices_filenames):
    
    print(label_filename)
    top100s[label_filename]={}
    
    labels = pd.read_csv("../datasets/sharader/"+label_filename)
    y = labels.set_index('ticker')
    y_all = y['10bagger']
    
    
    prices = pd.read_csv("../datasets/sharader/"+prices_filename)
    X = prices.set_index('date')
    first_valid_prices = X.apply(first_valid_idx, axis=0)
    
    X_filled = X.fillna(axis=0, method='ffill')  # forward fill along date axis with last valid price
    X_filled = X_filled.fillna(0)  # fill all other NaN with zero - remaining NaN before the first valid price

    # Transpose Dataframe in rows of tickers, and normalize price
    X_all = X_filled.transpose().div(first_valid_prices, axis=0)
    
    # Add industry columns (one-hot encoding of ticker's industry)
    for industry in list(ind_names.index):
        X_all[industry] = False
    
    # Add label column
    data = X_all.merge(y_all, left_index=True, right_index=True)
    
    # Populate ticker's industry
    for ticker, industry in zip(list(y.index), list(y['industry'])):
        data.loc[ticker,industry] = True
    
    # Shuffle the data
    shuffled_data = data.sample(frac=1)

    # Split 
    split = int(len(shuffled_data.index)*0.7)
    train_x = shuffled_data.iloc[:split,:-1]
    train_y = shuffled_data.iloc[:split, -1]

    valid_x = shuffled_data.iloc[split+1:,:-1]
    valid_y = shuffled_data.iloc[split+1:, -1]
    
    valid_tickers = shuffled_data.iloc[split+1:, -1].index
    
    C_values = [1e-7, 1e-6, 1e-5, 1e-3, 1e-2, 1e-1, 5e-1, 1.0, 10]

    for C in C_values:

        print("Regularization: {}".format(C))

        model = linear_model.LogisticRegression(C=C, class_weight='balanced', max_iter=10000)
        model.fit(train_x, train_y)

        predicts = model.predict(valid_x)
        correct = predicts == valid_y
        
        # This code section is to generate Precision@N stats
        # Sort labels and scores in descending order
        scores = model.decision_function(valid_x)
        scores_labels = list(zip(scores, valid_y, valid_tickers))
        scores_labels.sort(reverse = True)
        sortedlabels = [x[1] for x in scores_labels] # generate sorted labels

        top100s[label_filename][C] = [x[2] for x in scores_labels]  # save top 100 tickers

        TP,FP,TN,FN = calc_metrics(predicts,valid_y)

        print(TP,FP,TN,FN)

        precision, recall, accuracy, TPR, TNR, BER = calc_error_rates(TP, FP, TN, FN)

        print("Precision:{:.3f} Recall:{:.3f}".format(precision, recall))
        
        # output Precision@N stats
        print("Precision@100: {}".format(sum(sortedlabels[:100]) / 100))
        print("Precision@90: {}".format(sum(sortedlabels[:90]) / 90))
        print("Precision@80: {}".format(sum(sortedlabels[:80]) / 80))
        print("Precision@70: {}".format(sum(sortedlabels[:70]) / 70))
        print("Precision@60: {}".format(sum(sortedlabels[:60]) / 60))
        print("Precision@50: {}".format(sum(sortedlabels[:50]) / 50))
        print("Precision@40: {}".format(sum(sortedlabels[:40]) / 40))
        print("Precision@30: {}".format(sum(sortedlabels[:30]) / 30))
        print("Precision@20: {}".format(sum(sortedlabels[:20]) / 20))
        print("Precision@10: {}".format(sum(sortedlabels[:10]) / 10))
        
        print("Accuracy:{:.3f} TPR:{:.3f} TNR:{:.3f} BER:{:.3f}".format(accuracy, TPR, TNR, BER))
        print('\n')



labels_12-31-2010.csv
Regularization: 1e-07
78 175 1644 67
Precision:0.308 Recall:0.538
Precision@100: 0.42
Precision@90: 0.4444444444444444
Precision@80: 0.4625
Precision@70: 0.4857142857142857
Precision@60: 0.48333333333333334
Precision@50: 0.52
Precision@40: 0.55
Precision@30: 0.6
Precision@20: 0.7
Precision@10: 0.7
Accuracy:0.877 TPR:0.538 TNR:0.904 BER:0.279


Regularization: 1e-06
84 245 1574 61
Precision:0.255 Recall:0.579
Precision@100: 0.42
Precision@90: 0.45555555555555555
Precision@80: 0.45
Precision@70: 0.4857142857142857
Precision@60: 0.48333333333333334
Precision@50: 0.52
Precision@40: 0.55
Precision@30: 0.6333333333333333
Precision@20: 0.7
Precision@10: 0.7
Accuracy:0.844 TPR:0.579 TNR:0.865 BER:0.278


Regularization: 1e-05
91 267 1552 54
Precision:0.254 Recall:0.628
Precision@100: 0.44
Precision@90: 0.4444444444444444
Precision@80: 0.45
Precision@70: 0.4714285714285714
Precision@60: 0.5
Precision@50: 0.5
Precision@40: 0.575
Precision@30: 0.6666666666666666
Precision@20

104 203 1780 47
Precision:0.339 Recall:0.689
Precision@100: 0.57
Precision@90: 0.5888888888888889
Precision@80: 0.5875
Precision@70: 0.6142857142857143
Precision@60: 0.6166666666666667
Precision@50: 0.58
Precision@40: 0.625
Precision@30: 0.6333333333333333
Precision@20: 0.6
Precision@10: 0.7
Accuracy:0.883 TPR:0.689 TNR:0.898 BER:0.207


Regularization: 1.0
102 197 1786 49
Precision:0.341 Recall:0.675
Precision@100: 0.58
Precision@90: 0.5777777777777777
Precision@80: 0.575
Precision@70: 0.5857142857142857
Precision@60: 0.6
Precision@50: 0.58
Precision@40: 0.625
Precision@30: 0.6
Precision@20: 0.6
Precision@10: 0.7
Accuracy:0.885 TPR:0.675 TNR:0.901 BER:0.212


Regularization: 10
90 172 1811 61
Precision:0.344 Recall:0.596
Precision@100: 0.54
Precision@90: 0.5555555555555556
Precision@80: 0.5625
Precision@70: 0.5571428571428572
Precision@60: 0.55
Precision@50: 0.56
Precision@40: 0.525
Precision@30: 0.5333333333333333
Precision@20: 0.6
Precision@10: 0.5
Accuracy:0.891 TPR:0.596 TNR:0.913

157 158 2276 19
Precision:0.498 Recall:0.892
Precision@100: 0.86
Precision@90: 0.8777777777777778
Precision@80: 0.8875
Precision@70: 0.9142857142857143
Precision@60: 0.9166666666666666
Precision@50: 0.94
Precision@40: 0.975
Precision@30: 0.9666666666666667
Precision@20: 1.0
Precision@10: 1.0
Accuracy:0.932 TPR:0.892 TNR:0.935 BER:0.086


Regularization: 0.01
157 143 2291 19
Precision:0.523 Recall:0.892
Precision@100: 0.85
Precision@90: 0.8444444444444444
Precision@80: 0.9
Precision@70: 0.9
Precision@60: 0.9
Precision@50: 0.9
Precision@40: 0.925
Precision@30: 0.9
Precision@20: 0.9
Precision@10: 0.9
Accuracy:0.938 TPR:0.892 TNR:0.941 BER:0.083


Regularization: 0.1
147 121 2313 29
Precision:0.549 Recall:0.835
Precision@100: 0.81
Precision@90: 0.8333333333333334
Precision@80: 0.8375
Precision@70: 0.8428571428571429
Precision@60: 0.8666666666666667
Precision@50: 0.9
Precision@40: 0.9
Precision@30: 0.9
Precision@20: 0.9
Precision@10: 0.8
Accuracy:0.943 TPR:0.835 TNR:0.950 BER:0.107


Regula

Regularization: 1e-07
149 126 2681 8
Precision:0.542 Recall:0.949
Precision@100: 0.9
Precision@90: 0.9222222222222223
Precision@80: 0.9375
Precision@70: 0.9428571428571428
Precision@60: 0.95
Precision@50: 0.96
Precision@40: 0.975
Precision@30: 0.9666666666666667
Precision@20: 0.95
Precision@10: 1.0
Accuracy:0.955 TPR:0.949 TNR:0.955 BER:0.048


Regularization: 1e-06
154 120 2687 3
Precision:0.562 Recall:0.981
Precision@100: 0.93
Precision@90: 0.9555555555555556
Precision@80: 0.975
Precision@70: 0.9714285714285714
Precision@60: 0.9666666666666667
Precision@50: 0.96
Precision@40: 1.0
Precision@30: 1.0
Precision@20: 1.0
Precision@10: 1.0
Accuracy:0.959 TPR:0.981 TNR:0.957 BER:0.031


Regularization: 1e-05
156 99 2708 1
Precision:0.612 Recall:0.994
Precision@100: 0.96
Precision@90: 0.9666666666666667
Precision@80: 0.9875
Precision@70: 0.9857142857142858
Precision@60: 0.9833333333333333
Precision@50: 0.98
Precision@40: 1.0
Precision@30: 1.0
Precision@20: 1.0
Precision@10: 1.0
Accuracy:0.966

### Save Results to File

In [62]:
f = open("sorted_tickers_logreg_priceInd.pkl","wb")
pickle.dump(top100s,f)
f.close()

In [63]:
labels = pd.read_csv("../datasets/sharader/labels_12-31-2018.csv")

y = labels.set_index('ticker')

y['appreciation'].loc['CLRB']

1.6339869281045753e-05

In [65]:
sorted_tickers = pickle.load( open( "sorted_tickers_logreg_priceInd.pkl", "rb" ) )

for num, ticker in enumerate(sorted_tickers['labels_12-31-2018.csv'][1e-6][:101]):
    print('{}. {}: {:.2f} --> {:.2f} ({})'.format(num, ticker, X_all['2018-12-31'].loc[ticker], y['appreciation'].loc[ticker], 
                    y['10bagger'].loc[ticker]))

0. GENTY: 160.29 --> 160.29 (True)
1. ATSG: 120.05 --> 121.32 (True)
2. NHTC: 77.04 --> 54.00 (True)
3. CAR: 53.52 --> 83.00 (True)
4. IX: 68.27 --> 68.58 (True)
5. DAN: 41.30 --> 53.76 (True)
6. GMKYY: 54.55 --> 54.55 (True)
7. GGP: 47.96 --> 47.96 (True)
8. NFLX: 54.55 --> 72.66 (True)
9. BBX: 40.93 --> 42.29 (True)
10. TRW: 47.08 --> 47.08 (True)
11. ISDR: 37.83 --> 41.23 (True)
12. ULTA: 44.76 --> 63.75 (True)
13. HBP: 15.00 --> 23.08 (True)
14. IMH: 15.12 --> 15.68 (True)
15. CAMP: 26.02 --> 25.16 (True)
16. SIRI: 40.50 --> 40.21 (True)
17. CNO: 29.18 --> 31.73 (True)
18. KERX: 25.45 --> 25.45 (True)
19. KEM: 44.97 --> 43.51 (True)
20. CBPO: 31.63 --> 38.02 (True)
21. INCY: 27.89 --> 37.72 (True)
22. RLGT: 34.00 --> 50.40 (True)
23. FONR: 30.67 --> 31.02 (True)
24. KOG: 32.80 --> 32.80 (True)
25. BEXP: 32.57 --> 32.57 (True)
26. CPWM: 31.88 --> 31.88 (True)
27. ESCA: 27.26 --> 26.60 (True)
28. ASGN: 26.85 --> 31.28 (True)
29. REGN: 27.58 --> 30.33 (True)
30. BWEBF: 28.12 --> 28.12

In [46]:
X = np.array([[1, 7], [1, 2.5], [1, 0.67], [1, 0.5], [1, 0.33], [1, 4], [1, 3], [1, -1]])
print(X.shape)
y = np.array([1, 0, 0, 0, 0, 1, 1, 0])
print(y.shape)

(8, 2)
(8,)


In [47]:
model = linear_model.LogisticRegression()
model.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [48]:
predicts = model.predict(X)

In [49]:
predicts

array([1, 0, 0, 0, 0, 1, 1, 0])

In [2]:
df = pd.DataFrame([[np.NaN,np.NaN, 1, 7], [np.NaN, 3, 2.5, 5], [1, 0.67, 3], [np.NaN, 1, 0.5, 0.2]])
print(df)

print(df.apply(first_valid_idx, axis=1))

     0     1    2    3
0  NaN   NaN  1.0  7.0
1  NaN  3.00  2.5  5.0
2  1.0  0.67  3.0  NaN
3  NaN  1.00  0.5  0.2
0    1.0
1    3.0
2    1.0
3    1.0
dtype: float64


In [None]:
df