# Exploratory Data analysis
_Written by Thomas Niedermayer and Gunnar Sjúrðarson Knudsen, as a conjoined effort for an interdiscplinary project in Data Science._
* Supervisor: Wolfgang Aussenegg
* Co-Supervisor: Sascha Hunold

Purpose of this notebook is to understand the data quality, and the scale of the task. 
As we ran into several issues w.r.t. data quality, we want to get a deeper understanding of how which data is being handled. Example errors:
* TICKERS being used for multiple companies/ISINS, resulting in not knowing which ISIN an insider trading corresponds to
* Missing ISINs; means that no time series data is available
* Missing TICKERS; Means that no insider trades are available.
* For several companies, there are no insider trades registered. These are therefore also filtered out from the analysis notebook. These wouldn't give an error, but removing them speeds up the runtime
* Sometimes there was no return index data available. **Honestly not sure why this didn't break our analaysis. Maybe this was the reason for the `-Inf`?** In any case, they are also removed
Check which data we've already extracted, and define how to handle various situations

## Load libraries

In [1]:
import pickle
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import pyplot as plt
import time
import datetime
from dateutil.relativedelta import relativedelta
import requests
import locale
from pandas.io.json import json_normalize
import os
from os.path import exists
import sys
import io
import math

from IPython.display import clear_output, display
from tqdm import tqdm

# Load custom libraries
import source.read_tickers_and_isins as URTI

## Define which data to be loaded

In [2]:
# Set flags for what will be handled
NAME = "Knudsen" # "Niedermayer"
NAME = "Niedermayer"
prepare_and_download = True

# Constants:
## Different Parameters depending on the setting
if NAME == "Knudsen":
    no_https = False
    to_date_name = "DATE/TIME (DS End Date)"
    STOCK_EXCHANGE = "Nasdaq"
    n_input_files = 7
    _ticker = '%5EIXIC'  
elif NAME == "Niedermayer":
    no_https = True
    to_date_name = "DATE/TIME (DS End Date)"
    STOCK_EXCHANGE = "NYSE"
    n_input_files = 4
    _ticker = "%5Enya"
else:
    raise NotImplementedError

In [3]:
DATA_LOCATION = f'data/{NAME}/'
_insider_location = DATA_LOCATION + 'processed/insider/'

## Read in input data

In [4]:
INPUT_FILE = f'input_data/{NAME}/{STOCK_EXCHANGE} Composite 16.3.2022 plus dead firms - {NAME}.xlsx'

In [5]:
data = URTI.read_tickers_and_isins(INPUT_FILE)

Reading tickers


In [6]:
data

Unnamed: 0,Type,ISIN CODE,LOC OFF. CODE,NAME,DATASTREAM CODE,CUSIP,TICKER SYMBOL,BASE OR ST DATE,DATE/TIME (DS End Date)
0,756064,US88554D2053,U88554D205,3D SYSTEMS,756064,88554D205,DDD,1988-03-10,2022-03-16 00:00:00
1,902172,US88579Y1010,U88579Y101,3M,902172,88579Y101,MMM,1973-01-02,2022-03-16 00:00:00
2,877226,US2829141009,U282914100,8X8,877226,282914100,EGHT,1997-07-03,2022-03-16 00:00:00
3,2603Y7,US00152K1016,U00152K101,A K A BRANDS HOLDING,2603Y7,00152K101,AKA,2021-09-22,2022-03-16 00:00:00
4,8748C6,US0021211018,U002121101,A10 NETWORKS,8748C6,2121101,ATEN,2014-03-21,2022-03-16 00:00:00
...,...,...,...,...,...,...,...,...,...
3637,94419W,,,ZOOM VIDEO COMM CL B,94419W,98980L101,ZM,2019-04-18,
3638,9924R8,,,ZOOMINFO TECHNOLOGIES INC CL B,9924R8,98980F104,ZI,2020-06-04,
3639,9924TA,,,ZOOMINFO TECHNOLOGIES INC CL C,9924TA,98980F104,ZI,2020-06-04,
3640,9295TV,,,ZUORA INC CL B,9295TV,98983V106,ZUO,2018-04-12,


## Read in scraped data
We need this, so we can also exclude handling where we didn't get any data from whereever

### Insider trades

In [7]:
# Define dummy placeholders
tickers = []
trade_counts = []
min_filing_date = []
max_filing_date = []
min_trade_date = []
max_trade_date = []
n_distinct_traders = []
n_distinct_trade_types = []

n_p = []
n_s = []
n_s2 = []
n_a = []
n_d = []
n_g = []
n_f = []
n_m = []
n_x = []
n_c = []
n_w = []

# helpers
counter = 0
total_count = len(data['TICKER SYMBOL'])

# Read in scraped files, and do various aggregations
for ticker in data['TICKER SYMBOL']:
    counter = counter +1
    clear_output(wait=True)
    print(f'Handling {counter} of {total_count}. Currently doing: {ticker}')
    
    dat = pd.read_csv(_insider_location + ticker + '.csv', index_col=0, parse_dates=['FilingDate', 'TradeDate'])

    tickers.append(ticker)
    trade_counts.append(dat.shape[0])

    min_filing_date.append(dat['FilingDate'].min())
    max_filing_date.append(dat['FilingDate'].max())

    min_trade_date.append(dat['TradeDate'].min())
    max_trade_date.append(dat['TradeDate'].max())

    n_distinct_traders.append(dat['InsiderName'].nunique())
    n_distinct_trade_types.append(dat['TradeType'].nunique())


    n_p.append(sum(dat['TradeType'] == 'P - Purchase'))
    n_s.append(sum(dat['TradeType'] == 'S - Sale'))
    n_s2.append(sum(dat['TradeType'] == 'S - Sale+OE'))

    n_a.append(sum(dat['TradeType'] == 'A - Grant'))
    n_d.append(sum(dat['TradeType'] == 'D - Sale to Iss') + sum(dat['TradeType'] == 'D - Sale to issuer'))
    n_g.append(sum(dat['TradeType'] == 'G - Gift'))
    n_f.append(sum(dat['TradeType'] == 'F - Tax'))
    n_m.append(sum(dat['TradeType'] == 'M - Option Ex') + sum(dat['TradeType'] == 'M - OptEx'))
    n_x.append(sum(dat['TradeType'] == 'X - Option Ex') + sum(dat['TradeType'] == 'X - OptEx'))
    n_c.append(sum(dat['TradeType'] == 'C - Cnv Deriv') + sum(dat['TradeType'] == 'C - Converted deriv'))
    n_w.append(sum(dat['TradeType'] == 'W - Inherited'))
    
# Collect to a single data frame
scraped_insider_df = pd.DataFrame({'tickers': tickers
                                   , 'trade_count': trade_counts
                                   , 'min_filing_date': min_filing_date
                                   , 'max_filing_date': max_filing_date
                                   , 'min_trade_date': min_trade_date
                                   , 'max_trade_date': max_trade_date
                                   , 'n_distinct_traders': n_distinct_traders
                                   , 'n_distinct_trade_types': n_distinct_trade_types
                                   , 'P - Purchase (count)': n_p
                                   , 'S - Sale (count)': n_s
                                   , 'S - Sale+OE': n_s2
                                   , 'A - Grant (count)': n_a
                                   , 'D - Sale to Iss (count)': n_d
                                   , 'G - Gift (count)': n_g
                                   , 'F - Tax (count)': n_f
                                   , 'M - Option Ex (count)': n_m
                                   , 'X - Option Ex (count)': n_x
                                   , 'C - Cnv Deriv (count)': n_c
                                   , 'W - Inherited (count)': n_w
                   })
scraped_insider_df

Handling 3642 of 3642. Currently doing: ZWRK


Unnamed: 0,tickers,trade_count,min_filing_date,max_filing_date,min_trade_date,max_trade_date,n_distinct_traders,n_distinct_trade_types,P - Purchase (count),S - Sale (count),S - Sale+OE,A - Grant (count),D - Sale to Iss (count),G - Gift (count),F - Tax (count),M - Option Ex (count),X - Option Ex (count),C - Cnv Deriv (count),W - Inherited (count)
0,DDD,844,2003-08-27 20:53:25,2022-03-10 16:37:11,2000-03-16,2022-03-10,52,10,80,202,19,337,4,46,110,32,0,13,1
1,MMM,2325,2003-08-07 16:01:14,2022-02-22 09:39:47,2003-04-29,2022-02-10,88,7,11,156,190,834,0,45,634,455,0,0,0
2,EGHT,931,2003-07-29 20:33:20,2022-03-14 16:29:29,2003-07-29,2022-03-11,38,10,39,177,69,116,3,2,345,174,2,4,0
3,AKA,9,2021-09-24 20:08:40,2022-03-09 17:02:19,2021-09-22,2022-03-07,6,2,8,0,0,1,0,0,0,0,0,0,0
4,ATEN,277,2014-03-24 16:43:13,2022-03-15 18:19:01,2014-03-20,2022-03-14,23,8,10,161,6,72,0,0,1,20,2,5,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3637,ZM,578,2019-04-24 19:59:01,2022-08-16 18:01:08,2019-04-15,2022-08-12,26,10,1,88,166,3,1,19,33,65,0,201,1
3638,ZI,993,2020-08-26 16:30:24,2022-08-17 18:47:05,2020-08-24,2022-08-15,23,7,0,260,326,6,0,3,30,63,0,305,0
3639,ZI,993,2020-08-26 16:30:24,2022-08-17 18:47:05,2020-08-24,2022-08-15,23,7,0,260,326,6,0,3,30,63,0,305,0
3640,ZUO,272,2018-07-17 06:15:15,2022-07-13 19:32:54,2018-07-05,2022-07-11,29,7,0,37,84,28,0,4,6,51,0,62,0


#### Join the data

In [8]:
data = data.join(scraped_insider_df, rsuffix='_given', lsuffix='_insider', how="left")

### Read in market timeseries

In [9]:
# File location (needs cleansing)
DATA_LOCATION_RI = DATA_LOCATION + 'processed/RI_discard/'
_ri_location = DATA_LOCATION_RI

file_locs_ = os.listdir(_ri_location)
file_locs = [_ri_location + f for f in file_locs_]

# Actually read in the company information
companies = []
print("loading return series...")
for file_loc in tqdm(file_locs):
    with open(file_loc, "rb") as f:
        company = pickle.load(f)
    companies.append(company)

loading return series...


100%|██████████| 2171/2171 [00:13<00:00, 156.78it/s]


In [10]:
isins = []
names = []
tickers = []
start_dates = []
end_dates = []
start_dates_ts = []
end_dates_ts = []
ts_rows = []
ri_ts_errors = []
for company in companies:
    isins.append(company.isin)
    names.append(company.name)
    tickers.append(company.ticker)
    start_dates.append(company.start_date)
    end_dates.append(company.end_date)
    start_dates_ts.append(company.return_index_df.index.min())
    end_dates_ts.append(company.return_index_df.index.max())
    ts_rows.append(company.return_index_df.shape[0])
    
    # In some cases, the RI is the same for all days in a company, followed by missing days.
    ts_ri_sum = company.return_index_df[1:].company_return.sum() 
    if (ts_ri_sum == np.Inf):
        contains_error_in_timeseries = True
    elif (-ts_ri_sum == np.Inf):
        contains_error_in_timeseries = True
    elif (math.isnan(ts_ri_sum)):
        contains_error_in_timeseries = True
    else:
        contains_error_in_timeseries = False
    ri_ts_errors.append(contains_error_in_timeseries)
    
    if (contains_error_in_timeseries):
        print(f'{company.ticker}: {ts_ri_sum}')
    
        
# Collect to a single data frame
scraped_ts_df = pd.DataFrame({'isin': isins
                              , 'ts_rows': ts_rows
                              , 'name': names
                              , 'ticker': tickers
                              , 'start_date': start_dates
                              , 'end_date': end_dates
                              , 'start_date_ts': start_dates_ts
                              , 'end_date_ts': end_dates_ts
                              , 'RI_Errors': ri_ts_errors
                             })
scraped_ts_df

FUBO: inf
IVT: inf


Unnamed: 0,isin,ts_rows,name,ticker,start_date,end_date,start_date_ts,end_date_ts,RI_Errors
0,AN8068571086,1509,SCHLUMBERGER,SLB,1973-01-02,2022-03-16,2016-03-21,2022-03-16,False
1,BE0003816338,1509,EURONAV (NYS),EURN,2015-01-23,2022-03-16,2016-03-21,2022-03-16,False
2,BMG0464B1072,1509,ARGO GP.INTL.HOLDINGS,ARGO,1987-03-25,2022-03-16,2016-03-21,2022-03-16,False
3,BMG053841059,732,ASPEN IN.HDG. DEAD - DELIST.16/02/19,AHL,2003-12-04,2019-02-14,2016-03-21,2019-02-14,False
4,BMG0585R1060,1509,ASSURED GUARANTY,AGO,2004-04-23,2022-03-16,2016-03-21,2022-03-16,False
...,...,...,...,...,...,...,...,...,...
2166,VGG1890L1076,1509,CAPRI HOLDINGS,CPRI,2011-12-15,2022-03-16,2016-03-21,2022-03-16,False
2167,VGG273581030,1130,DESPEGAR COM,DESP,2017-09-20,2022-03-16,2017-09-20,2022-03-16,False
2168,VGG572791041,814,LUXOFT HOLDING DEAD - DELIST.14/06/19,LXFT,2013-06-26,2019-06-13,2016-03-21,2019-06-13,False
2169,VGG639071023,1509,NAM TAI PROPERTY,NTP,1990-11-26,2022-03-16,2016-03-21,2022-03-16,False


#### Join the data

In [11]:
data = data.join(scraped_ts_df, rsuffix='_given', lsuffix='_ts', how="left")

### Start filtering

In [12]:
data['reason_to_exclude'] = 'None'

#### Remove Companies without ISINs

In [13]:
mask = data['ISIN CODE'] == 'NA'
#data.loc[mask, 'reason_to_exclude'] = 'NA ticker'
data.loc[mask, 'reason_to_exclude'] = 'Missing ISIN'
data.loc[mask].shape

(1402, 38)

#### Remove companies without trades

In [14]:
mask = data['trade_count'] == 0
#data.loc[mask, 'reason_to_exclude'] = 'NA ticker'
data.loc[mask, 'reason_to_exclude'] = 'No trades done'
data.loc[mask].shape

(961, 38)

#### Remove companies without timeseries

In [15]:
mask = data['ts_rows'].isnull()
data.loc[mask, 'reason_to_exclude'] = 'No timeseries data'
data.loc[mask].shape

(1471, 38)

#### Remove companies where company time-series from source if wrong

In [16]:
mask = data['RI_Errors'] == True
data.loc[mask, 'reason_to_exclude'] = 'Faulty timeseries data'
display(data.loc[mask])
data.loc[mask].shape

Unnamed: 0,Type,ISIN CODE,LOC OFF. CODE,NAME,DATASTREAM CODE,CUSIP,TICKER SYMBOL,BASE OR ST DATE,DATE/TIME (DS End Date),tickers,...,isin,ts_rows,name,ticker,start_date,end_date,start_date_ts,end_date_ts,RI_Errors,reason_to_exclude
999,905922,US4878361082,U487836108,KELLOGG,905922,487836108,K,1973-01-02,2022-03-16 00:00:00,K,...,US35953D1046,1509.0,FUBOTV,FUBO,2013-04-05,2022-03-16,2016-03-21,2022-03-16,True,Faulty timeseries data
1190,2644WD,BMG637AM1024,UG637AM102,MYOVANT SCIENCES,2644WD,G637AM102,MYOV,2016-10-27,2022-03-16 00:00:00,MYOV,...,US46124J2015,1509.0,INVENTRUST PROPERTIES,IVT,2014-02-21,2022-03-16,2016-03-21,2022-03-16,True,Faulty timeseries data


(2, 38)

#### Find non-unique tickers

In [17]:
dublicate_tickers = data[data.duplicated(subset=['TICKER SYMBOL'],keep=False)]['TICKER SYMBOL']
dublicate_tickers_mask = data['TICKER SYMBOL'].isin(dublicate_tickers)
data.loc[dublicate_tickers_mask, 'reason_to_exclude'] = 'Non-unique-ticker'
data.loc[dublicate_tickers_mask].shape

(672, 38)

#### Find NA tickers

In [18]:
mask = data['TICKER SYMBOL'] == 'NA'
data.loc[mask, 'reason_to_exclude'] = 'NA ticker'
data.loc[mask].shape

(7, 38)

### Show what we have

In [19]:
data

Unnamed: 0,Type,ISIN CODE,LOC OFF. CODE,NAME,DATASTREAM CODE,CUSIP,TICKER SYMBOL,BASE OR ST DATE,DATE/TIME (DS End Date),tickers,...,isin,ts_rows,name,ticker,start_date,end_date,start_date_ts,end_date_ts,RI_Errors,reason_to_exclude
0,756064,US88554D2053,U88554D205,3D SYSTEMS,756064,88554D205,DDD,1988-03-10,2022-03-16 00:00:00,DDD,...,AN8068571086,1509.0,SCHLUMBERGER,SLB,1973-01-02,2022-03-16,2016-03-21,2022-03-16,False,
1,902172,US88579Y1010,U88579Y101,3M,902172,88579Y101,MMM,1973-01-02,2022-03-16 00:00:00,MMM,...,BE0003816338,1509.0,EURONAV (NYS),EURN,2015-01-23,2022-03-16,2016-03-21,2022-03-16,False,Non-unique-ticker
2,877226,US2829141009,U282914100,8X8,877226,282914100,EGHT,1997-07-03,2022-03-16 00:00:00,EGHT,...,BMG0464B1072,1509.0,ARGO GP.INTL.HOLDINGS,ARGO,1987-03-25,2022-03-16,2016-03-21,2022-03-16,False,
3,2603Y7,US00152K1016,U00152K101,A K A BRANDS HOLDING,2603Y7,00152K101,AKA,2021-09-22,2022-03-16 00:00:00,AKA,...,BMG053841059,732.0,ASPEN IN.HDG. DEAD - DELIST.16/02/19,AHL,2003-12-04,2019-02-14,2016-03-21,2019-02-14,False,
4,8748C6,US0021211018,U002121101,A10 NETWORKS,8748C6,2121101,ATEN,2014-03-21,2022-03-16 00:00:00,ATEN,...,BMG0585R1060,1509.0,ASSURED GUARANTY,AGO,2004-04-23,2022-03-16,2016-03-21,2022-03-16,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3637,94419W,,,ZOOM VIDEO COMM CL B,94419W,98980L101,ZM,2019-04-18,,ZM,...,,,,,NaT,NaT,NaT,NaT,,No timeseries data
3638,9924R8,,,ZOOMINFO TECHNOLOGIES INC CL B,9924R8,98980F104,ZI,2020-06-04,,ZI,...,,,,,NaT,NaT,NaT,NaT,,Non-unique-ticker
3639,9924TA,,,ZOOMINFO TECHNOLOGIES INC CL C,9924TA,98980F104,ZI,2020-06-04,,ZI,...,,,,,NaT,NaT,NaT,NaT,,Non-unique-ticker
3640,9295TV,,,ZUORA INC CL B,9295TV,98983V106,ZUO,2018-04-12,,ZUO,...,,,,,NaT,NaT,NaT,NaT,,Non-unique-ticker


#### Show amounts that are excluded

In [20]:
data.groupby(['reason_to_exclude']).count()[['ISIN CODE', 'Type']]

Unnamed: 0_level_0,ISIN CODE,Type
reason_to_exclude,Unnamed: 1_level_1,Unnamed: 2_level_1
Faulty timeseries data,1,1
Missing ISIN,84,84
NA ticker,7,7
No timeseries data,1118,1118
No trades done,395,395
Non-unique-ticker,665,665
,1372,1372


In [21]:
data.groupby(['reason_to_exclude']).count()[['ISIN CODE', 'Type']]

Unnamed: 0_level_0,ISIN CODE,Type
reason_to_exclude,Unnamed: 1_level_1,Unnamed: 2_level_1
Faulty timeseries data,1,1
Missing ISIN,84,84
NA ticker,7,7
No timeseries data,1118,1118
No trades done,395,395
Non-unique-ticker,665,665
,1372,1372


In [22]:
data.groupby(['reason_to_exclude'])['trade_count'].sum()

reason_to_exclude
Faulty timeseries data        183
Missing ISIN                13046
NA ticker                      35
No timeseries data         158924
No trades done                  0
Non-unique-ticker          268735
None                      1021938
Name: trade_count, dtype: int64

#### Show which ones are excluded:
Also stores to csv for later use

In [23]:
#scraping_summary = data[data['reason_to_exclude']!='None']
scraping_summary = data
scraping_summary = scraping_summary[[#'Type'
                      'ISIN CODE'
                     #, 'LOC OFF. CODE'
                     , 'NAME'
                     #, 'DATASTREAM CODE'
                     #, 'CUSIP'
                     , 'TICKER SYMBOL'
                     #, 'BASE OR ST DATE'
                     #, 'DATE/TIME (DS End Date)'
                     #, 'tickers'
                     , 'trade_count'
                     #, 'min_filing_date'
                     #, 'max_filing_date'
                     #, 'min_trade_date'
                     #, 'max_trade_date'
                     , 'n_distinct_traders'
                     , 'n_distinct_trade_types'
                     #, 'P - Purchase (count)'
                     #, 'S - Sale (count)'
                     #, 'S - Sale+OE'
                     #, 'A - Grant (count)'
                     #, 'D - Sale to Iss (count)'
                     #, 'G - Gift (count)'
                     #, 'F - Tax (count)'
                     #, 'M - Option Ex (count)'
                     #, 'X - Option Ex (count)'
                     #, 'C - Cnv Deriv (count)'
                     #, 'W - Inherited (count)'
                     #, 'isin'
                     , 'ts_rows'
                     #, 'name'
                     #, 'ticker'
                     #, 'start_date'
                     #, 'end_date'
                     #, 'start_date_ts'
                     #, 'end_date_ts'
                     , 'reason_to_exclude'
                    ]].sort_values(by=['reason_to_exclude', 'TICKER SYMBOL', 'ISIN CODE'])
scraping_summary.to_csv(DATA_LOCATION + '/scraping_summary.csv')
scraping_summary

Unnamed: 0,ISIN CODE,NAME,TICKER SYMBOL,trade_count,n_distinct_traders,n_distinct_trade_types,ts_rows,reason_to_exclude
1190,BMG637AM1024,MYOVANT SCIENCES,MYOV,183,21,5,1509.0,Faulty timeseries data
1939,,ABOVENET 'B',ABVT,179,21,10,888.0,Missing ISIN
2093,,ATHENA CONSUMER ACQUISITION CL B,ACAQ,1,1,1,1509.0,Missing ISIN
2062,,ARCTIC CAT 'B' DEAD - 07/03/17,ACAT,446,33,8,1509.0,Missing ISIN
1942,,ACCREDO HEALTH 'B',ACDO,90,16,6,1509.0,Missing ISIN
...,...,...,...,...,...,...,...,...
1914,US98936J1016,ZENDESK,ZEN,2271,40,8,1509.0,
1057,US53228T1016,LIGHTNING EMOTORS,ZEV,16,13,5,1507.0,
1923,US98978V1035,ZOETIS A,ZTS,499,34,8,1509.0,
1926,US98983L1089,ZURN WATER SOLUTIONS,ZWS,526,27,8,1509.0,


## Check how to read in, into the analysis notebook

### This should be done early in cell 3:

In [24]:
# Read in the summary data from "CompaniesToExclude" notebook
summary_data = pd.read_csv(DATA_LOCATION + '/scraping_summary.csv', index_col=0)
# Generate list of which companies to analyse
isins_to_use = summary_data[summary_data['reason_to_exclude'] == 'None']['ISIN CODE'].to_list()
display(summary_data)
print(f'We want to reduce to {len(isins_to_use)} isins')

Unnamed: 0,ISIN CODE,NAME,TICKER SYMBOL,trade_count,n_distinct_traders,n_distinct_trade_types,ts_rows,reason_to_exclude
1190,BMG637AM1024,MYOVANT SCIENCES,MYOV,183,21,5,1509.0,Faulty timeseries data
1939,,ABOVENET 'B',ABVT,179,21,10,888.0,Missing ISIN
2093,,ATHENA CONSUMER ACQUISITION CL B,ACAQ,1,1,1,1509.0,Missing ISIN
2062,,ARCTIC CAT 'B' DEAD - 07/03/17,ACAT,446,33,8,1509.0,Missing ISIN
1942,,ACCREDO HEALTH 'B',ACDO,90,16,6,1509.0,Missing ISIN
...,...,...,...,...,...,...,...,...
1914,US98936J1016,ZENDESK,ZEN,2271,40,8,1509.0,
1057,US53228T1016,LIGHTNING EMOTORS,ZEV,16,13,5,1507.0,
1923,US98978V1035,ZOETIS A,ZTS,499,34,8,1509.0,
1926,US98983L1089,ZURN WATER SOLUTIONS,ZWS,526,27,8,1509.0,


We want to reduce to 1372 isins


### This needs to be changed in other notebook! in Cell 3:

In [25]:
# Data locations
DATA_LOCATION = f'data/{NAME}/'
DATA_LOCATION_INSIDER_PROCESSED = DATA_LOCATION + 'processed/insider/'
DATA_LOCATION_RI = DATA_LOCATION + 'processed/RI_discard/'

## Not sure why we do this - maybe refactor
_ri_location = DATA_LOCATION_RI
_insider_location = DATA_LOCATION_INSIDER_PROCESSED

# Get locations to read in
file_locs_ = os.listdir(_ri_location)
print(f'Found {len(file_locs_)} possible files to analyze')
# Filter files for analysis, and append path:
file_locs = [_ri_location + f for f in file_locs_ if f[:-7] in isins_to_use]
print(f'We are left with {len(file_locs)} to analyze')


Found 2171 possible files to analyze
We are left with 1369 to analyze


### In the next cell (5):
I'm sorry, but this will have to replace the beautiful `ISINs = [rick[:-7] for rick in pickles]`

In [26]:
#ISINs = [f for f in file_locs_ if f[:-7] in isins_to_use]
#ISINs = isins_to_use
ISINs =  [f[:-7] for f in file_locs_ if f[:-7] in isins_to_use]
print(len(ISINs))

1369
