# Exploratory Data analysis
_Written by Thomas Niedermayer and Gunnar Sjúrðarson Knudsen, as a conjoined effort for an interdiscplinary project in Data Science._
* Supervisor: Wolfgang Aussenegg
* Co-Supervisor: Sascha Hunold

Purpose of this notebook is to understand the data quality, and the scale of the task. 
As we ran into several issues w.r.t. data quality, we want to get a deeper understanding of how which data is being handled. Example errors:
* TICKERS being used for multiple companies/ISINS, resulting in not knowing which ISIN an insider trading corresponds to
* Missing ISINs; means that no time series data is available
* Missing TICKERS; Means that no insider trades are available.
* For several companies, there are no insider trades registered. These are therefore also filtered out from the analysis notebook. These wouldn't give an error, but removing them speeds up the runtime
* Sometimes there was no return index data available. **Honestly not sure why this didn't break our analaysis. Maybe this was the reason for the `-Inf`?** In any case, they are also removed
Check which data we've already extracted, and define how to handle various situations

## Load libraries

In [1]:
import pickle
import pandas as pd
import numpy as np
import os
import math

from IPython.display import clear_output, display
from tqdm import tqdm

# Load custom libraries
import source.read_tickers_and_isins as URTI

## Define which data to be loaded

In [2]:
from tools import load_settings
settings = load_settings()
NAME = settings["NAME"]
STOCK_EXCHANGE = settings["STOCK_EXCHANGE"]

In [3]:
DATA_LOCATION = f'data/{NAME}/'
_insider_location = DATA_LOCATION + 'processed/insider/'

## Read in input data

In [4]:
INPUT_FILE = f'input_data/{NAME}/{STOCK_EXCHANGE} Composite 16.3.2022 plus dead firms - {NAME}.xlsx'

In [5]:
data = URTI.read_tickers_and_isins(INPUT_FILE)

Reading tickers


In [6]:
data

Unnamed: 0,Type,ISIN CODE,LOC OFF. CODE,NAME,DATASTREAM CODE,CUSIP,TICKER SYMBOL,BASE OR ST DATE,DATE/TIME (DS End Date)
0,2606T9,KYG870761080,UG87076108,10X CAPITAL VENTURE ACQUISITION II A,2606T9,G87076108,VCXA,2021-10-05,2022-03-16
1,95118Z,US88025U1097,U88025U109,10X GENOMICS A,95118Z,88025U109,TXG,2019-09-12,2022-03-16
2,9330V3,US68247Q1022,U68247Q102,111 ADR 1:2,9330V3,68247Q102,YI,2018-09-12,2022-03-16
3,99133U,US81807M2052,U81807M205,17 ED.TGP.ADR 1:10,99133U,81807M205,YQ,2020-12-04,2022-03-16
4,9101KM,US68236V1044,U68236V104,180 LIFE SCIENCES,9101KM,68236V104,ATNF,2017-06-27,2022-03-16
...,...,...,...,...,...,...,...,...,...
4073,30358V,US9898171015,U989817101,ZUMIEZ,30358V,989817101,ZUMZ,2005-05-06,2022-03-16
4074,50259R,US98880R1095,U98880R109,ZW DATA ACTION TECHNOLOGIES,50259R,98880R109,CNET,2007-10-15,2022-03-16
4075,2568RT,US98985X1000,U98985X100,ZYMERGEN,2568RT,98985X100,ZY,2021-04-22,2022-03-16
4076,98116P,US98986X1090,U98986X109,ZYNERBA PHARMACEUTICALS,98116P,98986X109,ZYNE,2015-08-05,2022-03-16


## Read in scraped data
We need this, so we can also exclude handling where we didn't get any data from whereever

### Insider trades

In [7]:
# Define dummy placeholders
tickers = []
trade_counts = []
min_filing_date = []
max_filing_date = []
min_trade_date = []
max_trade_date = []
n_distinct_traders = []
n_distinct_trade_types = []

n_p = []
n_s = []
n_s2 = []
n_a = []
n_d = []
n_g = []
n_f = []
n_m = []
n_x = []
n_c = []
n_w = []

# helpers
counter = 0
total_count = len(data['TICKER SYMBOL'])

# Read in scraped files, and do various aggregations
for ticker in data['TICKER SYMBOL']:
    counter = counter +1
    clear_output(wait=True)
    print(f'Handling {counter} of {total_count}. Currently doing: {ticker}')
    
    dat = pd.read_csv(_insider_location + ticker + '.csv', index_col=0, parse_dates=['FilingDate', 'TradeDate'])

    tickers.append(ticker)
    trade_counts.append(dat.shape[0])

    min_filing_date.append(dat['FilingDate'].min())
    max_filing_date.append(dat['FilingDate'].max())

    min_trade_date.append(dat['TradeDate'].min())
    max_trade_date.append(dat['TradeDate'].max())

    n_distinct_traders.append(dat['InsiderName'].nunique())
    n_distinct_trade_types.append(dat['TradeType'].nunique())


    n_p.append(sum(dat['TradeType'] == 'P - Purchase'))
    n_s.append(sum(dat['TradeType'] == 'S - Sale'))
    n_s2.append(sum(dat['TradeType'] == 'S - Sale+OE'))

    n_a.append(sum(dat['TradeType'] == 'A - Grant'))
    n_d.append(sum(dat['TradeType'] == 'D - Sale to Iss') + sum(dat['TradeType'] == 'D - Sale to issuer'))
    n_g.append(sum(dat['TradeType'] == 'G - Gift'))
    n_f.append(sum(dat['TradeType'] == 'F - Tax'))
    n_m.append(sum(dat['TradeType'] == 'M - Option Ex') + sum(dat['TradeType'] == 'M - OptEx'))
    n_x.append(sum(dat['TradeType'] == 'X - Option Ex') + sum(dat['TradeType'] == 'X - OptEx'))
    n_c.append(sum(dat['TradeType'] == 'C - Cnv Deriv') + sum(dat['TradeType'] == 'C - Converted deriv'))
    n_w.append(sum(dat['TradeType'] == 'W - Inherited'))
    
# Collect to a single data frame
scraped_insider_df = pd.DataFrame({'tickers': tickers
                                   , 'trade_count': trade_counts
                                   , 'min_filing_date': min_filing_date
                                   , 'max_filing_date': max_filing_date
                                   , 'min_trade_date': min_trade_date
                                   , 'max_trade_date': max_trade_date
                                   , 'n_distinct_traders': n_distinct_traders
                                   , 'n_distinct_trade_types': n_distinct_trade_types
                                   , 'P - Purchase (count)': n_p
                                   , 'S - Sale (count)': n_s
                                   , 'S - Sale+OE': n_s2
                                   , 'A - Grant (count)': n_a
                                   , 'D - Sale to Iss (count)': n_d
                                   , 'G - Gift (count)': n_g
                                   , 'F - Tax (count)': n_f
                                   , 'M - Option Ex (count)': n_m
                                   , 'X - Option Ex (count)': n_x
                                   , 'C - Cnv Deriv (count)': n_c
                                   , 'W - Inherited (count)': n_w
                   })
scraped_insider_df
scraped_insider_df = scraped_insider_df.drop_duplicates()

Handling 4078 of 4078. Currently doing: ZNGA


#### Join the data

In [8]:
#data = data.join(scraped_insider_df, rsuffix='_given', lsuffix='_insider', how="left")
data = pd.merge(data, scraped_insider_df, how='left', left_on = 'TICKER SYMBOL', right_on = 'tickers')

### Read in market timeseries

In [9]:
# File location (needs cleansing)
DATA_LOCATION_RI = DATA_LOCATION + 'processed/RI_discard/'
_ri_location = DATA_LOCATION_RI

file_locs_ = os.listdir(_ri_location)
file_locs = [_ri_location + f for f in file_locs_]

# Actually read in the company information
companies = []
print("loading return series...")
for file_loc in tqdm(file_locs):
    with open(file_loc, "rb") as f:
        company = pickle.load(f)
    companies.append(company)

loading return series...


100%|██████████| 4077/4077 [00:01<00:00, 2166.71it/s]


In [10]:
isins = []
names = []
tickers = []
start_dates = []
end_dates = []
start_dates_ts = []
end_dates_ts = []
ts_rows = []
ri_ts_errors = []
for company in companies:
    isins.append(company.isin)
    names.append(company.name)
    tickers.append(company.ticker)
    start_dates.append(company.start_date)
    end_dates.append(company.end_date)
    start_dates_ts.append(company.return_index_df.index.min())
    end_dates_ts.append(company.return_index_df.index.max())
    ts_rows.append(company.return_index_df.shape[0])
    
    # In some cases, the RI is the same for all days in a company, followed by missing days.
    ts_ri_sum = company.return_index_df[1:].company_return.sum() 
    
    # Add check to see if there is a change in price at all
    ts_ri_diff = company.return_index_df[1:].company_return.min() - company.return_index_df[1:].company_return.max()
    
    if (ts_ri_sum == np.Inf):
        contains_error_in_timeseries = True
    elif (-ts_ri_sum == np.Inf):
        contains_error_in_timeseries = True
    elif (math.isnan(ts_ri_sum)):
        contains_error_in_timeseries = True
    elif (math.isnan(ts_ri_diff)):
        contains_error_in_timeseries = True
    elif (ts_ri_diff == 0):
        contains_error_in_timeseries = True
    elif (company.return_index_df[1:].company_return.isnull().any() == True):
        contains_error_in_timeseries = True
    else:
        contains_error_in_timeseries = False
    ri_ts_errors.append(contains_error_in_timeseries)
    
    if (contains_error_in_timeseries):
        print(f'{company.ticker}: {ts_ri_sum} {contains_error_in_timeseries}')
    
        
# Collect to a single data frame
scraped_ts_df = pd.DataFrame({'isin': isins
                              , 'ts_rows': ts_rows
                              , 'name': names
                              , 'ticker': tickers
                              , 'start_date': start_dates
                              , 'end_date': end_dates
                              , 'start_date_ts': start_dates_ts
                              , 'end_date_ts': end_dates_ts
                              , 'RI_Errors': ri_ts_errors
                             })
scraped_ts_df = scraped_ts_df.drop_duplicates()
scraped_ts_df#[scraped_ts_df['ticker'] == 'ABIO']

NA: 0.04859999999999998 True
IPCI: inf True
SPHS: 9.755259204904322 True
IGLDF: -3.9164696927135467 True
ROSGQ: inf True
ESTRF: 9.294235124590656 True
LINUF: 59.66332377001714 True
NA: -0.0009999999999998899 True
RBZHF: 5.506560240880486 True
ABILF: inf True
DRYS: 0.09898764305395646 True
GLBS: inf True
PSHG: -3.6888402702323795 True
SHIP: inf True
TOPS: inf True
ABIO: inf True
ABEO: inf True
ACER: inf True
ACHV: inf True
ADYX: inf True
AIKI: inf True
ANTH: inf True
ARDMQ: inf True
ATIS: inf True
AVGR: inf True
AYRO: inf True
AYTU: inf True
BSPM: 22.563775671395504 True
CDTI: inf True
CTIC: 0.0 True
CLBS: 0.0 True
CLRB: inf True
CTRCQ: 6.6075396825396835 True
CALI: inf True
CNTFY: inf True
CLRD: inf True
CODA: inf True
CYCC: inf True
DESTQ: inf True
DTRM: inf True
DFFN: 0.0 True
ESESQ: 0.0 True
ENVB: inf True
FNJN: 0.0 True
NA: -0.12230000000000008 True
FSNNQ: inf True
GADS: inf True
GEVO: inf True
HMNY: 1.5206380557799282 True
IDEX: inf True
IMNPQ: inf True
INPX: inf True
INAPQ: inf T

Unnamed: 0,isin,ts_rows,name,ticker,start_date,end_date,start_date_ts,end_date_ts,RI_Errors
0,AGP8696W1045,737,SINOVAC BIOTECH,SVA,2003-09-26,2019-02-22,2016-03-21,2019-02-22,False
1,AU000000ITL3,1157,INTEGRATED MEDIA TECH.,IMTE,2017-08-11,2022-03-16,2017-08-11,2022-03-16,False
2,AU0000185993,82,IRIS ENERGY PTY,IREN,2021-11-17,2022-03-16,2021-11-17,2022-03-16,False
3,AU0000198582,942,CENNTRO ELECTRIC GROUP,CENN,2018-06-20,2022-03-16,2018-06-20,2022-03-16,False
4,AU0000205205,242,TRITIUM DCFC,DCFC,2021-04-01,2022-03-16,2021-04-01,2022-03-16,False
...,...,...,...,...,...,...,...,...,...
4072,VGG870841027,1129,TDH HOLDINGS,PETZ,2017-09-21,2022-03-16,2017-09-21,2022-03-16,False
4073,VGG9320Z1099,40,VAHANNA TECH EDGE ACQUISITION I A,VHNA,2022-01-13,2022-03-11,2022-01-13,2022-03-11,False
4074,VGG941841014,727,WAH FU EDUCATION GROUP,WAFU,2019-04-30,2022-03-16,2019-04-30,2022-03-16,False
4075,VGG9604C1077,315,MEIWU TECHNOLOGY COMPANY,WNW,2020-12-15,2022-03-16,2020-12-15,2022-03-16,False


#### Join the data

In [11]:
#data = data.join(scraped_ts_df, rsuffix='_given', lsuffix='_ts', how="left", left_on = 'ISIN CODE', right_on = 'isin')
data = pd.merge(data, scraped_ts_df, how="left", left_on = 'ISIN CODE', right_on = 'isin')

### Start filtering

In [12]:
data['reason_to_exclude'] = 'None'

#### Remove Companies without ISINs

In [13]:
mask = data['ISIN CODE'] == 'NA'
#data.loc[mask, 'reason_to_exclude'] = 'NA ticker'
data.loc[mask, 'reason_to_exclude'] = 'Missing ISIN'
data.loc[mask].shape

(2, 38)

#### Remove companies without trades

In [14]:
mask = data['trade_count'] == 0
#data.loc[mask, 'reason_to_exclude'] = 'NA ticker'
data.loc[mask, 'reason_to_exclude'] = 'No trades done'
data.loc[mask].shape

(922, 38)

#### Remove companies without timeseries

In [15]:
mask = data['ts_rows'].isnull()
data.loc[mask, 'reason_to_exclude'] = 'No timeseries data'
data.loc[mask].shape

(0, 38)

#### Remove companies where company time-series from source if wrong

In [16]:
mask = data['RI_Errors'] == True
data.loc[mask, 'reason_to_exclude'] = 'Faulty timeseries data'
data.loc[mask].shape

(92, 38)

#### Find non-unique tickers

In [17]:
dublicate_tickers = data[data.duplicated(subset=['TICKER SYMBOL'],keep=False)]['TICKER SYMBOL']
dublicate_tickers_mask = data['TICKER SYMBOL'].isin(dublicate_tickers)
data.loc[dublicate_tickers_mask, 'reason_to_exclude'] = 'Non-unique-ticker'
data.loc[dublicate_tickers_mask].shape

(251, 38)

#### Find NA tickers

In [18]:
mask = data['TICKER SYMBOL'] == 'NA'
data.loc[mask, 'reason_to_exclude'] = 'NA ticker'
data.loc[mask].shape

(158, 38)

### Show what we have

In [19]:
data

Unnamed: 0,Type,ISIN CODE,LOC OFF. CODE,NAME,DATASTREAM CODE,CUSIP,TICKER SYMBOL,BASE OR ST DATE,DATE/TIME (DS End Date),tickers,...,isin,ts_rows,name,ticker,start_date,end_date,start_date_ts,end_date_ts,RI_Errors,reason_to_exclude
0,2606T9,KYG870761080,UG87076108,10X CAPITAL VENTURE ACQUISITION II A,2606T9,G87076108,VCXA,2021-10-05,2022-03-16,VCXA,...,KYG870761080,113,10X CAPITAL VENTURE ACQUISITION II A,VCXA,2021-10-05,2022-03-16,2021-10-05,2022-03-16,False,No trades done
1,95118Z,US88025U1097,U88025U109,10X GENOMICS A,95118Z,88025U109,TXG,2019-09-12,2022-03-16,TXG,...,US88025U1097,633,10X GENOMICS A,TXG,2019-09-12,2022-03-16,2019-09-12,2022-03-16,False,
2,9330V3,US68247Q1022,U68247Q102,111 ADR 1:2,9330V3,68247Q102,YI,2018-09-12,2022-03-16,YI,...,US68247Q1022,884,111 ADR 1:2,YI,2018-09-12,2022-03-16,2018-09-12,2022-03-16,False,No trades done
3,99133U,US81807M2052,U81807M205,17 ED.TGP.ADR 1:10,99133U,81807M205,YQ,2020-12-04,2022-03-16,YQ,...,US81807M2052,322,17 ED.TGP.ADR 1:10,YQ,2020-12-04,2022-03-16,2020-12-04,2022-03-16,False,No trades done
4,9101KM,US68236V1044,U68236V104,180 LIFE SCIENCES,9101KM,68236V104,ATNF,2017-06-27,2022-03-16,ATNF,...,US68236V1044,1189,180 LIFE SCIENCES,ATNF,2017-06-27,2022-03-16,2017-06-27,2022-03-16,False,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4073,30358V,US9898171015,U989817101,ZUMIEZ,30358V,989817101,ZUMZ,2005-05-06,2022-03-16,ZUMZ,...,US9898171015,1509,ZUMIEZ,ZUMZ,2005-05-06,2022-03-16,2016-03-21,2022-03-16,False,
4074,50259R,US98880R1095,U98880R109,ZW DATA ACTION TECHNOLOGIES,50259R,98880R109,CNET,2007-10-15,2022-03-16,CNET,...,US98880R1095,1509,ZW DATA ACTION TECHNOLOGIES,CNET,2007-10-15,2022-03-16,2016-03-21,2022-03-16,False,
4075,2568RT,US98985X1000,U98985X100,ZYMERGEN,2568RT,98985X100,ZY,2021-04-22,2022-03-16,ZY,...,US98985X1000,228,ZYMERGEN,ZY,2021-04-22,2022-03-16,2021-04-22,2022-03-16,False,
4076,98116P,US98986X1090,U98986X109,ZYNERBA PHARMACEUTICALS,98116P,98986X109,ZYNE,2015-08-05,2022-03-16,ZYNE,...,US98986X1090,1509,ZYNERBA PHARMACEUTICALS,ZYNE,2015-08-05,2022-03-16,2016-03-21,2022-03-16,False,


#### Show which ones are excluded:
Also stores to csv for later use

In [20]:
#scraping_summary = data[data['reason_to_exclude']!='None']
scraping_summary = data
scraping_summary = scraping_summary[[#'Type'
                      'ISIN CODE'
                     #, 'LOC OFF. CODE'
                     , 'NAME'
                     #, 'DATASTREAM CODE'
                     #, 'CUSIP'
                     , 'TICKER SYMBOL'
                     #, 'BASE OR ST DATE'
                     #, 'DATE/TIME (DS End Date)'
                     #, 'tickers'
                     , 'trade_count'
                     #, 'min_filing_date'
                     #, 'max_filing_date'
                     #, 'min_trade_date'
                     #, 'max_trade_date'
                     , 'n_distinct_traders'
                     , 'n_distinct_trade_types'
                     #, 'P - Purchase (count)'
                     #, 'S - Sale (count)'
                     #, 'S - Sale+OE'
                     #, 'A - Grant (count)'
                     #, 'D - Sale to Iss (count)'
                     #, 'G - Gift (count)'
                     #, 'F - Tax (count)'
                     #, 'M - Option Ex (count)'
                     #, 'X - Option Ex (count)'
                     #, 'C - Cnv Deriv (count)'
                     #, 'W - Inherited (count)'
                     #, 'isin'
                     , 'ts_rows'
                     #, 'name'
                     #, 'ticker'
                     #, 'start_date'
                     #, 'end_date'
                     #, 'start_date_ts'
                     #, 'end_date_ts'
                     , 'reason_to_exclude'
                    ]].sort_values(by=['reason_to_exclude', 'TICKER SYMBOL', 'ISIN CODE'])
scraping_summary.to_csv(DATA_LOCATION + '/scraping_summary.csv')
scraping_summary

Unnamed: 0,ISIN CODE,NAME,TICKER SYMBOL,trade_count,n_distinct_traders,n_distinct_trade_types,ts_rows,reason_to_exclude
36,US00289Y1073,ABEONA THERAPEUTICS,ABEO,102,30,7,1509,Faulty timeseries data
38,KYG8789K1242,ABILITY,ABILF,0,0,0,1506,Faulty timeseries data
341,US00211Y5069,ARCA BIOPHARMA,ABIO,109,34,7,1509,Faulty timeseries data
60,US00444P1084,ACER THERAPEUTICS,ACER,110,26,10,1509,Faulty timeseries data
64,US0044685008,ACHIEVE LIFE SCIENCES,ACHV,186,23,7,1509,Faulty timeseries data
...,...,...,...,...,...,...,...,...
4072,US98980G1022,ZSCALER,ZS,440,16,7,1008,
4071,US98979H2022,ZOSANO PHARMA,ZSAN,61,19,6,1509,
4073,US9898171015,ZUMIEZ,ZUMZ,495,26,7,1509,
4075,US98985X1000,ZYMERGEN,ZY,25,9,7,228,


#### Do basic statistics of what was taken out


In [21]:
agg_scraping_summary = scraping_summary.groupby('reason_to_exclude').agg({'ISIN CODE': 'count'
                                                                          , 'trade_count': 'sum'
                                                                          , 'n_distinct_traders':'sum'
                                                                          , 'ts_rows':'sum'}).reset_index().rename(columns={'reason_to_exclude': 'Exclusion Reason'
                                                                                                                           , 'ISIN CODE': 'N Companies'
                                                                                                                           , 'trade_count': 'N trades'
                                                                                                                           , 'n_distinct_traders': 'N distinct traders'
                                                                                                                           , 'ts_rows': 'N TS rows'})
agg_scraping_summary["N TS rows"] = agg_scraping_summary["N TS rows"].astype(int)

##### And print it to latex code for report

In [22]:
def bold_rows(x):
    lenx = x.shape[0]-1
    return ['font-weight: bold' if (v == x.loc[lenx]) else '' for v in x]

latex = agg_scraping_summary.style\
.apply(bold_rows)\
.applymap_index(lambda v: "font-weight: bold;", axis="index")\
.applymap_index(lambda v: "font-weight: bold;", axis="columns")\
.format(na_rep="-", precision=1)\
.hide(axis="index")\
.format(thousands=".")\
.format_index(escape="latex", axis=1)\
.to_latex(convert_css=True
          , column_format="lrrrr"
          , position="H"
          , position_float="centering"
          , hrules=True
          , label="table:excluded_companies"
          , caption="Summary of discarded input"
          , multirow_align="t"
          , multicol_align="r"
          #, index = False
          , siunitx = True
)
print(latex)
print("---------------")
latex

\begin{table}[H]
\centering
\caption{Summary of discarded input}
\label{table:excluded_companies}
\begin{tabular}{lrrrr}
\toprule
{\bfseries Exclusion Reason} & {\bfseries N Companies} & {\bfseries N trades} & {\bfseries N distinct traders} & {N TS rows} \\
\midrule
Faulty timeseries data & 87 & 10.740 & 1.301 & 122.284 \\
Missing ISIN & 1 & 17 & 8 & 1.369 \\
NA ticker & 158 & 790 & 632 & 54.796 \\
No trades done & 867 & 0 & 0 & 531.503 \\
Non-unique-ticker & 93 & 30.132 & 1.587 & 102.720 \\
\bfseries None & \bfseries 2.872 & \bfseries 1.106.364 & \bfseries 68.027 & \bfseries 3.263.791 \\
\bottomrule
\end{tabular}
\end{table}

---------------


'\\begin{table}[H]\n\\centering\n\\caption{Summary of discarded input}\n\\label{table:excluded_companies}\n\\begin{tabular}{lrrrr}\n\\toprule\n{\\bfseries Exclusion Reason} & {\\bfseries N Companies} & {\\bfseries N trades} & {\\bfseries N distinct traders} & {N TS rows} \\\\\n\\midrule\nFaulty timeseries data & 87 & 10.740 & 1.301 & 122.284 \\\\\nMissing ISIN & 1 & 17 & 8 & 1.369 \\\\\nNA ticker & 158 & 790 & 632 & 54.796 \\\\\nNo trades done & 867 & 0 & 0 & 531.503 \\\\\nNon-unique-ticker & 93 & 30.132 & 1.587 & 102.720 \\\\\n\\bfseries None & \\bfseries 2.872 & \\bfseries 1.106.364 & \\bfseries 68.027 & \\bfseries 3.263.791 \\\\\n\\bottomrule\n\\end{tabular}\n\\end{table}\n'

## Check how to read in, into the analysis notebook

### This should be done early in cell 3:

In [23]:
# Read in the summary data from "CompaniesToExclude" notebook
summary_data = pd.read_csv(DATA_LOCATION + '/scraping_summary.csv', index_col=0)
# Generate list of which companies to analyse
isins_to_use = summary_data[summary_data['reason_to_exclude'] == 'None']['ISIN CODE'].to_list()
display(summary_data)
print(f'We want to reduce to {len(isins_to_use)} isins')

Unnamed: 0,ISIN CODE,NAME,TICKER SYMBOL,trade_count,n_distinct_traders,n_distinct_trade_types,ts_rows,reason_to_exclude
36,US00289Y1073,ABEONA THERAPEUTICS,ABEO,102,30,7,1509,Faulty timeseries data
38,KYG8789K1242,ABILITY,ABILF,0,0,0,1506,Faulty timeseries data
341,US00211Y5069,ARCA BIOPHARMA,ABIO,109,34,7,1509,Faulty timeseries data
60,US00444P1084,ACER THERAPEUTICS,ACER,110,26,10,1509,Faulty timeseries data
64,US0044685008,ACHIEVE LIFE SCIENCES,ACHV,186,23,7,1509,Faulty timeseries data
...,...,...,...,...,...,...,...,...
4072,US98980G1022,ZSCALER,ZS,440,16,7,1008,
4071,US98979H2022,ZOSANO PHARMA,ZSAN,61,19,6,1509,
4073,US9898171015,ZUMIEZ,ZUMZ,495,26,7,1509,
4075,US98985X1000,ZYMERGEN,ZY,25,9,7,228,


We want to reduce to 2872 isins


In [27]:
#summary_data[summary_data['TICKER SYMBOL'] == 'CLBS']
#summary_data[summary_data['TICKER SYMBOL'] == 'VIVE']

### This needs to be changed in other notebook! in Cell 3:

In [25]:
# Data locations
DATA_LOCATION = f'data/{NAME}/'
DATA_LOCATION_INSIDER_PROCESSED = DATA_LOCATION + 'processed/insider/'
DATA_LOCATION_RI = DATA_LOCATION + 'processed/RI_discard/'

## Not sure why we do this - maybe refactor
_ri_location = DATA_LOCATION_RI
_insider_location = DATA_LOCATION_INSIDER_PROCESSED

# Get locations to read in
file_locs_ = os.listdir(_ri_location)
print(f'Found {len(file_locs_)} possible files to analyze')
# Filter files for analysis, and append path:
file_locs = [_ri_location + f for f in file_locs_ if f[:-7] in isins_to_use]
print(f'We are left with {len(file_locs)} to analyze')


Found 4077 possible files to analyze
We are left with 2872 to analyze


### In the next cell (5):
I'm sorry, but this will have to replace the beautiful `ISINs = [rick[:-7] for rick in pickles]`

In [26]:
#ISINs = [f for f in file_locs_ if f[:-7] in isins_to_use]
#ISINs = isins_to_use
ISINs =  [f[:-7] for f in file_locs_ if f[:-7] in isins_to_use]
print(len(ISINs))

2872
