# Preprocess Large Holdings File

#### Converting the raw 50+ GB sas file with the holdings complezte data into a sparse python matrix which can be loaded into memory and more important which can be handled more efficiently by different alogorithms. 
#### The logic behind this process is as follows:

Loading data and transforming it into csv file to work with

1. 50+ GB holdings.sas7bdat file containing all the holdings data downloaded directly from wrds using ftp client
2. Converted into csv using sas7bdat_to_csv utility (Link)

Two step process to transform file into sparse matrix
Challenge is to convert from row describing one holding to rows describing the holdings of one fund at one point in time. Aslo it is crucial to keep track of which row of the sparse matrix is which fund at wjich date and which colums are which securities.

3. Open file in python 
4. Parse through file to make two lists. One with all fund/date combinations (using the comination as an ID) and one with all securities.
5. Generate sparse matrix with the dimensions "number of fund/date combinations" x "numer of securities"
6. Parse through large csv file again and fill the percentage_tna (percentage of the fund held in that particular security) number into the right spot of the sparse matrix as determined by two maps based on all fund/date combinations and securities
7. Save final sparse matrix and tables containing information about which row is which fund/date and which column is which security.

TODO

Parsing through csv file could be significantly sped up using something like: https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file

## Import statements

In [2]:
import os
import sys

import feather

import numpy as np
import pandas as pd
from scipy import sparse

## Load File

In [3]:
path = '../data/raw/out.csv'

## Parse complete file to get all unique fund/date combinations and stocks

In [4]:
%%time
chunksize = 10 ** 7
unit = chunksize / 184_578_843

reader = pd.read_csv(path,
                     usecols = ['crsp_portno','report_dt',
                                'crsp_company_key','security_name','cusip'],
                     dtype = {'crsp_portno': np.int64,
                              'report_dt': np.int64,
                              'crsp_company_key': np.int64,
                              'security_name': str,
                              'cusip': str},
                     low_memory=False,
                     chunksize=chunksize)

dfList_1 = []
dfList_2 = []

for i, chunk in enumerate(reader):
    temp_df_1 = chunk.loc[:,['crsp_portno','report_dt']].drop_duplicates()
    temp_df_2 = chunk.loc[:,['crsp_company_key','security_name','cusip']].drop_duplicates()
    dfList_1.append(temp_df_1)
    dfList_2.append(temp_df_2)

    print("{:6.2f}%".format(((i+1) * unit * 100)))        

  5.42%
 10.84%
 16.25%
 21.67%
 27.09%
 32.51%
 37.92%
 43.34%
 48.76%
 54.18%
 59.60%
 65.01%
 70.43%
 75.85%
 81.27%
 86.68%
 92.10%
 97.52%
102.94%
CPU times: user 3min 35s, sys: 58 s, total: 4min 33s
Wall time: 4min 41s


In [5]:
df_1 = pd.concat(dfList_1,sort=False)
df_2 = pd.concat(dfList_2,sort=False)

df_1 = df_1.drop_duplicates()
df_2 = df_2.drop_duplicates()

In [6]:
# Generate a unique ID from the portno and the date of a fund/date combination
df_1 = df_1.assign(port_id = ((df_1['crsp_portno'] * 1000000 + df_1['report_dt'])))
df_1 = df_1.rename(columns = {'report_dt':'report_dt_int'})

df_1 = df_1.assign(report_dt = pd.to_timedelta(df_1['report_dt_int'], unit='D') + pd.Timestamp('1960-1-1'))

df_1 = df_1.reset_index(drop = True)
df_1 = (df_1
        .assign(row = df_1.index)
        .set_index('port_id'))

In [7]:
df_2 = df_2.reset_index(drop = True)
df_2 = (df_2
        .assign(col = df_2.index)
        .set_index('crsp_company_key'))

In [8]:
df_1.head(1)

Unnamed: 0_level_0,crsp_portno,report_dt_int,report_dt,row
port_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1000001015795,1000001,15795,2003-03-31,0


In [9]:
df_2.head(1)

Unnamed: 0_level_0,security_name,cusip,col
crsp_company_key,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3000328,ADVANCED NEUROMODULATION SYS INC,00757T10,0


## Parse complete file to generate data for sparse matrix

In [10]:
%%time
chunksize = 10 ** 7
unit = chunksize / 184_578_843

reader = pd.read_csv(path,
                     usecols = ['crsp_portno','report_dt','crsp_company_key','percent_tna'],
                     dtype = {'crsp_portno': np.int64,
                              'report_dt': np.int64,
                              'crsp_company_key': np.int64,
                              'percent_tna':np.float64},
                     low_memory=False,
                     chunksize=chunksize)

CPU times: user 2.43 ms, sys: 4.22 ms, total: 6.65 ms
Wall time: 12.9 ms


In [11]:
# TODO pd.merge seems to be faster in this case than df.join

In [12]:
%%time
dfList = []

df_1_temp = df_1.loc[:,['row']]
df_2_temp = df_2.loc[:,['col']]

for i, chunk in enumerate(reader):
    temp_df = chunk.dropna()
    temp_df = temp_df.assign(port_id = ((temp_df['crsp_portno'] * 1000000 + temp_df['report_dt'])))
    temp_df.set_index('port_id',inplace=True)
    temp_df = temp_df.join(df_1_temp, how='left')
    temp_df.set_index('crsp_company_key',inplace=True)
    temp_df = temp_df.join(df_2_temp, how='left')
    temp_df = temp_df[['percent_tna','row','col']]
    dfList.append(temp_df)

    print("{:6.2f}%".format(((i+1) * unit * 100)))

  5.42%
 10.84%
 16.25%
 21.67%
 27.09%
 32.51%
 37.92%
 43.34%
 48.76%
 54.18%
 59.60%
 65.01%
 70.43%
 75.85%
 81.27%
 86.68%
 92.10%
 97.52%
102.94%
CPU times: user 4min, sys: 1min 26s, total: 5min 26s
Wall time: 5min 31s


In [13]:
df_sparse = pd.concat(dfList,sort=False)
df_sparse.reset_index(drop=True,inplace=True)
print(df_sparse.shape)
df_sparse.head(3)

(182439247, 3)


Unnamed: 0,percent_tna,row,col
0,0.03,339,4255
1,0.09,442,4255
2,0.14,443,4255


## Delete duplicates
All other filters will be applied later but this one has to be done before sparse matrix is created

In [14]:
duplicates_mask = df_sparse.duplicated(['col','row'],keep='last')
df_sparse = df_sparse[~duplicates_mask]

## Check if holdings data makes sense 

In [15]:
merged_data = pd.merge(df_sparse,df_1[['report_dt','row']],how='left',on='row')

In [16]:
date = pd.to_datetime('2016-09-30')

sum_col = (merged_data
           .query('report_dt == @date')
           .groupby(by = ['col'])
           .sum()
           .sort_values('percent_tna',ascending = False))

sum_col.join(df_2.set_index('col'),how='left').head(10)

Unnamed: 0_level_0,percent_tna,row,security_name,cusip
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
685,4154.578687,631344159,APPLE INC,03783310
8918,4022.548534,1041673573,USD Cash,
591,3479.578842,618682882,MICROSOFT CORP,59491810
774,3282.068953,461306389,AMAZON COM INC,02313510
15297,2958.859,480896666,FACEBOOK INC,30303M10
23776,2559.409133,506235478,ALPHABET INC,02079K30
594272,2490.869217,27468843,Proshares Trust Var Perp,
23993,2410.419165,462650691,ALPHABET INC,02079K10
630,2298.759256,492656280,JPMORGAN CHASE & CO,46625H10
926,2013.569344,491763426,JOHNSON & JOHNSON,47816010


In [17]:
# Seems to make sense. Interestingly Alphabet appears twice. TODO check if two share classes

## Change fund info and security info dfs for future use

In [18]:
df_1 = df_1[['crsp_portno','report_dt','row']].assign(port_id = df_1.index)
df_1.set_index('row',inplace=True)
df_1.sample()

Unnamed: 0_level_0,crsp_portno,report_dt,port_id
row,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
442628,1026331,2015-12-31,1026331020453


In [19]:
df_2 = df_2.assign(crsp_company_key = df_2.index)
df_2.set_index('col',inplace=True)
df_2.sample()

Unnamed: 0_level_0,security_name,cusip,crsp_company_key
col,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1713447,ASP MCS Acquisition Corp FRN 20-May,,8692984


## Create sparse matrix

In [20]:
sparse_matrix = sparse.csr_matrix((df_sparse['percent_tna'].values, (df_sparse['row'].values, df_sparse['col'].values)))

In [21]:
# Check if all dimensions match

In [22]:
print('Number of fund/date combinations:        {:12,d}'.format(sparse_matrix.shape[0]))
print('Number of unique securities:             {:12,d}'.format(sparse_matrix.shape[1]))
print('Number of values in sparse matrix:       {:12,d}'.format(sparse_matrix.getnnz()))
print()
print('Number of rows in fund info df:          {:12,d}'.format(df_1.shape[0]))
print('Number of rows in fund info df:          {:12,d}'.format(df_2.shape[0]))
print()
match_test = (sparse_matrix.shape[0] == df_1.shape[0]) & (sparse_matrix.shape[1] == df_2.shape[0])
print('Everything matches:                              {}'.format(match_test))

Number of fund/date combinations:             738,860
Number of unique securities:                2,382,969
Number of values in sparse matrix:        175,466,107

Number of rows in fund info df:               738,860
Number of rows in fund info df:             2,382,969

Everything matches:                              True


## Save data

#### Sparse matrix containing holdings

In [23]:
path = '../data/interim/holdings'
sparse.save_npz(path, sparse_matrix)

#### Fund/date info

In [24]:
path = '../data/interim/row_info.feather'
feather.write_dataframe(df_1,path)

#### Securities info

In [25]:
path = '../data/interim/col_info.feather'
feather.write_dataframe(df_2,path)