<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Preprocess-Large-Holdings-File" data-toc-modified-id="Preprocess-Large-Holdings-File-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Preprocess Large Holdings File</a></span><ul class="toc-item"><li><span><a href="#Import-statements" data-toc-modified-id="Import-statements-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Import statements</a></span></li><li><span><a href="#Load-File" data-toc-modified-id="Load-File-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Load File</a></span></li><li><span><a href="#Parse-complete-file-to-get-all-unique-fund/date-combinations-and-stocks" data-toc-modified-id="Parse-complete-file-to-get-all-unique-fund/date-combinations-and-stocks-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Parse complete file to get all unique fund/date combinations and stocks</a></span></li><li><span><a href="#Parse-complete-file-to-generate-data-for-sparse-matrix" data-toc-modified-id="Parse-complete-file-to-generate-data-for-sparse-matrix-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Parse complete file to generate data for sparse matrix</a></span></li><li><span><a href="#Delete-duplicates" data-toc-modified-id="Delete-duplicates-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Delete duplicates</a></span></li><li><span><a href="#Check-if-holdings-data-makes-sense" data-toc-modified-id="Check-if-holdings-data-makes-sense-1.6"><span class="toc-item-num">1.6&nbsp;&nbsp;</span>Check if holdings data makes sense</a></span></li><li><span><a href="#Change-fund-info-and-security-info-dfs-for-future-use" data-toc-modified-id="Change-fund-info-and-security-info-dfs-for-future-use-1.7"><span class="toc-item-num">1.7&nbsp;&nbsp;</span>Change fund info and security info dfs for future use</a></span></li><li><span><a href="#Create-sparse-matrix" data-toc-modified-id="Create-sparse-matrix-1.8"><span class="toc-item-num">1.8&nbsp;&nbsp;</span>Create sparse matrix</a></span></li><li><span><a href="#Save-data" data-toc-modified-id="Save-data-1.9"><span class="toc-item-num">1.9&nbsp;&nbsp;</span>Save data</a></span></li></ul></li></ul></div>

# Preprocess Large Holdings File

#### Converting the raw 50+ GB sas file with the holdings complezte data into a sparse python matrix which can be loaded into memory and more important which can be handled more efficiently by different alogorithms. 
#### The logic behind this process is as follows:

Loading data and transforming it into csv file to work with

1. 50+ GB holdings.sas7bdat file containing all the holdings data downloaded directly from wrds using ftp client
2. Converted into csv using sas7bdat_to_csv utility (Link)

Two step process to transform file into sparse matrix
Challenge is to convert from row describing one holding to rows describing the holdings of one fund at one point in time. Aslo it is crucial to keep track of which row of the sparse matrix is which fund at wjich date and which colums are which securities.

3. Open file in python 
4. Parse through file to make two lists. One with all fund/date combinations (using the comination as an ID) and one with all securities.
5. Generate sparse matrix with the dimensions "number of fund/date combinations" x "numer of securities"
6. Parse through large csv file again and fill the percentage_tna (percentage of the fund held in that particular security) number into the right spot of the sparse matrix as determined by two maps based on all fund/date combinations and securities
7. Save final sparse matrix and tables containing information about which row is which fund/date and which column is which security.

TODO

Parsing through csv file could be significantly sped up using something like: https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file

## Import statements

In [None]:
import os
import sys

import feather

import numpy as np
import pandas as pd
from scipy import sparse

## Load File

In [None]:
path = '../data/raw/holdings.csv'

## Parse complete file to get all unique fund/date combinations and stocks

In [None]:
%%time
chunksize = 10 ** 7
unit = chunksize / 184_578_843

reader = pd.read_csv(path,
                     usecols = ['crsp_portno','report_dt',
                                'crsp_company_key','security_name','cusip'],
                     dtype = {'crsp_portno': np.int64,
                              'report_dt': np.int64,
                              'crsp_company_key': np.int64,
                              'security_name': str,
                              'cusip': str},
                     low_memory=False,
                     chunksize=chunksize)

dfList_1 = []
dfList_2 = []

for i, chunk in enumerate(reader):
    temp_df_1 = chunk.loc[:,['crsp_portno','report_dt']].drop_duplicates()
    temp_df_2 = chunk.loc[:,['crsp_company_key','security_name','cusip']].drop_duplicates()
    dfList_1.append(temp_df_1)
    dfList_2.append(temp_df_2)

    print("{:6.2f}%".format(((i+1) * unit * 100)))        

In [None]:
df_1 = pd.concat(dfList_1,sort=False)
df_2 = pd.concat(dfList_2,sort=False)

df_1 = df_1.drop_duplicates()
df_2 = df_2.drop_duplicates()

In [None]:
# Generate a unique ID from the portno and the date of a fund/date combination
df_1 = df_1.assign(port_id = ((df_1['crsp_portno'] * 1000000 + df_1['report_dt'])))
df_1 = df_1.rename(columns = {'report_dt':'report_dt_int'})

df_1 = df_1.assign(report_dt = pd.to_timedelta(df_1['report_dt_int'], unit='D') + pd.Timestamp('1960-1-1'))

df_1 = df_1.reset_index(drop = True)
df_1 = (df_1
        .assign(row = df_1.index)
        .set_index('port_id'))

In [None]:
df_2 = df_2.reset_index(drop = True)
df_2 = (df_2
        .assign(col = df_2.index)
        .set_index('crsp_company_key'))

In [None]:
df_1.head(1)

In [None]:
df_2.head(1)

## Parse complete file to generate data for sparse matrix

In [None]:
%%time
chunksize = 10 ** 7
unit = chunksize / 184_578_843

reader = pd.read_csv(path,
                     usecols = ['crsp_portno','report_dt','crsp_company_key','percent_tna'],
                     dtype = {'crsp_portno': np.int64,
                              'report_dt': np.int64,
                              'crsp_company_key': np.int64,
                              'percent_tna':np.float64},
                     low_memory=False,
                     chunksize=chunksize)

In [None]:
# TODO pd.merge seems to be faster in this case than df.join

In [None]:
%%time
dfList = []

df_1_temp = df_1.loc[:,['row']]
df_2_temp = df_2.loc[:,['col']]

for i, chunk in enumerate(reader):
    temp_df = chunk.dropna()
    temp_df = temp_df.assign(port_id = ((temp_df['crsp_portno'] * 1000000 + temp_df['report_dt'])))
    temp_df.set_index('port_id',inplace=True)
    temp_df = temp_df.join(df_1_temp, how='left')
    temp_df.set_index('crsp_company_key',inplace=True)
    temp_df = temp_df.join(df_2_temp, how='left')
    temp_df = temp_df[['percent_tna','row','col']]
    dfList.append(temp_df)

    print("{:6.2f}%".format(((i+1) * unit * 100)))

In [None]:
df_sparse = pd.concat(dfList,sort=False)
df_sparse.reset_index(drop=True,inplace=True)
print(df_sparse.shape)
df_sparse.head(3)

## Delete duplicates
All other filters will be applied later but this one has to be done before sparse matrix is created

In [None]:
duplicates_mask = df_sparse.duplicated(['col','row'],keep='last')
df_sparse = df_sparse[~duplicates_mask]

## Check if holdings data makes sense 

In [None]:
merged_data = pd.merge(df_sparse,df_1[['report_dt','row']],how='left',on='row')

In [None]:
date = pd.to_datetime('2016-09-30')

sum_col = (merged_data
           .query('report_dt == @date')
           .groupby(by = ['col'])
           .sum()
           .sort_values('percent_tna',ascending = False))

sum_col.join(df_2.set_index('col'),how='left').head(10)

In [None]:
# Seems to make sense. Interestingly Alphabet appears twice. TODO check if two share classes

## Change fund info and security info dfs for future use

In [None]:
df_1 = df_1[['crsp_portno','report_dt','row']].assign(port_id = df_1.index)
df_1.set_index('row',inplace=True)
df_1.sample()

In [None]:
df_2 = df_2.assign(crsp_company_key = df_2.index)
df_2.set_index('col',inplace=True)
df_2.sample()

## Create sparse matrix

In [None]:
sparse_matrix = sparse.csr_matrix((df_sparse['percent_tna'].values, (df_sparse['row'].values, df_sparse['col'].values)))

In [None]:
# Check if all dimensions match

In [None]:
print('Number of fund/date combinations:        {:12,d}'.format(sparse_matrix.shape[0]))
print('Number of unique securities:             {:12,d}'.format(sparse_matrix.shape[1]))
print('Number of non-zero values in sparse matrix:       {:12,d}'.format(sparse_matrix.getnnz()))
print()
print('Number of rows in fund info df:          {:12,d}'.format(df_1.shape[0]))
print('Number of rows in fund info df:          {:12,d}'.format(df_2.shape[0]))
print()
match_test = (sparse_matrix.shape[0] == df_1.shape[0]) & (sparse_matrix.shape[1] == df_2.shape[0])
print('Everything matches:                              {}'.format(match_test))

## Save data

#### Sparse matrix containing holdings

In [None]:
path = '../data/interim/holdings'
sparse.save_npz(path, sparse_matrix)

#### Fund/date info

In [None]:
path = '../data/interim/row_info.feather'
feather.write_dataframe(df_1,path)

#### Securities info

In [None]:
path = '../data/interim/col_info.feather'
feather.write_dataframe(df_2,path)