# Preprocess Large Holdings File

#### Converting the raw 50+ GB sas file with the holdings complezte data into a sparse python matrix which can be loaded into memory and more important which can be handled more efficiently by different alogorithms. 
#### The logic behind this process is as follows:

Loading data and transforming it into csv file to work with

1. 50+ GB holdings.sas7bdat file containing all the holdings data downloaded directly from wrds using ftp client
2. Converted into csv using sas7bdat_to_csv utility (Link)

Two step process to transform file into sparse matrix
Challenge is to convert from row describing one holding to rows describing the holdings of one fund at one point in time. Aslo it is crucial to keep track of which row of the sparse matrix is which fund at wjich date and which colums are which securities.

3. Open file in python 
4. Parse through file to make two lists. One with all fund/date combinations (using the comination as an ID) and one with all securities.
5. Generate sparse matrix with the dimensions "number of fund/date combinations" x "numer of securities"
6. Parse through large csv file again and fill the percentage_tna (percentage of the fund held in that particular security) number into the right spot of the sparse matrix as determined by two maps based on all fund/date combinations and securities
7. Save final sparse matrix and tables containing information about which row is which fund/date and which column is which security.

TODO

Parsing through csv file could be significantly sped up using something like: https://stackoverflow.com/questions/17444679/reading-a-huge-csv-file

## Import statements

In [1]:
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)
from data.basic_functions import * 

import csv
import collections
import feather
import math

import numpy as np
import pandas as pd
from datetime import datetime
from scipy import sparse
import matplotlib.pyplot as plt

# For multiprocessing
import multiprocessing
from itertools import product

## Find all unqiue Stocks and Portfolio/dates

In [2]:
def make_port_id(a,b):
    """
    Generate a unique ID from the portno and the date of a fund/date combination
    
    Input:
    - a: port_no
    - b: date
    
    Output:
    - port_ID
    """
    return int(100_000 * a + b)

#### TODO
Run again without break!

In [3]:
def extra_reader(reader_object):
    """
    Loops over rows of holdings csv
    Needed to generate sparse matrix
    
    Input: 
    - a reader object linking to the holdings csv
    
    Output: 
    - port_ID: Collection of all unique fund/date combinations
    - stock_ID: Collection of all unique stocks
    """
    count = 0
    
    next(reader)
    stocks = collections.Counter()
    port_ID = collections.Counter()
    
    for row in reader_object:
        port_ID_str = make_port_id(int(float(row[0])),int(float(row[1])))
        port_ID[port_ID_str] += 1
        stocks[int(float(row[7]))] += 1
        count += 1
 #       if count == 10_000_000:
 #           break

    
    return(port_ID,stocks)  

In [4]:
path = '../data/raw/out.csv'

input_file = open(path)
reader = csv.reader(input_file, delimiter=',')

In [5]:
%%time
port_ID, stocks = extra_reader(reader)

CPU times: user 12min 47s, sys: 10.8 s, total: 12min 58s
Wall time: 13min 1s


### Test

In [6]:
sum(port_ID.values())

184578842

chunksize = 10 ** 8
for chunk in pd.read_csv(path, chunksize=chunksize):
    process(chunk)

## Create stock and port_no map

In [7]:
def make_unique_dict(counter):
    """
    Used to make a dictionary linking each fund/date combination 
    and each stock to a row/col in the sparse matrix
    
    Input:
    - collections.Counter() object
    
    Output:
    - dictionary
    """
    
    unique_keys = list(counter.keys())
    unique_keys_numbers = list(np.arange(len(unique_keys)))
    counter_map = dict(zip(unique_keys, unique_keys_numbers))
    
    return(counter_map)

In [8]:
stock_map = make_unique_dict(stocks)
port_no_map = make_unique_dict(port_ID)

In [9]:
len(port_no_map)

738860

In [10]:
total_number_rows = sum(list(stocks.values()))
print('Total number of rows in file:  {:,}'.format(total_number_rows))
print('Numer of unique stocks:        {:,}'.format(len(stocks.keys())))
print('Numer of unique portfolios:    {:,}'.format(len(port_no_map.keys())))

Total number of rows in file:  184,578,842
Numer of unique stocks:        2,382,968
Numer of unique portfolios:    738,860


## Parse through file and create data for sparse matrix

In [11]:
def gen_sparse_data(reader):
    """
    Loop over holdings csv file to collect the data for the sparse matrix
    
    Input:
    - reader: CSV holdings file
    
    Output:
    - sparse_row, sparse_col, sparse_data: three np arrays for the construction of the sparse matrix
    """
    next(reader)
    
    counter = 0
    
    sparse_row = np.zeros(total_number_rows)
    sparse_col = np.zeros(total_number_rows)
    sparse_data = np.zeros(total_number_rows)
    
    for row in reader:
        # Row
        port_ID = make_port_id(int(float(row[0])),int(float(row[1])))
        sparse_row[counter] = port_no_map[port_ID]

        # Col
        stock_num = int(float(row[7]))
        sparse_col[counter] = stock_map[stock_num]

        # Data
        try:
            sparse_data[counter] = float(row[4])
        except: 
            sparse_data[counter] = 0
            
        counter += 1
    
    sparse_row = sparse_row.astype(int)
    sparse_col = sparse_col.astype(int)
    
    return(sparse_row, sparse_col, sparse_data)

In [12]:
path = '../data/raw/out.csv'
input_file = open(path)
reader = csv.reader(input_file, delimiter=',')

In [13]:
%%time
sparse_row, sparse_col, sparse_data = gen_sparse_data(reader)

CPU times: user 13min 21s, sys: 14.7 s, total: 13min 36s
Wall time: 13min 39s


## Set up sparse matrix

Important to use masks on sparse matrix and row/colums info tables to keep everything correctly aligned

In [14]:
data = pd.DataFrame({'row': sparse_row, 'col' : sparse_col, 'data' :sparse_data})

In [15]:
print('Shape of rawdata used to generate sparse matrix: {:,}, {:,}'.format(data.shape[0], data.shape[1]))

Shape of rawdata used to generate sparse matrix: 184,578,842, 3


### Fist mask: drop duplicates

#### TODO 

check how many and why they exist

In [16]:
mask1_duplicates = data.duplicated(subset=['row','col']) == False
data_s = data[mask1_duplicates]

### Second mask: Drop extrem individual values

In [17]:
# Drop individual holdings where percent_tna is larger than 100 or smaller than 0 -> mask2
mask2_individual = (data_s['data'] < 150) & (data_s['data'] > -50)
data_s = data_s[mask2_individual]

In [18]:
sparse_row = data_s['row']
sparse_col = data_s['col']
sparse_data = data_s['data']

## Create sparse matrix

In [19]:
sparse_matrix = sparse.csr_matrix((sparse_data, (sparse_row, sparse_col)))
print('Number of fund/date combinations: {:,}'.format(sparse_matrix.shape[0]))
print('Number of securities: {:,}'.format(sparse_matrix.shape[1]))

Number of fund/date combinations: 738,860
Number of securities: 2,382,968


### Third mask: drop extrem portfolios

In [20]:
### Drop portfolios with total percent_tna > 150 or < 0
row_sums = np.array(sparse_matrix.sum(1)).flatten()
mask3_portfolios = (row_sums < 150) & (row_sums > 0)
np.sum(mask3_portfolios)

716429

In [21]:
sparse_matrix = sparse_matrix[mask3_portfolios]
print('Number of fund/date combinations after these filters: {:,}'.format(sparse_matrix.shape[0]))

Number of fund/date combinations after these filters: 716,429


## Create sparse info df with Fund_portno and date for every row of the sparse matrix

In [22]:
def split_port_id(num):
    start1 = int(np.floor(num / 100_000))
    start2 = num - start1 * 100_000
    return(start1,start2)

In [23]:
%%time
port_no = []
date = []
keys = list(port_no_map.keys())

for port_IDs in keys:
    left_temp, right_temp = split_port_id(port_IDs)
    port_no.append(left_temp)
    date.append(right_temp)
    
date = pd.to_timedelta(date, unit='D') + pd.Timestamp('1960-1-1')

CPU times: user 3.08 s, sys: 152 ms, total: 3.24 s
Wall time: 3.36 s


In [24]:
sparse_info = pd.DataFrame(data={'port_no':port_no, 'date':date})

In [25]:
sparse_info = sparse_info[mask3_portfolios]

## Generate table identifying securities

In [33]:
stock_map = pd.DataFrame.from_dict(stock_map.items())
stock_map['crsp_company_key'] = stock_map.index

### Check if dimensions of sparse matrix and fund/date and securities info match 

In [34]:
sparse_info.shape

(716429, 2)

In [35]:
sparse_matrix.shape

(716429, 2382968)

## Save data

#### Sparse matrix containing holdings

In [36]:
path = '../data/interim/sparse_matrix'
sparse.save_npz(path, sparse_matrix)

#### Fund/date info

In [37]:
path = '../data/interim/sparse_info.feather'
feather.write_dataframe(sparse_info,path)

#### Securities info

In [38]:
path = '../data/interim/stock_map.feather'
feather.write_dataframe(stock_map,path)