<p style="text-align: center; font-size: 192%"> Computational Finance </p>
<img src="../img/ABSlogo.svg" alt="LOGO" style="display:block; margin-left: auto; margin-right: auto; width: 90%;">
<p style="text-align: center; font-size: 150%"> Week 4: Asset Pricing </p>
<p style="text-align: center; font-size: 75%"> <a href="#copyrightslide">Copyright</a> </p>

In [None]:
#silence some warnings
import warnings
warnings.filterwarnings('ignore')

# Outline

* More pandas: Hierarchical Indexing
* Merging databases

# More pandas: Hierarchical Indexing

* The MultiIndex object ([user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html)) is part of `pandas`.

* It is an index with multiple levels (row index or column headers).

* This allows you to display higher dimensional data into a lower dimension.

* Flexible: tools for reshaping and aggregation.

*The example is from Data Science Methods  course by Cees Diks and Bram Wouters.*

## Creating a MultiIndex

* When creating a DataFrame, you can create a MultiIndex by using nested lists (for the `index` or `columns`).

**Example:** A dataFrame with row and column indices of MultiIndex type.

Number of ECTS obtained per year, semester by different students and programs.

In [None]:
import numpy as np
import pandas as pd

frame = pd.DataFrame(
    np.array([24, 24, 6, 24, 24, 6, 30, 24, 6, 18, 12, 6, 24, 18, 6, 24, 18, 12]).reshape(6,3),
    index=[['17-18', '17-18', '18-19', '18-19', '19-20', '19-20'], ['I','II','I', 'II', 'I', 'II']],
    columns=[['Robert', 'Esther', 'Esther'],['Finance', 'Finance', 'Ectrics']])

frame.index.names = ['year', 'semester']
frame.columns.names = ['name', 'program']

frame

In [None]:
frame.index

In [None]:
type(frame.index)

* The module `MultiIndex` offers methods `from_arrays()` and `from_tuples()` to create a MultiIndex:

In [None]:
pd.MultiIndex.from_arrays([['17-18', '17-18', '18-19', '18-19', '19-20', '19-20'], 
                           ['I','II','I', 'II', 'I', 'II']], names = ['year','semester'])

* It can be more convenient to use `from_product()` when all combinations of the elements in each level are included:

In [None]:
pd.MultiIndex.from_product([['17-18', '18-19', '19-20'], ['I','II']], names = ['year','semester'])

### Indexing and slicing
* Selection works similar to a DataFrame without MultiIndex. 
* Select the number of ECTS that Esther obtained in the Finance program.

In [None]:
frame['Esther', 'Finance'] # or frame['Esther']['Finance']

* The method `xs()` can be used to slice rows or columns (default is rows). It takes a level argument, for easy selection at any level.

In [None]:
frame.xs(('18-19', 'I'), level=(0,1)) # xs takes level argument (and optional axis argument)

### Swap levels

Swapping the levels of the row MultiIndex.

In [None]:
frame

In [None]:
frame.swaplevel(axis=1)  # more general: reorder_levels([1, 0], axis=1) where you can give a permutation of the levels

* May want to sort after this, using `sort_index(level, axis)` where you can specify the level at which you want to sort.

### Reshaping

Using `stack` and `unstack` ([user guide](https://pandas.pydata.org/pandas-docs/stable/user_guide/reshaping.html#reshaping-by-stacking-and-unstacking)) to turn the row index 'year' into a column index and the column index 'program' into a row index.

In [None]:
frame

In [None]:
frame.unstack(level='year').stack(level='program')  # Adds NaN if field is empty

### Aggregation

* Aggregation at particular level using `ufuncs` (sum, mean, etc.) is easy.

* Calculating the maximum of `frame` for each pair of 'year' and 'name'.

In [None]:
frame.max(level='year').max(level='name', axis=1)

* Alternative in a DataFrame *without* MultiIndex is to use [`groupby()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.groupby.html), but more verbose.

## Merging databases

*This section uses parts of Tomislav Ladika's Python bootcamp from the UvA course Data Analytics (MIF).*



* Information not always stored in a single database $\rightarrow$ need to merge data sets.

* Suppose that we want to combine data from CRSP and Compustat.
    * Here we use a sample (previously downloaded).
* Download and inspect the data sets before merging.

* Are there common variables (columns) that we can use to merge?

* Let's read the Compustat data...

In [None]:
import pandas as pd
## Open all data frames
compustat_data = pd.read_csv("../data/sample_data_compustat.txt", sep="\t")   # Can read any text file; takes comma as default separator.

## Create a new 'date' column in Compustat, which is the 'datadate' column in YYYYMM format
compustat_data['datadate'] = compustat_data['datadate'].astype(str)  # Ensure that it is a string
compustat_data['date'] = compustat_data['datadate'].str[0:6]  
compustat_data['date'] = compustat_data['date'].astype(int) 

compustat_data.head(10)

* ... and the CRSP data.

In [None]:
crsp_data = pd.read_csv("../data/sample_data_crsp.txt", sep="\t")

## Similarly, modify the 'date' column in CRSP
crsp_data['date'] = crsp_data['date'].astype(str)
crsp_data['date'] = crsp_data['date'].str[0:6]  
crsp_data['date'] = crsp_data['date'].astype(int)

crsp_data.head(10)

### Difference between data sources

* Merging is only possible if the DataFrames have *exactly* the same values. 
* Compare:
    * `'AMERICAN INTERNATIONAL GROUP'` of `'company_name'` in Compustat
    * `'AMERICAN INTERNATIONAL GROUP INC'` of `'company_name'` in CRSP
* Compare:        
    * `'datadate'` value of `'20001231'` in Compustat
    * `'date'` value of `'20001229'` in CRSP
    * We need monthly data, but CRSP uses the last trading day of each month, while Compustat uses the last calendar day.

### How to proceed?
1. Find some common information that is in both data frames to be merged. 
    - For example, both Compustat and CRSP contain company names (and though not shown, both also have stock tickers and CUSIP numbers)
2. Match the common values in data frame 1 to those of data frame 2
    - For example, match 'AMERICAN INTERNATIONAL GROUP' in Compustat to 'AMERICAN INTERNATIONAL GROUP INC' in CRSP
    - Unfortunately, this often must be done by hand. Smart computer algorithms or regular expressions can help.
3. Create a file that lists each of the matching values across data frames
4. A key challenge: Company names, tickers, etc. can change over time! Each data vendor has its own policy for updating these identifiers.
    - Compustat only lists the most recent company name for all dates, while CRSP lists the company name as of each date
    - Google changed name to Alphabet Inc. in a 2015 reorganization, but in Compustat financials going back to 2004 are listed under 'ALPHABET INC'    

* To highlight the merge procedure, first open a matching file (previously created).

In [None]:
compustat_crsp_linkfile = pd.read_csv("../data/sample_compustat_crsp_linkfile.txt", sep="\t")
compustat_crsp_linkfile.head(10)

* We will implement the merge in Pandas using a two-step process. 
* First, merge the Compustat data frame with the link file data frame, to add the 'permco' identifier to Compustat.

In [None]:
## Merge the data frames by 'gvkey' and 'date'
merged_data = pd.merge(compustat_data, compustat_crsp_linkfile, on=['gvkey', 'date'], how='inner')   

merged_data.head()

* Then merge the modified Compustat data frame (which includes the 'permco' identifier) to the CRSP data frame.
* Note that duplicate column names that are not merged on (e.g. 'company_name'), are renamed by pandas.

In [None]:
## Merge with CRSP by 'permco' and 'date'
merged_data = pd.merge(merged_data, crsp_data, on=['permco', 'date'], how='inner')
merged_data.head()

### Important decisions when merging

* Which observations to keep?
    * Data merging = data cleaning.
    * Coverage and frequency may differ: 
        * Compustat contains all firms that file financial statements in the United States.
        * CRSP contains stock prices of all firms that trade on U.S. stock exchanges.
    * Depends on ultimate goal of analysis.
* What variables to merge on?
    * Study what variables uniquely define observations!
        * E.g. company name AND date, not just company.
    * Use those in the merge command.
    * Otherwise, duplicate values.

## Summary

* **MultiIndex** is an index with multiple levels. It allows to store multidimensional data in two- or one- dimensional object.
* **Merging data** is necessary when not all data is in the same database.
* Match identifiers from both databases using a link file.

<section id="copyrightslide">

# Copyright Statement
* Course slides were created by Simon Broda for Python 2.7 $-$ Andreas Rapp adapted them to Python 3.6. 
* Week 4 slides were created by Bart Keijsers. The hierarchical indexing example is from the UvA course Data Science Methods by Cees Diks and Bram Wouters. The merging databases are from Tomislav Ladika's Python bootcamp for the UvA course Data Analytics (MIF).
* This work is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](https://creativecommons.org/licenses/by-sa/4.0/).
* All figures have been produced for this course using Python. Empirical results are based on public data available from [FRED](https://fred.stlouisfed.org/), [Quandl/WIKI](https://www.quandl.com/databases/WIKIP), and [Yahoo Finance](https://finance.yahoo.com/).
* More information on Simon Broda's [Github](https://github.com/s-broda/ComputationalFinance/blob/master/LICENSE.md).