# Welcome back

Welcome back! In our third session, we will do the following:

| What | How long |
|:------------------------------------------------------------------------|--------------:|
| Data frames refresh and fetch triangle data | 10 min |
| Indexing | 10 min.
| Querying a data frame | 15 min |
| Packing/unpacking | 15 min |
| Grouped operations | 20 min | 
| I/O | 5 min |
| Recap on iterators | 15 min |

# Data frames refresh

In [None]:
import pandas as pd

In [None]:
url = 'https://www.casact.org/research/reserve_data/wkcomp_pos.csv'
df_triangle = pd.read_csv(url)

df_triangle.head()

df_triangle['GRNAME']

df_triangle.columns

df_triangle.shape

By the by, we get back a tuple from `shape`.

In [None]:
df_triangle.dtypes

Note that the financial amounts are integers. This is great if we're using a Poisson GLM> 

In [None]:
df_triangle.describe()

## Renaming columns

In [None]:
new_names = {
  'CumPaidLoss_D': 'cumulative_paid',
  'IncurLoss_D': 'cumulative_incurred',
}
df_triangle = df_triangle.rename(columns = new_names)
df_triangle.columns

## Synthesizing new data

In [None]:
df_triangle['paid_to_incurred'] = df_triangle['cumulative_paid'] / df_triangle['cumulative_incurred']
df_triangle[['cumulative_paid', 'cumulative_incurred', 'paid_to_incurred']]

In [None]:
df_triangle['paid_to_incurred'].describe()

In [None]:
df_triangle['paid_to_incurred'] = df_triangle.cumulative_paid / df_triangle.cumulative_incurred
df_triangle['paid_to_incurred'].describe()

# Indexing

Pandas pays a lot more attention to indices than data frames in R. 

In [None]:
df_triangle.index
df_triangle = df_triangle.set_index('GRNAME')
df_triangle.index

An index may contain multiple values

In [None]:
df_triangle = df_triangle.set_index(['GRNAME', 'AccidentYear', 'DevelopmentYear'])

In [None]:
df_triangle.head()

In [None]:
df_triangle = df_triangle.set_index(['AccidentYear', 'DevelopmentYear'], append = True)
df_triangle.index

Index columns disappear from the data frame unless you explicitly tell pandas not to.

In [None]:
df_triangle['lag'] = df_triangle['DevelopmentYear'] - df_allstate['AccidentYear'] + 1

A useful strategy is to carry out all of the non-indexed operations before creating the index.

In [None]:
df_triangle = df_triangle.reset_index()
df_triangle.index

In [None]:
df_triangle['lag'] = df_triangle['DevelopmentYear'] - df_triangle['AccidentYear'] + 1
df_triangle.lag

That's the only one we need for now, so we can go ahead and set an index.

In [None]:
df_triangle = df_triangle.set_index(['GRNAME', 'AccidentYear', 'lag'])

# Querying a data frame

## Columnar subsets

In [None]:
df_triangle[['cumulative_paid', 'cumulative_incurred', 'paid_to_incurred']]

Pass in a list with the names of columns to return.

In [None]:
my_cols = ['cumulative_paid', 'cumulative_incurred']
df_triangle[my_cols]

Create a list using list comprehension

In [None]:
cumul_cols = [col for col in df_triangle.columns if 'cumul' in col]
cumul_cols

In [None]:
df_triangle[cumul_cols]

Use `iloc()`

In [None]:
df_triangle.iloc[:, 0]
df_triangle.iloc[:, :1]

## Row-wise subsets

In [None]:
df_triangle[df_triangle['paid_to_incurred'] > 2]

In [None]:
df_triangle[df_triangle['paid_to_incurred'] > 2]['cumulative_paid']

In [None]:
df_triangle[df_triangle['paid_to_incurred'] > 2].cumulative_paid

In [None]:
df_triangle[df_triangle['paid_to_incurred'] > 2]['cumulative_paid', 'cumulative_incurred']

In [None]:
df_triangle.loc[df_triangle['paid_to_incurred'] > 2, ['cumulative_paid', 'cumulative_incurred']]

In [None]:
df_triangle.loc(df_triangle['AccidentYear'] <= 1989)

We could use `filter()` here, but I'm not wild about that.

In [None]:
df_triangle.query('AccidentYear <= 1989 & lag == 1')['cumulative_paid']

In [None]:
df_lower = df_triangle.query('DevelopmentYear <= 1997')
df_lower.shape
df_triangle.shape

df_lower.shape[0] / 55

You may be tempted by the `filter()` method. This will filter based on index values.

In [None]:
df_triangle[1996].cumulative_paid

# Reshaping

## Pivoting

In [None]:
df_allstate = df_lower.query('GRNAME == "Allstate Ins Co Grp"')

`pivot_table(values, index, columns)`

In [None]:
df_allstate.pivot_table('cumulative_paid', 'AccidentYear', 'lag')

In [None]:
df_wide_paid = df_allstate.pivot_table('cumulative_paid', 'AccidentYear', 'lag')
df_wide_paid.shape

In [None]:
df_wide_paid.columns

In [None]:
df_wide_paid.index

Let's also construct an incurred triangle

In [None]:
df_wide_incurred = df_allstate.pivot_table('cumulative_incurred', 'AccidentYear', 'lag')

## Unstack

Unstack will behave similarly to `pivot_table()`, however it relies on values in the multiindex.

In [None]:
df_allstate.unstack()

In [None]:
df_allstate[['cumulative_paid', 'cumulative_incurred']].unstack()

## Stack

In [None]:
df_wide_paid.stack()

Notice that we dropped the NA values. We can keep them if we like.

In [None]:
df_wide_paid.stack(dropna = False)

In [None]:
df_long_paid = df_wide_paid.stack()
df_long_paid

In [None]:
df_long_paid = df_wide_paid.stack().to_frame()
df_long_paid.columns = ['cumulative_paid']

In [None]:
df_long_incurred = df_wide_incurred.stack().to_frame()
df_long_incurred.columns = ['cumulative_incurred']

## Merge two data frames

In [None]:
df_new = pd.merge(df_long_paid, df_long_incurred)

In [None]:
df_new = pd.merge(df_long_paid, df_long_incurred, left_index = True, right_index = True)
df_new

# Group-wise operations

In [None]:
df_allstate['cumulative_paid'].shift()


df_allstate[['cumulative_paid', 'prior_cumulative_paid']].head(15)

We have a problem. The entry for 1989, lag 1 is not correct. We need to group by accident year

In [None]:
df_allstate['prior_cumulative_paid'] = df_allstate['cumulative_paid'].groupby(
    level='AccidentYear'
  ).apply(lambda  x : x.shift(1))

The warning is something we should pay attention to, but it's not anything to worry about in this case.

In [None]:
df_allstate[['cumulative_paid', 'prior_cumulative_paid']].head(15)

Do that again for the incurred

In [None]:
df_allstate['prior_cumulative_incurred'] = df_allstate['cumulative_incurred'].groupby(
    level='AccidentYear'
  ).apply(lambda  x : x.shift(1))

## Make some link ratios

In [None]:
df_allstate['paid_ldf'] = df_allstate['cumulative_paid'] / df_allstate['prior_cumulative_paid']
df_allstate['incurred_ldf'] = df_allstate['cumulative_incurred'] / df_allstate['prior_cumulative_incurred']

In [None]:
df_allstate[['paid_ldf', 'incurred_ldf']]

In [None]:
df_allstate.pivot_table(index = 'AccidentYear', columns = 'lag', values = 'paid_ldf')

## Weighted average link ratios

In [None]:
cumul_cols = [col_name for col_name in df_allstate.columns if 'cumulative' in col_name]
df_links = df_allstate.groupby(level='lag')[cumul_cols].sum()
df_links

In [None]:
df_links['paid_ata'] = df_links.cumulative_paid / df_links.prior_cumulative_paid
df_links['incurred_ata'] = df_links.cumulative_incurred / df_links.prior_cumulative_incurred
df_links = df_links.query('lag > 1')
df_links

In [None]:
df_links.paid_ata.cumprod()

In [None]:
df_links.paid_ata[::-1].cumprod()

In [None]:
df_links['paid_atu'] = df_links.paid_ata[::-1].cumprod()
df_links['incurred_atu'] = df_links.incurred_ata[::-1].cumprod()

In [None]:
df_links

In [None]:
df_links = df_links[['paid_atu', 'incurred_atu']]

In [None]:
df_links = df_links.reset_index() 
df_links.lag = df_links.lag - 1
df_links = df_links.set_index('lag')
df_links

In [None]:
df_ultimate = df_allstate.query('DevelopmentYear == 1997')
df_ultimate

In [None]:
df_ultimate = pd.merge(df_ultimate, df_links, left_index = True, right_index = True)
df_ultimate['ult_paid'] = df_ultimate.cumulative_paid * df_ultimate.paid_atu
df_ultimate['ult_incurred'] = df_ultimate.cumulative_incurred * df_ultimate.incurred_atu
df_ultimate[['ult_paid', 'ult_incurred']]

# Saving our work

In [None]:
import os
os.getcwd()

In [None]:
df_triangle.to_csv('data/df_ultimate.csv')

# If there's time

## Recap on generator objects and iteration

# Homework

1. Repeat the construction of LDFs for every company. 
2. Which company had the most significant difference between paid and incurred ultimate estimates?
1. Which company had the largest case reserves? In which cell can you find this?