# Pandas = **PAN**el **DA**ta**S**ets


>  *pandas* provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language.

* General purpose data munger and ```numpy``` array wrapper  
* Persist to / read from variety of data sources including Excel 
* Two core data structures: a ```Series``` for 1d data and a ```DataFrame``` for 2d data
* ```DataFrames``` are indexed by rows and columns and all operations are index-aware
* Joins/merge
* Summarize, transform
* melt, stack/unstack, pivot tables
* Excellent time series support 
* Good integration with Jupyter for viewing data 
* Graceful handling of missing values 
* Nice integration with Python string handling 
* Plotting

See [10 minute intro to pandas](http://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html).

## Functions We Will Discuss

* DataFrame
* head, tail, describe
* unique, value_counts
* read_csv
* loc, slices, xs
* create_index, reset_index 
* MultiIndex 
* query 
* pivot, stack and unstack
* **concat**, append, keys 
* pivot_table (crosstab), pivot 
* **merge** (indicator) and join
* groupby (.groups, .get_group, as_index)
* sum, mean, std etc. 
* aggregate
* transform (same size as input whiten)
* apply
* assign 
* plot

## Functions not covered but check out on your own
* map (series), applymap (dataframes) 
* from_dict
* rename 
* melt
* evaluate 
* str
* dt
* style

# Seaborn Plotting 

```pandas``` + ```seaborn``` $\approx$ ```tibbles``` + ```ggplot```

* **```relplot```**  = relational plots, line, scatter 
* catplot = scatter plot with categorical, box, swarm, bar, count 
* jointplot, pairplot, distplot, kdeplot
* lmplot, regplot, residplot 
* heatmap, clustermap 
* faceting, row/column plots

In [412]:
# the basics 
import numpy as np
import numpy.ma as ma
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings 
import re

# nice printing of dataframes 
from IPython.display import display, HTML

# other setup 
warnings.filterwarnings("ignore")
%matplotlib inline
# %load_ext line_profiler
np.set_printoptions(linewidth =  160)

# handy utility 
import textwrap 
def wdid(ob, ex=False):
    ''' what does object do?, ex=True for more information 
    '''
    print('\n'.join(textwrap.wrap(' '.join([i for i in dir(ob) if i[0] != '_']), 80)))
    if ex:
    # optional pause for something more advanced... 
        for m in [ i for i in dir(np) if i[0] >= 'a' and i[0]<='z']:
            print(f'\n\n{m}\n{"="*len(m)}\n')
            print(np.__getattribute__(m).__doc__)

In [414]:
# !pip install seaborn==0.9.0
assert sns.__version__ == '0.9.0'

# Why I love Python...

In [None]:
# parse the emailed list of attendees
attendees = '''Allard, Christopher [Westfield Insurance]	christopherallard@westfieldgrp.com	Forced Subscription
Bogaardt, John [WCF Insurance]	jbogaardt@gmail.com	Forced Subscription
Chiarolanza, Laura [Chubb]	laura.chiarolanza@chubb.com	Forced Subscription
Citarella, Christian [New Hampshire Insurance Department]	christian.citarella@ins.nh.gov	Forced Subscription
Edwards, Vincent [Casualty Actuarial Society]	vedwards@casact.org	Forced Subscription
Faber, Brian [CNA Insurance Companies]	brian.a.faber@gmail.com	Forced Subscription
Fannin, Brian [Casualty Actuarial Society]	bfannin@casact.org	Forced Subscription
Govonlu, David [Homesite Insurance]	dgovonlu@homesite.com	Forced Subscription
Granados, Marcela [EY]	magramore@gmail.com	Forced Subscription
Greenberg, Eric [Rivington Partners]	egreenberg@valeip.com	Forced Subscription
Groeschen, Steven [Demotech, Inc.]	sgroeschen@demotech.com	Forced Subscription
Jablonski, Jeffrey [Erie Insurance Group]	j2195@erieinsurance.com	Forced Subscription
Kamykowski, Theresa [American Association of Insurance Services]	theresa.kamykowski@agcs.allianz.com	Forced Subscription
Keim, Scott [Ameriprise Auto & Home Insurance]	scott.keim@ampf.com	Forced Subscription
Kelch, Michael [Pinnacle Actuarial Resources]	mikekelch21@gmail.com	Forced Subscription
Lee, Joyce [AXIS Capital]	joyce.lee@axiscapital.com	Forced Subscription
Lo, Anson [TD Insurance]	anson.lo@tdinsurance.com	Forced Subscription
Mildenhall, Stephen [St. John's University]	mildenhs@stjohns.edu	Forced Subscription
Perry, Christopher [American Modern Insurance Group]	cperry@amig.com	Forced Subscription
Picard, Mathieu [Genworth Financial]	mathieu.picard@genworth.com	Forced Subscription
Qureshi, Abdul [AXIS Capital]	abdul.qureshi@axiscapital.com	Forced Subscription
Roddy, Matthew [Federated Insurance Companies]	mrroddy@fedins.com	Forced Subscription
Sobel, Scott [Oliver Wyman Actuarial Consulting]	scott.sobel@oliverwyman.com	Forced Subscription
Woodruff, Arlene [MLMIC Services, Inc.]	awoodruff@mlmic.com	Forced Subscription
Yskes, Eric [Amerisure Companies]'''

adf = pd.DataFrame([[j.strip() for j in re.split('\[|\]|\t', i) if j!=''][:-1] for i in attendees.split('\n') ], 
                 columns = ['Name', 'Employer', 'Contact'])
adf

In [None]:
# do that again, only spread out
att_split =  attendees.split('\n')
print('\n'.join(att_split[:3]), '\n')

In [None]:
# plit each row at [, ] or tab
row_split = re.split('\[|\]|\t', att_split[0])
print(row_split, '\n')

In [None]:
# get rid of the empty string and drop last element
row_split = [i for i in row_split[:-1] if len(i)>0]
print(row_split, '\n')

In [None]:
# combine into data frame---actually weirdly hard for a single row...  
temp = pd.DataFrame(row_split) 
display(temp)
pd.DataFrame({ i: j for i, j in zip(['Name', 'Employer', 'Contact'], row_split)}, index=[0])

# iterate with list comprehensions and re-orient... 

# Cake

Read in an explore the CAS Loss Reserve Database.

In [None]:
df = pd.read_csv(r'http://www.mynl.com/RPM/masterdata.csv')

In [None]:
df.head(10)

In [None]:
# add a few obvious columns: df['new col name'], refer to existing columns in one of threee ways, : = all 
df['LR'] = df.UltIncLoss / df['EarnedPrem']

In [None]:
df.loc[:, 'PdLR'] = df.PaidLoss / df.loc[:, 'EarnedPrem']

In [None]:
df['CaseLR'] = df.CaseIncLoss / df.iloc[:, 10]  # obviously a terrible method; df.columns.get_loc('EarnedPrem')

In [None]:
# some company names for future use
sfm = 'State Farm Mut Grp'
amg = 'American Modern Ins Grp Inc'
eix = 'Erie Ins Exchange Grp'
fmg = 'Federated Mut Grp'
wbi = 'West Bend Mut Ins Grp'

# Exercise

* Pull out the ```PaidLoss``` column
* What do you think ```df.loc[1600, :]``` means? Or ```df.loc[1600, 'EarnedPrem']```? Try them...in the spcae below...

# Graphics

In [None]:
sns.relplot(data=df.query(' EarnedPrem > 50000'), kind='line', x='AY', y="LR", hue='Lag', 
            col='Line', col_wrap=3, palette=sns.color_palette("coolwarm", 10));

In [None]:
sns.relplot(data=df.query(' EarnedPrem > 50000'), kind='line', x='AY', y="CaseLR", hue='Lag', 
            col='Line', col_wrap=3, palette=sns.color_palette("coolwarm", 10));

In [None]:
sns.relplot(data=df.query(' EarnedPrem > 50000'), kind='line', x='AY', y="PdLR", hue='Lag', 
            col='Line', col_wrap=3, palette=sns.color_palette("coolwarm", 10));

In [None]:
# histogram of PP Auto Results
a = sns.distplot(df.query(' Lag == 10 and LR>=0 and LR<=3 and Line=="PP Auto" ')['LR'], kde=False, 
                 hist_kws=dict(edgecolor="w", linewidth=1))
a.set(xlim=[0,3], ylabel='Density', title='Histogram of Loss Ratios');

In [None]:
ax = sns.relplot(data=df.query(' Lag==10 and GRName == "State Farm Mut Grp" '), kind='line', x='AY', y="LR", hue='Line');
ax.set(title=f'{sfm} Loss Ratios by AY')

# Exercise

* Create histogram of paid loss ratio
* ...for commercial auto (Comm Auto)
* ...for companies with premium > 10000 (EarnedPrem)


# Details

In [None]:
df.columns

In [None]:
# find columns with regular expression (string matching)
df.filter(regex='LR').columns

In [None]:
# drop our loss ratio columns to tidy up 
df = df.drop(df.filter(regex='LR').columns, axis=1)

In [None]:
df.tail(10)

In [None]:
df.describe()

In [None]:
df.index

In [None]:
# pull off a column in multiple ways 
df.head()['AY']

In [None]:
df.head()[['AY']]

In [None]:
df.head().AY

In [None]:
df.head().loc[:, ['AY']]

In [None]:
df.head().loc['AY']

In [None]:
# pull off row with index == 2
df.loc[2]

In [None]:
# rows with indexes 2..4 (inclusive)
df.loc[2:4]

In [None]:
df.AY.unique()

In [None]:
df.DY.unique()

In [None]:
df.Line.unique()

In [None]:
df.Line.value_counts()

In [None]:
len(df.GRName.unique())

In [None]:
df.GRName.unique()[:10] 

In [None]:
[i for i in df.GRName.unique() if i.find('State') >=0]

In [None]:
# access rows by row number 
df.iloc[2:4, :]

# Exercise

* How is ```df.iloc[2:4]``` different to ```df.loc[2:4]```?

In [None]:
# access rows and columns by row/column number 
df.iloc[14:4:-2, 2:4]

In [None]:
# filter by value, note ==
df[df.GRName==sfm].head()

In [None]:
# query by value with SQL like syntax
df.query( 'GRName == "State Farm Mut Grp" ').head()

In [None]:
# refer to variables with @ inside query or f strings
display(df.query( 'GRName == @sfm ').head())
df.query( f'GRName == "{sfm}" ').head()

In [None]:
# can't mix loc and iloc subset columns 
df[['Line', 'GRName', 'AY', 'DY', 'UltIncLoss', 'PaidLoss', 'CaseIncLoss', 'BulkLoss', 'EarnedPrem']].iloc[100:105, :]

# Summaries and ```groupby```

In [None]:
df.head()

In [None]:
# groupby: workhorse grouping...e.g. aggregate
df.query(' DY==1997 ').groupby('Line').sum()

In [None]:
# extract sensible columns
df.query(' DY==1997 ').groupby('Line')[['UltIncLoss', 'PaidLoss', 'CaseIncLoss', 'BulkLoss', 'EarnedPrem', 'PostedReserve97']].sum()

In [None]:
# average loss ratios
line_ay = df.query(' DY==1997 ').groupby(['Line', 'AY'])[['UltIncLoss', 'PaidLoss', 
                                                          'CaseIncLoss', 'BulkLoss', 'EarnedPrem', 'PostedReserve97']].sum()
display(line_ay.head())
line_ay['LR'] = line_ay.UltIncLoss / line_ay.EarnedPrem
display(line_ay.head())

# Exercise

* Add columns to ```line_ay``` for paid and case incurred loss ratios
* Extract just the loss ratio columns using ```[[column names]]```  to produce a report

        Line        AY     UltLR   CaseLR   PaidLR
        Comm Auto   1988   xxx     xxx      xxx
        etc.

In [None]:
# from index have stack and unstack, like tidyverse gather and spread
line_ay[['LR']].unstack(level=1)

In [None]:
line_ay[['LR']].head()

In [None]:
# columns and indexes are often interchangeable (they will be eventually) but not always...
sns.relplot(data=line_ay[['LR']], x='AY', y="LR", kind='line', hue='Line') 

In [None]:
# when indexes and columns are NOT interchangeable,  you need to convert
line_ay[['LR']].reset_index().head()

In [None]:
sns.relplot(data=line_ay[['LR']].reset_index(), x='AY', y="LR", kind='line', hue='Line');

# The ```groupby``` object

In [None]:
# what is the group by object? 
gb = df.query(' DY==1997 ').groupby(['Line', 'AY'])
gb

In [None]:
wdid(gb)

In [None]:
gb.groups

In [None]:
gb.get_group(('Comm Auto', 1990)).head()

In [None]:
df.query(' DY==1997 ').groupby(['Line', 'AY'])[['UltIncLoss', 'EarnedPrem']].apply(sum).head(10)

In [None]:
# %%timeit
g = df.query(' DY==1997 ').groupby(['Line', 'AY'])
temp = g[['UltIncLoss']].sum()
# display(temp.head(10))
temp['EarnedPrem'] = g[['EarnedPrem']].sum()
temp['LR'] = temp.UltIncLoss / temp.EarnedPrem
temp.head()

In [None]:
# %%timeit
# much slower...
temp = df.query(' DY==1997 ').groupby(['Line', 'AY'])[['UltIncLoss', 'EarnedPrem']].apply(
    lambda x : pd.Series([x.UltIncLoss.sum(), x.EarnedPrem.sum(), x.UltIncLoss.sum() / x.EarnedPrem.sum()],  index=['IL', 'EP', 'LR']))
temp.head()

In [None]:
# %%timeit
temp = df.query(' DY==1997 ').groupby(['Line', 'AY'])[['UltIncLoss', 'EarnedPrem']].apply(
    lambda x : pd.Series([x.UltIncLoss.sum(), x.EarnedPrem.sum()],  index=['IL', 'EP'])).assign(LR = lambda x : x.IL / x.EP)
temp.head()

In [None]:
# when answer is one row per group can use agg[regate] shorthand, very flexible report building
df.query(' DY==1997 ').groupby(['Line', 'AY']).agg({'UltIncLoss' : [np.max, np.min, np.mean], 'EarnedPrem': np.mean})

# Indexes

In [None]:
df.head()

In [None]:
# make new data frame with more useful index
df1 = df[['Line', 'GRName', 'AY', 'Lag', 'UltIncLoss', 'PaidLoss', 'CaseIncLoss', 'BulkLoss', 'EarnedPrem']]. \
    set_index(['GRName', 'Line', 'AY', 'Lag'])
df1.head(20)

In [None]:
# access rows by index = loc rather than integer-row number iloc 
df1.loc[sfm, ['UltIncLoss', 'EarnedPrem']].head()

In [None]:
df1.loc[[sfm]].head()

## You cannot mix ```.loc``` and ```.iloc```

# We're Actuaries: Let's Make Some Triangles

In [None]:
latest = df1.query('AY + Lag <= 1998')
latest.head(20)

In [None]:
# note query works seamlessly with indexes or columns 
df.query('AY + Lag <= 1998').head()

In [None]:
latest.pivot_table(index=['Line', 'GRName', 'AY'] , columns='Lag', values='PaidLoss').head(20)

In [None]:
# paid and incurred triangles
triangles = latest.pivot_table(index=['GRName', 'Line', 'AY'] , columns='Lag', values=['PaidLoss', 'CaseIncLoss'])
triangles.head(20)

In [None]:
triangles.loc[sfm, 'PaidLoss'].head(10)

In [None]:
triangles.xs((sfm, 'Comm Auto'))

## Make development factors 

In [None]:
# start by looking at one triangle
trg = triangles.loc[sfm, 'PaidLoss'].head(10)
trg

In [None]:
# link ratios should be devel 2nd:10th col / devel 1st:9th
# don't have to worry about specifying the 10 and 9 and remember zero based arrays
# 1: will be cols 2:10, :-1 cols 1:9
trg.iloc[:, 1:] / trg.iloc[:, :-1]

In [None]:
# indexes ususally helpful, but not always!
# can work with values (underlying numpy array) but then lose index
trg.iloc[:, 1:].values  / trg.iloc[:, :-1].values

In [None]:
# best of both worlds: pick up the index from cols 1:9 (the denominator)
trg.iloc[:, 1:].values   / trg.iloc[:, :-1]

## Make **all** the development factors 

In [None]:
triangles.head()

In [None]:
# note ['PaidLoss'] to retain the second index level 
triangles.loc[:, 'PaidLoss'].iloc[:, 1:].values / triangles.loc[:, ['PaidLoss']].iloc[:, :-1]

## Stitch together

Same approach used for the loss ratio report.

In [None]:
# pd.concat( list ) combines dataframes, axis=0 stacks vertially and axis=1 horizontally
t2 = pd.concat((triangles, 
                triangles.loc[:, 'CaseIncLoss'].iloc[:, 1:].values / triangles.loc[:, ['CaseIncLoss']].iloc[:, :-1],
                triangles.loc[:, 'PaidLoss'].iloc[:, 1:].values / triangles.loc[:, ['PaidLoss']].iloc[:, :-1]), axis=1)

# need to make a suitable index...which is a tad tricky...go by hand
t2.columns = pd.MultiIndex.from_tuples([(t, l) for t in ['CaseIncLoss', 'PaidLoss'] for l in range(1,11) ] + 
                         [(t, l) for t in ['CaseIncLink', 'PaidLink'] for l in range(1,10)] )

In [None]:
t2.loc[sfm, :].filter(regex='Paid').head(10)

In [None]:
t2.loc[sfm, :].filter(regex='Case').head(10)

In [None]:
# to get rid of the empty triangles...count the nas...should be 180 in complete triangles and more in incomplete ones 
t2.groupby(level=['GRName', 'Line']).apply(lambda x : x.isna().sum().sum()).head(10)

In [None]:
# filter out to just the complete triangles with exactly 180 missing entries
complete = t2.groupby(level=['GRName', 'Line']).filter(lambda x : x.isna().sum().sum()==180)
complete.head(20)

## What about **average** link ratios?

In [None]:
def maskex(n, n_ays, kind, tiles):
    """ 
    mask for avg last n years in a n_ays x n_ays triangle
    """
    # size of link ratio triangle is one smaller than number of ays
    nyrs = n_ays - 1
    if kind=='loss_den':
        ans = np.array([[1 if i + j < nyrs and i + j >= nyrs - n else 0 for i in range(n_ays)] for j in range(n_ays)])
    elif kind=='loss_num':
        ans = np.array([[1 if i > 0 and i + j < n_ays and i + j >= n_ays - n else 0 for i in range(n_ays)] for j in range(n_ays)])
    else:
        ans = np.array([[1 if i + j < nyrs and i + j >= nyrs - n else 0 for i in range(nyrs)] for j in range(n_ays)])
    return np.tile(ans, (tiles, 2))

def mask_count(n, size):
    '''
    size of mask for averaging link ratios 
    '''
    n = min(n, size-1)
    return np.tile(np.array([n]*(size-n-1) + list(range(n,0,-1))), 2)

In [None]:
# to compute three year average link ratio want these LDFs
maskex(3, 10, 'link', 1)

In [None]:
# compute staight and weighted average last 3, 5, all years, for paid and incurred loss = 2 x 2 x 3 = 12 methods:
report = pd.concat([(complete.filter(regex='Loss', axis=1) * maskex(i, 10, 'loss_num', 400)).iloc[:, pd.np.r_[1:10, 11:20]].groupby(level=[0,1]).sum().values / \
           (complete.filter(regex='Loss', axis=1) * maskex(i, 10, 'loss_den', 400)).iloc[:, pd.np.r_[0:9, 10:19]].groupby(level=[0,1]).sum() for i in [3, 5, 10]]+
           [(complete.filter(regex='Link', axis=1) * maskex(i, 10, 'link', 400)).groupby(level=[0,1]).sum() / mask_count(i, 10) for i in [3, 5, 10]],
                    axis=1,
                 keys=[(wt, i) for wt in ['Wtd', 'Str'] for i in [3, 5, 10]] ) 
report.columns.names= ['Method', 'NYrs', 'LossType', 'DY']
report = report.stack(level=(0,1,2))

In [None]:
report.loc[sfm].head(12).sort_index(level=[0,1,3,2])

In [None]:
# find large companies within each line 
n_large = 20  # number of cos by line 
large_cos = df.query(' DY == 1998 ').groupby('Line').apply(
    lambda x : x.groupby('GRName')[['EarnedPrem']].sum().sort_values(by='EarnedPrem', ascending=False).head(n_large))

In [None]:
large_cos

In [None]:
# number of lines for each company 
large_cos.reset_index().GRName.value_counts()

In [None]:
# extract ldfs for large companies: right merge report with large_cos
# need to reshape report to have index GRName and Line 
report.reset_index(level=[2,3,4], drop=False).head()

In [None]:
# extract just large cos with right merge; inner merge 
ldfs = report.reset_index(drop=False).merge(large_cos.reset_index(drop=False), 
                                            how='inner', on=['GRName', 'Line']).reset_index(drop=True)
ldfs.LossType = ldfs.LossType.str[:-4]

In [None]:
ldfs.head()

# More Graphics

In [None]:
# tidy data-like melt function for general re-shaping 
p = pd.melt(ldfs, id_vars=('GRName', 'Line', "Method", 'NYrs', 'LossType', 'EarnedPrem'), value_name='ldf', var_name='Age'). \
    sort_values(['GRName', 'Line', "Method", 'NYrs', 'LossType', 'Age'])
p['logldf'] = np.log(p.ldf-1)
p.head(13)

In [None]:
a = sns.relplot(data=p, kind='line', x='Age', y='ldf', hue='LossType', col='Line', col_wrap=3);
for ax in a.axes:
    ax.set(ylim=[0.5,3.5])

In [None]:
a = sns.relplot(data=p, kind='line', x='Age', y='logldf', style='NYrs', hue='LossType', col='Line', col_wrap=3);

In [None]:
a = sns.relplot(data=p, kind='line', x='Age', y='logldf', style='Method', hue='LossType', col='Line', col_wrap=3);

In [None]:
# compare paid and incurred methdods by company 
sns.set(style='whitegrid')
a = sns.relplot(data=p, kind='line', x='Age', y='logldf', hue='GRName', row='Line', col='LossType', legend=None);

# If there is time...let's do some simulation


In [None]:
def pd_inc_plot(df, co_name='', line_name='', bins=201, dd=True, ax=None, legend=False):
    '''
    bootstrap from paid and incurred and create product distribution 
    input is result of running
    
        links = comp.groupby(level=['GRName', 'Line']).apply(make_links)
        links.index.names = ['GRName', 'Line', 'Kind', 'Method']
    
    i.e. df has index GRName, Line, AY and col groups for Paid, CaseInc loss and links  and lag 
    '''

    def shorten(s):
        '''
        name shortening function for labels 
        '''
        if len(s) < 12:
            return s
        else:
            re.sub
            s = re.sub(' (Co|Ins|Grp|Exchange|Of|Inc|of)', '', s)
            s = s.replace('Agricultural', 'Ag').replace('Exchange', 'Ex'). replace('Associated', 'Assoc')
        if len(s) > 12:
            s = ' '.join([i[:4] for i in s.split(' ')][:3])
        return s
    # allows use with groupby
    if co_name == '':
        co_name, line_name, _ = df.index[0]
   
    yrs = list(df.index.get_level_values('AY').unique())
    nyrs = yrs[-1] - yrs[0]
    
    # piece of interest
    bit = df.xs((co_name, line_name), level=('GRName', 'Line'))
    
    if len(bit) < 10:
        return
    
    # make kronecker products for i (kpi) and paid (kpp)
    # pull off most recent year losses 
    kpi = np.array(bit.loc[yrs[-1], ('CaseIncLoss', 1)])
    kpp = np.array(bit.loc[yrs[-1], ('PaidLoss', 1)])
    
    # and complete with link ratios 
    for i in range(0, nyrs):
        kpp = np.kron(kpp, bit.loc[yrs[0]:yrs[0]+i, ('PaidLink', nyrs - i)])
        kpi = np.kron(kpi, bit.loc[yrs[0]:yrs[0]+i, ('CaseIncLink', nyrs - i)])

    ult = pd.DataFrame( {'inc' : kpi, 'pd' : kpp})
    # stats 
    d = ult.describe().iloc[1:, :]
    if dd:
        display(d)
    
    if ax is None:
        f = plt.figure()
        a = f.gca()
    else:
        a = next(ax)
    
    bp = np.linspace(d.loc['min', :].min(), d.loc['max', :].max(), bins)
    mnn = d.loc['mean', :].min()
    mnx = d.loc['mean', :].max()
    sd = d.loc['std', : ].max()
    bp = np.linspace(max(0, mnn - 4*sd), mnx + 4*sd, bins)
    npd,  _, _ = a.hist(kpp, bins=bp, color='b', alpha=0.5, label='Paid')
    ninc, _, _ = a.hist(kpi, bins=bp, color='r', alpha=0.5, label='Incurred')
    bay = ninc*npd / sum(ninc*npd) * sum(npd)
    xs = (bp[1:]+bp[0:-1])/2
    a.plot(xs, bay, '-g', label='Posterior')
    if legend:
        a.legend(frameon=False)
    a.set(title='{:}/{:}\nMLE={:,.1f}, CV(I/Pd)={:.3f}/{:.3f}'.format(shorten(co_name), line_name, xs[bay.argmax()]/1e3, 
                                                                *(d.loc['std']/d.loc['mean']) ))
    return ult

In [None]:
def plot_all(df, line='', co='', threshold=250000):
    '''
    all lines for given co or all cos for given line 
    '''
    if line=='' and co=='':
        return 
    
    if line != '':
        bit = df.query(f' Line=="{line}" ')        
        ncos = len(bit) / 10 
        nr = int(ncos/6)
        if nr < ncos/6: nr += 1
        f, ax = plt.subplots(nr, 6, figsize=(18, 2.4*nr))
        ax = iter(ax.flatten())
        
    elif co != '':
        bit = df.query(f' GRName=="{co}" ')
        f, ax = plt.subplots(2, 3, figsize=(12,6))
        ax = iter(ax.flatten())
    
    g = bit.groupby(['GRName', 'Line'])

    l = True
    for k, v in g.groups.items():
        grp = bit.loc[v]
        if grp.CaseIncLoss.sum().sum() > threshold:
            ult = pd_inc_plot(grp, dd=False, ax=ax, legend=l)
            l = False
        
    # tidy up 
    for a in ax:
        f.delaxes(a)
    plt.tight_layout()

In [None]:
for co in [ wbi, sfm]:
    plot_all(complete, co=co);

In [None]:
plot_all(complete, 'Comm Auto', 100000)

# Set up for John's Session

# Exercise: figure out what this is doing...

In [None]:
# Read in the CAS data
data_url = 'https://www.casact.org/research/reserve_data'
lobs = ['medmal','ppauto','wkcomp', 'othliab', 'comauto', 'prodliab']
data = pd.DataFrame()
data = []
columns = ['GRCODE','GRNAME','AccidentYear','DevelopmentYear','DevelopmentLag'
           ,'IncurLoss', 'CumPaidLoss','BulkLoss','EarnedPremDIR'
           ,'EarnedPremCeded','EarnedPremNet', 'Single','PostedReserve97']
for lob in lobs:
    file_url = f'{data_url}/{lob}_pos.csv'
    subset = pd.read_csv(file_url, names=columns, skiprows=1)
    subset['LOB'] = lob
    data.append(subset)
data1 = pd.concat(data)
data = data1.query(" DevelopmentYear <= 1997 ").reset_index(drop=True)

In [None]:
def make_triangles(data, nlarge=20):
    '''
    make ldf triangles from CAS data for largest companies
    '''
    aggregates2 = data.query(' DevelopmentYear ==  1997 ').groupby(['LOB','GRNAME'])['IncurLoss'].sum() 
    top_by_lob = aggregates2.groupby(level='LOB').apply(lambda x : x.nlargest(nlarge).reset_index(level=0, drop=True))
    
    data_alt2 = data.merge(top_by_lob.to_frame(), how='left', left_on=['LOB','GRNAME'], right_index=True)
    data_alt2.loc[data_alt2.loc[:,'IncurLoss_y'].isna(), 'GRNAME'] = 'Other'
    
    # create triangles 
    triangles = pd.pivot_table(data_alt2, index=['GRNAME','LOB','AccidentYear'],
                           columns='DevelopmentLag', values='CumPaidLoss')
    
    # Determine LDF Weights 
    w = pd.DataFrame(np.array([[1 if i+j<9 else 0 for i in range(9)] for j in range(10)]))
    weight = np.tile(w, (int(triangles.shape[0]/10), 1))
    columns = [f'{triangles.columns[num]}-{triangles.columns[num+1]}'
               for num, item in enumerate(triangles.columns[:-1])]

    # Volume-weighted numerator and demoninator mask for denom only; values on num because want index from num 
    ldf = (triangles.iloc[:,1:].groupby(level=['GRNAME','LOB']).sum().values / \
           (weight*triangles.iloc[:,:-1]).groupby(level=['GRNAME','LOB']).sum()).fillna(1.0) 
    
    return ldf

In [None]:
john = make_triangles(data, 20)
john.to_csv('trg-for-john.csv')

In [None]:
john.head()

## Should be the same as our ```ldf``` dataframe

In [None]:
l1 = ldfs.query(' Method=="Wtd" and NYrs==10 and LossType=="Paid" ').drop(['Method', 'NYrs', 'LossType', 'EarnedPrem'], axis=1).\
    set_index(['GRName', 'Line'])
l1.head()