## Working with Triangles

### Getting Started
Welcome! We drafted these tutorials to help you get familiar with some of the common functionalities that most actuaries can use in their day-to-day responsibilities. The package also comes with pre-installed datasets that you can play with, which are also used in the tutorials here.

The tutorials assume that you have the basic understanding of commonly used actuarial terms, and can independently perform an actuarial analysis in another tool, such as Microsoft Excel or another actuarial software. Furthermore, it is assumed that you already have some familiarity with Python, and that you have the basic knowledge and experience in using some common packages that are popular in the Python community, such as `pandas` and `numpy`.

This tutorial is linted using [black](https://github.com/psf/black) via [nb_black](https://github.com/dnanhkhoa/nb_black). This step is optional.

In [1]:
%load_ext lab_black

All tutorials and exercises rely on chainladder v0.8.5 and later. It is highly recomended that you keep your packages up-to-date.

In [2]:
import pandas as pd
import numpy as np
import chainladder as cl
import os

print("pandas: " + pd.__version__)
print("numpy: " + np.__version__)
print("chainladder: " + cl.__version__)

pandas: 1.3.1
numpy: 1.21.1
chainladder: 0.8.5


Since we will be plotting for quite a bit, here's a magic function in IPython, which sets the backend of matplotlib to the 'inline' backend. With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document.

In [3]:
%matplotlib inline

### Working with a Triangle
Let's begin by looking at an unprocessed triangle data and load it into a `pandas.DataFrame`. We'll use the data `raa`, which is available from the repository.

In [4]:
raa_df = pd.read_csv(
    "https://raw.githubusercontent.com/casact/chainladder-python/master/chainladder/utils/data/raa.csv"
)
raa_df.head(20)

Unnamed: 0,development,origin,values
0,1981,1981,5012.0
1,1982,1982,106.0
2,1983,1983,3410.0
3,1984,1984,5655.0
4,1985,1985,1092.0
5,1986,1986,1513.0
6,1987,1987,557.0
7,1988,1988,1351.0
8,1989,1989,3133.0
9,1990,1990,2063.0


The data has three columns: 
* development: or valuation time, in this case, valuation year
* origin: or accident date, in this case, accident year
* values: the values recorded for the specific accident date at the specific valuation time (such as incurred losses, paid losses, or claim counts)

A table of loss experience showing total losses for a certain period (origin) at various, regular valuation dates (development), reflect the change in amounts as claims mature and emerge. Older periods in the table will have one more entry than the next youngest period, leading to the triangle shape of the data in the table or any other measure that matures over time from an origin date. Loss triangles can be used to determine loss development for a given risk.

Let's put our data into the triangle format.

In [None]:
raa = cl.Triangle(raa_df, origin="origin", development="development", columns="values")
raa

You can also load the example data directly, using `load_sample`:

In [None]:
raa = cl.load_sample("raa")
raa

A triangle has more properties than just what is displayed. For example we can see the underlying `link_ratio`s, which represent the multiplicative change in amounts from one development period to the next.

In [None]:
raa.link_ratio

We can also view (and manipulate) the `latest_diagonal` of the triangle.

In [None]:
raa.latest_diagonal

In [None]:
raa.latest_diagonal / 1000

The latest diagonal also corresponds to a `valuation_date`. Note that 'valuation_date' is a datetime that is at the terminal timestamp of the period (i.e. the last split second of the year).

In [None]:
raa.valuation_date

We can also tell whether our triangle:
* `is_cumulative`: returns True if the data across the development periods is cumulative, or False if it is incremental.
* `is_ultimate`: returns True if the ultimate values are contained in the triangle.
* `is_val_tri`: returns True if the development period is stated as a valuation data as opposed to an age, i.e. Schedule P style triangle (True) or the more commonly used triangle by development age (False).
* `is_full`: returns True if the triangle has been "squared".

In [None]:
print("Is triangle cumulative?", raa.is_cumulative)
print("Does triangle contain ultimate projections?", raa.is_ultimate)
print("Is this a valuation triangle?", raa.is_val_tri)
print('Has the triangle been "squared"?', raa.is_full)

We can also inspect the triangle to understand its data granularity with `origin_grain` and `development_grain`.

In [None]:
print("Origin grain:", raa.origin_grain)
print("Development grain:", raa.development_grain)

The package supports monthly ("M"), quarterly ("Q") and yearly ("Y") grains for both `origin_grain` and `development_grain`.

### The Triangle Structure
The triangle described so far is a two-dimensional (accident date by valuation date) structure that spans multiple cells of data. This is a useful structure for exploring individual triangles, but becomes more problematic when working with **sets** of triangles. `Pandas` does not have a triangle `dtype`, but if it did, working with sets of triangles would be much more convenient. To facilitate working with more than one triangle at a time the `chainladder.Triangle` acts like a pandas dataframe (with an index and columns) where each cell (row x col) is an individual triangle.  This structure manifests itself as a four-dimensional space. Let's take a look at another dataset `clrd`.

In [None]:
clrd_df = pd.read_csv(
    "https://raw.githubusercontent.com/casact/chainladder-python/master/chainladder/utils/data/clrd.csv"
)
clrd_df

Let's load the data into the sets of triangles.

In [None]:
clrd = cl.Triangle(
    clrd_df,
    origin="AccidentYear",
    development="DevelopmentYear",
    columns=[
        "IncurLoss",
        "CumPaidLoss",
        "BulkLoss",
        "EarnedPremDIR",
        "EarnedPremCeded",
        "EarnedPremNet",
    ],
    index=["GRNAME", "LOB"],
)
clrd

Since 4D strucures do not fit nicely on 2D screens, we see a summary view instead that describes the structure rather than the underlying data itself. 

We see 5 rows of information:
* Valuation: the valuation date.
* Grain: the granularity of the data, O stands for origin, and D stands for development, `OYDY` represents triangles with accident year by development year.
* Shape: contains 4 numbers, represents the 4-D structure. This sample triangle represents a collection of 775x6 or 4,650 triangles that are themselves 10 accident years by 10 development periods.
    * 775: the number of segments, which is the combination of `index`, that represents the data segments. In this case, it is each of the `GRNAME` and `LOB` unique combination.
    * 6: the number of triangles for each segment, which is also the columns `[IncurLoss, CumPaidLoss, BulkLoss, EarnedPremDIR, EarnedPremCeded, EarnedPremNet]`. They could be paid amounts, incurred amounts, reported counts, loss ratios, closure rates, excess losses, premium, etc.
    * 10: the number of accident periods.
    * 10: the number of valuation periods.
* Index: the segmentation level of the triangles.
* Columns: the value types recorded in the triangles.
    
To summarize the set of triangles:
* We have a total of 775 segments, which are at the `GRNAME` and `LOB` level.
* Each segment contains 6 triangles, which are `IncurLoss`, `CumPaidLoss`, `BulkLoss`, `EarnedPremDIR`, `EarnedPremCeded`, `EarnedPremNet`.
* Each triangle is 10 accident years x 10 development periods.

Using `index.head()` allows us to see the first 5 segments in the set of triangles. Note that as the data is loaded, the triangle are sorted by `index`, in this case, `GRNAME` first, then by `LOB`.

In [None]:
clrd.index.head()

Under the hood, the data structure is a `numpy.ndarray` with the equivalent shape.  Like pandas, you can directly access the underlying numpy structure with the `values` property. By exposing the underlying `ndarray` you are free to manipulate the underlying data directly with numpy should that be an easier route to solving a problem.

In [None]:
print("Data structure of clrd:", type(clrd.values))
print("Sum of all data values:", np.nansum(clrd.values))

Keep in mind though, the `chainladder.Triangle` has several methods and properties beyond the raw numpy representation and these are kept in sync by using the `chainladder.Triangle` directly.

### Pandas-Style Slicing
As mentioned, the 4D structure is intended to behave like a pandas `DataFrame`.  Like pandas, we can subset a dataframe by referencing individual columns by name.

In [None]:
clrd[["CumPaidLoss", "IncurLoss", "BulkLoss"]]

We can also boolean-index the rows of the Triangle.

In [None]:
clrd[clrd["LOB"] == "wkcomp"]

We can even use the typical `loc`, `iloc` functionality similar to `pandas` to access subsets of data.  These features can be chained together as much as you want.

In [None]:
clrd.loc["Allstate Ins Co Grp"].iloc[-1]["CumPaidLoss"]

### Pandas-Style Arithmetic
With complete flexibility in the ability to slice subsets of triangles, we can use basic arithmetic to derive new triangles, which is commonly used as diagnostics to explore trends.

In [None]:
clrd["CaseIncurLoss"] = clrd["IncurLoss"] - clrd["BulkLoss"]
clrd["PaidToInc"] = clrd["CumPaidLoss"] / clrd["CaseIncurLoss"]
clrd[["CaseIncurLoss", "PaidToInc"]]

We can also aggregating the values across all triangles into one triangle.

In [None]:
clrd["CumPaidLoss"].sum()

We can also construct a paid loss ratio triangle against `EarnedPremNet`.

In [None]:
clrd["CumPaidLoss"].sum() / clrd["EarnedPremNet"].sum()

Aggregating all segments together is interesting, but it is often more useful to aggregate across segments using `groupby`.  For example, we may want to group the triangles by line of business and get a sum across all companies for each industry.

In [None]:
clrd.groupby("LOB").sum()

The shape is (**6**, 8, 10, 10) because now we have 6 LOBs with 8 triangles for each LOB. We can also note that `index` is now on `[LOB]` only.

In [None]:
np.unique(clrd["LOB"])

The aggregation functions, e.g. `sum`, `mean`, `std`, `min`, `max`, etc. don't have to just apply to the `index` axis.  You can apply them to any of the four axes in the triangle object, which are `segments` (axis `0`, or the index axis), `columns` (axis `1`, the various financial fields), `origin` (axis `2`, across triangle rows), and `development` period (axis `3`, across triangle columns). You can also use either the axis name or number. Let's try to sum all of the segments (`[GRNAME, LOB]`) and columns (`[IncurLoss, CumPaidLoss, BulkLoss, EarnedPremDIR, EarnedPremCeded, EarnedPremNet]`) using axis name and number, respectively.

In [None]:
clrd.sum(axis="index").sum(axis="segments")

In [None]:
clrd.sum(axis=0).sum(axis=1)

### Accessor Methods
`Pandas` has special "accessor" methods for `str` and `dt`.  These allow for the manipulation of data within each cell of data:

```python
# splits lastname from first name by a comma-delimiter
df['Last_First'].str.split(',')

# pulls the year out of each date in a dataframe column
df['Accident Date'].dt.year 
```

`chainladder` also has special "accessor" methods to help us manipulate the `origin`, `development` and `valuation` vectors of a triangle.

We may want to extract only the latest accident period for every triangle.

In [None]:
clrd[clrd.origin == clrd.origin.max()]

Note that this triangle has only 1 row; however, all of the columns would exist, but only the youngest age would have values.

We may want to extract particular diagonals from our triangles using its `valuation` vector.

In [None]:
clrd[(clrd.valuation >= "1994") & (clrd.valuation < "1995")]["CumPaidLoss"].sum()

We may even want to slice particular development periods to explore aspects of our data by development age. For example, we can look at the development factors between ages 24 and 36.

In [None]:
clrd[(clrd.development > 12) & (clrd.development <= 36)]["CumPaidLoss"].sum().link_ratio

### Moving Back to Pandas
When the shape of a `Triangle` object can be expressed as a 2D structure (i.e. two of its four axes have a length of 1) or less, you can use the `to_frame` method to convert your data into a `pandas.DataFrame`. Let's pick only one financial field, `CumPaidLoss` and only the latest diagonal, with `latest_diagonal`. We are now left with LOBs and origin period as our 2 axes.

In [None]:
clrd.groupby("LOB").sum().latest_diagonal["CumPaidLoss"]

In [None]:
clrd.groupby("LOB").sum().latest_diagonal["CumPaidLoss"].to_frame().astype(int)

We can also aggregate process away 3 dimensions, then use `to_frame`.

In [None]:
clrd[clrd.origin == "1990"].groupby("LOB").sum().latest_diagonal[
    "CumPaidLoss"
].to_frame().astype(int)

### Exercises
Use the `clrd` dataset for all of the exercises.

1. How do we create a new column named "NetPaidLossRatio" in the triangle using the existing columns?

In [None]:
clrd["NetPaidLossRatio"] = clrd["CumPaidLoss"] / clrd["EarnedPremNet"]

2. What is the highest paid loss ratio for across all segments for origin 1997 at age 12?

In [None]:
clrd[clrd.origin == "1997"][clrd.development == 12]["NetPaidLossRatio"].max()

3. How do we subset the overall triangle to just include "Alaska Nat Ins Co"?

In [None]:
clrd[clrd["GRNAME"] == "Alaska Nat Ins Co"]

4. How do we create a triangle subset that includes all triangles for companies with names starting with the letter "B"?

In [None]:
clrd[clrd["GRNAME"].str[0] == "B"]

5. Which are the top 5 companies by net premium share for in 1990?

In [None]:
clrd.latest_diagonal.groupby("GRNAME").sum()[clrd.origin == "1990"][
    "EarnedPremNet"
].to_frame().sort_values(ascending=False).iloc[0:5]

### Initializing a Triangle With Your Own Data
The `chainladder.Triangle`  class is designed to ingest `pandas.DataFrame` objects. However, you do not need to worry about shaping the dataframe into triangle format yourself. This happens at the time you ingest the data.

Let's look at the initialization signature and its docstring.

In [None]:
cl.Triangle?

Let's use a new dataset `prism` to construct our triangles.

In [None]:
prism_df = pd.read_csv(
    "https://raw.githubusercontent.com/casact/chainladder-python/master/chainladder/utils/data/prism.csv"
)
prism_df.head()

In [None]:
prism_df.dtypes

We must specify the `origin`, `development`, and `columns` to create a triangle object. By limiting our columns to one measure and not specifying an index, we can create a single triangle. For example, if we are only interested in the paid triangle.

In [None]:
prism = cl.Triangle(
    data=prism_df, origin="AccidentDate", development="PaymentDate", columns="Paid"
)
prism

Note that the lowest (most-detailed) grain supported is the monthly grain, so the triangle above is aggregated to the OMDM level.

In [None]:
prism.origin_grain

In [None]:
prism.development_grain

If we want to include more columns or indices we can certainly do so. Note that as we do this, we move into the 4D arena changing the display of the overall object.

In [None]:
prism = cl.Triangle(
    data=prism_df,
    origin="AccidentDate",
    development="PaymentDate",
    columns=["Paid", "Incurred"],
)
prism

`Pandas` has wonderful datetime inference functionality that the `Triangle` heavily uses to infer origin and development granularity. Even still, there are occassions where date format inferences can fail. It is often better to explicitly tell the triangle the date format, and is usually good pratice to explicitly state the date format instead.

In [None]:
prism_df.head()

In [None]:
prism = cl.Triangle(
    data=prism_df,
    origin="AccidentDate",
    origin_format="%Y-%m-%d",
    development="PaymentDate",
    development_format="%Y-%m-%d",
    columns=["Paid", "Incurred"],
)
prism

Up until now, we've been playing with symmetric triangles (i.e. `orgin` and `development` periods have the same grain). However, nothing precludes us from having a different grain. Often times in practice the `development` axis is more granular than the `origin` axis. All the functionality available to symmetric triangles works equally well for asymmetric triangles.

In [None]:
prism_df["AccYr"] = prism_df["AccidentDate"].str[:4]

prism = cl.Triangle(
    data=prism_df,
    origin="AccYr",
    origin_format="%Y",
    development="PaymentDate",
    development_format="%Y-%m-%d",
    columns=["Paid", "Incurred"],
)
prism["Paid"]

While exposure triangles make sense for auditable lines such as workers' compensation lines, most other lines of business' exposures can be expressed as a 1D array (along origin period) as exposures do not develop over time. `chainladder` arithmetic requires that operations happen between a triangle and either an `int`, `float`, or another `Triangle`. To create a 1D exposure array, simply omit the `development` argument at initialization.

The `prism` data does not consist of exposure data, but we can contrive one. Let's assume that the premium is thrice the incurred amount.

In [None]:
prism_df["Premium"] = 3 * prism_df["Incurred"]

prism = cl.Triangle(
    data=prism_df, origin="AccidentDate", origin_format="%Y-%m-%d", columns="Premium"
)

prism

Let's seperate our data by segments using `index`:

In [None]:
prism = cl.Triangle(
    data=prism_df,
    origin="AccidentDate",
    origin_format="%Y-%m-%d",
    development="PaymentDate",
    development_format="%Y-%m-%d",
    columns=["Paid", "Incurred"],
    index="Line",
)
prism

We can futher `index` by coverages, or sublines:

In [None]:
prism = cl.Triangle(
    data=prism_df,
    origin="AccidentDate",
    origin_format="%Y-%m-%d",
    development="PaymentDate",
    development_format="%Y-%m-%d",
    columns=["Paid", "Incurred"],
    index=["Line", "Type"],
)
prism

### Triangle Methods not Available in Pandas
Up until now, we've kept pretty close to the pandas API for triangle manipulation. However, there are data transformations commonly applied to triangles that don't have a nice `pandas` analogy.

For example, we often want to convert a triangle from an incremental view into a cumulative view and vice versa. This can be accomplished with the `incr_to_cum` and `cum_to_incr` methods.

In [None]:
prism["Paid"].sum()

In [None]:
prism_cum = prism.incr_to_cum()
prism_cum["Paid"].sum()

In [None]:
prism_incr = prism_cum.cum_to_incr()
prism_incr["Paid"].sum()

By default (and in concert with the `pandas` philosophy), the methods associated with the `Triangle` class strive for immutability. This means that the `incr_to_cum` and `cum_to_incr` methods will return new a new triangle object that must be assigned, or it is thrown away. Many of the `chainladder.Triangle` methods have an `inplace` argument, or alternatively you can just use variable reassignment to store the transformed triangle object.

In [None]:
# This works
prism.incr_to_cum(inplace=True)
# So does this
prism = prism.incr_to_cum()

When dealing with triangles that have an `origin` axis, `development` axis, or both at a monthly or quarterly grain, the triangle can be summarized to a higher grain using the `grain` method.

The grain to which you want your triangle converted, specified as "OxDy" where "x" and "y" can take on values of "M", "Q", or "Y". For example:
* `grain(OYDY)` for Origin Year x Development Year.
* `grain(OQDM)` for Origin Quarter x Development Month.

In [None]:
prism_OYDY = prism.grain("OYDY")
prism_OYDY["Paid"].sum()

Depending on the type of analysis being done, it may be more convenient to look at a triangle with its `development` axis expressed as a valuation rather than an age. This is also what the Schedule Ps look like. To do this, `Triangle` has two methods for toggling between a development triangle and a valuation triangle. The methods are `dev_to_val` and its inverse `val_to_dev`.

In [None]:
prism_OYDY_val = prism_OYDY.dev_to_val()
prism_OYDY_val["Paid"].sum()

When working with real-world data, the triangles can have holes, such as missing evluation(s), or no losses in certain origin(s). In these cases, it doesn't make sense to include empty accident periods or development ages. For example, in the `prism` dataset, the "Home" line has its latest accidents through 2016, and have no payments in development age '12'. Sometimes, dropping the non-applicable fields is usefule with the `dropna()` method.

In [None]:
prism_OYDY.loc["Home"]["Paid"]

Let's see what happens if we have no data for 2011.

In [None]:
prism_df_2011 = prism_df.copy()
prism_df_2011.loc[
    (prism_df_2011["AccidentDate"] >= "2011-01-01")
    & (prism_df_2011["AccidentDate"] < "2012-01-01"),
    "Paid",
] = None  # Let's assume we hav no payments for losses occurred in 2011
prism_2011 = cl.Triangle(
    data=prism_df_2011,
    origin="AccidentDate",
    origin_format="%Y-%m-%d",
    development="PaymentDate",
    development_format="%Y-%m-%d",
    columns=["Paid", "Incurred"],
    index=["Line", "Type"],
)
prism_2011.incr_to_cum(inplace=True)
prism_2011_OYDY = prism_2011.grain("OYDY")
prism_2011_OYDY.loc["Home"]["Paid"]

Note that the `dropna()` method will retain empty periods if they are surrounded by non-empty periods with valid data.

In [None]:
prism_2011_OYDY_droppedna = prism_2011_OYDY.loc["Home"].dropna()
prism_2011_OYDY_droppedna.loc["Home"]["Paid"]

### Commutative Properties of Triangle Methods
Where it makes sense, which is in most cases, the methods described above are commutative and can be applied in any order.

In [None]:
print("Commutative?", prism.sum().latest_diagonal == prism.latest_diagonal.sum())
print("Commutative?", prism.loc["Home"].link_ratio == prism.link_ratio.loc["Home"])

### Triangle Imports and Exports
To the extent the `Triangle` can be expressed as a `pandas.DataFrame`, you can use any of the pandas IO to send the data in and out. Note that converting to pandas is a one-way ticket with no inverse functions.

Another useful function is to copy the triangle and put it in the clipboard so we can paste it elsewhere (such as Excel):

In [None]:
prism_OYDY.loc["Home", "Paid"]

In [None]:
prism_OYDY.loc["Home", "Paid"].to_clipboard()

Try to paste it elsewhere.

Alternatively, if you want to store the triangle elsewhere but be able to reconstitute a triangle out of it later, then you can use:
* `Triangle.to_json` and its inverse `cl.read_json` for json format<br>
* `Triangle.to_pickle` and its inverse `cl.read_pickle` for pickle format<br>

These have the added benefit of working on multi-dimensional triangles that don't fit into a DataFrame.

### Exercises

In [None]:
prism = cl.Triangle(
    data=prism_df,
    origin="AccidentDate",
    development="PaymentDate",
    columns=["Paid", "Incurred"],
    index=["Line", "Type"],  # multiple indices
).incr_to_cum()
prism

1. What is the case incurred activity for calendar periods in 2015Q2 (March, April, and May in 2015) by "Line"?

In [None]:
incr_by_line = prism.groupby("Line").sum().cum_to_incr()["Incurred"].dev_to_val()
incr_by_line[
    (incr_by_line.valuation >= "2015-04-01") & (incr_by_line.valuation < "2015-07-01")
].sum("origin").to_frame().astype(int)

2. For accident year 2015, what proportion of our Paid amounts come from each "Type" of claims?

In [None]:
prism_OYDY = prism.grain("OYDY")
prism_OYDY

In [None]:
by_type = (
    prism_OYDY.latest_diagonal[prism_OYDY.origin == "2015"]["Paid"]
    .groupby("Type")
    .sum()
    .to_frame()
)
by_type / by_type.sum()