Summary: 

The goal is to create a method that correctly calculates the insurance gini score and also does the plot. 

I think we will need three items: 
- get a function that takes in an actual, prediction, and years_at_risk, and creates the table with the points
- a function that does the plot from this (which should be simple)
- and a function that calculates the score, which is handled with the auc function in sklearn, but I'd like to recalculate myself

On the __years_at_risk__ parameter: have to be careful with this. A concrete example I can think of is time exposure. We should obviously arrange things by an annualised prediction. However, I'm not sure if the percentages should be by row or by exposure. For example, if we have 3 predictions, 0.5 yr, 0.5 yr, 1 yr, then does the first point represent 1/3rd of the policies or 1/4th? I think this should be a factor. 

As of now, I'll simply create an annualised version first, and do everything by those fields. 

Just to reiterate, the actu and pred are assumed to be non-annualised now!!! 

# Data

In [3]:
import pandas as pd

In [4]:
df_input = pd.DataFrame.from_dict({'actual': [0,0,0,0,0,0,0,100,100,100],
                            'prediction': [10,10,10,10,50,50,20,50, 60, 10],
                            'years_at_risk': [1,1,1,1,1,1,1,1,1,1]})

# Gini Table

In [5]:
def get_gini_table(pred, actu, years_at_risk):
    df = pd.DataFrame()
    df['pred'] = pred
    df['actu'] = actu
    df['years_at_risk'] = years_at_risk
    df['pred'] = df['pred'] / df['years_at_risk']
    
    # the next line is debatable. you can make an argument for not annualising the actual claims
    # which would mean getting them wrong would not be as big a deal as with other claims...
    # it's difficult
    df['actu'] = df['actu'] / df['years_at_risk']
    
    df.drop('years_at_risk', axis = 1, inplace = True)
    df.sort_values('pred', inplace = True)
    
    # change pred and actu so they represent percentages of the total
    # pred is simply the order now... although now that I think of it, this also could be by the value
    
    df['pred'] = np.arange(1, len(df) + 1) / len(df)
    df['actu'] = df['actu'].cumsum() / df['actu'].sum()
    
    # also add a 0-0 point pair, we will need that for the score and the plot as well
    
    df = pd.concat([pd.DataFrame.from_dict({'pred': [0], 'actu': [0]}), df])
    
    return df

In [6]:
df = get_gini_table(df_input['prediction'], df_input['actual'], df_input['years_at_risk'])

In [7]:
df

Unnamed: 0,pred,actu
0,0.0,0.0
0,0.1,0.0
1,0.2,0.0
2,0.3,0.0
3,0.4,0.0
9,0.5,0.333333
6,0.6,0.333333
4,0.7,0.333333
5,0.8,0.333333
7,0.9,0.666667


# Score

Assuming we have a Gini table in the proper format, what we need is the area under the curve. And then maybe something like 1 - 2 * area. 

In [8]:
def get_area_under_curve(x, y):
    # x is the prediction, y is the actual
    df = pd.DataFrame.from_dict({'x': x, 'y': y})
    df['x_incr'] = df['x'].diff()
    df['y_incr'] = df['y'].diff()
    df['area_under_curve'] = (df['x_incr'] * df['y_incr'] / 2) + \
    (df['y'] - df['y_incr']) * df['x_incr']
    
    return np.sum(df['area_under_curve'])

In [9]:
get_area_under_curve(df['pred'], df['actu'])

0.24999999999999997

In [10]:
# let's check against sklearn score...
from sklearn.metrics import auc
auc(df['pred'], df['actu'])

0.24999999999999994

With the gini score, what we need is basically the portion this area under the curve takes against the lower half of a chart, the size of which is 0.5. 

I am going to keep the getting table bit separate, because we will need that for the plot as well...

In [11]:
def get_gini_score(gini_table):
    
    area_under_curve = get_area_under_curve(gini_table['pred'], gini_table['actu'])
    
    return 1 - area_under_curve / 0.5

In [12]:
get_gini_score(df)

0.5