Top Percentile Fraud

ABC Corp is a mid-sized insurer in the US and in the recent past their fraudulent claims have increased significantly for their personal auto insurance portfolio. They have developed a ML based predictive model to identify propensity of fraudulent claims. Now, they assign highly experienced claim adjusters for top 5 percentile of claims identified by the model.
Your objective is to identify the top 5 percentile of claims from each state. Your output should be policy number, state, claim cost, and fraud score.

In [1]:
import pandas as pd
import numpy as np

In [3]:
fraud_score = pd.read_csv('../CSV/fraud_score.csv')
fraud_score = fraud_score.iloc[:, :4]
fraud_score.head()

Unnamed: 0,policy_num,state,claim_cost,fraud_score
0,ABCD1001,CA,4113,0.613
1,ABCD1002,CA,3946,0.156
2,ABCD1003,CA,4335,0.014
3,ABCD1004,CA,3967,0.142
4,ABCD1005,CA,1599,0.889


In [7]:
fraud_score["percentile"] = fraud_score.groupby('state')['fraud_score'].rank(pct=True)
fraud_score

Unnamed: 0,policy_num,state,claim_cost,fraud_score,percentile
0,ABCD1001,CA,4113,0.613,0.666667
1,ABCD1002,CA,3946,0.156,0.181818
2,ABCD1003,CA,4335,0.014,0.010101
3,ABCD1004,CA,3967,0.142,0.171717
4,ABCD1005,CA,1599,0.889,0.868687
...,...,...,...,...,...
395,ABCD1396,TX,2535,0.926,0.930693
396,ABCD1397,TX,2358,0.761,0.792079
397,ABCD1398,TX,3191,0.978,0.980198
398,ABCD1399,TX,3107,0.416,0.435644


Этот код выполняет ранжирование значений столбца "fraud_score" в DataFrame `fraud_score` в каждой группе, сгруппированной по уникальным значениям столбца 'state', и затем создает новый столбец "percentile", содержащий персентиль для каждой записи внутри своей группы. Разберем код пошагово:

1. `fraud_score.groupby('state')`: Группирует DataFrame `fraud_score` по уникальным значениям в столбце 'state'. Каждой уникальной группе будет соответствовать свой набор данных.

2. `['fraud_score']`: Выбирает только столбец 'fraud_score' для применения ранжирования в каждой группе.

3. `.rank(pct=True)`: Применяет метод `rank()` для ранжирования значений внутри каждой группы. `pct=True` указывает, что нужно вернуть значения в виде процентилей, т.е., каждое значение будет представлено в виде доли от 0 до 1 в пределах своей группы.

4. `fraud_score["percentile"] = ...`: Создает новый столбец "percentile" в DataFrame `fraud_score` и присваивает ему рассчитанные процентили.

Таким образом, "percentile" будет содержать процентиль каждого значения "fraud_score" в пределах своей группы 'state'. Это может быть полезно для оценки, насколько каждое значение "fraud_score" в каждом штате сравнивается с другими значениями в этом же штате.

In [5]:
fraud_score.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 400 entries, 0 to 399
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   policy_num   400 non-null    object 
 1   state        400 non-null    object 
 2   claim_cost   400 non-null    int64  
 3   fraud_score  400 non-null    float64
 4   percentile   400 non-null    float64
dtypes: float64(2), int64(1), object(2)
memory usage: 15.8+ KB


In [8]:
df= fraud_score[fraud_score['percentile']>.95]
df

Unnamed: 0,policy_num,state,claim_cost,fraud_score,percentile
15,ABCD1016,CA,1639,0.964,0.989899
26,ABCD1027,CA,2663,0.988,1.0
68,ABCD1069,CA,1426,0.948,0.959596
78,ABCD1079,CA,4224,0.963,0.979798
80,ABCD1081,CA,1080,0.951,0.969697
116,ABCD1117,NY,4903,0.978,0.99
120,ABCD1121,NY,4009,0.969,0.96
186,ABCD1187,NY,3722,0.976,0.98
188,ABCD1189,NY,3577,0.982,1.0
195,ABCD1196,NY,2994,0.973,0.97


In [9]:
result = df[['policy_num','state','claim_cost','fraud_score']]
result

Unnamed: 0,policy_num,state,claim_cost,fraud_score
15,ABCD1016,CA,1639,0.964
26,ABCD1027,CA,2663,0.988
68,ABCD1069,CA,1426,0.948
78,ABCD1079,CA,4224,0.963
80,ABCD1081,CA,1080,0.951
116,ABCD1117,NY,4903,0.978
120,ABCD1121,NY,4009,0.969
186,ABCD1187,NY,3722,0.976
188,ABCD1189,NY,3577,0.982
195,ABCD1196,NY,2994,0.973


Solution Walkthrough
In this problem, we are given a DataFrame called fraud_score which contains information about policy numbers, states, claim costs, and fraud scores for an insurance company. We are asked to identify the top 5 percentile of claims from each state and output the policy number, state, claim cost, and fraud score for those claims.

To solve this problem, we will use the pandas library in Python. Pandas provides high-performance data manipulation and analysis tools, including functions to group data, calculate percentiles, and filter data based on conditions. We will also use the numpy library to perform numerical operations efficiently.

Let's walk through the solution step by step.

Understanding The Data
Before we dive into the solution, it's important to understand the structure of the fraud_score DataFrame. The DataFrame contains the following columns:

policy_num: The policy number for each claim.
state: The state in which the claim was made.
claim_cost: The cost of the claim.
fraud_score: The fraud score assigned to each claim.
The Problem Statement
We are given a DataFrame with information about claims made by an insurance company. The company wants to identify the top 5 percentile of claims from each state based on the fraud score. We need to write a program to accomplish this task and output the policy number, state, claim cost, and fraud score for those claims.

Breaking Down The Code
Let's break down the given code step by step:

import pandas as pd
import numpy as np
These lines import the necessary libraries: pandas as pd and numpy as np. We need these libraries to work with DataFrames and perform numerical operations efficiently.

fraud_score["percentile"] = fraud_score.groupby("state")[
    "fraud_score"
].rank(pct=True)
This line calculates the percentile rank of the fraud_score column within each state using the groupby function. The rank function with pct=True calculates the percentile rank as a percentage. The resulting percentile ranks are stored in a new column called "percentile" in the fraud_score DataFrame.

df = fraud_score[fraud_score["percentile"] > 0.95]
This line filters the fraud_score DataFrame to keep only the rows where the "percentile" column is greater than 0.95. This selects the top 5 percentile of claims from each state, as we want.

result = df[["policy_num", "state", "claim_cost", "fraud_score"]]
This line selects the specified columns from the filtered DataFrame (df) and assigns the result to a new DataFrame called result. The specified columns are "policy_num", "state", "claim_cost", and "fraud_score".

Bringing It All Together
Putting the individual code snippets together, we have the following solution:

import pandas as pd
import numpy as np

fraud_score["percentile"] = fraud_score.groupby("state")[
    "fraud_score"
].rank(pct=True)
df = fraud_score[fraud_score["percentile"] > 0.95]
result = df[["policy_num", "state", "claim_cost", "fraud_score"]]
This solution first calculates the percentile rank of the "fraud_score" column within each state, then filters the DataFrame to keep only the rows with a percentile rank greater than 0.95. Finally, it selects the specified columns and assigns the result to the result DataFrame.

Conclusion
In this walkthrough, we learned how to identify the top 5 percentile of claims from each state based on the fraud score using pandas. By grouping the data by state and calculating the percentile rank within each group, we were able to filter the data and select the desired claims.