## Expense Claim Patterns and Fraud Analysis (Flag 88)

### Dataset Description
The dataset consists of 500 entries simulating the ServiceNow fm_expense_line table, which records various attributes of financial expenses. Key fields include 'number', 'opened_at', 'amount', 'state', 'short_description', 'ci', 'user', 'department', 'category', 'processed_date', 'source_id', and 'type'. This table documents the flow of financial transactions by detailing the amount, departmental allocation, and the nature of each expense. It provides a comprehensive view of organizational expenditures across different categories, highlighting both the timing and the approval state of each financial entry. Additionally, the dataset offers insights into the efficiency of expense processing based on different states, revealing potential areas for workflow optimization.

### Your Task
**Goal**: To detect and investigate instances of repeated identical expense claims by individual users, determining whether these repetitions are fraudulent or due to misunderstandings of the expense policy.

**Role**: Compliance and Audit Analyst

**Difficulty**: 3 out of 5.

**Category**: Finance Management


### Import Necessary Libraries
This cell imports all necessary libraries required for the analysis. This includes libraries for data manipulation, data visualization, and any specific utilities needed for the tasks. 

In [1]:
import argparse
import pandas as pd
import json
import requests
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

from openai import OpenAI
from pandas import date_range

### Load Dataset
This cell loads the expense dataset to be analyzed. The data is orginally saved in the from a CSV file, and is here imported into a DataFrame. The steps involve specifying the path to the dataset, using pandas to read the file, and confirming its successful load by inspecting the first few table entries.

In [2]:
dataset_path = "csvs/flag-88.csv"
flag_data = pd.read_csv(dataset_path)
df = pd.read_csv(dataset_path)
flag_data.head()

Unnamed: 0,category,state,closed_at,opened_at,closed_by,number,sys_updated_by,location,assigned_to,caller_id,sys_updated_on,short_description,priority,assignement_group
0,Database,Closed,2023-07-25 03:32:18.462401146,2023-01-02 11:04:00,Fred Luddy,INC0000000034,admin,Australia,Fred Luddy,ITIL User,2023-07-06 03:31:13.838619495,There was an issue,2 - High,Database
1,Hardware,Closed,2023-03-11 13:42:59.511508874,2023-01-03 10:19:00,Charlie Whitherspoon,INC0000000025,admin,India,Beth Anglin,Don Goodliffe,2023-05-19 04:22:50.443252112,There was an issue,1 - Critical,Hardware
2,Database,Resolved,2023-01-20 14:37:18.361510788,2023-01-04 06:37:00,Charlie Whitherspoon,INC0000000354,system,India,Fred Luddy,ITIL User,2023-02-13 08:10:20.378839709,There was an issue,2 - High,Database
3,Hardware,Resolved,2023-01-25 20:46:13.679914432,2023-01-04 06:53:00,Fred Luddy,INC0000000023,admin,Canada,Luke Wilson,Don Goodliffe,2023-06-14 11:45:24.784548040,There was an issue,2 - High,Hardware
4,Hardware,Closed,2023-05-10 22:35:58.881919516,2023-01-05 16:52:00,Luke Wilson,INC0000000459,employee,UK,Charlie Whitherspoon,David Loo,2023-06-11 20:25:35.094482408,There was an issue,2 - High,Hardware


### **Question 1:How many instances of repeated identical expense claims are there, and which users are involved?**

#### Plot expense distribution by department

This bar visualization plots distribution of expenses across different departments within the organization, focusing on an average expenses per department.  This plot helps identify departments that might be overspending or under-utilizing resources etc.

In [3]:
# import matplotlib.pyplot as plt
# import pandas as pd

# # Assuming flag_data is your DataFrame containing expense data
# # Group data by department and calculate total and average expenses
# department_expenses = flag_data.groupby('department')['amount'].agg(['sum', 'mean']).reset_index()

# # Sort data for better visualization (optional)
# department_expenses.sort_values('sum', ascending=False, inplace=True)

# # Creating the plot
# fig, ax = plt.subplots(figsize=(14, 8))

# # Bar plot for total expenses
# # total_bars = ax.bar(department_expenses['department'], department_expenses['sum'], color='blue', label='Total Expenses')

# # Bar plot for average expenses
# average_bars = ax.bar(department_expenses['department'], department_expenses['mean'], color='green', label='Average Expenses', alpha=0.6, width=0.5)

# # Add some labels, title and custom x-axis tick labels, etc.
# ax.set_xlabel('Department')
# ax.set_ylabel('Expenses ($)')
# ax.set_title('Average Expenses by Department')
# ax.set_xticks(department_expenses['department'])
# ax.set_xticklabels(department_expenses['department'], rotation=45)
# ax.legend()

# # Adding a label above each bar
# def add_labels(bars):
#     for bar in bars:
#         height = bar.get_height()
#         ax.annotate(f'{height:.2f}',
#                     xy=(bar.get_x() + bar.get_width() / 2, height),
#                     xytext=(0, 3),  # 3 points vertical offset
#                     textcoords="offset points",
#                     ha='center', va='bottom')

# # add_labels(total_bars)
# add_labels(average_bars)

# plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)
# plt.show()
print("N/A")

N/A


#### Generate JSON Description for the Insight

In [None]:
{
    "data_type": "frequency",
    "insight": "The analysis could not be completed because the required 'department' column is missing from the dataset (flag_data). This is evidenced by the KeyError in the output indicating that 'department' is not a valid column name.",
    "insight_value": {},
    "plot": {
        "description": "The graph could not be generated due to missing data"
    },
    "question": "How many instances of repeated identical expense claims are there, and which users are involved?",
    "actionable_insight": "No actionable insight could be generated due to missing data"
}

{'data_type': 'frequency',
 'insight': "The analysis could not be completed because the required 'department' column is missing from the dataset (flag_data). This is evidenced by the KeyError in the output indicating that 'department' is not a valid column name.",
 'insight_value': {},
 'plot': {'description': 'The code attempts to create a bar chart showing average expenses by department, but the visualization failed due to missing data. The intended plot would have shown department-wise average expenses with green bars, including numerical labels and a rotated x-axis for better readability.'},
 'question': 'How many instances of repeated identical expense claims are there, and which users are involved?',
 'actionable_insight': "Before analyzing expense claim patterns, the data structure needs to be verified and corrected. Specifically, ensure that the dataset contains the required 'department' and 'amount' columns. Additionally, the code should be modified to address the actual quest

### **Question 2:** What are the differences in processing times for expenses in various states such as Processed, Declined, Submitted, and Pending?

Analyzing the processing times for expenses in different states reveals notable differences. Processed expenses tend to have shorter processing times compared to Declined expenses. Understanding these differences helps identify areas for potential optimization and efficiency improvements in the expense processing workflow."

These components are designed to prompt an analysis focused on the differences in processing times based on the states of the expenses, ultimately leading to the identified insight.

In [5]:
# # Calculate average processing time for each state
# avg_processing_time_by_state = df.groupby('state')['processing_time_hours'].mean().reset_index()

# # Set the style of the visualization
# sns.set(style="whitegrid")

# # Create a bar plot for average processing time by state
# plt.figure(figsize=(12, 6))
# sns.barplot(x='state', y='processing_time_hours', data=avg_processing_time_by_state)
# plt.title('Average Processing Time by State')
# plt.xlabel('State')
# plt.ylabel('Average Processing Time (hours)')
# plt.xticks(rotation=45)
# plt.show()
print("N/A")

N/A


In [1]:
{
    "data_type": "comparative",
    "insight": "The analysis could not be completed because the column 'processing_time_hours' was not found in the dataset, indicating either missing or incorrectly named data",
    "insight_value": {},
    "plot": {
        "description": "The graph could not be generated due to missing data"
    },
    "question": "What are the differences in processing times for expenses in various states such as Processed, Declined, Submitted, and Pending?",
    "actionable_insight": "No actionable insight could be generated due to missing data"
}

{'data_type': 'comparative',
 'insight': "The analysis could not be completed because the column 'processing_time_hours' was not found in the dataset, indicating either missing or incorrectly named data",
 'insight_value': {},
 'plot': {'description': 'The graph could not be generated due to missing data'},
 'question': 'What are the differences in processing times for expenses in various states such as Processed, Declined, Submitted, and Pending?',
 'actionable_insight': 'No actionable insight could be generated due to missing data'}

### **Question 3: How many instances of any repeated identical expense claims are there?**


#### Frequency Distribution of Repeated Expense Claims

This chart analyzes frequency of repeated identical expense claims, highlighting potential anomalies. It focuses on claims submitted by the same user, within the same category, and for the same amount. The histogram displays the distribution of these frequencies, using red bars to highlight any unusual nature of repeated claims.



In [7]:
# import matplotlib.pyplot as plt
# import pandas as pd

# # Group by user, category, and amount to count occurrences
# grouped_data = flag_data.groupby(['user', 'category', 'amount']).size().reset_index(name='frequency')

# # Filter out normal entries to focus on potential anomalies
# potential_fraud = grouped_data[grouped_data['frequency'] > 3]  # Arbitrary threshold, adjust based on your data

# # Plot histogram of frequencies
# plt.figure(figsize=(10, 6))
# plt.hist(potential_fraud['frequency'], bins=30, color='red', alpha=0.7)
# plt.title('Distribution of Repeated Claims Frequency')
# plt.xlabel('Frequency of Same Amount Claims by Same User in Same Category')
# plt.ylabel('Count of Such Incidents')
# plt.grid(True)
# plt.show()
print("N/A")

N/A


#### Generate JSON Description for the Insight

In [2]:
{
    "data_type": "frequency",
    "insight": "The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the dataset (flag_data)",
    "insight_value": {},
    "plot": {
        "description": "The code attempted to create a histogram showing the distribution of repeated claims frequency, but failed due to missing data. The intended visualization would have shown the frequency of identical expense claims made by the same user in the same category"
    },
    "question": "How many instances of any repeated identical expense claims are there?",
    "actionable_insight": "No actionable insight could be generated due to missing data"
}

{'data_type': 'frequency',
 'insight': "The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the dataset (flag_data)",
 'insight_value': {},
 'plot': {'description': 'The code attempted to create a histogram showing the distribution of repeated claims frequency, but failed due to missing data. The intended visualization would have shown the frequency of identical expense claims made by the same user in the same category'},
 'question': 'How many instances of any repeated identical expense claims are there?',
 'actionable_insight': 'No actionable insight could be generated due to missing data'}

### **Question 4:  Which users are involved in the frequent cases?**


#### Plot repeated expense claims by user and category

This plot visualizes repeated expense claims across various categories, highlighting users involved in frequent submissions. Each dot represents a unique combination of user, category, and expense amount, with the size of the dot proportional to the frequency of claims.


In [9]:
# import matplotlib.pyplot as plt

# # Assume flag_data includes 'user', 'amount', 'category' columns
# # Group data by user, category, and amount to count frequencies
# grouped_data = flag_data.groupby(['user', 'category', 'amount']).size().reset_index(name='count')

# # Filter to only include cases with more than one claim (to highlight potential fraud)
# repeated_claims = grouped_data[grouped_data['count'] > 1]

# # Create a scatter plot with sizes proportional to the count of claims
# plt.figure(figsize=(14, 8))
# colors = {'Travel': 'blue', 'Meals': 'green', 'Accommodation': 'red', 'Miscellaneous': 'purple'}  # Add more categories as needed
# for ct in repeated_claims['category'].unique():
#     subset = repeated_claims[repeated_claims['category'] == ct]
#     plt.scatter(subset['user'], subset['amount'], s=subset['count'] * 100,  # Increased size factor for better visibility
#                 color=colors.get(ct, 'gray'), label=f'Category: {ct}', alpha=0.6)

# # Customizing the plot
# plt.title('Repeated Expense Claims by User and Category')
# plt.xlabel('User')
# plt.ylabel('Amount ($)')
# plt.legend(title='Expense Categories')
# plt.xticks(rotation=45)  # Rotate x-axis labels for better readability
# plt.grid(True, which='both', linestyle='--', linewidth=0.5, alpha=0.7)

# # Highlighting significant cases
# # Let's annotate the specific user found in your description
# for i, row in repeated_claims.iterrows():
#     if row['user'] == 'Mamie Mcintee' and row['amount'] == 8000:
#         plt.annotate(f"{row['user']} (${row['amount']})", (row['user'], row['amount']),
#                      textcoords="offset points", xytext=(0,10), ha='center', fontsize=9, color='darkred')

# # Show plot
# plt.show()
print("N/A")

N/A


#### Generate JSON Description for the Insight

In [10]:
{
    "data_type": "frequency",
    "insight": "The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame",
    "insight_value": {},
    "plot": {
        "description": "A scatter plot was attempted to visualize repeated expense claims by user and category, with point sizes representing frequency of claims, but failed due to missing data"
    },
    "question": "Which users are involved in the frequent cases?",
    "actionable_insight": "Before proceeding with the analysis, verify that the flag_data DataFrame contains the required 'user' column and ensure data integrity"
}

{'data_type': 'frequency',
 'insight': "The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame",
 'insight_value': {},
 'plot': {'description': 'A scatter plot was attempted to visualize repeated expense claims by user and category, with point sizes representing frequency of claims, but failed due to missing data'},
 'question': 'Which users are involved in the frequent cases?',
 'actionable_insight': "Before proceeding with the analysis, verify that the flag_data DataFrame contains the required 'user' column and ensure data integrity"}

### **Question 5:  What department and categories are most commonly involved in these repeated claims?**


#### Plot distribution of expense claims by department and category for Mamie Mcintee

This bar graph displays the distribution of Mamie Mcintee's expense claims across different departments and categories, illustrating the specific areas where repeated claims are most frequent. One color represents a different expense category, allowing for a clear view of which combinations are most problematic.


In [11]:
# import matplotlib.pyplot as plt
# import pandas as pd

# # Assuming 'flag_data' includes 'user', 'department', 'amount', 'category' columns
# # and it's already loaded with the data

# # Filter for the specific user
# user_data = flag_data[flag_data['user'] == 'Mamie Mcintee']

# # Group data by department and category to count frequencies
# department_category_counts = user_data.groupby(['department', 'category']).size().unstack(fill_value=0)

# # Plotting
# plt.figure(figsize=(12, 7))
# department_category_counts.plot(kind='bar', stacked=True, color=['blue', 'green', 'red', 'purple', 'orange'], alpha=0.7)
# plt.title('Distribution of Expense Claims by Department and Category for Mamie Mcintee')
# plt.xlabel('Department')
# plt.ylabel('Number of Claims')
# plt.xticks(rotation=0)  # Keep the department names horizontal for better readability
# plt.legend(title='Expense Categories')
# plt.grid(True, which='both', linestyle='--', linewidth=0.5)
# plt.show()
print("N/A")

N/A


#### Generate JSON Description for the Insight

In [1]:
{
    "data_type": "distribution",
    "insight": "The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame",
    "insight_value": {},
    "plot": {
        "description": "No plot was generated due to missing data"
    },
    "question": "What department and categories are most commonly involved in these repeated claims?",
    "actionable_insight": "No actionable insight could be generated due to missing data"
}

{'data_type': 'distribution',
 'insight': "The analysis could not be completed due to a KeyError indicating that the 'user' column is missing from the flag_data DataFrame",
 'insight_value': {},
 'plot': {'description': 'No plot was generated due to missing data'},
 'question': 'What department and categories are most commonly involved in these repeated claims?',
 'actionable_insight': 'No actionable insight could be generated due to missing data'}

### Updated Summary of Findings (Flag 88):

1. **Pattern Recognition:** The dataset is focused on identifying patterns in expense submissions that may indicate potential fraud or policy abuse. However, the dataset is missing key columns such as 'department', 'user', and 'processing_time_hours', which are essential for conducting the analysis.

2. **Insight into User Behavior:** No analysis could be performed due to the missing 'user' column. This column is crucial for identifying repeated identical expense claims by individual users.

3. **State-Based Processing Time Analysis:** The analysis could not be completed because the 'processing_time_hours' column is missing from the dataset. This column is necessary to compare processing times for expenses in various states such as Processed, Declined, Submitted, and Pending.

4. **Expense Distribution by Department:** The analysis could not be completed because the 'department' column is missing from the dataset. This column is needed to plot the distribution of expenses across different departments.