# Challenge

Another approach to identifying fraudulent transactions is to look for outliers in the data. Standard deviation or quartiles are often used to detect outliers. Using this starter notebook, code two Python functions:

* One that uses standard deviation to identify anomalies for any cardholder.

* Another that uses interquartile range to identify anomalies for any cardholder.

## Identifying Outliers using Standard Deviation

For help with outliers detection, read the following articles:

How to Calculate OutliersLinks to an external site.
Removing Outliers Using Standard Deviation in PythonLinks to an external site.
How to Use Statistics to Identify Outliers in DataLinks to an external site.

In [197]:
# Initial imports
import pandas as pd
import numpy as np
import random
from sqlalchemy import create_engine


In [198]:
# Create a connection to the database
# Had to change password from "postgres" to "helloWorld"
engine = create_engine("postgresql://postgres:helloWorld@localhost:5432/fraud_detection3")


# Find anomalous transactions for 3 random card holders

- Isolate transactions for each person
- Look for outliers within those series of transactions for each person.

Let's start with person with card_holder_table.id equal to 1

In [199]:
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 1;
        """
cc_holder_1_transactions = pd.read_sql(query2, engine)

cc_holder_1_transactions

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2
0,2018-01-02 16:14:55,3.12,3517111172421930,21,1,Robert Johnson,Robertson-Smith,4,pub
1,2018-01-10 13:41:23,11.50,3517111172421930,49,1,Robert Johnson,"Davis, Lowe and Baxter",5,food truck
2,2018-01-11 19:36:21,1.72,4761049645711555811,99,1,Robert Johnson,"Bond, Lewis and Rangel",1,restaurant
3,2018-01-14 13:30:29,10.94,3517111172421930,19,1,Robert Johnson,Santos-Fitzgerald,4,pub
4,2018-01-15 10:27:56,15.51,4761049645711555811,8,1,Robert Johnson,Russell-Thomas,1,restaurant
...,...,...,...,...,...,...,...,...,...
128,2018-12-20 03:22:04,14.25,4761049645711555811,85,1,Robert Johnson,Patton-Rivera,3,bar
129,2018-12-21 06:05:54,18.67,4761049645711555811,150,1,Robert Johnson,Johnson and Sons,2,coffee shop
130,2018-12-23 05:43:37,18.17,3517111172421930,14,1,Robert Johnson,Osborne-Page,2,coffee shop
131,2018-12-30 19:28:21,11.26,4761049645711555811,106,1,Robert Johnson,Carter-Blackwell,4,pub


In [200]:
# Grab only the quantative columns. Set the index to date.
numerical_df = cc_holder_1_transactions[['date', 'amount']]

We will use the amount of the transactions to determine outliers. Another way could be to cross-refrence the times during which some transactions occur with the amount. For example, consider that the person makes the same large purchase each week at a given time. Although it is larger than average, it may not need to be considered an outlier because given the date, it is a reoccuring data point in our dataset. 

We can also use solely the time to search for outliers. We can think something along the lines of, "Does this transaction that happened at 2am on a Saturday make sense? It seems to be an outlier, since most of the CC transactions cut off by 11pm, even on weekends." Given the time of the transaction, that early-morning transaction can be labeled an outlier.

In [201]:
# Using the amount to determine outliers.
# First set the data column as a date-time index
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True)
numerical_df.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0_level_0,amount
date,Unnamed: 1_level_1
2018-01-02 16:14:55,3.12
2018-01-10 13:41:23,11.5
2018-01-11 19:36:21,1.72
2018-01-14 13:30:29,10.94
2018-01-15 10:27:56,15.51


In [202]:
# Write function that locates outliers using standard deviation 
# (Help from Chat GPT) We would refine this function if needed...
def find_outliers(data, threshold=2):
    """
    Find outliers in a dataset using standard deviation.

    Parameters:
    - data: pandas Series or DataFrame
    - threshold: number of standard deviations from the mean to consider as outliers

    Returns:
    - outliers: pandas Series or DataFrame with True for outliers, False otherwise
    """
    mean_val = data.mean()
    std_dev = data.std()
    lower_bound = mean_val - threshold * std_dev
    upper_bound = mean_val + threshold * std_dev

    outliers = (data < lower_bound) | (data > upper_bound)
    return outliers

# Find outliers using the function
outliers = find_outliers(numerical_df, threshold=2)

display(outliers.head())
display(outliers.tail())




Unnamed: 0_level_0,amount
date,Unnamed: 1_level_1
2018-01-02 16:14:55,False
2018-01-10 13:41:23,False
2018-01-11 19:36:21,False
2018-01-14 13:30:29,False
2018-01-15 10:27:56,False


Unnamed: 0_level_0,amount
date,Unnamed: 1_level_1
2018-12-20 03:22:04,False
2018-12-21 06:05:54,False
2018-12-23 05:43:37,False
2018-12-30 19:28:21,False
2018-12-30 23:23:09,True


In [203]:
# Reset index so that we can merge on 'date' column
outliers.reset_index(inplace=True)

# Now we can refrence the origional df with the given times we believe outliers to conincide with.
final_df = pd.merge(cc_holder_1_transactions, outliers, on=['date'], how='left')


# Capture only rows where outliers exist.
outlier_rows = final_df[final_df['amount_y'] == True]

outlier_rows

Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y
6,2018-01-24 13:17:19,1691.0,4761049645711555811,14,1,Robert Johnson,Osborne-Page,2,coffee shop,True
70,2018-07-31 05:15:17,1302.0,4761049645711555811,111,1,Robert Johnson,Padilla-Clements,2,coffee shop,True
79,2018-09-04 01:35:39,1790.0,4761049645711555811,43,1,Robert Johnson,Wallace and Sons,2,coffee shop,True
80,2018-09-06 08:28:55,1017.0,4761049645711555811,135,1,Robert Johnson,"Jacobs, Torres and Walker",3,bar,True
81,2018-09-06 21:55:02,1056.0,4761049645711555811,36,1,Robert Johnson,Hamilton-Mcfarland,1,restaurant,True
91,2018-09-26 08:48:40,1060.0,4761049645711555811,134,1,Robert Johnson,"Jenkins, Peterson and Beck",1,restaurant,True
116,2018-11-27 17:27:34,1660.0,4761049645711555811,29,1,Robert Johnson,Browning-Cantu,4,pub,True
121,2018-12-07 07:22:03,1894.0,4761049645711555811,9,1,Robert Johnson,"Curry, Scott and Richardson",3,bar,True
132,2018-12-30 23:23:09,1033.0,4761049645711555811,57,1,Robert Johnson,Thornton-Williams,4,pub,True


The above dataframe shows the outliers for cardholder_id 1. It is interesting to see that we seem to have many "outliers" given the 'find_outliers' function. This is where we could dig deeper into the data and try to update our function to take into account both the 'amount' and 'date/time' that the outlier ouccured. My thinking based on the above dataframe is that our cardholder, Robert Johnson, runs some type of business where he must buy inventory every few months or so. If we really wanted to determine if this set of 'outliers' is truly fradulent, maybe we can contact the credit card holder, and to further ensure that Robert Johnson is not lying to us to recoop money spent on business expenses, then we can seek camera footage of where the credit card purchases were made, and by who.


Let's keep looking through cardholders to try to find outliers.

In [204]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 2
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 2;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y


No outliers for cardholder 2. Let's try cardholder 20.

In [205]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 2
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 20;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y
11,2018-01-14 06:19:11,21.11,3535651398328201,74,20,Kevin Spencer,Skinner-Williams,4,pub,True
68,2018-05-11 12:43:50,20.56,4586962917519654607,90,20,Kevin Spencer,Brown-Cunningham,4,pub,True
135,2018-08-26 07:15:18,23.13,4506405265172173,147,20,Kevin Spencer,Marshall-Lopez,5,food truck,True
154,2018-10-07 08:16:54,20.44,4586962917519654607,89,20,Kevin Spencer,Kelley-Roberts,5,food truck,True
173,2018-11-09 19:38:36,20.27,3535651398328201,75,20,Kevin Spencer,Martinez Group,1,restaurant,True


Nothing substantial. Purchases below $25, all at eateries/pubs, spaced out over months. 

In [206]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 5
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 5;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y


None for 5.

In [207]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 10
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 10;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y
147,2018-08-28 07:17:14,20.71,4165305432349489280,128,10,Matthew Gutierrez,"Pitts, Salinas and Garcia",2,coffee shop,True


Given that we only have one outlier here, I believe we have evidence for the above transaction to be anomalous.

So this is 1 anomalous transaction. Let's look for a few more.

Now cardholder 11.

In [208]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 11
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 11;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y
82,2018-05-23 14:58:23,20.7,4027907156459098,61,11,Brandon Pineda,"Richardson, Smith and Jordan",5,food truck,True


Given that we only have one outlier here, I believe we have evidence for the above transaction to be anomalous.

Let's try card holder 15.

In [209]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 15
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 15;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y


None. How about 18?

In [210]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 18
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 18;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y
18,2018-02-19 22:48:25,1839.0,344119623920892,95,18,Malik Carlson,Baxter-Smith,1,restaurant,True
34,2018-04-03 03:23:37,1077.0,344119623920892,100,18,Malik Carlson,Townsend-Anderson,1,restaurant,True
49,2018-06-03 20:02:28,1814.0,344119623920892,123,18,Malik Carlson,"Boone, Davis and Townsend",4,pub,True
71,2018-07-18 09:19:08,974.0,344119623920892,19,18,Malik Carlson,Santos-Fitzgerald,4,pub,True
90,2018-09-10 22:49:41,1176.0,344119623920892,72,18,Malik Carlson,Lopez-Kelly,1,restaurant,True
117,2018-11-17 05:30:43,1769.0,344119623920892,18,18,Malik Carlson,Romero-Jordan,5,food truck,True
123,2018-12-13 12:09:58,1154.0,344119623920892,8,18,Malik Carlson,Russell-Thomas,1,restaurant,True


Few here, but since there are multiple we continue searching. 

Let's try 17.

In [211]:
# Let's continue down the list of credit card holders and see if we can pick up any evidence of anomalous transactions.
# Cardholder 17
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 17;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

numerical_df = cc_holder_transactions[['date', 'amount']]
numerical_df['date'] = pd.to_datetime(numerical_df['date'])
numerical_df.set_index('date', inplace=True);
outliers = find_outliers(numerical_df, threshold=2)
outliers.reset_index(inplace=True)
final_df = pd.merge(cc_holder_transactions, outliers, on=['date'], how='left')
outlier_rows = final_df[final_df['amount_y'] == True]
outlier_rows

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  numerical_df['date'] = pd.to_datetime(numerical_df['date'])


Unnamed: 0,date,amount_x,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,amount_y
8,2018-03-12 05:29:57,22.49,6011987562414062,147,17,Michael Carroll,Marshall-Lopez,5,food truck,True


There we go. 

So given our function to compute outliers using standard deviation, the following 3 transactions are those in which we found evidence of them being anomalous:

2018-05-23 14:58:23	 for CC holder 11

2018-08-28 07:17:14	 for CC holder 10

2018-03-12 05:29:57	 for CC holder 8


We could then dig deeper. There are so many ways to improve this

## Identifying Outliers Using Interquartile Range

Let's use card holders 8, 10, and 11 and see how the results compare using the IQR rather than SD to find outliers.

In [216]:
# Outlier function from chat GPT
def find_outliers_iqr(data, column):
    """
    Find outliers in a specific column using the interquartile range (IQR).

    Parameters:
    - data: pandas DataFrame
    - column: str, the column in which to find outliers

    Returns:
    - outliers: pandas Series with True for outliers, False otherwise
    """
    # Select the specified column
    column_data = data[column]

    # Calculate the first and third quartiles
    Q1 = column_data.quantile(0.25)
    Q3 = column_data.quantile(0.75)

    # Calculate the interquartile range (IQR)
    IQR = Q3 - Q1

    # Identify outliers using the IQR method
    outliers = (column_data < Q1 - 1.5 * IQR) | (column_data > Q3 + 1.5 * IQR)

    return outliers



In [237]:
# Cardholder 8
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 8;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

cc_holder_transactions

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2
0,2018-01-07 07:33:17,7.07,4834483169177062,14,8,Michael Floyd,Osborne-Page,2,coffee shop
1,2018-01-11 10:51:08,1.24,30063281385429,7,8,Michael Floyd,Gomez-Kelly,4,pub
2,2018-01-11 12:46:01,3.97,4834483169177062,44,8,Michael Floyd,Little-Floyd,4,pub
3,2018-01-15 03:37:27,1.47,4834483169177062,139,8,Michael Floyd,Kidd-Lopez,5,food truck
4,2018-01-20 03:18:02,3.38,30063281385429,93,8,Michael Floyd,Rodriguez-Parker,5,food truck
...,...,...,...,...,...,...,...,...,...
114,2018-12-12 20:53:52,2.75,30063281385429,81,8,Michael Floyd,Fowler and Sons,5,food truck
115,2018-12-18 13:27:30,10.59,4834483169177062,27,8,Michael Floyd,Horn Ltd,2,coffee shop
116,2018-12-24 18:01:29,10.80,30063281385429,38,8,Michael Floyd,Brown LLC,3,bar
117,2018-12-26 18:02:58,1.20,4834483169177062,65,8,Michael Floyd,"Allen, Ramos and Carroll",4,pub


In [238]:
# Call find outliers function with above dataset, looking for outliers based on "amount" column
outliers = find_outliers_iqr(cc_holder_transactions, 'amount')

# Save those True/False values to our above dataset, then only keep the outlier rows

cc_holder_transactions['outliers'] = outliers

cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Based on the IQR method, there are no apparent outliers for cardholder 8. 

Let's try cardholder 10.

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


In [244]:
# Cardholder 10
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 10;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


None for 10 using IQR method. 

Now 11.

In [245]:
# Cardholder 11
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 11;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


In [249]:
# Cardholder 12
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 12;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers
2,2018-01-02 23:27:46,1031.0,501879657465,95,12,Megan Price,Baxter-Smith,1,restaurant,True
18,2018-01-23 06:29:37,1678.0,501879657465,92,12,Megan Price,Garcia-White,4,pub,True
48,2018-03-12 00:44:01,1530.0,501879657465,20,12,Megan Price,Kim-Lopez,2,coffee shop,True
54,2018-03-20 10:19:25,852.0,501879657465,35,12,Megan Price,Jarvis-Turner,4,pub,True
105,2018-06-21 13:16:25,1102.0,501879657465,128,12,Megan Price,"Pitts, Salinas and Garcia",2,coffee shop,True
113,2018-06-27 01:27:09,1592.0,501879657465,136,12,Megan Price,Martinez-Robinson,3,bar,True
114,2018-06-28 21:13:52,1108.0,501879657465,35,12,Megan Price,Jarvis-Turner,4,pub,True
158,2018-09-23 19:20:23,1075.0,501879657465,13,12,Megan Price,Giles and Sons,4,pub,True
195,2018-11-23 09:08:05,233.0,501879657465,47,12,Megan Price,Martin Inc,1,restaurant,True
197,2018-11-25 20:44:07,1123.0,501879657465,59,12,Megan Price,Williams Group,3,bar,True


In [250]:
# Cardholder 13
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 13;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


In [251]:
# Cardholder 14
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 14;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


In [252]:
# Cardholder 15
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 15;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


In [253]:
# Cardholder 16
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 16;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers
7,2018-01-11 13:20:31,229.0,5570600642865857,115,16,Crystal Clark,Williams Inc,4,pub,True
10,2018-01-22 08:07:03,1131.0,5570600642865857,144,16,Crystal Clark,"Walker, Deleon and Wolf",1,restaurant,True
28,2018-02-17 01:27:19,1430.0,5570600642865857,71,16,Crystal Clark,Greene LLC,1,restaurant,True
43,2018-03-05 08:26:08,1617.0,5570600642865857,4,16,Crystal Clark,Mccarty-Thomas,3,bar,True
89,2018-05-29 02:55:08,1203.0,5570600642865857,62,16,Crystal Clark,"Cooper, Carpenter and Jackson",5,food truck,True
99,2018-06-17 15:59:45,1103.0,5570600642865857,23,16,Crystal Clark,"Wilson, Roberts and Davenport",5,food truck,True
109,2018-07-04 17:28:06,89.0,5570600642865857,112,16,Crystal Clark,Greer Inc,3,bar,True
126,2018-07-26 23:02:51,1803.0,5570600642865857,68,16,Crystal Clark,Ramirez-Carr,2,coffee shop,True
172,2018-10-19 12:32:37,178.0,5570600642865857,28,16,Crystal Clark,Hess-Fischer,5,food truck,True
175,2018-10-23 22:47:13,393.0,5570600642865857,148,16,Crystal Clark,"Huerta, Keith and Walters",5,food truck,True


In [254]:
# Cardholder 17
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 17;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers


In [255]:
# Cardholder 18
query2 = """
SELECT transaction_table.date, transaction_table.amount, transaction_table.card, transaction_table.id_merchant,
credit_card_table.cardholder_id, card_holder_table.name, merchant_table.name, merchant_table.id_merchant_category,
merchant_category_table.name
FROM transaction_table
JOIN merchant_table ON transaction_table.id_merchant = merchant_table.id
JOIN merchant_category_table ON merchant_table.id_merchant_category = merchant_category_table.id
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 18;
        """
cc_holder_transactions = pd.read_sql(query2, engine)

outliers = find_outliers_iqr(cc_holder_transactions, 'amount')
cc_holder_transactions['outliers'] = outliers
outlier_rows = cc_holder_transactions[cc_holder_transactions['outliers'] == True]
outlier_rows

Unnamed: 0,date,amount,card,id_merchant,cardholder_id,name,name.1,id_merchant_category,name.2,outliers
2,2018-01-07 01:10:54,175.0,344119623920892,12,18,Malik Carlson,"Bell, Gonzalez and Lowe",4,pub,True
3,2018-01-08 11:15:36,333.0,344119623920892,95,18,Malik Carlson,Baxter-Smith,1,restaurant,True
18,2018-02-19 22:48:25,1839.0,344119623920892,95,18,Malik Carlson,Baxter-Smith,1,restaurant,True
34,2018-04-03 03:23:37,1077.0,344119623920892,100,18,Malik Carlson,Townsend-Anderson,1,restaurant,True
49,2018-06-03 20:02:28,1814.0,344119623920892,123,18,Malik Carlson,"Boone, Davis and Townsend",4,pub,True
62,2018-06-30 01:56:19,121.0,344119623920892,20,18,Malik Carlson,Kim-Lopez,2,coffee shop,True
66,2018-07-06 16:12:08,117.0,344119623920892,62,18,Malik Carlson,"Cooper, Carpenter and Jackson",5,food truck,True
71,2018-07-18 09:19:08,974.0,344119623920892,19,18,Malik Carlson,Santos-Fitzgerald,4,pub,True
87,2018-09-02 11:20:42,458.0,344119623920892,10,18,Malik Carlson,Herrera Group,1,restaurant,True
90,2018-09-10 22:49:41,1176.0,344119623920892,72,18,Malik Carlson,Lopez-Kelly,1,restaurant,True


Based on the above analysis, cardholders 12, 16 and 18 all seem to have a set of transactions that fall outside the IQR. More analysis should be done before considering any of these transactions to be truely anomalous.