# Challenge

Another approach to identifying fraudulent transactions is to look for outliers in the data. Standard deviation or quartiles are often used to detect outliers. Using this starter notebook, code two Python functions:

* One that uses standard deviation to identify anomalies for any cardholder.

* Another that uses interquartile range to identify anomalies for any cardholder.

## Identifying Outliers using Standard Deviation

In [1]:
# Initial imports
import pandas as pd
import numpy as np
import random
import plotly.express as px
import panel as pn
pn.extension('plotly')
from sqlalchemy import create_engine

In [2]:
# Create a connection to the database
engine = create_engine("postgresql://postgres:postgres@localhost:5432/FinTech_Assignment")

In [52]:
# Write function that locates outliers using standard deviation
def calc_standard_deviation(data):
    # Calculate statistics
    data_mean = np.mean(data)
    print(f"The mean is {data_mean}.")
    print()
    
    data_std = np.std(data)
    print(f"The std is {data_std}.")
    print()
    
    cut_off = data_std * 2      
    lower = data_mean - cut_off
    upper = data_mean + cut_off 
    print(f"Cut off: {round(cut_off, 2)};\n"
          f"lower limit: {round(lower, 2)}:\n"
          f"upper limit: {round(upper, 2)}")
    # Apply stats to data 
    outliers=[]    
    final_dataset=[]
    
    for i in data:
        if i < lower or i > upper:
            outliers.append(i)
        elif i >= lower or i <= upper:
            final_dataset.append(i)
        else:
            pass
    
    print(f"The total dataset has {len(data)} elements")
    print()
    print(f"There are {len(data) - len(final_dataset)} outliers, which are {sorted(set(data) - set(final_dataset))}.")
    print()
    print(f"After removing the outliers, your final list has {len(final_dataset)} elements.")
    print()
    
    # Plot data
    plot1 = px.box(data,
                   title='Original dataset',
                   width=400,
                   height=700)
    
    plot2 = px.box(final_dataset,
                   title='Dataset after removing outliers',
                   width=400,
                   height=700)
    
    rows = pn.Row(plot1, 
                  plot2)
    
    return rows

In [48]:
# Test function
data = [1, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 9, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 3, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 8, 566]

calc_standard_deviation(data)

The mean is 462.0.

The std is 189.80057009992467.

Cut off: 569.4;
lower limit: -107.4:
upper limit: 1031.4
The total dataset has 39 elements

There are 0 outliers, which are [].

After removing the outliers, your final list has 39 elements.



In [49]:
# Find anomalous transactions for 3 random card holders
query_3 = """
SELECT credit_card.cardholder_id, 
	   EXTRACT(month from date) AS Month,
	   EXTRACT(day from date) AS Day,
	   transactions.amount
FROM credit_card
INNER JOIN transactions ON transactions.card=credit_card.card
WHERE cardholder_id=7 OR cardholder_id=15 OR cardholder_id=21
"""

# Get sql query to DataFrame
query_3_df = pd.read_sql(query_3, engine)

# Check data
query_3_df.columns

#print(query_3_df.dtypes)

Index(['cardholder_id', 'month', 'day', 'amount'], dtype='object')

In [50]:
# Allocate data to Card Holders
cardholder_7 = query_3_df.loc[query_3_df["cardholder_id"]==7]['amount'].astype(int)
cardholder_15 = query_3_df.loc[query_3_df["cardholder_id"]==15]['amount'].astype(int)
cardholder_21 = query_3_df.loc[query_3_df["cardholder_id"]==21]['amount'].astype(int)

print(cardholder_7.dtypes)
print(cardholder_15.dtypes)
print(cardholder_21.dtypes)

int32
int32
int32


In [53]:
# Check outliers for Card Holder 7
calc_standard_deviation(cardholder_7)

The mean is 81.79136690647482.

The std is 313.5183602719752.

Cut off: 627.04;
lower limit: -545.25:
upper limit: 708.83
The total dataset has 139 elements

There are 6 outliers, which are [1072, 1086, 1296, 1449, 1685, 2249].

After removing the outliers, your final list has 133 elements.



In [33]:
# Check outliers for Card Holder 15
calc_standard_deviation(cardholder_15)

The mean is 9.05072463768116.

The std is 5.816421609930408.

Cut off: 11.63;
lower limit: -2.58:
upper limit: 20.68
The total dataset has 138 elements

There are 0 outliers, which are [].

After removing the outliers, your final list has 138 elements.



In [34]:
# Check outliers for Card Holder 21
calc_standard_deviation(cardholder_21)

The mean is 9.044776119402986.

The std is 5.829499955055913.

Cut off: 11.66;
lower limit: -2.61:
upper limit: 20.7
The total dataset has 67 elements

There are 0 outliers, which are [].

After removing the outliers, your final list has 67 elements.



## Identifying Outliers Using Interquartile Range

In [40]:
# Write a function that locates outliers using interquartile range
def calc_iqr(data):
    
    # Calculate statistics
    q25 = np.percentile(data, 25)
    print(f"q25 is {q25}.")
    print()

    q75 = np.percentile(data, 75)    
    print(f"q75 is {q75}.")
    print()
    
    iqr = q75 - q25
    print(f"The IQR is {iqr}.")
    print()

    cut_off = iqr * 1.5
    lower = q25-cut_off
    upper = q75 + cut_off 
    print(f"Cut off: {round(cut_off, 2)};\n"
          f"lower limit: {round(lower, 2)}:\n"
          f"upper limit: {round(upper, 2)}")
    print()
    
    # Apply stats to data 
    outliers=[]    
    final_dataset=[]
    
    for i in data:
        if i < lower or i > upper:
            outliers.append(i)
        elif i >= lower or i <= upper:
            final_dataset.append(i)
        else:
            pass
    
    print(f"The total dataset has {len(data)} elements")
    print()
    print(f"There are {len(data) - len(final_dataset)} outliers, which are {sorted(set(data) - set(final_dataset))}.")
    print()
    print(f"After removing the outliers, your final list has {len(final_dataset)} elements.")
    print()
    
    # Plot data
    plot1 = px.box(data,
                   title='Original dataset',
                   width=400,
                   height=700)
    
    plot2 = px.box(final_dataset,
                   title='Dataset after removing outliers',
                   width=400,
                   height=700)
    
    rows = pn.Row(plot1, 
                  plot2)
    
    return rows

In [41]:
# Test function
data = ([1, 386, 479, 627, 20, 523, 482, 483, 542, 699, 535, 9, 617, 577, 471, 615, 583, 441, 562, 563, 527, 453, 530, 433, 541, 585, 3, 704, 443, 569, 430, 637, 331, 511, 552, 496, 484, 8, 566])

calc_iqr(data)

q25 is 442.0.

q75 is 567.5.

The IQR is 125.5.

Cut off: 188.25;
lower limit: 253.75:
upper limit: 755.75

The total dataset has 39 elements

There are 5 outliers, which are [1, 3, 8, 9, 20].

After removing the outliers, your final list has 34 elements.



In [42]:
# Find anomalous transactions for 3 random card holders
query_3 = """
SELECT credit_card.cardholder_id, 
	   EXTRACT(month from date) AS Month,
	   EXTRACT(day from date) AS Day,
	   transactions.amount
FROM credit_card
INNER JOIN transactions ON transactions.card=credit_card.card
WHERE cardholder_id=7 OR cardholder_id=15 OR cardholder_id=21
"""

# Get sql query to DataFrame
query_3_df = pd.read_sql(query_3, engine)

# Check data
query_3_df.columns

Index(['cardholder_id', 'month', 'day', 'amount'], dtype='object')

In [43]:
# Allocate data to Card Holders
cardholder_7 = query_3_df.loc[query_3_df["cardholder_id"]==7]['amount'].astype(int)
cardholder_15 = query_3_df.loc[query_3_df["cardholder_id"]==15]['amount'].astype(int)
cardholder_21 = query_3_df.loc[query_3_df["cardholder_id"]==21]['amount'].astype(int)

print(cardholder_7.dtypes)
print(cardholder_15.dtypes)
print(cardholder_21.dtypes)

int32
int32
int32


In [44]:
# Check outliers for Card Holder 7
calc_iqr(cardholder_7)

q25 is 3.0.

q75 is 15.5.

The IQR is 12.5.

Cut off: 18.75;
lower limit: -15.75:
upper limit: 34.25

The total dataset has 139 elements

There are 10 outliers, which are [160, 233, 445, 543, 1072, 1086, 1296, 1449, 1685, 2249].

After removing the outliers, your final list has 129 elements.



In [45]:
# Check outliers for Card Holder 15
calc_iqr(cardholder_15)

q25 is 3.0.

q75 is 14.75.

The IQR is 11.75.

Cut off: 17.62;
lower limit: -14.62:
upper limit: 32.38

The total dataset has 138 elements

There are 0 outliers, which are [].

After removing the outliers, your final list has 138 elements.



In [46]:
# Check outliers for Card Holder 21
calc_iqr(cardholder_21)

q25 is 3.0.

q75 is 12.0.

The IQR is 9.0.

Cut off: 13.5;
lower limit: -10.5:
upper limit: 25.5

The total dataset has 67 elements

There are 0 outliers, which are [].

After removing the outliers, your final list has 67 elements.

