Analysis of historical credit card transactions and consumption patterns in efforts to identify possible fraudulent transactions.

Part 2
Your CFO has also requested detailed trends data on specific card holders. Use the starter notebook to query your database and generate visualizations that supply the requested information as follows, then add your visualizations and observations to your markdown report:

The two most important customers of the firm may have been hacked. Verify if there are any fraudulent transactions in their history. For privacy reasons, you only know that their cardholder IDs are 2 and 18.
Using hvPlot, create a line plot representing the time series of transactions over the course of the year for each cardholder separately.

Next, to better compare their patterns, create a single line plot that contains both card holders' trend data.

What difference do you observe between the consumption patterns? Does the difference suggest a fraudulent transaction? Explain your rationale.
The CEO of the biggest customer of the firm suspects that someone has used her corporate credit card without authorization in the first quarter of 2018 to pay quite expensive restaurant bills. Again, for privacy reasons, you know only that the cardholder ID in question is 25.
Using hvPlot, create a box plot, representing the expenditure data from January 2018 to June 2018 for cardholder ID 25.

Are there any outliers for cardholder ID 25? How many outliers are there per month?

Do you notice any anomalies? Describe your observations and conclusions.

Challenge
Another approach to identifying fraudulent transactions is to look for outliers in the data. Standard deviation or quartiles are often used to detect outliers.

Use the challenge starter notebook to code two Python functions:

One that uses standard deviation to identify anomalies for any cardholder.
Another that uses interquartile range to identify anomalies for any cardholder.
For help with outliers detection, read the following articles:

How to Calculate OutliersLinks to an external site.
Removing Outliers Using Standard Deviation in PythonLinks to an external site.
How to Use Statistics to Identify Outliers in DataLinks to an external site.
Submission
Post a link to your GitHub repository in BootCamp Spot. The following should be included your repo:

An image file of your ERD.
The .sql file of your table schemata.
The .sql file of your queries.
The Jupyter Notebook containing your visual data analysis.
A ReadME file containing your markdown report.
Optional: The Jupyter Notebook containing the optional challenge assignment.
Hint

For comparing time and dates, take a look at the date/time functions and operatorsLinks to an external site. in the PostgreSQL documentation.

# Visual Data Analysis of Fraudulent Transactions

Your CFO has also requested detailed trends data on specific card holders. Use the starter notebook to query your database and generate visualizations that supply the requested information as follows, then add your visualizations and observations to your markdown report.

The CFO of your firm has requested a report to help analyze potential fraudulent transactions. Using your newly created database, generate queries that will discover the information needed to answer the following questions, then use your repository's ReadME file to create a markdown report you can share with the CFO:

Some fraudsters hack a credit card by making several small transactions (generally less than $2.00), which are typically ignored by cardholders.

Take your investigation a step further by considering the time period in which potentially fraudulent transactions are made.
What are the top 100 highest transactions made between 7:00 am and 9:00 am?

Do you see any anomalous transactions that could be fraudulent?

Is there a higher number of fraudulent transactions made during this time frame versus the rest of the day?

If you answered yes to the previous question, explain why you think there might be fraudulent transactions during this time frame.

What are the top 5 merchants prone to being hacked using small transactions?
Create a view for each of your queries.

# LAST TO DO... figure out how to filter to 7:00 and 9:00am each day. Do analysis. Compare number of "fraudulent"  transactions made from 7-9 compared to the rest of the day. Put the name back on the query for the merchants and see wich ones have the most fradulent transactions.

In [16]:
# Initial imports
import pandas as pd
import calendar
import hvplot.pandas
from sqlalchemy import create_engine

In [20]:
# # Replace 'your_username' and 'your_password' with your actual PostgreSQL username and password.
# username = 'postgres'
# password = 'farl19'
# host = 'localhost'
# port = '5432'
# database_name = 'estate_db'
# # Create the connection URL with username and password.
# db_url = f'postgresql://{username}:{password}@{host}:{port}/{database_name}'
# # Create the database engine.
# engine = create_engine(db_url)
# # Now, you can use the 'engine' to interact with the PostgreSQL database.

In [17]:
# Create a connection to the database
# Had to change password from "postgres" to "helloWorld"
engine = create_engine("postgresql://postgres:helloWorld@localhost:5432/fraud_detection3")

# THIS WORKS WITH fraud_detection2, THE DATABASE WITHOUT THE TABLES LINKED...
# IMPLEMENT THIS WITH fraud_detection3, WITH THE DATABASES LINKED WITH PROPER KEYS.

In [86]:
# loading data for transactions under $2

# Write the query
query2 = """
SELECT merchant_table.id_merchant_category,
merchant_table.name, transaction_table.id_merchant, transaction_table.amount,
transaction_table.date, credit_card_table.card, card_holder_table.id
FROM merchant_category_table
JOIN merchant_table ON merchant_category_table.id = merchant_table.id_merchant_category
JOIN transaction_table ON merchant_table.id = transaction_table.id_merchant
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE amount<2;
        """
small_transactions = pd.read_sql(query2, engine)

small_transactions

Unnamed: 0,id_merchant_category,name,id_merchant,amount,date,card,id
0,5,Rodriguez-Parker,93,1.46,2018-01-02 02:06:21,4319653513507,25
1,1,Townsend-Anderson,100,1.39,2018-01-03 15:23:58,4962915017023706562,10
2,1,Best Inc,108,1.91,2018-01-03 21:04:28,3561072557118696,19
3,3,Atkinson Ltd,30,1.36,2018-01-05 07:19:27,344119623920892,18
4,2,"Williams, Wright and Wagner",127,1.33,2018-01-06 02:16:41,4866761290278198714,2
...,...,...,...,...,...,...,...
345,4,"Allen, Ramos and Carroll",65,1.20,2018-12-26 18:02:58,4834483169177062,8
346,5,Romero-Jordan,18,1.45,2018-12-26 19:55:23,3561072557118696,19
347,1,"Johnson, Rivas and Anderson",55,1.70,2018-12-27 18:46:57,344119623920892,18
348,3,Greer Inc,112,1.32,2018-12-27 18:47:35,4681896441519,24


In [87]:
# Isolate (or group) the transactions of each cardholder?
# Count the transactions that are less than $2.00 per cardholder.

small_transactions.groupby('id')['amount'].count()

id
1     10
2     11
3      3
4     16
5     14
6      6
7     18
8     15
9      3
10    20
11    21
12    26
13    19
14     9
15    12
16    19
17     4
18    19
19    22
20    18
21     4
22     7
23    16
24    22
25    16
Name: amount, dtype: int64

**Is there any evidence to suggest that a credit card has been hacked? Explain your rationale.**

Based on just loooking at the above transactions under $2 for each customer, we cannot gain any evidence for the above claim. 

There are a consierable amount of small transactions for various people in our dataset. Let's do some further manipulation on the data to further investigate this claim.

In [89]:
#Is there a higher number of fraudulent transactions made during this time frame versus the rest of the day?
small_transactions.hvplot(
    kind='bar',
    x='date',
    y='amount',
    title='Small Transactions over Time',
    xlabel='Date',
    ylabel='Amount ($)',
)

## Data Analysis Question 1

The two most important customers of the firm may have been hacked. Verify if there are any fraudulent transactions in their history. For privacy reasons, you only know that their cardholder IDs are 2 and 18.

* Using hvPlot, create a line plot representing the time series of transactions over the course of the year for each cardholder separately. 

* Next, to better compare their patterns, create a single line plot that containins both card holders' trend data.  

* What difference do you observe between the consumption patterns? Does the difference suggest a fraudulent transaction? Explain your rationale in the markdown report.

In [38]:
# loading data for card holder 2

# We only know cardholder_id.
# Link cardholder_id in credit_card_table to card.
# Link card in credit_card_table to card in transaction_table

# Write the query
query1 = """
    SELECT transaction_table.date, transaction_table.card, transaction_table.amount
    FROM transaction_table
    JOIN credit_card_table ON transaction_table.card = credit_card_table.card
    JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
    WHERE card_holder_table.id = 2;
        """
# Create a DataFrame from the query result. HINT: Use pd.read_sql(query, engine)

cardholder_2 = pd.read_sql(query1, engine)

In [39]:
cardholder_2

Unnamed: 0,date,card,amount
0,2018-01-06 02:16:41,4866761290278198714,1.33
1,2018-01-06 05:13:20,4866761290278198714,10.82
2,2018-01-07 15:10:27,4866761290278198714,17.29
3,2018-01-10 10:07:20,675911140852,10.91
4,2018-01-16 06:29:35,675911140852,17.64
...,...,...,...
94,2018-12-13 06:21:43,4866761290278198714,19.36
95,2018-12-13 15:28:18,675911140852,10.06
96,2018-12-16 13:44:25,4866761290278198714,11.38
97,2018-12-22 23:29:09,4866761290278198714,10.20


In [40]:
# loading data for card holder 18 from the database
query1 = """
    SELECT transaction_table.date, transaction_table.card, transaction_table.amount
    FROM transaction_table
    JOIN credit_card_table ON transaction_table.card = credit_card_table.card
    JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
    WHERE card_holder_table.id = 18;
        """

cardholder_18 = pd.read_sql(query1, engine)

In [41]:
print(f"Card holder 2's cards:")
display(cardholder_2.head(3))

print(f"Card holder 18's cards:")
display(cardholder_18.head(3))



Card holder 2's cards:


Unnamed: 0,date,card,amount
0,2018-01-06 02:16:41,4866761290278198714,1.33
1,2018-01-06 05:13:20,4866761290278198714,10.82
2,2018-01-07 15:10:27,4866761290278198714,17.29


Card holder 18's cards:


Unnamed: 0,date,card,amount
0,2018-01-01 23:15:10,4498002758300,2.95
1,2018-01-05 07:19:27,344119623920892,1.36
2,2018-01-07 01:10:54,344119623920892,175.0


In [42]:
# Set both dfs to have datetime format
cardholder_2['date'] = pd.to_datetime(cardholder_2['date'])
cardholder_2.set_index('date', inplace=True)

cardholder_18['date'] = pd.to_datetime(cardholder_18['date'])
cardholder_18.set_index('date', inplace=True)

In [43]:
display(cardholder_2.head(3))
display(cardholder_18.head(3))

Unnamed: 0_level_0,card,amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-06 02:16:41,4866761290278198714,1.33
2018-01-06 05:13:20,4866761290278198714,10.82
2018-01-07 15:10:27,4866761290278198714,17.29


Unnamed: 0_level_0,card,amount
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2018-01-01 23:15:10,4498002758300,2.95
2018-01-05 07:19:27,344119623920892,1.36
2018-01-07 01:10:54,344119623920892,175.0


* Using hvPlot, create a line plot representing the time series of transactions over the course of the year for each cardholder separately. 


In [44]:
# Plot for cardholder 2
# Plotting amount by date
amount_date_2_plot = cardholder_2.hvplot(
    x='date',
    y='amount',
    title='Cardholder 2 Transactions over Time',
    xlabel='Date',
    ylabel='Amount ($)',
)

amount_date_2_plot

In [65]:
# Plot for cardholder 18
# Plotting amount by date
amount_date_18_plot = cardholder_18.hvplot(
    x='date',
    y='amount',
    title='Cardholder 18 Transactions over Time',
    xlabel='Date',
    ylabel='Amount ($)',
)

amount_date_18_plot

* Next, to better compare their patterns, create a single line plot that containins both card holders' trend data.  


In [46]:

# putting both groups of data onto the same plot
combined_plot = (amount_date_2_plot * amount_date_18_plot).opts(
    title="Transaction Amount over Time for Cardholders 2(blue) and 18(orange)",
    xlabel="Date",
    ylabel="Amount",
    legend_position='top_left',
    width=800,
    height=500
)
combined_plot

**What difference do you observe between the consumption patterns? Does the difference suggest a fraudulent transaction? Explain your rationale in the markdown report.**

Cardholder 2 has a somewhat even distrubition of transaction activity. Upon glancing at the graph of the cardholder 2's purchases, the amounts seem to be within a given range with no substantial outliers.

Cardholder 18, however, has many data points that appear to lie far beyond the mean. There are multiple transaction over $900, that seem to be very far away from the typical expenditures of the account. However, more analysis should be done to determine if this activity can be explained by the customer, before a claim of fradulence can be further substantiated. 

The difference between the graphs does allude to fradulent transactions because of how many outliers Cardholder 18's transactions have.

## Data Analysis Question 2

The CEO of the biggest customer of the firm suspects that someone has used her corporate credit card without authorization in the first quarter of 2018 to pay quite expensive restaurant bills. Again, for privacy reasons, you know only that the cardholder ID in question is 25.

* Using hvPlot, create a box plot, representing the expenditure data from January 2018 to June 2018 for cardholder ID 25.

* Are there any outliers for cardholder ID 25? How many outliers are there per month?

* Do you notice any anomalies? Describe your observations and conclusions in your markdown report.

In [54]:
# loading data of daily transactions from jan to jun 2018 for card holder 25
# Write the query
query2 = """
SELECT transaction_table.date, transaction_table.card, transaction_table.amount, 
transaction_table.id_merchant, credit_card_table.cardholder_id
FROM transaction_table
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 25;
        """

cardholder_25 = pd.read_sql(query2, engine)

cardholder_25

Unnamed: 0,date,card,amount,id_merchant,cardholder_id
0,2018-01-02 02:06:21,4319653513507,1.46,93,25
1,2018-01-05 06:26:45,372414832802279,10.74,86,25
2,2018-01-07 14:57:23,4319653513507,2.93,137,25
3,2018-01-10 00:25:40,372414832802279,1.39,50,25
4,2018-01-14 05:02:22,372414832802279,17.84,52,25
...,...,...,...,...,...
119,2018-12-15 08:34:15,372414832802279,14.36,83,25
120,2018-12-18 13:33:37,4319653513507,1074.00,67,25
121,2018-12-19 10:41:34,372414832802279,10.14,31,25
122,2018-12-27 17:52:18,372414832802279,3.97,18,25


In [56]:
# Set index to date-time.
cardholder_25['date'] = pd.to_datetime(cardholder_25['date'])
cardholder_25.set_index('date', inplace=True)


# Take a look at the index.
cardholder_25.index

DatetimeIndex(['2018-01-02 02:06:21', '2018-01-05 06:26:45',
               '2018-01-07 14:57:23', '2018-01-10 00:25:40',
               '2018-01-14 05:02:22', '2018-01-16 02:26:16',
               '2018-01-18 12:41:06', '2018-01-21 23:04:02',
               '2018-01-30 18:31:00', '2018-01-31 05:46:43',
               ...
               '2018-12-07 17:10:58', '2018-12-08 05:53:13',
               '2018-12-11 11:42:13', '2018-12-12 16:16:21',
               '2018-12-14 18:31:29', '2018-12-15 08:34:15',
               '2018-12-18 13:33:37', '2018-12-19 10:41:34',
               '2018-12-27 17:52:18', '2018-12-30 11:05:36'],
              dtype='datetime64[ns]', name='date', length=124, freq=None)

In [57]:
# change the numeric month to month names
cardholder_25.index = cardholder_25.index.strftime('%B')  #Chat GPT
cardholder_25.index

Index(['January', 'January', 'January', 'January', 'January', 'January',
       'January', 'January', 'January', 'January',
       ...
       'December', 'December', 'December', 'December', 'December', 'December',
       'December', 'December', 'December', 'December'],
      dtype='object', name='date', length=124)

In [62]:
# Creating the six box plots using hvPlot

# Get list of month names for iteration
month_names = [
    "January", "February", "March", "April", "May", "June",
    "July", "August", "September", "October", "November", "December"
] #GPT

count = 0


# For all months in the year...
for name in month_names:
        # group rows with same month
    selected_rows = cardholder_25[cardholder_25.index == month_names[count]] 
        # plot each of those groups of monthly transaction data
    cool_plot = selected_rows.hvplot.box(
        y = 'amount',
        xlabel = month_names[count],
        ylabel = 'Amount ($)'
    )
    display(cool_plot)
    count+=1


**Are there any outliers for cardholder ID 25? How many outliers are there per month?**

There are a considerable amount of outliers for cardholder ID 25. 


In the graphs representing January, March, April, May, July, August, October and December, there was one clear outlier in each. Each outlier was above $1000 in December, August, May, March, January. 

The outlier in July was under $25. Therefore, we will not consider this July outlier to be a potential fradulent extravagant dinner purchase. 

The outlier in October was under $150. Therefore, we will not consider this October outlier to be a potential fradulent extravagant dinner purchase. 

There were two outliers in June, both over $500. We will consider these to be possible extravagant dinner purchases.



**Do you notice any anomalies? Describe your observations and conclusions in your markdown report.**


Given the box plots of the transaction data for cardholder ID 25, there seems to be a case that can be made that someone is making fradulent dinner transactions using their corporate credit card.

To further this claim, we will grab the transactions with the maximum amounts from each of the groups, then determine where those transactions were made using the ID of the merchant and the merchant Category.



In [64]:
# Let's get all transactions above $200 and see if the various id_merchant's correspond to resturants within the merchant_category_table 
query2 = """
SELECT merchant_category_table.name, merchant_table.id_merchant_category,
merchant_table.name, transaction_table.id_merchant, transaction_table.amount,
transaction_table.date, credit_card_table.card, card_holder_table.id
FROM merchant_category_table
JOIN merchant_table ON merchant_category_table.id = merchant_table.id_merchant_category
JOIN transaction_table ON merchant_table.id = transaction_table.id_merchant
JOIN credit_card_table ON transaction_table.card = credit_card_table.card
JOIN card_holder_table ON credit_card_table.cardholder_id = card_holder_table.id
WHERE card_holder_table.id = 25 AND amount > 200;
        """

cardholder_25_max_purchases = pd.read_sql(query2, engine)

cardholder_25_max_purchases

Unnamed: 0,name,id_merchant_category,name.1,id_merchant,amount,date,card,id
0,restaurant,1,"Cline, Myers and Strong",64,1177.0,2018-01-30 18:31:00,4319653513507,25
1,bar,3,Griffin-Woodard,87,1334.0,2018-03-06 07:18:09,4319653513507,25
2,pub,4,"Bryant, Thomas and Collins",16,1063.0,2018-04-08 06:03:50,4319653513507,25
3,restaurant,1,Hamilton-Mcfarland,36,269.0,2018-04-09 18:28:25,4319653513507,25
4,food truck,5,Baker Inc,48,1046.0,2018-05-13 06:31:20,4319653513507,25
5,pub,4,Johnson-Fuller,96,1162.0,2018-06-04 03:46:15,4319653513507,25
6,restaurant,1,Hamilton-Mcfarland,36,749.0,2018-06-06 21:50:17,4319653513507,25
7,bar,3,"Cox, Montgomery and Morgan",40,1813.0,2018-06-22 06:16:50,4319653513507,25
8,food truck,5,"Vega, Jones and Castro",120,1001.0,2018-08-16 10:01:00,4319653513507,25
9,coffee shop,2,"Maxwell, Tapia and Villanueva",67,1074.0,2018-12-18 13:33:37,4319653513507,25


There is evidence of some big purchases at resturants, however that is only 3 cases out of the other outliers. This is not enough evidence to confer that this is an example of fradulent transactions.

Also, these big purchases were not restrained to the first quarter of 2018, where our cardholder is suspicious. The instances of big purchases seem to be spaced out throughout the year. 

Deeper analysis is needed before making a decision about fradulent purchases.