# Visual Data Analysis of Fraudulent Transactions

Your CFO has also requested detailed trends data on specific card holders. Use the starter notebook to query your database and generate visualizations that supply the requested information as follows, then add your visualizations and observations to your markdown report.

In [1]:
# Initial imports
import pandas as pd
import calendar
import hvplot.pandas
from sqlalchemy import create_engine


In [2]:
# Create a connection to the database
engine = create_engine("postgresql://postgres:bootcamp24@localhost:5432/fraud_detection")

# Test the connection
try:
    connection = engine.connect()
    print("Connection to the database was successful!")
    connection.close()
except Exception as e:
    print("Error connecting to the database:", e)


Connection to the database was successful!


## Data Analysis Question 1

The two most important customers of the firm may have been hacked. Verify if there are any fraudulent transactions in their history. For privacy reasons, you only know that their cardholder IDs are 2 and 18.

* Using hvPlot, create a line plot representing the time series of transactions over the course of the year for each cardholder separately. 

* Next, to better compare their patterns, create a single line plot that containins both card holders' trend data.  

* What difference do you observe between the consumption patterns? Does the difference suggest a fraudulent transaction? Explain your rationale in the markdown report.

In [3]:
from sqlalchemy import text

# Query to load data for card holder 2 and 18
query = text("""
SELECT t.*, cc.cardholder_id
FROM transaction t
JOIN credit_card cc ON t.card = cc.card
WHERE cc.cardholder_id IN (2, 18);
""")

# Execute the SQL query and load the results into a DataFrame
try:
    with engine.connect() as connection:
        cardholders_transactions_df = pd.read_sql_query(query, connection)
    print("Query executed successfully!")
except Exception as e:
    print("Error executing query:", e)

# Verify the data
if 'cardholders_transactions_df' in locals():
    print(cardholders_transactions_df.head())


Query executed successfully!
    id                 date  amount           card  id_merchant  cardholder_id
0  567  2018-01-01 23:15:10    2.95  4498002758300           64             18
1  567  2018-01-01 23:15:10    2.95  4498002758300           64             18
2  567  2018-01-01 23:15:10    2.95  4498002758300           64             18
3  567  2018-01-01 23:15:10    2.95  4498002758300           64             18
4  567  2018-01-01 23:15:10    2.95  4498002758300           64             18


In [4]:
# Plot for cardholder 2
cardholder_2_cards = cardholders_transactions_df[
    cardholders_transactions_df['cardholder_id'] == 2]

cardholder_2_plot = cardholder_2_cards.hvplot.line(
    x='date',
    y='amount',
    xlabel='Transaction Date',
    ylabel='Transaction Amount',
    title='Transaction Trend for Cardholder 2'
)

cardholder_2_plot


In [5]:
# Plot for cardholder 18
cardholder_18_cards = cardholders_transactions_df[
    cardholders_transactions_df['cardholder_id'] == 18]

cardholder_18_plot = cardholder_18_cards.hvplot.line(
    x='date',
    y='amount',
    xlabel='Transaction Date',
    ylabel='Transaction Amount',
    title='Transaction Trend for Cardholder 18'
)

cardholder_18_plot


In [6]:
# Combined plot for card holders 2 and 18
combined_plot = cardholders_transactions_df.hvplot.line(
    x='date',
    y='amount',
    by='cardholder_id',
    xlabel='Transaction Date',
    ylabel='Transaction Amount',
    title='Transaction Trend for Cardholders 2 and 18'
)

combined_plot


## Data Analysis Question 2

The CEO of the biggest customer of the firm suspects that someone has used her corporate credit card without authorization in the first quarter of 2018 to pay quite expensive restaurant bills. Again, for privacy reasons, you know only that the cardholder ID in question is 25.

* Using hvPlot, create a box plot, representing the expenditure data from January 2018 to June 2018 for cardholder ID 25.

* Are there any outliers for cardholder ID 25? How many outliers are there per month?

* Do you notice any anomalies? Describe your observations and conclusions in your markdown report.

In [7]:
import pandas as pd
from sqlalchemy import create_engine, text

# Create a connection to the database
engine = create_engine("postgresql://postgres:bootcamp24@localhost:5432/fraud_detection")

# Query to load data for card holder 25 from Jan to Jun 2018
query = text("""
SELECT t.*, cc.cardholder_id
FROM transaction t
JOIN credit_card cc ON t.card = cc.card
WHERE cc.cardholder_id = 25
AND EXTRACT(YEAR FROM t.date::date) = 2018
AND EXTRACT(MONTH FROM t.date::date) BETWEEN 1 AND 6;
""")

# Execute the SQL query and load the results into a DataFrame
try:
    with engine.connect() as connection:
        daily_transactions_df = pd.read_sql_query(query, connection)
    print("Query executed successfully!")
except Exception as e:
    print("Error executing query:", e)

# Verify the data
if 'daily_transactions_df' in locals():
    print(daily_transactions_df.head())


Query executed successfully!
     id                 date  amount           card  id_merchant  \
0  2083  2018-01-02 02:06:21    1.46  4319653513507           93   
1  2083  2018-01-02 02:06:21    1.46  4319653513507           93   
2  2083  2018-01-02 02:06:21    1.46  4319653513507           93   
3  2083  2018-01-02 02:06:21    1.46  4319653513507           93   
4  2083  2018-01-02 02:06:21    1.46  4319653513507           93   

   cardholder_id  
0             25  
1             25  
2             25  
3             25  
4             25  


In [8]:
# Change the numeric month to month names
daily_transactions_df['date'] = pd.to_datetime(daily_transactions_df['date'])
month_names = {1: 'January', 2: 'February', 3: 'March', 4: 'April', 5: 'May', 6: 'June'}
daily_transactions_df['month'] = daily_transactions_df['date'].dt.month.map(month_names)

# Verify the updated DataFrame
print(daily_transactions_df.head())


     id                date  amount           card  id_merchant  \
0  2083 2018-01-02 02:06:21    1.46  4319653513507           93   
1  2083 2018-01-02 02:06:21    1.46  4319653513507           93   
2  2083 2018-01-02 02:06:21    1.46  4319653513507           93   
3  2083 2018-01-02 02:06:21    1.46  4319653513507           93   
4  2083 2018-01-02 02:06:21    1.46  4319653513507           93   

   cardholder_id    month  
0             25  January  
1             25  January  
2             25  January  
3             25  January  
4             25  January  


In [9]:
# Creating the six box plots using hvPlot
box_plots = daily_transactions_df.hvplot.box(
    y='amount',
    by='month',
    xlabel='Month',
    ylabel='Transaction Amount',
    title='Expenditure Data for Cardholder ID 25 (Jan-Jun 2018)',
    rot=45
)

# Display the box plots
box_plots
