<div style="text-align: center; margin-top: 50px; margin-bottom: 30px;">
    <h1 style="font-size: 3em; color:#00E0FF;">Payment Analysis Test Case</h1>
    <h3 style="font-size: 2em; color: #E3BD00;">Clesson Roberto da Silva Junior</h3>
    
</div>

---

## Description

In this analysis, I will use Python to process and analyze the data, extracting as much information as possible. Subsequently, I will heavily research and implement the best machine learning model to predict fraud.


From all my previous experiences, the Google Framework is the best way to analyze data. But how it works? 

## Google Analytics Steps

<ol style="font-size: 1.2em; color: #F3CF1D; line-height: 1.5;">
  <li><strong>Ask</strong> - Define the questions you want to answer.</li>
  <li><strong>Prepare</strong> - Collect and prepare your data for analysis.</li>
  <li><strong>Process</strong> - Clean and transform the data to ensure accuracy.</li>
  <li><strong>Analyze</strong> - Examine the data to find insights and patterns.</li>
  <li><strong>Share</strong> - Present your findings to stakeholders.</li>
  <li><strong>Act</strong> - Make decisions and take actions based on the analysis.</li>
</ol>

So, let's get into our first section, to ask the questions.

---

## <span style="color:#F7A454">1. Asking the questions</span>


## Common Findings in Fraud Detection

Here are some of the most common findings when identifying fraud in credit card data:

1. **Unusual Transaction Amounts**: Transactions that are significantly higher or lower than the average transaction amounts.
2. **Multiple Transactions in a Short Time Frame**: Numerous transactions occurring within a very short period.
3. **Transactions from Different Geographic Locations**: Transactions made from locations that are significantly different from the cardholder’s usual locations.
4. **Inconsistent Merchant Categories**: Purchases from merchant categories that are unusual for the cardholder.
5. **Odd Hours Transactions**: Transactions made at times that are unusual for the cardholder.
6. **Frequent Chargebacks**: A high number of chargebacks associated with the cardholder's account.

### Key Questions to Ask:

1. **Are there any transactions that significantly deviate from the cardholder’s typical spending patterns?**
2. **Are there clusters of transactions occurring in a very short period?**
3. **Do any transactions originate from geographic locations far from the cardholder’s usual area?**
4. **Are there purchases from merchant categories that are atypical for the cardholder?**
5. **Do any transactions occur at odd hours, compared to the cardholder’s normal activity?**
6. **Is there a high frequency of chargebacks, and what are the reasons for these chargebacks?**

Not all the questions must be answered perfectly, but now I've a guide. **So, let's prepare the data.**

---

## <span style="color:#F7A454">2. Preparing the data.</span>

#### Importing Lib's

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report

#### Taking a look at the data

In [9]:
df = pd.read_csv('transactions.csv')
print(df.head())
print(df.info())
print(df.describe())
print(df.isnull().sum())


   transaction_id  merchant_id  user_id       card_number  \
0        21320398        29744    97051  434505******9116   
1        21320399        92895     2708  444456******4210   
2        21320400        47759    14777  425850******7024   
3        21320401        68657    69758  464296******3991   
4        21320402        54075    64367  650487******6116   

             transaction_date  transaction_amount  device_id  has_cbk  
0  2019-12-01T23:16:32.812632              374.56   285475.0    False  
1  2019-12-01T22:45:37.873639              734.87   497105.0     True  
2  2019-12-01T22:22:43.021495              760.36        NaN    False  
3  2019-12-01T21:59:19.797129             2556.13        NaN     True  
4  2019-12-01T21:30:53.347051               55.36   860232.0    False  
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3199 entries, 0 to 3198
Data columns (total 8 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----

### Brief Analysis

The card number is currently masked with asterisks, which hinders its usability. I will remove the asterisks to make the data more accessible for analysis.

Regarding the null values in the device_id column, this anomaly warrants further investigation. It is possible that individuals with null device IDs may be either important users who prioritize their device privacy or potential fraudsters attempting to conceal their identity. This aspect requires deeper analysis to discern patterns and potential implications.

The std is higher than the mean in the "transaction_amount" column, so the amount variability is high. So these questions come to mind:
 - Are there any specific transaction amounts or ranges that are more common or less common? 
 - Are there any outliers in transaction amounts that warrant further investigation?

Transaction data is an object, convert it to time frame

It's a common practice in carding (the act of using fraudulent credit cards), a common tactic involves attempting multiple cards with altered details but under the same merchant ID. This practice is aimed at impersonating different individuals while maintaining consistency with the merchant. To detect potential instances of this, I will analyze transactions with similar amounts occurring within the same merchant and occurring within a short timeframe, approximately 30 minutes.

Very low amounts may also indicate that a hacker is trying to test if the stolen CC info is working or not.


---

## <span style="color:#F7A454">3. Processing the data.</span>