# Data Collection

#### **Dataset:**
The Dataset you can get through this link: https://github.com/Fraud-Detection-Handbook/simulated-data-raw


##### Transaction features
The features of the datasets are:

*The transaction ID:* A unique identifier for the transaction

*The date and time:* Date and time at which the transaction occurs

*The customer ID:* The identifier for the customer. Each customer has a unique identifier

*The terminal ID:* The identifier for the merchant (or more precisely the terminal). Each terminal has a unique identifier

*The transaction amount:* The amount of the transaction.

*The fraud label:* A binary variable, with the value 0 for a legitimate transaction, 
                or the value 1 for a fraudulent transaction.
                
*The Fraud Scenario:* 
Scenario 1: Any transaction whose amount is more than 220 is a fraud. This scenario is not inspired by a real-world scenario. Rather, it will provide an obvious fraud pattern that should be detected by any baseline fraud detector. This will be useful to validate the implementation of a fraud detection technique.

Scenario 2: Every day, a list of two terminals is drawn at random. All transactions on these terminals in the next 28 days will be marked as fraudulent. This scenario simulates a criminal use of a terminal, through phishing for example. Detecting this scenario will be possible by adding features that keep track of the number of fraudulent transactions on the terminal. Since the terminal is only compromised for 28 days, additional strategies that involve concept drift will need to be designed to efficiently deal with this scenario.

Scenario 3: Every day, a list of 3 customers is drawn at random. In the next 14 days, 1/3 of their transactions have their amounts multiplied by 5 and marked as fraudulent. This scenario simulates a card-not-present fraud where the credentials of a customer have been leaked. The customer continues to make transactions, and transactions of higher values are made by the fraudster who tries to maximize their gains. Detecting this scenario will require adding features that keep track of the spending habits of the customer. As for scenario 2, since the card is only temporarily compromised, additional strategies that involve concept drift should also be designed.


These features will be referred to as TRANSACTION_ID, TX_DATETIME, CUSTOMER_ID, TERMINAL_ID, TX_AMOUNT,TX_FRAUD and TX_FRAUD_SCENARIO.

## Importing the Relevant Libraries

In [1]:
import pandas as pd
import os
import pickle

In [2]:
data_folder_path =  r"C:\Users\kshar\myfiles\projects\Fraud_transaction_detection\Data_files\data"

In [3]:
#list all the pickle files
files_data = [file for file in os.listdir(data_folder_path) if file.endswith('.pkl')]
# files_data[:5]

In [4]:
# Initialize an empty DataFrame to store the combined data
new_df = pd.DataFrame()

In [5]:
#look through all pickle files and concatenate its data to the DataFrame
for file in files_data:
    file_path = os.path.join(data_folder_path,file)
    # Load each pickle file into a DataFrame
    df = pd.read_pickle(file_path)
    new_df = pd.concat([new_df,df],ignore_index= True)

In [6]:
new_df.head()

Unnamed: 0,TRANSACTION_ID,TX_DATETIME,CUSTOMER_ID,TERMINAL_ID,TX_AMOUNT,TX_TIME_SECONDS,TX_TIME_DAYS,TX_FRAUD,TX_FRAUD_SCENARIO
0,0,2018-04-01 00:00:31,596,3156,57.16,31,0,0,0
1,1,2018-04-01 00:02:10,4961,3412,81.51,130,0,0,0
2,2,2018-04-01 00:07:56,2,1365,146.0,476,0,0,0
3,3,2018-04-01 00:09:29,4128,8737,64.49,569,0,0,0
4,4,2018-04-01 00:10:34,927,9906,50.99,634,0,0,0


In [7]:
new_df.shape

(1754155, 9)

In [8]:
new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1754155 entries, 0 to 1754154
Data columns (total 9 columns):
 #   Column             Dtype         
---  ------             -----         
 0   TRANSACTION_ID     int64         
 1   TX_DATETIME        datetime64[ns]
 2   CUSTOMER_ID        object        
 3   TERMINAL_ID        object        
 4   TX_AMOUNT          float64       
 5   TX_TIME_SECONDS    object        
 6   TX_TIME_DAYS       object        
 7   TX_FRAUD           int64         
 8   TX_FRAUD_SCENARIO  int64         
dtypes: datetime64[ns](1), float64(1), int64(3), object(4)
memory usage: 120.4+ MB


In [9]:
new_df.describe()

Unnamed: 0,TRANSACTION_ID,TX_AMOUNT,TX_FRAUD,TX_FRAUD_SCENARIO
count,1754155.0,1754155.0,1754155.0,1754155.0
mean,877077.0,53.6323,0.008369272,0.01882388
std,506381.1,42.32649,0.09110012,0.2113263
min,0.0,0.0,0.0,0.0
25%,438538.5,21.01,0.0,0.0
50%,877077.0,44.64,0.0,0.0
75%,1315616.0,76.95,0.0,0.0
max,1754154.0,2628.0,1.0,3.0


In [10]:
new_df.isnull().sum()

TRANSACTION_ID       0
TX_DATETIME          0
CUSTOMER_ID          0
TERMINAL_ID          0
TX_AMOUNT            0
TX_TIME_SECONDS      0
TX_TIME_DAYS         0
TX_FRAUD             0
TX_FRAUD_SCENARIO    0
dtype: int64

In [11]:
#saveing the dataset in pickle format
new_df.to_pickle("Fraud_transations1.pkl")