# Dataset analysis

# Introduction

In this dataset analysis, we aim to identify potential features that could significantly impact whether a transaction is fraudulent. By thoroughly analyzing the dataset, we hope to uncover patterns and correlations that will guide our feature selection and clean up our (very large) dataset.

The insights gained from this analysis will enable us to clearly and specifically formulate our two sub-questions, which will focus on the role of these features in predicting fraudulent transactions. 

# Specific data set explanation ("Transaction data")

In this section, we will dive deeper into the dataset by importing its features and providing a detailed explanation of each feature. Understanding the dataset's structure and the role of its features is essential for identifying those that may influence whether a transaction is fraudulent.

In [1]:
import pandas as pd

# Load the dataset
df = pd.read_csv('train_transaction.csv')

# Get the column names (features)
features = df.columns
print(features)

Index(['TransactionID', 'isFraud', 'TransactionDT', 'TransactionAmt',
       'ProductCD', 'card1', 'card2', 'card3', 'card4', 'card5',
       ...
       'V330', 'V331', 'V332', 'V333', 'V334', 'V335', 'V336', 'V337', 'V338',
       'V339'],
      dtype='object', length=394)


As you can see from the code, the dataset contains 394 features. We need to comb through these columns and identify which ones are most relevant for our sub-question. Before diving into feature selection, we will first provide a detailed explanation of all the columns in the next section. This will help us understand the context of each feature and decide which ones might be important for predicting fraudulent transactions.

<br>

**1.Transaction Information:**


TransactionID:  Unique identifier for each transaction.

isFraud: Binary target variable indicating whether the transaction is fraudulent or not.

TransactionDT: Time delta from a reference datetime, representing the time of the transaction.

TransactionAmt: The amount of money involved in the transaction (in USD).

ProductCD: The product code, which may refer to different types of transactions (e.g., goods or services).

<br>

**2. Payment Card Information:**

card1 - card6: Features related to the payment card, such as card type, category, issuing bank, and country.

<br>

**3. Address Information:**

addr1, addr2: Billing and mailing address information.

dist1, dist2: Distance features, likely related to geographic location (e.g., distance between billing and shipping address, IP address, etc.).
<br>
<br>

**4. Email Domain Information:**

P_emaildomain, R_emaildomain: The email domains for the purchaser and recipient. Some transactions may not have a recipient email domain.
<br>
<br>

**5. Counting Features (C1 - C14):**

These columns contain counts, such as how many addresses or phone numbers are linked to a particular payment card or how many devices are associated with the transaction.
<br>
<br>

**6. Time Delta Features (D1 - D15):**

These features represent time differences, such as the number of days since the last transaction for the same user or card.
<br>
<br>

**7. Matching Features (M1 - M9):**

These columns are used to match various aspects of the transaction, such as matching names on the card and the address.
<br>
<br>


**8. Vesta Engineered Features (V1 - V339):**

These are features engineered by the Vesta team, which include various ranking, counting, and entity relations, such as the frequency of transactions associated with a specific card, email, or address in a given time window.

## Specific dataset explenation ("Identity"data)

In [2]:
import pandas as pd

# Load the dataset
df = pd.read_csv('train_identity.csv')

# Get the column names (features)
features = df.columns
print(features)

Index(['TransactionID', 'id_01', 'id_02', 'id_03', 'id_04', 'id_05', 'id_06',
       'id_07', 'id_08', 'id_09', 'id_10', 'id_11', 'id_12', 'id_13', 'id_14',
       'id_15', 'id_16', 'id_17', 'id_18', 'id_19', 'id_20', 'id_21', 'id_22',
       'id_23', 'id_24', 'id_25', 'id_26', 'id_27', 'id_28', 'id_29', 'id_30',
       'id_31', 'id_32', 'id_33', 'id_34', 'id_35', 'id_36', 'id_37', 'id_38',
       'DeviceType', 'DeviceInfo'],
      dtype='object')



The identity dataset contains a variety of features related to the identity and digital security of transactions. These features primarily focus on network connection information (IP, ISP, Proxy, etc.) and digital signatures (e.g., browser, operating system, device type), which are collected by Vesta's fraud protection system and their digital security partners. These are not available for all transactions.

**Breakdown of the identity dataset:**
<br>
<br>
**1. Transaction Information:**
<br>
<br>

**TransactionID:** 
<br>Unique identifier for each transaction (this is a common feature across both datasets).
<br>
<br>

**2. Identity Features (id_01 to id_11):**
<br>

<br>
These are numerical features associated with various identity-related information, such as device ratings, IP domain ratings, proxy ratings, and behavioral fingerprints. Examples of behavioral data might include account login times, failed login attempts, and how long an account stayed on the page. These features are collected by Vesta and security partners, but due to privacy agreements, the exact definitions are not provided. They will be treated as numerical/categorical features based on the context.
<br>
<br>

**3. Additional Identity Information (id_12 to id_38):**
<br>
id_12 to id_38: These additional identity-related features may include more specific or detailed information about a transaction's associated identity, such as timestamps of certain actions or events related to the account and device. However, the exact nature of these features is not explicitly disclosed, so further investigation and assumption may be needed.
<br>
<br>

**4. Device Information:**
<br>
DeviceType: Type of device used for the transaction (e.g., mobile, desktop).
DeviceInfo: A more detailed description of the device used, which could include specific information about the browser, operating system, and version being used during the transaction.
<br>
<br>

**General Observations:**
<br>
These features primarily help to track the behavior and characteristics of the user's device and account, which can be essential in detecting suspicious or fraudulent activities. For instance, frequent failed login attempts, a proxy server used, or a device with an unusual fingerprint might suggest a higher likelihood of fraud.

After the analysis of the identity features, we can formulate a sub-question related to how these features contribute to detecting fraudulent transactions:

**Sub-question: How do specific identity-related features, such as device type, IP address patterns, and behavioral data, contribute to identifying potentially fraudulent transactions?**

# 

## Sub questions


We have chosen to focus on one sub-question from the transaction dataset and one from the identity dataset to provide a analysis of the factors that contribute to fraudulent transactions.
<br>
By addressing both datasets separately, we aim to understand the role of both transaction details and identity-related features in fraud detection. This approach ensures a more separated investigation into the factors influencing fraudulent transactions.

**Sub-question 1 (Transaction Dataset):**
<br>
How do the features in the transaction dataset, such as transaction amount, product code, and card details, influence the likelihood of a transaction being fraudulent?
This question focuses on analyzing how various transaction-specific characteristics, such as the payment amount, the type of product or service involved, and details about the payment card, contribute to identifying fraudulent activities.

**Sub-question 2 (Identity Dataset):**
<br>
How do specific identity-related features, such as device type, IP address patterns, and behavioral data, contribute to identifying potentially fraudulent transactions?
The second sub-question investigates how identity-related factors, including device information, login behavior, and network-related details, can be used to flag suspicious activities. These features help track the user’s device and account behavior, which can serve as critical indicators of fraud. We will explore whether unusual device fingerprints, failed login attempts, or use of proxies correlate with a higher likelihood of fraudulent transactions.