# EDA

## Task
Xente is an e-commerce and financial service app serving 10,000+ customers in Uganda.
Xente offers smart Visa cards & payment solutions that simplify finance and admin for companies operating in Africa.
This dataset includes a sample of approximately 140,000 transactions that occurred between 15 November 2018 and 15 March 2019.

One of the challenges of fraud detection problems is that the data is highly imbalanced. See these blogs for examples on how imbalanced data might be handled:
https://medium.com/coinmonks/handling-imbalanced-datasets-predicting-credit-card-fraud-544f5e74e0fd
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

## Variables
Column Name,Definition
* TransactionId: Unique transaction identifier on platform
* BatchId: Unique number assigned to a batch of transactions for processing
* AccountId: Unique number identifying the customer on platform
* SubscriptionId: Unique number identifying the customer subscription
* CustomerId: Unique identifier attached to Account
* CurrencyCode: Country currency
* CountryCode: Numerical geographical code of country
* ProviderId: Source provider of Item bought.
* ProductId: Item name being bought.
* ProductCategory: ProductIds are organized into these broader product categories.
* ChannelId: "Identifies if customer used web,Android, IOS, pay later or checkout."
* Amount: Value of the transaction. Positive for debits from customer account and negative for credit into customer account
* Value: Absolute value of the amount
* TransactionStartTime: Transaction start time
* PricingStrategy,: Category of Xente's pricing structure for merchants
* FraudResult: Fraud status of transaction 1 -yes or 0-No

## Brainstorming
* new feature: sign of "Amount" to separate debit from credit transactions
* |Amount| should be == value -> not always the case!
* After baseline modeling: Feature engineering on TransactionStartTime
    * Periodicality (more fraud during/after work hours, certain days of the week...)

## Import libraries

In [None]:
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np

import ipywidgets

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go

## Import data

In [None]:
df = pd.read_csv("../data/training.csv")

df = df.drop(["CurrencyCode","CountryCode"], axis=1) # identical value across all entries
df.set_index("TransactionId", inplace=True)

df.head(20)

## Profiling report

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile