# EDA

## Task
Xente is an e-commerce and financial service app serving 10,000+ customers in Uganda.
Xente offers smart Visa cards & payment solutions that simplify finance and admin for companies operating in Africa.
This dataset includes a sample of approximately 140,000 transactions that occurred between 15 November 2018 and 15 March 2019.

One of the challenges of fraud detection problems is that the data is highly imbalanced. See these blogs for examples on how imbalanced data might be handled:
https://medium.com/coinmonks/handling-imbalanced-datasets-predicting-credit-card-fraud-544f5e74e0fd
https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18

## Variables
Column Name,Definition
* TransactionId: Unique transaction identifier on platform
* BatchId: Unique number assigned to a batch of transactions for processing
* AccountId: Unique number identifying the customer on platform
* SubscriptionId: Unique number identifying the customer subscription
* CustomerId: Unique identifier attached to Account
* CurrencyCode: Country currency
* CountryCode: Numerical geographical code of country
* ProviderId: Source provider of Item bought.
* ProductId: Item name being bought.
* ProductCategory: ProductIds are organized into these broader product categories.
* ChannelId: "Identifies if customer used web,Android, IOS, pay later or checkout."
* Amount: Value of the transaction. Positive for debits from customer account and negative for credit into customer account
* Value: Absolute value of the amount
* TransactionStartTime: Transaction start time
* PricingStrategy,: Category of Xente's pricing structure for merchants
* FraudResult: Fraud status of transaction 1 -yes or 0-No

## Brainstorming
* new feature: sign of "Amount" to separate debit from credit transactions
* |Amount| should be == value -> not always the case!
* After baseline modeling: Feature engineering on TransactionStartTime
    * Periodicality (more fraud during/after work hours, certain days of the week...)

## Import libraries

In [None]:
import pandas as pd
from pandas_profiling import ProfileReport
import numpy as np

import ipywidgets

import matplotlib
#matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.express as px
from plotly.subplots import make_subplots
import plotly.graph_objects as go
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

## Import data

In [None]:
df = pd.read_csv("../data/training.csv")
df = df.drop(["CurrencyCode","CountryCode"], axis=1) # identical value across all entries
df.set_index("TransactionId", inplace=True)
df.head()

## Profiling report

In [None]:
profile = ProfileReport(df, title="Pandas Profiling Report", explorative=True)
profile

# Exploratory Data Analysis

## Plot fraudulent transactions for all features

In [None]:
fraud = df.query("FraudResult == 1")
fraud

In [None]:
sns.pairplot(fraud)

In [None]:
sns.countplot(x="PricingStrategy", data=fraud)
plt.show

fraud.query("PricingStrategy == 0")["PricingStrategy"].count()

Analysing fraud-positive data, we can see a high correlation with pricingstrategy = 0 (10% fraud-positive)

## Normalize fraudulent transactions against ProductID

In [None]:
feature_names = ['BatchId',
 'AccountId',
 'SubscriptionId',
 'CustomerId',
 'ProviderId',
 'ProductId',
 'ProductCategory',
 'ChannelId',
 'PricingStrategy']

for feat in feature_names:
    counts = (df.groupby(feat)['FraudResult']
                         .value_counts(normalize=True)
                         .rename('percentage_fraud')
                         .mul(100)
                         .reset_index()
                         .sort_values(feat))
    plt.figure(figsize=(20,6), dpi=80)
    p = sns.barplot(x=feat, y="percentage_fraud", hue="FraudResult", data=counts[counts.FraudResult==1])
    plt.xticks(rotation=90)

## Get dummies
Use pd.get_dummies(loans,columns=cat_feats,drop_first=True) to create a fixed larger dataframe that has new feature columns with dummy variables. Set this dataframe as final_data.**

In [None]:
df.columns.tolist()

In [None]:
cat_columns = [
 'ProviderId',
 'ProductId',
 'ProductCategory',
 'ChannelId',
 'PricingStrategy']

In [None]:
df_dummies = pd.get_dummies(df, columns=cat_columns, drop_first = True)
df_dummies.head()

baseline = df_dummies.drop(["BatchId", "AccountId", "SubscriptionId", "CustomerId", "TransactionStartTime"], axis=1)

### Define features (X) and target variable (y)

In [None]:
X = baseline.loc[:, baseline.columns != 'FraudResult']
y = baseline["FraudResult"]

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=0)

In [None]:
dtree = DecisionTreeClassifier()
dtree.fit(X_train,y_train)
predictions = dtree.predict(X_test)   

confusion_matrix(y_test, predictions)

In [None]:
print(classification_report(y_test, predictions))

## Checking base line model with Naive Bayes

In [None]:
NB = GaussianNB()
NB.fit(X_test, y_test)

In [None]:
predictions = NB.predict(X_test)   
confusion_matrix(y_test, predictions)

In [None]:
print(classification_report(y_test, predictions))