<a href="https://colab.research.google.com/github/LeonMilosevic/fraud_homework_redo/blob/main/fraud_homework_redo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Concept of the Notebook

Notebook will consists out of 5 parts.

1. Analysis
  - Explore the dataset
  - Find insights
  - Draw conclusions
  - Get ideas for feature engineering

2. Base model
  - We will create a base model and have a benchmark for the performence.

3. Feature Engineering
  - Improve the results and build upon discoveries from Analysis part.

4. Modeling
  - Model optimization, and performence tuning.
  - Draw business conclusions

5. Business presentation and final conclusions.
  - We will make a conclusion, sum up all previous chapters, present our solution and advice on how to improve the business model of detecting fraudulent transactions.

# Imports and libraries 

In [None]:
!pip install pgeocode
!pip install folium
!pip install pycountry
!pip install beautifulsoup4

In [4]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from statistics import mean

import pycountry
import pgeocode
import folium
from folium import Marker

from scipy import stats

from bs4 import BeautifulSoup
import requests

from helper_functions import get_postal_code, get_time_difference, check_missing_values

import warnings
warnings.filterwarnings('ignore')

# Analysis

In [5]:
transaction_data = pd.read_csv('transactions_obf.csv')
labels_data = pd.read_csv('labels_obf.csv')

# adding target value to labels_data and merging the dataframes
labels_data['target'] = 1
data = pd.merge(transaction_data, labels_data, how="left", on="eventId")

data[['target']] = data[['target']].fillna(value=0)

# converting time column and setting time as index
data['transactionTime'] = pd.to_datetime(data['transactionTime'])

data.shape

(118621, 12)

###### Split into train/test data based on time

In [6]:
train_data = data[data['transactionTime'] < "2018-01-01"]
test_data = data[data['transactionTime'] > "2018-01-01"]

print(train_data.shape)
print(test_data.shape)

(110090, 12)
(8531, 12)


###### Missing Values, maybe remove?

In [9]:
check_missing_values(train_data)[1:]

Unnamed: 0,total,percent
merchantZip,21182,0.192406
target,0,0.0
availableCash,0,0.0
transactionAmount,0,0.0
posEntryMode,0,0.0
merchantCountry,0,0.0
mcc,0,0.0
merchantId,0,0.0
accountNumber,0,0.0
eventId,0,0.0


###### Target Distribution

In [10]:
# proportion of fraud transactions
train_data['target'].value_counts(normalize=True)

0.0    0.992406
1.0    0.007594
Name: target, dtype: float64

- We are dealing with class imbalance

###### Total transactions per country

In [13]:
train_data.groupby(['merchantCountry']).target.count().sort_values(ascending=False).head(5)

merchantCountry
826    88908
442    13819
840     2640
372     1236
250      429
Name: target, dtype: int64

We can see that most of the transactions come from Great Britain,
since country code 826 = GB

###### Zip Codes per country

In [15]:
# transform to uppercase as to avoid same values with different cases
train_data['merchantZip'] = train_data['merchantZip'].str.upper()

# getting unique countries values
unique_countries = train_data['merchantCountry'].unique().tolist()

# number of unique zip codes per country
unique_zips = []
for i in unique_countries:
  unique_zips.append(len(train_data.loc[train_data['merchantCountry'] == i]['merchantZip'].unique()))

temp_data = {"country_code": unique_countries, "num_of_unique_zip_codes": unique_zips}
zip_codes_per_country_df = pd.DataFrame(temp_data)
zip_codes_per_country_df.head(5)

Unnamed: 0,country_code,num_of_unique_zip_codes
0,826,3099
1,442,1
2,392,1
3,36,1
4,372,1


In [16]:
len(train_data['merchantZip'].unique())

3100

We can see that all merchant zip codes come from Great Britain, while zip codes from countries other than GB were not registered or are unknown.

###### Taking a closer look at GB

In [36]:
gb_data = train_data.loc[train_data['merchantCountry'] == 826]

# total unique number of strings lengths for zip codes
zip_lengths = len(gb_data['merchantZip'].str.len().value_counts().index.to_list())

# number of unique zip code values based on their length
for i in range(zip_lengths):
  print(i+1, len(gb_data.loc[gb_data['merchantZip'].str.len() == i+1]['merchantZip'].unique()))

1 2
2 4
3 1087
4 1263
5 743


Note: After searching on web, Zip code length for GB is: zip_code_len >= 3

###### Invalid Zip codes, based on zip_code_len < 3

In [37]:
gb_data[gb_data['merchantZip'].str.len() < 3]['merchantZip'].unique().tolist()

['0', 'NI', 'F', 'MK', '**', '..']