<a href="https://colab.research.google.com/github/LeonMilosevic/fraud_homework_redo/blob/main/fraud_homework_redo.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Concept of the Notebook

Notebook will consists out of 5 parts.

1. Analysis
  - Explore the dataset
  - Find insights
  - Draw conclusions
  - Get ideas for feature engineering

2. Base model
  - We will create a base model and have a benchmark for the performence.

3. Feature Engineering
  - Improve the results and build upon discoveries from Analysis part.

4. Modeling
  - Model optimization, and performence tuning.
  - Draw business conclusions

5. Business presentation and final conclusions.
  - We will make a conclusion, sum up all previous chapters, present our solution and advice on how to improve the business model of detecting fraudulent transactions.

# Imports and libraries 

In [1]:
!pip install pgeocode
!pip install folium
!pip install pycountry
!pip install beautifulsoup4



In [2]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
from statistics import mean

import pycountry
import pgeocode
import folium
from folium import Marker

from scipy import stats

from bs4 import BeautifulSoup
import requests

from helper_functions import get_postal_code, get_time_difference, check_missing_values, split_zip_codes

import warnings
warnings.filterwarnings('ignore')

# Analysis

In [3]:
transaction_data = pd.read_csv('transactions_obf.csv')
labels_data = pd.read_csv('labels_obf.csv')

# adding target value to labels_data and merging the dataframes
labels_data['target'] = 1
data = pd.merge(transaction_data, labels_data, how="left", on="eventId")

data[['target']] = data[['target']].fillna(value=0)

# converting time column and setting time as index
data['transactionTime'] = pd.to_datetime(data['transactionTime'])

data.shape

(118621, 12)

###### Split into train/test data based on time

In [4]:
train_data = data[data['transactionTime'] < "2018-01-01"]
test_data = data[data['transactionTime'] > "2018-01-01"]

print(train_data.shape)
print(test_data.shape)

(110090, 12)
(8531, 12)


###### Missing Values, maybe remove?

In [5]:
check_missing_values(train_data)[1:]

Unnamed: 0,total,percent
merchantZip,21182,0.192406
target,0,0.0
availableCash,0,0.0
transactionAmount,0,0.0
posEntryMode,0,0.0
merchantCountry,0,0.0
mcc,0,0.0
merchantId,0,0.0
accountNumber,0,0.0
eventId,0,0.0


###### Target Distribution

In [6]:
# proportion of fraud transactions
train_data['target'].value_counts(normalize=True)

0.0    0.992406
1.0    0.007594
Name: target, dtype: float64

- We are dealing with class imbalance

###### Total transactions per country

In [7]:
train_data.groupby(['merchantCountry']).target.count().sort_values(ascending=False).head(5)

merchantCountry
826    88908
442    13819
840     2640
372     1236
250      429
Name: target, dtype: int64

We can see that most of the transactions come from Great Britain,
since country code 826 = GB

###### Zip Codes per country

In [8]:
# transform to uppercase as to avoid same values with different cases
train_data['merchantZip'] = train_data['merchantZip'].str.upper()

# getting unique countries values
unique_countries = train_data['merchantCountry'].unique().tolist()

# number of unique zip codes per country
unique_zips = []
for i in unique_countries:
  unique_zips.append(len(train_data.loc[train_data['merchantCountry'] == i]['merchantZip'].unique()))

temp_data = {"country_code": unique_countries, "num_of_unique_zip_codes": unique_zips}
zip_codes_per_country_df = pd.DataFrame(temp_data)
zip_codes_per_country_df.head(5)

Unnamed: 0,country_code,num_of_unique_zip_codes
0,826,3099
1,442,1
2,392,1
3,36,1
4,372,1


In [9]:
len(train_data['merchantZip'].unique())

3100

We can see that all merchant zip codes come from Great Britain, while zip codes from countries other than GB were not registered or are unknown.

###### Taking a closer look at GB

In [10]:
gb_data = train_data.loc[train_data['merchantCountry'] == 826]

# total unique number of strings lengths for zip codes
zip_lengths = len(gb_data['merchantZip'].str.len().value_counts().index.to_list())

# number of unique zip code values based on their length
for i in range(zip_lengths):
  print(i+1, len(gb_data.loc[gb_data['merchantZip'].str.len() == i+1]['merchantZip'].unique()))

1 2
2 4
3 1087
4 1263
5 743


**Note:** Zip Code Fact 1.0

* Zip code pattern for GB is letter+digit

###### Invalid Zip codes, based on Zip Code Fact 1.0

In [11]:
# finding non alpha numeric values
non_alpha_numeric = gb_data[gb_data['merchantZip'].str.contains('(\W+)')]['merchantZip'].unique().tolist()

# finding values that only contain digits
only_numeric = gb_data[gb_data['merchantZip'].str.contains('^([\s\d]+)$')]['merchantZip'].unique().tolist()

# finding values that only contain alphabetic chars
only_alphabetic_chars = gb_data[gb_data['merchantZip'].str.contains('^[a-zA-Z]+$')]['merchantZip'].unique().tolist()

invalid_zip_values = non_alpha_numeric + only_numeric + only_alphabetic_chars

In [12]:
invalid_zip_values[:5]

['....', '...', '***', '.....', '**']

###### Number of transactions for each invalid zip code

In [13]:
invalid_zip_transactions = []
for zip in invalid_zip_values:
  invalid_zip_transactions.append(gb_data.loc[gb_data['merchantZip'] == zip].shape[0])

temp_data = {"zip_codes": invalid_zip_values, "transactions": invalid_zip_transactions}
df = pd.DataFrame(temp_data)

df.sort_values(by=['transactions'], ascending=False).head(5)

Unnamed: 0,zip_codes,transactions
6,0,13554
1,...,374
0,....,361
7,11111,35
19,F,27


In [14]:
# count of transactions for valid + invalid zip values
gb_data.groupby(['merchantZip']).count().target.sort_values(ascending=False)

merchantZip
0        13554
E12       1063
SL4        614
LS11       570
CO10       512
         ...  
E83NS        1
E83DG        1
E82NS        1
E82JP        1
YO8          1
Name: target, Length: 3099, dtype: int64

Given that merchantZip "0" has the highest number of transactions in GB, we can't assume that it is coming from one/same location. We should then encode it as a geolocation from GB.

**Note:** Zip Code Fact 1.1

* All codes in GB that have 5 characters, have space before 3rd character

###### Fixing zip codes based on Zip Code Fact 1.1

In [19]:
gb_data['merchantZip'] = split_zip_codes(gb_data, 5)

###### Adding longitude and latitude to each sample based on Zip address

In [24]:
nomi = pgeocode.Nominatim('gb')

# encoding known zip codes
gb_zip_codes = pd.DataFrame({"zip_codes": gb_data['merchantZip'].unique()})

gb_zip_codes[['Latitude', 'Longitude', 'state_name']] = gb_zip_codes.apply(lambda x: get_postal_code(nomi, x['zip_codes']), axis=1)

print("{}% of addresses were geocoded!".format(
    (1 - sum(np.isnan(gb_zip_codes["Latitude"])) / len(gb_zip_codes)) * 100))

95.70829299774121% of addresses were geocoded!


In [25]:
not_found_zip_codes = gb_zip_codes[gb_zip_codes['Latitude'].isnull()]
not_found_zip_codes['zip_codes'].str.len().value_counts()

4    70
3    41
6    16
2     4
1     2
Name: zip_codes, dtype: int64

After manually examining some codes, most of them are districts with missing street numbers or the codes are no longer in use.

* There are some geopy maps that could potentionally find the geolocation of a district, unfortunatelly, free-tier maps do not work well with districts or not in use post-codes.

* We will encode all the non-found zip_codes with geolocation of UK in 

In [29]:
# fill non found states with other
gb_zip_codes['state_name'] = gb_zip_codes.state_name.fillna('other')

# derived from https://developers.google.com/public-data/docs/canonical/countries_csv
# fill nan values with GB coordinates
gb_zip_codes['Latitude'] = gb_zip_codes['Latitude'].fillna(55.378051)
gb_zip_codes['Longitude'] = gb_zip_codes['Longitude'].fillna(-3.435973)

# removing space from strings to match train_data
gb_zip_codes['zip_codes'] = gb_zip_codes['zip_codes'].str.replace(' ', '')

train_data = pd.merge(left=train_data, right=gb_zip_codes, how="outer", left_on='merchantZip', right_on="zip_codes").drop(['zip_codes'], axis=1)