<h1>Детектирование фродовых транзакций (V1)</h1>

<h4>Датасет: https://www.kaggle.com/datasets/kartik2112/fraud-detection</h4>

О наборе данных

Это смоделированный набор данных о транзакциях по кредитным картам, содержащий законные и мошеннические транзакции за период с 1 января 2019 года по 31 декабря 2020 года. Он охватывает кредитные карты 1000 клиентов, совершающих транзакции с пулом из 800 торговцев.

Источник моделирования

Она была создана с помощью инструмента Sparkov Data Generation | Github, созданного Брэндоном Харрисом. Эта симуляция была запущена на период с 1 января 2019 года по 31 декабря 2020 года. Файлы были объединены и преобразованы в стандартный формат.

Информация о симуляторе

Я не являюсь владельцем симулятора. Я воспользовался тем, который использовал Брэндон Харрис, и, чтобы понять, как он работает, просмотрел несколько фрагментов кода. Вот что я понял из прочитанного:

Симулятор имеет определенный заранее заданный список продавцов, клиентов и категорий транзакций. Затем с помощью библиотеки python под названием "faker" и с учетом количества клиентов, продавцов, которые вы упоминаете во время симуляции, создается промежуточный список.

После этого, в зависимости от выбранного профиля, например "adults 2550 female rural.json" (что означает моделирование свойств взрослых женщин в возрасте 25-50 лет, проживающих в сельской местности), создаются транзакции. Скажем, для этого профиля можно проверить "Sparkov | Github | adults_2550_female_rural.json", там определены диапазоны значений параметров: минимальное и максимальное количество транзакций в день, распределение транзакций по дням недели и свойства нормального распределения (среднее, стандартное отклонение) для сумм в различных категориях. Используя эти показатели распределений, транзакции генерируются с помощью программы faker.

Я сгенерировал транзакции по всем профилям, а затем объединил их вместе, чтобы создать более реалистичное представление симулированных транзакций.


In [14]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
sns.set(style='darkgrid')

<h1>Загрузка данных</h1>

In [15]:
df_train = pd.read_csv('../data/credit_card_fraud_detection/fraudTrain.csv', index_col=0)
df_test = pd.read_csv('../data/credit_card_fraud_detection/fraudTest.csv', index_col=0)
print(f'Rows: {df_train.shape[0]} | Columns: {df_train.shape[1]} (Train)')
print(f'Rows: {df_test.shape[0]} | Columns: {df_test.shape[1]} (Test)')

Rows: 1296675 | Columns: 22 (Train)
Rows: 555719 | Columns: 22 (Test)


In [16]:
df_train.head()

Unnamed: 0,trans_date_trans_time,cc_num,merchant,category,amt,first,last,gender,street,city,...,lat,long,city_pop,job,dob,trans_num,unix_time,merch_lat,merch_long,is_fraud
0,2019-01-01 00:00:18,2703186189652095,"fraud_Rippin, Kub and Mann",misc_net,4.97,Jennifer,Banks,F,561 Perry Cove,Moravian Falls,...,36.0788,-81.1781,3495,"Psychologist, counselling",1988-03-09,0b242abb623afc578575680df30655b9,1325376018,36.011293,-82.048315,0
1,2019-01-01 00:00:44,630423337322,"fraud_Heller, Gutmann and Zieme",grocery_pos,107.23,Stephanie,Gill,F,43039 Riley Greens Suite 393,Orient,...,48.8878,-118.2105,149,Special educational needs teacher,1978-06-21,1f76529f8574734946361c461b024d99,1325376044,49.159047,-118.186462,0
2,2019-01-01 00:00:51,38859492057661,fraud_Lind-Buckridge,entertainment,220.11,Edward,Sanchez,M,594 White Dale Suite 530,Malad City,...,42.1808,-112.262,4154,Nature conservation officer,1962-01-19,a1a22d70485983eac12b5b88dad1cf95,1325376051,43.150704,-112.154481,0
3,2019-01-01 00:01:16,3534093764340240,"fraud_Kutch, Hermiston and Farrell",gas_transport,45.0,Jeremy,White,M,9443 Cynthia Court Apt. 038,Boulder,...,46.2306,-112.1138,1939,Patent attorney,1967-01-12,6b849c168bdad6f867558c3793159a81,1325376076,47.034331,-112.561071,0
4,2019-01-01 00:03:06,375534208663984,fraud_Keeling-Crist,misc_pos,41.96,Tyler,Garcia,M,408 Bradley Rest,Doe Hill,...,38.4207,-79.4629,99,Dance movement psychotherapist,1986-03-28,a41d7549acf90789359a9aa5346dcb46,1325376186,38.674999,-78.632459,0


In [17]:
df = pd.concat([df_train, df_test], axis=0).reset_index(drop=True)

In [18]:
num_cols = df.loc[:, df.dtypes==np.number].columns.to_list()
cat_cols = df.loc[:, df.dtypes==object].columns.to_list()
print(f'Numerical cols: {num_cols}')
print(f'Categorical cols: {cat_cols}')

Numerical cols: ['amt', 'lat', 'long', 'merch_lat', 'merch_long']
Categorical cols: ['trans_date_trans_time', 'merchant', 'category', 'first', 'last', 'gender', 'street', 'city', 'state', 'job', 'dob', 'trans_num']


In [19]:
df = df.drop_duplicates()

In [20]:
df['trans_date_trans_time'] = pd.to_datetime(df['trans_date_trans_time'])

In [21]:
df['month'] = df['trans_date_trans_time'].dt.month
df["day_of_week"] = df['trans_date_trans_time'].dt.dayofweek
df["is_weekend"] = (df['trans_date_trans_time'].dt.dayofweek > 4).astype(int)
df['hour'] = (df['trans_date_trans_time'].dt.hour).astype(int)

In [22]:
df['num_of_trans'] = df['cc_num'].map(df.groupby('cc_num')['merchant'].count())
df['num_of_unique_merchant'] = df['cc_num'].map(df.groupby('cc_num')['merchant'].nunique())

In [23]:
df['lat_std'] = df['cc_num'].map(df.groupby('cc_num')['lat'].std())
df['long_std'] = df['cc_num'].map(df.groupby('cc_num')['long'].std())

In [24]:
df['country'] = 'United States'

In [25]:
df['mean_amt_per_category'] = df['category'].map(df.groupby('category')['amt'].mean())

In [26]:
df = df.sort_values(by='trans_date_trans_time')
print(f'Rows: {df.shape[0]} | Columns: {df.shape[1]} (Full)')

Rows: 1852394 | Columns: 32 (Full)


In [27]:
df = df.drop(['first', 'last', 'trans_date_trans_time', 'street'], axis=1)

In [28]:
# from geopy.geocoders import Nominatim
# from geopy.extra.rate_limiter import RateLimiter

# # Load your CSV file into a pandas DataFrame

# # Initialize a geolocator with a rate limiter to avoid overloading the service
# geolocator = Nominatim(user_agent="geo_analysis")
# geocode = RateLimiter(geolocator.reverse, min_delay_seconds=1)

# # Function to get country name from coordinates
# def get_country_name(lat, lon):
#     location = geocode((lat, lon), exactly_one=True)
#     if location:
#         # return location.raw['address']['country']
#     else:
#         return None

# # Apply the function to each row of the DataFrame in parallel
# df['country'] = df.apply(lambda row: get_country_name(row['merch_lat'], row['merch_long']), axis=1)


In [29]:
allLat  = list(df['lat']) + list(df['merch_lat'])
medianLat  = sorted(allLat)[int(len(allLat)/2)]
latMultiplier  = 111.32

df['lat']   = latMultiplier  * (df['lat']   - medianLat)
df['merch_lat']   = latMultiplier  * (df['merch_lat']  - medianLat)
allLong = list(df['long']) + list(df['merch_long'])

medianLong  = sorted(allLong)[int(len(allLong)/2)]

longMultiplier = np.cos(medianLat*(np.pi/180.0)) * 111.32
### Your code is here

df['long']  = longMultiplier * (df['long']  - medianLong)
df['merch_long']  = longMultiplier * (df['merch_long'] - medianLong)

df.head(5)
### Your code is here

df['long_diff'] = df['merch_long'] - df['long']
df['lat_diff'] = df['merch_lat'] - df['lat']

df['distance_km'] = (df['long_diff']**2 + df['lat_diff']**2)**(1/2)

df = df.drop(['long_diff', 'lat_diff'], axis=1)

In [30]:
df = df.drop(['lat', 'long', 'merch_lat', 'merch_long'], axis=1)

In [31]:
df['age'] = (pd.to_datetime(df['unix_time'], unit='s') - pd.to_datetime(df['dob'])) / pd.Timedelta(days=365.25)
df['age'] = df['age'].round()

In [32]:
df = df.drop(['unix_time', 'dob'], axis=1)

In [33]:
df.columns

Index(['cc_num', 'merchant', 'category', 'amt', 'gender', 'city', 'state',
       'zip', 'city_pop', 'job', 'trans_num', 'is_fraud', 'month',
       'day_of_week', 'is_weekend', 'hour', 'num_of_trans',
       'num_of_unique_merchant', 'lat_std', 'long_std', 'country',
       'mean_amt_per_category', 'distance_km', 'age'],
      dtype='object')

In [34]:
df['gender'] = df['gender'].map({'M': 0, 'F': 1})

In [35]:
cols_mean_target = ['city', 'state', 'job']

In [36]:
for col in cols_mean_target:
    if df[col].nunique() < 10:
        one_hot = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df = pd.concat((df.drop(col, axis=1), one_hot), axis=1)
        
    else:
        mean_target = df.groupby(col)['is_fraud'].mean()
        df[col] = df[col].map(mean_target)

In [37]:
cat_cols = df.loc[:, df.dtypes==object].columns.tolist()

In [38]:
cat_cols

['merchant', 'category', 'trans_num', 'country']

In [39]:
df = df.drop('trans_num', axis=1)

In [40]:
df = df.drop('merchant', axis=1)

In [41]:
df['category'] = df['category'].map(df.groupby('category')['is_fraud'].mean())

In [44]:
# from sklearn.feature_selection import VarianceThreshold
# var_selector = VarianceThreshold(threshold=0.01)
# var_selector.fit(df[:, df.dtypes==])
# print(var_selector.get_feature_names_out())

In [45]:
df.describe(include=np.number).T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
cc_num,1852394.0,4.17386e+17,1.309115e+18,60416210000.0,180042900000000.0,3521417000000000.0,4642255000000000.0,4.992346e+18
category,1852394.0,0.005210015,0.00482746,0.001509551,0.001879711,0.00269737,0.006343752,0.01592713
amt,1852394.0,70.06357,159.254,1.0,9.64,47.45,83.1,28948.9
gender,1852394.0,0.5478041,0.4977097,0.0,0.0,1.0,1.0,1.0
city,1852394.0,0.005210015,0.01976274,0.0,0.002743484,0.004098361,0.005733006,1.0
state,1852394.0,0.005210015,0.002378605,0.003772577,0.004542347,0.005207998,0.005629393,1.0
zip,1852394.0,48813.26,26881.85,1257.0,26237.0,48174.0,72042.0,99921.0
city_pop,1852394.0,88643.67,301487.6,23.0,741.0,2443.0,20328.0,2906700.0
job,1852394.0,0.005210015,0.01159766,0.0,0.003495441,0.004557885,0.006152794,1.0
is_fraud,1852394.0,0.005210015,0.07199217,0.0,0.0,0.0,0.0,1.0


In [46]:
df.describe(include=object).T

Unnamed: 0,count,unique,top,freq
country,1852394,1,United States,1852394


In [47]:
df.isna().sum()

cc_num                    0
category                  0
amt                       0
gender                    0
city                      0
state                     0
zip                       0
city_pop                  0
job                       0
is_fraud                  0
month                     0
day_of_week               0
is_weekend                0
hour                      0
num_of_trans              0
num_of_unique_merchant    0
lat_std                   0
long_std                  0
country                   0
mean_amt_per_category     0
distance_km               0
age                       0
dtype: int64

In [48]:
df.head()

Unnamed: 0,cc_num,category,amt,gender,city,state,zip,city_pop,job,is_fraud,...,is_weekend,hour,num_of_trans,num_of_unique_merchant,lat_std,long_std,country,mean_amt_per_category,distance_km,age
0,2703186189652095,0.013039,4.97,1,0.003758,0.004521,28654,3495,0.00332,0,...,0,0,2927,660,0.0,0.0,United States,80.18137,75.267255,24.0
1,630423337322,0.012645,107.23,1,0.00216,0.00466,99160,149,0.002472,0,...,0,0,4362,681,0.0,0.0,United States,116.640146,30.265999,34.0
2,38859492057661,0.002177,220.11,0,0.010884,0.004107,83252,4154,0.021534,0,...,0,0,735,431,0.0,0.0,United States,64.142968,108.365491,50.0
3,3534093764340240,0.004106,45.0,0,0.020188,0.004106,59632,1939,0.005461,0,...,0,0,743,423,0.0,0.0,United States,63.477271,97.400118,45.0
4,375534208663984,0.002819,41.96,0,0.004449,0.006538,24433,99,0.004449,0,...,0,0,2922,652,0.0,0.0,United States,62.676479,76.870522,26.0


In [49]:
df.head()

Unnamed: 0,cc_num,category,amt,gender,city,state,zip,city_pop,job,is_fraud,...,is_weekend,hour,num_of_trans,num_of_unique_merchant,lat_std,long_std,country,mean_amt_per_category,distance_km,age
0,2703186189652095,0.013039,4.97,1,0.003758,0.004521,28654,3495,0.00332,0,...,0,0,2927,660,0.0,0.0,United States,80.18137,75.267255,24.0
1,630423337322,0.012645,107.23,1,0.00216,0.00466,99160,149,0.002472,0,...,0,0,4362,681,0.0,0.0,United States,116.640146,30.265999,34.0
2,38859492057661,0.002177,220.11,0,0.010884,0.004107,83252,4154,0.021534,0,...,0,0,735,431,0.0,0.0,United States,64.142968,108.365491,50.0
3,3534093764340240,0.004106,45.0,0,0.020188,0.004106,59632,1939,0.005461,0,...,0,0,743,423,0.0,0.0,United States,63.477271,97.400118,45.0
4,375534208663984,0.002819,41.96,0,0.004449,0.006538,24433,99,0.004449,0,...,0,0,2922,652,0.0,0.0,United States,62.676479,76.870522,26.0


In [50]:
sns.heatmap(df[num_cols].corr(), annot=True, cmap='inferno')

KeyError: "['lat', 'long', 'merch_lat', 'merch_long'] not in index"

In [None]:
df['gender'] = df['gender'].map({'M': 0, 'F': 1})

In [None]:
df['time'] = pd.to_datetime(df['unix_time'], unit='s')

In [51]:
df = df.drop(['last', 'first', 'unix_time', 'trans_num', 'lat', 'long', 'cc_num'], axis=1)

KeyError: "['last', 'first', 'unix_time', 'trans_num', 'lat', 'long'] not found in axis"

In [None]:
df.head()

In [52]:
num_cols = df.loc[:, df.dtypes==np.number].columns

In [53]:
def get_redundant_pairs(df):
    pairs_to_drop = set()
    cols = df.columns
    for i in range(0, df.shape[1]):
        for j in range(0, i+1):
            pairs_to_drop.add((cols[i], cols[j]))
    return pairs_to_drop

def get_top_abs_correlations(df, n=5):
    au_corr = df.corr().abs().unstack()
    labels_to_drop = get_redundant_pairs(df)
    au_corr = au_corr.drop(labels=labels_to_drop).sort_values(ascending=False)
    return au_corr[0:n]

print("Top Absolute Correlations")
print(get_top_abs_correlations(df[num_cols], 50))

Top Absolute Correlations
category               mean_amt_per_category    0.748424
city                   job                      0.452726
amt                    mean_amt_per_category    0.128317
category               amt                      0.096036
amt                    city                     0.066912
                       job                      0.037239
job                    age                      0.035151
city                   age                      0.032246
state                  age                      0.027366
category               city                     0.022010
city                   state                    0.017938
mean_amt_per_category  age                      0.017807
city                   mean_amt_per_category    0.017256
state                  job                      0.015273
category               job                      0.012073
job                    mean_amt_per_category    0.011316
amt                    age                      0.010740
     

In [54]:
for col in df.loc[:, df.dtypes==object].columns:
    if df[col].nunique() < 10:
        one_hot = pd.get_dummies(df[col], prefix=col, drop_first=True)
        df = pd.concat((df.drop(col, axis=1), one_hot), axis=1)
        
    else:
        mean_target = df.groupby(col)['is_fraud'].mean()
        df[col] = df[col].map(mean_target)

In [None]:
df.head()

In [None]:
df = df.drop('zip', axis=1)
df = df.drop('time', axis=1)

In [None]:
df.to_csv('../data/df_processed.csv')