## DATA PREPROCESSING

*   Data Type Handling
*   Exploring Insight into Attributes
*   Feature Extraction
*   Saving Data

In [2]:
# for data analysis
import pandas as pd
import numpy as np

import json     # for reading the key inside the json formatted file
import pyodbc   # for connecting database

# for data visualization
import seaborn as sns
import matplotlib.pyplot as plt

In [3]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)

#### Connecting to Database

Pyodbc library handles the connection between Jupyter notebook and MS SQL Server. SQL Server's key is hidden inside the json file.

In [4]:
f = open('log.json')
sql_key = json.load(f)     # returns JSON object as a dictionary

cnxn = pyodbc.connect(sql_key['key'])     # establish a connection
crsr = cnxn.cursor()                      # cursor enables to send command

> Sending Queries to Database

Queries are sent with respect to the decisions given on analytics phase. G and T type company data are retrieved separately.

>> For G (Şahıs)

In [5]:
gk_query= """SELECT ID, MUSTERI_ID, SIRKET_TURU,
            CEK_NO, CEK_TUTAR, KULLANDIRIM, SUBE, KESIDECI_ID, ISLEM_TARIHI,
            BK_KURUMSAYISI, BK_LIMIT, BK_RISK, BK_GECIKMEHESAP, BK_GECIKMEBAKIYE, MUSTERI_RISK_SEVIYESI
            FROM dataset
            WHERE SIRKET_TURU='G' """
g_company_type_df = pd.read_sql(gk_query, cnxn)

In [6]:
g_company_type_df.head(10)

Unnamed: 0,ID,MUSTERI_ID,SIRKET_TURU,CEK_NO,CEK_TUTAR,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,BK_KURUMSAYISI,BK_LIMIT,BK_RISK,BK_GECIKMEHESAP,BK_GECIKMEBAKIYE,MUSTERI_RISK_SEVIYESI
0,2457932,11800527,G,70331933,20000.0,KY,PENDIK,11535476,2018-11-30,0,0,0,0,0,0
1,2457933,12024009,G,3014103,25000.0,KV,ÇORLU,11753282,2018-11-30,3,2213,226,2,1012,0
2,2457934,11800527,G,7031933,20000.0,KV,PENDIK,11535476,2018-11-30,0,0,0,0,0,0
3,2457936,11724283,G,7198012,23000.0,KY,TOPÇULAR,126553,2018-11-30,2,52100,9481,3,3575,0
4,2457937,11879266,G,9090937,10000.0,KY,ADANA,98511,2018-11-30,7,280101,211301,3,4128,0
5,2457938,11879266,G,9090936,13458.0,KY,ADANA,98511,2018-11-30,7,280101,211301,3,4128,0
6,2457939,11854083,G,4535918,5200.0,KY,IZMIT,12054732,2018-11-30,0,0,0,0,0,0
7,2457942,11654711,G,88624,110000.0,KY,BESEVLER,74198,2018-11-30,4,215261,150619,1,847,0
8,2457943,11723432,G,3331313,5300.0,KY,YILDIRIM,11570468,2018-11-30,4,104384,65599,3,2591,0
9,2457944,11577211,G,4602258,11000.0,KY,BEYLIKDÜZÜ,12054730,2018-11-30,4,215860,85589,4,6379,0


>> For T (Tüzel)

In [129]:
tk_query= """SELECT MUSTERI_ID, ID, CEK_NO, CEK_TUTAR, SIRKET_TURU, KULLANDIRIM, SUBE, KESIDECI_ID, ISLEM_TARIHI,
    TK_NAKDILIMIT, TK_NAKDIRISK, TK_GAYRINAKDILIMIT, TK_GAYRINAKDIRISK, TK_GECIKMEHESAP, TK_GECIKMEBAKIYE,
    TK_KURUMSAYISI, MUSTERI_RISK_SEVIYESI
    FROM dataset WHERE SIRKET_TURU='T' """
t_company_type_df = pd.read_sql(tk_query, cnxn)

In [127]:
t_company_type_df.head(10)

Unnamed: 0,MUSTERI_ID,ID,CEK_NO,CEK_TUTAR,SIRKET_TURU,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,TK_NAKDILIMIT,TK_NAKDIRISK,TK_GAYRINAKDILIMIT,TK_GAYRINAKDIRISK,TK_GECIKMEHESAP,TK_GECIKMEBAKIYE,TK_KURUMSAYISI,MUSTERI_RISK_SEVIYESI
0,11820145,2457923,3062309,8000.0,T,KY,IZMIR SANAYI,12054734,2018-11-30,11000,9490,0,0,0,0,1,0
1,11672216,2457924,80075,35000.0,T,KY,MECIDIYEKÖY,21013,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
2,11672216,2457925,1009838,5000.0,T,KY,MECIDIYEKÖY,12006233,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
3,11672216,2457926,1009837,10000.0,T,KY,MECIDIYEKÖY,12006233,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
4,11672216,2457927,8005059,15000.0,T,KY,MECIDIYEKÖY,12000824,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
5,11672216,2457928,7824,5000.0,T,KY,MECIDIYEKÖY,11830127,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
6,11672216,2457929,5425083,6000.0,T,KY,MECIDIYEKÖY,11789483,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
7,11672216,2457930,84781,77600.0,T,KY,MECIDIYEKÖY,11618513,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
8,11672216,2457931,17971,100000.0,T,KY,MECIDIYEKÖY,KD00186882,2018-11-30,6903025,3095925,1049450,45637,12,134953,14,0
9,11827634,2457935,1012285,22000.0,T,KY,TOPÇULAR,151751,2018-11-30,931624,0,932564,2820,0,0,3,0


>> Save Retrieved Data to Feather

In [7]:
g_company_type_df.to_feather('data/bk_rawData.feather')

In [130]:
t_company_type_df.to_feather('data/tk_rawData.feather')

#### Handle Data Types *-for G type Customers*

For G type Customers, a new empty table is going to be created as **bk_data** as a feather file. Then, G type customers' records are committed into the table using the following script. The code snippet is one-time use. After executing once, we are going to develop ML works on *segmentation.ipynb*

Firstly, we have to assign data types before ML operations

In [65]:
g_company_type_df.info()    # to see the data types and missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229538 entries, 0 to 229537
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   ID                     229538 non-null  int64  
 1   MUSTERI_ID             229538 non-null  object 
 2   SIRKET_TURU            229538 non-null  object 
 3   CEK_NO                 229538 non-null  object 
 4   CEK_TUTAR              229538 non-null  float64
 5   KULLANDIRIM            229538 non-null  object 
 6   SUBE                   229538 non-null  object 
 7   KESIDECI_ID            229538 non-null  object 
 8   ISLEM_TARIHI           229538 non-null  object 
 9   BK_KURUMSAYISI         229538 non-null  int64  
 10  BK_LIMIT               229538 non-null  int64  
 11  BK_RISK                229538 non-null  int64  
 12  BK_GECIKMEHESAP        229538 non-null  int64  
 13  BK_GECIKMEBAKIYE       229538 non-null  object 
 14  MUSTERI_RISK_SEVIYESI  229538 non-nu

In [66]:
# ID must be int type
g_company_type_df['ID']=g_company_type_df['ID'].astype('int64')

# converting into float to use on ML model later
g_company_type_df['BK_LIMIT']=g_company_type_df['BK_LIMIT'].astype(float)
g_company_type_df['BK_RISK']=g_company_type_df['BK_RISK'].astype(float)
g_company_type_df['BK_GECIKMEHESAP']=g_company_type_df['BK_GECIKMEHESAP'].astype(float)
g_company_type_df['BK_GECIKMEBAKIYE']=g_company_type_df['BK_GECIKMEBAKIYE'].astype(float)

Updated MUSTERI_ID's non-numeric data by using SQL queries

In [67]:
#spotted using this query
"""
SELECT ID,MUSTERI_ID
FROM dataset
WHERE ISNUMERIC(MUSTERI_ID)
"""

'\nSELECT ID,MUSTERI_ID\nFROM dataset\nWHERE ISNUMERIC(MUSTERI_ID)\n'

In [68]:
# updated via SQL query
"""UPDATE dataset
SET MUSTERI_ID = REPLACE(MUSTERI_ID, 'KD', '3')
WHERE MUSTERI_ID IS NOT NULL;

UPDATE dataset
SET MUSTERI_ID = REPLACE(MUSTERI_ID, 'SC', '')
WHERE MUSTERI_ID IS NOT NULL;

--Goes like this...
"""

"UPDATE dataset\nSET MUSTERI_ID = REPLACE(MUSTERI_ID, 'KD', '3')\nWHERE MUSTERI_ID IS NOT NULL;\n\nUPDATE dataset\nSET MUSTERI_ID = REPLACE(MUSTERI_ID, 'SC', '')\nWHERE MUSTERI_ID IS NOT NULL;\n\n--Goes like this...\n"

In [69]:
# after performing query, can directly convert into int dtype
g_company_type_df['MUSTERI_ID']=g_company_type_df['MUSTERI_ID'].astype('int64')


Also, updated **CEK_NO** with same sequence on SQL

In [70]:
# finally, the script is executed successfully
g_company_type_df['CEK_NO']=g_company_type_df['CEK_NO'].astype('int64')

In [71]:
g_company_type_df['KESIDECI_ID'] = g_company_type_df['KESIDECI_ID'].str.extract('(\d+)', expand=False).astype('int')

g_company_type_df['SIRKET_TURU'] = g_company_type_df['SIRKET_TURU'].astype('string')
g_company_type_df['SUBE'] = g_company_type_df['SUBE'].astype('string')
g_company_type_df['KULLANDIRIM'] = g_company_type_df['KULLANDIRIM'].astype('string')

g_company_type_df['CEK_TUTAR'] = g_company_type_df['CEK_TUTAR'].replace(',','.', regex=True).astype('float')
g_company_type_df['BK_KURUMSAYISI'] = g_company_type_df['BK_KURUMSAYISI'].astype('int')

g_company_type_df['ISLEM_TARIHI'] = pd.to_datetime(g_company_type_df['ISLEM_TARIHI'], format='%Y.%m.%d')

In [72]:
#to see the data types after the changes
g_company_type_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 229538 entries, 0 to 229537
Data columns (total 15 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   ID                     229538 non-null  int64         
 1   MUSTERI_ID             229538 non-null  int64         
 2   SIRKET_TURU            229538 non-null  string        
 3   CEK_NO                 229538 non-null  int64         
 4   CEK_TUTAR              229538 non-null  float64       
 5   KULLANDIRIM            229538 non-null  string        
 6   SUBE                   229538 non-null  string        
 7   KESIDECI_ID            229538 non-null  int32         
 8   ISLEM_TARIHI           229538 non-null  datetime64[ns]
 9   BK_KURUMSAYISI         229538 non-null  int32         
 10  BK_LIMIT               229538 non-null  float64       
 11  BK_RISK                229538 non-null  float64       
 12  BK_GECIKMEHESAP        229538 non-null  floa

#### Feature Extraction

Deriving new attributes using the existing ones provides us to generate new features and find insight into data. 


> Limit/Risk Ratio

Exploring the attribute to understand it better

In [73]:
g_company_type_df[(g_company_type_df['BK_RISK'] != 0) & (g_company_type_df['BK_LIMIT'] == 0)]   # obtaining the case of RISK is not 0 but LIMIT is 0
g_company_type_df[(g_company_type_df['BK_RISK'] == 0) & (g_company_type_df['BK_LIMIT'] != 0)]   # obtaining the case of RISK is 0 but LIMIT is not 0

Unnamed: 0,ID,MUSTERI_ID,SIRKET_TURU,CEK_NO,CEK_TUTAR,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,BK_KURUMSAYISI,BK_LIMIT,BK_RISK,BK_GECIKMEHESAP,BK_GECIKMEBAKIYE,MUSTERI_RISK_SEVIYESI
12,2457952,11860577,G,1023231,12605.490234,KY,KAYSERI,134618,2018-11-30,1,1000.0,0.0,0.0,0.0,0
87,2458126,11884594,G,381633,15000.000000,KY,ISKENDERUN,30052,2018-11-30,2,5200.0,0.0,1.0,296.0,0
144,2458237,11793964,G,3167451,9500.000000,KV,KARABAGLAR,11992164,2018-11-30,1,750.0,0.0,0.0,0.0,0
168,2458278,11593517,G,271555,50000.000000,KY,DUDULLU,11592311,2018-11-30,1,550.0,0.0,1.0,35.0,0
177,2458290,11506837,G,8060375,23100.000000,KY,ANTALYA,11952661,2018-11-30,1,500.0,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229377,2939494,11620749,G,1614282,30000.000000,KY,MECIDIYEKÖY,104996,2018-01-02,2,3000.0,0.0,0.0,0.0,0
229379,2939497,11855859,G,12927,30000.000000,KV,MERSIN,11908808,2018-01-02,1,750.0,0.0,0.0,0.0,0
229400,2939551,124816,G,694134,5000.000000,KY,ESKISEHIR,69677,2018-01-02,2,1550.0,0.0,3.0,522.0,0
229401,2939552,124816,G,694133,5000.000000,KY,ESKISEHIR,69677,2018-01-02,2,1550.0,0.0,3.0,522.0,0


As seen, there are several cases on the dataframe that one of those attribute is zero while other is not.

We must create a new attribute called **BK_ORAN** which is ***derived by diving BK_RISK to BK_LIMIT***. It will represent as higher ratio provides higher reliability on the customer check payback

*Meanwhile, must be aware of the cases where one of those attributes is equal to 0 and the other is not 0!*

In [74]:
g_company_type_df['BK_ORAN'] = 0    # creating a new column for BK_ORAN and initializing it with 0

# for cases of RISK and LIMIT both are not equal to 0, BK_ORAN is calculated as a ratio of LIMIT to RISK
mask = (g_company_type_df['BK_RISK'] != 0) & (g_company_type_df['BK_LIMIT'] != 0)
g_company_type_df.loc[mask, 'BK_ORAN'] = g_company_type_df.loc[mask, 'BK_LIMIT'] / g_company_type_df.loc[mask, 'BK_RISK']

# for cases of RISK is equal to 0 but LIMIT is not 0, BK_ORAN is calculated as LIMIT / 0.1
mask = (g_company_type_df['BK_RISK'] == 0) & (g_company_type_df['BK_LIMIT'] != 0)
g_company_type_df.loc[mask, 'BK_ORAN'] = g_company_type_df.loc[mask, 'BK_LIMIT'] / 0.1  # assigning the value of RISK as 0.1 instead of 0

# for cases of RISK is not 0 and LIMIT is equal to 0, BK_ORAN is calculated as 0.1 / RISK
mask = (g_company_type_df['BK_LIMIT'] == 0) & (g_company_type_df['BK_RISK'] != 0)
g_company_type_df.loc[mask, 'BK_ORAN'] = 0.1 / g_company_type_df.loc[mask, 'BK_RISK']   # assigning the value of LIMIT as 0.1 instead of 0

mask = (g_company_type_df['BK_LIMIT'] == 0) & (g_company_type_df['BK_RISK'] == 0)
g_company_type_df.loc[mask, 'BK_ORAN'] = 0

> Keşideci Count for Each Customer

In [75]:
# how to count the number of unique KESIDECI_ID for each CUSTOMER_ID
g_company_type_df.groupby('MUSTERI_ID')['KESIDECI_ID'].nunique().sort_values(ascending=False)

MUSTERI_ID
11633272     115
11529385     111
11954204     101
74805         97
28987         93
            ... 
11848956       1
11848938       1
11848905       1
11848611       1
300188452      1
Name: KESIDECI_ID, Length: 38374, dtype: int64

In [76]:
count_kesideci_df = g_company_type_df.groupby('MUSTERI_ID')['KESIDECI_ID'].nunique().reset_index()
count_kesideci_df.columns = ['MUSTERI_ID', 'KESIDECI_COUNT']

g_company_type_df = pd.merge(g_company_type_df, count_kesideci_df, on='MUSTERI_ID', how='left')     # left outer join to append KESIDECI_ID count per customers
del count_kesideci_df # delete the temporary dataframe

>> SUBE count number for each customer

In [77]:
# how to count the number of unique KESIDECI_ID for each CUSTOMER_ID
g_company_type_df.groupby('MUSTERI_ID')['SUBE'].nunique().sort_values(ascending=False)

MUSTERI_ID
11635461     4
11729006     3
11937686     3
11870207     3
11919651     3
            ..
11720582     1
11720587     1
11720594     1
11720639     1
300188452    1
Name: SUBE, Length: 38374, dtype: int64

In [78]:
count_sube_df = g_company_type_df.groupby('MUSTERI_ID')['SUBE'].nunique().reset_index()
count_sube_df.columns = ['MUSTERI_ID', 'SUBE_COUNT']

g_company_type_df = pd.merge(g_company_type_df, count_sube_df, on='MUSTERI_ID', how='left')     # left outer join to append SUBE count per customers
del count_sube_df                                                                               # delete the temporary dataframe

In [79]:
g_company_type_df

Unnamed: 0,ID,MUSTERI_ID,SIRKET_TURU,CEK_NO,CEK_TUTAR,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,BK_KURUMSAYISI,BK_LIMIT,BK_RISK,BK_GECIKMEHESAP,BK_GECIKMEBAKIYE,MUSTERI_RISK_SEVIYESI,BK_ORAN,KESIDECI_COUNT,SUBE_COUNT
0,2457932,11800527,G,70331933,20000.0,KY,PENDIK,11535476,2018-11-30,0,0.0,0.0,0.0,0.0,0,0.000000,3,2
1,2457933,12024009,G,3014103,25000.0,KV,ÇORLU,11753282,2018-11-30,3,2213.0,226.0,2.0,1012.0,0,9.792035,4,1
2,2457934,11800527,G,7031933,20000.0,KV,PENDIK,11535476,2018-11-30,0,0.0,0.0,0.0,0.0,0,0.000000,3,2
3,2457936,11724283,G,7198012,23000.0,KY,TOPÇULAR,126553,2018-11-30,2,52100.0,9481.0,3.0,3575.0,0,5.495201,4,1
4,2457937,11879266,G,9090937,10000.0,KY,ADANA,98511,2018-11-30,7,280101.0,211301.0,3.0,4128.0,0,1.325602,4,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229533,2939791,11708128,G,8005260,25000.0,KY,IKITELLI,11687714,2018-01-02,0,0.0,0.0,0.0,0.0,0,0.000000,31,1
229534,2939792,11708128,G,363682,30000.0,KY,IKITELLI,11687714,2018-01-02,0,0.0,0.0,0.0,0.0,0,0.000000,31,1
229535,2939793,11708128,G,1246339,20000.0,KY,IKITELLI,11676073,2018-01-02,0,0.0,0.0,0.0,0.0,0,0.000000,31,1
229536,2939794,11708128,G,1019468,10000.0,KY,IKITELLI,11549352,2018-01-02,0,0.0,0.0,0.0,0.0,0,0.000000,31,1


>> Incomes from Checks for Each Customer

The customers get income from check which KULLANDIRIM is equal to KV. **Sum the CEK_TUTAR** for each customer using this case.

In [80]:
# how to sum the CEK_TUTAR amount for each CUSTOMER_ID where KULLANDIRIM is equal to KV
g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KV'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().sort_values(ascending=False)

MUSTERI_ID
11633272    1.218740e+06
11662188    1.101993e+06
11688660    1.005800e+06
131702      9.089248e+05
28648       8.152050e+05
                ...     
12034143    1.600000e+03
136012      1.509000e+03
11956149    1.500000e+03
11869377    1.500000e+03
11701463    1.422610e+03
Name: CEK_TUTAR, Length: 19219, dtype: float64

In [81]:
sum_cekTutar_df = g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KV'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().reset_index()
sum_cekTutar_df.columns = ['MUSTERI_ID', 'CEK_INCOME']

g_company_type_df = pd.merge(g_company_type_df, sum_cekTutar_df, on='MUSTERI_ID', how='left')     # left outer join to append CEK_INCOME count per customers
del sum_cekTutar_df                                                                               # delete the temporary dataframe

g_company_type_df['CEK_INCOME'].fillna(0, inplace=True)                                           # fill the NaN values with 0

Also, there is a case that customers can not earn any income from check which KULLANDIRIM is equal to KY. **Sum the CEK_TUTAR** for each customer using this case as well.

In [83]:
# how to sum the CEK_TUTAR amount for each CUSTOMER_ID where KULLANDIRIM is equal to KV
g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KY'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().sort_values(ascending=False)

MUSTERI_ID
11616207    3.017795e+07
81500       6.818861e+06
11762944    6.072031e+06
11816079    6.015223e+06
11561408    5.905000e+06
                ...     
11910764    1.350000e+03
11875696    1.180000e+03
11533203    1.000000e+03
11759794    1.000000e+03
12012966    1.000000e+03
Name: CEK_TUTAR, Length: 35019, dtype: float64

In [84]:
# creating a new column on g_company_type_df using...
# g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KY'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().sort_values(ascending=False)
g_company_type_df['CEK_LOSS'] = g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KY'].groupby('MUSTERI_ID')['CEK_TUTAR'].transform('sum')

# then fillna with 0
g_company_type_df['CEK_LOSS'].fillna(0, inplace=True)

In [34]:
"""
sum_cekKayıp_df = g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KY'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().reset_index()
sum_cekKayıp_df.columns = ['MUSTERI_ID', 'CEK_LOSS']

g_company_type_df = pd.merge(g_company_type_df, sum_cekKayıp_df, on='MUSTERI_ID', how='left')     # left outer join to append CEK_LOSS count per customers
del sum_cekKayıp_df                                                                               # delete the temporary dataframe

g_company_type_df['CEK_LOSS'].fillna(0, inplace=True)                                           # fill the NaN values with 0
"""

"\nsum_cekKayıp_df = g_company_type_df[g_company_type_df['KULLANDIRIM'] == 'KY'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().reset_index()\nsum_cekKayıp_df.columns = ['MUSTERI_ID', 'CEK_LOSS']\n\ng_company_type_df = pd.merge(g_company_type_df, sum_cekKayıp_df, on='MUSTERI_ID', how='left')     # left outer join to append CEK_LOSS count per customers\ndel sum_cekKayıp_df                                                                               # delete the temporary dataframe\n\ng_company_type_df['CEK_LOSS'].fillna(0, inplace=True)                                           # fill the NaN values with 0\n"

In [86]:
# round one decimal to CEK_LOSS values
g_company_type_df['CEK_LOSS'] = g_company_type_df['CEK_LOSS'].round(1)

>> Percentage of CEK_INCOME over All Potential Income

Firstly, find the case that both is 0

In [88]:
g_company_type_df[(g_company_type_df['CEK_INCOME'] == 0) & (g_company_type_df['CEK_LOSS'] == 0)]

Unnamed: 0,ID,MUSTERI_ID,SIRKET_TURU,CEK_NO,CEK_TUTAR,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,BK_KURUMSAYISI,BK_LIMIT,BK_RISK,BK_GECIKMEHESAP,BK_GECIKMEBAKIYE,MUSTERI_RISK_SEVIYESI,BK_ORAN,KESIDECI_COUNT,SUBE_COUNT,CEK_INCOME,CEK_LOSS


There is no data for this case, so no mask operation needed

Finally, calculate the percentage of CEK_INCOME over all potential check incomes a customer may get.

Formula is: **CEK_INCOME/CEK_LOSS+CEK_INCOME**

In [89]:
# calculate CEK_INCOME/CEK_LOSS+CEK_INCOME in a new column named CEK_GELIR_PERCENTAGE
g_company_type_df['CEK_GELIR_PERCENTAGE'] = g_company_type_df['CEK_INCOME'] / (g_company_type_df['CEK_LOSS'] + g_company_type_df['CEK_INCOME'])
g_company_type_df['CEK_GELIR_PERCENTAGE'].fillna(0, inplace=True)                  # fill the NaN values with 0

In [90]:
g_company_type_df.head(10)

Unnamed: 0,ID,MUSTERI_ID,SIRKET_TURU,CEK_NO,CEK_TUTAR,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,BK_KURUMSAYISI,...,BK_RISK,BK_GECIKMEHESAP,BK_GECIKMEBAKIYE,MUSTERI_RISK_SEVIYESI,BK_ORAN,KESIDECI_COUNT,SUBE_COUNT,CEK_INCOME,CEK_LOSS,CEK_GELIR_PERCENTAGE
0,2457932,11800527,G,70331933,20000.0,KY,PENDIK,11535476,2018-11-30,0,...,0.0,0.0,0.0,0,0.0,3,2,50500.0,47160.0,0.5171
1,2457933,12024009,G,3014103,25000.0,KV,ÇORLU,11753282,2018-11-30,3,...,226.0,2.0,1012.0,0,9.792035,4,1,50000.0,0.0,1.0
2,2457934,11800527,G,7031933,20000.0,KV,PENDIK,11535476,2018-11-30,0,...,0.0,0.0,0.0,0,0.0,3,2,50500.0,0.0,1.0
3,2457936,11724283,G,7198012,23000.0,KY,TOPÇULAR,126553,2018-11-30,2,...,9481.0,3.0,3575.0,0,5.495201,4,1,168000.0,184000.0,0.477273
4,2457937,11879266,G,9090937,10000.0,KY,ADANA,98511,2018-11-30,7,...,211301.0,3.0,4128.0,0,1.325602,4,1,0.0,73588.0,0.0
5,2457938,11879266,G,9090936,13458.0,KY,ADANA,98511,2018-11-30,7,...,211301.0,3.0,4128.0,0,1.325602,4,1,0.0,73588.0,0.0
6,2457939,11854083,G,4535918,5200.0,KY,IZMIT,12054732,2018-11-30,0,...,0.0,0.0,0.0,0,0.0,20,1,222300.0,740950.0,0.230781
7,2457942,11654711,G,88624,110000.0,KY,BESEVLER,74198,2018-11-30,4,...,150619.0,1.0,847.0,0,1.429176,1,1,0.0,110000.0,0.0
8,2457943,11723432,G,3331313,5300.0,KY,YILDIRIM,11570468,2018-11-30,4,...,65599.0,3.0,2591.0,0,1.591244,4,1,10000.0,35050.0,0.221976
9,2457944,11577211,G,4602258,11000.0,KY,BEYLIKDÜZÜ,12054730,2018-11-30,4,...,85589.0,4.0,6379.0,0,2.522053,18,1,25000.0,347710.4,0.067076


>> Frequency of Customer's Check Transactions

In [91]:
# to see the number of transactions in different dates by each customer
g_company_type_df.groupby('MUSTERI_ID')['ISLEM_TARIHI'].nunique().sort_values(ascending=True)

MUSTERI_ID
300188452      1
11982080       1
11982051       1
11982034       1
11754965       1
            ... 
11769908      81
11609912      82
11529385      84
11954204      86
11633272     100
Name: ISLEM_TARIHI, Length: 38374, dtype: int64

In [92]:
transact_date_sorted = g_company_type_df.sort_values(['MUSTERI_ID', 'ISLEM_TARIHI'])    # sort the data using new dataframe first

# Calculate the time difference between consecutive transactions for each customer
transact_date_sorted['DAYS_BETWEEN_TRANSACTIONS'] = transact_date_sorted.groupby('MUSTERI_ID')['ISLEM_TARIHI'].diff().dt.days
# Calculate the average days between transactions for each customer
transact_avg_days = transact_date_sorted.groupby('MUSTERI_ID')['DAYS_BETWEEN_TRANSACTIONS'].mean().reset_index(name='AVERAGE_DAYS')

transact_avg_days['AVERAGE_DAYS'].fillna(0, inplace=True)   # fill the NaN values with 0
g_company_type_df= g_company_type_df.merge(transact_avg_days, on='MUSTERI_ID', how='left')     # left outer join to append the feature extraction

In [94]:
g_company_type_df['AVERAGE_DAYS'].value_counts().sort_index(ascending=True)

0.000000      18164
0.076923         14
0.090909         12
0.111111         20
0.125000          9
              ...  
315.000000        2
316.000000        4
317.000000        2
325.000000        2
326.000000        2
Name: AVERAGE_DAYS, Length: 4197, dtype: int64

We assigned values first, now we convert them into categories.

    0-30 days->Very Often Transactions
    30-90 days->Often Transactions
    90-360->Rare
    0->One timer Customer

In [95]:
# create categories by the conditions above for AVERAGE_DAYS
g_company_type_df['TRANSACTIONS_FREQ'] = pd.cut(g_company_type_df['AVERAGE_DAYS'], bins=[-1, 1, 30, 90, 360], labels=['One Timer', 'Very Often', 'Often', 'Rare'])

We need to implement **one-hot encoding** to make this attribute feasible for ML operations

In [97]:
one_hot_encoded_freq = pd.get_dummies(g_company_type_df['TRANSACTIONS_FREQ'])       # one hot encoding for the new feature
g_company_type_df = pd.concat([g_company_type_df, one_hot_encoded_freq], axis=1)    # append the encodings as new features to the dataframe

In [98]:
g_company_type_df['TRANSACTIONS_FREQ'].value_counts().sort_index(ascending=True)

One Timer      19968
Very Often    169228
Often          35423
Rare            4919
Name: TRANSACTIONS_FREQ, dtype: int64

>> Find Customer's Latest Operation Date

Find the difference as day between the last transaction of each customer and the last day of the dataset

In [100]:
# find the last ISLEM_TARIHI for each customer
g_company_type_df['LAST_TRANSACTION_DATE'] = g_company_type_df.groupby('MUSTERI_ID')['ISLEM_TARIHI'].transform('max')
# now find the difference of these dates with 01-12-2018
g_company_type_df['DAYS_SINCE_LAST_TRANSACTION'] = (pd.to_datetime('2018-12-01') - g_company_type_df['LAST_TRANSACTION_DATE']).dt.days

del g_company_type_df['LAST_TRANSACTION_DATE']    # delete the temporary column

>> Finally, deleting unnecessary attributes

In [102]:
g_company_type_df.columns

Index(['ID', 'MUSTERI_ID', 'SIRKET_TURU', 'CEK_NO', 'CEK_TUTAR', 'KULLANDIRIM',
       'SUBE', 'KESIDECI_ID', 'ISLEM_TARIHI', 'BK_KURUMSAYISI', 'BK_LIMIT',
       'BK_RISK', 'BK_GECIKMEHESAP', 'BK_GECIKMEBAKIYE',
       'MUSTERI_RISK_SEVIYESI', 'BK_ORAN', 'KESIDECI_COUNT', 'SUBE_COUNT',
       'CEK_INCOME', 'CEK_LOSS', 'CEK_GELIR_PERCENTAGE', 'AVERAGE_DAYS',
       'TRANSACTIONS_FREQ', 'One Timer', 'Very Often', 'Often', 'Rare',
       'DAYS_SINCE_LAST_TRANSACTION'],
      dtype='object')

In [53]:
del g_company_type_df['KESIDECI_ID']    # delete the column KESIDECI_ID
del g_company_type_df['SUBE']           # delete the column SUBE
del g_company_type_df['KULLANDIRIM']    # delete the column KULLANDIRIM
del g_company_type_df['SIRKET_TURU']    # delete the column SIRKET_TURU
del g_company_type_df['CEK_TUTAR']      # delete the column CEK_TUTAR
del g_company_type_df['ISLEM_TARIHI']   # delete the column ISLEM_TARIHI
del g_company_type_df['AVERAGE_DAYS']   # delete the column AVERAGE_DAYS
del g_company_type_df['CEK_NO']         # delete the column CEK_NO

In [54]:
g_company_type_df.describe()

Unnamed: 0,ID,MUSTERI_ID,BK_KURUMSAYISI,BK_LIMIT,BK_RISK,BK_GECIKMEHESAP,BK_GECIKMEBAKIYE,MUSTERI_RISK_SEVIYESI,BK_ORAN,KESIDECI_COUNT,SUBE_COUNT,CEK_INCOME,CEK_LOSS,CEK_GELIR_PERCENTAGE,One Timer,Very Often,Often,Rare,DAYS_SINCE_LAST_TRANSACTION
count,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0,229538.0
mean,2701168.0,10359000.0,3.123295,69961.97,48270.58,1.928814,2203.242,0.039113,991.941698,11.707125,1.017696,98797.01,329166.5,0.425567,0.086992,0.737255,0.154323,0.02143,67.566629
std,139535.5,13786150.0,2.247007,109235.9,83886.69,2.176905,18568.36,0.285967,13452.459457,13.903977,0.134041,117034.6,672117.0,0.406137,0.281824,0.440126,0.361259,0.144813,76.837046
min,2457932.0,105.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,2580292.0,11576140.0,1.0,3797.0,1898.0,0.0,0.0,0.0,1.105593,3.0,1.0,15000.0,0.0,0.074047,0.0,0.0,0.0,0.0,10.0
50%,2702264.0,11763640.0,3.0,29650.0,15996.0,1.0,471.0,0.0,1.381565,7.0,1.0,62000.0,100800.0,0.252173,0.0,1.0,0.0,0.0,37.0
75%,2822914.0,11901180.0,4.0,91348.0,58718.25,3.0,2381.0,0.0,1.936203,15.0,1.0,146039.4,387550.0,1.0,0.0,1.0,0.0,0.0,106.0
max,2939795.0,300188500.0,15.0,2362133.0,1771233.0,31.0,2384798.0,3.0,879000.0,115.0,4.0,1218740.0,30177950.0,1.0,1.0,1.0,1.0,1.0,333.0


In [103]:
g_company_type_df

Unnamed: 0,ID,MUSTERI_ID,SIRKET_TURU,CEK_NO,CEK_TUTAR,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,BK_KURUMSAYISI,...,CEK_INCOME,CEK_LOSS,CEK_GELIR_PERCENTAGE,AVERAGE_DAYS,TRANSACTIONS_FREQ,One Timer,Very Often,Often,Rare,DAYS_SINCE_LAST_TRANSACTION
0,2457932,11800527,G,70331933,20000.0,KY,PENDIK,11535476,2018-11-30,0,...,50500.0,47160.0,0.517100,64.800000,Often,0,0,1,0,1
1,2457933,12024009,G,3014103,25000.0,KV,ÇORLU,11753282,2018-11-30,3,...,50000.0,0.0,1.000000,1.000000,One Timer,1,0,0,0,1
2,2457934,11800527,G,7031933,20000.0,KV,PENDIK,11535476,2018-11-30,0,...,50500.0,0.0,1.000000,64.800000,Often,0,0,1,0,1
3,2457936,11724283,G,7198012,23000.0,KY,TOPÇULAR,126553,2018-11-30,2,...,168000.0,184000.0,0.477273,21.214286,Very Often,0,1,0,0,1
4,2457937,11879266,G,9090937,10000.0,KY,ADANA,98511,2018-11-30,7,...,0.0,73588.0,0.000000,55.000000,Often,0,0,1,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
229533,2939791,11708128,G,8005260,25000.0,KY,IKITELLI,11687714,2018-01-02,0,...,89000.0,781358.0,0.102257,2.200000,Very Often,0,1,0,0,234
229534,2939792,11708128,G,363682,30000.0,KY,IKITELLI,11687714,2018-01-02,0,...,89000.0,781358.0,0.102257,2.200000,Very Often,0,1,0,0,234
229535,2939793,11708128,G,1246339,20000.0,KY,IKITELLI,11676073,2018-01-02,0,...,89000.0,781358.0,0.102257,2.200000,Very Often,0,1,0,0,234
229536,2939794,11708128,G,1019468,10000.0,KY,IKITELLI,11549352,2018-01-02,0,...,89000.0,781358.0,0.102257,2.200000,Very Often,0,1,0,0,234


#### Save Changes on a New Table

Using feather to save changes

In [104]:
# save g_company_type_df to a feather file
g_company_type_df.to_feather('data/bk_data.feather')

In [130]:
del g_company_type_df   # delete the dataframe after saving it into feather file

#### Handle Data Types *for T type Customers*

In [8]:
t_company_type_df = pd.read_feather('data/tk_rawData.feather')    # read the feather file

In [9]:
t_company_type_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252335 entries, 0 to 252334
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   MUSTERI_ID             252335 non-null  object 
 1   ID                     252335 non-null  int64  
 2   CEK_NO                 252335 non-null  object 
 3   CEK_TUTAR              252335 non-null  float64
 4   SIRKET_TURU            252335 non-null  object 
 5   KULLANDIRIM            252335 non-null  object 
 6   SUBE                   252335 non-null  object 
 7   KESIDECI_ID            252335 non-null  object 
 8   ISLEM_TARIHI           252335 non-null  object 
 9   TK_NAKDILIMIT          252335 non-null  int64  
 10  TK_NAKDIRISK           252335 non-null  int64  
 11  TK_GAYRINAKDILIMIT     252335 non-null  int64  
 12  TK_GAYRINAKDIRISK      252335 non-null  int64  
 13  TK_GECIKMEHESAP        252335 non-null  int64  
 14  TK_GECIKMEBAKIYE       252335 non-nu

In [11]:
# converting into float to use on ML model later
t_company_type_df['TK_NAKDILIMIT']=t_company_type_df['TK_NAKDILIMIT'].astype(float)
t_company_type_df['TK_NAKDIRISK']=t_company_type_df['TK_NAKDIRISK'].astype(float)
t_company_type_df['TK_GAYRINAKDIRISK']=t_company_type_df['TK_GAYRINAKDIRISK'].astype(float)
t_company_type_df['TK_GAYRINAKDILIMIT']=t_company_type_df['TK_GAYRINAKDILIMIT'].astype(float)

t_company_type_df['TK_GECIKMEHESAP']=t_company_type_df['TK_GECIKMEHESAP'].astype(float)
t_company_type_df['TK_GECIKMEBAKIYE']=t_company_type_df['TK_GECIKMEBAKIYE'].astype(float)

In [12]:
t_company_type_df['SIRKET_TURU'] = t_company_type_df['SIRKET_TURU'].astype('string')
t_company_type_df['SUBE'] = t_company_type_df['SUBE'].astype('string')
t_company_type_df['KULLANDIRIM'] = t_company_type_df['KULLANDIRIM'].astype('string')

t_company_type_df['KESIDECI_ID'] = t_company_type_df['KESIDECI_ID'].str.extract('(\d+)', expand=False).astype('int64')
t_company_type_df['MUSTERI_ID'] = t_company_type_df['MUSTERI_ID'].astype('int64')

t_company_type_df['ISLEM_TARIHI'] = pd.to_datetime(t_company_type_df['ISLEM_TARIHI'], format='%Y.%m.%d')

In [13]:
t_company_type_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 252335 entries, 0 to 252334
Data columns (total 17 columns):
 #   Column                 Non-Null Count   Dtype         
---  ------                 --------------   -----         
 0   MUSTERI_ID             252335 non-null  int64         
 1   ID                     252335 non-null  int64         
 2   CEK_NO                 252335 non-null  object        
 3   CEK_TUTAR              252335 non-null  float64       
 4   SIRKET_TURU            252335 non-null  string        
 5   KULLANDIRIM            252335 non-null  string        
 6   SUBE                   252335 non-null  string        
 7   KESIDECI_ID            252335 non-null  int64         
 8   ISLEM_TARIHI           252335 non-null  datetime64[ns]
 9   TK_NAKDILIMIT          252335 non-null  float64       
 10  TK_NAKDIRISK           252335 non-null  float64       
 11  TK_GAYRINAKDILIMIT     252335 non-null  float64       
 12  TK_GAYRINAKDIRISK      252335 non-null  floa

#### Feature Extraction

for T type customers data

>> Limit/Risk Ratio for T type Customers

This is handled by aggregating their cash and non-cash type limits and risks, then calculating a ratio

In [14]:
# create a new column that summation of TK_GAYRINAKDIRISK and TK_NAKDIRISK
t_company_type_df['TOTAL_RISK'] = t_company_type_df['TK_GAYRINAKDIRISK'] + t_company_type_df['TK_NAKDIRISK']

#create a new column that summation of TK_GAYRINAKDILIMIT and TK_NAKDILIMIT
t_company_type_df['TOTAL_LIMIT'] = t_company_type_df['TK_GAYRINAKDILIMIT'] + t_company_type_df['TK_NAKDILIMIT']

In [15]:
t_company_type_df['TK_ORAN'] = 0    # creating a new column for TK_ORAN and initializing it with 0

# for cases of RISK and LIMIT both are not equal to 0, TK_ORAN is calculated as a ratio of LIMIT to RISK
mask = (t_company_type_df['TOTAL_RISK'] != 0) & (t_company_type_df['TOTAL_LIMIT'] != 0)
t_company_type_df.loc[mask, 'TK_ORAN'] = t_company_type_df.loc[mask, 'TOTAL_LIMIT'] / t_company_type_df.loc[mask, 'TOTAL_RISK']

# for cases of RISK is equal to 0 but LIMIT is not 0, BK_ORAN is calculated as LIMIT / 0.1
mask = (t_company_type_df['TOTAL_RISK'] == 0) & (t_company_type_df['TOTAL_LIMIT'] != 0)
t_company_type_df.loc[mask, 'TK_ORAN'] = t_company_type_df.loc[mask, 'TOTAL_LIMIT'] / 0.1  # assigning the value of RISK as 0.1 instead of 0

# for cases of RISK is not 0 and LIMIT is equal to 0, BK_ORAN is calculated as 0.1 / RISK
mask = (t_company_type_df['TOTAL_LIMIT'] == 0) & (t_company_type_df['TOTAL_RISK'] != 0)
t_company_type_df.loc[mask, 'TK_ORAN'] = 0.1 / t_company_type_df.loc[mask, 'TOTAL_RISK']   # assigning the value of LIMIT as 0.1 instead of 0

mask = (t_company_type_df['TOTAL_LIMIT'] == 0) & (t_company_type_df['TOTAL_RISK'] == 0)
t_company_type_df.loc[mask, 'TK_ORAN'] = 0


>> Keşideci Count for Each Customer

In [16]:
count_kesideci_df = t_company_type_df.groupby('MUSTERI_ID')['KESIDECI_ID'].nunique().reset_index()
count_kesideci_df.columns = ['MUSTERI_ID', 'KESIDECI_COUNT']

t_company_type_df = pd.merge(t_company_type_df, count_kesideci_df, on='MUSTERI_ID', how='left')     # left outer join to append KESIDECI_ID count per customers
del count_kesideci_df # delete the temporary dataframe

>> Customer's Check Income & its Ratio over All Possible Incomes

In [17]:
sum_cekTutar_df = t_company_type_df[t_company_type_df['KULLANDIRIM'] == 'KV'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().reset_index()
sum_cekTutar_df.columns = ['MUSTERI_ID', 'CEK_INCOME']

t_company_type_df = pd.merge(t_company_type_df, sum_cekTutar_df, on='MUSTERI_ID', how='left')     # left outer join to append CEK_INCOME per customers
del sum_cekTutar_df                                                                               # delete the temporary dataframe

t_company_type_df['CEK_INCOME'].fillna(0, inplace=True)                                           # fill the NaN values with 0

And calculate CEK_LOSS attribute that calculates loss from possible check incomes

In [18]:
sum_cekKayıp_df = t_company_type_df[t_company_type_df['KULLANDIRIM'] == 'KY'].groupby('MUSTERI_ID')['CEK_TUTAR'].sum().reset_index()
sum_cekKayıp_df.columns = ['MUSTERI_ID', 'CEK_LOSS']

t_company_type_df = pd.merge(t_company_type_df, sum_cekKayıp_df, on='MUSTERI_ID', how='left')   # left outer join to append CEK_LOSS count per customers
del sum_cekKayıp_df                                                                             # delete the temporary dataframe

t_company_type_df['CEK_LOSS'].fillna(0, inplace=True)                                           # fill the NaN values with 0

# round one decimal to CEK_LOSS values
t_company_type_df['CEK_LOSS'] = t_company_type_df['CEK_LOSS'].round(1)

>> Calculate the **Ratio of CEK_INCOME over INCOME and LOSS**

In [19]:
# calculate CEK_INCOME/CEK_INCOME+CEK_LOSS as new column
t_company_type_df['CEK_GELIR_PERCENTAGE'] = t_company_type_df['CEK_INCOME'] / (t_company_type_df['CEK_INCOME'] + t_company_type_df['CEK_LOSS'])
t_company_type_df['CEK_GELIR_PERCENTAGE'].fillna(0, inplace=True)                               # fill the NaN values with 0

>> Frequency of Customer Transactions

In [20]:
transact_date_sorted = t_company_type_df.sort_values(['MUSTERI_ID', 'ISLEM_TARIHI'])    # sort the data using new dataframe first

# calculate the time difference between consecutive transactions for each customer
transact_date_sorted['DAYS_BETWEEN_TRANSACTIONS'] = transact_date_sorted.groupby('MUSTERI_ID')['ISLEM_TARIHI'].diff().dt.days
# calculate the average days between transactions for each customer
transact_avg_days = transact_date_sorted.groupby('MUSTERI_ID')['DAYS_BETWEEN_TRANSACTIONS'].mean().reset_index(name='AVERAGE_DAYS')

transact_avg_days['AVERAGE_DAYS'].fillna(0, inplace=True)   # fill the NaN values with 0
t_company_type_df= t_company_type_df.merge(transact_avg_days, on='MUSTERI_ID', how='left')     # left outer join to append the feature extraction

After assigning values, convert them into categories.

    0-30 days->Very Often Transactions
    30-90 days->Often Transactions
    90-360->Rare
    0->One timer Customer

In [21]:
# create categories by the conditions above for AVERAGE_DAYS
t_company_type_df['TRANSACTIONS_FREQ'] = pd.cut(t_company_type_df['AVERAGE_DAYS'], bins=[-1, 1, 30, 90, 360], labels=['One Timer', 'Very Often', 'Often', 'Rare'])

We need to implement **one-hot encoding** to make this attribute feasible for ML operations

In [22]:
one_hot_encoded_freq = pd.get_dummies(t_company_type_df['TRANSACTIONS_FREQ'])       # one hot encoding for the new feature
t_company_type_df = pd.concat([t_company_type_df, one_hot_encoded_freq], axis=1)    # append the encodings as new features to the dataframe

>> Calculate the latest transaction of each customers

In [23]:
# find the last ISLEM_TARIHI for each customer
t_company_type_df['LAST_TRANSACTION_DATE'] = t_company_type_df.groupby('MUSTERI_ID')['ISLEM_TARIHI'].transform('max')
# now find the difference of these dates with 01-12-2018
t_company_type_df['DAYS_SINCE_LAST_TRANSACTION'] = (pd.to_datetime('2018-12-01') - t_company_type_df['LAST_TRANSACTION_DATE']).dt.days

del t_company_type_df['LAST_TRANSACTION_DATE']    # delete the temporary column

In [24]:
t_company_type_df.head(10)

Unnamed: 0,MUSTERI_ID,ID,CEK_NO,CEK_TUTAR,SIRKET_TURU,KULLANDIRIM,SUBE,KESIDECI_ID,ISLEM_TARIHI,TK_NAKDILIMIT,...,CEK_INCOME,CEK_LOSS,CEK_GELIR_PERCENTAGE,AVERAGE_DAYS,TRANSACTIONS_FREQ,One Timer,Very Often,Often,Rare,DAYS_SINCE_LAST_TRANSACTION
0,11820145,2457923,3062309,8000.0,T,KY,IZMIR SANAYI,12054734,2018-11-30,11000.0,...,230234.5,325980.0,0.413931,10.677419,Very Often,0,1,0,0,1
1,11672216,2457924,80075,35000.0,T,KY,MECIDIYEKÖY,21013,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
2,11672216,2457925,1009838,5000.0,T,KY,MECIDIYEKÖY,12006233,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
3,11672216,2457926,1009837,10000.0,T,KY,MECIDIYEKÖY,12006233,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
4,11672216,2457927,8005059,15000.0,T,KY,MECIDIYEKÖY,12000824,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
5,11672216,2457928,7824,5000.0,T,KY,MECIDIYEKÖY,11830127,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
6,11672216,2457929,5425083,6000.0,T,KY,MECIDIYEKÖY,11789483,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
7,11672216,2457930,84781,77600.0,T,KY,MECIDIYEKÖY,11618513,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
8,11672216,2457931,17971,100000.0,T,KY,MECIDIYEKÖY,186882,2018-11-30,6903025.0,...,205000.0,945708.7,0.178151,5.74359,Very Often,0,1,0,0,1
9,11827634,2457935,1012285,22000.0,T,KY,TOPÇULAR,151751,2018-11-30,931624.0,...,104500.0,706879.5,0.128793,9.433333,Very Often,0,1,0,0,1


Finished all feature extractions, now able to delete this attributes. No more needed.

In [25]:
del t_company_type_df['ISLEM_TARIHI']   # delete the column ISLEM_TARIHI
del t_company_type_df['AVERAGE_DAYS']   # delete the column AVERAGE_DAYS

del t_company_type_df['KESIDECI_ID']    # delete the column KESIDECI_ID
del t_company_type_df['SUBE']           # delete the column SUBE
del t_company_type_df['KULLANDIRIM']    # delete the column KULLANDIRIM
del t_company_type_df['SIRKET_TURU']    # delete the column SIRKET_TURU
del t_company_type_df['CEK_TUTAR']      # delete the column CEK_TUTAR

#### Save Changes

For T type, create another new feather file

In [26]:
# save t_company_type_df to a feather file
t_company_type_df.to_feather('data/tk_data.feather')

In [27]:
del t_company_type_df