# 03 - Construct transactions history at loan features

## What is this notebook?

This notebook creates the transactions history features at the moment the loan is created. What this means is that 
if a user creates a loan at `2022-05-05 23:30:48.986000` we're going to calculate the transactions history features that happened between `2022-05-01 00:00:00.000000` and `2022-05-05 23:30:48.986000`. 

For that, we have 2 columns that are stored in `database.db` file that are in `\databases\database.db`:

#### Loans Table
- **id (int)**: Unique identifier for the loan.
- **user_id (int)**: Unique identifier for the user who has taken the loan.
- **amount (float)**: The amount of loan disbursed.
- **total_amount (float)**: The amount of loan, including fees.
- **due_amount (int)**: The amount of the loan by the due date if there are no repayments during the contract period. Good to get interest rates.
- **due_date (object)**: The date by which the loan is due.
- **status (object)**: Current status of the loan (e.g., repaid, debt_collection, ongoing, debt_repaid).
    - **repaid**: A loan that was was paid until due.
    - **debt_collection**: A loan that was not paid until due.
    - **debt_repaid**: A loan that was not paid until due but we recovered the money somehow.
    - **cancelled**: A canceled loan.
    - **error**: Operational error.
- **created_at (object)**: Timestamp of when the loan record was created. <u>Have it as the beginning of the loan</u>.

### Transactions Table
- **id (int)**: Unique identifier for the transaction.
- **user_id (int)**: The user ID associated with the transaction.
- **amount (float)**: Transaction amount.
- **status (object)**: Status of the transaction.
    - **approved**: A transaction that happened.
    - **denied**: A transaction that didn't happened due to it being denied.
- **capture_method (object)**: Method of capturing the transaction.
- **payment_method (object)**: Payment method used (e.g., credit, debit).
- **installments (int)**: Number of installments for the transaction.
- **card_brand (object)**: Brand of the card used for the transaction.
- **created_at (object)**: Timestamp of when the transaction record was created. <u>Have it as the moment the transaction happened</u>.


## Results

For this dataset, we're constructed these features:

- `sum_amt_transactions_at_created_loan`: This calculates the sum amount of transactions between the first day of the month when the loan was created and the moment the loan was created.
- `max_amt_transactions_at_created_loan`: This calculates the maximum value of transactions between the first day of the month when the loan was created and the moment the loan was created.
- `most_frequent_transactions_payment_method_at_created_loan`: This calculates the most frequent payment method of transactions between the first day of the month when the loan was created and the moment the loan was created.
- `sum_amt_payment_method_credit_method_at_created_loan`: This calculates the sum amount of transactions that uses credit as payment method between the first day of the month when the loan was created and the moment the loan was created.
- `sum_amt_payment_method_debit_method_at_created_loan`: This calculates the sum amount of transactions that uses debit as payment method between the first day of the month when the loan was created and the moment the loan was created.
- `sum_amt_transactions_in_visa_at_created_loan`: the amount of transactions between the first day of the month when the loan was created and the moment the loan was created that uses the visa card_brand.
- `sum_amt_transactions_in_mastercard_at_created_loan`: the amount of transactions between the first day of the month when the loan was created and the moment the loan was created that uses the mastercard card_brand.
- `sum_amt_transactions_in_elo_at_created_loan`: the amount of transactions between the first day of the month when the loan was created and the moment the loan was created that uses the elo card_brand.
- `sum_amt_transactions_in_hipercard_at_created_loan`: the amount of transactions between the first day of the month when the loan was created and the moment the loan was created that uses the hipercard card_brand.
- `sum_amt_transactions_in_amex_at_created_loan`: the amount of transactions between the first day of the month when the loan was created and the moment the loan was created that uses the amex card_brand.
- `max_installments_at_created_loan`: The maximum installments value considere the transactions that happnend between the first day of the month when the loan was created and the moment the loan was created.
- `median_installments_at_created_loan`: The median installments value considere the transactions that happnend between the first day of the month when the loan was created.

Have in mind that the transactions we're considering here are the approved transactions between the first day of the month when the loan was created and the moment the loan was created.


The final dataset is located at `data/processed` with name of `df_transactions_history_per_user_at_loan_created.csv`.

## 1 - Imports

In [1]:
import os 
os.chdir("../../")

In [2]:
import sqlalchemy
import pandas as pd 
import numpy as np

from pandas.tseries.offsets import DateOffset
from datetime import datetime,date
from typing import Union

## 2 - Read tables

In [3]:
engine = sqlalchemy.create_engine("sqlite:///./database/database.db", echo=True)

df_loans = pd.read_sql(
    sql="""
    SELECT * FROM loans l
    """,
    con=engine
)
df_loans_repay = pd.read_sql(
    sql="""
    SELECT * FROM loan_repayments lr
    """,
    con=engine
)

df_transactions = pd.read_sql(
    sql="""
    SELECT * FROM transactions tr
    """,
    con=engine
)

2024-04-07 14:52:05,232 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-04-07 14:52:05,232 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("
    SELECT * FROM loans l
    ")
2024-04-07 14:52:05,233 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-07 14:52:05,234 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("
    SELECT * FROM loans l
    ")
2024-04-07 14:52:05,235 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-07 14:52:05,236 INFO sqlalchemy.engine.Engine 
    SELECT * FROM loans l
    
2024-04-07 14:52:05,237 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-07 14:52:05,284 INFO sqlalchemy.engine.Engine ROLLBACK
2024-04-07 14:52:05,285 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-04-07 14:52:05,286 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("
    SELECT * FROM loan_repayments lr
    ")
2024-04-07 14:52:05,286 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-07 14:52:05,287 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("
    SELECT * FR

## 3 - Preprocessing data

In [4]:
#convert to datetime
df_loans["created_at"] = pd.to_datetime(df_loans["created_at"],utc=True,format="ISO8601")
df_loans_repay["created_at"] = pd.to_datetime(df_loans_repay["created_at"],utc=True,format="ISO8601")
df_transactions["created_at"] = pd.to_datetime(df_transactions["created_at"],utc=True,format="ISO8601")

#convert to date
df_loans["due_date"] = pd.to_datetime(df_loans["due_date"],format="%Y-%m-%d")

#create date_created 
df_loans["date_created"] = df_loans["created_at"].apply(func=lambda d:d.date())
df_loans_repay["date_created"] = df_loans_repay["created_at"].apply(func=lambda d:d.date())
df_transactions["date_created"] = df_transactions["created_at"].apply(func=lambda d:d.date())

# add reference_date in df_loans
df_loans["reference_date"] = [date(year=d.year,month=d.month,day=1)
                              for d in df_loans["date_created"]]

## 4 - Construct features

### 4.1 - Functions to create features

In [5]:
def filter_amt_transactions_in_month_by_user_id(
        dataframe_loans:pd.DataFrame,
        dataframe_transactions:pd.DataFrame,
)->list:
    """
    Filters the transactions of a user that happended between
    the first day of the month that a loan was created and the moment the loan was created.

    Args:
        dataframe_loans (pd.DataFrame): Dataframe with loans made by a user and their 
        date created, reference date and timestamp of the moment when it was created.
        dataframe_transactions (pd.DataFrame): Dataframe with all transactions made 
        by users and their date created and timestamp of the moment when it was created.

    Returns:
        list: A list of all transactions made by users in dataframe_loans with transactions
        made in the first day of the month that a loan was created and the moment the loan was created.
    """    
    dfs_all_transactions_by_user_in_month = []
    for id_user,ref_date,date_crt,e_tmsp in dataframe_loans[["user_id","reference_date","date_created","created_at"]].values:
        b_tmsp = pd.to_datetime(date(year=e_tmsp.date().year,
                                     month=e_tmsp.date().month,day=1),
                                     format="ISO8601",utc=True)
        filter_df =  dataframe_transactions[
                (dataframe_transactions["created_at"].between(left=b_tmsp,right=e_tmsp,inclusive="left")) & 
                (dataframe_transactions["user_id"]==id_user)].copy()
        if len(filter_df)>0:
                filter_df["reference_date"] = ref_date
                filter_df["loan_date_created"] = date_crt
                filter_df["loan_created_at"] = e_tmsp
                dfs_all_transactions_by_user_in_month.append(filter_df)
    
    return dfs_all_transactions_by_user_in_month

def calculate_sum_in_group_per_user(
          list_dataframes:list,
          group_by_col:Union[str,list],
          col_to_sum:Union[str,list],
          new_col_name:str
          )->pd.DataFrame:
    """
    For each dataframe in list_dataframes groups their values and 
    calculates the sum per group.

    Args:
        list_dataframes (list): A list of dataframes to calculate the sum values per groups.
        group_by_col (Union[str,list]): A column to group values.
        col_to_sum (Union[str,list]): A column to calculate the sum of values.
        new_col_name (str): New column name for the sum values in the result dataframe.

    Returns:
        pd.DataFrame: A dataframe with all the calculated sum values per group.
    """    
    dfs_sum_in_group_per_user = [
    df.groupby(by=group_by_col)[col_to_sum].sum().to_frame(name=new_col_name)
    for df in list_dataframes
    ]
    df_sum_in_group_per_user = pd.concat(dfs_sum_in_group_per_user).reset_index()
    return df_sum_in_group_per_user

def calculate_max_in_group_per_user(
          list_dataframes:list,
          group_by_col:Union[str,list],
          col_to_search_max:Union[str,list],
          new_col_name:str
          )->pd.DataFrame:
    """
    For each dataframe in list_dataframes groups their values and 
    calculates the maxiumm value per group.

    Args:
        list_dataframes (list): A list of dataframes to calculate the maximum values per groups.
        group_by_col (Union[str,list]): A column to group values.
        col_to_search_max (Union[str,list]): A column to search the maximum value.
        new_col_name (str): New column name for the maximum values in the result dataframe.

    Returns:
        pd.DataFrame: A dataframe with all the calculated maximum values per group.
    """    
    dfs_max_in_group_per_user  = [
    df.groupby(by=group_by_col)[col_to_search_max].max().to_frame(name=new_col_name)
    for df in list_dataframes
    ]
    df_max_in_group_per_user = pd.concat(dfs_max_in_group_per_user).reset_index()
    return df_max_in_group_per_user

def calculate_median_in_group_per_user(
          list_dataframes:list,
          group_by_col:Union[str,list],
          col_to_search_median:Union[str,list],
          new_col_name:str
          )->pd.DataFrame:
    """
    For each dataframe in list_dataframes groups their values and 
    calculates the median value per group.

    Args:
        list_dataframes (list): A list of dataframes to calculate the median values per groups.
        group_by_col (Union[str,list]): A column to group values.
        col_to_search_max (Union[str,list]): A column to search the median value.
        new_col_name (str): New column name for the median values in the result dataframe.

    Returns:
        pd.DataFrame: A dataframe with all the calculated median values per group.
    """    
    dfs_median_in_group_per_user  = [
    df.groupby(by=group_by_col)[col_to_search_median].median().to_frame(name=new_col_name)
    for df in list_dataframes
    ]
    df_median_in_group_per_user = pd.concat(dfs_median_in_group_per_user).reset_index()
    return df_median_in_group_per_user

def calculate_most_frequent(
        dataframe:pd.DataFrame,
        col_to_count:Union[str,list],
        col_to_group:Union[str,list],
        return_counts:bool=False
)->pd.DataFrame:
    """
    Count the most frequent ocurrence of a value in a column in a dataframe.

    Args:
        dataframe (pd.DataFrame): Initial dataframe to count 
        col_to_count (str,list): Column to count the vaules.
        col_to_group (str,list): Column to group the vaules in dataframe.
        return_counts (bool, optional): If true, returns the counts of the most frequent value.
        If false, return only the group id and their most frequent value . Defaults to False.

    Returns:
        pd.DataFrame: A new dataframe with the most frequent value in the column by group. If
        return_counts is true, returns the frequency of the value per group. If return_counts
        is false, then, returns only the group and their most frequent value.
    """    
    df_counts_type = dataframe.groupby(by=col_to_group)[col_to_count]\
                              .value_counts()\
                              .to_frame(name="count_types")\
                              .reset_index()

    idxs_most_frequent = df_counts_type.groupby(by=col_to_group)["count_types"].idxmax()
    final_df = df_counts_type.loc[idxs_most_frequent,:].reset_index(drop=True)

    if return_counts:
        return final_df
    else:
         final_df = final_df.drop(columns="count_types")
    
    return final_df

### 4.2 - Create sum amount features

- `sum_amt_transactions_at_created_loan`
- `sum_amt_payment_method_credit_at_created_loan`
- `sum_amt_payment_method_debit_at_created_loan`
- `sum_amt_transactions_in_visa_at_created_loan`
- `sum_amt_transactions_in_mastercard_at_created_loan`
- `sum_amt_transactions_in_elo_at_created_loan`
- `sum_amt_transactions_in_hipercard_at_created_loan`
- `sum_amt_transactions_in_amex_at_created_loan`

In [6]:
df_approved_transactions = df_transactions[df_transactions["status"]=="approved"]
dfs_trans_per_user = filter_amt_transactions_in_month_by_user_id(
    dataframe_loans=df_loans,
    dataframe_transactions=df_approved_transactions
)

In [7]:
## sum_amt_transactions_at_created_loan feature
df_sum_amt_transactions_per_user_in_month = calculate_sum_in_group_per_user(
    list_dataframes=dfs_trans_per_user,
    group_by_col=["user_id","reference_date","loan_date_created","loan_created_at"],
    col_to_sum="amount",
    new_col_name="sum_amt_transactions_at_created_loan"
)
df_sum_amt_transactions_per_user_in_month.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,sum_amt_transactions_at_created_loan
2165,0,2022-04-01,2022-04-26,2022-04-26 16:47:20.625000+00:00,25745.0
102,2,2022-02-01,2022-02-04,2022-02-04 18:20:58.272000+00:00,5250.0
1435,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,14450.0
4019,3,2022-07-01,2022-07-09,2022-07-09 16:23:37.569000+00:00,23700.0
5196,4,2022-08-01,2022-08-26,2022-08-26 11:16:48.724000+00:00,97264.0
...,...,...,...,...,...
909,3153,2022-03-01,2022-03-15,2022-03-15 15:28:53.048000+00:00,10720.0
3402,3153,2022-06-01,2022-06-15,2022-06-15 19:31:51.132000+00:00,13457.0
4951,3153,2022-08-01,2022-08-20,2022-08-20 12:42:08.411000+00:00,13590.0
7,3153,2022-02-01,2022-02-02,2022-02-02 15:36:59.356000+00:00,2400.0


In [8]:
## sum_amt_transactions per payment_method
df_sum_amt_transactions_per_user_in_month_by_pm = calculate_sum_in_group_per_user(
    list_dataframes=dfs_trans_per_user,
    group_by_col=["user_id","reference_date","loan_date_created","loan_created_at","payment_method"],
    col_to_sum="amount",
    new_col_name="sum_amt_transactions_at_created_loan_by_payment_method"
)
df_sum_amt_transactions_per_user_in_month_by_pm.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,payment_method,sum_amt_transactions_at_created_loan_by_payment_method
3532,0,2022-04-01,2022-04-26,2022-04-26 16:47:20.625000+00:00,credit,25745.0
144,2,2022-02-01,2022-02-04,2022-02-04 18:20:58.272000+00:00,credit,5250.0
6762,3,2022-07-01,2022-07-09,2022-07-09 16:23:37.569000+00:00,credit,23700.0
2254,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,credit,14450.0
8815,4,2022-08-01,2022-08-26,2022-08-26 11:16:48.724000+00:00,credit,92969.0
...,...,...,...,...,...,...
8379,3153,2022-08-01,2022-08-20,2022-08-20 12:42:08.411000+00:00,credit,13510.0
4504,3153,2022-05-01,2022-05-04,2022-05-04 11:18:29.811000+00:00,credit,650.0
8380,3153,2022-08-01,2022-08-20,2022-08-20 12:42:08.411000+00:00,debit,80.0
6487,3153,2022-07-01,2022-07-04,2022-07-04 15:32:00.095000+00:00,credit,5530.0


In [9]:
## sum_amt_transactions per card brand
df_sum_amt_transactions_per_user_in_month_by_crdb = calculate_sum_in_group_per_user(
    list_dataframes=dfs_trans_per_user,
    group_by_col=["user_id","reference_date","loan_date_created","loan_created_at","card_brand"],
    col_to_sum="amount",
    new_col_name="sum_amt_transactions_at_created_loan_by_card_brand"
)
df_sum_amt_transactions_per_user_in_month_by_crdb.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,card_brand,sum_amt_transactions_at_created_loan_by_card_brand
5459,0,2022-04-01,2022-04-26,2022-04-26 16:47:20.625000+00:00,mastercard,25745.0
186,2,2022-02-01,2022-02-04,2022-02-04 18:20:58.272000+00:00,hipercard,5250.0
3414,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,mastercard,4350.0
3415,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,visa,9550.0
3413,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,elo,550.0
...,...,...,...,...,...,...
10104,3153,2022-07-01,2022-07-04,2022-07-04 15:32:00.095000+00:00,mastercard,130.0
8833,3153,2022-06-01,2022-06-15,2022-06-15 19:31:51.132000+00:00,visa,13007.0
10105,3153,2022-07-01,2022-07-04,2022-07-04 15:32:00.095000+00:00,visa,5600.0
7001,3153,2022-05-01,2022-05-04,2022-05-04 11:18:29.811000+00:00,mastercard,650.0


In [10]:
## create dataframes of sum amount transactions per user by payment_method

df_sum_amt_transactions_per_user_in_month_by_credit = df_sum_amt_transactions_per_user_in_month_by_pm[
    df_sum_amt_transactions_per_user_in_month_by_pm["payment_method"]=="credit"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_payment_method":
          "sum_amt_payment_method_credit_at_created_loan"},axis=1)\
 .drop(columns=["payment_method"])

df_sum_amt_transactions_per_user_in_month_by_debit = df_sum_amt_transactions_per_user_in_month_by_pm[
    df_sum_amt_transactions_per_user_in_month_by_pm["payment_method"]=="debit"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_payment_method":
          "sum_amt_payment_method_debit_at_created_loan"},axis=1)\
 .drop(columns=["payment_method"])

In [11]:
## create dataframes of sum amount transactions per user by card_brand

df_sum_amt_transactions_per_user_in_month_with_visa = df_sum_amt_transactions_per_user_in_month_by_crdb[
    df_sum_amt_transactions_per_user_in_month_by_crdb["card_brand"]=="visa"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_card_brand":
          "sum_amt_transactions_in_visa_at_created_loan"},axis=1)\
 .drop(columns=["card_brand"])

df_sum_amt_transactions_per_user_in_month_with_mastercard = df_sum_amt_transactions_per_user_in_month_by_crdb[
    df_sum_amt_transactions_per_user_in_month_by_crdb["card_brand"]=="mastercard"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_card_brand":
          "sum_amt_transactions_in_mastercard_at_created_loan"},axis=1)\
 .drop(columns=["card_brand"])

df_sum_amt_transactions_per_user_in_month_with_elo = df_sum_amt_transactions_per_user_in_month_by_crdb[
    df_sum_amt_transactions_per_user_in_month_by_crdb["card_brand"]=="elo"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_card_brand":
          "sum_amt_transactions_in_elo_at_created_loan"},axis=1)\
 .drop(columns=["card_brand"])

df_sum_amt_transactions_per_user_in_month_with_hpcrd = df_sum_amt_transactions_per_user_in_month_by_crdb[
    df_sum_amt_transactions_per_user_in_month_by_crdb["card_brand"]=="hipercard"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_card_brand":
          "sum_amt_transactions_in_hipercard_at_created_loan"},axis=1)\
 .drop(columns=["card_brand"])

df_sum_amt_transactions_per_user_in_month_with_amex = df_sum_amt_transactions_per_user_in_month_by_crdb[
    df_sum_amt_transactions_per_user_in_month_by_crdb["card_brand"]=="amex"
].copy()\
 .rename({"sum_amt_transactions_at_created_loan_by_card_brand":
          "sum_amt_transactions_in_amex_at_created_loan"},axis=1)\
 .drop(columns=["card_brand"])

### 4.3 - Create max amount and installments features
- `max_installments_at_created_loan`
- `max_amt_transactions_at_created_loan`

In [12]:
df_max_amt_transactions_per_user_in_month = calculate_max_in_group_per_user(
    list_dataframes=dfs_trans_per_user,
    group_by_col=["user_id","reference_date","loan_date_created","loan_created_at"],
    col_to_search_max="amount",
    new_col_name="max_amt_transactions_at_created_loan"
)
df_max_amt_transactions_per_user_in_month.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,max_amt_transactions_at_created_loan
2165,0,2022-04-01,2022-04-26,2022-04-26 16:47:20.625000+00:00,22000.0
102,2,2022-02-01,2022-02-04,2022-02-04 18:20:58.272000+00:00,4500.0
1435,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,2700.0
4019,3,2022-07-01,2022-07-09,2022-07-09 16:23:37.569000+00:00,3000.0
5196,4,2022-08-01,2022-08-26,2022-08-26 11:16:48.724000+00:00,71300.0
...,...,...,...,...,...
909,3153,2022-03-01,2022-03-15,2022-03-15 15:28:53.048000+00:00,5000.0
3402,3153,2022-06-01,2022-06-15,2022-06-15 19:31:51.132000+00:00,5080.0
4951,3153,2022-08-01,2022-08-20,2022-08-20 12:42:08.411000+00:00,2345.0
7,3153,2022-02-01,2022-02-02,2022-02-02 15:36:59.356000+00:00,2400.0


In [13]:
df_max_amt_installments_per_user_in_month = calculate_max_in_group_per_user(
    list_dataframes=dfs_trans_per_user,
    group_by_col=["user_id","reference_date","loan_date_created","loan_created_at"],
    col_to_search_max="installments",
    new_col_name="max_installments_at_created_loan"
)
df_max_amt_installments_per_user_in_month.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,max_installments_at_created_loan
2165,0,2022-04-01,2022-04-26,2022-04-26 16:47:20.625000+00:00,12
102,2,2022-02-01,2022-02-04,2022-02-04 18:20:58.272000+00:00,10
1435,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,10
4019,3,2022-07-01,2022-07-09,2022-07-09 16:23:37.569000+00:00,10
5196,4,2022-08-01,2022-08-26,2022-08-26 11:16:48.724000+00:00,10
...,...,...,...,...,...
909,3153,2022-03-01,2022-03-15,2022-03-15 15:28:53.048000+00:00,12
3402,3153,2022-06-01,2022-06-15,2022-06-15 19:31:51.132000+00:00,10
4951,3153,2022-08-01,2022-08-20,2022-08-20 12:42:08.411000+00:00,12
7,3153,2022-02-01,2022-02-02,2022-02-02 15:36:59.356000+00:00,5


### 4.4 - Create median installments features

- `median_installments_at_created_loan`

In [14]:
df_median_installments_per_user_in_month = calculate_median_in_group_per_user(
    list_dataframes=dfs_trans_per_user,
    group_by_col=["user_id","reference_date","loan_date_created","loan_created_at"],
    col_to_search_median="installments",
    new_col_name="median_installments_at_created_loan"
)
df_median_installments_per_user_in_month.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,median_installments_at_created_loan
2165,0,2022-04-01,2022-04-26,2022-04-26 16:47:20.625000+00:00,5.0
102,2,2022-02-01,2022-02-04,2022-02-04 18:20:58.272000+00:00,10.0
1435,3,2022-04-01,2022-04-18,2022-04-18 21:46:00.032000+00:00,5.0
4019,3,2022-07-01,2022-07-09,2022-07-09 16:23:37.569000+00:00,5.5
5196,4,2022-08-01,2022-08-26,2022-08-26 11:16:48.724000+00:00,3.0
...,...,...,...,...,...
909,3153,2022-03-01,2022-03-15,2022-03-15 15:28:53.048000+00:00,5.0
3402,3153,2022-06-01,2022-06-15,2022-06-15 19:31:51.132000+00:00,4.0
4951,3153,2022-08-01,2022-08-20,2022-08-20 12:42:08.411000+00:00,10.0
7,3153,2022-02-01,2022-02-02,2022-02-02 15:36:59.356000+00:00,5.0


### 4.5 - Create most frequent features 

- `most_frequent_transactions_payment_method_at_created_loan`

In [15]:
dfs_most_frequent_payment_per_user_in_month = [
    calculate_most_frequent(
        dataframe=df,
        col_to_count="payment_method",
        col_to_group=["user_id","reference_date","loan_date_created","loan_created_at"]
    )
    for df in dfs_trans_per_user
]
df_most_frequent_payment_per_user_in_month = pd.concat(dfs_most_frequent_payment_per_user_in_month)
df_most_frequent_payment_per_user_in_month = df_most_frequent_payment_per_user_in_month.rename({"payment_method":"most_frequent_transactions_payment_method_at_created_loan"},axis=1)
df_most_frequent_payment_per_user_in_month

Unnamed: 0,user_id,reference_date,loan_date_created,loan_created_at,most_frequent_transactions_payment_method_at_created_loan
0,1989,2022-02-01,2022-02-01,2022-02-01 09:17:21.960000+00:00,credit
0,2737,2022-02-01,2022-02-01,2022-02-01 09:57:31.136000+00:00,debit
0,2009,2022-02-01,2022-02-01,2022-02-01 12:14:33.440000+00:00,credit
0,988,2022-02-01,2022-02-01,2022-02-01 12:21:50.226000+00:00,credit
0,2948,2022-02-01,2022-02-01,2022-02-01 12:33:34.521000+00:00,credit
...,...,...,...,...,...
0,2154,2022-10-01,2022-10-03,2022-10-03 19:16:23.571000+00:00,credit
0,96,2022-10-01,2022-10-03,2022-10-03 19:49:31.969000+00:00,debit
0,1315,2022-10-01,2022-10-03,2022-10-03 20:30:28.453000+00:00,credit
0,2130,2022-10-01,2022-10-03,2022-10-03 20:44:11.967000+00:00,credit


## 5 - Merge tables

In [16]:
df_feats_hist_trans_at_loan_created = df_loans.merge(
    right=df_sum_amt_transactions_per_user_in_month.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left")\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_by_credit.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_by_debit.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_with_visa.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_with_mastercard.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_with_elo.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_with_hpcrd.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_sum_amt_transactions_per_user_in_month_with_amex.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_max_amt_transactions_per_user_in_month.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_max_amt_installments_per_user_in_month.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_median_installments_per_user_in_month.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )\
    .merge(
    right=df_most_frequent_payment_per_user_in_month.set_index(["user_id","reference_date","loan_date_created","loan_created_at"]),
    right_index=True,
    left_on=["user_id","reference_date","date_created","created_at"],
    how="left"
    )

In [17]:
print("Shape of final dataset:",df_feats_hist_trans_at_loan_created.shape)
print("Columns of final dataset:",df_feats_hist_trans_at_loan_created.columns)
df_feats_hist_trans_at_loan_created.sort_values(by=["user_id","created_at"])

Shape of final dataset: (6746, 22)
Columns of final dataset: Index(['id', 'user_id', 'amount', 'total_amount', 'due_amount', 'due_date',
       'status', 'created_at', 'date_created', 'reference_date',
       'sum_amt_transactions_at_created_loan',
       'sum_amt_payment_method_credit_at_created_loan',
       'sum_amt_payment_method_debit_at_created_loan',
       'sum_amt_transactions_in_visa_at_created_loan',
       'sum_amt_transactions_in_mastercard_at_created_loan',
       'sum_amt_transactions_in_elo_at_created_loan',
       'sum_amt_transactions_in_hipercard_at_created_loan',
       'sum_amt_transactions_in_amex_at_created_loan',
       'max_amt_transactions_at_created_loan',
       'max_installments_at_created_loan',
       'median_installments_at_created_loan',
       'most_frequent_transactions_payment_method_at_created_loan'],
      dtype='object')


Unnamed: 0,id,user_id,amount,total_amount,due_amount,due_date,status,created_at,date_created,reference_date,...,sum_amt_payment_method_debit_at_created_loan,sum_amt_transactions_in_visa_at_created_loan,sum_amt_transactions_in_mastercard_at_created_loan,sum_amt_transactions_in_elo_at_created_loan,sum_amt_transactions_in_hipercard_at_created_loan,sum_amt_transactions_in_amex_at_created_loan,max_amt_transactions_at_created_loan,max_installments_at_created_loan,median_installments_at_created_loan,most_frequent_transactions_payment_method_at_created_loan
2477,2477,0,6000.0,6045.28,6459000000,2022-07-25,error,2022-04-26 16:47:20.625000+00:00,2022-04-26,2022-04-01,...,,,25745.0,,,,22000.0,12.0,5.0,credit
86,86,1,6000.0,6045.28,6459000000,2022-05-03,debt_collection,2022-02-02 15:36:00.574000+00:00,2022-02-02,2022-02-01,...,,,,,,,,,,
223,223,2,6000.0,6045.28,6459000000,2022-05-05,debt_collection,2022-02-04 18:20:58.272000+00:00,2022-02-04,2022-02-01,...,,,,,5250.0,,4500.0,10.0,10.0,credit
1744,1744,3,6000.0,6045.28,6458800000,2022-07-18,repaid,2022-04-18 21:46:00.032000+00:00,2022-04-18,2022-04-01,...,,9550.0,4350.0,550.0,,,2700.0,10.0,5.0,credit
4538,4538,3,6000.0,6045.28,6458800000,2022-10-07,debt_collection,2022-07-09 16:23:37.569000+00:00,2022-07-09,2022-07-01,...,,10000.0,10700.0,3000.0,,,3000.0,10.0,5.5,credit
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,1186,3153,6000.0,6045.28,6458800000,2022-06-13,repaid,2022-03-15 15:28:53.048000+00:00,2022-03-15,2022-03-01,...,,5000.0,3820.0,1900.0,,,5000.0,12.0,5.0,credit
3111,3111,3153,6000.0,6045.28,6458800000,2022-08-02,repaid,2022-05-04 11:18:29.811000+00:00,2022-05-04,2022-05-01,...,,,650.0,,,,650.0,6.0,6.0,credit
3856,3856,3153,6000.0,6045.28,6458780000,2022-09-13,repaid,2022-06-15 19:31:51.132000+00:00,2022-06-15,2022-06-01,...,450.0,13007.0,450.0,,,,5080.0,10.0,4.0,credit
4358,4358,3153,6000.0,6045.28,6458800000,2022-10-02,repaid,2022-07-04 15:32:00.095000+00:00,2022-07-04,2022-07-01,...,300.0,5600.0,130.0,100.0,,,2950.0,12.0,2.0,credit


## 6 - Save table

In [18]:
df_feats_hist_trans_at_loan_created.to_csv(
    "./data/processed/df_transactions_history_per_user_at_loan_created.csv",index=False
)