# 03 - Construct transactions history in the last 6 months features

This notebook creates the transactions history features in the last 6 months the loan was created. What this means is that if a user creates a loan at `2022-05-05 23:30:48.986000` we're going to calculate the transactions history features that happened between `2021-11-01 00:00:00.000000` and `2022-05-01 00:00:00.000000`. 

For that, we have 2 columns that are stored in `database.db` file that are in `\databases\database.db`:

#### Loans Table
- **id (int)**: Unique identifier for the loan.
- **user_id (int)**: Unique identifier for the user who has taken the loan.
- **amount (float)**: The amount of loan disbursed.
- **total_amount (float)**: The amount of loan, including fees.
- **due_amount (int)**: The amount of the loan by the due date if there are no repayments during the contract period. Good to get interest rates.
- **due_date (object)**: The date by which the loan is due.
- **status (object)**: Current status of the loan (e.g., repaid, debt_collection, ongoing, debt_repaid).
    - **repaid**: A loan that was was paid until due.
    - **debt_collection**: A loan that was not paid until due.
    - **debt_repaid**: A loan that was not paid until due but we recovered the money somehow.
    - **cancelled**: A canceled loan.
    - **error**: Operational error.
- **created_at (object)**: Timestamp of when the loan record was created. <u>Have it as the beginning of the loan</u>.

### Transactions Table
- **id (int)**: Unique identifier for the transaction.
- **user_id (int)**: The user ID associated with the transaction.
- **amount (float)**: Transaction amount.
- **status (object)**: Status of the transaction.
    - **approved**: A transaction that happened.
    - **denied**: A transaction that didn't happened due to it being denied.
- **capture_method (object)**: Method of capturing the transaction.
- **payment_method (object)**: Payment method used (e.g., credit, debit).
- **installments (int)**: Number of installments for the transaction.
- **card_brand (object)**: Brand of the card used for the transaction.
- **created_at (object)**: Timestamp of when the transaction record was created. <u>Have it as the moment the transaction happened</u>.

## Results

For this dataset, we're constructed these features:

- `avg_amt_transactions_in_last_six_months`: This calculates the avg amount of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created.
- `max_amt_transactions_in_last_six_months`: This calculates the maximum value of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created.
- `most_frequent_transactions_payment_method_in_last_six_months`: This calculates the most frequent payment method of transactions betweem the first day of the six month before the loan was created and first day of the month that the loan was created.
- `avg_amt_payment_method_credit_method_in_last_six_months`: This calculates the avg amount of transactions that uses credit as payment method between the first day of the six month before the loan was created and first day of the month that the loan was created.
- `avg_amt_payment_method_debit_method_in_last_six_months`: This calculates the avg amount of transactions that uses debit as payment method between the first day of the six month before the loan was created and first day of the month that the loan was created.
- `avg_amt_transactions_in_visa_in_last_six_months`: the amount of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created that uses the visa card_brand.
- `avg_amt_transactions_in_mastercard_in_last_six_months`: the amount of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created that uses the mastercard card_brand.
- `avg_amt_transactions_in_elo_in_last_six_months`: the amount of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created that uses the elo card_brand.
- `avg_amt_transactions_in_hipercard_in_last_six_months`: the amount of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created that uses the hipercard card_brand.
- `avg_amt_transactions_in_amex_in_last_six_months`: the amount of transactions between the first day of the six month before the loan was created and first day of the month that the loan was created that uses the amex card_brand.
- `max_installments_in_last_six_months`: The maximum installments value considere the transactions that happnend between the first day of the six month before the loan was created and first day of the month that the loan was created.
- `median_installments_in_last_six_months`: The median installments value considere the transactions that happnend between the first day of the six month before the loan was created.

Have in mind that the transactions we're considering here are the approved transactions between the first day of the six month before the loan was created and first day of the month that the loan was created.


The final dataset is located at `data/processed` with name of `df_transactions_history_per_user_in_last_six_months.csv`.

## 1 - Imports

In [1]:
import os 
os.chdir("../../")

In [2]:
import sqlalchemy
import pandas as pd 
import numpy as np

from pandas.tseries.offsets import DateOffset
from datetime import datetime,date
from typing import Union

## 2 - Read tables

In [3]:
engine = sqlalchemy.create_engine("sqlite:///./database/database.db", echo=True)

df_loans = pd.read_sql(
    sql="""
    SELECT * FROM loans l
    """,
    con=engine
)
df_loans_repay = pd.read_sql(
    sql="""
    SELECT * FROM loan_repayments lr
    """,
    con=engine
)

df_transactions = pd.read_sql(
    sql="""
    SELECT * FROM transactions tr
    """,
    con=engine
)

2024-04-08 17:27:49,311 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-04-08 17:27:49,312 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("
    SELECT * FROM loans l
    ")
2024-04-08 17:27:49,313 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-08 17:27:49,354 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("
    SELECT * FROM loans l
    ")
2024-04-08 17:27:49,355 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-08 17:27:49,356 INFO sqlalchemy.engine.Engine 
    SELECT * FROM loans l
    
2024-04-08 17:27:49,357 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-08 17:27:49,508 INFO sqlalchemy.engine.Engine ROLLBACK
2024-04-08 17:27:49,509 INFO sqlalchemy.engine.Engine BEGIN (implicit)
2024-04-08 17:27:49,510 INFO sqlalchemy.engine.Engine PRAGMA main.table_info("
    SELECT * FROM loan_repayments lr
    ")
2024-04-08 17:27:49,510 INFO sqlalchemy.engine.Engine [raw sql] ()
2024-04-08 17:27:49,511 INFO sqlalchemy.engine.Engine PRAGMA temp.table_info("
    SELECT * FR

## 3 - Preprocessing data

In [4]:
#convert to datetime
df_loans["created_at"] = pd.to_datetime(df_loans["created_at"],utc=True,format="ISO8601")
df_loans_repay["created_at"] = pd.to_datetime(df_loans_repay["created_at"],utc=True,format="ISO8601")
df_transactions["created_at"] = pd.to_datetime(df_transactions["created_at"],utc=True,format="ISO8601")

#convert to date
df_loans["due_date"] = pd.to_datetime(df_loans["due_date"],format="%Y-%m-%d")

#create date_created 
df_loans["date_created"] = df_loans["created_at"].apply(func=lambda d:d.date())
df_loans_repay["date_created"] = df_loans_repay["created_at"].apply(func=lambda d:d.date())
df_transactions["date_created"] = df_transactions["created_at"].apply(func=lambda d:d.date())

# add reference_date in df_loans
df_loans["reference_date"] = [date(year=d.year,month=d.month,day=1)
                              for d in df_loans["date_created"]]

## 4 - Construct features

### 4.1 - Functions to create features

In [5]:
def filter_amt_transactions_in_last_six_months(
        dataframe_loans:pd.DataFrame,
        dataframe_transactions:pd.DataFrame,
)->list:
    """
    Filters the transactions of a user that happended in the last six months before the loan was created.

    Args:
        dataframe_loans (pd.DataFrame): Dataframe with loans made by a user and their 
        date created, reference date and timestamp of the moment when it was created.
        dataframe_transactions (pd.DataFrame): Dataframe with all transactions made 
        by users and their date created and timestamp of the moment when it was created.

    Returns:
        list: A list of all transactions made by users in dataframe_loans with transactions
        made in the last six months before the loan was created.
    """    
    dfs_all_transactions_per_month = []
    for ref_date in dataframe_loans["reference_date"].drop_duplicates().values:
        b_date_month = (ref_date + DateOffset(months=-6)).date()
        filter_df = dataframe_transactions[dataframe_transactions["date_created"].between(left=b_date_month,right=ref_date,inclusive="left")].copy()
        if len(filter_df)>0:
            filter_df["reference_date"]=ref_date
            dfs_all_transactions_per_month.append(filter_df)
    
    return dfs_all_transactions_per_month

def calculate_avg_in_group_per_user(
          list_dataframes:list,
          group_by_col:Union[str,list],
          col_to_avg:Union[str,list],
          new_col_name:str
          )->pd.DataFrame:
    """
    For each dataframe calculates the average value of a column 
    in a group.

    Args:
        list_dataframes (list): A list of dataframes to calculate the average values per groups.
        group_by_col (Union[str,list]): A column to group values.
        col_to_avg (Union[str,list]): A column to calculate the average values.
        new_col_name (str): The new column name for the average values calculated.

    Returns:
        pd.DataFrame: A dataframe with all the calculated average values per group.
    """    
    dfs_avg_in_group_per_user = [
    df.groupby(by=group_by_col)[col_to_avg].mean().to_frame(name=new_col_name)
    for df in list_dataframes
    ]
    df_avg_in_group_per_user = pd.concat(dfs_avg_in_group_per_user).reset_index()
    return df_avg_in_group_per_user

def calculate_max_in_group_per_user(
          list_dataframes:list,
          group_by_col:Union[str,list],
          col_to_search_max:Union[str,list],
          new_col_name:str
          )->pd.DataFrame:
    """
    For each dataframe in list_dataframes groups their values and 
    calculates the maxiumm value per group.

    Args:
        list_dataframes (list): A list of dataframes to calculate the maximum values per groups.
        group_by_col (Union[str,list]): A column to group values.
        col_to_search_max (Union[str,list]): A column to search the maximum value.
        new_col_name (str): New column name for the maximum values in the result dataframe.

    Returns:
        pd.DataFrame: A dataframe with all the calculated maximum values per group.
    """    
    dfs_max_in_group_per_user  = [
    df.groupby(by=group_by_col)[col_to_search_max].max().to_frame(name=new_col_name)
    for df in list_dataframes
    ]
    df_max_in_group_per_user = pd.concat(dfs_max_in_group_per_user).reset_index()
    return df_max_in_group_per_user

def calculate_median_in_group_per_user(
          list_dataframes:list,
          group_by_col:Union[str,list],
          col_to_search_median:Union[str,list],
          new_col_name:str
          )->pd.DataFrame:
    """
    For each dataframe in list_dataframes groups their values and 
    calculates the median value per group.

    Args:
        list_dataframes (list): A list of dataframes to calculate the median values per groups.
        group_by_col (Union[str,list]): A column to group values.
        col_to_search_max (Union[str,list]): A column to search the median value.
        new_col_name (str): New column name for the median values in the result dataframe.

    Returns:
        pd.DataFrame: A dataframe with all the calculated median values per group.
    """    
    dfs_median_in_group_per_user  = [
    df.groupby(by=group_by_col)[col_to_search_median].median().to_frame(name=new_col_name)
    for df in list_dataframes
    ]
    df_median_in_group_per_user = pd.concat(dfs_median_in_group_per_user).reset_index()
    return df_median_in_group_per_user


def calculate_most_frequent(
        dataframe:pd.DataFrame,
        col_to_count:Union[str,list],
        col_to_group:Union[str,list],
        return_counts:bool=False
)->pd.DataFrame:
    """
    Count the most frequent ocurrence of a value in a column in a dataframe.

    Args:
        dataframe (pd.DataFrame): Initial dataframe to count 
        col_to_count (str,list): Column to count the vaules.
        col_to_group (str,list): Column to group the vaules in dataframe.
        return_counts (bool, optional): If true, returns the counts of the most frequent value.
        If false, return only the group id and their most frequent value . Defaults to False.

    Returns:
        pd.DataFrame: A new dataframe with the most frequent value in the column by group. If
        return_counts is true, returns the frequency of the value per group. If return_counts
        is false, then, returns only the group and their most frequent value.
    """    
    df_counts_type = dataframe.groupby(by=col_to_group)[col_to_count]\
                              .value_counts()\
                              .to_frame(name="count_types")\
                              .reset_index()

    idxs_most_frequent = df_counts_type.groupby(by=col_to_group)["count_types"].idxmax()
    final_df = df_counts_type.loc[idxs_most_frequent,:].reset_index(drop=True)

    if return_counts:
        return final_df
    else:
         final_df = final_df.drop(columns="count_types")
    
    return final_df

### 4.2 - Create sum amount features

- `avg_amt_transactions_in_last_six_months`
- `avg_amt_payment_method_credit_in_last_six_months`
- `avg_amt_payment_method_debit_in_last_six_months`
- `avg_amt_transactions_in_visa_in_last_six_months`
- `avg_amt_transactions_in_mastercard_in_last_six_months`
- `avg_amt_transactions_in_elo_in_last_six_months`
- `avg_amt_transactions_in_hipercard_in_last_six_months`
- `avg_amt_transactions_in_amex_in_last_six_months`

In [6]:
df_approved_transactions = df_transactions[df_transactions["status"]=="approved"]
dfs_trans_in_last_six_mths= filter_amt_transactions_in_last_six_months(
    dataframe_loans=df_loans,
    dataframe_transactions=df_approved_transactions
)

In [7]:
## avg_amt_transactions_at_created_loan feature
df_avg_amt_transactions_in_last_six_months = calculate_avg_in_group_per_user(
    list_dataframes=dfs_trans_in_last_six_mths,
    group_by_col=["user_id","reference_date"],
    col_to_avg="amount",
    new_col_name="avg_amt_transactions_in_last_six_months"
)
df_avg_amt_transactions_in_last_six_months.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,avg_amt_transactions_in_last_six_months
3178,0,2022-04-01,2406.928571
5688,0,2022-05-01,3494.315789
8399,0,2022-06-01,3186.613226
11296,0,2022-07-01,2373.886591
14312,0,2022-08-01,2318.080806
...,...,...,...
20424,3153,2022-09-01,1052.805882
8398,3153,2022-05-01,981.956728
5687,3153,2022-04-01,985.757734
14311,3153,2022-07-01,1057.474697


In [8]:
## avg_amt_transactions per payment_method
df_avg_amt_transactions_per_user_in_last_six_months_by_pm = calculate_avg_in_group_per_user(
    list_dataframes=dfs_trans_in_last_six_mths,
    group_by_col=["user_id","reference_date","payment_method"],
    col_to_avg="amount",
    new_col_name="avg_amt_transactions_in_last_six_months_by_payment_method"
)
df_avg_amt_transactions_per_user_in_last_six_months_by_pm.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,payment_method,avg_amt_transactions_in_last_six_months_by_payment_method
9859,0,2022-05-01,credit,3688.388889
1905,0,2022-03-01,credit,600.000000
31643,0,2022-09-01,debit,634.250000
31642,0,2022-09-01,credit,2741.406949
9860,0,2022-05-01,debit,1.000000
...,...,...,...,...
20263,3153,2022-06-01,credit,1313.923007
14853,3153,2022-05-01,debit,224.462963
14852,3153,2022-05-01,credit,1360.703611
37288,3153,2022-09-01,debit,217.615385


In [9]:
## avg_amt_transactions per card brand
df_avg_amt_transactions_per_user_in_last_six_months_by_crdb = calculate_avg_in_group_per_user(
    list_dataframes=dfs_trans_in_last_six_mths,
    group_by_col=["user_id","reference_date","card_brand"],
    col_to_avg="amount",
    new_col_name="avg_amt_transactions_in_last_six_months_by_card_brand"
)
df_avg_amt_transactions_per_user_in_last_six_months_by_crdb.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,card_brand,avg_amt_transactions_in_last_six_months_by_card_brand
16188,0,2022-05-01,mastercard,3470.166667
62205,0,2022-10-01,elo,350.000000
3188,0,2022-03-01,mastercard,600.000000
52744,0,2022-09-01,visa,2796.133667
52743,0,2022-09-01,mastercard,2221.578947
...,...,...,...,...
3186,3153,2022-02-01,mastercard,540.066667
3185,3153,2022-02-01,elo,374.500000
24551,3153,2022-05-01,visa,1331.522388
33646,3153,2022-06-01,visa,1258.905882


In [10]:
## create dataframes of avg amount transactions per user by payment_method

df_avg_amt_transactions_per_user_in_last_six_months_by_credit = df_avg_amt_transactions_per_user_in_last_six_months_by_pm[
    df_avg_amt_transactions_per_user_in_last_six_months_by_pm["payment_method"]=="credit"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_payment_method":
          "avg_amt_payment_method_credit_in_last_six_months"},axis=1)\
 .drop(columns=["payment_method"])

df_avg_amt_transactions_per_user_in_last_six_months_by_debit = df_avg_amt_transactions_per_user_in_last_six_months_by_pm[
    df_avg_amt_transactions_per_user_in_last_six_months_by_pm["payment_method"]=="debit"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_payment_method":
          "avg_amt_payment_method_debit_in_last_six_months"},axis=1)\
 .drop(columns=["payment_method"])

In [11]:
## create dataframes of avg amount transactions per user by card_brand

df_avg_amt_transactions_per_user_in_last_six_months_with_visa = df_avg_amt_transactions_per_user_in_last_six_months_by_crdb[
    df_avg_amt_transactions_per_user_in_last_six_months_by_crdb["card_brand"]=="visa"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_card_brand":
          "avg_amt_transactions_in_visa_in_last_six_months"},axis=1)\
 .drop(columns=["card_brand"])

df_avg_amt_transactions_per_user_in_last_six_months_with_mastercard = df_avg_amt_transactions_per_user_in_last_six_months_by_crdb[
    df_avg_amt_transactions_per_user_in_last_six_months_by_crdb["card_brand"]=="mastercard"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_card_brand":
          "avg_amt_transactions_in_mastercard_in_last_six_months"},axis=1)\
 .drop(columns=["card_brand"])

df_avg_amt_transactions_per_user_in_last_six_months_with_elo = df_avg_amt_transactions_per_user_in_last_six_months_by_crdb[
    df_avg_amt_transactions_per_user_in_last_six_months_by_crdb["card_brand"]=="elo"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_card_brand":
          "avg_amt_transactions_in_elo_in_last_six_months"},axis=1)\
 .drop(columns=["card_brand"])

df_avg_amt_transactions_per_user_in_last_six_months_with_hpcrd = df_avg_amt_transactions_per_user_in_last_six_months_by_crdb[
    df_avg_amt_transactions_per_user_in_last_six_months_by_crdb["card_brand"]=="hipercard"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_card_brand":
          "avg_amt_transactions_in_hipercard_in_last_six_months"},axis=1)\
 .drop(columns=["card_brand"])

df_avg_amt_transactions_per_user_in_last_six_months_with_amex = df_avg_amt_transactions_per_user_in_last_six_months_by_crdb[
    df_avg_amt_transactions_per_user_in_last_six_months_by_crdb["card_brand"]=="amex"
].copy()\
 .rename({"avg_amt_transactions_in_last_six_months_by_card_brand":
          "avg_amt_transactions_in_amex_in_last_six_months"},axis=1)\
 .drop(columns=["card_brand"])

### 4.3 - Create max amount and installments features
- `max_installments_in_last_six_months`
- `max_amt_transactions_in_last_six_months`

In [12]:
df_max_amt_transactions_per_user_in_last_six_months = calculate_max_in_group_per_user(
    list_dataframes=dfs_trans_in_last_six_mths,
    group_by_col=["user_id","reference_date"],
    col_to_search_max="amount",
    new_col_name="max_amt_transactions_in_last_six_months"
)
df_max_amt_transactions_per_user_in_last_six_months.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,max_amt_transactions_in_last_six_months
3178,0,2022-04-01,6600.0
5688,0,2022-05-01,22000.0
8399,0,2022-06-01,22000.0
11296,0,2022-07-01,22000.0
14312,0,2022-08-01,23000.0
...,...,...,...
20424,3153,2022-09-01,8500.0
8398,3153,2022-05-01,25000.0
5687,3153,2022-04-01,25000.0
14311,3153,2022-07-01,25000.0


In [13]:
df_max_installments_per_user_in_last_six_months = calculate_max_in_group_per_user(
    list_dataframes=dfs_trans_in_last_six_mths,
    group_by_col=["user_id","reference_date"],
    col_to_search_max="installments",
    new_col_name="max_installments_in_last_six_months"
)
df_max_installments_per_user_in_last_six_months.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,max_installments_in_last_six_months
3178,0,2022-04-01,12
5688,0,2022-05-01,12
8399,0,2022-06-01,12
11296,0,2022-07-01,12
14312,0,2022-08-01,12
...,...,...,...
20424,3153,2022-09-01,12
8398,3153,2022-05-01,12
5687,3153,2022-04-01,12
14311,3153,2022-07-01,12


### 4.4 - Create median installments features

- `median_installments_in_last_six_months`

In [14]:
df_median_installments_per_user_in_last_six_months = calculate_median_in_group_per_user(
    list_dataframes=dfs_trans_in_last_six_mths,
    group_by_col=["user_id","reference_date"],
    col_to_search_median="installments",
    new_col_name="median_installments_in_last_six_months"
)
df_median_installments_per_user_in_last_six_months.sort_values(by="user_id")

Unnamed: 0,user_id,reference_date,median_installments_in_last_six_months
3178,0,2022-04-01,4.0
5688,0,2022-05-01,5.0
8399,0,2022-06-01,5.0
11296,0,2022-07-01,3.5
14312,0,2022-08-01,3.0
...,...,...,...
20424,3153,2022-09-01,4.0
8398,3153,2022-05-01,2.0
5687,3153,2022-04-01,1.0
14311,3153,2022-07-01,3.0


### 4.5 - Create most frequent features 

- `most_frequent_transactions_payment_method_in_last_six_months`

In [15]:
dfs_most_frequent_payment_per_user_in_last_six_months = [
    calculate_most_frequent(
        dataframe=df,
        col_to_count="payment_method",
        col_to_group=["user_id","reference_date"]
    )
    for df in dfs_trans_in_last_six_mths
]
df_most_frequent_payment_per_user_in_last_six_months = pd.concat(dfs_most_frequent_payment_per_user_in_last_six_months)
df_most_frequent_payment_per_user_in_last_six_months = df_most_frequent_payment_per_user_in_last_six_months.rename({"payment_method":"most_frequent_transactions_payment_method_in_last_six_months"},axis=1)
df_most_frequent_payment_per_user_in_last_six_months

Unnamed: 0,user_id,reference_date,most_frequent_transactions_payment_method_in_last_six_months
0,1,2022-02-01,credit
1,2,2022-02-01,credit
2,6,2022-02-01,credit
3,7,2022-02-01,debit
4,8,2022-02-01,credit
...,...,...,...
2933,3149,2022-10-01,debit
2934,3150,2022-10-01,credit
2935,3151,2022-10-01,credit
2936,3152,2022-10-01,credit


## 5 - Merge tables

In [16]:
df_feats_hist_trans_in_last_six_months = df_loans.merge(
    right=df_avg_amt_transactions_in_last_six_months.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left")\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_by_credit.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_by_debit.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_with_visa.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_with_mastercard.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_with_elo.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_with_hpcrd.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_avg_amt_transactions_per_user_in_last_six_months_with_amex.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_max_amt_transactions_per_user_in_last_six_months.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_max_installments_per_user_in_last_six_months.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_median_installments_per_user_in_last_six_months.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )\
    .merge(
    right=df_most_frequent_payment_per_user_in_last_six_months.set_index(["user_id","reference_date"]),
    right_index=True,
    left_on=["user_id","reference_date"],
    how="left"
    )

In [17]:
print("Shape of final dataset:",df_feats_hist_trans_in_last_six_months.shape)
print("Features in final dataset:",df_feats_hist_trans_in_last_six_months.columns)
df_feats_hist_trans_in_last_six_months.sort_values(by=["user_id","created_at"])

Shape of final dataset: (6746, 22)
Features in final dataset: Index(['id', 'user_id', 'amount', 'total_amount', 'due_amount', 'due_date',
       'status', 'created_at', 'date_created', 'reference_date',
       'avg_amt_transactions_in_last_six_months',
       'avg_amt_payment_method_credit_in_last_six_months',
       'avg_amt_payment_method_debit_in_last_six_months',
       'avg_amt_transactions_in_visa_in_last_six_months',
       'avg_amt_transactions_in_mastercard_in_last_six_months',
       'avg_amt_transactions_in_elo_in_last_six_months',
       'avg_amt_transactions_in_hipercard_in_last_six_months',
       'avg_amt_transactions_in_amex_in_last_six_months',
       'max_amt_transactions_in_last_six_months',
       'max_installments_in_last_six_months',
       'median_installments_in_last_six_months',
       'most_frequent_transactions_payment_method_in_last_six_months'],
      dtype='object')


Unnamed: 0,id,user_id,amount,total_amount,due_amount,due_date,status,created_at,date_created,reference_date,...,avg_amt_payment_method_debit_in_last_six_months,avg_amt_transactions_in_visa_in_last_six_months,avg_amt_transactions_in_mastercard_in_last_six_months,avg_amt_transactions_in_elo_in_last_six_months,avg_amt_transactions_in_hipercard_in_last_six_months,avg_amt_transactions_in_amex_in_last_six_months,max_amt_transactions_in_last_six_months,max_installments_in_last_six_months,median_installments_in_last_six_months,most_frequent_transactions_payment_method_in_last_six_months
2477,2477,0,6000.0,6045.28,6459000000,2022-07-25,error,2022-04-26 16:47:20.625000+00:00,2022-04-26,2022-04-01,...,1.000000,2966.666667,1987.125000,,,,6600.0,12.0,4.0,credit
86,86,1,6000.0,6045.28,6459000000,2022-05-03,debt_collection,2022-02-02 15:36:00.574000+00:00,2022-02-02,2022-02-01,...,250.000000,1358.180000,7283.333333,605.000000,,,17100.0,12.0,6.0,credit
223,223,2,6000.0,6045.28,6459000000,2022-05-05,debt_collection,2022-02-04 18:20:58.272000+00:00,2022-02-04,2022-02-01,...,,658.750000,1528.500000,3820.000000,2296.153846,,8000.0,12.0,10.0,credit
1744,1744,3,6000.0,6045.28,6458800000,2022-07-18,repaid,2022-04-18 21:46:00.032000+00:00,2022-04-18,2022-04-01,...,,855.769231,690.000000,3300.000000,,,4500.0,10.0,5.0,credit
4538,4538,3,6000.0,6045.28,6458800000,2022-10-07,debt_collection,2022-07-09 16:23:37.569000+00:00,2022-07-09,2022-07-01,...,,1235.000000,1139.615385,3783.333333,,,10000.0,10.0,5.0,credit
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1186,1186,3153,6000.0,6045.28,6458800000,2022-06-13,repaid,2022-03-15 15:28:53.048000+00:00,2022-03-15,2022-03-01,...,194.333333,1369.291667,773.999767,310.818182,,,25000.0,12.0,1.0,credit
3111,3111,3153,6000.0,6045.28,6458800000,2022-08-02,repaid,2022-05-04 11:18:29.811000+00:00,2022-05-04,2022-05-01,...,224.462963,1331.522388,785.525513,505.529412,,,25000.0,12.0,2.0,credit
3856,3856,3153,6000.0,6045.28,6458780000,2022-09-13,repaid,2022-06-15 19:31:51.132000+00:00,2022-06-15,2022-06-01,...,215.822581,1258.905882,851.121111,476.380952,,,25000.0,12.0,2.0,credit
4358,4358,3153,6000.0,6045.28,6458800000,2022-10-02,repaid,2022-07-04 15:32:00.095000+00:00,2022-07-04,2022-07-01,...,225.864407,1404.695122,874.249891,573.500000,,,25000.0,12.0,3.0,credit


## 6 - Save table

In [18]:
df_feats_hist_trans_in_last_six_months.to_csv(
    "./data/processed/df_transactions_history_per_user_in_last_six_months.csv",index=False
)