Project 5: Customer financial health analysis and banking information.
Part 1: data treatment and creation fo synthetic data.

For this project, the dataset used is the Bank Marketing Dataset from UCI. The dataset is complemented by adding synthetic data columns with transaction information, as well as a customer ID, transaction ID, transaction number, transaction value, transaction date, and name columns.

The dataset is found in the following link: https://archive.ics.uci.edu/dataset/222/bank+marketing

The following script is used for the generation of the synthetic data.

In [None]:
import pandas as pd
import faker
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Export the file path to the CSV file
bank_raw = pd.read_csv(r"C:\Users\diego\OneDrive\Documentos\Python\Customer_bank_info.csv")

# Dropping of unnecesarry columns. The df is copied to avoid modifying the original data in case 
 # of error
bank_processed = bank_raw.copy()
unnecesary_columns = ['contact','duration', 'campaign', 'pdays', 'previous', 'poutcome','y','day','month']
bank_processed = bank_processed.drop(columns = unnecesary_columns, axis=1) 

# Creation of customer ID column. This is assuming the dataset doesn't contain duplicate values and each row is unique.
bank_processed['customer_id'] = range(1, len(bank_processed) + 1)

# Creation of a name column using the faker library
from faker import Faker
num_rows = len(bank_processed)
bank_processed['name'] = [Faker().name() for _ in range(num_rows)]

# Creation of a transaction number column using poisson distribution and a max transaction number of 30
# This is assuming that the balance is a reasonable value to determine the number of transactions, and that the balance is positive.
# Using this method, the number of transactions will be higher for larger balances, simulating a realistic banking scenario.

bank_processed['transaction_number_last_month'] = np.random.poisson(lam = bank_processed['balance'].abs() / 1000, size=num_rows).clip(0, 30)

# Creation of a transaction ID column using the faker library and simulating multiple transactions per customer.
# A uniform distribution is used to generate transaction values between -2000 and 2000, simulating deposits and withdrawals.

# The first step is the creation of a separate df for transactions.
# The first loop iterates over each row of the bank_processed df, and the second loop creates multiple transactions per customer based on the transaction_number_last_month column.
# _ is used as a placeholder for the index, as it is not needed in this case.
transaction_id_counter = 10
transactions = []
for _, row in bank_processed.iterrows():
    for _ in range(row['transaction_number_last_month']):
        transactions.append({'customer_id': row['customer_id'],
                             'transaction_id':transaction_id_counter,
                             'transaction_value': np.random.uniform(-2000, 2000),
                             'transaction_date': Faker().date_between(start_date='-30d', end_date='today').strftime('%Y-%m-%d')})
        transaction_id_counter += 1

transactions_df = pd.DataFrame(transactions)

# Joining the transactions df with the bank_processed df using the customer_id column.
bank_final = pd.merge(bank_processed, transactions_df, on='customer_id', how='left')

# Export the final df to a csv file
bank_final.to_csv(r"C:\Users\diego\OneDrive\Documentos\Python\Customer_bank_info_final.csv", index=False)
