# SQL and Data Viz

1. Identify the best month in terms of loan issuance. What was the quantity and amount lent in each month?
2. Which batch had the best overall adherence?
3. Do different interest rates lead to different loan outcomes in terms of default rate?
4. Rank the best 10 and 10 worst clients. Explain your methodology for constructing this ranking.
5. What is the default rate by month and batch?
6. Assess the profitability of this operation. Provide an analysis of the operation's timeline.

> adherence: clients that got loans\
> season: loan issuing month\
> default rate: defaulted/issued loans

## Importing Libraries and Establishing Database Connection

In [1]:
import pandas as pd
from sqlalchemy import create_engine
from dotenv import load_dotenv
import os

In [2]:
# Load environment variables from .env file
load_dotenv()

True

In [3]:
# # Function to establish a connection to the PostgreSQL database
# def create_connection():
#     connection = psycopg2.connect(
#         user=os.getenv("DB_USER"),
#         password=os.getenv("DB_PASSWORD"),
#         host=os.getenv("DB_HOST"),
#         port=os.getenv("DB_PORT"),
#         database=os.getenv("DB_NAME")
#     )
#     return connection

In [4]:
# # Function to execute SQL queries and return results as a pandas DataFrame
# def execute_query(query):
#     connection = create_connection()
#     df = pd.read_sql_query(query, connection)
#     connection.close()
#     return df

In [5]:
# Function to execute SQL queries and return results as a pandas DataFrame
def execute_query(query):
    # Create a SQLAlchemy engine
    engine = create_engine(f"postgresql+psycopg2://{os.getenv('DB_USER')}:{os.getenv('DB_PASSWORD')}@{os.getenv('DB_HOST')}:{os.getenv('DB_PORT')}/{os.getenv('DB_NAME')}")
    
    # Execute the query and return the result as a DataFrame
    with engine.connect() as connection:
        df = pd.read_sql_query(query, connection)
    return df

In [6]:
# Query to select all data from the Clients table
clients_query = "SELECT * FROM Clients;"


In [7]:
# Query to select all data from the Loans table
loans_query = "SELECT * FROM Loans;"


In [8]:
# Load data from Clients and Loans tables into pandas DataFrames
clients_df = execute_query(clients_query)
loans_df = execute_query(loans_query)


In [9]:
# Display the first few rows of the Clients DataFrame
"Clients Data:"
clients_df.head()

Unnamed: 0,user_id,created_at,status,batch,credit_limit,interest_rate,denied_reason,denied_at
0,1,2023-09-18 16:05:36,approved,1,47500,30,,NaT
1,2,2020-07-05 07:00:37,denied,1,59750,20,money_loundry,2023-07-29 02:48:33
2,3,2023-07-25 03:39:55,approved,1,73000,30,,NaT
3,4,2022-07-01 01:28:58,approved,1,14250,20,,NaT
4,5,2023-06-23 20:17:40,approved,1,23750,20,,NaT


In [10]:
# Basic statistics and information about the Clients DataFrame
print("\nClients Data Statistics:")
clients_df.describe()


Clients Data Statistics:


Unnamed: 0,user_id,created_at,batch,credit_limit,interest_rate,denied_at
count,90000.0,90000,90000.0,90000.0,90000.0,18341
mean,45000.5,2022-01-01 23:32:57.941388800,1.49,50172.752778,52.497111,2023-02-20 09:45:18.857914112
min,1.0,2020-01-01 00:00:29,1.0,500.0,20.0,2020-01-21 01:57:34
25%,22500.75,2020-12-31 06:06:37.500000,1.0,25500.0,20.0,2022-09-03 20:40:23
50%,45000.5,2022-01-01 19:56:10.500000,1.0,50000.0,30.0,2023-05-28 04:49:14
75%,67500.25,2023-01-02 21:06:10.249999872,2.0,75000.0,90.0,2023-10-24 19:48:20
max,90000.0,2024-01-01 23:49:18,4.0,100000.0,90.0,2024-01-24 23:55:11
std,25980.906451,,0.780965,28711.436188,28.657638,


In [11]:
# Display the first few rows of the Loans DataFrame
print("\nLoans Data:")
loans_df.head()


Loans Data:


Unnamed: 0,user_id,loan_id,created_at,due_at,paid_at,status,loan_amount,tax,due_amount,amount_paid
0,46937,1,2020-01-06 08:58:24,2020-04-05 08:58:24,2020-02-21 08:58:24,paid,16638.0,186.01,18071.86,18071.86
1,29211,2,2020-01-07 05:12:59,2020-04-06 05:12:59,2020-03-09 05:12:59,paid,1886.0,21.09,2331.44,2331.44
2,62030,3,2020-01-12 02:06:18,2020-04-11 02:06:18,NaT,default,39802.0,444.99,42237.09,4147.27
3,14500,4,2020-01-14 18:09:12,2020-04-13 18:09:12,2020-01-28 18:09:12,paid,5114.0,57.17,5554.72,5554.72
4,73480,5,2020-01-15 17:28:24,2020-04-14 17:28:24,2020-03-14 17:28:24,paid,22153.0,247.67,27385.1,27385.1


In [12]:
# Basic statistics and information about the Loans DataFrame
print("\nLoans Data Statistics:")
loans_df.describe()


Loans Data Statistics:


Unnamed: 0,user_id,loan_id,created_at,due_at,paid_at,loan_amount,tax,due_amount,amount_paid
count,150708.0,150708.0,150708,150708,89595,150708.0,150708.0,150708.0,150708.0
mean,45079.625056,75354.5,2023-04-11 23:33:29.043341824,2023-07-10 23:33:29.043341824,2023-01-28 19:17:35.457838080,25207.486789,281.819697,28798.309845,22969.777727
min,1.0,1.0,2020-01-06 08:58:24,2020-04-05 08:58:24,2020-01-28 18:09:12,250.0,2.8,265.29,0.02
25%,22694.75,37677.75,2022-11-25 11:20:06.249999872,2023-02-23 11:20:06.249999872,2022-08-22 01:36:23,7143.0,79.86,8121.6175,4761.5175
50%,45007.0,75354.5,2023-07-30 23:23:02,2023-10-28 23:23:02,2023-04-29 20:36:53,18929.0,211.63,21569.82,14810.72
75%,67620.25,113031.25,2023-11-22 05:40:47.500000,2024-02-20 05:40:47.500000,2023-09-17 06:38:13.500000,38367.25,428.9425,43702.255,34393.125
max,90000.0,150708.0,2024-01-24 23:59:01,2024-04-23 23:59:01,2024-01-24 22:23:07,99776.0,1115.5,122087.61,122066.59
std,25957.238289,43505.796522,,,,21914.750955,245.00692,25147.64987,23217.838676


## Analysis - Identifying the Best Month for Loan Issuance

In [13]:
# Group loans by month and calculate total quantity and amount lent in each month
loans_df['month'] = loans_df['created_at'].dt.to_period('M')
monthly_loan_stats = loans_df.groupby('month').agg(
    total_quantity=('loan_id', 'count'),
    total_amount=('loan_amount', 'sum')
).reset_index()

# Determine the month with the highest loan issuance
best_month = monthly_loan_stats.loc[monthly_loan_stats['total_amount'].idxmax()]

best_month

month                 2023-12
total_quantity          17351
total_amount      442464966.0
Name: 47, dtype: object


The analysis indicates that December 2023 had the highest loan issuance, with a total of 17,351 loans issued and a total amount lent of $442,464,966.00. This information provides insights into the peak activity of loan issuance, which can be further analyzed to understand potential factors contributing to the increased demand for loans during that month.

## Analysis - Identifying the Batch with the Best Overall Adherence

In [19]:
# Merge clients_df with loans_df on 'user_id'
merged_df = pd.merge(loans_df, clients_df[['user_id', 'batch']], on='user_id', how='left')

# Group by 'batch' and calculate adherence directly on the 'status' column
batch_adherence = merged_df.groupby('batch')['status'].apply(lambda x: (x == 'paid').sum() / len(x)).reset_index()
batch_adherence.columns = ['batch', 'adherence']

# Identify the batch with the highest adherence rate
best_batch = batch_adherence.loc[batch_adherence['adherence'].idxmax()]

best_batch

batch        2.000000
adherence    0.602913
Name: 1, dtype: float64

the result indicates that batch number 2 had the highest proportion of clients who successfully repaid their loans compared to the other batches, with an adherence rate of approximately 60.29%. This suggests that clients in batch 2 demonstrated better adherence to loan repayment obligations compared to clients in other batches.

## Analysis - Examining the Relationship Between Interest Rates and Loan Outcomes

## Analysis - Ranking the Best and Worst Clients

## Analysis - Determining Default Rate by Month and Batch

## Analysis - Assessing the Profitability of the Operation