# 📊 02_clean_afdr_charts.ipynb

## 🎯 Purpose

This notebook prepares the dataset `afdr_charts.csv` 
from the USDA Agricultural Finance Databook (Table A2). 
It includes quarterly statistics from 1977 onward 
on non-real-estate farm loans: volume, interest rates, 
loan sizes, and floating-rate shares.

## Relevance to Our Research

Our project investigates how demographic and financial 
behavior features influence default risk across BNPL and 
traditional loans. This dataset is **not usable for modeling** 
because:

- It contains **aggregated national-level statistics**, not 
  individual user-level data.
- It lacks **demographic features** (age, income, etc.).
- It has **no outcome labels** (like default vs. non-default).
- It focuses on **agricultural finance**, not BNPL-type behavior.

## When Might This Be Useful?

We can use this dataset:

- As a **historical benchmark** to compare changes in financial 
  behavior (loan size, term, interest rate).
- To show the evolution of traditional loan terms over decades.
- In **data exploration or literature review**, not for predictive models.

## Output

Cleaned version saved in:
- `../1_datasets/processed_datasets/afdr_charts_cleaned.csv`

In [9]:
# Import pandas for data handling
import pandas as pd

In [10]:
# Load the dataset
df = pd.read_csv("../1_datasets/raw_data/afdr_charts.csv")

# Preview the first few rows
df.head()

Unnamed: 0,Period,Number of non-real-estate farm loans,Average size of non-real-estate farm loans,Volume of non-real-estate farm loans,Average maturity of non-real-estate farm loans,Average effective interest rate on non-real-estate farm loans,Share of farm loans with a floating interest rate
0,1977Q1,3.48,12.35,42.97,10.87,8.82,15.32
1,1977Q2,4.05,11.93,48.26,8.25,8.74,13.22
2,1977Q3,3.38,13.45,45.51,7.03,8.73,22.71
3,1977Q4,2.81,13.18,36.98,9.4,9.12,17.28
4,1978Q1,3.38,12.29,41.56,9.97,9.16,14.98


In [12]:
# Split 'Period' into 'Year' and 'Quarter'
df[["Year", "Quarter"]] = df["Period"].str.extract(r"(\d{4})Q(\d)")
df["Year"] = df["Year"].astype(int)
df["Quarter"] = df["Quarter"].astype(int)

# Move 'Year' and 'Quarter' to the front for clarity
cols = ["Year", "Quarter"] + [
    col for col in df.columns if col not in ["Year", "Quarter"]
]
df = df[cols]

# Drop the 'Loan Characteristic' and 'Period' columns
df = df.drop(columns=["Period"])

In [13]:
df.head()

Unnamed: 0,Year,Quarter,Num_Loans,Avg_Loan_Size,Loan_Volume,Avg_Maturity,Avg_Interest_Rate,Floating_Rate_Share
0,1977,1,3.48,12.35,42.97,10.87,8.82,15.32
1,1977,2,4.05,11.93,48.26,8.25,8.74,13.22
2,1977,3,3.38,13.45,45.51,7.03,8.73,22.71
3,1977,4,2.81,13.18,36.98,9.4,9.12,17.28
4,1978,1,3.38,12.29,41.56,9.97,9.16,14.98


## 🔎 Variable Description

| Column                                                        | Description                            |
|---------------------------------------------------------------|----------------------------------------|
|  Year                | Year of the survey                              |
| Quarter             | Quarter of the year| Number of non-real-estate farm loans                          | National count of loans                |
| Average size of non-real-estate farm loans                    | Average loan amount (in $1,000s)       |
| Volume of non-real-estate farm loans                          | Total dollar volume of loans (millions)|
| Average maturity of non-real-estate farm loans                | Average loan term (in months)          |
| Average effective interest rate on non-real-estate farm loans | Weighted avg. interest rate (%)        |
| Share of farm loans with a floating interest rate             | % of loans with variable interest rate |

In [14]:
# Rename columns to cleaner, shorter names
df.rename(
    columns={
        "Number of non-real-estate farm loans": "Num_Loans",
        "Average size of non-real-estate farm loans": "Avg_Loan_Size",
        "Volume of non-real-estate farm loans": "Loan_Volume",
        "Average maturity of non-real-estate farm loans": "Avg_Maturity",
        "Average effective interest rate on non-real-estate farm loans": "Avg_Interest_Rate",
        "Share of farm loans with a floating interest rate": "Floating_Rate_Share",
    },
    inplace=True,
)

df.head()

Unnamed: 0,Year,Quarter,Num_Loans,Avg_Loan_Size,Loan_Volume,Avg_Maturity,Avg_Interest_Rate,Floating_Rate_Share
0,1977,1,3.48,12.35,42.97,10.87,8.82,15.32
1,1977,2,4.05,11.93,48.26,8.25,8.74,13.22
2,1977,3,3.38,13.45,45.51,7.03,8.73,22.71
3,1977,4,2.81,13.18,36.98,9.4,9.12,17.28
4,1978,1,3.38,12.29,41.56,9.97,9.16,14.98


In [15]:
# Ensure numerical values are float type for consistency
for col in df.columns[2:]:
    df[col] = pd.to_numeric(df[col], errors="coerce")

In [16]:
# Identify rows with missing data
missing = df[df.isnull().any(axis=1)]
print("Number of rows with missing values:", len(missing))

# 🧹 Fill missing values with 0 — assume no value means zero activity
df.fillna(0, inplace=True)

Number of rows with missing values: 0


## Save Cleaned Dataset

Although this dataset is not used for modeling, we preserve it 
for potential exploratory or historical comparison.

We store it in the `/1_datasets/reference/` folder to distinguish 
it from modeling datasets.

In [18]:
# Save to a reference subfolder
df.to_csv(
    "../1_datasets/additional_data/afdr_charts_cleaned_historical_match.csv",
    index=False,
)

print(" Saved cleaned reference dataset.")

 Saved cleaned reference dataset.
