# 01 — Data Intake & Initial Validation

**Goal:** Load the public *Loan Prediction* dataset, inspect its structure, handle initial missing values,  
and save a clean subset for further EDA and modelling.

**Input:** `data/raw/loan_data_raw.csv`  
**Output:** `data/interim/loan_data_clean_start.csv`

In [1]:
!mkdir -p Loan_Repayment_Behaviour_Analytics/{data/raw,data/interim,data/processed,notebooks,images,reports,src}
!touch Loan_Repayment_Behaviour_Analytics/{README.md,requirements.txt}


In [2]:
!ls -R Loan_Repayment_Behaviour_Analytics


Loan_Repayment_Behaviour_Analytics:
data  images  notebooks  README.md  reports  requirements.txt  src

Loan_Repayment_Behaviour_Analytics/data:
interim  processed  raw

Loan_Repayment_Behaviour_Analytics/data/interim:

Loan_Repayment_Behaviour_Analytics/data/processed:

Loan_Repayment_Behaviour_Analytics/data/raw:

Loan_Repayment_Behaviour_Analytics/images:

Loan_Repayment_Behaviour_Analytics/notebooks:

Loan_Repayment_Behaviour_Analytics/reports:

Loan_Repayment_Behaviour_Analytics/src:


In [None]:
#from google.colab import drive
#drive.mount('/content/drive')
#create the project folder inside Drive:
#!mkdir -p /content/drive/MyDrive/Loan_Repayment_Behaviour_Analytics/{data/raw,data/interim,data/processed,notebooks,images,reports,src}

In [3]:
'''from pathlib import Path
import pandas as pd

# Set paths relative to project root
#ROOT = Path.cwd().resolve()# --- set explicit project root for Colab ---
if Path("/content/drive/MyDrive/Loan_Repayment_Behaviour_Analytics").exists():
    ROOT = Path("/content/drive/MyDrive/Loan_Repayment_Behaviour_Analytics")
elif Path("/content/Loan_Repayment_Behaviour_Analytics").exists():
    ROOT = Path("/content/Loan_Repayment_Behaviour_Analytics")
else:
    ROOT = Path.cwd()

# folders
DATA_RAW = ROOT / "data" / "raw"
DATA_INTERIM = ROOT / "data" / "interim"
DATA_INTERIM.mkdir(parents=True, exist_ok=True)

print("Project root:", ROOT)
print("Raw data path:", DATA_RAW)'''

'from pathlib import Path\nimport pandas as pd\n\n# Set paths relative to project root\n#ROOT = Path.cwd().resolve()# --- set explicit project root for Colab ---\nif Path("/content/drive/MyDrive/Loan_Repayment_Behaviour_Analytics").exists():\n    ROOT = Path("/content/drive/MyDrive/Loan_Repayment_Behaviour_Analytics")\nelif Path("/content/Loan_Repayment_Behaviour_Analytics").exists():\n    ROOT = Path("/content/Loan_Repayment_Behaviour_Analytics")\nelse:\n    ROOT = Path.cwd()\n\n# folders\nDATA_RAW = ROOT / "data" / "raw"\nDATA_INTERIM = ROOT / "data" / "interim"\nDATA_INTERIM.mkdir(parents=True, exist_ok=True)\n\nprint("Project root:", ROOT)\nprint("Raw data path:", DATA_RAW)'

In [4]:
# --- Import standard paths ---
import sys
from pathlib import Path

# Add src folder to Python path (so imports work in notebooks)
ROOT = Path.cwd()
if ROOT.name.lower() == "notebooks":
    ROOT = ROOT.parent
sys.path.append(str(ROOT / "src"))

from utils_paths import get_project_paths

paths = get_project_paths()

# Unpack for easy use
env = paths["env"]
DATA_RAW = paths["DATA_RAW"]
DATA_INTERIM = paths["DATA_INTERIM"]
DATA_PROCESSED = paths["DATA_PROCESSED"]
IMAGES = paths["IMAGES"]

print(f"Environment: {env}")
print("RAW:", DATA_RAW)
print("INTERIM:", DATA_INTERIM)
print("PROCESSED:", DATA_PROCESSED)


Environment: Colab
RAW: /content/Loan_Repayment_Behaviour_Analytics/data/raw
INTERIM: /content/Loan_Repayment_Behaviour_Analytics/data/interim
PROCESSED: /content/Loan_Repayment_Behaviour_Analytics/data/processed


In [14]:
import pandas as pd
# File expected: data/raw/loan_data_raw.csv
file_path = DATA_RAW / "loan_data_raw.csv"

if not file_path.exists():
    raise FileNotFoundError(
        f"Dataset not found at {file_path}\n"
        "Please upload 'loan_data_raw.csv' to data/raw/ inside your project folder.\n"
        "If you downloaded from Kaggle, rename it exactly as loan_data_raw.csv."
    )

# Load the CSV
df = pd.read_csv(file_path)

# Basic overview
print(f"Loaded dataset: {file_path.name}")
print(f"Rows: {df.shape[0]:,} | Columns: {df.shape[1]}")
display(df.head())

Loaded dataset: loan_data_raw.csv
Rows: 614 | Columns: 13


Unnamed: 0,Loan_ID,Gender,Married,Dependents,Education,Self_Employed,ApplicantIncome,CoapplicantIncome,LoanAmount,Loan_Amount_Term,Credit_History,Property_Area,Loan_Status
0,LP001002,Male,No,0,Graduate,No,5849,0.0,,360.0,1.0,Urban,Y
1,LP001003,Male,Yes,1,Graduate,No,4583,1508.0,128.0,360.0,1.0,Rural,N
2,LP001005,Male,Yes,0,Graduate,Yes,3000,0.0,66.0,360.0,1.0,Urban,Y
3,LP001006,Male,Yes,0,Not Graduate,No,2583,2358.0,120.0,360.0,1.0,Urban,Y
4,LP001008,Male,No,0,Graduate,No,6000,0.0,141.0,360.0,1.0,Urban,Y


In [15]:
# --- Inspect structure and missing data ---

df.info()

# Count and percentage of missing values
missing = df.isna().sum().to_frame("missing_count")
missing["missing_pct"] = (missing["missing_count"] / len(df) * 100).round(2)
display(missing.sort_values("missing_pct", ascending=False).head(10))

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB


Unnamed: 0,missing_count,missing_pct
Credit_History,50,8.14
Self_Employed,32,5.21
LoanAmount,22,3.58
Dependents,15,2.44
Loan_Amount_Term,14,2.28
Gender,13,2.12
Married,3,0.49
Education,0,0.0
Loan_ID,0,0.0
CoapplicantIncome,0,0.0


In [9]:
# --- Basic cleaning & standardisation ---

# 1) Trim whitespace in column names & string cells
df.columns = df.columns.str.strip()
for c in df.columns:
    if df[c].dtype == object:
        df[c] = df[c].astype(str).str.strip()

# 2) Coerce expected numeric columns
num_cols = ["ApplicantIncome","CoapplicantIncome","LoanAmount","Loan_Amount_Term","Credit_History"]
for c in num_cols:
    if c in df.columns:
        df[c] = pd.to_numeric(df[c], errors="coerce")

# 3) Impute small categorical gaps with mode (most common)
cat_cols = ["Gender","Married","Dependents","Education","Self_Employed","Property_Area"]
for c in cat_cols:
    if c in df.columns and df[c].isna().any():
        df[c]=df[c].fillna(df[c].mode().iloc[0])

# 4) Impute numeric gaps with median (robust)
for c in num_cols:
    if c in df.columns and df[c].isna().any():
        df[c]=df[c].fillna(df[c].median())

# 5) Sanity constraints
if "LoanAmount" in df.columns:
    df["LoanAmount"] = df["LoanAmount"].clip(lower=0)
if "Loan_Amount_Term" in df.columns:
    df["Loan_Amount_Term"] = df["Loan_Amount_Term"].clip(lower=0)

print("Remaining missing values:", int(df.isna().sum().sum()))

Remaining missing values: 0


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[c].fillna(df[c].median(), inplace=True)


In [13]:
# --- Intake summary for audit trail ---

summary = (
    df.isna().sum()
    .to_frame("missing_count")
    .assign(
        missing_pct=lambda x: (x["missing_count"] / len(df) * 100).round(2),
        dtype=df.dtypes.astype(str).values
    )
    .sort_values("missing_pct", ascending=False)
)

display(summary.head(10))

intake_summary_path = DATA_INTERIM / "intake_summary.csv"
summary.to_csv(intake_summary_path)
print("Saved intake summary →", intake_summary_path)

Unnamed: 0,missing_count,missing_pct,dtype
Credit_History,50,8.14,float64
Self_Employed,32,5.21,object
LoanAmount,22,3.58,float64
Dependents,15,2.44,object
Loan_Amount_Term,14,2.28,float64
Gender,13,2.12,object
Married,3,0.49,object
Education,0,0.0,object
Loan_ID,0,0.0,object
CoapplicantIncome,0,0.0,float64


Saved intake summary → /content/Loan_Repayment_Behaviour_Analytics/data/interim/intake_summary.csv


In [16]:
# --- Save the cleansed file for EDA ---

# Keep all columns (dataset is small), but you can subset if desired:
# keep_cols = ["Loan_ID","Gender","Married","Dependents","Education","Self_Employed",
#              "ApplicantIncome","CoapplicantIncome","LoanAmount","Loan_Amount_Term",
#              "Credit_History","Property_Area","Loan_Status"]
# df_clean = df[keep_cols].copy()
df_clean = df.copy()

out_path = DATA_INTERIM / "loan_data_clean_start.csv"
df_clean.to_csv(out_path, index=False)
print(f"Clean subset saved → {out_path}")
print("Rows, Cols:", df_clean.shape)


Clean subset saved → /content/Loan_Repayment_Behaviour_Analytics/data/interim/loan_data_clean_start.csv
Rows, Cols: (614, 13)


In [17]:
# ---  Quick data dictionary seed to help README/Docs ---

preview_vals = {}
for c in df_clean.columns:
    ex = df_clean[c].dropna().astype(str).unique()[:3]
    preview_vals[c] = ", ".join(map(str, ex))

dict_df = pd.DataFrame({
    "column": df_clean.columns,
    "dtype": [str(t) for t in df_clean.dtypes],
    "example_values": [preview_vals[c] for c in df_clean.columns],
})

dict_path = DATA_INTERIM / "data_dictionary_seed.csv"
dict_df.to_csv(dict_path, index=False)
display(dict_df.head(10))
print("Saved data dictionary seed →", dict_path)


Unnamed: 0,column,dtype,example_values
0,Loan_ID,object,"LP001002, LP001003, LP001005"
1,Gender,object,"Male, Female"
2,Married,object,"No, Yes"
3,Dependents,object,"0, 1, 2"
4,Education,object,"Graduate, Not Graduate"
5,Self_Employed,object,"No, Yes"
6,ApplicantIncome,int64,"5849, 4583, 3000"
7,CoapplicantIncome,float64,"0.0, 1508.0, 2358.0"
8,LoanAmount,float64,"128.0, 66.0, 120.0"
9,Loan_Amount_Term,float64,"360.0, 120.0, 240.0"


Saved data dictionary seed → /content/Loan_Repayment_Behaviour_Analytics/data/interim/data_dictionary_seed.csv


**Intake complete.**

- Saved: `data/interim/loan_data_clean_start.csv`
- Audit: `data/interim/intake_summary.csv`
- Helper: `data/interim/data_dictionary_seed.csv`

Next up → `02_eda.ipynb`: distributions, correlations, and 3–5 hypotheses for drivers of late/default (Checkpoint 1).
