# Data Collection Notebook for Approval Predict

## Objectives

* Fetch data from Kaggle and save as raw data.
* Inspect the data and save it under outputs/datasets/collection.


## Inputs
* Kaggle authentication token (JSON file).

## Outputs
* Generate dataset with an output of the loan_approval.csv

## Additional Comments

* Loan Approval Dataset is a synthetic dataset with 8 columns relevant to loan approval.

* The data was published on Kaggle by user Anish Dev Edward. The dataset contains information used to predict whether a loan application will be approved or rejected, based on applicant and financial details.

* License: MIT

## Change working directory

We want to make the parent of the current directory the new current directory and confirm this.

In [1]:
import os
current_dir = os.getcwd()
current_dir

os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

current_dir = os.getcwd()
current_dir


You set a new current directory


'/workspaces/Approval_Predict'

## Fetch data from Kaggle

In [None]:
%pip install kaggle==1.5.12

In [2]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [3]:
KaggleDatasetPath = "anishdevedward/loan-approval-dataset/data?select=loan_approval.csv"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading loan-approval-dataset.zip to inputs/datasets/raw
  0%|                                               | 0.00/44.9k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 44.9k/44.9k [00:00<00:00, 6.16MB/s]


In [4]:
! unzip -o {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

Archive:  inputs/datasets/raw/loan-approval-dataset.zip
  inflating: inputs/datasets/raw/loan_approval.csv  


In [9]:
import pandas as pd
from pathlib import Path

root = current_dir
file_path = Path(root) / "outputs" / "datasets" / "collection" / "loan_approval.csv"

if not file_path.exists():
    raise FileNotFoundError(f"Dataset not found at: {file_path}")

df = pd.read_csv(file_path).drop(['name'], axis=1)
df.head(3)

Unnamed: 0,city,income,credit_score,loan_amount,years_employed,points,loan_approved
0,East Jill,113810,389,39698,27,50.0,0
1,New Jamesside,44592,729,15446,28,55.0,0
2,Lake Roberto,33278,584,11189,13,45.0,0


## Load and Inspect Kaggle data

In [10]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/loan_approval.csv")
df.head()

Unnamed: 0,name,city,income,credit_score,loan_amount,years_employed,points,loan_approved
0,Allison Hill,East Jill,113810,389,39698,27,50.0,False
1,Brandon Hall,New Jamesside,44592,729,15446,28,55.0,False
2,Rhonda Smith,Lake Roberto,33278,584,11189,13,45.0,False
3,Gabrielle Davis,West Melanieview,127196,344,48823,29,50.0,False
4,Valerie Gray,Mariastad,66048,496,47174,4,25.0,False


Identify the size of the dataset.

In [11]:
df.shape

(2000, 8)

## DataFrame Summary

Summary of information

In [12]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2000 entries, 0 to 1999
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   name            2000 non-null   object 
 1   city            2000 non-null   object 
 2   income          2000 non-null   int64  
 3   credit_score    2000 non-null   int64  
 4   loan_amount     2000 non-null   int64  
 5   years_employed  2000 non-null   int64  
 6   points          2000 non-null   float64
 7   loan_approved   2000 non-null   bool   
dtypes: bool(1), float64(1), int64(4), object(2)
memory usage: 111.5+ KB


Chekcing for duplicates for the variable 'name'. This did not find any duplicates.

In [20]:
df[df.duplicated(subset=['name'])]

Unnamed: 0,name,city,income,credit_score,loan_amount,years_employed,points,loan_approved


loan_approved is a boolean variable: True or False. Therefore, we will replace it to an integer as the ML model requires numeric variables.

In [21]:
df['loan_approved'].unique()

array([False,  True])

Checked loan_approved data type.

In [22]:
df['loan_approved'] = df['loan_approved'].replace({"True":1, "False":0})
df['loan_approved'].dtype
df['loan_approved'] = df['loan_approved'].astype(int)
df['loan_approved'].dtype

dtype('int64')

View missing data

In [17]:
df.isna().sum()

name              0
city              0
income            0
credit_score      0
loan_amount       0
years_employed    0
points            0
loan_approved     0
dtype: int64

Summary of all numeric columns in dataset. Provides statistics for each value and identify any possible outliners.

In [18]:
df.describe()

Unnamed: 0,income,credit_score,loan_amount,years_employed,points
count,2000.0,2000.0,2000.0,2000.0,2000.0
mean,90585.977,573.946,25308.503,20.441,56.68
std,34487.874907,160.564945,14207.320147,11.777813,18.638033
min,30053.0,300.0,1022.0,0.0,10.0
25%,61296.25,433.0,12748.75,10.0,45.0
50%,90387.5,576.0,25661.5,21.0,55.0
75%,120099.75,715.0,37380.5,31.0,70.0
max,149964.0,850.0,49999.0,40.0,100.0


Table summary:
From the table above count shows 2000 values confirming no missing data.
There is a a quite large standard deviation in income and loan_amount suggesting a lot of variability.
This shows the highest(max) and lowest(min) values for each variable.

Files saved below in output folder

In [19]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/loan_approval.csv",index=False)

[Errno 17] File exists: 'outputs/datasets/collection'


# Conclusions and Next Steps

This is a moderate to small size dataset with 2000 rows and 8 columns. There is no missing data or duplicates of the names. The numerical summary shows high levels variability across income and loan amount with relatively high standard deviations, suggesting a wide range of applicant financial profiles. This variability could influence model performance and will be explored further during data visualization and feature analysis.

Next steps:
Undertake Exploratory Data Analysis (EDA)
Investigate patterns in the data, particularly correlations between features and target variable.