## Data Collection Notebook for Approval Predict

## Objectives

Fetch data from Kaggle and save as raw data.
Inspect the data and save it under outputs/datasets/collection.


## Inputs
Kaggle authentication token (JSON file).

## Outputs
Generate dataset with an output of the loan_approval.csv

## Additional Comments

Loan Approval Dataset is a synthetic dataset with 8 columns relevant to loan approval.

The data was published on Kaggle by user Anish Dev Edward. The dataset contains information used to predict whether a loan application will be approved or rejected, based on applicant and financial details.

License: MIT

## Change working directory

We want to make the parent of the current directory the new current directory and confirm this.

In [3]:
import os
current_dir = os.getcwd()
current_dir

os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

current_dir = os.getcwd()
current_dir


You set a new current directory


'/workspaces/Approval_Predict'

## Fetch data from Kaggle

In [None]:
%pip install kaggle==1.5.12

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "anishdevedward/loan-approval-dataset/data?select=loan_approval.csv"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
! unzip -o {DestinationFolder}/*.zip -d {DestinationFolder} \
    && rm {DestinationFolder}/*.zip \
    && rm kaggle.json

In [4]:
import pandas as pd

root = setup_project_path()
file_path = Path(root) / "outputs" / "datasets" / "collection" / "loan_approval.csv"

if not file_path.exists():
    raise FileNotFoundError(f"Dataset not found at: {file_path}")

df = pd.read_csv(file_path).drop(['name'], axis=1)
df.head(3)

NameError: name 'setup_project_path' is not defined

## Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/loan_approval.csv")
df.head()

DataFrame Summary

In [None]:
df.info()

Chekcing for duplicates for the variable 'name'. This did not find any duplicates.

In [None]:
df[df.duplicated(subset=['name'])]

loan_approved is a boolean variable: True or False. Therefore, we will replace it to an integer as the ML model requires numeric variables.

In [None]:
df['loan_approved'].unique()

Checked loan_approved data type.

In [None]:
df['loan_approved'] = df['loan_approved'].replace({"True":1, "False":0})
df['loan_approved'].dtype
df['loan_approved'] = df['loan_approved'].astype(int)
df['loan_approved'].dtype

Files saved below in output folder

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/loan_approval.csv",index=False)