# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/LoanStatusPrediction.csv

## Additional Comments, information about dataset

* In this Loan Status Prediction dataset, we have the data of applicants who previously applied for the loan based on the property which is a Property Loan.
* The bank will decide whether to give a loan to the applicant based on some factors such as Applicant Income, Loan Amount, previous Credit History, Co-applicant Income, etc…
* Our goal is to build a Machine Learning Model to predict the loan to be approved or to be rejected for an applicant.
* It is not known wether the dataset is real or a fiction with the purpose of creating ML tasks.

---

# Requirements

* As this project uses a rather old version of python/ numpy, make sure python version 3.8 is installed.
* The requirements cannot be installed on ARM systems. (e.g. Macbook => M1)

* In the next code snippet correct the path to you working directory

#### For CodeAnywhere installing requirements would be something like:

In [None]:
%pip install -r /workspaces/LoanerAI/requirements.txt

#### For local IDE installing requirements would be something like:

In [None]:
%pip install -r ../requirements.txt

# Change working directory

* Change the working directory from its current folder to its parent folder
* Access the current directory with os.getcwd() first

In [None]:
import os
current_dir = os.getcwd()
current_dir

Make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [None]:
%pip install kaggle==1.5.12

A kaggle authentication is needed for the upcoming get_data process
* You will need **kaggle.json** available
* Once you copied your **kaggle.json** to the root of this project run following code, so the token is recognized in the session
* If youre running a local IDE on a windows machine and are developing in a wsl container make sure your /etc/wsl.conf has the key the [automount] section with following key *metadata* = *enabled* (restart needed)

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Get the dataset path from the Kaggle url
* When you are viewing the dataset at Kaggle, check what is after https://www.kaggle.com/ .
* e.g. https://www.kaggle.com/datasets/bhavikjikadara/loan-status-prediction

Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "bhavikjikadara/loan-status-prediction"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle data

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

DataFrame Summary

In [None]:
df.info()

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
