# **Bank Customer Exit Predictor (CI PP-5)** 

# **Data Collection**

## Objectives

* Fetch and save raw data from Kaggle.
* Verify the data and save it under : outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authetication token.

## Outputs

* Dataset Generated: outputs/datasets/collection/BankCustomerData.csv

## Additional Comments

* Data is obtained from [Kaggle](https://www.kaggle.com/datasets/shubhammeshram579/bank-customer-churn-prediction/data) and is a public dataset.
* Ideally in a workplace scenario, data is never pushed to a public repository due to privacy and security concerns associated with data. However for the purpose of this project we are hosting the data in a publicly accessible repository.


---

# Change working directory

* Notebooks are being stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to parent folder

1. We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

2. We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You have set a new current directory")

3. Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Fetch data from Kaggle

1. Download kaggle.json file(authentication token) from Kaggle and add it to the root directory
 * kaggle.json

2. Install Kaggle package in order to fetch the dataset

In [None]:
pip install kaggle==1.5.12

3. Recognize the Kaggle token in the session

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

4. Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "shubhammeshram579/bank-customer-churn-prediction"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

5. Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Kaggle Data

1. Import Pandas Library and read dataset .csv file

In [None]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Churn_Modelling.csv")
df.head()

2. DataFrame Summary

In [None]:
df.info()

3. Checking if there are duplicate CustomerId cases: found 2 cases

In [None]:
df[df.duplicated(subset=['CustomerId'], keep=False)]

4. Removing duplicate data (2 Cases), and verifying.

In [None]:
df.drop_duplicates(subset=['CustomerId'], inplace=True)
df[df.duplicated(subset=['CustomerId'])]

---

# Save Dataset 

* Saving the data file to outputs/datasets/collection/

In [None]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/BankCustomerData.csv",index=False)

* Push the changes to GitHub Repo, using GitHub commands (git add, git commit, git push)

---

# Conclusion and Next Steps

## Conclusion:

* Minor issues like duplicate data has been removed and all the variables have required datatype, However we have missing data which we will handle later.

## Next Steps:

* Exited Customer Data Analysis
