# **Data Collection**

## Objectives

* Fetch and save raw data from Kaggle.
* Verify the data and save it under : outputs/datasets/collection

## Inputs

* Kaggle JSON file - the authetication token.

## Outputs

* Dataset Generated: outputs/datasets/collection/BankCustomerData.csv

## Additional Comments

* Data is obtained from [Kaggle](https://www.kaggle.com/datasets/shubhammeshram579/bank-customer-churn-prediction/data) and is a public dataset.
* Ideally in a workplace scenario, data is never pushed to a public repository due to privacy and security concerns associated with data. However for the purpose of this project we are hosting the data in a publicly accessible repository.


---

# Change working directory

* Notebooks are being stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to parent folder


1. We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspace/BankCustomerExitPredictor/jupyter_notebooks'

2. We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You have set a new current directory")

You have set a new current directory


3. Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspace/BankCustomerExitPredictor'

---

# Fetch data from Kaggle

1. Download kaggle.json file(authentication token) from Kaggle and add it to the root directory
 * kaggle.json

2. Install Kaggle package in order to fetch the dataset

In [4]:
pip install kaggle==1.5.12

Note: you may need to restart the kernel to use updated packages.


3. Recognize the Kaggle token in the session

In [5]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

4. Define the Kaggle dataset, and destination folder and download it.

In [6]:
KaggleDatasetPath = "shubhammeshram579/bank-customer-churn-prediction"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading bank-customer-churn-prediction.zip to inputs/datasets/raw
  0%|                                                | 0.00/262k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 262k/262k [00:00<00:00, 53.6MB/s]


5. Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [7]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/bank-customer-churn-prediction.zip
  inflating: inputs/datasets/raw/Churn_Modelling.csv  


---

# Load and Inspect Kaggle Data

1. Import Pandas Library and read dataset .csv file

In [8]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Churn_Modelling.csv")
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42.0,2,0.0,1,1.0,1.0,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41.0,1,83807.86,1,0.0,1.0,112542.58,0
2,3,15619304,Onio,502,France,Female,42.0,8,159660.8,3,1.0,0.0,113931.57,1
3,4,15701354,Boni,699,France,Female,39.0,1,0.0,2,0.0,0.0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43.0,2,125510.82,1,,1.0,79084.1,0


2. DataFrame Summary

In [9]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10002 entries, 0 to 10001
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10002 non-null  int64  
 1   CustomerId       10002 non-null  int64  
 2   Surname          10002 non-null  object 
 3   CreditScore      10002 non-null  int64  
 4   Geography        10001 non-null  object 
 5   Gender           10002 non-null  object 
 6   Age              10001 non-null  float64
 7   Tenure           10002 non-null  int64  
 8   Balance          10002 non-null  float64
 9   NumOfProducts    10002 non-null  int64  
 10  HasCrCard        10001 non-null  float64
 11  IsActiveMember   10001 non-null  float64
 12  EstimatedSalary  10002 non-null  float64
 13  Exited           10002 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 1.1+ MB


3. Checking if there are duplicate CustomerId cases: found 2 cases

In [10]:
df[df.duplicated(subset=['CustomerId'], keep=False)]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9998,9999,15682355,Sabbatini,772,Germany,Male,42.0,3,75075.31,2,1.0,0.0,92888.52,1
9999,9999,15682355,Sabbatini,772,Germany,Male,42.0,3,75075.31,2,1.0,0.0,92888.52,1
10000,10000,15628319,Walker,792,France,Female,28.0,4,130142.79,1,1.0,0.0,38190.78,0
10001,10000,15628319,Walker,792,France,Female,28.0,4,130142.79,1,1.0,0.0,38190.78,0


4. Removing duplicate data (2 Cases), and verifying.

In [11]:
df.drop_duplicates(subset=['CustomerId'], inplace=True)
df[df.duplicated(subset=['CustomerId'])]

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited


---

# Save Dataset 

* Saving the data file to outputs/datasets/collection/

In [12]:
import os
try:
  os.makedirs(name='outputs/datasets/collection')
except Exception as e:
  print(e)

df.to_csv(f"outputs/datasets/collection/BankCustomerData.csv",index=False)

* Push the changes to GitHub Repo, using GitHub commands (git add, git commit, git push)

---

# Conclusion and Next Steps

## Conclusion:

* Minor issues like duplicate data has been removed and all the variables have required datatype, However we have 2 missing data which we will handle later.

## Next Steps:

* Exited Customer Data Analysis
* Data Cleaning