# **Data Collection Notebook**

## Objectives

 * Fetch the data from Kaggle
 * Conduct an initial inspection of the datase
  
 * Save the dataset as a CSV file for use in the notebooks under outputs/datasets/collection

## Inputs

 * Kaggle JSON file - the authentication token

## Outputs

 * Generate Dataset: outputs/datasets/collection/BankCustomerChurn.csv 

## Additional Comments

 * The dataset is hosted publicly on Kaggle as a None Disclosure Agreement (NDA)
   * Link to dataset can be found [here](https://www.kaggle.com/datasets/gauravtopre/bank-customer-churn-dataset/data)


---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Bank-Customer-Churn-Prediction/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Bank-Customer-Churn-Prediction'

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [4]:
! pip install kaggle==1.5.12



---

We run the cell, so that the token is recognized in the session

In [6]:
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json


Define the Kaggle dataset, and destination folder and download it.

In [7]:
KaggleDatasetPath = "gauravtopre/bank-customer-churn-dataset/data"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading bank-customer-churn-dataset.zip to inputs/datasets/raw
  0%|                                                | 0.00/187k [00:00<?, ?B/s]
100%|████████████████████████████████████████| 187k/187k [00:00<00:00, 8.73MB/s]


The data cell is in a zip file. We Unzip the downloaded file, delete the zip file and delete the kaggle token file

In [8]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/bank-customer-churn-dataset.zip
  inflating: inputs/datasets/raw/Bank Customer Churn Prediction.csv  


# Load and Inspect Kaggle data

In [9]:
import pandas as pd
df = pd.read_csv(f"inputs/datasets/raw/Bank Customer Churn Prediction.csv")
df.head(10)

Unnamed: 0,customer_id,credit_score,country,gender,age,tenure,balance,products_number,credit_card,active_member,estimated_salary,churn
0,15634602,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,15647311,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,15619304,502,France,Female,42,8,159660.8,3,1,0,113931.57,1
3,15701354,699,France,Female,39,1,0.0,2,0,0,93826.63,0
4,15737888,850,Spain,Female,43,2,125510.82,1,1,1,79084.1,0
5,15574012,645,Spain,Male,44,8,113755.78,2,1,0,149756.71,1
6,15592531,822,France,Male,50,7,0.0,2,1,1,10062.8,0
7,15656148,376,Germany,Female,29,4,115046.74,4,1,0,119346.88,1
8,15792365,501,France,Male,44,4,142051.07,2,0,1,74940.5,0
9,15592389,684,France,Male,27,2,134603.88,1,1,1,71725.73,0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customer_id       10000 non-null  int64  
 1   credit_score      10000 non-null  int64  
 2   country           10000 non-null  object 
 3   gender            10000 non-null  object 
 4   age               10000 non-null  int64  
 5   tenure            10000 non-null  int64  
 6   balance           10000 non-null  float64
 7   products_number   10000 non-null  int64  
 8   credit_card       10000 non-null  int64  
 9   active_member     10000 non-null  int64  
 10  estimated_salary  10000 non-null  float64
 11  churn             10000 non-null  int64  
dtypes: float64(2), int64(8), object(2)
memory usage: 937.6+ KB


The Evaluation of the data shows the variables and their data types are mostly numeric and some object

In [11]:
# Check for missing values
df.isnull().values.any()

False

---

---

# Push files to Repo

* We will save it as a CSV file, and push the file to the repository.* 
We create the “outputs” folder and save the data under dataset/collection folder

In [12]:
import os
try:
  os.makedirs(name='outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)
df.to_csv(f"outputs/datasets/collection/BankCustomerChurn.csv",index=False)
