# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

* Kaggle JSON file and authentication token. 

## Outputs

* Generate Dataset: outputs/datasets/WA_Fn-UseC_-Telco-Customer-Churn.csv

## Additional Comments

* This data is coming from an open, public source and poses no ethical or privacy concerns.

---

# Install python packages

In [None]:
%pip install -r D://codeacademy_darbai/churn/requirements.txt

---

# Change working directory

Since jupyter notebooks are in a subfolder we need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Data acquisition from Kaggle

Install Kaggle package to fetch data.

In [None]:
%pip install kaggle==1.5.12

* In order to download the data a personal authentication token (JSON file) is needed to authenticate Kaggle. You can aquire one by signin up to Kaggle.

* Once you have your token, drag and drop the file into the directory and then run the following:

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

* Get the dataset path from the Kaggle url

* Define the Kaggle dataset, and destination folder and download it.

In [None]:
KaggleDatasetPath = "blastchar/telco-customer-churn"
DestinationFolder = "inputs/datasets"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

* Unzip the downloaded file
* Delete the zip file and delete the kaggle.json file

In [None]:
import glob
zip_files = glob.glob(f"{DestinationFolder}/*.zip") # Get all zip files in the folder

for zip_file in zip_files: # Extract each ZIP file individually
    !tar -xf "{zip_file}" -C "{DestinationFolder}"

for zip_file in zip_files: # # Remove ZIP files
    !del "{zip_file}"

!del /Q kaggle.json # Remove Kaggle API key file

---

## Load and Inspect Kaggle Data

In [None]:
import pandas as pd
pdf_files = glob.glob(f"{DestinationFolder}/*.csv")
df = pd.read_csv(f"{pdf_files[0]}")
print(df.shape)
df.head(5)


* Data Frame Summary

In [None]:
df.info()

* We want to check for duplicates in the patientid column

* This is an index containing the meaning of all variables:

    <img src="../static/images/abbreviations.png" alt="abbreviations for Heart Disease data" height="500" />

* We want to check for duplicates in the patientid column

In [None]:
df[df.duplicated(subset=['customerID'])]

* We want to check for duplicates in the rest of the dataset

In [None]:
duplicate_count = df.duplicated().sum()

if duplicate_count == 0:
    print("There are no duplicates in the dataset")
else:
    print(f"There are {duplicate_count} duplicates in the dataset")

* We want to check for NaN values in the dataset

In [None]:
df.isna().sum()

In [None]:
for column in list(df.columns):
   print(df[column].value_counts())

---

# Push files to Repo

* We will save Data Frame and remove it's index

In [None]:
import os
import os

file_name = os.path.basename(pdf_files[0]) # Strip file path to just file name
output_folder = "outputs/datasets/collection"
try:
  os.makedirs(name=output_folder) # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"{output_folder}/{file_name}",index=False)


---

# Observations and Next Steps

* At first look dataset looks well maintained, there is no missing values in the dataset
* We found empty strings as values in MonthlyCharges variable which we will investigate in the next notebook
* TotalCharges is a continuos variables although it has a data type of object