# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv


---

# Install python packages in the notebooks

In [None]:
%pip install -r /Users/Endeavour/Code/customer-churn-predictor/customer-churn-predictor/requirements.txt

In [None]:
%pip install matplotlib --no-deps
%pip install streamlit --no-deps

# Change Working Directory

In [None]:
import os
current_dir = os.getcwd()
current_dir

Set the parent of the current directory as the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Download Dataset

In [None]:
%pip install kaggle==1.5.12

In [None]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

Download and assign the dataset a path.

In [None]:
KaggleDatasetPath = "gyanshashwat1611/telecom-churn-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [None]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

---

# Load and Inspect Data

In [None]:
import pandas as pd
df = pd.read_csv(f"/Users/Endeavour/Code/customer-churn-predictor/inputs/datasets/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

**DataFrame Summary:**

In [None]:
df.info()

Confirm if there are duplicated `customerID`:

In [None]:
df[df.duplicated(subset=['customerID'])]

Converting `TotalCharges` to numeric

In [None]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'] ,errors='coerce')

Check `TotalCharges` data type

In [None]:
df['TotalCharges'].dtype

Currently, `Churn` is a categorical variable: Yes or No. We will replace/convert it to an integer as the ML model requires numeric variables. 

In [None]:
df['Churn'].unique()

In [None]:
df['Churn'] = df['Churn'].replace({"Yes":1, "No":0})

Check the `Churn` data type.

In [None]:
df['Churn'].dtype

# Push files to Repo

In [None]:
import os
try:
  os.makedirs(name='/Users/Endeavour/Code/customer-churn-predictor/outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"/Users/Endeavour/Code/customer-churn-predictor/outputs/datasets/collection/TelcoCustomerChurn.csv",index=False)