# **Data Collection Notebook**

## Objectives

* Fetch data from Kaggle and save it as raw data.
* Inspect the data and save it under outputs/datasets/collection

## Inputs

*   Kaggle JSON file - the authentication token.

## Outputs

* Generate Dataset: outputs/datasets/collection/TelcoCustomerChurn.csv

## Additional Comments


* In the workplace, **projects are not done using Kaggle data**, but instead, the data comes from multiple data sources that may be hosted internally (like in a data warehouse) or outside your company. For this project learning context, we are fetching the data from Kaggle.

* Another aspect is that in the workplace, the **data has never been pushed to a public repository** due to security reasons. Just for this project learning context, we are hosting the data in a public repo.


---

# Install python packages in the notebooks

In [22]:
%pip install -r /Users/Endeavour/Code/customer-churn-predictor/requirements.txt

Defaulting to user installation because normal site-packages is not writeable
Collecting numpy==1.26.0 (from -r /Users/Endeavour/Code/customer-churn-predictor/requirements.txt (line 1))
  Using cached numpy-1.26.0-cp39-cp39-macosx_11_0_arm64.whl.metadata (53 kB)
Collecting pandas==1.4.2 (from -r /Users/Endeavour/Code/customer-churn-predictor/requirements.txt (line 2))
  Using cached pandas-1.4.2-cp39-cp39-macosx_11_0_arm64.whl.metadata (12 kB)
Collecting seaborn==0.12.2 (from -r /Users/Endeavour/Code/customer-churn-predictor/requirements.txt (line 3))
  Using cached seaborn-0.12.2-py3-none-any.whl.metadata (5.4 kB)
Collecting ydata-profiling==4.10.0 (from -r /Users/Endeavour/Code/customer-churn-predictor/requirements.txt (line 4))
  Using cached ydata_profiling-4.10.0-py2.py3-none-any.whl.metadata (20 kB)
Collecting plotly==5.15.0 (from -r /Users/Endeavour/Code/customer-churn-predictor/requirements.txt (line 5))
  Using cached plotly-5.15.0-py2.py3-none-any.whl.metadata (7.0 kB)
Collec

In [23]:
%pip install matplotlib --no-deps
%pip install streamlit --no-deps

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.
Defaulting to user installation because normal site-packages is not writeable
Collecting streamlit
  Using cached streamlit-1.38.0-py2.py3-none-any.whl.metadata (8.5 kB)
Using cached streamlit-1.38.0-py2.py3-none-any.whl (8.7 MB)
Installing collected packages: streamlit
Successfully installed streamlit-1.38.0
Note: you may need to restart the kernel to use updated packages.


# Change working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [24]:
import os
current_dir = os.getcwd()
current_dir

'/Users/Endeavour/Code/customer-churn-predictor'

Set the parent of the current directory as the new current directory.
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [25]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [26]:
current_dir = os.getcwd()
current_dir

'/Users/Endeavour/Code'

# Fetch data from Kaggle

Install Kaggle package to fetch data

In [12]:
%pip install kaggle==1.5.12

Defaulting to user installation because normal site-packages is not writeable
Collecting kaggle==1.5.12
  Downloading kaggle-1.5.12.tar.gz (58 kB)
  Preparing metadata (setup.py) ... [?25ldone
Collecting certifi (from kaggle==1.5.12)
  Downloading certifi-2024.8.30-py3-none-any.whl.metadata (2.2 kB)
Collecting requests (from kaggle==1.5.12)
  Downloading requests-2.32.3-py3-none-any.whl.metadata (4.6 kB)
Collecting tqdm (from kaggle==1.5.12)
  Downloading tqdm-4.66.5-py3-none-any.whl.metadata (57 kB)
Collecting python-slugify (from kaggle==1.5.12)
  Downloading python_slugify-8.0.4-py2.py3-none-any.whl.metadata (8.5 kB)
Collecting urllib3 (from kaggle==1.5.12)
  Downloading urllib3-2.2.2-py3-none-any.whl.metadata (6.4 kB)
Collecting text-unidecode>=1.3 (from python-slugify->kaggle==1.5.12)
  Downloading text_unidecode-1.3-py2.py3-none-any.whl.metadata (2.4 kB)
Collecting charset-normalizer<4,>=2 (from requests->kaggle==1.5.12)
  Downloading charset_normalizer-3.3.2-cp39-cp39-macosx_11

In [27]:
import os
os.environ['KAGGLE_CONFIG_DIR'] = os.getcwd()
! chmod 600 kaggle.json

chmod: kaggle.json: No such file or directory


Get the dataset path from the Kaggle url

Define the Kaggle dataset, and destination folder for download.

In [14]:
KaggleDatasetPath = "gyanshashwat1611/telecom-churn-dataset"
DestinationFolder = "inputs/datasets/raw"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

Downloading telecom-churn-dataset.zip to inputs/datasets/raw
100%|█████████████████████████████████████████| 172k/172k [00:00<00:00, 837kB/s]
100%|█████████████████████████████████████████| 172k/172k [00:00<00:00, 824kB/s]


Unzip the downloaded file, delete the zip file and delete the kaggle.json file

In [15]:
! unzip {DestinationFolder}/*.zip -d {DestinationFolder} \
  && rm {DestinationFolder}/*.zip \
  && rm kaggle.json

Archive:  inputs/datasets/raw/telecom-churn-dataset.zip
  inflating: inputs/datasets/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv  


---

# Load and Inspect Kaggle data

In [29]:
import pandas as pd
df = pd.read_csv(f"/Users/Endeavour/Code/customer-churn-predictor/inputs/datasets/raw/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


DataFrame Summary

In [30]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


Check if there are duplicated `customerID`: There are not.

In [31]:
df[df.duplicated(subset=['customerID'])]

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn


Converting `TotalCharges` to numeric

In [32]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'] ,errors='coerce')

Check `TotalCharges` data type

In [33]:
df['TotalCharges'].dtype

dtype('float64')

Currently, `Churn` is a categorical variable: Yes or No. We will replace/convert it to an integer as the ML model requires numeric variables. 

In [34]:
df['Churn'].unique()

array(['No', 'Yes'], dtype=object)

In [35]:
df['Churn'] = df['Churn'].replace({"Yes":1, "No":0})

Check the `Churn` data type.

In [36]:
df['Churn'].dtype

dtype('int64')

# Push files to Repo

In [38]:
import os
try:
  os.makedirs(name='/Users/Endeavour/Code/customer-churn-predictor/outputs/datasets/collection') # create outputs/datasets/collection folder
except Exception as e:
  print(e)

df.to_csv(f"/Users/Endeavour/Code/customer-churn-predictor/outputs/datasets/collection/TelcoCustomerChurn.csv",index=False)

[Errno 17] File exists: '/Users/Endeavour/Code/customer-churn-predictor/outputs/datasets/collection'
