# Data Cleaning Notebook

## Objectives

*   Evaluate missing data
*   Clean data

## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned

## Conclusions

 
  * Data Cleaning
  * Drop Variables:  `['customerID']`

---

# Change working directory

Since jupyter notebooks are in a subfolder we need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Collected data

In [None]:
import pandas as pd
import glob
output_folder = "outputs/datasets/collection"
csv_files = glob.glob(f"{output_folder}/*.csv")
print(csv_files)
df = pd.read_csv(csv_files[0]).drop('customerID', axis=1)
df.head(5)

# Data Cleaning

* We are interested in checking the distribution and shape of a variable with missing data.

In [None]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

* We have no variables with missing data, however from the previous notebook we know that TotalCharges have  
values that are a string with a empty space (i.e. " ")  
* We will replace that to zero and change the data type of the variable to float

In [None]:
df['TotalCharges'] = df['TotalCharges'].replace(" ", 0)
df['TotalCharges'] = df['TotalCharges'].astype(float)
df['TotalCharges']

---

# Push cleaned data to Repo

In [None]:
import os

file_name = "telco_customer_churn_cleaned.csv"
output_folder = "outputs/datasets/cleaned"

try:
  os.makedirs(name=output_folder) # create outputs/datasets/cleaned folder
except Exception as e:
  print(e)

df.to_csv(f"{output_folder}/{file_name}", index=False)
