# Data Cleaning Notebook

## Objectives

*   Evaluate missing data
*   Clean data

## Inputs

* outputs/datasets/collection/TelcoCustomerChurn.csv

## Outputs

* Generate cleaned Train and Test sets, both saved under outputs/datasets/cleaned

## Conclusions

 
  * Data Cleaning Pipeline
  * Drop Variables:  `['customerID']`

---

# Change working directory

Since jupyter notebooks are in a subfolder we need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'd:\\codeacademy_darbai\\churn\\jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'd:\\codeacademy_darbai\\churn'

---

# Load Collected data

In [4]:
import pandas as pd
import glob
output_folder = "outputs/datasets/collection"
csv_files = glob.glob(f"{output_folder}/*.csv")
print(csv_files)
df = pd.read_csv(csv_files[0]).drop('customerID', axis=1)
df.head(5)

['outputs/datasets/collection\\WA_Fn-UseC_-Telco-Customer-Churn.csv']


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


# Data Cleaning

* We are interested in checking the distribution and shape of a variable with missing data.

In [5]:
vars_with_missing_data = df.columns[df.isna().sum() > 0].to_list()
vars_with_missing_data

[]

* We have no variables with missing data, however from the previous notebook we know that TotalCharges have  
values that are a string with a empty space (i.e. " ")  
* We will replace that to zero and change the data type of the variable to float

In [6]:
df['TotalCharges'] = df['TotalCharges'].replace(" ", 0)
df['TotalCharges'] = df['TotalCharges'].astype(float)
df['TotalCharges']

0         29.85
1       1889.50
2        108.15
3       1840.75
4        151.65
         ...   
7038    1990.50
7039    7362.90
7040     346.45
7041     306.60
7042    6844.50
Name: TotalCharges, Length: 7043, dtype: float64

---

## Split Train and Test Set

In [16]:
from sklearn.model_selection import train_test_split

target = df['Churn']
df_copy = df.copy()
df_copy.drop(columns='Churn', inplace=True)
x_train, x_test, y_train, y_test = train_test_split(df_copy, target, test_size=0.2, random_state=42)
print(f"Shape of train set: {x_train.shape}")
print(f"Shape of test set: {x_test.shape}")

Shape of train set: (5634, 19)
Shape of test set: (1409, 19)


# Push cleaned data to Repo

In [17]:
import os
try:
  os.makedirs(name='outputs/datasets/cleaned') # create outputs/datasets/collection folder
except Exception as e:
  print(e)


## Train Set

In [18]:
x_train.to_csv("outputs/datasets/cleaned/x_train_cleaned.csv", index=False)
y_train.to_csv("outputs/datasets/cleaned/y_train_cleaned.csv", index=False)

## Test Set

In [19]:
x_test.to_csv("outputs/datasets/cleaned/x_test_cleaned.csv", index=False)
y_test.to_csv("outputs/datasets/cleaned/y_test_cleaned.csv", index=False)