# Data Cleaning

We'll be going through the data file provided to us by our client, and cleaning up any artifacts, dropping/imputing missing values.

In [1]:
# import our required libraries
import pandas as pd
import numpy as np

In [2]:
# load our data and output head
df = pd.read_csv("../data/raw/data.csv")
df.head()

Unnamed: 0,state,account length,area code,phone number,international plan,voice mail plan,number vmail messages,total day minutes,total day calls,total day charge,...,total eve calls,total eve charge,total night minutes,total night calls,total night charge,total intl minutes,total intl calls,total intl charge,customer service calls,churn
0,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
1,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
2,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
3,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
4,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


In [3]:
# what are the dtypes of our columns
df.dtypes

state                      object
account length              int64
area code                   int64
phone number               object
international plan         object
voice mail plan            object
number vmail messages       int64
total day minutes         float64
total day calls             int64
total day charge          float64
total eve minutes         float64
total eve calls             int64
total eve charge          float64
total night minutes       float64
total night calls           int64
total night charge        float64
total intl minutes        float64
total intl calls            int64
total intl charge         float64
customer service calls      int64
churn                        bool
dtype: object

All our columns are the correct dtype. We may want to edit our column names only to make it easier in the future to access values.

For all columns we'll remove spaces and add in underscores, we'll alse remove the word total from columns, since it's quite excessive.

In [4]:
# replace spaces with _
df.columns = df.columns.str.replace(" ", "_")
# remove 'total_'
df.columns = df.columns.str.replace("total_", "")

In [5]:
# write cleaned data to data/interim directory
df.to_csv("../data/interim/data.csv", index=False)