<p style="font-size:300%; text-align:center"> Telco Customer Churn modeling</p>
<p style="font-size:150%; text-align:center"> Focused customer retention programs <br> MOD3 Project - 2. Scrub</p>


In [1]:
# import important libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# import important and explore dataset
df = pd.read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")
df.sample(6).T

Unnamed: 0,4468,1312,4342,846,2788,5239
customerID,7083-YNSKY,1661-CZBAU,5792-JALQC,6916-HIJSE,2790-XUYMV,2056-EVGZL
gender,Female,Male,Female,Female,Male,Male
SeniorCitizen,0,0,1,0,0,0
Partner,No,No,No,No,No,Yes
Dependents,No,No,No,No,Yes,Yes
tenure,15,48,52,65,71,68
PhoneService,Yes,Yes,Yes,Yes,Yes,Yes
MultipleLines,Yes,Yes,Yes,No,Yes,No
InternetService,No,DSL,DSL,DSL,Fiber optic,Fiber optic
OnlineSecurity,No internet service,Yes,Yes,Yes,Yes,No


## Any obvious features to eliminat e? 
customerID needs to be removed for data explorationa, analysis and model building. 

In [3]:
# let's remove customer ID as it's not helpful fro data analysis and prediction
df = df.drop(['customerID'], axis=1)

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 20 columns):
gender              7043 non-null object
SeniorCitizen       7043 non-null int64
Partner             7043 non-null object
Dependents          7043 non-null object
tenure              7043 non-null int64
PhoneService        7043 non-null object
MultipleLines       7043 non-null object
InternetService     7043 non-null object
OnlineSecurity      7043 non-null object
OnlineBackup        7043 non-null object
DeviceProtection    7043 non-null object
TechSupport         7043 non-null object
StreamingTV         7043 non-null object
StreamingMovies     7043 non-null object
Contract            7043 non-null object
PaperlessBilling    7043 non-null object
PaymentMethod       7043 non-null object
MonthlyCharges      7043 non-null float64
TotalCharges        7043 non-null object
Churn               7043 non-null object
dtypes: float64(1), int64(2), object(17)
memory usage: 1.1+ MB


First look at the data types show TotalCharge should be a numerical value. Let's first explore this feature. 

In [5]:
df.TotalCharges.value_counts()

20.2       11
           11
19.75       9
19.9        8
20.05       8
           ..
54.5        1
2633.95     1
770.6       1
3765.05     1
4016.3      1
Name: TotalCharges, Length: 6531, dtype: int64

In [6]:
# there appear to be 11 empty values. Let's replace them with nan. and since it's less than .2% of the data we can drop them
df.TotalCharges = df.TotalCharges.replace(" ",np.nan)
df.dropna(inplace=True)

In [7]:
# let's chack for duplicates <-- this is questionable 
#df.duplicated()
#df[df.duplicated()]

In [8]:
for col in df.columns:
    display(df[col].value_counts())

Male      3549
Female    3483
Name: gender, dtype: int64

0    5890
1    1142
Name: SeniorCitizen, dtype: int64

No     3639
Yes    3393
Name: Partner, dtype: int64

No     4933
Yes    2099
Name: Dependents, dtype: int64

1     613
72    362
2     238
3     200
4     176
     ... 
38     59
28     57
39     56
44     51
36     50
Name: tenure, Length: 72, dtype: int64

Yes    6352
No      680
Name: PhoneService, dtype: int64

No                  3385
Yes                 2967
No phone service     680
Name: MultipleLines, dtype: int64

Fiber optic    3096
DSL            2416
No             1520
Name: InternetService, dtype: int64

No                     3497
Yes                    2015
No internet service    1520
Name: OnlineSecurity, dtype: int64

No                     3087
Yes                    2425
No internet service    1520
Name: OnlineBackup, dtype: int64

No                     3094
Yes                    2418
No internet service    1520
Name: DeviceProtection, dtype: int64

No                     3472
Yes                    2040
No internet service    1520
Name: TechSupport, dtype: int64

No                     2809
Yes                    2703
No internet service    1520
Name: StreamingTV, dtype: int64

No                     2781
Yes                    2731
No internet service    1520
Name: StreamingMovies, dtype: int64

Month-to-month    3875
Two year          1685
One year          1472
Name: Contract, dtype: int64

Yes    4168
No     2864
Name: PaperlessBilling, dtype: int64

Electronic check             2365
Mailed check                 1604
Bank transfer (automatic)    1542
Credit card (automatic)      1521
Name: PaymentMethod, dtype: int64

20.05     61
19.85     44
19.90     44
19.95     44
19.65     43
          ..
92.35      1
35.60      1
72.85      1
67.70      1
113.30     1
Name: MonthlyCharges, Length: 1584, dtype: int64

20.2       11
19.75       9
20.05       8
19.9        8
19.65       8
           ..
83.3        1
54.5        1
2633.95     1
770.6       1
4016.3      1
Name: TotalCharges, Length: 6530, dtype: int64

No     5163
Yes    1869
Name: Churn, dtype: int64

In [9]:
cols=['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport',
     'StreamingTV', 'StreamingMovies', 'Contract']
for col in cols:
    df[col] = df[col].apply(lambda x: x.replace("service", "").strip().replace(" ","_").replace("-","_"))
    display(df[col].value_counts())

# PaymentMethod
df['PaymentMethod'] = df['PaymentMethod'].apply(lambda x: x.replace("automatic", "").replace("(", "").replace(")", "").strip())
df['PaymentMethod'] = df['PaymentMethod'].apply(lambda x: x.strip().replace(" ","_").replace("\(automatic\)", ""))
display(df['PaymentMethod'].value_counts())

No          3385
Yes         2967
No_phone     680
Name: MultipleLines, dtype: int64

Fiber_optic    3096
DSL            2416
No             1520
Name: InternetService, dtype: int64

No             3497
Yes            2015
No_internet    1520
Name: OnlineSecurity, dtype: int64

No             3087
Yes            2425
No_internet    1520
Name: OnlineBackup, dtype: int64

No             3094
Yes            2418
No_internet    1520
Name: DeviceProtection, dtype: int64

No             3472
Yes            2040
No_internet    1520
Name: TechSupport, dtype: int64

No             2809
Yes            2703
No_internet    1520
Name: StreamingTV, dtype: int64

No             2781
Yes            2731
No_internet    1520
Name: StreamingMovies, dtype: int64

Month_to_month    3875
Two_year          1685
One_year          1472
Name: Contract, dtype: int64

Electronic_check    2365
Mailed_check        1604
Bank_transfer       1542
Credit_card         1521
Name: PaymentMethod, dtype: int64

## Scrubbing done
save file 


In [10]:
df.to_csv("data/telco_clean.csv", index=False)