## Data Wrangling

### Read Data

In [1]:
import numpy as np
import pandas as pd
import pandas_profiling
import os

In [2]:
tele = pd.read_csv('Telecom_users.csv')
tele.head()

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


In [3]:
print(tele.dtypes)
tele.shape

customerID           object
gender               object
SeniorCitizen         int64
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges         object
Churn                object
dtype: object


(7043, 21)

In [4]:
tele['SeniorCitizen'] = tele.SeniorCitizen.replace({0:'No', 1:'Yes'})

### Data cleaning

A quick check on the data type of each column shows that `TotalCharges` should be read in as float and not object. This means that the column should contain some non-valid float entry that causes this coercion.  

In [5]:
empty = []
for i, value in enumerate(tele['TotalCharges']):
    try:
        value = float(value)
    except ValueError:
        print(str(value)+' at row '+str(i))
        empty.append(i)

  at row 488
  at row 753
  at row 936
  at row 1082
  at row 1340
  at row 3331
  at row 3826
  at row 4380
  at row 5218
  at row 6670
  at row 6754


It seems like there are some empty values in `TotalCharges`. Since we also have their information on `MonthlyCharges` and `Tenure`, maybe we can get a good estimate for total charges.

In [6]:
print(tele.iloc[empty, [5,18]])

      tenure  MonthlyCharges
488        0           52.55
753        0           20.25
936        0           80.85
1082       0           25.75
1340       0           56.05
3331       0           19.85
3826       0           25.35
4380       0           20.00
5218       0           19.70
6670       0           73.35
6754       0           61.90


It seems that customers with `tenure` that are less than 1 month will have their `TotalCharges` recorded as empty. However, since billing for new users is usually 1 month in advance, it is reasonable to assume their `TotalCharge` is equal to its `MonthlyCharge`.  

In [7]:
tele.iloc[empty, [19]] = tele.iloc[empty, [18]]
tele['TotalCharges'] = tele['TotalCharges'].astype('float')
tele.dtypes

customerID           object
gender               object
SeniorCitizen        object
Partner              object
Dependents           object
tenure                int64
PhoneService         object
MultipleLines        object
InternetService      object
OnlineSecurity       object
OnlineBackup         object
DeviceProtection     object
TechSupport          object
StreamingTV          object
StreamingMovies      object
Contract             object
PaperlessBilling     object
PaymentMethod        object
MonthlyCharges      float64
TotalCharges        float64
Churn                object
dtype: object

### collinearity

In [8]:
tele.loc[tele['MultipleLines'] == 'Yes', 'PhoneService'] = 'MultipleLines'
tele.loc[tele['PhoneService'] == 'Yes', 'PhoneService'] = 'OneLine'
tele.drop(columns = 'MultipleLines', inplace = True)
tele['PhoneService'].unique()

array(['No', 'OneLine', 'MultipleLines'], dtype=object)

### Report summary

- There is no missing value in our dataset. 

   - The distribution of our target variable is about 25%/75% on a sample size of 7.3k users. So this is an unbalanced dataset. Resampling method will be needed when building a pipeline.
   
   - Some columns like `PhoneService` and `MultipleLines` are merged due to collinearity
   
    



## Writing csv file

In [9]:
tele.to_csv('tele_clean.csv', index = False, header=True)