<h1>Churn Project for Udacity Data Science Nano Degree program</h1>

This project is the first assignement for the Udacity Nano Degree in Data Science Program. In this assignment I must identify a dataset to explore, pose at least 3 questions, and then answer those questions using the data. I have decided to work with some telecom data available on <a href="https://www.kaggle.com/radmirzosimov/telecom-users-dataset">Kaggle</a>. Using this data, I hope to be able to predict if the customer will "churn"--elect to terminate their existing cellular contract with the service provider in favor of a new service provider. As part of this study, I intend to identify any (<b>variables</b>) which might help us predict churn. Once identified, I intend to try multiple methods to predict churn and identify which methods work best, why they might work best, and ways I may be able to improve upon them. 

In [168]:
import pandas as pd
import numpy as np
import seaborn as sns

# This section is to beautify this Notebook by surpressing warnings that aren't important
import warnings
from pandas.core.common import SettingWithCopyWarning
warnings.simplefilter(action="ignore", category=SettingWithCopyWarning)

# Read the file into the dataframe
# adding index_col=0 to remove Unamed:0 column from the dataframe. This saves us a drop() later
df = pd.read_csv('./data/telecom_users.csv', index_col=0)

# Convert from Object to Float, force errors to NaN 
df['TotalCharges']=pd.to_numeric(df['TotalCharges'],errors='coerce')

# Replace NaN's with 0.0
df['TotalCharges']=df['TotalCharges'].fillna(0.0)

# examining the data layout, column names, and dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5986 entries, 1869 to 860
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        5986 non-null   object 
 1   gender            5986 non-null   object 
 2   SeniorCitizen     5986 non-null   int64  
 3   Partner           5986 non-null   object 
 4   Dependents        5986 non-null   object 
 5   tenure            5986 non-null   int64  
 6   PhoneService      5986 non-null   object 
 7   MultipleLines     5986 non-null   object 
 8   InternetService   5986 non-null   object 
 9   OnlineSecurity    5986 non-null   object 
 10  OnlineBackup      5986 non-null   object 
 11  DeviceProtection  5986 non-null   object 
 12  TechSupport       5986 non-null   object 
 13  StreamingTV       5986 non-null   object 
 14  StreamingMovies   5986 non-null   object 
 15  Contract          5986 non-null   object 
 16  PaperlessBilling  5986 non-null   object

In [169]:
# Examining the relationship between tenure, MonthlyCharges, and TotalCharges. 
print(f'shape: {df.shape}')
print(f'Percent of charges exceeding expected: {len(df[df.tenure * df.MonthlyCharges > df.TotalCharges])/df.shape[0]}')
print(f'Percent of charges less than expected: {len(df[df.tenure * df.MonthlyCharges < df.TotalCharges])/df.shape[0]}')
print(f'Percent of charges matching expected: {len(df[df.tenure * df.MonthlyCharges == df.TotalCharges])/df.shape[0]}')

a = (len(df[df.tenure * df.MonthlyCharges > df.TotalCharges])/df.shape[0]) \
    + (len(df[df.tenure * df.MonthlyCharges < df.TotalCharges])/df.shape[0]) \
    + (len(df[df.tenure * df.MonthlyCharges == df.TotalCharges])/df.shape[0])

# Ensuring these all add up to 100%
print(f'Total: {a}')

shape: (5986, 21)
Percent of charges exceeding expected: 0.4567323755429335
Percent of charges less than expected: 0.4562312061476779
Percent of charges matching expected: 0.08703641830938857
Total: 1.0


<h2>Examination of the Data:</h2>
Looking at the columns and the dtypes, we will need to transform some of the columns. Those transformations are outlined here:

- The column CustomerID is an internal number for the customer. For our purposes here, we can safely drop it from our dataframe.

- The column gender is a binary representation of a persons gender. We will need to encode this for our purposes here. Since this is a binary answer, I will encode Male to 1 and Female to 0. This will reduce the number of dummny columns necessary later. 

- The column SeniorCitizen should be a Boolean and not int64. The good news is the column only contains zeroes and ones making this transform pretty easy. Given there are 966 "one" values, 1 is True as we would assume. Given True/False are encoded as 1's and 0's, we will not need to create a dummy column for this column.

- The columns Partner, Dependents, PhoneService and Paperless contains only "yes" and "no" values. This can safely be transformed into a Boolean with "Yes" being reset to True and "No" reset to False. Since these are encoded as 1's and 0's, we will not need to create dummy columns for those items. 

- The columns MultipleLines, OnlineSecurity, OnlineBackup, DeviceProtection, TechSupport, StreamingTV, and StreamingMovies will need to be removed and replaced with <b>Dummy Variables</b>. I had hoped to use simple Booleans here but these rows all of them contain three possible entries: 'Yes', 'No', 'No internet service'

- The column InternetService will need to be dummy encoded as well. It contains three possible entries: 'No', 'Fiber optic', 'DSL'

- The column Contract contains possible values of: 'Two year', 'One year', 'Month-to-month' and will need to be encoded for our purposes here. 

- The column PaymentMethod contains four possible values: 'Credit card (automatic)', 'Bank transfer (automatic)', 'Electronic check', 'Mailed check'

- The column MonthlyCharge is a float value and contains the rate that customer is billed per month. 

- The column TotalCharges is a float and appears to be a close approximation of MonthlyCharge * tenure. As injested, it is represneting as an "Object" and will need to be transformed into a float. Also of note, there are 8 rows of empty data. Comparing the empty data to tenure, all 8 of these rows show a tenure of "zero" meaning these are quite likely new customers who have not been billed yet so we can safely impute those values to 0.00. <b>About half of the data is "over charging" and about half is "under charging". I should examine the spread between the over and under to see if this might be a significant factor.</b> As these transformations are necessary for data exploration, these transformations are made at ingest and won't be reflected below in the the Transformation section. These transformations are commented on above. 

- Column Churn needs to be converted to True/False. 



<h2>Data Transformations</h2>
In an ideal world, I would look to change all of this code to some function, presumably around the map() function. I tried that on the first pass but I could only make it work if I replaced the original columns. Perhaps I will do that in the future. 

In [170]:
# Changing gender to zeroes and ones
df.gender[df.gender=="Male"]=1
df.gender[df.gender=="Female"]=0
df.gender=pd.to_numeric(df.gender,errors='coerce')

# Changing SeniorCitizen to zeroes and ones
df.SeniorCitizen[df.SeniorCitizen=='Yes']=1
df.SeniorCitizen[df.SeniorCitizen=='No']=0
df.SeniorCitizen=pd.to_numeric(df.SeniorCitizen,errors='coerce')

#Changing Partner to zeroes and ones
df.Partner[df.Partner=='Yes']=1
df.Partner[df.Partner=='No']=0
df.Partner=pd.to_numeric(df.Partner,errors='coerce')

#Changing Dependents to zeroes and ones
df.Dependents[df.Dependents=='Yes']=1
df.Dependents[df.Dependents=='No']=0
df.Dependents=pd.to_numeric(df.Dependents,errors='coerce')


# Dropping customerID and all transformed columns no longer needed
# df.drop(columns=['customerID'], axis=1, inplace=True)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 5986 entries, 1869 to 860
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        5986 non-null   object 
 1   gender            5986 non-null   int64  
 2   SeniorCitizen     5986 non-null   int64  
 3   Partner           5986 non-null   int64  
 4   Dependents        5986 non-null   int64  
 5   tenure            5986 non-null   int64  
 6   PhoneService      5986 non-null   object 
 7   MultipleLines     5986 non-null   object 
 8   InternetService   5986 non-null   object 
 9   OnlineSecurity    5986 non-null   object 
 10  OnlineBackup      5986 non-null   object 
 11  DeviceProtection  5986 non-null   object 
 12  TechSupport       5986 non-null   object 
 13  StreamingTV       5986 non-null   object 
 14  StreamingMovies   5986 non-null   object 
 15  Contract          5986 non-null   object 
 16  PaperlessBilling  5986 non-null   object

  res_values = method(rvalues)
