# Project: Telco Churn Rate

#### Find drivers for customer churn at Telco. Why are customers churning?


###### Hypothesis: Customer who are churning are ones with mail in check

###### Null: mail-in check does not have any relationship with customer churning

###### Alternative: Mail-in check customer have higher chance of churning

#### Construct a ML classification model that accurately predicts customer churn

## Import

In [1]:
import os 
import pandas as pd
import numpy as np


## Acquire

Where and when you acquire your data.
How did I get teh data.
When did i get the data.
Size of data.
What does each observation represent?
What does each column represent?

In [4]:
#put telco file in the same repository
#importing telco .CSV file using acquisition method 
url = 'telco.csv'

df = pd.read_csv(url)
df.head()

Unnamed: 0.1,Unnamed: 0,payment_type_id,internet_service_type_id,contract_type_id,customer_id,gender,senior_citizen,partner,dependents,tenure,...,tech_support,streaming_tv,streaming_movies,paperless_billing,monthly_charges,total_charges,churn,contract_type,internet_service_type,payment_type
0,0,2,1,2,0002-ORFBO,Female,0,Yes,Yes,9,...,Yes,Yes,No,Yes,65.6,593.3,No,One year,DSL,Mailed check
1,1,2,1,1,0003-MKNFE,Male,0,No,No,9,...,No,No,Yes,No,59.9,542.4,No,Month-to-month,DSL,Mailed check
2,2,1,2,1,0004-TLHLJ,Male,0,No,No,4,...,No,No,No,Yes,73.9,280.85,Yes,Month-to-month,Fiber optic,Electronic check
3,3,1,2,1,0011-IGKFF,Male,1,Yes,No,13,...,No,Yes,Yes,Yes,98.0,1237.85,Yes,Month-to-month,Fiber optic,Electronic check
4,4,2,2,1,0013-EXCHZ,Female,1,Yes,No,3,...,Yes,Yes,No,Yes,83.9,267.4,Yes,Month-to-month,Fiber optic,Mailed check


In [9]:
#size of the data
df.shape

(7043, 25)

In [10]:
df.columns

Index(['Unnamed: 0', 'payment_type_id', 'internet_service_type_id',
       'contract_type_id', 'customer_id', 'gender', 'senior_citizen',
       'partner', 'dependents', 'tenure', 'phone_service', 'multiple_lines',
       'online_security', 'online_backup', 'device_protection', 'tech_support',
       'streaming_tv', 'streaming_movies', 'paperless_billing',
       'monthly_charges', 'total_charges', 'churn', 'contract_type',
       'internet_service_type', 'payment_type'],
      dtype='object')

In [13]:
df.columns.to_list()

['Unnamed: 0',
 'payment_type_id',
 'internet_service_type_id',
 'contract_type_id',
 'customer_id',
 'gender',
 'senior_citizen',
 'partner',
 'dependents',
 'tenure',
 'phone_service',
 'multiple_lines',
 'online_security',
 'online_backup',
 'device_protection',
 'tech_support',
 'streaming_tv',
 'streaming_movies',
 'paperless_billing',
 'monthly_charges',
 'total_charges',
 'churn',
 'contract_type',
 'internet_service_type',
 'payment_type']

In [55]:
#printing unique values for each of categorical variables using the for loop
for col in df.columns:
    if df[col].dtypes == 'object':
        print(f'{col} has {df[col].nunique()} unique values: {df[col].unique()}')


gender has 2 unique values: ['Female' 'Male']
partner has 2 unique values: ['Yes' 'No']
dependents has 2 unique values: ['Yes' 'No']
phone_service has 2 unique values: ['Yes' 'No']
multiple_lines has 3 unique values: ['No' 'Yes' 'No phone service']
online_security has 3 unique values: ['No' 'Yes' 'No internet service']
online_backup has 3 unique values: ['Yes' 'No' 'No internet service']
device_protection has 3 unique values: ['No' 'Yes' 'No internet service']
tech_support has 3 unique values: ['Yes' 'No' 'No internet service']
streaming_tv has 3 unique values: ['Yes' 'No' 'No internet service']
streaming_movies has 3 unique values: ['No' 'Yes' 'No internet service']
paperless_billing has 2 unique values: ['Yes' 'No']
total_charges has 6531 unique values: ['593.3' '542.4' '280.85' ... '742.9' '4627.65' '3707.6']
churn has 2 unique values: ['No' 'Yes']
contract_type has 3 unique values: ['One year' 'Month-to-month' 'Two year']
internet_service_type has 3 unique values: ['DSL' 'Fiber opt

In [56]:
df.isnull().sum()  #there are zero null so i do not need to consider dropping anything

gender                   0
senior_citizen           0
partner                  0
dependents               0
tenure                   0
phone_service            0
multiple_lines           0
online_security          0
online_backup            0
device_protection        0
tech_support             0
streaming_tv             0
streaming_movies         0
paperless_billing        0
monthly_charges          0
total_charges            0
churn                    0
contract_type            0
internet_service_type    0
payment_type             0
dtype: int64

In [15]:
#i am commiting to drop useless and redundant datas for the sake of humanity
df = df.drop(columns=['Unnamed: 0', 'payment_type_id', 'internet_service_type_id', 'contract_type_id', 'customer_id',])

In [60]:
#turning the above into a function this will be used in exploration

def drop_columns(df):
    columns_to_drop = ['Unnamed: 0', 'payment_type_id', 'internet_service_type_id', 'contract_type_id', 'customer_id']
    df = df.drop(columns=[columns_to_drop])
    return df



In [61]:
df.dtypes # i have ensured the column dropped and list their data types

gender                    object
senior_citizen             int64
partner                   object
dependents                object
tenure                     int64
phone_service             object
multiple_lines            object
online_security           object
online_backup             object
device_protection         object
tech_support              object
streaming_tv              object
streaming_movies          object
paperless_billing         object
monthly_charges          float64
total_charges             object
churn                     object
contract_type             object
internet_service_type     object
payment_type              object
dtype: object

In [65]:
unique_values = df['total_charges'].unique()
print(unique_values)
#i dont see anything out of the norm but..

['593.3' '542.4' '280.85' ... '742.9' '4627.65' '3707.6']


In [63]:
#note that the total_charges is read as an object
#this mean that some values could be Null, non-numeric, or Dollar sign
# we need to get rid of these

df['total_charges'].str.replace(',', '')
pd.to_numeric(df['total_charges'], errors='coerce')

0        593.30
1        542.40
2        280.85
3       1237.85
4        267.40
         ...   
7038     742.90
7039    1873.70
7040      92.75
7041    4627.65
7042    3707.60
Name: total_charges, Length: 7043, dtype: float64

In [66]:
# i need to identify columns with object incase i have to change it into integer for pandas during machine learning phase
df.select_dtypes(include='object').columns.to_list()

['gender',
 'partner',
 'dependents',
 'phone_service',
 'multiple_lines',
 'online_security',
 'online_backup',
 'device_protection',
 'tech_support',
 'streaming_tv',
 'streaming_movies',
 'paperless_billing',
 'total_charges',
 'churn',
 'contract_type',
 'internet_service_type',
 'payment_type']

In [52]:
#i need to compute for the range of each numeric variables to understand Data Distribution and visualization
#i have identied three columns with numeric values
numeric_columns = []

for col in df.columns:
    if pd.api.types.is_numeric_dtype(df[col]):
        numeric_columns.append(col)

print(numeric_columns)

['senior_citizen', 'tenure', 'monthly_charges']


In [54]:
#computing the range for each columns
df[['senior_citizen', 'tenure', 'monthly_charges']].describe().T
#below are the statistic description for each of the variables

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
senior_citizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
monthly_charges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


## Prepare

List steps taken to clean your data here. In particular call out how you handle null values and outliers in detail. You must do this even if you do not do anything or do not encounter any. Anytime there is potential to make changes to the data you must be upfront about the changes you make or do not make.