# Why Are Customers Churning?

**Quick Reference**

1. Project Plan
3. Imports
2. Acquire and Prep
3. EDA
4. Models
5. Model Evaluation
6. Conclusions

# Project Plan

**Summary**

The Senior Leading Team wants to find out why our customers are churning.

Below is a list of questions they would like answered:

1. Are there clear groupings where a customer is more likely to churn? What if you consider contract type? Is there a tenure that month-to-month customers are most likely to churn? 1-year contract customers? 2-year customers? Do you have any thoughts on what could be going on? (Be sure to state these thoughts not as facts but as untested hypotheses. Unless you test them!). Plot the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers).
2. Are there features that indicate a higher propensity to churn? like type of Internet service, type of phone service, online security and backup, senior citizens, paying more than x% of customers with the same services, etc.?
3. Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?
4. If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

**Goals**

The goals of this project are to answer the above questions and to deliver our findings in the following formats:

1. Report with detailed analysis in .ipynb format
2. CSV file containing customer_id, probability of churn, and the prediction of churn (1=churn, 0=not_churn)
3. Google Slides explaining model chosen and brief analysis for SLT
4. All files necessary to recreate our findings and models
5. Readme file
6. GitHub repo containing all files

# Imports

Below are all the necessary libraries to reproduce this project

In [1]:
import pandas as pd
import numpy as np
from scipy import stats

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder

from sklearn.model_selection import train_test_split

import acquire
import prepare
import encode
import explore
import features

# Acquire

- SQL Query that brought in all columns from the customers table off of the telco_churn database and joined on both internet_service_types and payment_types
- read the SQL query using pandas and converting into a pandas data-frame.

In [2]:
#bring in the data
telco = acquire.get_telco_data()

#Take a quick peek at Telco and look columns to see if we have any nulls and look at dtypes
telco.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 24 columns):
payment_type_id             7043 non-null int64
internet_service_type_id    7043 non-null int64
contract_type_id            7043 non-null int64
customer_id                 7043 non-null object
gender                      7043 non-null object
senior_citizen              7043 non-null int64
partner                     7043 non-null object
dependents                  7043 non-null object
tenure                      7043 non-null int64
phone_service               7043 non-null object
multiple_lines              7043 non-null object
online_security             7043 non-null object
online_backup               7043 non-null object
device_protection           7043 non-null object
tech_support                7043 non-null object
streaming_tv                7043 non-null object
streaming_movies            7043 non-null object
paperless_billing           7043 non-null object
monthly_charges 

# Prepare

- Surprisingly, it looks like we don't have any nulls within the data
- We can patch out the type id's, since they were only useful for merging
    - We could've also done this in SQL, but I prefer getting to python as soon as possible

Fields to look at:
* gender: Currently an object, likely needs to be encoded
* senior_citizen: It's an int type, does that mean it's encoded already?
* partner: Currently an object, either it's a bool or should be an int identifying how many
* dependents: Currently an object, either it's a bool or should be an int identifying how many dependents
* phone_service - paperless_billing: Needs to be encoded
* total_charges: Definitely should not be an object, likely needs to be a float
* churn: needs to be encoded


The prepapre.py file will handle all of the following:

   * split data into train/test/validate splits using .8 as our training split in both splits and 123 as the random seed
   * Handle Missing Values
   * Handle erroneous data
   * encode variables as needed
   * new feature that represents tenure in years
   * create single variable representing the information from phone_service and multiple_lines
   * do the same using dependents and partner


In [3]:
train, test, validate = prepare.prep_telco(telco, train_size=.8, seed=123)

In [4]:
#encode some fields before EDA
train, test, validate = encode.encoded_df(train, test, validate)

In [5]:
#Ensure shape off all splits look ok
print(train.shape, test.shape, validate.shape)

(4507, 31) (1409, 31) (1127, 31)


In [6]:
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 4507 entries, 1249 to 6958
Data columns (total 31 columns):
customer_id                      4507 non-null object
gender                           4507 non-null object
senior_citizen                   4507 non-null int64
partner                          4507 non-null object
dependents                       4507 non-null object
tenure                           4507 non-null int64
phone_service                    4507 non-null int64
multiple_lines                   4507 non-null int64
online_security                  4507 non-null object
online_backup                    4507 non-null object
device_protection                4507 non-null object
tech_support                     4507 non-null object
streaming_tv                     4507 non-null object
streaming_movies                 4507 non-null object
paperless_billing                4507 non-null object
monthly_charges                  4507 non-null float64
total_charges                

In [7]:
train.sample(4)

Unnamed: 0,customer_id,gender,senior_citizen,partner,dependents,tenure,phone_service,multiple_lines,online_security,online_backup,...,tenure_years,phone_lines,contract_type_encoded,encoded_internet_service_type,churn_encoded,payment_type_encoded,online_security_encoded,tech_support_encoded,device_protection_encoded,online_backup_encoded
1297,1264-FUHCX,Female,0,Yes,No,49,0,0,No,Yes,...,4.1,0,1,0,0,1,0,2,0,2
3174,3481-JHUZH,Male,0,Yes,No,41,1,1,No,Yes,...,3.4,2,0,1,0,2,0,0,0,2
118,1101-SSWAG,Female,0,Yes,No,15,1,1,No,No,...,1.2,2,0,0,0,2,0,0,0,0
1711,8597-CTXVJ,Male,0,No,Yes,70,1,1,Yes,Yes,...,5.8,2,1,0,0,0,2,2,0,2


# Exploratory Data Analysis (EDA)

**Initial Hypothesis**: Customers who are month to month are most likely to churn

1. Look at the features that describe the consumer portions of our data, this will help us figure out **WHO** is churning
    - Gender
    - Senior or Not
    - Single household (dependents)
2. Run a Chi-Squared statistical test to see if there is a correlation between the above and churning
3. Take a look at all other components to look for the **DRIVERS** that are causing our customers to churn 

## $Chi^2$ Testing

The $Chi^2$ is going to help us determine who is more likely to churn

Our $\alpha\$ for each Chi Test will be set at 99% confidence

In [None]:
alpha = .01

**Senior citizen VS churning**

$H_0$ - Being a senior citizen is independent of churning

In [None]:
#1 is senior citzen, 0 is not senior citizen
is_senior_citizen_ctab = pd.crosstab(train.senior_citizen, train.churn)
is_senior_citizen_ctab

In [None]:
chi2, p_senior, degf, expected_senior = stats.chi2_contingency(is_senior_citizen_ctab)
print(expected_senior)
print(f"p-val: {p_senior}")

In [None]:
if p_senior < alpha:
    print("We reject the H_0: Being a senior citizen is independent of churning")
else:
    print("We fail to reject H_0")

In [None]:
is_senior_citizen_ctab = pd.crosstab(train.senior_citizen, train.churn, normalize=True)
is_senior_citizen_ctab

**Takeaways**
- Our p value is less than $\alpha\$ which signifies there is a statistical significance
- There's roughly a 65% increase in the number of seniors who churned from expected to observed

**Gender VS churning**

$H_0$ - Gender is independent of churning

In [None]:
male_and_female_ctab = pd.crosstab(train.gender, train.churn)
male_and_female_ctab

In [None]:
chi2, p_gender, degf, expected_gender = stats.chi2_contingency(is_senior_citizen_ctab)
print(expected_gender)
print(f"p-val: {p_gender}")

In [None]:
if p_senior < alpha:
    print("We reject the H_0: Gender is independent of churning")
else:
    print("We fail to reject H_0")

In [None]:
male_and_female_ctab = pd.crosstab(train.gender, train.churn, normalize = True)
male_and_female_ctab

**Takeaways**
- While our P value is less than our $\alpha\$ the distribution within this section is not significant enough to explore further.

**Has Dependents VS churning**

$H_0$ - Having dependents is independent of churning

In [None]:
dependents_ctab = pd.crosstab(train.dependents, train.churn)
dependents_ctab

In [None]:
chi2, p_dependents, degf, expected_dependents = stats.chi2_contingency(dependents_ctab)
print(expected_dependents)
print(f"p-val: {p_dependents}")

In [None]:
if p_senior < alpha:
    print("We reject the H_0: Having dependents is independent of churning")
else:
    print("We fail to reject H_0")

In [None]:
dependents_ctab = pd.crosstab(train.dependents, train.churn, normalize=True)
dependents_ctab 

**Answers to our who is churning**
- While having dependents does not seem as significant as being a senior citizen it would be interesting to explore this further
- Senior citizens are very likely to churn

**Looking at Senior Citizens VS Others**

This analysis will help us get a better insight of what might be causing our senior citizens to churn

In [None]:
train.groupby(["senior_citizen"])[["churn_encoded","monthly_charges","tenure","total_charges"]].mean()

**Takeaways**:

Senior Citizens make up approximately 17% percent of our customer base and of those approximately 41% are churning. They also have higher mean on monthly charges compared to non-seniors (roughly $17 more)

They are not our target market but we may want to consider offering some incentives to keep them from churning like a discount on monthly charges based on their tenure

In [None]:
#take a look at correlations
explore.corr_heatmap(train)

**Takeaways**
- mean for tenure is roughly 2 and a half years
- our max tenure is sitting at 6 years
- contract type seems to be our best indicator of whether a customer will churn or not

## Tenure vs Rate of Churn

Here we are plotting the rate of churn on a line chart where x is the tenure and y is the rate of churn (customers churned/total customers).

What is our churn rate?

In [None]:
churn = (train.churn == "Yes").sum()
not_churn = (train.churn == "No").sum()
all_customers = len(train)
churn_rate = ((churn/all_customers) * 100).round(2)

print(f"Our current churn rate is approximately {churn_rate}%")

In [None]:
#function to plot rate of churn and tenure in months
ax = explore.lineplot_rate_of_churn_to_tenure_months(train)

In [None]:
train.tenure.value_counts(ascending=False)

In [None]:
#function to plot rate of churn and tenure in years
explore.lineplot_rate_of_churn_to_tenure_years(train)

In [None]:
train.tenure_years.value_counts(ascending=False)

**Takeaways for tenure vs rate of churn:**

We see most customers are churning at lower tenures with a spike right around 49 months. Overall there is a downward trend as tenure increases. However this may also be because most of our customers tenures are below 1 year

**Categorical exploration to find the DRIVERS**

Here we are looking to see if we can find the driving forces behind what is causing customers to churn

In [None]:
ctab = pd.crosstab(train.churn, train.contract_type)
ctab

Below is a visualization of Churn Rates within each Contract Type

In [None]:
explore.stacked_barplot_for_churn_rates_by_contract(train)

In [None]:
explore.isolated_tenure_distros(train)

In [None]:
explore.tenure_distros_overlayed(train)

**Quick Takeaway**

- Far more customers who are on month to month contracts and they also have the highest churn rate which is sitting at approximately 43%, this is much higher than one year(approx. 12%) and two year(approx. 3%)
- Month to Month contracts are right skewed, Two Year Contracts are left skewed and one year contracts by comparison are normally distributed

**New Question:**

Why are customers choosing the Month to Month option? Are monthly rates lower?

**Explore monthly charges**

In [None]:
df = explore.stats_for_contract_types(train)
df

In [None]:
explore.monthly_charges_distros(train)

**Takeaways:**
- Customers with 2 year contracts are least likely to churn and month-to-month contracts most likely to churn.
- We should consider running some sort of incentive to convert monthly customers that we feel are potential customers to churn.
- Surprisingly there is not a clear difference between monthly charges and contract types

## Churn Rates at 12 months
Here we will compare each contract type at the 12 month marker

In [None]:
df_tenure_at_one_year = train[train.tenure == 12]
explore.plot_categorical_with_churn_rates(df_tenure_at_one_year, "contract_type")

In [None]:
explore.churn_percentages_at_12_months(train)

**Question:**

If we looked at churn rate for month-to-month customers after the 12th month and that of 1-year contract customers after the 12th month, are those rates comparable?

**Answer:**

- Overall customers on the month to month contract have a slightly higher churn rate. 
- The difference between the monthly contract and one year contracts is roughly 8%. 
- When we look into the churn rates within each subgroup we see:
    - roughly 45% of month-to-month customers are churning
    - roughly 25% of one-year contract are churning.

## Groupings VS Churn Rate

Here we will:
- explore all other factors that were not tested in our $Chi^2$ tests 
- make visualizations to help absorb our information
- Answer the question of what the drivers are behind churning 

**Correlations and quick stats**

In [None]:
#take a quick look at stats to see if anything jumps at us
train.describe()

In [None]:
explore.plot_all_categoricals_with_churn_rates(train)

**Question Asked**:

Are there clear groupings where a customer is more likely to churn? What if you consider contract type? Is there a tenure that month-to-month customers are most likely to churn? 1-year contract customers? 2-year customers? Do you have any thoughts on what could be going on?

**Answers**:

Untested Findings:
- As predicted Month to Month customers seem to be more likely to churn
- Customers not receiving online security seem to be churning at a higher rate
- Customers not receiving device protection seem to be churning at a higher rate
- Customers not receiving online backup seem to be churning at a higher rate
- Customers who do not receive tech support seem to be churning at a higher rate
- Customers who receive paper bills seem to be more likely to churn
- Customers with Fiber Optics seem to be more likely to churn - this is surprising since this is considered a premium service.
- Customers who pay via electronic check seem to be more likely to churn

Tested Findings from our $Chi^2$
- Senior citizens are highly likely to churn
- Gender does not play a significant role in whether or not a customer will churn
- Customers with dependents are less likely to churn
    

## Price Thresholds

In [None]:
explore.price_threshold_internet_services(train)

In [None]:
explore.price_threshold_phone_service(train)

**Question Asked:**

Is there a price threshold for specific services where the likelihood of churn increases once price for those services goes past that point? If so, what is that point for what service(s)?

**Answer**

- At around \$50 customers with a phone line seem more likely to churn
- At around $80 customers with Fiber Optic Internet service seem more likely to churn