# Telco Churn Report

## Planning
### Goals
The goal of this project is to determine drivers that indicate if customers from Telco are more likely to leave the company and to construct a Machine Learning classification model that most accurately predicts customer churn.

>Deliverables will include:
> - A repo containing: 
>   - This report detailing the process to create this model
>   - Files that hold functions to acquire and prep the data
>   - A Readme.md file detailing project planning and exection, my data dictionary, and instructions for project recreation
> - Final model created to predict if a customer will churn
> - CSV file with customer_id, probability of churn, and prediction of churn

### Some Context
Why is customer loyalty important? What is the cost of churn over time?
According to Patrick Campbell from [ProfitWell](https://www.profitwell.com/customer-churn/analysis),
>"Even seemingly small, single-figure increases in churn rate 
>can quickly have a major negative effect on your company’s ability 
>to grow. What’s more, high churn rates are more likely to compound 
>over time."

## Imports

In [5]:
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# ignore warnings
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import acquire
import explore
import prepare

## Acquire
- Use the acquire.py file to access the SQL database and return a dataframe
- Quick glances at the columns and values to understand the data features
- Plot the features

In [None]:
# Acquire the data, check out the shape and the first five rows

df = acquire.get_telco_data()
print(df.shape)
df.head()

In [None]:
# Inspect values and dtypes

for col in df.columns:
    print(col)
    print(df[col].value_counts(dropna=False))
    print('----------')

In [None]:
# Print a concise summary of our telco df - the column dtypes, non-null values

df.info()

In [None]:
# Plot distribution with subplots of histograms of features

# sets size
plt.subplots(figsize=(25, 15))

plt.subplot(3,3,1);
line1 = plt.hist(df.monthly_charges)
plt.title('monthly_charges');

plt.subplot(3,3,2);
line2 = plt.hist(df.tenure);
plt.title('tenure');

plt.subplot(3,3,3);
line3 = plt.hist(df.payment_type);
plt.title('payment_type');

plt.subplot(3,3,4);
line4 = plt.hist(df.contract_type);
plt.title('contract_type');

plt.subplot(3,3,5);
line1 = plt.hist(df.internet_service_type)
plt.title('internet_service_type');

plt.subplot(3,3,6);
line2 = plt.hist(df.streaming_movies);
plt.title('streaming_movies');

plt.subplot(3,3,7);
line3 = plt.hist(df.dependents);
plt.title('dependents');

plt.subplot(3,3,8);
line4 = plt.hist(df.device_protection);
plt.title('device_protection');

plt.subplot(3,3,9);
line4 = plt.hist(df.online_security);
plt.title('online_security');

## Acquire Takeaways

- After examining the values and data types of the data I acquired, I want to do the following in prepare:

    - Convert 'online_security','online_backup', 'device_protection', 'tech_support' from `yes` and `no` values to 1 and 0.
    
    - I'm going to drop `total_charges` because it is calculated from `monthly_charges` and `tenure_months`.
    

## Prepare
- Adjust values or drop columns as needed
- Make dummy variable where applicable
- Split the dataframe into train, validate and test

In [None]:
# Split my data into three sets: train, validate and test

train, validate, test = prepare.telco_prep(df)

In [None]:
# Check out the shapes to confirm correct split

print(train.shape, validate.shape, test.shape)

#### Changes made in prep
- Converted all values of "No" and "Yes" to 0 and 1
- Dropped "total_charges" as it was redundant, "gender" and "senior_citizen" because they were not significant
- Created "tenure_months" and "tenure_years" columns, both calculated from tenure
- Created dummy variables from 'internet_service_type_id', 'online_security', and 'tech_support' columns

In [None]:
# Quick visual to confirm changes made to values in prep

train.head()

## Explore
- Use the train dataset ONLY
- Determine X and Y
- Determine categorical and continuous variables


#### Categorical and Continuous Features:
| Categorical                | Continuous        |                 
|----------------------------|-------------------|
| internet_service_type_id   | tenure_months     |
| contract_type_id           | monthly_charges   |
| payment_type_id            | total_charges     |
| customer_id                | tenure_years      |
| partner                    |                   |
| dependents                 |                   |
| phone_service              |                   |
| multiple_lines             |                   |
| online_security            |                   |
| online_backup              |                   |
| device_protection          |                   |
| tech_support               |                   |
| streaming_tv               |                   |
| streaming_movies           |                   |
| paperless_billing          |                   |
| churn                      |                   |
| payment_type               |                   |
| contract_type              |                   |
| internet_service_type      |                   |                  


In [None]:
# Correlation heat map for quick clarification

# creating the correlation of each feature against each other
corr = train.corr()

# Generate a mask for the upper triangle, prints out only the bottom triangle
mask = np.triu(np.ones_like(corr, dtype=bool))

# Set up the matplotlib figure, set size
f, ax = plt.subplots(figsize=(13, 10))

# Generate a custom diverging colormap, changing default colors
cmap = sns.diverging_palette(250, 30, l=65, center="dark", as_cmap=True)

# Draw the heatmap with the mask and correct aspect ratio
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5})

## Model and Evaluate
- Establish a baseline to judge model efficiency against
- Create multiple models
- Determine the best 3 models to run on my validate dataset

- By predicting that no customers will churn, I am accurate 73% of the time. This is the baseline metric that I need to beat for a model to hold any value.

In [None]:
# Set my baseline accuracy.

train.churn.value_counts(normalize=True)

## CSV file
- contains customer_id, probability of churn, and prediction of churn