# Identifying the Drivers of Churn at Telco
***
## Goals

My goal for this project is to create a model that will acurrately predict customer churn using the customer data provided. I will also be identifying the primary drivers of customer churn.

I will deliver the following: 

- classification_project.ipynb
    - A Jupyter Notebook showing the process and analysis by which the drivers of customer churn are documented.
    
- README.md
    - A Markdown file containing the project description with goals, a data dictionary, project planning, instructions for recreaction of the project and its findings, key findings and takeaways. 
    
- predictions.csv
    - A CSV file containing customer IDs, probability of churn, and prediction of churn (1 = Churn, 0 = not_churn)

- acquire.py
    - A Python file containing a function to acquire the customer data
    
- prepare.py
    - A Python file containing functions that prepare the customer data to be worked with

- model.py
    - A Python file containing the functions needed to recreate the model 
    
- A walkthrough-style presentation with a high-level overview of the project

***
## Acquire
Acquire data from the customers table from the telco_churn database on the codeup data science database server.

In [3]:
# preparing environment

import pandas as pd
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [10]:
# importing env for access to user, host and password
import acquire

# display data frame
df = get_telco_data()

***
## Prepare
Prepare, tidy, and clean the data so it can be explored and analyzed

In [32]:
# displaying column names, dtypes, etc.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   customer_id               7043 non-null   object 
 1   gender                    7043 non-null   object 
 2   senior_citizen            7043 non-null   int64  
 3   partner                   7043 non-null   object 
 4   dependents                7043 non-null   object 
 5   tenure                    7043 non-null   int64  
 6   phone_service             7043 non-null   object 
 7   multiple_lines            7043 non-null   object 
 8   internet_service_type_id  7043 non-null   int64  
 9   online_security           7043 non-null   object 
 10  online_backup             7043 non-null   object 
 11  device_protection         7043 non-null   object 
 12  tech_support              7043 non-null   object 
 13  streaming_tv              7043 non-null   object 
 14  streamin

- Categorical object variables such as gender, churn and dependents may need to be converted to 0s and 1s
- total_charges should be a float
- No missing values
- May need to rename some columns

In [34]:
# displaying various values for numerical columns (average, quartiles, min, max, etc)
# this allows us to get a rough idea of the values within the columns
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
senior_citizen,7043.0,0.162147,0.368612,0.0,0.0,0.0,0.0,1.0
tenure,7043.0,32.371149,24.559481,0.0,9.0,29.0,55.0,72.0
internet_service_type_id,7043.0,1.872923,0.737796,1.0,1.0,2.0,2.0,3.0
contract_type_id,7043.0,1.690473,0.833755,1.0,1.0,1.0,2.0,3.0
payment_type_id,7043.0,2.315633,1.148907,1.0,1.0,2.0,3.0,4.0
monthly_charges,7043.0,64.761692,30.090047,18.25,35.5,70.35,89.85,118.75


- Will definitely want to scale the data?
- Senior_citizen: seniors are a minority
- Tenure: 0 - 72 with mean of 32 means this may be normally distributed
- Type_ids for internet_service, contract, and paymenta are already numerical so no need to convert
- Monthly_charges: the average monthly payment is 64.76

In [30]:
# displaying unique values per column to identify possible categorical variables
df.nunique()

customer_id                 7043
gender                         2
senior_citizen                 2
partner                        2
dependents                     2
tenure                        73
phone_service                  2
multiple_lines                 3
internet_service_type_id       3
online_security                3
online_backup                  3
device_protection              3
tech_support                   3
streaming_tv                   3
streaming_movies               3
contract_type_id               3
paperless_billing              2
payment_type_id                4
monthly_charges             1585
total_charges               6531
churn                          2
dtype: int64

- columns with 2 values may need to be converted to 0s and 1s
- columns with 3 or 4 values may need to be converted to numerical values

In [None]:
for col in df.columns:
    plt.figure(figsize=(4,3))
    plt.hist(df[col])
    plt.title(col)
    plt.show()