# Telco Project
***
### Project Description
Accurately predict customer churn using machine learning classification
algorithms

### Table of Contents:
1. [Planning](#Planning)
2. [Acquisition](#Acquisition)
3. [Preparation](#Preparation)
4. [Exploration](#Exploration)
5. [Modeling](#Modeling)
6. [Delivery](#Delivery)

## Planning
---
 - [ ] Goal(s)
     - [ ] Find drivers of customer churn
     - [ ] Accurately predict customer churn at Telco.
 - [ ] Measure(s) of success
     - [ ] Hypothesis testing
     - [ ] Baseline accuracy
     - [ ] 3 classification models
         - [ ] Model performance: train, validate, test
         - [ ] Hyperparameter tuning
 - [ ] Plan to achieve 1 & 2
 - [ ] Develop hypotheses
     - [ ] Brainstorm questions
 
Brainstorming questions to form hypothesis:
Do customer's of a certain demographic churn more than the rest?
Does service package influence churn?
Does having add-on services influence whether a customer churns?
Does the payment method influence whether customers churn?


## Acquisition
--------------
- [ ] Instructions to acquire data
- [x] Upload `.csv` file to repository - file named `telecom_data.csv`
- [ ] acquire.py file
    1. [x] Write functions to acquire telco dataset
    2. [ ] Write docstring for each function

In [47]:
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from acquire import get_telco_data
from prepare import telco_data_prep, preprocessed_data, data_target_splitter
from sklearn.model_selection import train_test_split
warnings.filterwarnings('ignore')

In [2]:
# df_telco.nunique()

In [3]:
# object_columns = df_telco.nunique()[df_telco.nunique() <= 4]

In [4]:
# object_columns = object_columns.index.to_list()

In [5]:
# for column in object_columns:
#     print(df_telco[column].value_counts().sort_index())
#     print('')

In [6]:
# sns.distplot(df_telco.tenure);

In [7]:
# sns.distplot(df_telco.monthly_charges);

## Preparation and Processing
---
- [ ] Document Process

In [25]:
df = telco_data_prep()

## Exploration
---
- [ ] Statistical Analysis
    - [ ] Restate hypothesis here
    - [ ] Test hypotheses
    - [ ] Plot distributions
- [ ] Create visuals
- [ ] Present and summarize key findings

Hypotheses


In [26]:
train, validate, test = preprocessed_data()

In [27]:
X_train, y_train = data_target_splitter(train)
X_validate, y_validate = data_target_splitter(validate)
X_test, y_test = data_target_splitter(test)

In [28]:
df_numeric_attributes = train.select_dtypes(exclude='uint8')
df_categorical_attributes = train.select_dtypes(exclude=['float64', 'int64'])

numeric_columns = df_numeric_attributes.columns.to_list().remove('churn')
categorical_columns = df_categorical_attributes.columns.to_list()

In [29]:
# sns.pairplot(train,
#              x_vars = numeric_columns,
#              y_vars = numeric_columns,
#              hue = 'churn');

## Modeling
---
- [ ] sklearn.domymathhomework.classification_models
    - [ ] Create 3 classification models

In [30]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier

In [46]:
# Baseline Accuracy
baseline_accuracy = df.churn.value_counts(normalize=True)[0]
print(f"Baseline accuracy: {baseline_accuracy:.2%}")

Baseline accuracy: 73.42%


### Boilerplate Classifiers: All attributes

In [48]:
# Logistic Regression Model
logreg = LogisticRegression().fit(X_train, y_train)

In [49]:
logreg.score(X_train, y_train)
logreg.score(X_validate, y_validate)

0.7985781990521327

In [50]:
# Decision Tree Classifier
forest = DecisionTreeClassifier().fit(X_train, y_train)

In [51]:
forest.score(X_train, y_train)
forest.score(X_validate, y_validate)

0.740521327014218

In [52]:
# Random Forest Classifier
rforest = RandomForestClassifier().fit(X_train, y_train)

In [53]:
rforest.score(X_train, y_train)
rforest.score(X_validate, y_validate)

0.7902843601895735

In [54]:
# K Nearest Neighbors Classifier
knn = KNeighborsClassifier().fit(X_train, y_train)

In [55]:
knn.score(X_train, y_train)
knn.score(X_validate, y_validate)

0.7677725118483413

### Tuning Hyperparameters

## Delivery
---

- [ ] Summarize/Recap key findings
    - [ ] Drivers

[Return to the top](#Telco-Project)