In [None]:
# -- besure to put the title, name, email and date here in a markdown cell! 

### Load Packages
-------



In [1]:
from IPython.core.display import display, HTML
from IPython.display import clear_output
display(HTML("<style>.container { width:90% }</style>"))
# ------------------------------------------------------------------

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd

# -- sklearn modules 
from sklearn.model_selection import train_test_split # - splits data into training and test sets 
from sklearn.metrics import accuracy_score           # - calculates accuracy 

# -- need this to render charts in notebook -- 
%matplotlib inline

## Project 1
You are a newly minted data scientist working for a telecommunication company like Verizon or ATT. You have been tasked with identify which customers are likely to “churn”. Customer churn, also known as customer attrition, occurs when customers stop doing business with a company or stop using a company’s services. Your task is to explore the data and identify some business rules which can help the company identify likely churners. The following Tasks have been dived into Three(3) parts, simply look at the section's **Todos** for your project's required tasks. If there is a question, simply add a markdown cell and answer the question. As always feel free to add additional cells and analysis as you dig into the data.  


### Part 1
1. Stage data
2. Clean up column names 
3. Describe data 
4. Explore likely predictors  

### Part 2.
5. Partition into 75/25 split 
6. Write a rule to predict likely targets 
7. Evaluate  

### Part 3.  
8. Write up your thoughts, in a markdown cell. 



# Part 1. 
## 1. Stage 
----- 
Import our dataset into a pandas dataframe


<div class="alert alert-info"> 💡 <strong> TODO </strong>
 
1. Read churn.csv into a dataframe named df 
2. use df.head() to display the first 5 records 
</div>

```python

df = pd.read_csv("./data/adult.csv")
df.head()
```


## 2.  Clean up Column Names

*It's just not fun dealing with ill-formed columns*

<div class="alert alert-info"> 💡 <strong> TODO </strong>
 
1. clean names 
    - remove leading and trailing characters
    - replace spaces with underscores _ 
    - change case to lower case
    - remove various special characters
2. print column names 
3. use head to display first 5 records 



</div>

**Todo:**


This is how I clean up column names. 

```python
df.columns = ( df.columns
    .str.strip()
    .str.lower()
    .str.replace(' ', '_')
    .str.replace('-', '_')
    .str.replace('(', '')
    .str.replace(')', '')
    .str.replace('?', '')
    .str.replace('\'', '') # notice the backslash \ this is an escape character
)
print(df.columns)
df.head()
```

In [None]:
df.columns = ( df.columns
    .str.strip()
    .str.lower()
    .str.replace(' ', '_')
    .str.replace('-', '_')
    .str.replace('(', '')
    .str.replace(')', '')
    .str.replace('?', '')
    .str.replace('\'', '') # notice the backslash \ this is an escape character
)
print(df.columns)
df.head()

## 3. Describe data
### Check Target Distribution 

-----
Always start by understanding your **"target"** column this will determine how you are going to perform your analysis 


<div class="alert alert-info"> 💡 <strong> TODO </strong>
 
1. use value_counts on churn column to display count 
2. use value_counts but normalize the results so you get percentages 


</div>


```python

df['income'].value_counts()               # - perform counts 
df['income'].value_counts(normalize=True) # return percentages 

```


### Describe 
---------
Always take a look at your data to see what you are dealing with. I recomend using describe and dtypes to understand what i've just imported. 
<div class="alert alert-info"> 💡 <strong> TODO </strong>
 


1. use describe to print out descriptive statitiscs, what does T do and what does sort_values do? 
2. use dtypes to output data types, what is an object data type. are their numbers that should be considered categorical? think area codes
</div>


```python
df.describe(include='all').T.sort_values('unique')
df.dtypes
```

### Check out Nulls 
----
Null values can be interesting but you have to deal with them when we get to building models. 

**Step 1. is to identify your problem areas.**  

Step 2. figure out if there is any predictive power in the nulls - not necesary here! 

Step 3. handle them. - not necessary yet! 

<div class="alert alert-info"> 💡 <strong> TODO </strong>
 
1. Identify if any columns contain nulls. 

</div>



```python
# -- count nulls by column -- 
df.isnull().sum(axis = 0)
```

## 4. Explore likely predictors
### Make Histograms, Crosstabs and Barcharts 

Because you are new to plots and graphs in python i'll do the first couple for you. Your job is to do the following.

Todo: 

<div class="alert alert-info"> 💡 <strong> TODO </strong>
 
1. Histogram 1 - Pick a *NUMERIC* column that you think might be useful and make a histogram. 
    - make sure you have one color for churn == True and another for churn == False. 
2. Histogram 2 - Pick a *Second NUMERIC* column that you think might be useful 
    - make sure you have one color for churn == True and another for churn == False. 
3. Barchart 1 - Pick a *CHARACTER* column that you think might be useful 
    - 1a. make sure you have counts and that you can overlay them. 
    - 1b. make sure you use normalize="index" to get percentages and that you can overlay them, and use bottom= 
4. Barchart 2 - Pick a *Second CHARACTER* column that you think might be useful 
    - 1a. make sure you have counts and that you can overlay them. 
    - 1b. make sure you use normalize="index" to get percentages and that you can overlay them, and use bottom= 

</div>



See examples below


----- 

We are looking to identify variables, split points and conditions that are likely useful to predict our target. 

Here is my basic recipe. 
1. Use histograms on NUMERIC varaibles, mess with the number of bins to make a more interesting chart. 
    - filter for churn == True. and another for churn == False. 
    - use two plt.hist to get them to overlay. 
2. Use the crosstab + barchart recipe to create a table of frequencies for CATEGORICAL variables, you may want to normalize or not, i do both. 
    - first create a cross tab column by target, use reset_index() to return a dataframe instead of a crosstab 
    - second plot using a BAR chart(s) i typically usae one for each target variable 
    - if you want to get fancy you can use the bottom option for one of your 
    
```python

# -- simple histogram -- 
plt.figure(figsize=(20,10)) # -- controls the figure size 
plt.hist(df['day_mins'], 50, facecolor='blue', alpha=0.5) # -- makes the histogram, 50 is the number of bins  
plt.title('Title of Chart')
plt.ylabel('Y label')
plt.xlabel('X label')
plt.show() # -- this shows the histogram in the notebook. 

```


In [None]:
# -- simple histogram, not what we want! 
plt.figure(figsize=(20,10))
plt.hist(df['day_mins'], 50, facecolor='blue', alpha=0.5)
plt.title('Histogram of Day Minutes')
plt.ylabel('Count')
plt.xlabel('day_mins')
plt.show()

In [None]:
# -- this is what we want for our histograms! 
plt.figure(figsize=(20,10))

# -- divide my data into two datasets by target variable 
churn_t = df.loc[df['churn']== "True."]
churn_f = df.loc[df['churn']== "False."]

# -- simply change the bin size to make the chart look better --
plt.hist(churn_t['night_charge'], 25, facecolor='red', alpha=0.5)
plt.hist(churn_f['night_charge'], 25, facecolor='lightblue', alpha=0.5)
plt.title('Histogram of Day Night Charge')
plt.ylabel('Count')
plt.xlabel('night_charge')
plt.show()


In [None]:
# -- Histogram 1 

In [None]:
# -- Histogram 2 

### OK what about Categorical Data? 

-----

First use a crosstab to create a new table of the column by the target. you can read about crosstabs here 

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html

Can you understand the difference when we look at percentages rather than counts? what does this tell us about identifying high earners? 

<div class="alert alert-success"> 💡 <strong> Note </strong>
    
I'm going to create an example of what i expect for your bar charts. you essentially create two barcharts one on the frequcncy and another by the row percentage. what we are looking for are categories that can be useful to identify churn not churn. 
</div>


In [None]:
# -- 1st do the Frequencies 
ctab = pd.crosstab(df['state'], df['churn']).reset_index()
ctab.head(10)

In [None]:
# - then plot it 
plt.figure(figsize=(20,10))

plt.bar(ctab['state'], ctab['False.'], facecolor='lightblue', alpha=0.5)
plt.bar(ctab['state'], ctab['True.'], facecolor='red', alpha=0.5)
plt.title('Frequency of Churn by State')
plt.ylabel('Count')
plt.xlabel('state')
plt.show()

In [None]:
#- 2nd Normalize and Sort  
ctab = pd.crosstab(df['state'], df['churn'], normalize="index").reset_index().sort_values('True.',ascending=False )
ctab.head(10)

In [None]:
# - then plot that, but this time use bottom on one of the bar charts 
plt.figure(figsize=(20,10))

# -- use bottom to stack the bars. since it's sorted you get a nice trend. 
plt.bar(ctab['state'], ctab['False.'], facecolor='lightblue', alpha=0.5)
plt.bar(ctab['state'], ctab['True.'], bottom= ctab['False.'], facecolor='red', alpha=0.5)
plt.title('Frequency of Churn by State')
plt.ylabel('Count')
plt.xlabel('state')
plt.show()

In [None]:
#- crosstab 1a

In [None]:
#- barchart 1a

In [None]:
#- crosstab 1b with normalization 

In [None]:
#- barchart 1b with bottom 

In [None]:
#- crosstab 2a

In [None]:
#- barchart 2a

In [None]:
#- crosstab 2b with normalization 

In [None]:
#- barchart 2b

## Part 2.
### 5. Partition into 75/25 split
-----
Sklearn is our main pakage, we imported **train_test_split** from the model selection module. Why do we need to split the data? well we do it so that we are making predictions on an out-of-sample data, meaning will our prediction generalize to new and unseen data? it isn't fair to evaluate our prediction if it's seen the data before right? i mean you wouldn't go to your psychic and tell them exactly what you want to hear before they do your the reading?

So what percentage to use? the general rule of thumb is a 70/30 or 75/25 training test split. you'll "train" your model on 70% of the data and evaluate it on 30%. 

<div class="alert alert-info"> 💡 <strong> TODO </strong>
    
1. partition your data into a 70/30 split, using train_test_split  
2. print the percentages 
    
 </div>

```python
train, test = train_test_split(df,test_size=0.30)
print("train pct: {:2.2%}".format(train.shape[0]/df.shape[0]))
print("test  pct: {:2.2%}".format(test.shape[0]/df.shape[0]))
```

## 6. Write a rule to predict likely targets
-----
based on our exploratory analysis above you should have identified one or two rules which "MIGHT" be useful 

This is the recipe. 

1. Make a new column **churn_pred** default it to False., our majority class 
2. Write rules to update **churn_pred** to equal True. 
3. I like confusion matricies they help you know how well you are doing, predicting the target. 


<div class="alert alert-info"> 💡 <strong> TODO </strong>
    
1. create a new column on training data "churn_pred" and DEFAULT it to False. 
2. write your rules to set "churn_pred" to True. 
3. Create *"confusion_matrix"* by using pd.crosstab(actual,predicted)  this one has the counts
    - print your confusion matrix 
    - plot it using a sns heatmap
4. Create *"confusion_matrix_pct"* by using pd.crosstab(actual,predicted, normalize="index") this one has the percents 
    - print your confusion matrix 
    - plot it using a sns heatmap
    
5. Repeat steps 1-4 for the **test dataset** !!!! 
 </div>
 
```python 
# 1. -- default the predicted target 
train.loc[:,'churn_pred'] = 'False.'

# 2. -- update where rules are met, use your rules 
train.loc[train['state'].isin(['CA','TX']), 'churn_pred' ] = 'True.'  # -- this is a single rule 

# 3 & 4 - confusion matricies and sns heatmap plots. 
confusion_matrix = pd.crosstab(train['churn'], train['churn_pred'],  rownames=['Actual'], colnames=['Predicted'])
confusion_matrix_pct = pd.crosstab(train['churn'], train['churn_pred'], normalize="index", rownames=['Actual'], colnames=['Predicted'])
print("Training confusion Matrix")
print (confusion_matrix)

plt.figure(figsize=(8,5))
sns.heatmap(confusion_matrix, annot=True, fmt='g')
plt.show()

plt.figure(figsize=(8,5))
sns.heatmap(confusion_matrix_pct, annot=True, fmt='g')
plt.show()
```

In [None]:
# -- do this with the TRAIN dataset 

In [None]:
# -- repeat with the TEST dataset 

## 7. Evaluate

How accurate were we? Accuracy is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. 
Formally, accuracy has the following definition:

    accuracy = number of correct predictions / all predictions 
    
    
We always want to understand if we did nothing, how accurate were we? and then compare how accurate were we with our predictions. 

https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html

<div class="alert alert-info"> 💡 <strong> TODO </strong>

1. calculate default accuracy, as if everything were predicted false 
2. calculate training accuracy, based on your busienss rules applied to the train dataset 
3. calculate testing accuracy, based on your business rules applied to the test dataset 
4. answer these questions. 
    - do you think accuracy a good measure of prediction? 
    - how would your analysis change if you made $100 for every correct churn==True prediction and -$20 for every missed churn? 
    
</div>

```python 

### Default Accuracy, i.e. do nothing predict everyone as <50K, is the same as saying what % of <50K 
accuracy_default = train['churn'].value_counts(normalize='True')[0]
accuracy_train = accuracy_score(train['churn'], train['churnpred'])
accuracy_test = accuracy_score(test['churn'], test['churn_pred'])
print("Default Accuracy : {:2.2%}".format(accuracy_default))
print("Train Accuracy   : {:2.2%}".format(accuracy_train))
print("Test Accuracy    : {:2.2%}".format(accuracy_test))

````

In [None]:
# -- accuracy here 

In [None]:
# -- answer questions here, change to cell to markdown 

## 8. Writeup 
----- 

<div class="alert alert-info"> 💡 <strong> TODO </strong>
    
I'm not looking for anything long, just a short write up on what you did, what you thought was interesing about the data, how your rules performed, what can you infer about accuracy as a measure of performance vs. a confusion matrix? is Accuracy a good measure of classifier performance?
<div>


In [None]:
## - change cell to a markdown and write your response. 