#### Standard imports

In [9]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from pandas import DataFrame
import scipy.stats
from sklearn.impute import SimpleImputer

#### Jupyter display cell-width increase

In [2]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [3]:
churn_df = pd.read_csv('churn_raw_data.csv')
churn_df.describe()

Unnamed: 0.1,Unnamed: 0,CaseOrder,Zip,Lat,Lng,Population,Children,Age,Income,Outage_sec_perweek,...,MonthlyCharge,Bandwidth_GB_Year,item1,item2,item3,item4,item5,item6,item7,item8
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,7505.0,7525.0,7510.0,10000.0,...,10000.0,8979.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,5000.5,49153.3196,38.757567,-90.782536,9756.5624,2.095936,53.275748,39936.762226,11.452955,...,174.076305,3398.842752,3.4908,3.5051,3.487,3.4975,3.4929,3.4973,3.5095,3.4956
std,2886.89568,2886.89568,27532.196108,5.437389,15.156142,14432.698671,2.154758,20.753928,28358.469482,7.025921,...,43.335473,2187.396807,1.037797,1.034641,1.027977,1.025816,1.024819,1.033586,1.028502,1.028633
min,1.0,1.0,601.0,17.96612,-171.68815,0.0,0.0,18.0,740.66,-1.348571,...,77.50523,155.506715,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,2500.75,2500.75,26292.5,35.341828,-97.082813,738.0,0.0,35.0,19285.5225,8.054362,...,141.071078,1234.110529,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
50%,5000.5,5000.5,48869.5,39.3958,-87.9188,2910.5,1.0,53.0,33186.785,10.202896,...,169.9154,3382.424,3.0,4.0,3.0,3.0,3.0,3.0,4.0,3.0
75%,7500.25,7500.25,71866.5,42.106908,-80.088745,13168.0,3.0,71.0,53472.395,12.487644,...,203.777441,5587.0965,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
max,10000.0,10000.0,99929.0,70.64066,-65.66785,111850.0,10.0,89.0,258900.7,47.04928,...,315.8786,7158.982,7.0,7.0,8.0,7.0,7.0,8.0,7.0,8.0


In [4]:
churn_df.head()

Unnamed: 0.1,Unnamed: 0,CaseOrder,Customer_id,Interaction,City,State,County,Zip,Lat,Lng,...,MonthlyCharge,Bandwidth_GB_Year,item1,item2,item3,item4,item5,item6,item7,item8
0,1,1,K409198,aa90260b-4141-4a24-8e36-b04ce1f4f77b,Point Baker,AK,Prince of Wales-Hyder,99927,56.251,-133.37571,...,171.449762,904.53611,5,5,5,3,4,4,3,4
1,2,2,S120509,fb76459f-c047-4a9d-8af9-e0f7d4ac2524,West Branch,MI,Ogemaw,48661,44.32893,-84.2408,...,242.948015,800.982766,3,4,3,3,4,3,4,4
2,3,3,K191035,344d114c-3736-4be5-98f7-c72c281e2d35,Yamhill,OR,Yamhill,97148,45.35589,-123.24657,...,159.440398,2054.706961,4,4,2,4,4,3,3,3
3,4,4,D90850,abfa2b40-2d43-4994-b15a-989b8c79e311,Del Mar,CA,San Diego,92014,32.96687,-117.24798,...,120.249493,2164.579412,4,4,4,2,5,4,3,3
4,5,5,K662701,68a861fd-0d20-4e51-a587-8a90407ee574,Needville,TX,Fort Bend,77461,29.38012,-95.80673,...,150.761216,271.493436,4,4,4,3,4,4,4,5


## <span style="color:red">Part  I: Research Question</span>

### A. Description of realistic organizational need & consequential question to be addressed

### <span style="color:green"><b>Question description</b>:</span>
<br>Which customers are at high risk of churn?  How can I tell when they will leave and what will push them over the edge?  Are there responses to survey that predict potential customer churn?

### B. Description of variables & specific type of data being described, <i>with examples</i>

### <span style="color:green"><b>Variables & data description</b>:</span>
<br>The data set is 10,000 customer records of a popular telecommunications company.  The dependent variable (target) in question is whether or not each customer has continued or discontinued service within the last month.  This column is titled "Churn".  Independent variables or predictors that may lead to identifying a relationship with the dependent variable of "Churn" within the data set include: 1) services that each customer signed up for (for example, multiple phone lines, technical support add-ons or streaming media), customer account information (customers' tenure with the company, payment methods, bandwidth usage, etc.) and customer demographics (gender, marital status, income, etc.).  Finally, there are eight independent variables that represent responses customer-perceived importance of company services and features.

In [5]:
# Find number of records and columns of data set
churn_df.shape

(10000, 52)

In [6]:
# Add an index field
churn_df['index'] = pd.Series(range(0,10000))

In [7]:
churn_df.head()

Unnamed: 0.1,Unnamed: 0,CaseOrder,Customer_id,Interaction,City,State,County,Zip,Lat,Lng,...,Bandwidth_GB_Year,item1,item2,item3,item4,item5,item6,item7,item8,index
0,1,1,K409198,aa90260b-4141-4a24-8e36-b04ce1f4f77b,Point Baker,AK,Prince of Wales-Hyder,99927,56.251,-133.37571,...,904.53611,5,5,5,3,4,4,3,4,0
1,2,2,S120509,fb76459f-c047-4a9d-8af9-e0f7d4ac2524,West Branch,MI,Ogemaw,48661,44.32893,-84.2408,...,800.982766,3,4,3,3,4,3,4,4,1
2,3,3,K191035,344d114c-3736-4be5-98f7-c72c281e2d35,Yamhill,OR,Yamhill,97148,45.35589,-123.24657,...,2054.706961,4,4,2,4,4,3,3,3,2
3,4,4,D90850,abfa2b40-2d43-4994-b15a-989b8c79e311,Del Mar,CA,San Diego,92014,32.96687,-117.24798,...,2164.579412,4,4,4,2,5,4,3,3,3
4,5,5,K662701,68a861fd-0d20-4e51-a587-8a90407ee574,Needville,TX,Fort Bend,77461,29.38012,-95.80673,...,271.493436,4,4,4,3,4,4,4,5,4


## <span style="color:red">Part II: Data-Cleaning Plan</span>

### C. Explanation of data cleaning plan
1. Plan proposal to identify anomalies, <i>including relevant techniques & specific steps</i> 
2.  Justify your approach for assessing the quality of the data, include:
<br>&ensp; •  characteristics of the data being assessed,
<br>&ensp; •  the approach used to assess the quality.

3.  Justify your selected programming language and any libraries and packages that will support the data-cleaning process.

4.  Provide the code you will use to identify the anomalies in the data.

### <span style="color:green"><b>1. Plan to Find Anomalies</b>:</span>
I will follow the text book approach of:
<br>&ensp; 1) backing up my data and the process I am following as a copy to my machine and, since this is a manageable data set, to GitHub using command line and gitbash;
<br>&ensp; 2) reading the data set into Python using Pandas read_csv command;
<br>&ensp; 3) naming the data set as a the variable "churn_df" and subsequent useful slices as "data"; 
<br>&ensp; 4) examine coding errors, including data missing, in the collection of the data set;
<br>&ensp; 4) find outliers that may create or hide statistical significance using histograms;
<br>&ensp; 5) and, impute records missing data with meaningful measures of central tendency (mean, median or mode) or simply remove outliers that are several standard deviations above the mean

### <span style="color:green"><b>2. Justification of Approach</b>:</span>
"The justification includes the characteristics of the data being assessed and references the approach used to assess the quality of the data. The justified approach aligns with the selected data set."

### <span style="color:green"><b>3. Justification of Tools</b>:</span>
I will use the Python programming language as I have a bit of a background in Python having studied machine learning independently over the last year before beginning this masters program and its ability to perform many things right "out of the box."  Python provides clean, intuitive and readable syntax that has become ubiquitous across in the data science industry.  Also, I find the Jupyter notebooks a convenient way to run code visually, in its attractive single document markdown format, the ability to display results of code and graphic visualizations and provide crystal-clear running documentation for future reference.   The packages and libraries will include Numpy, Pandas, Matplotlib, Scikit-learn, Scipy and Seaborn as they contain specially designed code to perfom complex data science tasks rather than personally building them from scratch. SPECIFIC EXAMPLES . . .

### <span style="color:green"><b>4. Code for anomaly identification</b>:</span>
code here . . . 

## <span style="color:red">Part III: Data Cleaning</span>

#### D.  Summarize the data-cleaning process by doing the following:

1.  Describe the findings, including all anomalies, from the implementation of the data-cleaning plan from part C.

2.  Justify your methods for mitigating each type of discovered anomaly in the data set.

3.  Summarize the outcome from the implementation of each data-cleaning step.

4.  Provide the code used to mitigate anomalies.

5.  Provide a copy of the cleaned data set.

6.  Summarize the limitations of the data-cleaning process.

7.  Discuss how the limitations in part D6 affect the analysis of the question or decision from part A.

### <span style="color:green">Data-cleaning process description</span>

#### E.  Apply principal component analysis (PCA) to identify the significant features of the data set by doing the following:

1.  List the principal components in the data set.

2.  Describe how you identified the principal components of the data set.

3.  Describe how the organization can benefit from the results of the PCA

### <span style="color:green">PCA description</span>

## <span style="color:red">Part IV: Supporting Documents</span>

#### F.  Provide a Panopto recording that demonstrates the warning- and error-free functionality of the code used to support the discovery of anomalies and the data cleaning process and summarizes the programming environment.


#### G.  Reference the web sources used to acquire segments of third-party code to support the application. Be sure the web sources are reliable.

#### H.  Acknowledge sources, using in-text citations and references, for content that is quoted, paraphrased, or summarized.

#### I.  Demonstrate professional communication in the content and presentation of your submission.