# Performance Assessment | D206 Data Cleaning
&emsp;Ryan L. Buchanan
<br>&emsp;Student ID:  001826691
<br>&emsp;Masters Data Analytics (12/01/2020)
<br>&emsp;Program Mentor:  Dan Estes
<br>&emsp;(385) 432-9281 (MST)
<br>&emsp;rbuch49@wgu.edu

## <span style="color:red">Part  I: Research Question</span>

### <span style="color:green"><b>A. Question or Decision</b>:</span>
Can we determine which individual customers are at high risk of churn?  And, can we determine which features are most significant to churn?

### <span style="color:green"><b>**A1. Alternative Question</b>:</span>
Also, Are there certain responses to survey that correlate with customer churn?

### <span style="color:green"><b>B. Required Variables</b>:</span>
The data set is 10,000 customer records of a popular telecommunications company. The dependent variable (target) in question is whether or not each customer has continued or discontinued service within the last month.  This column is titled "Churn."  
Independent variables or predictors that may lead to identifying a relationship with the dependent variable of "Churn" within the data set include: 
1. Services that each customer signed up for (for example, multiple phone lines, technical support add-ons or streaming media) 
2. Customer account information (customers' tenure with the company, payment methods, bandwidth usage, etc.)
3. Customer demographics (gender, marital status, income, etc.).  
4. Finally, there are eight independent variables that represent responses customer-perceived importance of company services and features.  

The data is both numerical (as in the yearly GB bandwidth usage; customer annual income) and categorical (a "Yes" or "No" for Churn; customer job).

## <span style="color:red">Part II: Data-Cleaning Plan</span>

### C. Explanation of data cleaning plan
1. Plan proposal to identify anomalies, <i>including relevant techniques & specific steps</i> 
2.  Justify your approach for assessing the quality of the data, include:
<br>&ensp; •  characteristics of the data being assessed,
<br>&ensp; •  the approach used to assess the quality.

3.  Justify your selected programming language and any libraries and packages that will support the data-cleaning process.

4.  Provide the code you will use to identify the anomalies in the data.

### <span style="color:green"><b>C1. Plan to Find Anomalies</b>:</span>
My approach will include:
<br>&ensp; 1. Backing up my data and the process I am following as a copy to my machine and, since this is a manageable data set, to GitHub using command line and gitbash;
<br>&ensp; 2. Reading the data set into Python using Pandas read_csv command;
<br>&ensp; 3. Naming the data set as a the variable "churn_df" and subsequent useful slices of the dataframe as "data"; 
<br>&ensp; 4. Examine coding errors, including data missing, in the collection of the data set;
<br>&ensp; 5. Find outliers that may create or hide statistical significance using histograms;
<br>&ensp; 6. Imputing records missing data with meaningful measures of central tendency (mean, median or mode) or simply remove outliers that are several standard deviations above the mean

### <span style="color:green"><b>C2. Justification of Approach</b>:</span>
Though the data seems to be inexplicably missing quite a bit of data (such as the many NAs in customer tenure with the company) from apparently random columns, this approach seems like a good first approach in order to put the data in better working order without needing to involve methods of initial data collection or querying the data-gatherers on reasons for missing information.

### <span style="color:green"><b>C3. Justification of Tools</b>:</span>
I will use the Python programming language as I have a bit of a background in Python having studied machine learning independently over the last year before beginning this masters program and its ability to perform many things right "out of the box."  Python provides clean, intuitive and readable syntax that has become ubiquitous across in the data science industry.  Also, I find the Jupyter notebooks a convenient way to run code visually, in its attractive single document markdown format, the ability to display results of code and graphic visualizations and provide crystal-clear running documentation for future reference.   A thorough installation and importation of Python packages and libraries will provide specially designed code to perfom complex data science tasks rather than personally building them from scratch.  This will include: 
<br>&ensp; • NumPy - to work with arrays
<br>&ensp; • Pandas - to load data sets
<br>&ensp; • Matplotlib - to plot charts
<br>&ensp; • Scikit-learn - for machine learning model classes
<br>&ensp; • SciPy - for mathematical problems, specifically linear algebra transformations
<br>&ensp; • Seaborn - for high-level interface and atttractive visualizations

A quick, precise example of loading a data set and creating a variable efficiently is using to call the Pandas library and its subsequent "read_csv" function in order to manipulate our data as a dataframe:
<span style="color:coral">
<br>&ensp; import pandas as pd
<br>&ensp; df = pd.read_csv('Data.csv')
</span>

### <span style="color:green"><b>C4. Provide the Code</b>:</span>
Code follows in subsequent cells:

### Standard imports

In [None]:
import numpy as np
import pandas as pd
from pandas import DataFrame

import matplotlib.pyplot as plt
%matplotlib inline

import scipy.stats
from sklearn.impute import SimpleImputer

#### Increase Jupyter display cell-width

In [None]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

#### Load data set into Pandas dataframe

In [None]:
churn_df = pd.read_csv('churn_raw_data.csv')

#### Display Churn dataframe

In [None]:
churn_df

In [None]:
# Now just the head of the dataframe
churn_df.head()

#### Describe Churn statistics

In [None]:
churn_df.describe()

#### List of Dataframe Columns

In [None]:
churn_df.columns

In [None]:
churn_df.index

In [None]:
churn_df.loc[0]

In [None]:
churn_df.info()

In [None]:
len(churn_df)

In [None]:
# Find number of records and columns of data set
churn_df.shape

In [None]:
# Add an index field
churn_df['index'] = pd.Series(range(0,10000))

In [None]:
churn_df.head()

## <span style="color:red">Part III: Data Cleaning</span>

### D.  Summarize the data-cleaning process by doing the following:

1.  Describe the findings, including all anomalies, from the implementation of the data-cleaning plan from part C.

2.  Justify your methods for mitigating each type of discovered anomaly in the data set.

3.  Summarize the outcome from the implementation of each data-cleaning step.

4.  Provide the code used to mitigate anomalies.

5.  Provide a copy of the cleaned data set.

6.  Summarize the limitations of the data-cleaning process.

7.  Discuss how the limitations in part D6 affect the analysis of the question or decision from part A.

### <span style="color:green">D. Data Cleaning Summary</span>

D1.<b>Cleaning Findings</b>:

D2.<b>Justification of Mitigation Methods</b>:

D3.<b>Summary of Outcomes</b>:

D4.<b>Mitigation Code</b>:

D5.<b>Clean Data</b>: (see attached file 'churn_data_cleaned.csv')

D6.<b>Limitations</b>: Limitations given the telecom company data set are that the data are not coming from an warehouse.  It this scenario, it is as though I initiated and gathered the data.  So, I am not able to reach out to the folks that organized and gathered this information and ask them why certain NAs are there, why are fields such as age or yearly bandwidth used missing information that might be relevant to answering questions about customer retention or churn.  In a real world project, you would be able to go down to the department where these folks worked and fill in the empty fields or discover why fields are left blank.  
D7.

### E.  Apply principal component analysis (PCA) to identify the significant features of the data set by doing the following:

1.  List the principal components in the data set.

2.  Describe how you identified the principal components of the data set.

3.  Describe how the organization can benefit from the results of the PCA

### <span style="color:green">E. PCA Application</span>

E1. <b>Principal Components</b>:

E2. <b>Criteria Used</b>:

E3. <b>Benefits</b>:

## <span style="color:red">Part IV: Supporting Documents</span>

### F.  Provide a Panopto recording that demonstrates the warning- and error-free functionality of the code used to support the discovery of anomalies and the data cleaning process and summarizes the programming environment.


### <span style="color:green">F. Video</span>

### G.  Reference the web sources used to acquire segments of third-party code to support the application. <span style="color:red">Be sure the web sources are reliable.</span>

### <span style="color:green">G. Sources for Third-Party Code</span>
Larose, C. D. & Larose, D. T. (2019). <i>Data Science: Using Python and R.</i>  John Wiley & Sons, Inc.

VanderPlas, J. (2017). <i>Python Data Science Handbook: Essential Tools for WOrking with Data.</i>  <br>&emsp;O'Reilly Media, Inc.

### H.  Acknowledge sources, <span style="color:red">using in-text citations<span> 

### <span style="color:green">H. Sources</span>
Ahmad, A. K., Jafar, A & Aljoumaa, K. (2019, March 20). <i>Customer churn prediction in telecom using machine <br>&emsp;learning in big data platform</i>. Journal of Big Data.  <br>&emsp;https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6

Altexsoft. (2019, March 27). <i>Customer Churn Prediction Using Machine Learning: Main Approaches and Models</i>.  <br>&emsp;Altexsoft.  <br>&emsp;https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/

Frohbose, F. (2020, November 24). <i>Machine Learning Case Study: Telco Customer Churn Prediction</i>.  <br>&emsp;Towards Data Science.  <br>&emsp;https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d

Mountain, A. (2014, August 11). <i>Data Cleaning</i>.  Better Evaluation.  <br>&emsp;https://www.betterevaluation.org/en/evaluation-options/data_cleaning