# Performance Assessment | D206 Data Cleaning
&emsp;Ryan L. Buchanan
<br>&emsp;Student ID:  001826691
<br>&emsp;Masters Data Analytics (12/01/2020)
<br>&emsp;Program Mentor:  Dan Estes
<br>&emsp;(385) 432-9281 (MST)
<br>&emsp;rbuch49@wgu.edu

## <span style="color:red">Part  I: Research Question</span>

### <span style="color:green"><b>A. Question or Decision</b>:</span>
Can we determine which individual customers are at high risk of churn?  And, can we determine which features are most significant to churn?

### <span style="color:green"><b>**A1. Alternative Question</b>:</span>
Also, Are there certain responses to survey that correlate with customer churn?

### <span style="color:green"><b>B. Required Variables</b>:</span>
The data set is 10,000 customer records of a popular telecommunications company. The dependent variable (target) in question is whether or not each customer has continued or discontinued service within the last month.  This column is titled "Churn."  
Independent variables or predictors that may lead to identifying a relationship with the dependent variable of "Churn" within the data set include: 
1. Services that each customer signed up for (for example, multiple phone lines, technical support add-ons or streaming media) 
2. Customer account information (customers' tenure with the company, payment methods, bandwidth usage, etc.)
3. Customer demographics (gender, marital status, income, etc.).  
4. Finally, there are eight independent variables that represent responses customer-perceived importance of company services and features.  

The data is both numerical (as in the yearly GB bandwidth usage; customer annual income) and categorical (a "Yes" or "No" for Churn; customer job).

## <span style="color:red">Part II: Data-Cleaning Plan</span>

### C. Explanation of data cleaning plan
1. Plan proposal to identify anomalies, <i>including relevant techniques & specific steps</i> 
2.  Justify your approach for assessing the quality of the data, include:
<br>&ensp; •  characteristics of the data being assessed,
<br>&ensp; •  the approach used to assess the quality.

3.  Justify your selected programming language and any libraries and packages that will support the data-cleaning process.

4.  Provide the code you will use to identify the anomalies in the data.

### <span style="color:green"><b>C1. Plan to Find Anomalies</b>:</span>
My approach will include:
<br>&ensp; 1. Back up my data and the process I am following as a copy to my machine and, since this is a manageable data set, to GitHub using command line and gitbash;
<br>&ensp; 2. Read the data set into Python using Pandas read_csv command;
<br>&ensp; 3. Evaluate the data struture to better understand input data
<br>&ensp; 4. Naming the data set as a the variable "churn_df" and subsequent useful slices of the dataframe as "data"; 
<br>&ensp; 5. Examine coding errors, including data missing, in the collection of the data set;
<br>&ensp; 6. Find outliers that may create or hide statistical significance using histograms;
<br>&ensp; 7. Imputing records missing data with meaningful measures of central tendency (mean, median or mode) or simply remove outliers that are several standard deviations above the mean

### <span style="color:green"><b>C2. Justification of Approach</b>:</span>
Though the data seems to be inexplicably missing quite a bit of data (such as the many NAs in customer tenure with the company) from apparently random columns, this approach seems like a good first approach in order to put the data in better working order without needing to involve methods of initial data collection or querying the data-gatherers on reasons for missing information.

### <span style="color:green"><b>C3. Justification of Tools</b>:</span>
I will use the Python programming language as I have a bit of a background in Python having studied machine learning independently over the last year before beginning this masters program and its ability to perform many things right "out of the box."  Python provides clean, intuitive and readable syntax that has become ubiquitous across in the data science industry.  Also, I find the Jupyter notebooks a convenient way to run code visually, in its attractive single document markdown format, the ability to display results of code and graphic visualizations and provide crystal-clear running documentation for future reference.   A thorough installation and importation of Python packages and libraries will provide specially designed code to perfom complex data science tasks rather than personally building them from scratch.  This will include: 
<br>&ensp; • NumPy - to work with arrays
<br>&ensp; • Pandas - to load data sets
<br>&ensp; • Matplotlib - to plot charts
<br>&ensp; • Scikit-learn - for machine learning model classes
<br>&ensp; • SciPy - for mathematical problems, specifically linear algebra transformations
<br>&ensp; • Seaborn - for high-level interface and atttractive visualizations

A quick, precise example of loading a data set and creating a variable efficiently is using to call the Pandas library and its subsequent "read_csv" function in order to manipulate our data as a dataframe:
<span style="color:coral">
<br>&ensp; import pandas as pd
<br>&ensp; df = pd.read_csv('Data.csv')
</span>

### <span style="color:green"><b>C4. Provide the Code</b>:</span>
Code follows in subsequent cells:

In [1]:
# Standard imports
import numpy as np
import pandas as pd
from pandas import DataFrame
from scipy.stats import norm, skew
from scipy import stats
import statsmodels.api as sm

In [2]:
# Scikit-learn modules for data cleaning
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler

In [3]:
# Standard libraries for data visualization
import seaborn as sn
color = sn.color_palette()
import matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
import matplotlib.pylab as pylab
import matplotlib.ticker as mtick
from IPython.display import display
pd.options.display.max_columns = None
from pandas.plotting import scatter_matrix
from sklearn.metrics import roc_curve

In [4]:
# Utility libraries imports
import random
import os
import re
import sys
import timeit
import string
import time
from datetime import datetime
from time import time
from dateutil.parser import parse
import joblib

In [5]:
# Increase Jupyter display cell-width
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:75% !important; }</style>"))

In [8]:
# Load data set into Pandas dataframe
churn_df = pd.read_csv('churn_raw_data.csv')

In [9]:
# Display Churn dataframe
churn_df

Unnamed: 0.1,Unnamed: 0,CaseOrder,Customer_id,Interaction,City,State,County,Zip,Lat,Lng,Population,Area,Timezone,Job,Children,Age,Education,Employment,Income,Marital,Gender,Churn,Outage_sec_perweek,Email,Contacts,Yearly_equip_failure,Techie,Contract,Port_modem,Tablet,InternetService,Phone,Multiple,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,PaymentMethod,Tenure,MonthlyCharge,Bandwidth_GB_Year,item1,item2,item3,item4,item5,item6,item7,item8
0,1,1,K409198,aa90260b-4141-4a24-8e36-b04ce1f4f77b,Point Baker,AK,Prince of Wales-Hyder,99927,56.25100,-133.37571,38,Urban,America/Sitka,Environmental health practitioner,,68.0,Master's Degree,Part Time,28561.99,Widowed,Male,No,6.972566,10,0,1,No,One year,Yes,Yes,Fiber Optic,Yes,No,Yes,Yes,No,No,No,Yes,Yes,Credit Card (automatic),6.795513,171.449762,904.536110,5,5,5,3,4,4,3,4
1,2,2,S120509,fb76459f-c047-4a9d-8af9-e0f7d4ac2524,West Branch,MI,Ogemaw,48661,44.32893,-84.24080,10446,Urban,America/Detroit,"Programmer, multimedia",1.0,27.0,Regular High School Diploma,Retired,21704.77,Married,Female,Yes,12.014541,12,0,1,Yes,Month-to-month,No,Yes,Fiber Optic,Yes,Yes,Yes,No,No,No,Yes,Yes,Yes,Bank Transfer(automatic),1.156681,242.948015,800.982766,3,4,3,3,4,3,4,4
2,3,3,K191035,344d114c-3736-4be5-98f7-c72c281e2d35,Yamhill,OR,Yamhill,97148,45.35589,-123.24657,3735,Urban,America/Los_Angeles,Chief Financial Officer,4.0,50.0,Regular High School Diploma,Student,,Widowed,Female,No,10.245616,9,0,1,Yes,Two Year,Yes,No,DSL,Yes,Yes,No,No,No,No,No,Yes,Yes,Credit Card (automatic),15.754144,159.440398,2054.706961,4,4,2,4,4,3,3,3
3,4,4,D90850,abfa2b40-2d43-4994-b15a-989b8c79e311,Del Mar,CA,San Diego,92014,32.96687,-117.24798,13863,Suburban,America/Los_Angeles,Solicitor,1.0,48.0,Doctorate Degree,Retired,18925.23,Married,Male,No,15.206193,15,2,0,Yes,Two Year,No,No,DSL,Yes,No,Yes,No,No,No,Yes,No,Yes,Mailed Check,17.087227,120.249493,2164.579412,4,4,4,2,5,4,3,3
4,5,5,K662701,68a861fd-0d20-4e51-a587-8a90407ee574,Needville,TX,Fort Bend,77461,29.38012,-95.80673,11352,Suburban,America/Chicago,Medical illustrator,0.0,83.0,Master's Degree,Student,40074.19,Separated,Male,Yes,8.960316,16,2,1,No,Month-to-month,Yes,No,Fiber Optic,No,No,No,No,No,Yes,Yes,No,No,Mailed Check,1.670972,150.761216,271.493436,4,4,4,3,4,4,4,5
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,9996,M324793,45deb5a2-ae04-4518-bf0b-c82db8dbe4a4,Mount Holly,VT,Rutland,5758,43.43391,-72.78734,640,Rural,America/New_York,Sport and exercise psychologist,3.0,,"Some College, Less than 1 Year",Retired,55723.74,Married,Male,No,9.265392,12,2,0,,Month-to-month,Yes,Yes,DSL,,Yes,No,Yes,Yes,No,No,No,No,Electronic Check,68.197130,159.828800,6511.253000,3,2,3,3,4,3,2,3
9996,9997,9997,D861732,6e96b921-0c09-4993-bbda-a1ac6411061a,Clarksville,TN,Montgomery,37042,36.56907,-87.41694,77168,Rural,America/Chicago,Consulting civil engineer,4.0,48.0,Regular High School Diploma,Part Time,,Divorced,Male,No,8.115849,15,2,0,,Two Year,No,No,Fiber Optic,,Yes,Yes,Yes,Yes,No,Yes,No,No,Electronic Check,61.040370,208.856400,5695.952000,4,5,5,4,4,5,2,5
9997,9998,9998,I243405,e8307ddf-9a01-4fff-bc59-4742e03fd24f,Mobeetie,TX,Wheeler,79061,35.52039,-100.44180,406,Rural,America/Chicago,IT technical support officer,,,Nursery School to 8th Grade,Full Time,,Never Married,Female,No,4.837696,10,0,0,No,Month-to-month,No,No,Fiber Optic,Yes,Yes,Yes,Yes,No,No,No,No,Yes,Bank Transfer(automatic),,168.220900,4159.306000,4,4,4,4,4,4,4,5
9998,9999,9999,I641617,3775ccfc-0052-4107-81ae-9657f81ecdf3,Carrollton,GA,Carroll,30117,33.58016,-85.13241,35575,Urban,America/New_York,Water engineer,1.0,39.0,Bachelor's Degree,Full Time,16667.58,Separated,Male,No,12.076460,14,1,0,No,Two Year,No,Yes,Fiber Optic,No,Yes,No,No,No,Yes,Yes,Yes,Yes,Credit Card (automatic),71.095600,252.628600,6468.457000,4,4,6,4,3,3,5,4


In [10]:
# Now just the head of the dataframe
churn_df.head()

Unnamed: 0.1,Unnamed: 0,CaseOrder,Customer_id,Interaction,City,State,County,Zip,Lat,Lng,Population,Area,Timezone,Job,Children,Age,Education,Employment,Income,Marital,Gender,Churn,Outage_sec_perweek,Email,Contacts,Yearly_equip_failure,Techie,Contract,Port_modem,Tablet,InternetService,Phone,Multiple,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,PaperlessBilling,PaymentMethod,Tenure,MonthlyCharge,Bandwidth_GB_Year,item1,item2,item3,item4,item5,item6,item7,item8
0,1,1,K409198,aa90260b-4141-4a24-8e36-b04ce1f4f77b,Point Baker,AK,Prince of Wales-Hyder,99927,56.251,-133.37571,38,Urban,America/Sitka,Environmental health practitioner,,68.0,Master's Degree,Part Time,28561.99,Widowed,Male,No,6.972566,10,0,1,No,One year,Yes,Yes,Fiber Optic,Yes,No,Yes,Yes,No,No,No,Yes,Yes,Credit Card (automatic),6.795513,171.449762,904.53611,5,5,5,3,4,4,3,4
1,2,2,S120509,fb76459f-c047-4a9d-8af9-e0f7d4ac2524,West Branch,MI,Ogemaw,48661,44.32893,-84.2408,10446,Urban,America/Detroit,"Programmer, multimedia",1.0,27.0,Regular High School Diploma,Retired,21704.77,Married,Female,Yes,12.014541,12,0,1,Yes,Month-to-month,No,Yes,Fiber Optic,Yes,Yes,Yes,No,No,No,Yes,Yes,Yes,Bank Transfer(automatic),1.156681,242.948015,800.982766,3,4,3,3,4,3,4,4
2,3,3,K191035,344d114c-3736-4be5-98f7-c72c281e2d35,Yamhill,OR,Yamhill,97148,45.35589,-123.24657,3735,Urban,America/Los_Angeles,Chief Financial Officer,4.0,50.0,Regular High School Diploma,Student,,Widowed,Female,No,10.245616,9,0,1,Yes,Two Year,Yes,No,DSL,Yes,Yes,No,No,No,No,No,Yes,Yes,Credit Card (automatic),15.754144,159.440398,2054.706961,4,4,2,4,4,3,3,3
3,4,4,D90850,abfa2b40-2d43-4994-b15a-989b8c79e311,Del Mar,CA,San Diego,92014,32.96687,-117.24798,13863,Suburban,America/Los_Angeles,Solicitor,1.0,48.0,Doctorate Degree,Retired,18925.23,Married,Male,No,15.206193,15,2,0,Yes,Two Year,No,No,DSL,Yes,No,Yes,No,No,No,Yes,No,Yes,Mailed Check,17.087227,120.249493,2164.579412,4,4,4,2,5,4,3,3
4,5,5,K662701,68a861fd-0d20-4e51-a587-8a90407ee574,Needville,TX,Fort Bend,77461,29.38012,-95.80673,11352,Suburban,America/Chicago,Medical illustrator,0.0,83.0,Master's Degree,Student,40074.19,Separated,Male,Yes,8.960316,16,2,1,No,Month-to-month,Yes,No,Fiber Optic,No,No,No,No,No,Yes,Yes,No,No,Mailed Check,1.670972,150.761216,271.493436,4,4,4,3,4,4,4,5


In [11]:
# List of Dataframe Columns
churn_df.columns

Index(['Unnamed: 0', 'CaseOrder', 'Customer_id', 'Interaction', 'City',
       'State', 'County', 'Zip', 'Lat', 'Lng', 'Population', 'Area',
       'Timezone', 'Job', 'Children', 'Age', 'Education', 'Employment',
       'Income', 'Marital', 'Gender', 'Churn', 'Outage_sec_perweek', 'Email',
       'Contacts', 'Yearly_equip_failure', 'Techie', 'Contract', 'Port_modem',
       'Tablet', 'InternetService', 'Phone', 'Multiple', 'OnlineSecurity',
       'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
       'StreamingMovies', 'PaperlessBilling', 'PaymentMethod', 'Tenure',
       'MonthlyCharge', 'Bandwidth_GB_Year', 'item1', 'item2', 'item3',
       'item4', 'item5', 'item6', 'item7', 'item8'],
      dtype='object')

In [12]:
# Describe Churn statistics
churn_df.describe()

Unnamed: 0.1,Unnamed: 0,CaseOrder,Zip,Lat,Lng,Population,Children,Age,Income,Outage_sec_perweek,Email,Contacts,Yearly_equip_failure,Tenure,MonthlyCharge,Bandwidth_GB_Year,item1,item2,item3,item4,item5,item6,item7,item8
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,7505.0,7525.0,7510.0,10000.0,10000.0,10000.0,10000.0,9069.0,10000.0,8979.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,5000.5,5000.5,49153.3196,38.757567,-90.782536,9756.5624,2.095936,53.275748,39936.762226,11.452955,12.016,0.9942,0.398,34.498858,174.076305,3398.842752,3.4908,3.5051,3.487,3.4975,3.4929,3.4973,3.5095,3.4956
std,2886.89568,2886.89568,27532.196108,5.437389,15.156142,14432.698671,2.154758,20.753928,28358.469482,7.025921,3.025898,0.988466,0.635953,26.438904,43.335473,2187.396807,1.037797,1.034641,1.027977,1.025816,1.024819,1.033586,1.028502,1.028633
min,1.0,1.0,601.0,17.96612,-171.68815,0.0,0.0,18.0,740.66,-1.348571,1.0,0.0,0.0,1.000259,77.50523,155.506715,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
25%,2500.75,2500.75,26292.5,35.341828,-97.082813,738.0,0.0,35.0,19285.5225,8.054362,10.0,0.0,0.0,7.890442,141.071078,1234.110529,3.0,3.0,3.0,3.0,3.0,3.0,3.0,3.0
50%,5000.5,5000.5,48869.5,39.3958,-87.9188,2910.5,1.0,53.0,33186.785,10.202896,12.0,1.0,0.0,36.19603,169.9154,3382.424,3.0,4.0,3.0,3.0,3.0,3.0,4.0,3.0
75%,7500.25,7500.25,71866.5,42.106908,-80.088745,13168.0,3.0,71.0,53472.395,12.487644,14.0,2.0,1.0,61.42667,203.777441,5587.0965,4.0,4.0,4.0,4.0,4.0,4.0,4.0,4.0
max,10000.0,10000.0,99929.0,70.64066,-65.66785,111850.0,10.0,89.0,258900.7,47.04928,23.0,7.0,6.0,71.99928,315.8786,7158.982,7.0,7.0,8.0,7.0,7.0,8.0,7.0,8.0


In [14]:
# Review data types (numerical and categorical) in data set
churn_df.dtypes

Unnamed: 0                int64
CaseOrder                 int64
Customer_id              object
Interaction              object
City                     object
State                    object
County                   object
Zip                       int64
Lat                     float64
Lng                     float64
Population                int64
Area                     object
Timezone                 object
Job                      object
Children                float64
Age                     float64
Education                object
Employment               object
Income                  float64
Marital                  object
Gender                   object
Churn                    object
Outage_sec_perweek      float64
Email                     int64
Contacts                  int64
Yearly_equip_failure      int64
Techie                   object
Contract                 object
Port_modem               object
Tablet                   object
InternetService          object
Phone   

In [15]:
# Re-validate column data types and missing values
churn_df.columns.to_series().groupby(churn_df.dtypes).groups

{dtype('int64'): Index(['Unnamed: 0', 'CaseOrder', 'Zip', 'Population', 'Email', 'Contacts',
        'Yearly_equip_failure', 'item1', 'item2', 'item3', 'item4', 'item5',
        'item6', 'item7', 'item8'],
       dtype='object'),
 dtype('float64'): Index(['Lat', 'Lng', 'Children', 'Age', 'Income', 'Outage_sec_perweek',
        'Tenure', 'MonthlyCharge', 'Bandwidth_GB_Year'],
       dtype='object'),
 dtype('O'): Index(['Customer_id', 'Interaction', 'City', 'State', 'County', 'Area',
        'Timezone', 'Job', 'Education', 'Employment', 'Marital', 'Gender',
        'Churn', 'Techie', 'Contract', 'Port_modem', 'Tablet',
        'InternetService', 'Phone', 'Multiple', 'OnlineSecurity',
        'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV',
        'StreamingMovies', 'PaperlessBilling', 'PaymentMethod'],
       dtype='object')}

In [16]:
churn_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 52 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Unnamed: 0            10000 non-null  int64  
 1   CaseOrder             10000 non-null  int64  
 2   Customer_id           10000 non-null  object 
 3   Interaction           10000 non-null  object 
 4   City                  10000 non-null  object 
 5   State                 10000 non-null  object 
 6   County                10000 non-null  object 
 7   Zip                   10000 non-null  int64  
 8   Lat                   10000 non-null  float64
 9   Lng                   10000 non-null  float64
 10  Population            10000 non-null  int64  
 11  Area                  10000 non-null  object 
 12  Timezone              10000 non-null  object 
 13  Job                   10000 non-null  object 
 14  Children              7505 non-null   float64
 15  Age                 

In [17]:
# Review missing data from columns
churn_df.isna().any()

Unnamed: 0              False
CaseOrder               False
Customer_id             False
Interaction             False
City                    False
State                   False
County                  False
Zip                     False
Lat                     False
Lng                     False
Population              False
Area                    False
Timezone                False
Job                     False
Children                 True
Age                      True
Education               False
Employment              False
Income                   True
Marital                 False
Gender                  False
Churn                   False
Outage_sec_perweek      False
Email                   False
Contacts                False
Yearly_equip_failure    False
Techie                   True
Contract                False
Port_modem              False
Tablet                  False
InternetService         False
Phone                    True
Multiple                False
OnlineSecu

In [None]:
churn_df.index

In [None]:
churn_df.loc[0]

In [None]:
len(churn_df)

In [None]:
# Find number of records and columns of data set
churn_df.shape

In [None]:
# Add an index field
churn_df['index'] = pd.Series(range(0,10000))

In [None]:
churn_df.head()

In [None]:
# Calculate Churn Rate
churn_df.Churn.value_counts() / len(churn_df)

## <span style="color:red">Part III: Data Cleaning</span>

### D.  Summarize the data-cleaning process by doing the following:

1.  Describe the findings, including all anomalies, from the implementation of the data-cleaning plan from part C.

2.  Justify your methods for mitigating each type of discovered anomaly in the data set.

3.  Summarize the outcome from the implementation of each data-cleaning step.

4.  Provide the code used to mitigate anomalies.

5.  Provide a copy of the cleaned data set.

6.  Summarize the limitations of the data-cleaning process.

7.  Discuss how the limitations in part D6 affect the analysis of the question or decision from part A.

### <span style="color:green">D. Data Cleaning Summary</span>

D1.<b>Cleaning Findings</b>:

D2.<b>Justification of Mitigation Methods</b>:

D3.<b>Summary of Outcomes</b>:

D4.<b>Mitigation Code</b>:

D5.<b>Clean Data</b>: (see attached file 'churn_data_cleaned.csv')

D6.<b>Limitations</b>: Limitations given the telecom company data set are that the data are not coming from an warehouse.  It this scenario, it is as though I initiated and gathered the data.  So, I am not able to reach out to the folks that organized and gathered this information and ask them why certain NAs are there, why are fields such as age or yearly bandwidth used missing information that might be relevant to answering questions about customer retention or churn.  In a real world project, you would be able to go down to the department where these folks worked and fill in the empty fields or discover why fields are left blank.  
D7.

### E.  Apply principal component analysis (PCA) to identify the significant features of the data set by doing the following:

1.  List the principal components in the data set.

2.  Describe how you identified the principal components of the data set.

3.  Describe how the organization can benefit from the results of the PCA

### <span style="color:green">E. PCA Application</span>

E1. <b>Principal Components</b>:

E2. <b>Criteria Used</b>:

E3. <b>Benefits</b>:

## <span style="color:red">Part IV: Supporting Documents</span>

### F.  Provide a Panopto recording that demonstrates the warning- and error-free functionality of the code used to support the discovery of anomalies and the data cleaning process and summarizes the programming environment.


### <span style="color:green">F. Video</span>

### G.  Reference the web sources used to acquire segments of third-party code to support the application. <span style="color:red">Be sure the web sources are reliable.</span>

### <span style="color:green">G. Sources for Third-Party Code</span>
Larose, C. D. & Larose, D. T. (2019). <i>Data Science: Using Python and R.</i>  John Wiley & Sons, Inc.

Sree. (2020, October 26). <i>Predict Customer Churn in Python.</i>  Towards Data Science.  <br>&emsp;https://towardsdatascience.com/predict-customer-churn-in-python-e8cd6d3aaa7

VanderPlas, J. (2017). <i>Python Data Science Handbook: Essential Tools for WOrking with Data.</i>  <br>&emsp;O'Reilly Media, Inc.

### H.  Acknowledge sources, <span style="color:red">using in-text citations<span> 

### <span style="color:green">H. Sources</span>
Ahmad, A. K., Jafar, A & Aljoumaa, K. (2019, March 20). <i>Customer churn prediction in telecom using machine <br>&emsp;learning in big data platform</i>. Journal of Big Data.  <br>&emsp;https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0191-6

Altexsoft. (2019, March 27). <i>Customer Churn Prediction Using Machine Learning: Main Approaches and Models</i>.  <br>&emsp;Altexsoft.  <br>&emsp;https://www.altexsoft.com/blog/business/customer-churn-prediction-for-subscription-businesses-using-machine-learning-main-approaches-and-models/

Frohbose, F. (2020, November 24). <i>Machine Learning Case Study: Telco Customer Churn Prediction</i>.  <br>&emsp;Towards Data Science.  <br>&emsp;https://towardsdatascience.com/machine-learning-case-study-telco-customer-churn-prediction-bc4be03c9e1d

Mountain, A. (2014, August 11). <i>Data Cleaning</i>.  Better Evaluation.  <br>&emsp;https://www.betterevaluation.org/en/evaluation-options/data_cleaning