# Sample Notebook

<p>Let's start here.  In Python, as in most programming languages, we have the ability to get very granualar in how we interact with the computer, but we can also stand on the shoulders of giants and reuse their work.  This is what we will opt to do as often as possible, other people have figured out the gory details of how to interpret the format of an Excel file so we don't have to - I'd rather depend on their work and focus on the real value add for our projects - that is spending time doing something with the data!</p>
<br/>
So the first thing we are going to do is ask Python to "load" up the code that knows how to read Exxcel files.  The code is stored in a "module" called pandas (keep in mind that case matters, so Pandas is not the same as pandas).  For simplicity we are going to ask Python to make the utilities in this module available to us whenever we specify the shortcut 'pd'.  (The pd shortcut is common for Pandas).  On to the first set of code.

In [29]:
# Notice the '#' symbol?  This symbol tells Python that we don't want it to try to 'run' the code on this line of text
# This syntax is called a comment, and we can use it to clarify difficult code, or provide a hint as to what the next line (or several lines)
#  of code are meant to do

# Bring the Python module that can read Excel files into our scope so that we can use it, along with a few other helpful libraries
import pandas as pd
from scipy import stats
import numpy as np

In [6]:
# The two key things that Pandas provides us is a Series datatype and a Dataframe.  A series is like a list of values that have an index like a number or
# a date value or a string.  A dataframe is a collection of Series all which share the same index.  So you can think of a dataframe type like a spreadsheet,
# just like when you open a spreadsheet and you see the row numbers on the left side and the column headers across the top, a pandas dataframe is similar
df = pd.read_excel('data/Churn_Dirty.xlsx',index_col='ID')
# The head function, lets use see the first few lines of the file we have loaded from the Excel file
print(df.shape)
df.head()

(3333, 21)


Unnamed: 0_level_0,State,Account Length,Area Code,Phone,Intl Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,KS,128,415,382-4657,no,yes,25,265.1,110,45.07,...,99,16.78,244.7,91,11.01,10.0,3,2.7,1,False
2,OH,107,415,371-7191,no,yes,26,161.6,123,27.47,...,103,16.62,254.4,103,11.45,13.7,3,3.7,1,False
3,NJ,137,415,358-1921,no,no,0,243.4,114,41.38,...,110,10.3,162.6,104,7.32,12.2,5,3.29,0,False
4,OH,84,408,375-9999,yes,no,0,299.4,71,50.9,...,88,5.26,196.9,89,8.86,6.6,7,1.78,2,False
5,OK,75,415,330-6626,yes,no,0,166.7,113,28.34,...,122,12.61,186.9,121,8.41,10.1,3,2.73,3,False


So now that we have a sense of what our dataframe looks like, we can see that it has a column called ID, which is the 'index' or the row name, and 
the bold characters across the top give us the name of the columns.  We could look at the entire thing, but that is cumbersome, what we'd rather have is a view a summary of the data, that will help us to understand what we are looking at

In [4]:
df.describe()

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.021002,437.182418,8.085209,179.775098,109.405341,30.562307,200.980348,100.026103,17.08354,208.269847,100.053105,9.035161,10.230693,5.678368,2.763057,1.561056
std,39.933132,42.37129,13.696524,54.467389,518.290257,9.259435,50.713844,20.361003,4.310668,428.510605,19.845998,2.292355,2.815939,69.251653,0.759342,1.317627
min,-73.0,408.0,-23.0,0.0,0.0,0.0,0.0,-147.0,0.0,23.2,-91.0,-6.94,-11.0,0.0,-2.54,-3.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.1,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.4,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,30000.0,59.64,363.7,170.0,30.91,24767.0,175.0,17.77,20.0,4000.0,5.4,9.0


Now, that is helpful.  We can see for each column the number of rows, the mean, the max, the min, the standard deviation and the different quartiles.  That is super helpful.

Wait a minute... it looks like Account Length has a value that is less than zero.  That doesn't make any sense.  Let's see if we can find out if this is a common issue or a one-off.

In [7]:
# This is a fancy way of saying, show me all the rows where 'Account Length' values are less than zero
df[df['Account Length'] < 0]

Unnamed: 0_level_0,State,Account Length,Area Code,Phone,Intl Plan,VMail Plan,VMail Message,Day Mins,Day Calls,Day Charge,...,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls,Churn
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1950,WI,-73,415,419-4894,no,no,0,157.1,109,26.71,...,83,22.85,181.5,91,8.17,10.0,8,2.7,0,True


Wheh!  It looks like only one of them was below zero, I'll bet someone just made a mistake and it should be positive.  Let's change it, in this case - we'll just take the absolute value of the the values (while we are at it lets do it for all the columns with negative values that shouldn't be negative

In [23]:
df['Account Length']=df['Account Length'].abs()
df['VMail Message']=df['VMail Message'].abs()
df['Eve Calls']=df['Eve Calls'].abs()
df['Night Calls']=df['Night Calls'].abs()
df['Night Charge']=df['Night Charge'].abs()
df['Intl Mins']=df['Intl Mins'].abs()
df['CustServ Calls']=df['CustServ Calls'].abs()

In [26]:
df.describe()

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
count,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0,3333.0
mean,101.064806,437.182418,8.09901,179.775098,109.405341,30.562307,200.980348,100.114311,17.08354,208.269847,100.107711,9.039325,10.237294,5.678368,2.763057,1.562856
std,39.822106,42.37129,13.688365,54.467389,518.290257,9.259435,50.713844,19.922625,4.310668,428.510605,19.568609,2.275873,2.79184,69.251653,0.759342,1.315491
min,1.0,408.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,-2.54,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.6,87.0,14.16,167.1,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.4,100.0,17.12,201.2,100.0,9.05,10.3,4.0,2.78,1.0
75%,127.0,510.0,20.0,216.4,114.0,36.79,235.3,114.0,20.0,235.4,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,51.0,350.8,30000.0,59.64,363.7,170.0,30.91,24767.0,175.0,17.77,20.0,4000.0,5.4,9.0


# Now we should take care of removing the rows where we have outliers
In our purposes we are going to assume outliers are any values where the value is more than 3 standard deviations from the mean

In [39]:
# So lets' find the rows where any of the columns have a std that is greater than 3 (we are using abs because it doesn't matter if it's > 3 or < -3)
def remove_outliers(df_in, col_name):
    return df_in[(np.abs(stats.zscore(df_in[col_name])) < 3)]
    
#df = remove_outliers(df, 'Day Calls')
#df = remove_outliers(df, 'Night Mins')
#df = remove_outliers(df,'Intl Calls')
df = remove_outliers(df, 'VMail Message')
df.describe()

Unnamed: 0,Account Length,Area Code,VMail Message,Day Mins,Day Calls,Day Charge,Eve Mins,Eve Calls,Eve Charge,Night Mins,Night Calls,Night Charge,Intl Mins,Intl Calls,Intl Charge,CustServ Calls
count,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0,3318.0
mean,101.07384,437.203134,8.053647,179.850271,100.556058,30.575084,201.052471,100.075347,17.089668,200.957866,100.095539,9.043192,10.238969,4.481314,2.763502,1.560579
std,39.857853,42.385653,13.630757,54.284426,19.732879,9.228328,50.708105,19.908648,4.310182,50.597661,19.585167,2.276942,2.792631,2.46289,0.759581,1.314941
min,1.0,408.0,0.0,2.6,42.0,0.44,0.0,0.0,0.0,23.2,33.0,1.04,0.0,0.0,-2.54,0.0
25%,74.0,408.0,0.0,143.7,87.0,24.43,166.625,87.0,14.1625,167.1,87.0,7.52,8.5,3.0,2.3,1.0
50%,101.0,415.0,0.0,179.4,101.0,30.5,201.5,100.0,17.13,201.35,100.0,9.06,10.3,4.0,2.78,1.0
75%,127.0,510.0,19.0,216.375,114.0,36.785,235.3,113.75,20.0,235.4,113.0,10.59,12.1,6.0,3.27,2.0
max,243.0,510.0,49.0,350.8,160.0,59.64,363.7,170.0,30.91,395.0,175.0,17.77,20.0,20.0,5.4,9.0


# Looks like area code is being seen as a float, this doesn't make any sense.
So, we should likely let Pandas know this is more likely 