# Dealing with Data

###This notebook attributed to Jarret Petrillo, GADS14

### This is the rickety scaffolding supporting a magnificant triumphant statue.

## I.

### Let us raise a standard of respite for the wise and weary.  Observe'th the dataframe!

#### * We're importing two packages, numpy and pandas. Since we're importing them AS something, the numpy library will be accessible anywhere in the rest of the code by calling np.somefunction. 
#### * Why do we load each library seperately?  There may be some ambiguity.  
#### * For example the function min(1,2) returns 1.  But numpy also has a function named min.  Running the min function from numpy with an input of (1,2) produces an error since that function is meant to deal only with arrays.  It would be confusing to write 'min' and not know which one we were going to get.

In [2]:
import numpy as np
import pandas as pd

In [3]:
min(1,2)

In [4]:
#if you uncomment this and run it it will produce an error
#np.min(1,2)

In [5]:
an_array=np.array([3, 5, 7])
print an_array.min()
print np.min(an_array)

####These are different min functions - one belonging to the numpy library and the other to the Python language itself

In [6]:
"""Attribute Information:
    -- 1. #3  (age)       
    -- 2. #4  (sex)       
    -- 3. #9  (cp)        
    -- 4. #10 (trestbps)  
    -- 5. #12 (chol)      
    -- 6. #16 (fbs)       
    -- 7. #19 (restecg)   
    -- 8. #32 (thalach)   
    -- 9. #38 (exang)     
    -- 10. #40 (oldpeak)   
    -- 11. #41 (slope)     
    -- 12. #44 (ca)        
    -- 13. #51 (thal)      
    -- 14. #58 (num)
"""

#this is a LIST of STRINGS.  Lists are contained in brackets.  Strings are contained in single or double quotes.
#each item in a list is seperated by a comma.

header_row = ['age', 'sex', 'cp', 'trestbps', 'chol', 'fbs', 'restecg', 'thalach', \
              'exang', 'oldpeak', 'slope', 'ca', 'thal', 'num',]

#url is a string object.  We could check that out by running type(url)
url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data'

#a forward slash allows us to continue a command onto a new line

#data frames are the work horse of data analysis.  In this line we're calling the 'read_csv' function 
#from the pandas library and passing it three inputs: url, header, and na_values.
#heart_data = pd.read_csv(url) would run fine
#if we did this header and na_values would be set to their DEFAULT values.  
#By explicitely writing them we override those values in order to get the function to do what we want it to do

heart_data = pd.read_csv(url, header=0, na_values='?')

#since we said above that header=0 we told the function that unlike a nicely structured comma seperated file (csv) 
#this data doesn't have the first row as a list of variable names.  Because it's not there we have to set the 
#column names manually ourselves.  heart_data is a data frame object.  'columns' is an attribute of the object.  
#Attributes are accessible with a dot.
#for example .__name__ is the name of a function, .__repr__ is the string representation of any object.  
#These are internal variables that you won't need to change, but it's good to know that every object, even 
#functions have attributes.
#This is what we mean when we say Python is an object-oriented language.  Everything is an object and most objects have
#repetitive operations that can be performed on them.  Type(obj) gives the name of the object in question.
heart_data.columns = header_row

In [7]:
type(header_row)
type(url) 
# why does only one thing print from this?  Because unless we specifically write "print" only the output from the
#last line of code is printed.

In [8]:
type(heart_data)

In [9]:
print heart_data 
#print is a function.  Most functions in python are written as func(inputs) but print is special
#we print with "print 'string to print'

### Dataframes have columns.

In [10]:
columns = heart_data.columns

In [11]:
print columns

### Columns are a list of strings.

In [12]:
columns[0]
#columns is a list.  The way to access what's in a list is by writing a number in brackets.  
#0 is the first entry.
# -1 is the last.  
# -2 is the second to last.  etc.  
#Calling list[14] for a list with only 10 entries produces a well-known oft-cited error "list index out of range"

In [13]:
columns[2]

In [14]:
columns[-2]

In [15]:
#we use the colon to select a subset.  Everything from 0 to 3, including 0 but not including 3.
columns[0:3]

In [16]:
columns[0:2]

### Strings are a list of characters.

In [17]:
#individual characters can be accessed in the same way as list entries.  
#Strings behave as if they were lists of characters.
column1 = columns[0]
column1

In [18]:
column1[0]

In [19]:
column1[0:2]

In [20]:
#this is a for loop it's used to repeat an action over some list, for different numbers.  
#columns is a list.  In the first line we're taking each value of columns (first columns[0] then columns[1] 
#then columns[2] and this is important, RENAMING it 'name' so that in the rest of the code 
#(the indented portion is a code block) the entry we're looking at is always named 'name' although the 
#value of 'name' will change as we move through the columns list.
#The next lines say that if the first character (name[0] since name is a string) is equal to the string "n" then print it out.
for name in columns:
    if name[0] == "t":
        print name

### Characters are... characters.

In [21]:
#This shows a simple a boolean expression
'a'=='a'

## II.

### Dataframes are a list of series.

In [22]:
#heart_data is a dataframe and it has different columns of data.  
#Each of these columns has a string name we can access by typing that string name into brackets next to the dataframe.  Try calling individually each of the columns.  The resulting
#output is called a series, it's a 1 dimensional dataframe.
heart_data['age']

In [23]:
#or, using a different notation
heart_data.age

In [24]:
type(heart_data['age'])

In [25]:
#pandas is a sophisticated data storage method and it even let's use retrieve more than one column at a time
#by writing a LIST of column names rather than the string ('age') we used in the block above.
heart_data[['age','num']]

In [26]:
#Here we get tricky.  We're submitting a list with only one item, a string with the column name we want.
#But because we're submitting a list, the output is a dataframe rather than a series.  At this point take a break.
#The purpose of knowing the right type is that some functions only operate on one type of object.
#I.e. "This" + "that" returns "Thisthat" while 1+1 returns 2.  Clearly the plus sign is doing two different things
#in the first case it's concatenating two strings, and in the second it's adding two numbers.  Integer addition would
#make no sense on a string.  In just the same way there is some analysis that can only be performed on dataframes and
#the functions aren't written to handle one-dimensional data series.  This is something just to keep in mind since
#it may result in rather deep and difficult-to-find errors.
print type(heart_data[['age']])
print type(heart_data['age'])

In [27]:
print heart_data['age'][0:10], "\n\n\n", heart_data[['age']][0:10]

In [29]:
#Note the subtle difference, one is a series and one is a dataframe
print heart_data['age'].shape
print heart_data[['age']].shape

In [30]:
print "This" + "that"

### Series behave like lists.

In [31]:
#we're making a new variable 'age_column' and assigning it the value of the age column in the heart_data dataframe 
#(df for short)
age_column = heart_data['age']
print age_column.shape

In [32]:
#since series behave like lists we can select the first 0 through 34 elements the same we we would for lists
age_column[0:35]

## III.

## Dataframes contain different type of variables.

In [33]:
#another for loop
#for every string in columns name it 'col_name' and perform some operation with col_names.
#In this case we're writing it's type.  First we select a column with heart_data[col_name] then
#we select the first element (since every element of the column will be the same type) with [0]
#and then we feed this as the only input into our function "type()"
for col_name in columns:
    print col_name, type(heart_data[col_name][0])

In [34]:
#this just prints our dataset.  It only works if it's small enough to show.
heart_data

### Dataframes contain integers.

In [35]:
num = heart_data['num']

In [36]:
type(num)

In [37]:
type(num[0])

### Dataframes can contain floating point numbers

In [38]:
sex = heart_data['sex']

In [39]:
#floating point numbers are numbers with decimal points.  They're different than integers.
type(sex[0])

### ... that can be made into categorical variables

In [40]:
#there is a lot going on here.  The .apply function takes a function as an input it then applies that function 
#on every row of the dataframe and returns the result.  
#In this case we're running str(1), str(2), etc for every value in the sex column of the heart_data dataset.

#why is apply after the dataframe?

#it's operating on the object it comes after.  Apply in this case is not a generic function but one only callable
#on dataframe objects (the actual code for it is within the dataframe code).  Since apply is not a generic function
#but only callable from a dataframe object we call it with the period.  Which is the same way we access object
#attributes and in some sense this apply function is an attribute of our dataframe.  

#I'm getting a bit theoretical, but the above discussion is unique to Python and why it's an unbelievably 
#versatile language.
#It's used on raspberry pi's (tiny tiny altoid tin sized computers) and it's used by facebook to 
#distribute server load. There's no other language that has that versatility.

heart_data['sex']=heart_data['sex'].apply(str)

In [41]:
type(heart_data['sex'][0])

### and converted back to integers!

In [43]:
#now we're calling a function from the np library named unique.  Unique takes a series of values and
#returns a series with the values replaced by a number.  That number is the same if the two inputs were the same.
#The idea is that the sex column is a list of strings.  But we want to make them integers.

#We're then saving the output of the function into two variables

#Why two variables?  The function returns a tuple.  A tuple is another python object
#which looks like this (a,b,c) that's a tuple with three elements "a" "b" and "c."

#The first line of code is a shortcut (and they're are many of these in Python!) to assign the
#output of a function directly to the variable we want, instead of needing to have an 
#intermedate step where we first save the data as a temporary variable and then
#assign it to the variable we really want.

test, inverse = np.unique(heart_data['sex'], return_inverse=True)
heart_data['sex']=inverse

In [44]:
print test

In [45]:
print inverse

In [70]:
type(heart_data['sex'][0])

In [48]:
#so let's check
for i in heart_data['sex']:
    print i,

## IV.

### Let knowledge grow and life be enriched!  Observe'th as our triumphal dataframe grow'th.

In [49]:
#In this code we're actually adding a new column to our dataframe
#we're naming it 'num_times_2' and setting it equal to 2 times the num column
heart_data['num_times_2']=2*heart_data['num']

In [50]:
heart_data['num_times_3']=3*heart_data['num']

In [57]:
#The head() function is another dataframe built in method or function. It displays the first 5 lines of the dataframe
print heart_data['num'].head()
print heart_data['num_times_2'].head()
print heart_data['num_times_3'].head()

In [60]:
#range(10) returns a list of the numbers [0,1,2,3,..., 9]
#For every number in that list, call it 'i' and then perform some operation on i
#in this case we're making a new variable named label. Paranthesis are a way
#of returning nicely formatted numerical values.  The "%0.2d" in the string
#is replaced with the variable that follows the string, after the "%"
#We are adding a new column with that name and setting it equal to
#either 1 times num or 2 times num or 3 times num depending on which number
#of the loop we're on.

for i in range(10):
    label = "num_times_%02d"%i
    print label
    heart_data[label]=heart_data['num']*i

In [58]:
print range(10)

In [74]:
#printing the attribut heart_data.columns
print heart_data.columns

In [75]:
#for every column name if the first 4 characters are "num_" print it out
for column in heart_data.columns:
    if column[0:4]=="num_":
        print column

### Columns can be described.

In [66]:
#Describe is another function callable on dataframe objects.  In this case it returns a summary of our dataframe.
#Note the need for double square brackets in order to get a 3 column data frame
#If a dataframe is not returned then the object that is returned will not have a describe function
#See that happens is you only have single square brackets!
heart_data[['age','sex','cp']].describe()

In [77]:
heart_data[['num','num_times_05','num_times_09']].describe()

### Columns can be counted.

In [67]:
#another dataframe function! count returns the number of non null values
heart_data[['age','cp']].count()

### Columns have means.

In [68]:
#another function - mean
heart_data[['chol', 'fbs']].mean()

In [69]:
#The output looks good, but the output of .mean is actually a series object
type(heart_data[['chol', 'fbs']].mean())

In [70]:
#Extracting a single mean value can be done like this, or
heart_data['chol'].mean()

In [71]:
#like this!!
heart_data[['chol']].mean()['chol']

In [73]:
type(heart_data['chol'].mean())

In [72]:
type(heart_data[['chol']].mean()['chol'])

### Columns can be grouped.

In [74]:
#groupby is a dataframe function that takes a column name.
heart_data.groupby('sex')

In [84]:
#At first we're grouping a dataframe. Then we're running describe on the resulting dataframe.
heart_data[['chol','fbs','restecg','sex']].groupby('sex').describe()

## V.

In [85]:
#replace restecg with the group mean by gender

## VI.

### The Poetry of a Blank Cell

#### Empty dataframes waiting

#### to be filled.

#### The blank canvas

#### of data scientists!

###### - Poet Laureate (GA)

In [86]:
#we're creating an empty dataframe object using the pandas.DataFrame function directly
Empty_Canvas = pd.DataFrame(index=range(20), columns=["A","B"])
Empty_Canvas

In [87]:
#Now we're replacing the NA's in column A with the string "No!"
#and the NA's in column B with the string "Yes!"
Empty_Canvas['A'].fillna("No!", inplace=True)
Empty_Canvas['B'].fillna("Yes!", inplace=True)

Save a dataframe's contents to file

In [88]:
#then exporting it as a csv readable with .read_csv command we used at the beginning of this notebook
Empty_Canvas.to_csv('TheBlankCanvas.csv')