# Using Pandas for data science in Python

### Import libraries
First, we need to import the libraries that we will use during this exercise. These include bokeh, a Python visualization library built for the web, sys, which we will use to get information about the version of Python that we are using, and pandas, which is an external library for doing data science in Python. In some ways, it is built to help Python do some things that R is really good at.

In [1]:
import bokeh
import sys
import pandas as pd
from pandas import DataFrame, read_csv
from bokeh.charts import Histogram, Line, BoxPlot, output_notebook, show
from scipy.stats import ttest_ind

Next we are just going to check the versions of the libraries that we are using. This serves as a good sanity check when you are looking at documentation for the API. It helps ensure that you are looking in the right place!

In [2]:
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Bokeh version ' + bokeh.__version__)

Python version 3.5.1 |Continuum Analytics, Inc.| (default, Jun 15 2016, 16:14:02) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version 0.19.1
Bokeh version 0.12.3


### Load the data
We will load some data to look at in Pandas. Pandas has some nice functions to read in `.csv` (and other) files. Before we do this, let's look at the .csv so we know what the input data look like. As you have learned, you can enter standard Unix command in Jupyter notebook as long as you prepend it with an `!`.

In [4]:
!cat gender_height.csv

Gender,Height
1,67
1,67
1,67
1,60
1,68
1,64
1,69
1,71
1,67
1,67
1,66
1,63
1,67
1,62
1,66
1,70
1,67
1,61
1,68
1,67
1,68
1,64
1,69
1,67
1,70
1,72
1,61
1,67
1,69
1,68
1,69
1,72
1,66
1,67
1,66
1,67
1,69
1,64
1,64
1,63
1,68
1,66
1,65
1,60
1,70
1,65
1,68
1,66
1,61
1,65
2,72
2,74
2,75
2,71
2,71
2,68
2,75
2,75
2,74
2,70
2,74
2,74
2,79
2,72
2,72
2,75
2,72
2,76
2,76
2,74
2,73
2,72
2,65
2,71
2,74
2,69
2,73
2,69
2,71
2,67
2,74
2,70
2,70
2,70
2,74
2,74
2,74
2,70
2,74
2,73
2,69
2,72
2,69
2,74
2,71
2,72
2,74
2,75
2,77
2,72

What we see is a pretty standard `.csv` file with two columns, one for gender and one for height. You can read that into a pandas data frame with the `read_csv()` method.

In [5]:
df = pd.read_csv("gender_height.csv")

We can then take a look at the data frame. Jupyter does a nice job of rendering into a table.

In [6]:
df

Unnamed: 0,Gender,Height
0,1,67
1,1,67
2,1,67
3,1,60
4,1,68
5,1,64
6,1,69
7,1,71
8,1,67
9,1,67


Like other objects in Python, data frames are iterable. You can write a for loop on the data frame itself, which will loop through the categories.

In [7]:
for i in df:
    print(i)

Gender
Height


We can also iterate through the values in each category.

In [8]:
for i in df['Height']:
    print(i)

67
67
67
60
68
64
69
71
67
67
66
63
67
62
66
70
67
61
68
67
68
64
69
67
70
72
61
67
69
68
69
72
66
67
66
67
69
64
64
63
68
66
65
60
70
65
68
66
61
65
72
74
75
71
71
68
75
75
74
70
74
74
79
72
72
75
72
76
76
74
73
72
65
71
74
69
73
69
71
67
74
70
70
70
74
74
74
70
74
73
69
72
69
74
71
72
74
75
77
72


In [9]:
for i in df['Gender']:
    print(i)

1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2


One of the nice things about Pandas data frames is that they have many useful methods built-in. If you're thinking in a pure Python sense, you might think that you'd have to do something like the following to get the sum or mean of a category in a data frame.

In [10]:
the_sum = 0
count = 0
for i in df['Height']:
    the_sum += i
    count += 1

mean = the_sum/count

This works fine and the mean and the sum are correct:

In [11]:
mean

69.409999999999997

In [12]:
the_sum

6941

You could also do something a little bit cleverer by converting the category into a list, then using some built-in Python functions to find something like the median:

In [13]:
height_list = list(df['Height'])

In [14]:
height_list

[67,
 67,
 67,
 60,
 68,
 64,
 69,
 71,
 67,
 67,
 66,
 63,
 67,
 62,
 66,
 70,
 67,
 61,
 68,
 67,
 68,
 64,
 69,
 67,
 70,
 72,
 61,
 67,
 69,
 68,
 69,
 72,
 66,
 67,
 66,
 67,
 69,
 64,
 64,
 63,
 68,
 66,
 65,
 60,
 70,
 65,
 68,
 66,
 61,
 65,
 72,
 74,
 75,
 71,
 71,
 68,
 75,
 75,
 74,
 70,
 74,
 74,
 79,
 72,
 72,
 75,
 72,
 76,
 76,
 74,
 73,
 72,
 65,
 71,
 74,
 69,
 73,
 69,
 71,
 67,
 74,
 70,
 70,
 70,
 74,
 74,
 74,
 70,
 74,
 73,
 69,
 72,
 69,
 74,
 71,
 72,
 74,
 75,
 77,
 72]

Such as sum

In [17]:
sum(height_list)

6941

Then you can sort the list to make it easy to identify the median.

In [18]:
sorted_list = sorted(height_list)

In [19]:
print(sorted_list)

[60, 60, 61, 61, 61, 62, 63, 63, 64, 64, 64, 64, 65, 65, 65, 65, 66, 66, 66, 66, 66, 66, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 67, 68, 68, 68, 68, 68, 68, 68, 69, 69, 69, 69, 69, 69, 69, 69, 69, 70, 70, 70, 70, 70, 70, 70, 70, 71, 71, 71, 71, 71, 71, 72, 72, 72, 72, 72, 72, 72, 72, 72, 72, 73, 73, 73, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 74, 75, 75, 75, 75, 75, 76, 76, 77, 79]


In [20]:
len(sorted_list)

100

In [21]:
median = (sorted_list[49]+sorted_list[50])/2

In [22]:
median

69.0

All of this is great and quite simple, but Pandas makes it even nicer by having built-in statistics for this sort of thing.

In [23]:
df["Height"].mean()

69.409999999999997

In [24]:
df["Height"].sum()

6941

In [25]:
df["Height"].median()

69.0

In [26]:
df["Height"].mode()

0    67
1    74
dtype: int64

In [None]:
df["Gender"].replace(1, "female")

In [None]:
df

In [None]:
df["Gender"] = df["Gender"].replace(1, "female")

In [None]:
df

In [None]:
df["Gender"] = df["Gender"].replace(2, "male")

In [None]:
df

In [None]:
des_stats = df['Height'].describe()

In [None]:
des_stats

In [None]:
output_notebook()

In [None]:
line = Line(df, title="line", legend="top_left", ylabel='Height')

In [None]:
show(line)

In [None]:
hist = Histogram(df, values='Height', title="Distribution of Height", plot_width=600)

In [None]:
show(hist)

In [None]:
hist = Histogram(df, values='Height', title="Distribution of Height", plot_width=600, bins=13)

In [None]:
show(hist)

In [None]:
hist2 = Histogram(df, values='Height', label='Gender', color='Gender',
                  title="Height by Gender", plot_width=600)

In [None]:
show(hist2)

In [None]:
hist2 = Histogram(df, values='Height', label='Gender', color='Gender',
                  title="Height by Gender", plot_width=600, bins=13)

In [None]:
show(hist2)

In [None]:
by_gender = df.groupby("Gender")

In [None]:
by_gender

In [None]:
by_gender.mean()

In [None]:
by_gender.median()

In [None]:
by_gender.describe()

In [None]:
box = BoxPlot(df, values='Height', label='Gender', title="Heights", plot_width=600)

In [None]:
show(box)

In [None]:
box2 = BoxPlot(df, values='Height', label='Gender', color='Gender',
               title="Height by Gender", plot_width=600)

In [None]:
show(box2)

In [None]:
male_heights = df[df["Gender"]=="male"]

In [None]:
male_heights

In [None]:
female_heights = df[df["Gender"]=="female"]

In [None]:
female_heights

In [None]:
ttest_ind(male_heights['Height'], female_heights['Height'])