# Using Pandas for data science in Python

### Import libraries
First, we need to import the libraries that we will use during this exercise. These include bokeh, a Python visualization library built for the web, sys, which we will use to get information about the version of Python that we are using, and pandas, which is an external library for doing data science in Python. In some ways, it is built to help Python do some things that R is really good at.

In [None]:
import bokeh
import sys
import pandas as pd
from pandas import DataFrame, read_csv
from bokeh.charts import Histogram, Line, BoxPlot, output_notebook, show
from scipy.stats import ttest_ind

Next we are just going to check the versions of the libraries that we are using. This serves as a good sanity check when you are looking at documentation for the API. It helps ensure that you are looking in the right place!

In [None]:
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)
print('Bokeh version ' + bokeh.__version__)

### Load the data
We will load some data to look at in Pandas. Pandas has some nice functions to read in `.csv` (and other) files. Before we do this, let's look at the .csv so we know what the input data look like. As you have learned, you can enter standard Unix command in Jupyter notebook as long as you prepend it with an `!`.

In [None]:
!cat gender_height.csv

What we see is a pretty standard `.csv` file with two columns, one for gender and one for height. You can read that into a pandas data frame with the `read_csv()` method.

In [None]:
df = pd.read_csv("gender_height.csv")

We can then take a look at the data frame. Jupyter does a nice job of rendering into a table.

In [None]:
df

Like other objects in Python, data frames are iterable. You can write a for loop on the data frame itself, which will loop through the categories.

In [None]:
for i in df:
    print(i)

We can also iterate through the values in each category.

In [None]:
for i in df['Height']:
    print(i)

In [None]:
for i in df['Gender']:
    print(i)

One of the nice things about Pandas data frames is that they have many useful methods built-in. If you're thinking in a pure Python sense, you might think that you'd have to do something like the following to get the sum or mean of a category in a data frame.

In [None]:
the_sum = 0
count = 0
for i in df['Height']:
    the_sum += i
    count += 1

mean = the_sum/count

This works fine and the mean and the sum are correct:

In [None]:
mean

In [None]:
the_sum

You could also do something a little bit more clever by converting the category into a list, then using some built-in Python functions to find something like the median:

In [None]:
height_list = list(df['Height'])

In [None]:
height_list

Such as sum

In [None]:
sum(height_list)

Then you can sort the list to make it easy to identify the median.

In [None]:
sorted_list = sorted(height_list)

In [None]:
print(sorted_list)

In [None]:
len(sorted_list)

In [None]:
median = (sorted_list[49]+sorted_list[50])/2

In [None]:
median

All of this is great and quite simple, but Pandas makes it even nicer by having built-in statistics for this sort of thing.

In [None]:
df["Height"].mean()

In [None]:
df["Height"].sum()

In [None]:
df["Height"].median()

In [None]:
df["Height"].mode()

This is pretty useful stuff. Now we can make the output even more useful by renaming the two categories from an integer into a useful name. We can do this with the `replace()` method.

In [None]:
df["Gender"].replace(1, "female")

Notice that the output above replaced the integer `1` with `female`. This is nice, but let's take a look at the dataframe again:

In [None]:
df

These changes don't take unless you set that particular column in the dataframe equal to the corresponding column for which you did your `replace()`.

In [None]:
df["Gender"] = df["Gender"].replace(1, "female")

Now take another look at the dataframe and you should see that `1` has been replaced with `female`.

In [None]:
df

Let's do the same with `2` and `male`.

In [None]:
df["Gender"] = df["Gender"].replace(2, "male")

Now we have a dataframe with clear values in each column.

In [None]:
df

In [None]:
des_stats = df['Height'].describe()

In [None]:
des_stats

In [None]:
output_notebook()

In [None]:
line = Line(df, title="line", legend="top_left", ylabel='Height')

In [None]:
show(line)

In [None]:
hist = Histogram(df, values='Height', title="Distribution of Height", plot_width=600)

In [None]:
show(hist)

In [None]:
hist = Histogram(df, values='Height', title="Distribution of Height", plot_width=600, bins=13)

In [None]:
show(hist)

In [None]:
hist2 = Histogram(df, values='Height', label='Gender', color='Gender',
                  title="Height by Gender", plot_width=600)

In [None]:
show(hist2)

In [None]:
hist2 = Histogram(df, values='Height', label='Gender', color='Gender',
                  title="Height by Gender", plot_width=600, bins=13)

In [None]:
show(hist2)

In [None]:
by_gender = df.groupby("Gender")

In [None]:
by_gender

In [None]:
by_gender.mean()

In [None]:
by_gender.median()

In [None]:
by_gender.describe()

In [None]:
box = BoxPlot(df, values='Height', label='Gender', title="Heights", plot_width=600)

In [None]:
show(box)

In [None]:
box2 = BoxPlot(df, values='Height', label='Gender', color='Gender',
               title="Height by Gender", plot_width=600)

In [None]:
show(box2)

In [None]:
male_heights = df[df["Gender"]=="male"]

In [None]:
male_heights

In [None]:
female_heights = df[df["Gender"]=="female"]

In [None]:
female_heights

In [None]:
ttest_ind(male_heights['Height'], female_heights['Height'])