# 02 Blackbirds

- [Practice of day 01 techniques](#Practice)
- [Setting a column to datetime format](#Setting-a-column-to-datetime-format)
- [Introducing groupby](#Introducing-groupby)
- [Distribution plots with `distplot`](#Distribution-plots)
    -  Includes, using `subplots` to get two graphs in one figure
- [Hypothesis testing](#Hypothesis-testing)
    -  Using `scipy.stats` to run a t-test
- [Box plots](#Boxplots)
- [Ordinal data](#Ordinal-data)
    -  The age categories happened to make sense in alphabetical order. What if they didn't?
- [Time series](#Time-series)
    -  Now we've grouped by year we can aggregate by mean to plot a time series

## Practice

![Blackbird](https://www.rspb.org.uk/globalassets/images/birds-and-wildlife/bird-species-illustrations/blackbird_male_1200x675.jpg?preset=landscape_mobile)

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

1. Import the `blackbirds.csv` data into a `pandas` dataframe.
1. How many rows are there in your dataframe? (Try `len()`)
1. Is there a sensible index in the dataframe?
1. What do each of the columns represent? What do you think the age values mean?
1. Find the mean and standard deviation (`std`) of the wing span and weight columns.
1. Use the documentation to check *which* standard deviation you're getting.
1. Use the `quantile` function to find the median and the IQR too.
1. Is there a relationship between wing span and weight? Visualise it and measure it.
1. Use the `hue`, `size`, `style` and `markers` of the `seaborn` [scatterplot function](https://seaborn.pydata.org/generated/seaborn.scatterplot.html) to distinguish between the different kinds of blackbird in your plot.
1. Find the mean and standard deviation weight and wing span of adult female and male blackbirds separately.
1. What other questions could you ask of this data set?

### Q1 and Q2

In [None]:
#blackbirds = pd.read_csv("blackbirds.csv")
blackbirds = pd.read_csv("https://raw.githubusercontent.com/adaapp/dav-introductionToPandas/master/blackbirds.csv")
len(blackbirds)

### Q3

There isn't anything unique in the columns to index by.

### Q4

The values in the age column are Juvenile, First year, Adult and Unknown

### Q5

In [None]:
blackbirds.Weight.std()

### Q6

The default `ddof` argument is 1, which means the denominator will be $n-1$, so this is sample standard deviation by default.

### Q7

In [None]:
print("Weight: The median is {}, with IQR {}".format(blackbirds.Weight.quantile(0.5),
                                                     blackbirds.Weight.quantile(0.75)-blackbirds.Weight.quantile(0.25)))
print("Wing: The median is {}, with IQR {}".format(blackbirds.Wing.quantile(0.5),
                                                     blackbirds.Wing.quantile(0.75)-blackbirds.Wing.quantile(0.25)))

But actually,

In [None]:
blackbirds.describe()

### Setting a column to datetime format

The `Year` column shouldn't really work like that. If you check `blackbirds.dtypes` you'll see why.

In [None]:
blackbirds.Year = pd.to_datetime(blackbirds.Year,format="%Y")
blackbirds.dtypes

Check `blackbirds.dtypes` again.

### Q8

In [None]:
blackbirds.plot.scatter("Wing","Weight")
blackbirds.corr()

### Q9

In [None]:
plt.figure(figsize=(12,6))

sns.scatterplot(data=blackbirds,x="Wing",y="Weight",hue="Age", style="Sex", palette="hot", alpha=0.6);

### Q10

In [None]:
blackbirds.loc[(blackbirds.Sex=='M')&(blackbirds.Age=='A')].describe()

In [None]:
blackbirds.loc[(blackbirds.Sex=='F')&(blackbirds.Age=='A')].describe()

### Introducing groupby

But this feels like good opportunity to see the `groupby` function:

In [None]:
blackbirds.groupby(["Sex","Age"])

By itself, `groupby` doesn't do much except make a groupby object. Just like with a pivot table, we need to tell it what to *aggregate* by...

In [None]:
blackbirds.groupby(["Age","Sex"]).mean()

## Distribution plots

`seaborn` has a `distplot` function the combines a histogram with an estimate of the continuous distribution shape

In [None]:
# What error message do you get without the dropna?
sns.distplot(blackbirds.Weight.dropna());

In [None]:
# fig is the whole figure, axs is a list of two sets of axes
fig,axs = plt.subplots(1,2)
fig.suptitle("Distribution of weight and wing span")
# I don't care about the numbers on the y-axis
axs[0].get_yaxis().set_visible(False)
axs[1].get_yaxis().set_visible(False)
# Pass the axes to seaborn to tell it where to plot each graph
sns.distplot(blackbirds.Weight.dropna(), bins=10, ax=axs[0])
sns.distplot(blackbirds.Wing.dropna(), bins=10, ax=axs[1]);

Use `distplot` to compare the distribution of weight and the wing span for female and male blackbirds

In [None]:
fig, axs = plt.subplots(1,2)
fig.suptitle("Weight and wingspan distribution by sex")

axs[0].get_yaxis().set_visible(False)
sns.distplot(blackbirds[blackbirds.Sex=='M'].Wing.dropna(),color="goldenrod", ax=axs[0], label='M', bins=10)
sns.distplot(blackbirds[blackbirds.Sex=='F'].Wing.dropna(),color="rebeccapurple", ax=axs[0], label='F', bins=10)

axs[1].get_yaxis().set_visible(False)
sns.distplot(blackbirds[blackbirds.Sex=='M'].Weight.dropna(),color="goldenrod", ax=axs[1], label='M')
sns.distplot(blackbirds[blackbirds.Sex=='F'].Weight.dropna(),color="rebeccapurple", ax=axs[1], label='F')
axs[0].legend(loc="lower left")

What does this suggest?

## Hypothesis testing

It looks like the mean wing span for female blackbirds is different from the mean for males. How should we test that?

The `scipy` package has a function for doing t-tests

https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.ttest_ind.html

In [None]:
from scipy import stats

In [None]:
stats.ttest_ind(blackbirds.loc[blackbirds.Sex == 'M',"Weight"].dropna(),
                blackbirds.loc[blackbirds.Sex == 'F',"Weight"].dropna(),
               equal_var=False)

What can we conclude? Was this a one or a two-tailed test? Does it matter?

## Boxplots

We can use grouped boxplots to see how weight and wing span change with age

In [None]:
# Make a figure with two subplots with a shared y-axis
fig, axs = plt.subplots(1,2, sharey=True)
# axs is a list so we can get the first subplot with ax[0]
sns.boxplot(x="Wing",y="Age",data=blackbirds, ax=axs[0], whis=3)
# and the second with ax[1]
sns.boxplot(x="Weight",y="Age",data=blackbirds, ax=axs[1], whis=2);

Investigate the optional arguments for boxplots. What definition of outlier is used?

## Ordinal data

It so happened that A, F, J and U worked quite well because they're in alphabetical order. But it would be better to tell `pandas` what order we really mean them to come in.

In [None]:
blackbirds.Age = pd.Categorical(blackbirds.Age, categories=["U","J","F","A"])

In [None]:
fig, axs = plt.subplots(2,1,sharex=True)
sns.boxplot(x="Wing",y="Age",data=blackbirds, ax=axs[0])
sns.boxplot(x="Weight",y="Age",data=blackbirds, ax=axs[1]);

## Time series

Let's look at how weight and wing span have varied over time

In [None]:
# A groupby by itself doesn't do very much
blackbirds.groupby(by="Year")

In [None]:
blackbirds.groupby(by="Year").mean()

In [None]:
blackbirds.groupby(by="Year").mean().plot();