# Summary Statistics & Aggregation

In [None]:
import pandas as pd

In [None]:
animals = pd.read_csv("data/animals.csv")
titanic = pd.read_excel("data/titanic.xlsx")

In [None]:
%whos

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Overall Descriptive Statistics

Pandas has an inbuilt method to get basic descriptive statistics, this is `.describe()`

In [None]:
titanic.head()

In [None]:
titanic.describe()

In [None]:
titanic["fare"].describe()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Range

We can also access these summary statistics individually. In most cases the name of the method is the same as the summary statistic.

We can use `.min()` to return the minimum value in a column.

In [None]:
titanic["fare"].min()

This also works for object (text) columns.

In [None]:
titanic["name"].min()

In [None]:
titanic["fare"].max()

In [None]:
titanic["name"].max()

Something important to note here is that `pandas` effectively assigns a value to each letter. 

This goes A-Z and **then** a-z.

A lower case "a" is treated as coming **after** a capital "Z" in Python.

In [None]:
titanic["name"].str.lower().max()

We can use `.quantile()` to find information at different points of our data.

Our parameter is `q= ` and then a decimal number.

If we don't specify this the default behaviour is `0.5`.

In [None]:
titanic["fare"].quantile(q=0.25)

In [None]:
titanic["fare"].quantile(q=[0, 0.25, 0.5, 0.75, 1])

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Averages


In [None]:
titanic["fare"].mean()

In [None]:
titanic["fare"].median()

In [None]:
titanic["fare"].mode()

In [None]:
titanic["name"].mode()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Spread

In [None]:
titanic["fare"].std()

In [None]:
titanic["fare"].var()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Counting Values

In [None]:
titanic["embarked"].count()

In [None]:
titanic["age"].isnull().tail()

In [None]:
titanic["age"].tail()

### Null value count

In [None]:
titanic["age"].isnull().sum()

### Value count

In [None]:
titanic["sex"].value_counts()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Sum

In [None]:
titanic["fare"].sum()

## Unique values

In [None]:
titanic["boat"].unique()

In [None]:
titanic["boat"].nunique()

<hr style="width:70%;height:4px;border-width:0;color:gray;background-color:#003d59; opacity:0.75"> 

## Summary 

In this chapter we’ve explored:
* Overall descriptive statistics
* Range
* Averages
* Spread
* Counting Values
* Other general summary statistics

## Exercise 

How old is the oldest passenger in the titanic dataframe? 

Find the standard deviation of the `age` column

How many passengers were in each class?