<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [6]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

# Descriptive Statistics

There are really two broad types of data in our DataFrames at the moment that we want to look at - numerical data (i.e. ints and floats) and text data (i.e. strings; slightly confusingly called objects).

In this section, we will explore some basic univariate descriptive statistics.

## 4.1 Describing Numerical Data
Let's start with the numerical data, because thats the easiest to work with. Pandas even has a built in function called `.describe()` which will provide some descriptive statistics for all the numerical columns in a dataframe.

In [None]:
# Describe the titanic dataframe
titanic.describe()

As it happens, the descriptive statistics output for our titanic dataset is also a DataFrame!

Make sure you understand what each row means in this table:
* **count** - the number (count) of entries in the given column.
* **mean** - the average (arithmetic mean) data value in the given column.
* **std** - the standard deviation (spread) of values in the given column.
* **min** - the smallest value in the given column.
* **25%** - the value of the data at the lower quartile (i.e. after the first 25% of data, ordered from smallest to largest).
* **50%** - the middle value of the data (aka the median), half the values are larger than this value, and half smaller.
* **75%** - the value of the data at the upper quartile (i.e. after the first 75% of data, ordered from smallest to largest).
* **max** - the maximum data value recorded.

We can get a sense of the data from these descriptive statistics. For instance: 

## 4.2 Descriptive Statistics for Numerical Data

`.describe()` is great to get an overview, but what if we just wanted particular statistics and not the whole lot?

Well, pandas will let you run a range of statistics individually! Some examples are given in the code below. 

In [None]:
# count() can be defined for all datatypes, so all columns are computed. Note which columns have some missing data.
titanic.count()

In [None]:
# mean() is only defined for numeric columns
titanic.mean()

In [None]:
# std() is also only defined for numeric columns
titanic.std()

In [None]:
# min() has a definition for numeric and text data.
# The minimum value of a text field is the text which is first alphabetically.
titanic.min() 

In [None]:
# max() has a definition for numeric and text data.
# The maximum value of a text field is the text which is last alphabetically.
titanic.max() 

In [None]:
# quantile() allows you to specify quantiles, such as 0.25 (lower quartile), 0.5 (median), and 0.75 (upper quartile)
# for convenience median() also exists
titanic.quantile(0.25) # 25% - lower quartile. 

In [None]:
# sum() works to concatenate text, producing a curious output.
titanic.sum()

In [None]:
# Hopefully though it is obvious that these methods could be called on selected columns too.
titanic['fare'].sum()

In [None]:
# As it happens, python has a built-in sum, min and max functions which does the same thing.
# however, pandas sum is better when confronted with missing data:
sum(titanic['fare'])

In [None]:
# Try this instead
sum(titanic[titanic['fare'].notnull()]['fare'])

In the above cell, a new filter condition for working with missing data is apparent: `notnull()` this returns `True` for rows that have a valid value, and `False` otherwise. Similar to the behaviour of `bool()`. The opposite of `notnull()` is `isnull()`.

This method of selection is similar to making conditional statements with object methods that return a Boolean, e.g.
```python
if string_variable.islower():
    # Do something
```
The same principle can apply to other contexts, for instance the `Series` object has a large number of string methods collected as `.str.`, calling `titanic['name'].str.contains('Mr.', regex=False)` returns `True` or `False` for each row in a column depending on whether it contains the substring 'Mr.'

In [None]:
# Select passengers with title Mr. and get mean fare
titanic[titanic['name'].str.contains('Mr.', regex=False)]['fare'].mean()

In [None]:
# Select passengers with title Mrs. and get mean fare
titanic[titanic['name'].str.contains('Mrs.', regex=False)]['fare'].mean()

## 4.3 Describing Text (or Categorical) Data

We can still use `.describe()` to look at text data, however we need to specify that we're looking at object (text) data types.

Really, the descriptive statistics below are for categorical data, they don't work very well if every value in a field is a different piece of text!

In [None]:
# In the describe parameters we're only choosing to include object datatypes, given by 'O'.
# The 'O' is in a list, because we could include other data types in the list if we wanted to.
titanic.describe(include=['O'])

When you are describing an object you get some different summary statistics than with numerical data:

* **count** as before, a count of the values present in each column.
* **unique** a count of the number of unique values in each column.
* **top** is the most common value - aka the mode.
* **freq** is the frequency of occurance of the most common value.

Let's dig a bit deeper into some of these columns.

In [None]:
# Interestingly there are 2 James Kellys, however they don't appear to be duplicates.
titanic[titanic['name'] == 'Kelly, Mr. James']

In [None]:
# One way we could check for other name duplicates is by taking the mode.
# As mode can be non-unique it returns a series
# Looks like Kate Connolly is another possible duplicate.
titanic['name'].mode()

In [None]:
# Again, these appear to be different people!
titanic[titanic['name'] == 'Connolly, Miss. Kate']

In [None]:
# the unique() function will give us all the unique objects in a column.
titanic['embarked_city'].unique()

In [None]:
# the value_counts() function gives a count for each unique value in a chosen column.
titanic['embarked_city'].value_counts()
# Most people embarked in Southampton.

## 4.4 Sorting Data

Sorting data is straightforward in pandas, a simple sort on one columns used the DataFrame method `.sort_values()`:
```python
titanic.sort_values('age')
```
The default is to sort in ascending order, from smallest to largest value. Set the ascending parameter to `False` for a descending sort:
```python
titanic.sort_values('age', ascending = False)
```
This approach sorts and returns the entire DataFrame, if you want to sort a single column on its works similarly:
```python
titanic['age'].sort_values()
```
More complicated sorting behaviours can be managed by passing a list, in the order you would like the sort to occur:
```python
titanic.sort_values(['pclass','age'], ascending = [True, False])
```
In the above code I sort first by 'pclass' then by 'age'. in addition I pass a list to ascending indicating that 'pclass' is to be sorted in ascending order, and 'age' in descending order.

In [None]:
# sort fare descending
titanic.sort_values('fare', ascending = False).head(8)

In [None]:
# sort by sex ascending, then age descending
titanic.sort_values(['sex','age'], ascending = [True, False]).head()

In [None]:
# sort by sex descending, then age descending
titanic.sort_values(['sex','age'], ascending = False).head()

# Exercise 5

1. How old is the oldest passenger in the dataset?
2. How many men and women are in the dataset?
    * Check the pd.Series.value_counts() docstring and figure out how to get proportions of men and women.
3. Create a new column called 'std_fare' which is the 'fare' minus the mean fare, divided by the standard deviation.
4. Calculate the number of children in second class.

In [4]:
## Question 1

#print("The oldest passenger is {} years old.\n".format(titanic['age'].max()))

## Question 2

#print(titanic['sex'].value_counts(),'\n')

## Question 2b

#print(titanic['sex'].value_counts(normalize = True),'\n')

## Question 3

#titanic['std_fare'] = (titanic['fare'] - titanic['fare'].mean())/titanic['fare'].std()
#print(titanic['std_fare'].head(),'\n')

## Question 4

#print("There are {} children in second class".format(titanic[titanic['pclass'] == 2]['child'].sum()))

toggle_code()