<img src="https://datasciencecampus.ons.gov.uk/wp-content/uploads/sites/10/2017/03/data-science-campus-logo-new.svg"
             alt="ONS Data Science Campus Logo"
             width = "240"
             style="margin: 0px 60px"
             />

In [8]:
# import the helper functions from the parent directory,
# these help with things like graph plotting and notebook layout
import sys
sys.path.append('..')
from helper_functions import *

# set things like fonts etc - comes from helper_functions
set_notebook_preferences()

# add a show/hide code button - also from helper_functions
toggle_code(title = "import functions")

# Selecting and Filtering DataFrames

Over the following sections, we will learn how to select and filter data using pandas DataFrames. This is one of the most useful and powerful features of pandas.  

It is useful for a range of reasons, from simply cutting down a large dataset into the specific sub-sets of data required for analysis, to managing the various components of a model (e.g. dependent and independent variables, training and test data etc.) and conducting specific subgroup analyses.

Selecting and filtering can be done by using the indexing operator. Pandas uses the same indexing operator as lists, tuples, and dictionaries - `[]` (square brackets).

However, the DataFrame indexing operator is more sophisticated than one used for the python built-in data structures, the behaviour of the DataFrame indexer depends, as you'll see, on what you pass to the DataFrame indexing operator. This allows you to pass different kinds of information to the same indexing operator and get specific outputs.

## 2.1 Selecting Columns from DataFrames

The simplest way to select a column from a dataframe is to use the name of that column!

In [None]:
# Select by passing the name of a column as a string.
passengers = titanic['name']
passengers.head()

In the code above, the following things happen:
1. We index the DataFrame `titanic` using the column header.
2. Assuming 'name' is a legitimate column header, a pandas Series is returned.
3. This Series, representing the 'name' column of the `titanic` Dataframe, is assigned to a variable called `passengers`
5. Finally, we look at the first 5 rows of the Series object `passengers`

If we want to select multiple columns, we have to first collect the column names together using a list, and then pass that to the DataFrame indexing operator.

In [None]:
# First make a list of column names called cols
cols = ['name','age']
# Use cols to select multiple columns from marvel.
passenger_cols = titanic[cols]
passenger_cols.head()

The code above is very similar to the code for selecting a single column, the main difference is that to select multiple columns we pass a list of string objects, rather that a single string object directly.

However, because we have selected multiple columns, the `passenger_cols` object is actually a DataFrame, rather than a Series object.

Note, that we don't have to create the `cols` list first, we can actually create it on-the-fly in the indexing operator, you just have to learn to distinguish the list constructor square brackets from the indexing square brackets!

In [None]:
# select multiple columns directly.
passenger_cols = titanic[['name','age']]
passenger_cols.head()

In [None]:
# Just have a look, rather than assigning to a variable
titanic[['name','age']].head()

## Exercise 1

1. Refresh your memory of the titanic data by getting:
    * The number of rows and columns in the DataFrame
    * The datatypes of the columns.
    * A list of the columns names for the titanic dataset.
2. Select the 'fare' column from the `titanic` data and show the tail of the data.
3. Select just the last column, try using the list of column names you made earlier.
4. Select the second, third and fourth columns, try doing it using DataFrame columns property directly.

In [3]:
## Question 1

## The number of rows and columns
#n_rows, n_cols = titanic.shape
#print(f"There are {n_rows} rows and {n_cols} columns in the data\n")

## The data types of the columns
#print(titanic.dtypes, '\n')

## Get a list of column names
#colnames = list(titanic.columns)
#print(colnames,"\n")

## Question 2
#fares = titanic['fare']
#print(fares.tail(),"\n")

## Question 3
#print(titanic[colnames[-1]].head(),'\n')

## Question 4
#print(titanic[titanic.columns[1:4]].head())

toggle_code()

## 2.2 Filtering rows from Dataframes

Filtering rows from a pandas dataframe works in a very similar way to selecting columns. Simple filtering can be achieved by passing a range to the DataFrame indexer, just like slicing a list.

The code below does the same thing as `.head()` and `.tail()` and can be used to show any arbitrary range of rows in a given DataFrame.

Note that this is identical to slicing a list.

However, we can't get individual rows by indexing as we would with a list, because a column could be named with an integer. This would mean that `dataframe[0]` is ambiguous and could refer to the first row, or a column named 0. Hence it is not allowed, `dataframe[0]` only works if you have a column named '0', which is a default for some operations in pandas.

This means that selecting a single row also requires a slice.

In [None]:
# first 5 rows
titanic[0:5] # or - titanic[:5]

In [None]:
# last 5 rows
titanic[-5:]

In [None]:
# arbitrary slice
titanic[102:109]

In [None]:
# 1 row
titanic[123:124]

## 2.3 Conditional Filtering

Most filtering is a more involved operation where we first specify the condition(s) that must be met in order for rows to be included in or excluded from the output dataframe.

The way that pandas filters rows can be though of as a two-step process.

1. Create a 'mask' that specifies inclusion and exclusion for each row in the dataframe.
2. Mask the dataframe to return the subset of rows that are included.

This sounds a bit abstract, so let's consider what this might look like in practice.

Imagine you have the following (very simple) dataframe, called 'catdog':

index | Animal | Name
---| --- | ---
0 | Cat | Catalie Portman
1 | Cat | Pico de Gato
2 | Dog | Chewbarka
3 | Cat | JK Meowling


You want to filter so you just have 'Cat' rows. Therefore you design the following condition:

```python 
mask = catdog['Animal'] == 'Cat'
```

There are a lot of = (equals) in the above statement.
* The first = indicates assignment, we are assigning the outcome of the expression on the right to the variable on the left of the equals sign.
* The second double equals sign, ==, indicates a comparison, in this case it assesses whether each value in the 'Animal' column of catdog is equal to the text 'Cat'. If python finds that the column value and 'Cat' are the same it assigns a True value, and if not a False value.

This produces a 'mask' which is a Series of `True` and `False` values against the DataFrame index.

index | &#xfeff;
---|---
0 | True
1 | True
2 | False
3 | True

Now, you just have to pass the mask to the original dataframe to complete the filtering process.

```python
catdog2 = catdog[mask]
catdog2
```
This subsets the catdog dataframe based on the True (include) and False (exclude) values. Producing:

index | Animal | Name
---| --- | ---
0 | Cat | Catalie Portman
1 | Cat | Pico de Gato
3 | Cat | JK Meowling

The row that had a 'Dog' value for 'Animal', has been removed. Note though that the index has remained the same as the original. Sometimes it is important to reset the index after filtering to restore the index to sequential integers starting at 0.

If you want to reset the index - you can do so by using this code - 

```python
catdog2 = catdog2.reset_index(drop = True)
catdog2
```
index | Animal | Name
---| --- | ---
0 | Cat | Catalie Portman
1 | Cat | Pico de Gato
2 | Cat | JK Meowling

See above that the index has been reset to be sequential.

## 2.4 Operators to filter data

We can filter by using logical comparison statements

* == 'is equal to' notice the double == and watch out! A single one would be assigning to a variable!
* != 'does not equal' - the opposite of ==
* $\gt$  greater than
* $\lt$ less than
* $\gt$= greater than or equal too
* $\lt$= less than or equal too.

In addition, pandas includes some functions to make particular comparisons easier:

* .isin(list) which we can use for multiple conditions 
* .between() which we can use to specify upper and lower bounds

Finally, the ~ (tilde) allows us to flip or invert an expression. Basically, if an expression returns [true, true, false], the same expression with a ~ in front of it will return [false, false, true].

However, we'll concentrate on the simple operators in the top list for now.

In [None]:
# Filter titanic for 3rd class passengers only

# First make the mask
mask = titanic['pclass'] == 3
# Have a quick look at the mask
mask.sample(5) # 5 rows in the mask.

In [None]:
# Now filter the titanic dataframe with this mask
thirdclass = titanic[mask]
thirdclass.head()

In [None]:
# Use the same approach for other logical statements.
mask = titanic['fare'] > 200
titanic[mask].head(7)

## Exercise 2

1. Show the row for the passenger named: 'Birkeland, Mr. Hans Martin Monsen'
2. How many passengers in the dataset are male?
3. How many passengers are under 18 years of age?
4. What proportion of passenger in the dataset survived?

In [5]:
## Question 1

#print(titanic[titanic['name'] == 'Birkeland, Mr. Hans Martin Monsen'],"\n")

## Question 2

#print(f"{len(titanic[titanic['sex'] == 'male'])} passengers are male.\n")

## Question 3

#print(f"{len(titanic[titanic['age'] < 18])} passengers are children.\n")

## Question 4
#total_rows = len(titanic)
#survive_rows = len(titanic[titanic['survived'] == 1])
#print("The proportion of survivors is {:.2f}".format(survive_rows/total_rows))

toggle_code()

## 2.5 Using Multiple Conditions to Filter

So far, we've only filtered according to individual conditions set on a single column, but there is no reason we can't use multiple conditions to filter by several conditions and/or columns at once. However, we do need to think about how the conditions relate to each other, we have two options to establish these relationships.

* **and** relationships are given by the **&** (ampersand) symbol. This implies both/all conditions must be met for a row to evaluate to True.
* **or** relationships are given by the **|** (pipe) symbol. This implies that if _any_ of the conditions can be met a given row evaluates to True.

You can think of `.isin()` and `.between()` as being special versions of multiple condition filters.

* isin() is basically just a lot of linked **or** statements - *value1* **or** *value2* **or** *value3* etc.
* between() is an **and** condition - greater than (or equal to) the lower bound **and** less than (or equal to) the upper bound.

Let's again take a simple example to illustrate this with the `catdog` dataframe:

index | Animal | Name | Age
---| --- | --- | ---
0 | Cat | Catalie Portman | 3.0
1 | Cat | Pico de Gato | 5.0
2 | Dog | Chewbarka | 1.0
3 | Cat | JK Meowling | 7.0
4 | Dog | K-9 | 11.0

If you wanted to select all animals that are cats **and** who are over 4 years old, you could do the following:

```python
mask = (catdog['Animal'] == 'Cat') & (catdog['Age'] > 4.0)

catdog[mask]
```
index | Animal | Name | Age
---| --- | --- | ---
1 | Cat | Pico de Gato | 5.0
3 | Cat | JK Meowling | 7.0

Only Cats over 4 years old have been included in the filter.

However, if you wanted to select all animals that are either cats **or** are over 4 years old, you could instead do:

```python
mask = (catdog['Animal'] == 'Cat') | (catdog['Age'] > 4.0)

catdog[mask]
```
index | Animal | Name | Age
---| --- | --- | ---
0 | Cat | Catalie Portman | 3.0
1 | Cat | Pico de Gato | 5.0
3 | Cat | JK Meowling | 7.0
4 | Dog | K-9 | 11.0


In [None]:
# Let's try some multiple condition filters with the titanic data
# First class passengers who are women.
mask = (titanic['pclass'] == 1) & (titanic['sex']== 'female')
titanic[mask].head()

In [None]:
# Women or children
mask = (titanic['sex'] == 'female') | (titanic['age'] < 18)
titanic[mask].head()

In [None]:
# Try the special functions for multiple selection. First .isin()
# Passeners from Cherbourg ('C') or Queenstown ('Q')
titanic[titanic['embarked'].isin(['C','Q'])].sample(7)

In [None]:
# Now, .between()
# passengers who paid between 100 and 250
titanic[titanic['fare'].between(100,250)].head()

## Exercise 3

1. Select passengers who are in classes 2 and 3, what percentage of passengers is this?
2. How many passengers who do not have siblings or spouses ('sibsp'), or parents or children ('parch') on the boat?
3. What proportion of passengers who 'embarked' in Cherbourg ('C') or Queenstown ('Q') survived? 

In [6]:
## Question 1

#n_c2_3 = len(titanic[titanic['pclass'].isin([2,3])])
#print("{:.1f}% of passengers were in 2nd and 3rd class.\n".format(n_c2_3/len(titanic)*100))

## Question 2

#num_solo = len(titanic[(titanic['sibsp'] == 0)&(titanic['parch'] == 0)])
#print(f"There are {num_solo} solo travelers in the dataset.\n")

## Question 3

#total_em = len(titanic[titanic['embarked'].isin(['C','Q'])])
#survive_em = len(titanic[titanic['embarked'].isin(['C','Q']) & (titanic['survived'] == 1)])
#print("The proportion of Cherbourg and Queenstown passengers who survived was: {survive_em/total_em:.2f}")

toggle_code()