# Foundation Data Sciences
## Week 02: Introduction to Jupyter Notebooks and Pandas

**Learning outcomes:** 
In this lab you will learn the very basics of the python library pandas, which is used for data management. By the end of the lab you should be able to: 
- use jupyter notebook,
- load different data file types, 
- display data,
- filter your data for specific values, and 
- apply basic statistical computations on the data.

**Prerequisites**  
- Basic knowledge of `python` is assumed for this course. If you haven't used Python before or need a refresher, we can recommend the following [python tutorial](http://bebi103.caltech.edu/2015/tutorials/t1a_intro_to_python.html) as a starting point. 
- Basic knowledge of `numpy` is assumed for this course. If you haven't used numpy before or need a refresher, we can recommend the following [numpy tutorial](http://cs231n.github.io/python-numpy-tutorial/#python).

We will try to cover a different research question every week. This week we will take the position of a historian and try to answer the following question.

**Research question:** Which passenger group had the worst survival rate on the [*Titanic*](https://en.wikipedia.org/wiki/Titanic)?

**Data information:** We will use a well-known dataset in the machine learning community, often referred to as the [titanic dataset](https://www.kaggle.com/c/titanic/data). It contains a list of passengers, information about whether they survived or not, and some extra information, such as age, fare, gender and the class they travelled in. On the website, you will find a testing and training dataset. What training data and testing data means will be covered later in this course. In this lab we will not be doing machine learning; therefore the data is combined into a single dataset.

## 1.A IPython / Jupyter environment

Jupyter Notebook is a web-based interactive computational environment, which enables code to be shared and documented. It supports Julia, Python and R (Ju-pyte-r). A notebook is a collection of *code* and *Markdown* (text) **cells**. We will only give a high-level introduction to Jupyter Notebooks, which will be enough to solve the labs for this course. If you are interested in creating your own notebooks, or you just generally want to get a better understanding, we recommend the following [tutorial](http://bebi103.caltech.edu/2015/tutorials/t0b_intro_to_jupyter_notebooks.html). 

Each code cell can be run separately, and the output is given below the cell. A number appears at the side of the *code* cell to indicate the order in which the cells were run. 

**Remarks**
1. Code in one cell can run, even if there are errors in other cells.
2. The order in which these cells are run is important, e.g. if you are calling a function in cell A, which is defined in cell B, cell B needs to be executed before cell A.
3. This means that the state of the variables in the notebook can be difficult to see.

All objects created by running cells are stored in the *kernel* running in the background. You can restart the kernel by using the Kernel menu at the top of the notebook. Because of the issues with the state of the system noted in the remarks, we recommend that when developing notebooks, you periodically select **Kernel->Restart & Run All** to check that your code really is reproducible. If you run into problems, it can also be a good idea to select **Kernel->Restart & Run All**; the problem might be due to the state of the variables.

### 1.A.1 Basic operation and shortcuts

There are two modes of selection when inside a Jupyter Notebook:
1. Command Mode - When you hit up/down arrows you select different cells. Hit `<enter>` to enter edit mode.
2. Edit Mode - You can edit the cell. Hit `<esc>` to enter Command Mode again.

In Command Mode (cell highlighted blue):
```
              <h> - bring up help window (contains full list of shortcuts!)
          <enter> - Enter Edit Mode
              <a> - create new cell above selected
              <b> - create cell below selected
          <d> <d> - delete selected cell (pressing 'd' twice)
```

In Edit Mode (cell highlighted green):
```
            <esc> - Enter Command Mode
<shift> + <enter> - Run cell and move to cell below in Command Mode
 <ctrl> + <enter> - Run cell in place

```

Try running the following code cell:

In [2]:
a = 1
b = 2
a + b

3

You'll notice that the notebook will try to display the last thing in the cell, even if you don't use a `print` statement. However, if you want to print multiple things from one cell, you need to use multiple `print` statements (or multiple cells).

In [3]:
first_name = 'Jane'
last_name = 'Doe'
print(first_name)
print(last_name)

Jane
Doe


**Good Practice** 
- It is good practice to separate code into different cells. One cell should correspond to one task, similarly to functions. For example, use a cell for the `import` statements, one for loading data, one for preprocessing data, and one for each different operation you carry out on the data.
- It's generally good practice to import all your packages at the top of a file. We will do so in future tutorials.

Before we start, we need to import the packages that we will be using later:

In [4]:
import os
import pandas as pd
import numpy as np

`os` stands for the standard Python operating system module, which we will use to access files. `pd` is an alias for the `pandas` module, to save typing later. Likewise `np` is an alias for `numpy` module. `pd` and `np` are the standard aliases for `pandas` and `numpy`.  Here is a more [in-depth tutorial on installing, importing and using modules](https://www.digitalocean.com/community/tutorials/how-to-import-modules-in-python-3).

## 1.B Pandas

Pandas is a library for data manipulation and analysis. There are two fundamental data structures in pandas: the **Series** and **DataFrame** structures which are built on top of NumPy arrays. (Again, if you need a refresher, you can check out this [numpy tutorial](http://cs231n.github.io/python-numpy-tutorial/#python).)

Pandas is well documented and you will find good information about all methods and structures in the [API reference](http://pandas.pydata.org/pandas-docs/version/0.23.4/api.html).

### 1.B.1 Series

A **Series** is a one-dimensional object (similar to a list). Each element has a corresponding *index*. By default the indices range from `0` to `N-1`, where `N` is the length of the Series. 

In [5]:
passenger = pd.Series(['Mr. Owen Harris Braund', 22.0, False])
passenger

0    Mr. Owen Harris Braund
1                        22
2                     False
dtype: object

If we want to specify **meaningful labels for the index**, we can do so with the `index` parameter.

In [6]:
passenger = pd.Series(['Mr. Owen Harris Braund', 22.0, False], index=['Name', 'Age', 'Survived'])
passenger

Name        Mr. Owen Harris Braund
Age                             22
Survived                     False
dtype: object

You can **access a Series** entry the same way as you access list entries, either using the assigned index labels, such as `'Name'`, or by using the numeric index, i.e. `0:(N-1)`, where `N` is the length of the Series.

In [7]:
print(passenger[1]) # Careful: indexing starts at 0.
print(passenger['Age']) # Remember to use quotes

22.0
22.0


### 1.B.2 DataFrame

A DataFrame is a tabular data structure comprised of rows and columns. You can also think of the DataFrame as a collection of Series objects that share an index. 

#### Creating DataFrame structures, adding rows, deleting rows and modifying entries

We can create an empty DataFrame by specifying the column names. Then we can insert data row by row.

In [8]:
passengers = pd.DataFrame(columns=['Gender', 'Age', 'Survived'])
passengers # Careful, the dataframe is called passenger*s*, use meaningful variable names when coding

Unnamed: 0,Gender,Age,Survived


Now, let's start filling the dataframe. To specify the row of a data frame, we use the `.loc` attribute.

In [9]:
passengers.loc[0] = ['Male', 22.0, False]  # Note how we used df.loc() to specify the index
passengers 

Unnamed: 0,Gender,Age,Survived
0,Male,22.0,False


Remember, we said that a DataFrame is a collection of Series. Let's double check that.

In [10]:
type(passengers.loc[0])

pandas.core.series.Series

Pandas DataFrames are quite flexible. Just as with Series, we can also use strings as index labels.

In [11]:
passengers.loc['Mrs. John Bradley Cumings'] = ['Female', 38.0, 'Yes']
passengers

Unnamed: 0,Gender,Age,Survived
0,Male,22.0,False
Mrs. John Bradley Cumings,Female,38.0,Yes


**Remark** It is generally bad practice to mix different kinds of indices. So let's remove the first entry.

In [12]:
cleaned_passengers = passengers.drop(0)
print(cleaned_passengers)
print('\n') # Empty line between DataFrames
print(passengers)

                           Gender   Age Survived
Mrs. John Bradley Cumings  Female  38.0      Yes


                           Gender   Age Survived
0                            Male  22.0    False
Mrs. John Bradley Cumings  Female  38.0      Yes


**Remark**: By default, `df.drop(index)` creates a copy of the DataFrame without modifying the original DataFrame, which is why if you want to drop a row without creating a new DataFrame, you need to write `df = df.drop(0)`, or set the optional `inplace` argument to `True`. You can see the difference between `cleaned_passengers` and `passengers` above.

In [13]:
passengers.drop(0, inplace=True) # Remove the 0th passenger without creating a copy 
passengers

Unnamed: 0,Gender,Age,Survived
Mrs. John Bradley Cumings,Female,38.0,Yes


You can also populate a DataFrame using a dictionary which allows you to do things in a nonstandard order. Let's get our first entry back.

In [14]:
passengers.loc['Mr. Owen Harris Braund'] = dict(Survived=False, Age=22.0, Gender='Male') # Remark that the attributes are assigned in a different order
passengers

Unnamed: 0,Gender,Age,Survived
Mrs. John Bradley Cumings,Female,38.0,Yes
Mr. Owen Harris Braund,Male,22.0,False


We just made a mess. In the first row, we used a string to denote whether the passenger survived, and in the second a Boolean value. Let's clean this up.

In [15]:
passengers.loc['Mrs. John Bradley Cumings', 'Survived'] = True
passengers

Unnamed: 0,Gender,Age,Survived
Mrs. John Bradley Cumings,Female,38.0,True
Mr. Owen Harris Braund,Male,22.0,False


#### Creating DataFrame from other structures

You can also create a DataFrame from:
* A dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

Let's recreate the same DataFrame:

In [16]:
# Create a DataFrame from a list
passengers_list = [['Male', 22.0, False], ['Female', 38.0, True]]
passengers = pd.DataFrame(passengers_list, index=['Mr. Owen Harris Braund', 'Mrs. John Bradley Cumings'],
                          columns=['Gender', 'Age', 'Survived'])
passengers

Unnamed: 0,Gender,Age,Survived
Mr. Owen Harris Braund,Male,22.0,False
Mrs. John Bradley Cumings,Female,38.0,True


In [17]:
# Create a DataFrame from a dictionary where keys are column values
column_key_dict = {
    'Gender': ['Male', 'Female'],
    'Age': [22.0, 38.0],
    'Survived': [False, True]
}
passengers = pd.DataFrame.from_dict(column_key_dict, orient='columns')
passengers.index = ['Mr. Owen Harris Braund', 'Mrs. John Bradley Cumings']
passengers

Unnamed: 0,Gender,Age,Survived
Mr. Owen Harris Braund,Male,22.0,False
Mrs. John Bradley Cumings,Female,38.0,True


In [18]:
# Create a DataFrame from a dictionary where keys are index values
index_key_dict = {'Mr. Owen Harris Braund':['Male', 22.0, False],
                  'Mrs. John Bradley Cumings':['Female', 38.0, True]}
passengers = pd.DataFrame.from_dict(index_key_dict, orient='index')
passengers.columns = ['Gender', 'Age', 'Survived']
passengers

Unnamed: 0,Gender,Age,Survived
Mr. Owen Harris Braund,Male,22.0,False
Mrs. John Bradley Cumings,Female,38.0,True


In [19]:
# Using the DataFrame call, keys are assumed to be column headers
passengers = pd.DataFrame({'Mr. Owen Harris Braund':['Male', 22.0, False],
                           'Mrs. John Bradley Cumings':['Female', 38.0, True]}, 
                   index=['Gender', 'Age', 'Survived'])
passengers

Unnamed: 0,Mr. Owen Harris Braund,Mrs. John Bradley Cumings
Gender,Male,Female
Age,22,38
Survived,False,True


However, now the rows have become columns and vice versa. We could rewrite the code above, assigning the passenger names to the `index` argument, and using the Gender, Age and Survived attributes as the dict keys. However, there is a more elegant solution: we can use the transpose method `df.T`

In [20]:
passengers = passengers.T
passengers

Unnamed: 0,Gender,Age,Survived
Mr. Owen Harris Braund,Male,22,False
Mrs. John Bradley Cumings,Female,38,True


**Remark** Again, the transpose method creates a copy. Thus, if you want to actually apply the changes to the DataFrame, you need to save it to the variable, as shown above.

Let's combine a few things we have learned so far.

**Exercise 01 a:** 
- Delete the markdown cell below, which says 'We don't need this cell.'.
- Insert a new code cell below this cell.
- Create a 'passengers' DataFrame with one of the options we have presented above, which contains the following entries: 
    - Miss. Laina Heikkinen, Female, aged 26, survived; 
    - Mrs. Jacques Heath Futrelle, Female, 35 years old, survived;
    - Mr. William Henry Allen, male, 35.0, did not survive;
- Make sure that you use a consistent notation in your DataFrame.

In [21]:
passengers_dict = {
    'Gender': ['Female', 'Female', 'Male'],
    'Age': [26, 35, 35], 
    'Survived': [True, True, False]
}
passengers = pd.DataFrame.from_dict(passengers_dict, orient="columns")
passengers.index = ["Miss. Laina Heikkinen", "Mrs. Jacques Heath Futrelle", "Mr. William Henry Allen"]
passengers

Unnamed: 0,Gender,Age,Survived
Miss. Laina Heikkinen,Female,26,True
Mrs. Jacques Heath Futrelle,Female,35,True
Mr. William Henry Allen,Male,35,False


**Exercise 01 b:**
- Append the two passengers mentioned previously (Mr. Own Harris Braund and Mrs. John Bradley Cumings) to the DataFrame.

In [22]:
# Your code
passengers.loc["Mr. Own Harris Braund"] = dict(Gender='Male', Age=22, Survived=False)
passengers.loc["Mrs. John Bradley Cumings"] = dict(Gender='Female', Age=38, Survived=True)
passengers

Unnamed: 0,Gender,Age,Survived
Miss. Laina Heikkinen,Female,26,True
Mrs. Jacques Heath Futrelle,Female,35,True
Mr. William Henry Allen,Male,35,False
Mr. Own Harris Braund,Male,22,False
Mrs. John Bradley Cumings,Female,38,True


### 1.B.3 Dataset operations in Pandas

Most commonly we create DataFrame structures by reading csv files. We store the datasets you need in the first lab in `datasets`. The next labs will have the datasets stored accordingly.

In [23]:
passengers_filepath = os.path.join(os.getcwd(), 'datasets', 'titanic.csv')
passengers_filepath

'/Users/leonard/Downloads/week02-introduction-master/datasets/titanic.csv'

In [24]:
passengers = pd.read_csv(passengers_filepath)
passengers.head() # Head shows the first five elements (unless specified otherwise) of the DataFrame

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


These are the five passengers you should have added to your dataframe in Exercise 01. Now, we have some extra information. Using `df.head()` we can get an impression of how the data looks like in our dataset. Note that the information in the Survived column is stored as 0/1 instead of `False/True`. Pandas doesn't care how Boolean values are stored. If a column only contains 0s and 1s, Pandas treats the column as Boolean, and works with it in the same way as a column of `False` and `True`.

However, `df.head()` only shows the first five entries. How many entries are there in total? You can use the python native `len()` function.

In [25]:
len(passengers)

887

#### Tab completion

Tab completion is a powerful method for viewing object attributes and available methods.

We have just seen the `df.head()` method above. Let's see what other functions we can call on a DataFrame. You can see what methods are available by typing the DataFrame name followed by `.` and then hitting the `<tab>` key. Then you can access any method's help documentation by hitting the method's name followed by `?`; this opens a 'pager' at the bottom of the screen, you can hit `<esc>` to exit it.

For example, to find the last few entries of the DataFrame, first type `passengers.t`, then hit `<tab>` and then choose `df.tail()`.

In [26]:
passengers.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,877,878,879,880,881,882,883,884,885,886
Survived,0,1,1,1,0,0,0,0,1,1,...,0,0,0,0,0,0,1,0,1,0
Pclass,3,1,3,1,3,3,1,3,3,2,...,3,3,2,3,3,2,1,3,1,3
Name,Mr. Owen Harris Braund,Mrs. John Bradley (Florence Briggs Thayer) Cum...,Miss. Laina Heikkinen,Mrs. Jacques Heath (Lily May Peel) Futrelle,Mr. William Henry Allen,Mr. James Moran,Mr. Timothy J McCarthy,Master. Gosta Leonard Palsson,Mrs. Oscar W (Elisabeth Vilhelmina Berg) Johnson,Mrs. Nicholas (Adele Achem) Nasser,...,Mr. Johann Markun,Miss. Gerda Ulrika Dahlberg,Mr. Frederick James Banfield,Mr. Henry Jr Sutehall,Mrs. William (Margaret Norton) Rice,Rev. Juozas Montvila,Miss. Margaret Edith Graham,Miss. Catherine Helen Johnston,Mr. Karl Howell Behr,Mr. Patrick Dooley
Sex,male,female,female,female,male,male,male,male,female,female,...,male,female,male,male,female,male,female,female,male,male
Age,22,38,26,35,35,27,54,2,27,14,...,33,22,28,25,39,27,19,7,26,32
Siblings/Spouses Aboard,1,1,0,1,0,0,0,3,0,1,...,0,0,0,0,0,0,0,1,0,0
Parents/Children Aboard,0,0,0,0,0,0,0,1,2,0,...,0,0,0,0,5,0,0,2,0,0
Fare,7.25,71.2833,7.925,53.1,8.05,8.4583,51.8625,21.075,11.1333,30.0708,...,7.8958,10.5167,10.5,7.05,29.125,13,30,23.45,30,7.75


In [27]:
# Let's get the documentation
passengers.tail?

If you want to get more than five entries you can do so by specifying `N` entries with `df.head(N)` or `df.tail(N)`. Give it a try above.

#### Row selection

As already mentioned, you can think of a DataFrame as a group of Series that share an index (*either* the column headers or the row id). This makes it easy to select specific **observations (i.e. rows)**.

In [28]:
type(passengers.loc[0])

pandas.core.series.Series

We have already talked about `df.loc[label]`, which selects a row based on the label of the index, e.g. 'Mr. Owen Harris Braund'. If we want the N-th row, we can use `df.iloc[N]`. In the loaded Titanic dataset, it so happens that the indices also run from 0 to 886, and thus `df.loc[N]` and `df.iloc[N]` return the same observation.

Technically, there are three options to select rows:
* `df.loc[label]`: works on labels in the index
* `df.iloc[N]`: works on the position in the index (so it only takes integers)
* `df[N]`: works the same way as `df.iloc[N]`

It is often safest to use the first two methods (rather than just using square brackets) to index into pandas DataFrames. 

In [29]:
passengers.iloc[0]

Survived                                        0
Pclass                                          3
Name                       Mr. Owen Harris Braund
Sex                                          male
Age                                            22
Siblings/Spouses Aboard                         1
Parents/Children Aboard                         0
Fare                                         7.25
Name: 0, dtype: object

We can also select several rows; the resulting structure is a DataFrame. This operation is called **slicing**.

**Remark:** Python slicing might seem a bit confusing. When you specify a range to slice with `i:j` the slice returned runs from the `i`-th entry to the `j-1`-th entry. This can be very helpful at times, when you want the last few entries, and you don't know how long the object is (e.g. a DataFrame or list). In that case, you can slice with `k:len(object)`. 

In [30]:
type(passengers.iloc[0:3])

pandas.core.frame.DataFrame

In [31]:
passengers.iloc[0:3]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925


In [32]:
# This is equivalent to using .iloc
passengers[0:3]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925


#### Column Selection

As already mentioned, you can think of a DataFrame as a group of Series that share an index (*either* the column headers or the row id). This makes it as easy to select specific **columns** as it is to select rows.

In [33]:
type(passengers['Name'])

pandas.core.series.Series

In [34]:
passengers['Name'].head()

0                               Mr. Owen Harris Braund
1    Mrs. John Bradley (Florence Briggs Thayer) Cum...
2                                Miss. Laina Heikkinen
3          Mrs. Jacques Heath (Lily May Peel) Futrelle
4                              Mr. William Henry Allen
Name: Name, dtype: object

To select multiple columns we simply need to pass a list of column names. We said above that we were interested in who survived, so let's check that:

In [35]:
#Remark the double brackets, because we passed a list of names
passengers[['Name', 'Survived']].head(7)

Unnamed: 0,Name,Survived
0,Mr. Owen Harris Braund,0
1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,1
2,Miss. Laina Heikkinen,1
3,Mrs. Jacques Heath (Lily May Peel) Futrelle,1
4,Mr. William Henry Allen,0
5,Mr. James Moran,0
6,Mr. Timothy J McCarthy,0


**Exercise 02:** What do you expect what the type of `passengers[['Name', 'Survived']].head(7)` is? Check it! 

In [36]:
# Your Code
type(passengers[['Name', 'Survived']].head(7))

pandas.core.frame.DataFrame

You can combine row and column selection, as we already did above, when selecting a specific entry.

In [37]:
passengers.iloc[0]['Survived']

0

**Exercise 03:** Look at the dataframe you obtained in Exercise 01. Reconstruct the same dataframe using the newly loaded dataframe, i.e. select the correct rows and columns, such that the output is the same as in Exercise 01.

In [38]:
# Your Code
passengers_selected = passengers[['Sex', 'Age', 'Survived']].head(5)
passengers_selected.index = list(passengers['Name'].head(5))
passengers_selected

Unnamed: 0,Sex,Age,Survived
Mr. Owen Harris Braund,male,22.0,0
Mrs. John Bradley (Florence Briggs Thayer) Cumings,female,38.0,1
Miss. Laina Heikkinen,female,26.0,1
Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1
Mr. William Henry Allen,male,35.0,0


#### Filtering

Now suppose that you want to select all the observations of minors (i.e. people under the age of 18 in the UK). It is easy to do that:

In [39]:
passengers[passengers['Age'] <= 17]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.0750
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708
10,1,3,Miss. Marguerite Rut Sandstrom,female,4.0,1,1,16.7000
14,0,3,Miss. Hulda Amanda Adolfina Vestrom,female,14.0,0,0,7.8542
16,0,3,Master. Eugene Rice,male,2.0,4,1,29.1250
...,...,...,...,...,...,...,...,...
849,1,1,Miss. Mary Conover Lines,female,16.0,0,1,39.4000
859,0,3,Miss. Dorothy Edith Sage,female,14.0,8,2,69.5500
865,1,3,Master. Harold Theodor Johnson,male,4.0,1,1,11.1333
871,1,3,Miss. Adele Kiamie Najib,female,15.0,0,0,7.2250


Or equivalently:

In [40]:
passengers[passengers.Age <= 17]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
7,0,3,Master. Gosta Leonard Palsson,male,2.0,3,1,21.0750
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708
10,1,3,Miss. Marguerite Rut Sandstrom,female,4.0,1,1,16.7000
14,0,3,Miss. Hulda Amanda Adolfina Vestrom,female,14.0,0,0,7.8542
16,0,3,Master. Eugene Rice,male,2.0,4,1,29.1250
...,...,...,...,...,...,...,...,...
849,1,1,Miss. Mary Conover Lines,female,16.0,0,1,39.4000
859,0,3,Miss. Dorothy Edith Sage,female,14.0,8,2,69.5500
865,1,3,Master. Harold Theodor Johnson,male,4.0,1,1,11.1333
871,1,3,Miss. Adele Kiamie Najib,female,15.0,0,0,7.2250


This concept is called *filtering*, and `passengers.Age <= 17` is called a **mask**, which hides/masks all entries that don't fit the criteria; i.e. it only returns the observations for which the mask returns `True`. You can also filter the data by using multiple attributes:

In [78]:
young_passengers = passengers[(passengers.Age <= 17) & (passengers.Survived)] #Remark that we were able to drop 'Survived == 1'
young_passengers[8:10]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
77,1,2,Master. Alden Gates Caldwell,male,0.83,0,2,29.0
83,1,2,Miss. Bertha Ilett,female,17.0,0,0,10.5


**Remark** The first row has index 9. As mentioned earlier: the label of the index does not have to coincide with the position of the data entry. Here, `df.iloc[9]` and `df.loc[9]` return different values!.

**Exercise 04:** What will be the return values of `young_passengers.iloc[8:10]` and `young_passengers.loc[8:10]`. Check the answer.

In [79]:
# Your Code
young_passengers.loc[8:10]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
9,1,2,Mrs. Nicholas (Adele Achem) Nasser,female,14.0,1,0,30.0708
10,1,3,Miss. Marguerite Rut Sandstrom,female,4.0,1,1,16.7


Note that we can also index into columns using `loc`. We just have to specify the second dimension (much as we would do with numpy arrays). Let's give it a try. We want to get the list of minors (children) aboard the *Titanic*:

In [43]:
young_passengers.loc[:, 'Name']

9           Mrs. Nicholas (Adele Achem) Nasser
10              Miss. Marguerite Rut Sandstrom
22                          Miss. Anna McGowan
39                  Miss. Jamila Nicola-Yarred
42     Miss. Simonne Marie Anne Andree Laroche
                        ...                   
826     Mrs. Antoni (Selini Alexander) Yasbeck
827             Master. George Sibley Richards
849                   Miss. Mary Conover Lines
865             Master. Harold Theodor Johnson
871                   Miss. Adele Kiamie Najib
Name: Name, Length: 65, dtype: object

In [44]:
young_passengers.iloc[:, 2] # And now using column indexing **Remark** this would not work with .loc[:, 3]

9           Mrs. Nicholas (Adele Achem) Nasser
10              Miss. Marguerite Rut Sandstrom
22                          Miss. Anna McGowan
39                  Miss. Jamila Nicola-Yarred
42     Miss. Simonne Marie Anne Andree Laroche
                        ...                   
826     Mrs. Antoni (Selini Alexander) Yasbeck
827             Master. George Sibley Richards
849                   Miss. Mary Conover Lines
865             Master. Harold Theodor Johnson
871                   Miss. Adele Kiamie Najib
Name: Name, Length: 65, dtype: object

In [45]:
# If we try the following we will get an empty DataFrame because there are no rows with labels 0 and 1.
young_passengers.loc[0:2]

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare


In [46]:
# The result is still a DataFrame
type(young_passengers.loc[0:2])

pandas.core.frame.DataFrame

For more, check out [Advanced Indexing](http://pandas.pydata.org/pandas-docs/version/0.23.4/advanced.html)

#### Basic operations on datasets

We now know how to 
- load a dataset from a csv file `pd.read_csv()`, 
- get the first few entries `df.head()`, 
- select certain rows or columns based on their position `df.iloc[]` or index label `df.loc[]`,
- and finally how to filter the dataset for entries that satisfy certain conditions `df[condition]`.

However, at the beginning of the lab, we wanted to figure out the passenger group worst hit. Using filtering, we can already split the dataset into different subsets. Now let's apply some operations.

First let's get the age spread, and figure out the oldest and youngest passenger. The passengers in the dataset are not sorted by age. Looking at each entry individually is not an option. Thankfully pandas dataframes have a method: `df.sort_values(by='column', ascending={True,False})`.

In [47]:
passengers.sort_values(by='Age').head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
799,1,3,Master. Assad Alexander Thomas,male,0.42,0,1,8.5167
751,1,2,Master. Viljo Hamalainen,male,0.67,1,1,14.5
641,1,3,Miss. Eugenie Baclini,female,0.75,2,1,19.2583
466,1,3,Miss. Helene Barbara Baclini,female,0.75,2,1,19.2583
827,1,2,Master. George Sibley Richards,male,0.83,1,1,18.75


In [48]:
passengers.sort_values(by='Age', ascending=False).head() # Returns the data in descending order of the sorted value

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
627,1,1,Mr. Algernon Henry Wilson Barkworth,male,80.0,0,0,30.0
847,0,3,Mr. Johan Svensson,male,74.0,0,0,7.775
490,0,1,Mr. Ramon Artagaveytia,male,71.0,0,0,49.5042
95,0,1,Mr. George B Goldschmidt,male,71.0,0,0,34.6542
115,0,3,Mr. Patrick Connors,male,70.5,0,0,7.75


The two results above already show that young passengers had a better survival rate than senior passengers. Let's try, tentatively, to answer the research question.  

**Discussion:** We will consider the different travel classes, genders, and age groups. Discuss with your lab partner who you would expect had the best and who had the worst survival rate. 

**Exercise 05:**

- Compute the survival rate of: minors (0-17 years old), adults (18-65), and seniors (66+). (Hint: All you need is `len()`.)
- Compute the survival rate of: women and men.
- Compute the survival rate of the travel classes (1, 2, and 3).
- Does the gender have an influence on the survival rates of minors? (What is the survival rate of girls vs boys?)
- Compute the survival rate of all combinations of class, age group, and gender, and print them out as "Age group, gender, class: percentage", e.g. "Minors, Male, First Class: 0.3". (Hint: These are quite a few computations, think about how to automate it, e.g. `conditions = condition1 & condition2` might help.).

In [77]:
# Your Code
def rate(*conditions):
    combination = 1
    for condition in conditions:
        combination &= condition
    return len(passengers[combination & passengers.Survived]) / len(passengers[condition])
def print_rate(*labels, rate):
    for label in labels:
        print(label + ", ", end="")
    print("survival rate:", rate)
#Exercise a
minors = passengers.Age <= 17
adults = (passengers.Age <= 65) & (passengers.Age >= 18)
seniors = passengers.Age >= 66
#Exercise b
women = passengers.Sex == "female"
men = passengers.Sex == "male"
#Exercise c
class1 = passengers.Pclass == 1 
class2 = passengers.Pclass == 2 
class3 = passengers.Pclass == 3 
#Print results
print_rate("minors", rate=rate(minors))
print_rate("adults", rate=rate(adults))
print_rate("seniors", rate=rate(seniors))
print_rate("women", rate=rate(women))
print_rate("men", rate=rate(men))
print_rate("class1", rate=rate(class1))
print_rate("class2", rate=rate(class2))
print_rate("class3", rate=rate(class3))
print('--------------------------------')
#Exercise d
#There is gender influence, women are more likely to survive
#Exercise e
age = [minors, adults, seniors]
sex = [women, men]
Pclass = [class1, class2, class3]
age_label = ['minors', 'adults', 'seniors']
sex_label = ['women', 'men']
Pclass_label = ['class1', 'class2', 'class3']

for i in range(len(age)):
    for j in range(len(sex)):
        for k in range(len(Pclass)):
            print_rate(age_label[i], sex_label[j], Pclass_label[k], rate=rate(age[i], sex[j], Pclass[k]))





minors, survival rate: 0.5
adults, survival rate: 0.36947791164658633
seniors, survival rate: 0.1
women, survival rate: 0.7420382165605095
men, survival rate: 0.19022687609075042
class1, survival rate: 0.6296296296296297
class2, survival rate: 0.47282608695652173
class3, survival rate: 0.24435318275154005
--------------------------------
minors, women, class1, survival rate: 0.032407407407407406
minors, women, class2, survival rate: 0.06521739130434782
minors, women, class3, survival rate: 0.043121149897330596
minors, men, class1, survival rate: 0.018518518518518517
minors, men, class2, survival rate: 0.04891304347826087
minors, men, class3, survival rate: 0.024640657084188913
adults, women, class1, survival rate: 0.3888888888888889
adults, women, class2, survival rate: 0.31521739130434784
adults, women, class3, survival rate: 0.10472279260780287
adults, men, class1, survival rate: 0.18518518518518517
adults, men, class2, survival rate: 0.043478260869565216
adults, men, class3, surviva

**We need your help:** This is a new course. In order for us to improve the labs for the next iterations, and to make sure that the next labs are better, we need your feedback. Please fill out the following [form](https://forms.office.com/Pages/ResponsePage.aspx?id=sAafLmkWiUWHiRCgaTTcYZmGMCx4KxlMjSTITqjdcXpUQlNUTkhPOTk1V0dDTUxTMVoyREdEV1U4SS4u).