# Data Exploration

In our last notebook we saw how to import a dataframe using `pandas` and make a scatter plot from the data using `ploty`. We often want to manipulate our data or extract information from it in some way and `pandas` gives us many tools to do this.

As always, we need to import our tools:

In [None]:
import pandas as pd
import plotly.express as px
%load_ext google.colab.data_table

We're going to use a new dataset today -- the titanic dataset, which contains a wealth of information about the passengers that were on the Titanic. Below, we use `pandas` to read the titanic csv file into a dataframe, and call it `titanic`.

In [None]:
titanic=pd.read_csv('https://raw.githubusercontent.com/SkyIslandsMath/semester-2/master/data/titanic3.csv')

You can see below that the meaning of several of the columns is obvious, but others are less so. Most datasets will have some documentation telling you what everything means. I'll just tell you here:
- `survived` - Survival (0 = No; 1 = Yes)
- `pclass` - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
- `name` - Name
- `sex` - Sex
- `age` - Age
- `sibsp` - Number of Siblings/Spouses Aboard
- `parch` - Number of Parents/Children Aboard
- `ticket` - Ticket Number
- `fare` - Passenger Fare in British pounds.
- `cabin` - Cabin
- `embarked` - Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)
- `boat` - Lifeboat (if survived)
- `body` - Body number (if did not survive and body was recovered)

In [None]:
titanic

We might naturally ask some questions of our data. How much did the average ticket cost? How old was the average passenger? Who was the oldest passenger on the titanic? The youngest?

We can answer many of these questions quickly with pandas. 

Let's say we want to find the average age. First,we need to isolate the age column. In pandas, we can pick one column out by putting the name of the column in quotation marks between square brackets. In this case, our age column is:
```
titanic['age']
```
If we want to find the average (also known as the mean) of this column, we just use the `.mean()` method on it like so:
```
titanic['age'].mean()
```
When we type this code in the code cell below and run it, we see that the average age is just under 30.

In [None]:
titanic['age'].mean()

We can use the `.max()` and `.min()` methods to find the oldest and youngest passengers.

In [None]:
titanic['age'].max()

In [None]:
titanic['age'].min()

So the oldest passenger was 80 years old, and the youngest was 0.17 years (or about 2 months) old.

### Visualizing age on the titanic

We can see how ages are distributed by calling the plotly express function, histogram, with `px.histogram()`. As with `px.scatter()`, we need to pass our dataframe as the first argument, we then pass the column we want to visualize as a string to the x argument:

In [None]:
px.histogram(titanic, x='age')

In the graph above, the height of each column represents how many passengers in the given age range there were. We can see that there were a lot of people aged 20-50 , and almost no one over the age of 65. Where were all the old rich people? 

### Assignment
1. Use the same method we just used, but on the `fare` column to find out how much the average ticket, the most expensive ticket, and the least expensive ticket were.

In [None]:
#find the average ticket price here

In [None]:
#find the most expensive ticket here

In [None]:
#find the least expensive ticket here

2. Make a histogram of the fare column, how were the prices distributed?

### a shortcut
It is so common to ask these sorts of questions, that pandas has a handy method to print them all out at once. It is easier to do it this way when we just need to read off the numbers, but it isn't helpful if we need to use the values in our program. To see all of these summary statistics at once, just type in `titanic.describe()`. 

In [None]:
titanic.describe()

### Survival

We can see that there are 1309 rows in our dataset - so there were 1309 passengers aboard (in the dataset, which doesn't include crew, and 4 passengers are missing from it). How many of them survived? Our survived column has a value of `1` if they survived and `0` if they didn't, so if we add up all the ones in the column we will have our answer! Just run the code below:

In [None]:
titanic['survived'].sum()

We see that 500 of the passengers in our dataset survived. We can find the percentage of passengers that survived by taking the mean of the survived column:

In [None]:
titanic['survived'].mean()

We can see that about 0.38 -- 38% -- of passengers survived.

## Sub Frames

We often want to look at a specific subset of our data and see what information it contains. The easiest way to do this is to create what is called a 'mask'. It is just a series of `True` or `False` values based on a condition. 

Let's say we want to see all of the infants on board-- say those people younger than 1. We would begin with a logical test of the values in the age column: `titanic['age']<1`. We'll need to save it to a variable I'll call `mask` so we can use it. 

In [None]:
mask=titanic['age']<1

We can now use our `mask` by using it just like an index in square brackets: `titanic[mask]`. We'll assign this new sub-frame to a variable called `infants`.

In [None]:
infants=titanic[mask]
infants

We can see there were twelve such passengers. We can again use `.sum()` on the survived column to see how many survived, and use `.mean()` to see the rate of survival:

In [None]:
print(infants['survived'].sum())
print(infants['survived'].mean())

We can see that 10 survived, or about 83%, Much higher than the overall average.

## Assignment

Find out if men or women survived at the same rate.

**Step 1:**
Make two masks: `m_mask` and `f_mask`
`m_mask` should be True if the sex column equals `'male'`
and the `f_mask` should be true if the sex column equals `'female'`.

**Step 2:**
Use the masks to make two new sub-frames `males` and `females`.

**Step 3:**
Use the `.sum()` method on the two subframes to find out the total number of male and female passengers survived.

**Step 4:**
Use the `.mean()` method on the subframes to find their respective rates of survival.