# Lecture 4: Numpy and Pandas
## 10/2/18

### Table Of Contents
1. [NumPy](#section1)  
    1.1 [Basic Operations](#section1.1)  
    1.2 [Broadcasting](#section1.2)  
    1.3 [Aggregation and Axes](#section1.3)  
    1.4 [Conditions](#section1.4)  
    1.5 [A Tiny Introduction to Linear Algebra](#section1.5)  
    1.6 [Useful NumPy Functions](#section1.6)  
    1.7 [Exercises](#section1.7)  
    1.8 [Final Notes on NumPy](#section1.8)  
2. [Pandas](#section2)  




### Hosted by and maintained by the [Statistics Undergraduate Students Association (SUSA)](https://susa.berkeley.edu). Authored by [Ajay Raj](mailto:araj@berkeley.edu) and [Roland Chin](mailto:rond24933chn@berkeley.edu).

<a id='section1'></a>
# NumPy

An informal introduction to your least-worst enemy in the realm of data science. Numpy is an optimized math library for Python. Most of the optimization occurs in C, and some neat Python backend tinkering allows us to interface with it in Python. The code is vectorized as much as possible, which means that there's a heavy focus on using arrays (treated as n-dimensional vectors) to do operations. This is a shift away from doing looped operations.

For example:
If you wanted to compute the dot product of `[1, 2, 3, 4, 5]` and `[5, 4, 3, 2, 1]` = `[1*5, 2*4, 3*3, 4*2, 5*1]`, you could either loop through two **lists** in Python

```python
sum = 0
for v1, v2 in zip(arr1, arr2): # iterates through the lists at the same time
    sum += v1*v2
```

Or, you could perform all the multiplications at once, and then add them together. That's basically what NumPy does behind the scenes. So doing the dot product in NumPy is very simple:

```python
dot_product = arr1.dot(arr2)
```

Before we begin, vectors are single dimensional matrices, and a matrix is multidimensional. In Numpy, we represent both with np.arrays. We can also use np.matrix, but arrays are faster performance wise, and matrices are less easily manipulated and are inherently 2 dimensional.

<a id='section1.1'></a>
## Basic Operations

In [236]:
import numpy as np

NumPy is a Python library that is used to handle linear algebra operations. It does a couple amazing things under the hood that make certain operations lightning fast, and makes large scale data processing possible (like Pandas –– covered later).

NumPy holds data in **arrays**.

In [237]:
v = np.array([1, 2, 3, 4, 5]) #creating an array
v

array([1, 2, 3, 4, 5])

You can **index** an element of an array by putting it's index number in brackets after the name of the array.

Note: the index of arrays start with **0**, as with Python lists.

In [238]:
v[2]

3

In [239]:
v[5] #why does this error?

IndexError: index 5 is out of bounds for axis 0 with size 5

**Exercise**: How would you sum the 3rd and 4th element of v?

In [240]:
summed = """YOUR CODE HERE"""
summed

'YOUR CODE HERE'

### Indexing 2-D Arrays in Numpy

What is a 2-D array? It's an array of arrays. Also referred to as a matrix. 

This is what a 2D list looks like in vanilla Python.
```python
A = [
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9],
    ]
```

Accessing the number 5 is not too easy however. There are no built-in routines to help you index 2 layers deep into a list. So you have to index into multiple arrays one at a time:

```python
# getting the number 5 from A
# A[1] = [4, 5, 6]
# A[1][1] = 5
five = A[1][1]
```

When you store an array as an np.array, you are not only gaining a runtime speedup, you're also getting a speedup in writing your code because you now have advanced indexing! 

Now, we'll show how to index in a similar array in numpy's array format.

<a id='section1.1.2'></a>
### Subarrays and Submatrices

In [251]:
random = np.random.randn(3, 4) #the first argument is the number of rows, the 2nd argument is the number of columns
random

array([[-0.09910351, -1.01351285, -0.10447602,  0.33705028],
       [-1.41299493, -1.10024087, -0.81846861, -0.07876869],
       [ 0.65275623,  0.03192873,  0.5751568 , -0.98968038]])

Let's multiply all its values by 10 so it's easier to read.

In [252]:
bigger_random = random * 10
bigger_random

array([[ -0.99103512, -10.13512849,  -1.04476021,   3.37050282],
       [-14.1299493 , -11.0024087 ,  -8.18468607,  -0.78768685],
       [  6.52756229,   0.31928734,   5.75156799,  -9.89680382]])

That's better, but there's still a lot of decimals. Let's go 1 step further and round all the values with the astype function, which casts it's arguments to a specified type. Here it'll round all the values in the array to the nearest integer.

This is an example of how easy it is to apply a function to every element in a matrix.

In [253]:
A = bigger_random.astype(int)
A

array([[  0, -10,  -1,   3],
       [-14, -11,  -8,   0],
       [  6,   0,   5,  -9]])

Now let's select the element in row index 2, column index 2.

In [254]:
A[2, 2]

5

You can also use the slice operation, which is a colon ```[start:end]``` that allows you to select multiple elements. 
For example:

```[1:5]``` is equivalent to "from element index 1 to 4" (Python doesn't include the last index)

```[1:]``` is equivalent to "from element index 1 to all the way to the end"

```[0:]``` is equivalent to ```[:]```, which makes a copy

Omitting the first index defaults to the beginning (0), and leaving the second index blank defaults to the length of whatever is being sliced.

In a 2D array, since there's rows and columns, there are 2 slice operators, the first of which will take a start index for the row and an end index, while the second one will take a start index for the column and an end index: ```[row start:row end, column start:column end]```.

Again, indexing begins at 0.

In [255]:
A[1, :] # this will return the 2nd row of the matrix, along with all it's column elements from 0 to 3

array([-14, -11,  -8,   0])

In [256]:
A[:, 2] # this will return the 3rd column of the matrix, along with all it's row elements from 0 to 3

array([-1, -8,  5])

This is how you get the 2x3 matrix at the bottom right hand corner of the matrix.

In [257]:
A[1:, 1:]

array([[-11,  -8,   0],
       [  0,   5,  -9]])

**Exercise:** What if you wanted the 3x3 matrix at the left of A (everything but the rightmost column)?

In [267]:
far_left = """YOUR CODE HERE"""
far_left

array([[  0, -10,  -1],
       [-14, -11,  -8],
       [  6,   0,   5]])

Negative indexing can also be performed in Numpy. This is useful when you don't have the length of the array, and want something starting from the end. The last element has an index of -1, second to last element has an index of -2, and so forth. The same operations from above can be used.

Here's an example with a 1D array.

In [181]:
D = np.arange(1, 10)  #arange is similar to range(), it takes a starting number, end, and a step interval
D

array([1, 2, 3, 4, 5, 6, 7, 8, 9])

In [182]:
D[-3]

7

In [183]:
D[2:-3] #remember, the index stops 1 before the last, so the last element is at index -3 - 1 = -4

array([3, 4, 5, 6])

**Exercise:** Let's try to select the 2x2 matrix at the bottom righthand corner of the A matrix from the last exercise.

Hint: leaving the 2nd argument blank in both slices below tells it to run all the way to the end.

In [1]:
bottom_right_corner = """YOUR CODE HERE"""
bottom_right_corner

'YOUR CODE HERE'

This is how you'd do it with positive indices.


In [185]:
A[1:, 2:]

array([[ 10,  -2],
       [-17,   5]])

There's also the `shape` function, which returns the dimensions of an array and can be useful to tell how many elements there are in a matrix.

In [2]:
A.shape

NameError: name 'A' is not defined

The ```reshape``` function, when called on an array, can change it's elements into a new array shape of size (x, y). The number of elements in each array must be equal.

Observe that it fills the array row by row by going along the rows of the original matrix.

In [187]:
A.reshape((2, 6))

array([[ 14,   4,  27,   3, -21,  18],
       [ 10,  -2,   0,  -6, -17,   5]])

In [188]:
A.reshape((6, 2))

array([[ 14,   4],
       [ 27,   3],
       [-21,  18],
       [ 10,  -2],
       [  0,  -6],
       [-17,   5]])

<a id='section1.2'></a>
## Broadcasting

The most important thing NumPy does is **broadcasting**, which means that it allows for arithmetic operations on arrays of different shapes.

It's important because because uses less memory and is more computationally efficient. This is because broadcasting allows less memory to be moved around during the multiplication (in the example below, b is a scalar vs an array).

More information can be found here: https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

Here's an example of array multiplication, where both arrays' sizes are equal.

In [189]:
a = np.array([1.0, 2.0, 3.0])
b = np.array([2.0, 2.0, 2.0])
a * b

array([2., 4., 6.])

And here broadcasting is being used; the scalar ```b``` is being "stretched" into an array with the same shape as ```a```.

In [190]:
a = np.array([1.0, 2.0, 3.0])
b = 2.0
a * b

array([2., 4., 6.])

This **ELIMINATES** (in most cases) the need for `for` loops! Apart from it removing the need to store redundant info, it also makes your code super nice to read.

In [191]:
a

array([1., 2., 3.])

In [192]:
a ** 2

array([1., 4., 9.])

In [193]:
a + 42

array([43., 44., 45.])

The rule of thumb is that NumPy does arithmetic operations pairwise, but if a certain dimension is 1, then it will **broadcast** that effect across the dimension. Broadcasting is when a smaller array is "repeated" across a larger array so they have compatible shapes, and arithmetic can be done between them.

Here's a more complicated example.

In [194]:
a # 1x3 matrix

array([1., 2., 3.])

In [195]:
B = np.zeros((3,3)) # a 3x3 matrix of zeros
B

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [196]:
a + B

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

The way this works is that we are adding the `a` row vector to every row of the matrix `B`. In effect, "stretching" `a` across `B`.

In [197]:
a_col = np.array([[1.0], [2.0], [3.0]]) #now a is a column vector
a_col

array([[1.],
       [2.],
       [3.]])

In [77]:
B

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [199]:
a + B

array([[1., 2., 3.],
       [1., 2., 3.],
       [1., 2., 3.]])

The way this works is that we are adding the `a` column vector to every column of matrix `B`.

You can also add a single number to every element of an array like this.

In [203]:
B

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [204]:
3 + B

array([[3., 3., 3.],
       [3., 3., 3.],
       [3., 3., 3.]])

**Exercise:** What if we want to add a constant vector to each row of a matrix? In the following example, the sizes of the arrays are different, so the sum is performed elementwise. We want to create a `row_vector` that when added to `an_array`, returns the `given` matrix.

In [207]:
given = np.array([[2, 2, 4], [5, 5, 7], [8, 8, 10]])
given

array([[ 2,  2,  4],
       [ 5,  5,  7],
       [ 8,  8, 10]])

In [209]:
an_array = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
# row_vector = np.array([_, _, _])
the_sum = an_array + row_vector #Hint: the row vector will be added to each row of an_array
the_sum #does this look like the given matrix?

array([[ 2,  2,  4],
       [ 5,  5,  7],
       [ 8,  8, 10]])

<a id='section1.3'></a>
## Aggregation and Axes

NumPy is also great at **aggregation**, which means combining values in arrays or matrices.

In [3]:
sum_matrix = np.arange(16).reshape((4, 4)) #this creates an array from 0 to 15, and then reshapes it into a 4x4 matrix
sum_matrix

NameError: name 'np' is not defined

In [215]:
sum_matrix.sum() # sums all of the elements in an array/matrix

120

The `axis` parameter is commonly used in NumPy. It's a little hard to think about, so here's a picture.

![](axes.jpg)

When you pass in `axis=0`, that means that you want to do your operation over the columns, and `axis=1` means over the rows.

Let's go back to our `sum_random` matrix.

Say, now, instead of the total sum of all the elements, we want to calculate all of the **row-sums**, or the sums of each row.

In [219]:
random.sum(axis=1)

array([ 5.03324575,  0.48999793, -1.69695637])

The **column-sums** are similarly computed.

In [220]:
random.sum(axis=0)

array([-0.60027067,  1.66861993,  2.01270673,  0.7452313 ])

**Exercise:** Let's try an example with another 2D matrix. Uncomment A (delete the #) and erase the underscore blanks, filling in numbers to make the output equal to ```[3, 12, 8]```.

In [221]:
# A = [[2, _, 5], [_, 9, _]]

In [222]:
np.sum(A, axis=0) #test your values by running this and seeing if it equals an array of [3, 12, 8]

array([5, 8])

Here are some other aggregation functions you might see.

In [223]:
a = np.random.rand(100) #randomly creating 100 numbers from the standard normal distribution

In [224]:
a.mean() #finds the mean of an array of numbers

0.473990177297372

In [225]:
np.median(a) #finds the median of an array of numbers

0.4697023959489951

<a id='section1.4'></a>
## Conditions

Now we're going to see how we can select certain elements based on conditions that we specify. Sometimes you don't want all the rows and columns from a matrix you're given. Let's say we want to find the number of even Fibonacci numbers of the first 25.

In [4]:
# this code generates the first 25 elements of the Fibonacci sequence
# it's a cool exercise to figure out how this works! Try it out at home

A = np.array([
    [1, 1],
    [1, 0]
])

fib = np.zeros(25)

start = np.array([1, 0])
curr_A = A

fib[0] = 0
fib[1] = 1

for i in np.arange(2, 25):
    fib[i] = (curr_A @ start)[0]
    curr_A = A @ curr_A

fib = fib.astype(int)
fib

NameError: name 'np' is not defined

Now we're going to see how we can select certain elements based on conditions that we specify. Sometimes you don't want all the rows and columns from a matrix or array you're given. Let's say you only want certain elements from the fib array from above.

In [227]:
fib[[True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True, False, False, True]]

array([    0,     2,     8,    34,   144,   610,  2584, 10946, 46368])

But wait, these are just the even numbers. How can we tell if a number is even? Let's see what the 2 operations below yield.

In [92]:
4 % 2, 3 % 2 # 4 is even, 3 is not

(0, 1)

It turns out, the modulo operator gives the remainder when x is divided by y.

In [228]:
fib[fib % 2 == 0] # the code inside the brackets returns the same boolean array as above

array([    0,     2,     8,    34,   144,   610,  2584, 10946, 46368])

Notice how we we're able to just put the name of the array instead of having to loop through each index of the array! This is another beautiful aspect of NumPy.

We can make logical expressions with the `&`, `|`, `~` operators.

In [5]:
fib[~(fib % 2 == 0)] # ~ is negation, so here we have the odd numbers

NameError: name 'fib' is not defined

In [230]:
fib[(fib % 2 == 0) & (fib % 3 == 0)] #here we have numbers that are both even and divisible by 3

array([    0,   144, 46368])

Let's try an example on our own. The code below randomly generates a 3x3 matrix.

In [231]:
random_matrix = np.random.rand(3, 3) 
random_matrix

array([[0.25462503, 0.10414092, 0.1346097 ],
       [0.19831264, 0.54016466, 0.11088254],
       [0.18097283, 0.88588816, 0.12454195]])

**Exercise:** What if we only want the numbers that are less than the mean of all the elements of the matrix? Uncomment out the following (erase the #) and fill in the blank between the brackets.

In [232]:
#random_matrix[_____]

<a id='section1.5'></a>
## A Tiny Introduction to Linear Algebra (if time)

Linear algebra is a math buzz word that just means math with matrices and vectors.

A vector is an array of numbers: it is the same thing as a NumPy array.

In [37]:
v = np.array([1, 2, 3])
v

array([1, 2, 3])

As we said earlier, a **matrix** an **array** of **arrays**.

In [38]:
A = np.array([
    [1, 2, 3],
    [4, 5, 6],
    [7, 8, 9]
])
A

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

A **matrix-vector** product $Av$, where $A$ is a matrix and $v$ is a vector is where you multiply each of the rows of the matrix by the vector, and add them up.

![Example of matrix vector multiplication](mult.png)

In [39]:
Av = np.array([
    1*1 + 2*2 + 3*3,
    4*1 + 5*2 + 6*3,
    7*1 + 8*2 + 9*3
])
Av

array([14, 32, 50])

You can do this in NumPy with the `@` operator.

In [40]:
A @ v

array([14, 32, 50])

Why are matrix-vector products important? It's all about **data**: matrices are a great way to store **data**. Say we have the matrix $A$, where each row represents a student, and the columns represent how many meal points that student spent at each dining hall. Column $a_i$ of the matrix $A$ is the meal points spent at each dining hall.

In [41]:
A = np.array([
    np.random.dirichlet(np.ones(4),size=1).reshape(4) * 200,
    np.random.dirichlet(np.ones(4),size=1).reshape(4) * 200,
    np.random.dirichlet(np.ones(4),size=1).reshape(4) * 200,
    np.random.dirichlet(np.ones(4),size=1).reshape(4) * 200
])
A

array([[  16.02978008,  117.7643773 ,   41.70347514,   24.50236748],
       [  20.09335071,    8.04458069,   77.45197238,   94.41009621],
       [ 160.81311355,    6.12015784,   23.9279186 ,    9.13881002],
       [  99.05339978,   20.72960095,   12.88599542,   67.33100385]])

For example, let's say that Pat Browns is the last column of our matrix, so the last column represents how much each student spent at Pat Browns.

In [42]:
A[:,3]

array([ 24.50236748,  94.41009621,   9.13881002,  67.33100385])

As you can see, each student spent 200 meal points.

In [43]:
A.sum(axis=1)

array([ 200.,  200.,  200.,  200.])

We can also do the `sum` function with a **matrix-vector** product.

In [44]:
v = np.ones(4)
v

array([ 1.,  1.,  1.,  1.])

In [45]:
A @ v

array([ 200.,  200.,  200.,  200.])

This works because we're multiplying one to each element in the row, and then sum them, which is what we want to happen.

Now, say that Pat Browns jacks up its prices by 2x. What would have been the meal points spent per each student now?

In [46]:
v_jacked = np.array([1, 1, 1, 2])
v_jacked

array([1, 1, 1, 2])

In [47]:
A @ v_jacked

array([ 224.50236748,  294.41009621,  209.13881002,  267.33100385])

<a id='section1.6'></a>
## Useful NumPy Functions

Before you start writing complex code in NumPy, be sure to do a Google Search! Most likely there is a NumPy function already there for you.

Press `shift`-`tab` after placing your cursor after each function name to get a helpful message of what it does!

In [87]:
np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [88]:
np.ones((4, 4))

array([[ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.]])

In [89]:
np.zeros((5, 5))

array([[ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.],
       [ 0.,  0.,  0.,  0.,  0.]])

In [127]:
np.eye(5)

array([[ 1.,  0.,  0.,  0.,  0.],
       [ 0.,  1.,  0.,  0.,  0.],
       [ 0.,  0.,  1.,  0.,  0.],
       [ 0.,  0.,  0.,  1.,  0.],
       [ 0.,  0.,  0.,  0.,  1.]])

In [125]:
np.dot(np.array([1,2]), np.array([3,4]))

11

In [126]:
np.full((2,2),7) 

array([[7, 7],
       [7, 7]])

<a id='section1.7'></a>
## Exercises

<a id='section1.7.1'></a>
### Broadcasting

**Exercise:** Uncomment the x and fill in the 2 blanks so that x + y = z.

In [234]:
# x = np.array([_, _])
y = np.array([[3], [4]])
z = np.array([[4, 5], [5, 6]])
#Hint: y looks like [3] and z looks like [4, 5]
#                   [4]                  [5, 6]
#The row vector will be "stretched" across and added to both elements of the column vector y.

In [235]:
x + y

NameError: name 'x' is not defined

<a id='section1.7.2'></a>
### Linear Algebra (for 54 wizards) !!!Challenging!!!

In [51]:
x = np.arange(1000).reshape(1000, 1)
b = np.ones((1000, 1))

X = np.append(x, b, axis=1)
#all you need to know is that this is the X you will use in the equation below

Y = 2 * x[:,0] + 4*b + np.random.random()
#all you need to know is that this is the Y you will use in the equation below

Use Least Squares Linear Regression to solve for $\hat{\theta}$, the weights on each column of $X$ such that it models $Y$. Remember, the formula for Least Squares Linear Regression is: 

$$X^TX\hat{\theta} = X^TY$$

In [52]:
# theta_hat = np.linalg.solve(___, ___)
theta_hat

NameError: name 'theta_hat' is not defined

Find the loss of your model. The loss equation is as follows
$$ ||Y - X\hat{\theta}|| $$
_Hint: First use np.linalg.norm on the difference of Y and the product of X and theta hat, then square._

In [315]:
# loss = 
loss

Ellipsis

<a id='section1.8'></a>
## Final Notes on NumPy

Numpy makes scientific computing in Python possible. It's pretty fantastic. But there are many tiny details that might trip you up when using it in a practical setting. Sometimes it will have to with using functions properly, othertimes it will be low-level messups.

For a common one that often gets me annoyed, see <a href="http://scipy-cookbook.readthedocs.io/items/ViewsVsCopies.html">this link</a>.

Also, remember that whenever you're confused with or forget what parameters a numpy function takes, say with ```np.arange```, you can call ```help(np.arange) ``` in an empty cell.

<a id='section2'></a>
# An Introduction to Pandas and Data Processing

In [95]:
import pandas as pd

Well, we know we can store numbers in matrices in NumPy. But, this isn't great: compare and contrast with Microsoft Excel. NumPy seems like Excel without any of it's nice aesthetic features, like plotting graphs, etc. **Pandas** is Python's most common answer to this.

A Microsoft Excel file is the same as a **Comma-Separated-Value** (.csv) file: where each of the rows is it's own line, separated by commas.

Today, we'll be diving into the **Titanic** dataset, which has the data for every passenger aboard the Titanic. We've downloaded two .csv files for you to play with in Pandas.

Pandas allows you to convert a .csv file into a Pandas object in the following way.

In [97]:
titanic_train = pd.read_csv('titanic/train.csv')
titanic_test = pd.read_csv('titanic/test.csv')

Data is stored in **DataFrame** objects.

In [98]:
type(titanic_train)

pandas.core.frame.DataFrame

First, let's look at the data itself, using the `head` function.

In [99]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Each of the columns is a **Series** object, and you can get each of them by indexing the same way as you would a **dictionary** in Python (in brackets).

In [100]:
type(titanic_train['Name'])

pandas.core.series.Series

You can use the `head` function on **Series** too.

In [101]:
titanic_train['Name'].head()

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

Let's get back to the data.

In [102]:
titanic_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


You'll notice that there is a bolded column and a bolded row: these are the **index column** (which should uniquely define a row) and the **column names**.

You can get any specific value in the **DataFrame** with the `loc` function.

In [106]:
titanic_train.loc[2, 'Name'] # gets the name of the passenger with index 2

'Heikkinen, Miss. Laina'

Now, since each row represents a person aboard, it would make sense that `PassengerId` can be a valid index. It also makes more sense with our `.loc` function calls, e.g. to be getting the name of a `PassengerId`.

We accomplish this with the `.set_index` command.

In [107]:
titanic_train = titanic_train.set_index('PassengerId')

In [108]:
titanic_train.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Congratulations, you are now data scientists! You just did one step of what is known as **exploratory data analysis (EDA)**.

Now, let's look under the hood. Every Pandas **DataFrame** is just a NumPy matrix.

In [110]:
titanic_matrix = titanic_train.as_matrix()
type(titanic_matrix)

numpy.ndarray

In [111]:
titanic_matrix[0]

array([0, 3, 'Braund, Mr. Owen Harris', 'male', 22.0, 1, 0, 'A/5 21171',
       7.25, nan, 'S'], dtype=object)

We can accomplish most of what we can in NumPy in Pandas also. For example, we can index **DataFrame** with the `.iloc` command.

In [114]:
titanic_train.iloc[3, 3] # 4rd row, 4rd column ==> sex of 4th passenger.

'female'

And, we can aggregate too!

In [121]:
titanic_train['Age'].sum() / titanic_train['Age'].shape[0] # the average age of someone aboard.

23.799292929292928

Conditions, however, work a little different. You need to specify the **DataFrame** when putting a conditional. And it returns a **DataFrame** back.

In [122]:
survived = titanic_train[titanic_train['Survived'] == 1] # all passengers that survived.
survived.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


**Exercise**: Find the total fare spent by females to be aboard the Titanic.

In [123]:
fare = 0
fare

0