# Pandas: Tabular Data in Python

## Objectives

* Create `Series` and `DataFrame` objects from Python data types. 
* Create `DataFrame` objects from files.
* Index and slice `pandas` objects.
* Aggregate data in `DataFrame`s.
* Join multiple `DataFrame` objects.

## What is Pandas?

A Python library providing data structures and data analysis tools for tabular data of many types. Think of a `DataFrame` like a table in SQL. 

We will use this pretty much every day from here on out. So be sure to complete the assignment, and even do it again if you aren't feeling comfortable.

## Benefits

  * Efficient storage and processing of data.
  * Includes many built-in functions for data transformation, aggregations, and plotting.
  * Great for exploratory work.

## Not so greats

  * Does not scale terribly well to VERY large datasets.

## Documentation:

The documentation for pandas is here:

  * http://pandas.pydata.org/pandas-docs/stable/index.html
  
  
But we suggest using this cheetsheet as you go:

  * https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf


Particularly important reads (eventually) are:

  * [Indexing and Selecting](https://pandas.pydata.org/pandas-docs/stable/indexing.html)
  * [Advanced Indexing](http://pandas.pydata.org/pandas-docs/stable/advanced.html#advanced-mi-slicers)
  * [Group-by](https://pandas.pydata.org/pandas-docs/stable/groupby.html)

## Standard Imports

In [4]:
%matplotlib inline

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('ggplot')

## Numpy: A Quick Primer

`pandas` is built out of data types from `numpy` a lower level library.

The basic object in `numpy` is an `array`.

In [109]:
x = np.array([0, 1, 2, 3, 4, 5])
x

array([0, 1, 2, 3, 4, 5])

Arrays can be processed very efficiently.

In [110]:
x.sum()  # <-- As efficient as possible way to sum these numbers in python.

15

Arrays can be multi-dimensional.  A **two-dimensional array** is called a **matrix**.

In [111]:
M = np.array([
    [0, 1, 2],
    [1, 2, 3],
    [2, 3, 4],
    [5, 6, 7]
])

M

array([[0, 1, 2],
       [1, 2, 3],
       [2, 3, 4],
       [5, 6, 7]])

In [5]:
print(x.shape)
print(M.shape)

(6,)
(4, 3)


## Adding index and a label to an array

The most basic datatype in pandas is the 'series'. It's like a vector array, but with dressing.

In [6]:
heights = np.array([34, 35, 36, 37, 38])
student_heights = pd.Series(heights)
student_heights

0    34
1    35
2    36
3    37
4    38
dtype: int64

The numpy array is still in there. Easily accessible with '.values'.

In [7]:
student_heights.values

array([34, 35, 36, 37, 38])

The '0,1,2,3,4' part is called the index. If you don't set the index it's '0,1,2,3' by default.

You can access it with '.index'

In [113]:
student_heights.index

RangeIndex(start=0, stop=5, step=1)

Notice it stores the index as a [iterable](https://www.programiz.com/python-programming/iterator).

In [8]:
type(student_heights.index)

pandas.core.indexes.range.RangeIndex

In [115]:
list(student_heights.index)

[0, 1, 2, 3, 4]

We could manually change it

In [9]:
student_heights.index = ["Tomas", "Angel", "Stacy", "Michaela", "Haden"]

In [10]:
student_heights

Tomas       34
Angel       35
Stacy       36
Michaela    37
Haden       38
dtype: int64

#### So what?

So, now we can use __words__ to access items in the series. Not unlike a dictionary.

In [11]:
student_heights['Michaela']

37

But unlike a dictionary, the series still has an order.

In [12]:
student_heights[2]

36

Also unlike a dictionary, we get the numpy methods for free. Plus a few more.

In [13]:
student_heights.mean()

36.0

You can see the function parameters by typing 
```python
pd.Series()
```
putting your cursor in between those '()' and hitting 'shift+tab'.

Try it!

In [14]:
#Type pd.Series() here:
pd.Series()



  pd.Series()


Series([], dtype: float64)

### Question:

What are the parameters of pd.Series()?

In [17]:
# write here:

# help(pd.Series)       #(OR use the 'shift+tab' command)

# Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)

<p>(Please type SOME actual answer before checking the solution. Even if you have NO IDEA. Just guess. It's good for you.)<p/>


<details><summary>Solution
</summary>

```python
> help(pd.Series)       (OR use the 'shift+tab' command)

Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
```
</details>

### Question:

> PLEASE, NEVER EVER EVER COPY-PASTE WHILE DOING THESE QUESTIONS. OR ITS WRITER WILL HAUNT YOU. (AKA you won't learn to code it from memory. And in three weeks you'll realize you don't know it. We've seen it time and time again. It's worth it.) Make a second window if you need to copy code verbatum. Please.

> For variables you've defined or loaded use 'tab completion.' For example:
>
>
> Below type 'bra' + tab and watch the remainder populate. It can be hard to remember but it REALLY HELPS.

Make a series 'brag_vow' with index 'i am a numpy champ'. And values 'So Pandas here I come".

```python 

print(brag_vow)

I            So
am       Pandas
a          here
numpy         I
champ      come
dtype: object
```

In [15]:
# write here:
# pd.series(brag_vow)
# Series('So Pandas here I come', 'i am numpy champ')
brag_vow = pd.Series(data=["So", "Pandas", "Here", "I", "Come"], index=["I", "am", "a",  "numpy", "champ"])
brag_vow

I            So
am       Pandas
a          Here
numpy         I
champ      Come
dtype: object

In [None]:
brag_vow = pd.Series(data=["So", "Pandas", "Here", "I", "Come"], index="I", "am", "a", "Numpy", "champ")

<p>(Please type SOME actual answer before checking the solution. Even if you have NO IDEA. Just guess. It's good for you.)<p/>


<details><summary>Solution
</summary>

```python
brag_vow = pd.Series(data=["So", "Pandas", "here", "I", "come"], index=["I", "am", "a", "numpy", "champ"])
brag_vow
```
</details>

### Question:

return the last three elements of 'brag_vow'

```python 
a          here
numpy         I
champ      come
dtype: object
```

In [16]:
brag_vow[2:]


a        Here
numpy       I
champ    Come
dtype: object

In [19]:
# write here:

brag_vow[-3: ]
brag_vow[2: ]

a        Here
numpy       I
champ    Come
dtype: object

<p>(Please type SOME actual answer before checking the solution. Even if you have NO IDEA. Just guess. It's good for you.)<p/>


<details><summary>Solution
</summary>

```python
brag_vow[2:]
```
</details>

### Question:

Please return the elements correlating only with index 'I a champ'. Do not use numerical index.

```python 
I          So
a        here
champ    come
dtype: object
```

In [None]:
brag_vow[::]

In [123]:
# write here:
brag_vow[['I', 'a', 'champ']]

I          So
a        Here
champ    Come
dtype: object

<details><summary>Solution
</summary>

```python
brag_vow[['I', 'a', 'champ']]
```
</details>

### Question:

From student_heights, please return the students with even heights. Use boolean indexing.

(if you don't know what 'boolean indexing' is, please refer to the numpy assignment. You'll need it.)

```python 
Tomas    34
Stacy    36
Haden    38
dtype: int64
```

In [17]:
# write here:
student_heights[student_heights % 2==0]



Tomas    34
Stacy    36
Haden    38
dtype: int64

<details><summary>Solution
</summary>

```python
student_heights[student_heights%2==0]
```
</details>

### Question:

Return the 'vow' from brag_vow.

```python 
array(['So', 'Pandas', 'here', 'I', 'come'], dtype=object)
```

In [22]:
# write here:
brag_vow.values


array(['So', 'Pandas', 'Here', 'I', 'Come'], dtype=object)

In [18]:
brag_vow.values

array(['So', 'Pandas', 'Here', 'I', 'Come'], dtype=object)

<details><summary>Solution
</summary>

```python
brag_vow.values
```
</details>

### Question:

Return the 'brag' from brag_vow.

```python 
Index(['I', 'am', 'a', 'numpy', 'champ'], dtype='object')
```

In [125]:
brag_vow.values

Index(['I', 'am', 'a', 'numpy', 'champ'], dtype='object')

In [3]:
# write here:
brag_vow.index


NameError: name 'brag_vow' is not defined

<details><summary>Solution
</summary>

```python
brag_vow.index
```
</details>

# Series names

Series have something called 'names'

In [20]:
student_grades = pd.Series([45,56,78,89,90], 
                           index=['Tomas', 'Angel', 'Stacy', 'Michaela', 'Haden'], 
                           name='grades')

In [25]:
student_grades

Tomas       45
Angel       56
Stacy       78
Michaela    89
Haden       90
Name: grades, dtype: int64

See the name there? Great.

That's all. We'll use it later.

### Creating DataFrames from Python Objects

So we know how to use series. Awesome.

But perhaps you noticed we have two series: student_heights and student_grades. Each has the same students in the index. Wouldn't it be cool if we could access all the student data at once?

To do this, we use a dataframe.

In [21]:
students = pd.DataFrame({'grade':student_grades, 'height':student_heights})
students

Unnamed: 0,grade,height
Tomas,45,34
Angel,56,35
Stacy,78,36
Michaela,89,37
Haden,90,38


There are AT LEAST two ways to think of a dataframe.

One is as a mulitple series with matching index. The other is as a numpy array with explicitly labeled columns and rows.

The truth is, they're BOTH.

In [27]:
# numpy array:
students.values

array([[45, 34],
       [56, 35],
       [78, 36],
       [89, 37],
       [90, 38]])

In [28]:
# pandas series:
students['grade']

Tomas       45
Angel       56
Stacy       78
Michaela    89
Haden       90
Name: grade, dtype: int64

We could also create DataFrames from numpy arrays or list-of-lists with provided labels and indices. The `columns=` parameter specifies the names for the columns; the `index=` specifies the names for the rows.

In [29]:
pd.DataFrame(
    data = [[1, 2, 3], 
            [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])

Unnamed: 0,a,b,c
foo,1,2,3
bar,4,5,6


You might notice in pandas that, unlike python, there are MANY ways to do any one thing. This makes learning harder. But here we go anyways.

### Question:

Make a dataframe 'classrooms' that contains this

```python 
> print(classrooms)

         chairs  projectors
gym           0           1
history       2           3
math          4           5
english       6           7
```

In [30]:
# write here:
classrooms = pd.DataFrame(
    data = [[0,1], [2,3], [4,5], [6,7]],
    columns=['chairs', 'projectors'],
    index = ['gym','history','math', 'english' ]
)

classrooms

Unnamed: 0,chairs,projectors
gym,0,1
history,2,3
math,4,5
english,6,7


<details><summary>Solution
</summary>

```python
classrooms = pd.DataFrame(data=np.arange(8).reshape((4,2)),
                          columns=['chairs', 'projectors'],
                          index=['gym','history','math','english'])

# or

classrooms = pd.DataFrame(data=np.array([[0, 1],
                                         [2, 3],
                                         [4, 5],
                                         [6, 7]]),
                          columns=['chairs', 'projectors'],
                          index=['gym','history','math','english'])
```
</details>

### Question:

Make a series 'chairs' like so. 

```python 
gym        0
history    2
math       4
english    6
Name: chairs, dtype: int64
```

In [31]:
# write here:
chairs = pd.Series({'gym': 0,'history': 2,'math': 4, 'english': 6 })
chairs

gym        0
history    2
math       4
english    6
dtype: int64

<details><summary>Solution
</summary>

```python
chairs = pd.Series(data=range(0,8,2), index=['gym','history','math','english'], name='chairs')
chairs
```
</details>

### Question:

Make a series 'projectors' like so. 

```python 
gym        1
history    3
math       5
english    7
Name: projectors, dtype: int64
```

In [32]:
# write here:
projectors = pd.Series(data =range(1,8,2), index = ['gym','history','math', 'english' ]
)

projectors

gym        1
history    3
math       5
english    7
dtype: int64

<details><summary>Solution
</summary>

```python
projectors = pd.Series(data=range(1,8,2), index=['gym','history','math','english'], name='projectors')
projectors
```
</details>

### Question:

Make the 'classrooms' dataframe again, but this time by using the 'chairs' and 'projectors' series.

```python 
> print(classrooms)

         chairs  projectors
gym           0           1
history       2           3
math          4           5
english       6           7
```

In [33]:
# write here:
classroom = pd.DataFrame({'chairs': chairs, 'projectors':projectors})
classroom



Unnamed: 0,chairs,projectors
gym,0,1
history,2,3
math,4,5
english,6,7


<details><summary>Solution
</summary>

```python
classrooms = pd.DataFrame(data={'chairs':chairs, 'projectors':projectors})
classrooms
```
</details>

### Question:

Create a dataframe 'down_up' with two columns: `decreasing` and `increasing`, that have the numbers 1-10 in increasing and decreasing orders. The index can be anything.

```python
> print(down_up)

   decreasing  increasing
0          10           1
1           9           2
2           8           3
3           7           4
4           6           5
5           5           6
6           4           7
7           3           8
8           2           9
9           1          10
```

In [34]:
# write here:
nums = range(10)

down_up = pd.DataFrame(data =[[10,1],[9,2],[8,3],[7,4],[9,5], [5,6],[4,7],[3,8],[2,9], [1,10]], columns=['Down', 'Up'])

down_up
 


Unnamed: 0,Down,Up
0,10,1
1,9,2
2,8,3
3,7,4
4,9,5
5,5,6
6,4,7
7,3,8
8,2,9
9,1,10


<details><summary>Solution
</summary>

```python
down_up = pd.DataFrame(data={'decreasing': range(10,0,-1), 'increasing':range(1,11)})
down_up
```
Those ranges are tricky aren't they?
</details>

You can also put a Series into a DataFrame as long as you have matching index.

## Modifying dataframes

Dataframes are VERY much like numpy arrays with added columns and index.

But unlike arrays (which have a fixed size), we CAN add columns to a dataframe!

In [35]:
students['was_late'] = [True, False, True, False, False]
students

Unnamed: 0,grade,height,was_late
Tomas,45,34,True
Angel,56,35,False
Stacy,78,36,True
Michaela,89,37,False
Haden,90,38,False


We can also do this by adding an array.

In [36]:
students['was_late'] = np.array([True, False, True, False, False])
students

Unnamed: 0,grade,height,was_late
Tomas,45,34,True
Angel,56,35,False
Stacy,78,36,True
Michaela,89,37,False
Haden,90,38,False


But the array must either have a shape of (n, ) or (1, n). Not (n, 1). Which kinda makes sense, because that's trying to fit row into a column.

In [37]:
students['was_late'] = np.array([True, False, True, False, False]).reshape(-1,1)
students

Unnamed: 0,grade,height,was_late
Tomas,45,34,True
Angel,56,35,False
Stacy,78,36,True
Michaela,89,37,False
Haden,90,38,False


```python
# this would return an error:
students['was_late'] = np.array([True, False, True, False, False]).reshape(1,-1)
students
```

So let's get going.

### Question:

Add a column 'top_crush' to 'students'. Which is the name that student has a crush on. (make it up!)

```python
> print(students)

          grade  height  was_late     top_crush
Tomas        45      34      True         Angel
Angel        56      35     False         Haden
Stacy        78      36      True      Michaela
Michaela     89      37     False         Tomas
Haden        90      38     False  Edgar A. Poe
```

In [38]:
# write here:

students['top_crush'] = np.array(['Angel', 'Haden', 'Michaela', 'Tomas', 'Edgar A. Poe']).reshape(-1,1)


          grade  height  was_late     top_crush
Tomas        45      34      True         Angel
Angel        56      35     False         Haden
Stacy        78      36      True      Michaela
Michaela     89      37     False         Tomas
Haden        90      38     False  Edgar A. Poe


<details><summary>Solution
</summary>

```python
students['top_crush'] = ['Angel', 'Haden', 'Michaela', 'Tomas', 'Edgar A. Poe']
print(students)
```

Alas, love is cruel.
</details>

## Broadcasting

Again, like a numpy array, we can also use broadcasting on a series. Like so.

In [39]:
students['grade']>75

Tomas       False
Angel       False
Stacy        True
Michaela     True
Haden        True
Name: grade, dtype: bool

So to make a new column called 'above_75_percent' we would write

In [40]:
students['above_75_percent'] = students['grade']>75
students

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent
Tomas,45,34,True,Angel,False
Angel,56,35,False,Haden,False
Stacy,78,36,True,Michaela,True
Michaela,89,37,False,Tomas,True
Haden,90,38,False,Edgar A. Poe,True


This is powerful. You might not feel it yet. But it is.

### Question:

Add a column 'taller_than_35' to students. This can be done using broadcasting on an existing series.

```python
> print(students[['grade', 'height', 'taller_than_35']])

          grade  height  taller_than_35
Tomas        45      34           False
Angel        56      35           False
Stacy        78      36            True
Michaela     89      37            True
Haden        90      38            True
```

In [41]:
# write here:

students['taller_than_35']= students['height'] > 35
students

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35
Tomas,45,34,True,Angel,False,False
Angel,56,35,False,Haden,False,False
Stacy,78,36,True,Michaela,True,True
Michaela,89,37,False,Tomas,True,True
Haden,90,38,False,Edgar A. Poe,True,True


<details><summary>Solution
</summary>

```python
students['taller_than_35'] = students['height']>35
print(students[['grade', 'height', 'taller_than_35']])
```
</details>

But buyer beware. The elements in the assignment above are matched **by index**, which is a common pattern in Pandas.

## Index tricks

Let's take a deeper look at this.

In [25]:
df = pd.DataFrame(
    [[1, 2, 3], [4, 5, 6]], 
    columns=['a', 'b', 'c'], 
    index=['foo', 'bar'])
df

Unnamed: 0,a,b,c
foo,1,2,3
bar,4,5,6


In [43]:
# INDEX IS BACKWARDS
df['d'] = pd.Series([4, 5], index=['bar', 'foo'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,5
bar,4,5,6,4


If no indices match, missing values are filled into the unmatched spaces.

In [44]:
df['d'] = pd.Series([5, 4], index=['bar', 'baz'])
df

Unnamed: 0,a,b,c,d
foo,1,2,3,
bar,4,5,6,5.0


We can also put a list/vector into a DataFrame, and here there is no index, so the column is inserted in order.

In [45]:
df['e'] = [1, 2]
df

Unnamed: 0,a,b,c,d,e
foo,1,2,3,,1
bar,4,5,6,5.0,2


So be careful if you're using a series:

Because without the index, the dataframe just says 'there are no matching rows, silly!' and inserts all NaNs. ('Nan' stands for 'Not a Number')

In [46]:
students['was_late'] = pd.Series([True, False, True, False, False])
students

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35
Tomas,45,34,,Angel,False,False
Angel,56,35,,Haden,False,False
Stacy,78,36,,Michaela,True,True
Michaela,89,37,,Tomas,True,True
Haden,90,38,,Edgar A. Poe,True,True


Adding the index:

In [47]:
students['was_late'] = pd.Series([True, False, True, False, False], index=students.index)
students

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35
Tomas,45,34,True,Angel,False,False
Angel,56,35,False,Haden,False,False
Stacy,78,36,True,Michaela,True,True
Michaela,89,37,False,Tomas,True,True
Haden,90,38,False,Edgar A. Poe,True,True


Column insertion, both with and without specifying the index, is super useful.

### Question:

Create a dataframe 'numbers' that has one column `increasing`. The `increasing` column contains the numbers 1-50 in increasing order. Then insert a columns called `evens`, which has the even numbers in increasing order at the same locations as in `increasing`, but with missing values in the other locations.

```python
> print(numbers)

    increasing  evens
0            0    0.0
1            1    NaN
2            2    2.0
3            3    NaN
4            4    4.0
5            5    NaN
6            6    6.0
7            7    NaN
8            8    8.0
9            9    NaN
10          10   10.0
.            .      .
.            .      .
.            .      .

```

In [48]:
# write here:

numbers =  pd.DataFrame([num for num in range(50)], columns=['increasing'])
numbers['evens'] = pd.Series([num for num in range(50) if num % 2], index=[num for num in range(50) if num % 2])

numbers

Unnamed: 0,increasing,evens
0,0,
1,1,1.0
2,2,
3,3,3.0
4,4,
5,5,5.0
6,6,
7,7,7.0
8,8,
9,9,9.0


<details><summary>Solution
</summary>

```python
numbers = pd.DataFrame([[n] for n in range(50)], columns=['increasing'])
numbers['evens'] = pd.Series([n for n in range(50) if n%2==0], index=[n for n in range(50) if n%2==0])
print(numbers)

```

</details>

### Question:

We only have information for some students. Please add the following information into a column named 'pickup_time'.  The rest can be NaNs.

Haden = 4,
Tomas = 5,
Stacy = 3


```python
> print(students[['grade', 'height', 'pickup_time']])

          grade  height  pickup_time
Tomas        45      34          5.0
Angel        56      35          NaN
Stacy        78      36          3.0
Michaela     89      37          NaN
Haden        90      38          4.0

```

In [49]:
# write here:

students['pickup_time'] = pd.Series([4,5,3], ['Tomas', 'Stacy', 'Haden'])
students


Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35,pickup_time
Tomas,45,34,True,Angel,False,False,4.0
Angel,56,35,False,Haden,False,False,
Stacy,78,36,True,Michaela,True,True,5.0
Michaela,89,37,False,Tomas,True,True,
Haden,90,38,False,Edgar A. Poe,True,True,3.0


<details><summary>Solution
</summary>

```python
students['pickup_time'] = pd.Series([4,5,3], ['Haden', 'Tomas', 'Stacy'])
print(students[['grade', 'height', 'pickup_time']])
```

</details>

### Question:

We only know about the first three students. Please add the following information into a column named 'pickup_time'.  The rest can be NaNs.

4,5,3


```python
> print(students[['grade', 'height', 'pickup_time']])

          grade  height  pickup_time
Tomas        45      34          4.0
Angel        56      35          5.0
Stacy        78      36          2.0
Michaela     89      37          NaN
Haden        90      38          NaN

```

(NaN is input as np.NaN)

In [50]:
# write here:
students['pickup_time'] = pd.Series([4.0,5.0, 2.0], ['Tomas', 'Angel', 'Stacy'])
students['pickup_time']

Tomas       4.0
Angel       5.0
Stacy       2.0
Michaela    NaN
Haden       NaN
Name: pickup_time, dtype: float64

<details><summary>Solution
</summary>

```python
students['pickup_time'] = [4, 5, 2, np.NaN, np.NaN]
print(students[['grade', 'height', 'pickup_time']])
```

</details>

## Loading Data

To get external data into pandas, we first need to know where it is.

We can use 'ls' to do this

In [51]:
ls

README.md                [1m[36mdata[m[m/                    pandas_assignment.ipynb


Ah, the data is probably kept in 'data/'

In [52]:
ls data

[31mchess_games.csv[m[m*       hospital-costs.csv     winequality-red.csv
[31mexo_planet.csv[m[m*        playgolf.csv           winequality-white.csv


Pandas can load a csv, but csvs often are separated by things that ARE NOT COMMAS. So let's look at the first line INSIDE playgolf.csv

In [53]:
# '!' means we're running a bash command
!head -1 data/playgolf.csv

7/14/14,rain,71,80,TRUE,Don't Play

In [54]:
!head -1 data/playgolf.csv

7/14/14,rain,71,80,TRUE,Don't Play

### Load data from csv

A csv (comma separated values) is a file format used to store data separated by a **delimiter**.

A delimiter is the **single character** that divides the data elements in a file.  A comma is a traditional choice of delimiter but a relatively poor one because they are often part of elements themselves.  Better choices are pipe (`|`) and tab (`\t`).

In a bizarre twist of history, comma separated files are often separated by different characters than commas.  There is no consistent convention of using a different file extension, but some people use `.psv` or `.tsv`.

Pandas has a `read_csv` function that loads a delimited file into a `DataFrame`.  

In [55]:
golf_df = pd.read_csv('data/playgolf.csv', delimiter=',')

In [56]:
golf_df

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,7/1/14,sunny,85,85,False,Don't Play
1,7/2/14,sunny,80,90,True,Don't Play
2,7/3/14,overcast,83,78,False,Play
3,7/4/14,rain,70,96,False,Play
4,7/5/14,rain,68,80,False,Play
5,7/6/14,rain,65,70,True,Don't Play
6,7/7/14,overcast,64,65,True,Play
7,7/8/14,sunny,72,95,False,Don't Play
8,7/9/14,sunny,69,70,False,Play
9,7/10/14,rain,75,80,False,Play


### Question:

load in hospital-costs.csv as 'hospital_costs'.


In [57]:
# write here:
# !head -1 data/hospital-costs.csv
hospital_costs = pd.read_csv('data/hospital-costs.csv', delimiter = ',')
hospital_costs

Unnamed: 0,Year,Facility Id,Facility Name,APR DRG Code,APR Severity of Illness Code,APR DRG Description,APR Severity of Illness Description,APR Medical Surgical Code,APR Medical Surgical Description,Discharges,Mean Charge,Median Charge,Mean Cost,Median Cost
0,2011,324,Adirondack Medical Center-Saranac Lake Site,4,4,Tracheostomy W MV 96+ Hours W Extensive Proced...,Extreme,P,Surgical,3,361289.0,210882.0,196080.0,123347.0
1,2011,324,Adirondack Medical Center-Saranac Lake Site,5,4,Tracheostomy W MV 96+ Hours W/O Extensive Proc...,Extreme,P,Surgical,1,102190.0,102190.0,59641.0,59641.0
2,2011,324,Adirondack Medical Center-Saranac Lake Site,24,2,Extracranial Vascular Procedures,Moderate,P,Surgical,6,14172.0,13506.0,6888.0,6445.0
3,2011,324,Adirondack Medical Center-Saranac Lake Site,26,1,Other Nervous System & Related Procedures,Minor,P,Surgical,1,8833.0,8833.0,4259.0,4259.0
4,2011,324,Adirondack Medical Center-Saranac Lake Site,41,1,Nervous System Malignancy,Minor,M,Medical,1,5264.0,5264.0,1727.0,1727.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
383488,2009,1153,Wyoming County Community Hospital,951,3,Moderately Extensive Procedure Unrelated To Pr...,Major,P,Surgical,5,13572.0,12615.0,14433.0,15835.0
383489,2009,1153,Wyoming County Community Hospital,952,3,Nonextensive Procedure Unrelated To Principal ...,Major,P,Surgical,4,8323.0,8179.0,9520.0,8674.0
383490,2009,1153,Wyoming County Community Hospital,952,2,Nonextensive Procedure Unrelated To Principal ...,Moderate,P,Surgical,5,7746.0,5120.0,7257.0,5321.0
383491,2009,1153,Wyoming County Community Hospital,952,1,Nonextensive Procedure Unrelated To Principal ...,Minor,P,Surgical,1,7892.0,7892.0,6528.0,6528.0


<details><summary>Solution
</summary>

```python
!head -1 data/hospital-costs.csv

hospital_costs = pd.read_csv('data/hospital-costs.csv')
hospital_costs
```

</details>

### Question:

Load in winequality-white.csv as 'wine_quality_white'.

In [58]:
ls data

[31mchess_games.csv[m[m*       hospital-costs.csv     winequality-red.csv
[31mexo_planet.csv[m[m*        playgolf.csv           winequality-white.csv


In [59]:
# write here:

# !head - 1 data/winequality-white.csv
wine_quality_white = pd.read_csv('data/winequality-white.csv', delimiter = ';')
wine_quality_white

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.00100,3.00,0.45,8.8,6
1,6.3,0.30,0.34,1.6,0.049,14.0,132.0,0.99400,3.30,0.49,9.5,6
2,8.1,0.28,0.40,6.9,0.050,30.0,97.0,0.99510,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.99560,3.19,0.40,9.9,6
...,...,...,...,...,...,...,...,...,...,...,...,...
4893,6.2,0.21,0.29,1.6,0.039,24.0,92.0,0.99114,3.27,0.50,11.2,6
4894,6.6,0.32,0.36,8.0,0.047,57.0,168.0,0.99490,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.041,30.0,111.0,0.99254,2.99,0.46,9.4,6
4896,5.5,0.29,0.30,1.1,0.022,20.0,110.0,0.98869,3.34,0.38,12.8,7


<details><summary>Solution
</summary>

```python
!head -5 data/winequality-white.csv # to see what delimiter

wine_quality_white = pd.read_csv('data/winequality-white.csv', delimiter=';')
wine_quality_white
```

</details>

## Extracting information from DataFrames

#### Basic Row and Column Indexing

As we have seen, individual columns may be extracted from a `DataFrame` as a `Series` using the usual `__getitem__` style indexing using the name of the column.  

This is similar to how we index a dictionary.

In [60]:
golf_df['Temperature']

0     85
1     80
2     83
3     70
4     68
5     65
6     64
7     72
8     69
9     75
10    75
11    72
12    81
13    71
Name: Temperature, dtype: int64

We can extract individual values by taking the series out of the matrix, then treating it like a list.

In [61]:
golf_df['Temperature'][0]

85

We can extract multiple rows at once.

In [62]:
golf_df[['Temperature', 'Humidity']]

Unnamed: 0,Temperature,Humidity
0,85,85
1,80,90
2,83,78
3,70,96
4,68,80
5,65,70
6,64,65
7,72,95
8,69,70
9,75,80


If you try to index with a slice, however, it will only operate on the rows.

In [63]:
short_df = golf_df[0:5]
short_df

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,7/1/14,sunny,85,85,False,Don't Play
1,7/2/14,sunny,80,90,True,Don't Play
2,7/3/14,overcast,83,78,False,Play
3,7/4/14,rain,70,96,False,Play
4,7/5/14,rain,68,80,False,Play


### Question:

Look at hospital_costs Facility Id, Year, Discharges, and Mean Charge columns side by side in a dataframe.

```python
        Facility Id  Year  Discharges  Mean Charge
0               324  2011           3     361289.0
1               324  2011           1     102190.0
2               324  2011           6      14172.0
3               324  2011           1       8833.0
4               324  2011           1       5264.0
...             ...   ...         ...          ...
383488         1153  2009           5      13572.0
383489         1153  2009           4       8323.0
383490         1153  2009           5       7746.0
383491         1153  2009           1       7892.0
383492         1153  2009           3       1069.0
```

In [64]:
# write here:
hospital_costs[['Facility Id', 'Year', 'Discharges', 'Mean Charge']]


Unnamed: 0,Facility Id,Year,Discharges,Mean Charge
0,324,2011,3,361289.0
1,324,2011,1,102190.0
2,324,2011,6,14172.0
3,324,2011,1,8833.0
4,324,2011,1,5264.0
...,...,...,...,...
383488,1153,2009,5,13572.0
383489,1153,2009,4,8323.0
383490,1153,2009,5,7746.0
383491,1153,2009,1,7892.0


<details><summary>Solution
</summary>

```python
print(hospital_costs[['Facility Id', 'Year', 'Discharges', 'Mean Charge']])
```

</details>

## Boolean / Logical Indexing

We can also index into a `DataFrame` using a list of **booleans** (i.e. `True` and `False` values). This will also operate on the rows.

In [65]:
# Takes rows 0, 2, and 4.
short_df[[True, False, True, False, True]]

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
0,7/1/14,sunny,85,85,False,Don't Play
2,7/3/14,overcast,83,78,False,Play
4,7/5/14,rain,68,80,False,Play


Which doesn't seem that useful...except we can create a boolean `Series` by using comparisons on a Series

In [66]:
# A series of booleans.
golf_df['Temperature'] > 70

0      True
1      True
2      True
3     False
4     False
5     False
6     False
7      True
8     False
9      True
10     True
11     True
12     True
13     True
Name: Temperature, dtype: bool

And them use the result to grab rows of the dataframe.

In [67]:
golf_df[golf_df['Temperature'] > 70][["Date", "Windy"]]

Unnamed: 0,Date,Windy
0,7/1/14,False
1,7/2/14,True
2,7/3/14,False
7,7/8/14,False
9,7/10/14,False
10,7/11/14,True
11,7/12/14,True
12,7/13/14,False
13,7/14/14,True


This is essentially applying a logical condition to select rows from a `DataFrame`.  This is one of the most common patterns in Pandas.

To review: if you index a `DataFrame` with a **single value** or a **list of values**, it selects the **columns**.

If you use a **slice** or **sequence of booleans**, it selects the **rows**. 

### Question:

Select all of the days in which the humidity is larger than 90 from 'golf_df'.

```python
     Date Outlook  Temperature  Humidity  Windy      Result
3  7/4/14    rain           70        96  False        Play
7  7/8/14   sunny           72        95  False  Don't Play
```

In [68]:
# write here:
golf_df[golf_df['Humidity'] > 90]


Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
3,7/4/14,rain,70,96,False,Play
7,7/8/14,sunny,72,95,False,Don't Play


<details><summary>Solution
</summary>

```python
golf_df[golf_df['Humidity']>90]
```

</details>

## logical aside

As a reminder, logical operators with arrays AND series look like

In [69]:
(golf_df['Humidity']>90) | (golf_df['Outlook']=="Sunny") # '|' is 'OR'

0     False
1     False
2     False
3      True
4     False
5     False
6     False
7      True
8     False
9     False
10    False
11    False
12    False
13    False
dtype: bool

In [70]:
(golf_df['Result']=="Don't Play") & golf_df['Windy'] # '&' is 'AND'

0     False
1      True
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13     True
dtype: bool

### Question:

Select all of the rainy days in which the temperature is less than 70 degrees.

In [71]:
# write here:
golf_df[(golf_df['Outlook']=='rain') & (golf_df['Temperature'] < 70)]


Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result
4,7/5/14,rain,68,80,False,Play
5,7/6/14,rain,65,70,True,Don't Play


<details><summary>Solution
</summary>

```python
golf_df[(golf_df['Temperature'] <70) & (golf_df['Outlook'] == 'rain')]
                                     
```

</details>

### Question:

Select the day and Result of the rainy days in which the humidity is larger than 90 from this data frame.

```python
     Date Result
3  7/4/14   Play
```

In [72]:
# write here:
golf_df[(golf_df['Outlook']=='rain') & (golf_df['Humidity'] > 90)] [['Date', 'Result']]

Unnamed: 0,Date,Result
3,7/4/14,Play


<details><summary>Solution
</summary>

```python
golf_df[(golf_df['Outlook']=="rain") & (golf_df['Humidity']>90)][['Date','Result']]
```

</details>

## Double Indexing

Suppose we want to set the value of the `Windy` column where `Temperature > 70` to True (because, um, science).

In [73]:
golf_df[golf_df['Temperature'] > 70]["Windy"] = True

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  golf_df[golf_df['Temperature'] > 70]["Windy"] = True


What?

In [74]:
golf_df[golf_df['Temperature'] > 70]["Windy"]

0     False
1      True
2     False
7     False
9     False
10     True
11     True
12    False
13     True
Name: Windy, dtype: bool

Huh, they're still false. Apparently that error actually meant something.

This pattern is called double indexing, and it is an antipattern!  Pandas can not guarantee that assignments will hold when you index twice!

To fix these issues, we need to study the other indexing options that Pandas provides.

#### Other Indexers: .loc and .iloc

There are a few other indexing objects in pandas, both of which take a value to choose rows and a value to choose columns.

  - `df.iloc` is **positionally based**.  This indexer accepts integers and integer slices, and essentially treats the data frame as if it were a simple matrix.
  - `df.loc` is **label based**.  This indexer works with row and column indices / labels.

In [75]:
df = pd.DataFrame({
    'some_integers': [0, 0, 1, 1, 2, 2],
    'some_strings': ['x', 'y', 'z', 'x', 'y', 'z'],
    'some_booleans': [0, 0, 1, 0, 1, 1]},
    index=['a', 'b', 'c', 'd', 'e', 'f']
)
df

Unnamed: 0,some_integers,some_strings,some_booleans
a,0,x,0
b,0,y,0
c,1,z,1
d,1,x,0
e,2,y,1
f,2,z,1


In [76]:
df.iloc[2:4, 0:2]

Unnamed: 0,some_integers,some_strings
c,1,z
d,1,x


In [77]:
df.loc['b':'e', ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
b,0,0
c,1,1
d,1,0
e,2,1


### Question:

Select rows a,b,c,f of df.

```python
   some_integers some_strings  some_booleans
a              0            x              0
b              0            y              0
c              1            z              1
f              2            z              1
```

In [78]:
# write here:
df.loc[['a', 'b', 'c', 'f']]


Unnamed: 0,some_integers,some_strings,some_booleans
a,0,x,0
b,0,y,0
c,1,z,1
f,2,z,1


<details><summary>Solution
</summary>

```python
df.loc[['a', 'b', 'c', 'f']]
```

</details>

### Question:

Select the 0th, 1st, and 5th row of df.

```python
   some_integers some_strings  some_booleans
a              0            x              0
b              0            y              0
f              2            z              1
```

In [79]:
# write here:
df.iloc[[0,1,5]]


Unnamed: 0,some_integers,some_strings,some_booleans
a,0,x,0
b,0,y,0
f,2,z,1


<details><summary>Solution
</summary>

```python
df.iloc[[0, 1, 5]]
```

</details>

### Question:

Select the 0th and 4th students.

```python
       grade  height was_late     top_crush   ...
Tomas     45      34      NaN         Angel
Haden     90      38      NaN  Edgar A. Poe
```

In [80]:
# write here:

students.iloc[[0,4]]

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35,pickup_time
Tomas,45,34,True,Angel,False,False,4.0
Haden,90,38,False,Edgar A. Poe,True,True,


<details><summary>Solution
</summary>

```python
students.iloc[[0,4]]
```

</details>

### Question:

Select Haden, Angel and Micaela, in that order.

```python
          grade  height was_late     top_crush
Haden        90      38      NaN  Edgar A. Poe
Angel        56      35      NaN         Haden
Michaela     89      37      NaN         Tomas
```

In [81]:
# write here:
students.loc[['Haden', 'Angel', 'Michaela']]


Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35,pickup_time
Haden,90,38,False,Edgar A. Poe,True,True,
Angel,56,35,False,Haden,False,False,5.0
Michaela,89,37,False,Tomas,True,True,


<details><summary>Solution
</summary>

```python
students.loc[['Haden', 'Angel', 'Michaela']]
```

</details>

## Mixed Indexing

So what do we do if we want to get the rows by position, and get the columns by label?  I.e. if we have a use for **mixed indexing**.

```python
# Mixed indexing with iloc: will not work.
>>> df.iloc[2:4, ['some_integers', 'some_booleans']]

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-337-dcc694afee25> in <module>
      1 # Mixed indexing with iloc: will not work.
----> 2 df.iloc[2:4, ['some_integers', 'some_booleans']]
```

Doing mixed indexing in modern pandas is more explicit, less magic.  You need to use the `df.index` and `df.columns` attributes to explicitly turn positions into labels.

#### Rows by position, Columns by name

In [82]:
df.index[2:4]

Index(['c', 'd'], dtype='object')

In [83]:
df.loc[df.index[2:4], ['some_integers', 'some_booleans']]

Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


#### Rows by name, Columns by position

In [84]:
df.columns[[0, 2]]

Index(['some_integers', 'some_booleans'], dtype='object')

In [85]:
df.loc[['c', 'd'], df.columns[[0, 2]]]

Unnamed: 0,some_integers,some_booleans
c,1,1
d,1,0


### Question:

Use mixed indexing to get the 0th, 2nd and 4th row of student's top_crush and pickup_time. (don't use the names explicitly)

```python
          top_crush  pickup_time
Tomas         Angel          4.0
Stacy      Michaela          2.0
Haden  Edgar A. Poe          NaN
```

In [86]:
# write here:
students.loc[students.index[[0,2,4]], ['top_crush', 'pickup_time']]


Unnamed: 0,top_crush,pickup_time
Tomas,Angel,4.0
Stacy,Michaela,2.0
Haden,Edgar A. Poe,


<details><summary>Solution
</summary>

```python
students.loc[students.index[[0,2,4]], ['top_crush', 'pickup_time']]
```

</details>

### Question:

Use mixed indexing to get the 0th, 2nd, 3th, and 0th column of Stacy and Haden. (don't use the column names explicitly)

```python
       grade  was_late  above_75_percent  grade
Stacy     78      True              True     78
Haden     90     False              True     90
```

In [87]:
# write here:
students.loc[['Stacy', 'Haden'], students.columns[[0,2,3,0]]]


Unnamed: 0,grade,was_late,top_crush,grade.1
Stacy,78,True,Michaela,78
Haden,90,False,Edgar A. Poe,90


<details><summary>Solution
</summary>

```python
students.loc[['Stacy', 'Haden'], students.columns[[0, 2, 3, 0]]]
```

</details>

## Transforming data

Arithmetic operations apply to `Series` element by element. (like arrays)

In [88]:
# Yes, this makes no sense.
golf_df["TempHumid"] = golf_df['Temperature'] + golf_df['Humidity']

In [89]:
golf_df.head()

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid
0,7/1/14,sunny,85,85,False,Don't Play,170
1,7/2/14,sunny,80,90,True,Don't Play,170
2,7/3/14,overcast,83,78,False,Play,161
3,7/4/14,rain,70,96,False,Play,166
4,7/5/14,rain,68,80,False,Play,148


In [90]:
# More Usefully

# Heat index formula taken from wikipedia: 
#    https://en.wikipedia.org/wiki/Heat_index
temp = golf_df['Temperature']
humid = golf_df['Humidity']
golf_df['HeatIndex'] = (-42.37 + 2.05*temp + 10.14*humid
                        - 0.225*temp*humid
                        - 6.84e-3*temp**2 
                        - 5.482e-2*humid**2
                        + 1.23e-3*temp**2*humid
                        + 8.53e-4*temp*humid**2
                        - 1.99e-6*temp**2*humid**2
)
golf_df[['Temperature', 'Humidity', 'HeatIndex']].head()

Unnamed: 0,Temperature,Humidity,HeatIndex
0,85,85,98.004631
1,80,90,84.4744
2,83,78,89.669911
3,70,96,62.847024
4,68,80,69.089776


### Question:

Make a new column 'smarty_pants' that returns true if a student's grade is greater than 2.2 times their height. (tough school)

```python
>>> print(students[['grade','height', 'smarty_pants']])
          grade  height  smarty_pants
Tomas        45      34         False
Angel        56      35         False
Stacy        78      36         False
Michaela     89      37          True
Haden        90      38          True
```

In [91]:
# write here:
g = students['grade']
h = students['height']
students['smarty_pants'] = (g >= (2.2 * h))
students

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35,pickup_time,smarty_pants
Tomas,45,34,True,Angel,False,False,4.0,False
Angel,56,35,False,Haden,False,False,5.0,False
Stacy,78,36,True,Michaela,True,True,2.0,False
Michaela,89,37,False,Tomas,True,True,,True
Haden,90,38,False,Edgar A. Poe,True,True,,True


<details><summary>Solution
</summary>

```python
students['smarty_pants'] = students['grade']>students['height']*2.2
print(students[['grade','height', 'smarty_pants']])
```

</details>

In [92]:
wine_quality_white.head(2)


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6


### Question:

Make a new column 'percent fixed acidity' that finds how much of the total acidity is fixed in wine_quality_white.

```python
>>> print(wine_quality_white[['fixed acidity', 'volatile acidity', 'citric acid', 'percent fixed acidity']])
      fixed acidity  volatile acidity  citric acid  percent fixed acidity
0               7.0              0.27         0.36               0.917431
1               6.3              0.30         0.34               0.907781
2               8.1              0.28         0.40               0.922551
3               7.2              0.23         0.32               0.929032
4               7.2              0.23         0.32               0.929032
...             ...               ...          ...                    ...
4893            6.2              0.21         0.29               0.925373
4894            6.6              0.32         0.36               0.906593
4895            6.5              0.24         0.19               0.937951
4896            5.5              0.29         0.30               0.903120
4897            6.0              0.21         0.38               0.910470
```

In [93]:
# write here:
wqw = wine_quality_white
wine_quality_white['percent fixed acidity'] = (wqw['fixed acidity'] / (wqw['fixed acidity'] + wqw['volatile acidity'] + wqw['citric acid']))

wine_quality_white[['fixed acidity', 'volatile acidity', 'citric acid', 'percent fixed acidity']]


Unnamed: 0,fixed acidity,volatile acidity,citric acid,percent fixed acidity
0,7.0,0.27,0.36,0.917431
1,6.3,0.30,0.34,0.907781
2,8.1,0.28,0.40,0.922551
3,7.2,0.23,0.32,0.929032
4,7.2,0.23,0.32,0.929032
...,...,...,...,...
4893,6.2,0.21,0.29,0.925373
4894,6.6,0.32,0.36,0.906593
4895,6.5,0.24,0.19,0.937951
4896,5.5,0.29,0.30,0.903120


<details><summary>Solution
</summary>

```python
wine_quality_white['percent fixed acidity'] = wine_quality_white['fixed acidity'] / 
                                                 (wine_quality_white['fixed acidity'] + 
                                                  wine_quality_white['volatile acidity'] + 
                                                  wine_quality_white['citric acid'])
print(wine_quality_white[['fixed acidity', 'volatile acidity', 'citric acid', 'percent fixed acidity']])
```
(you still got it right if you didn't include citric acid)
</details>

## Apply

We can create a new Series by applying functions to an existing Series.

Lets say we wanted to get the day of the month out of the 'date' column.

In [94]:
golf_df

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex
0,7/1/14,sunny,85,85,False,Don't Play,170,98.004631
1,7/2/14,sunny,80,90,True,Don't Play,170,84.4744
2,7/3/14,overcast,83,78,False,Play,161,89.669911
3,7/4/14,rain,70,96,False,Play,166,62.847024
4,7/5/14,rain,68,80,False,Play,148,69.089776
5,7/6/14,rain,65,70,True,Don't Play,135,73.668025
6,7/7/14,overcast,64,65,True,Play,129,75.987116
7,7/8/14,sunny,72,95,False,Don't Play,167,66.247396
8,7/9/14,sunny,69,70,False,Play,139,72.843649
9,7/10/14,rain,75,80,False,Play,155,74.557


In [95]:
golf_df['Date'].apply(lambda x: x.split('/')[1])

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
Name: Date, dtype: object

Apply takes each item from the 'date' series and *applies* the function to it. The outputs are arranged in a array of the same shape and index.

So we can save it like so

In [96]:
golf_df['day'] = golf_df['Date'].apply(lambda x: x.split('/')[1])
golf_df

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,day
0,7/1/14,sunny,85,85,False,Don't Play,170,98.004631,1
1,7/2/14,sunny,80,90,True,Don't Play,170,84.4744,2
2,7/3/14,overcast,83,78,False,Play,161,89.669911,3
3,7/4/14,rain,70,96,False,Play,166,62.847024,4
4,7/5/14,rain,68,80,False,Play,148,69.089776,5
5,7/6/14,rain,65,70,True,Don't Play,135,73.668025,6
6,7/7/14,overcast,64,65,True,Play,129,75.987116,7
7,7/8/14,sunny,72,95,False,Don't Play,167,66.247396,8
8,7/9/14,sunny,69,70,False,Play,139,72.843649,9
9,7/10/14,rain,75,80,False,Play,155,74.557,10


lambda is just a way of making a function. We can also apply a function we've already made.

In [97]:
def get_day(x):
    return x.split('/')[1]

golf_df['Date'].apply(get_day)

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
Name: Date, dtype: object

We can also apply a function to each row of the DataFrame by specifying the column and axis equals 1. 

In [98]:
golf_df.apply(lambda x: x['Temperature'] + x['Humidity'], axis=1)

0     170
1     170
2     161
3     166
4     148
5     135
6     129
7     167
8     139
9     155
10    145
11    162
12    156
13    151
dtype: int64

(This method is slower than using plain ol' arithmetics though)

In general, `.apply` is useful for mapping complex functions across your data.

### Question:

Make a new column 'report_card' that returns 'FAILURE' if a student's grade is below 60%, and the student's grade if the grade is not below 60%. (tough school)

```python
>>> print(students[['grade','height', 'report_card']])
          grade  height report_card
Tomas        45      34     FAILURE
Angel        56      35     FAILURE
Stacy        78      36          78
Michaela     89      37          89
Haden        90      38          90
```

In [99]:
# write here:
students['report_card'] = students['grade'].apply(lambda g: g if g >= 60 else 'FAILURE' )

students

Unnamed: 0,grade,height,was_late,top_crush,above_75_percent,taller_than_35,pickup_time,smarty_pants,report_card
Tomas,45,34,True,Angel,False,False,4.0,False,FAILURE
Angel,56,35,False,Haden,False,False,5.0,False,FAILURE
Stacy,78,36,True,Michaela,True,True,2.0,False,78
Michaela,89,37,False,Tomas,True,True,,True,89
Haden,90,38,False,Edgar A. Poe,True,True,,True,90


<details><summary>Solution
</summary>

```python
students['report_card'] = students['grade'].apply(lambda x: x if x>=60 else 'FAILURE')
print(students[['grade','height', 'report_card']])
```

</details>

### Question:

Make a new column 'even_height' that returns 'Even' if their height is even, and 'odd' if it isn't.

```python
>>> print(students[['grade','height', 'even_height']])
          grade  height even_height
Tomas        45      34        Even
Angel        56      35         Odd
Stacy        78      36        Even
Michaela     89      37         Odd
Haden        90      38        Even
```

In [100]:
# write here:
students['even_height'] = students['height'].apply(lambda x: 'Even' if x% 2 ==0 else 'Odd')


students[['grade', 'height', 'even_height']]

Unnamed: 0,grade,height,even_height
Tomas,45,34,Even
Angel,56,35,Odd
Stacy,78,36,Even
Michaela,89,37,Odd
Haden,90,38,Even


<details><summary>Solution
</summary>

```python
students['report_card'] = students['grade'].apply(lambda x: x if x>=60 else 'FAILURE')
print(students[['grade','height', 'even_height']])

#OR

students['report_card'] = np.array(["Even", "Odd"])[students['height']%2]
print(students[['grade','height', 'even_height']])
# but that doesn't use apply, but is a faster method to consider for later (using our NUMPY skills!)
```

</details>

## Aggregating data

There exists MANY methods availible to us within each dataframe. Some of them are aggregators, just like in numpy.

In [101]:
golf_df.count() # this counts the items in the row that are not None

Date           14
Outlook        14
Temperature    14
Humidity       14
Windy          14
Result         14
TempHumid      14
HeatIndex      14
day            14
dtype: int64

In [102]:
golf_df.count(axis=1)

0     9
1     9
2     9
3     9
4     9
5     9
6     9
7     9
8     9
9     9
10    9
11    9
12    9
13    9
dtype: int64

In [103]:
golf_df.mean()

Temperature    7.357143e+01
Humidity       8.028571e+01
Windy          4.285714e-01
TempHumid      1.538571e+02
HeatIndex      7.616314e+01
day            8.818342e+16
dtype: float64

In [104]:
golf_df.sum()

Date           7/1/147/2/147/3/147/4/147/5/147/6/147/7/147/8/...
Outlook        sunnysunnyovercastrainrainrainovercastsunnysun...
Temperature                                                 1030
Humidity                                                    1124
Windy                                                          6
Result         Don't PlayDon't PlayPlayPlayPlayDon't PlayPlay...
TempHumid                                                   2154
HeatIndex                                            1066.283922
day                                          1234567891011121314
dtype: object

In [105]:
golf_df.aggregate(min)

Date               7/1/14
Outlook          overcast
Temperature            64
Humidity               65
Windy               False
Result         Don't Play
TempHumid             129
HeatIndex       62.847024
day                     1
dtype: object

But what if we want to figure ot the minimum temperature on overcast vs rainy vs sunny days? 

We use a groupby statement! Like from sql!

In [106]:
golf_df.groupby('Outlook').aggregate(min)

Unnamed: 0_level_0,Date,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,day
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
overcast,7/12/14,64,65,False,Play,129,68.106944,12
rain,7/10/14,65,70,False,Don't Play,135,62.847024,10
sunny,7/1/14,69,70,False,Don't Play,139,66.247396,1


This took the minimum of each __Outlook__ of *every* other column. And notice that __Outlook__ is now the index! 

Lets just look at what we care about, the temperature.

In [107]:
golf_df.('Outlook').aggregate(min)[['Temperature']]

SyntaxError: invalid syntax (<ipython-input-107-ba49289e213d>, line 1)

### Question:

Return the average weather data for each Outlook.

```python
          Temperature  Humidity  Windy  TempHumid  HeatIndex
Outlook                                                     
overcast         75.0      77.0    0.5      152.0  79.571853
rain             69.8      81.2    0.4      151.0  70.129762
sunny            76.2      82.0    0.4      158.2  79.469540
```

In [251]:
# write here:
golf_df.groupby('Outlook').mean()


Unnamed: 0_level_0,Temperature,Humidity,Windy,HeatIndex
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
overcast,75.0,77.0,0.5,79.571853
rain,69.8,81.2,0.4,70.129762
sunny,76.2,82.0,0.4,79.46954


<details><summary>Solution
</summary>

```python
print(golf_df.groupby('Outlook').mean())
```

</details>

### Question:

Return the max temperature on windy and non-windy days.

```python
Windy
False    85
True     80
```

In [253]:
# write here:
golf_df.groupby('Windy').aggregate(max)[['Temperature']]


Unnamed: 0_level_0,Temperature
Windy,Unnamed: 1_level_1
False,85
True,80


<details><summary>Solution
</summary>

```python
print(golf_df.groupby('Windy').max()['Temperature'])
```

</details>

Note that groupby is a big topic; more documentation is at http://pandas.pydata.org/pandas-docs/stable/groupby.html



# Some Extra, Useful Stuff

## Various Summaries

The `info` method is useful for checking column types and quickly seeing if you have NaN in the data.

In [254]:
golf_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14 entries, 0 to 13
Data columns (total 7 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Date         14 non-null     object 
 1   Outlook      14 non-null     object 
 2   Temperature  14 non-null     int64  
 3   Humidity     14 non-null     int64  
 4   Windy        14 non-null     bool   
 5   Result       14 non-null     object 
 6   HeatIndex    14 non-null     float64
dtypes: bool(1), float64(1), int64(2), object(3)
memory usage: 814.0+ bytes


The `describe` method will give you a quick sense of the quartiles and distribution.

In [255]:
golf_df.describe()

Unnamed: 0,Temperature,Humidity,HeatIndex
count,14.0,14.0,14.0
mean,73.571429,80.285714,76.163137
std,6.571667,9.840486,9.77144
min,64.0,65.0,62.847024
25%,69.25,71.25,69.439078
50%,72.0,80.0,74.112513
75%,78.75,88.75,82.352579
max,85.0,96.0,98.004631


### Question:

Which column in wine_quality_white has the biggest standard deviation? 

In [257]:
# write here:
wine_quality_white.describe()
#total sulfur dioxide

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,percent fixed acidity
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909,0.917699
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639,0.017822
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0,0.799136
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0,0.908357
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0,0.920635
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0,0.930005
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0,0.960163


<details><summary>Solution
</summary>

```python
wine_quality_white.describe()

# then look visually to see it's 'total sulfure dioxide'
```

</details>

### Question:

How much memory does wine_quality_white take in pandas?


In [366]:
# write here:

wine_quality_white.info()
# 497.6 KB

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   fixed acidity          4898 non-null   float64
 1   volatile acidity       4898 non-null   float64
 2   citric acid            4898 non-null   float64
 3   residual sugar         4898 non-null   float64
 4   chlorides              4898 non-null   float64
 5   free sulfur dioxide    4898 non-null   float64
 6   total sulfur dioxide   4898 non-null   float64
 7   density                4898 non-null   float64
 8   pH                     4898 non-null   float64
 9   sulphates              4898 non-null   float64
 10  alcohol                4898 non-null   float64
 11  quality                4898 non-null   int64  
 12  percent fixed acidity  4898 non-null   float64
dtypes: float64(12), int64(1)
memory usage: 497.6 KB


<details><summary>Solution
</summary>

```python
wine_quality_white.info()

# then look at the verrrry bottom. Answers may very. Mine says 459.3 KB.
```

</details>

### Question:

Which columns are integers in wine_quality_white?

In [367]:
# write here:
# 12 quality


<details><summary>Solution
</summary>

```python
wine_quality_white.info()

# then look at the last column and you'll see quality is the last one
```

</details>

### Question:

Which column has the lowest 75th percentile? 

(The 75th percentile is the datapoint that is higher than 75% of the data in that group)

In [260]:
# write here:
wine_quality_white.describe()

# chlorides


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,percent fixed acidity
count,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0,4898.0
mean,6.854788,0.278241,0.334192,6.391415,0.045772,35.308085,138.360657,0.994027,3.188267,0.489847,10.514267,5.877909,0.917699
std,0.843868,0.100795,0.12102,5.072058,0.021848,17.007137,42.498065,0.002991,0.151001,0.114126,1.230621,0.885639,0.017822
min,3.8,0.08,0.0,0.6,0.009,2.0,9.0,0.98711,2.72,0.22,8.0,3.0,0.799136
25%,6.3,0.21,0.27,1.7,0.036,23.0,108.0,0.991723,3.09,0.41,9.5,5.0,0.908357
50%,6.8,0.26,0.32,5.2,0.043,34.0,134.0,0.99374,3.18,0.47,10.4,6.0,0.920635
75%,7.3,0.32,0.39,9.9,0.05,46.0,167.0,0.9961,3.28,0.55,11.4,6.0,0.930005
max,14.2,1.1,1.66,65.8,0.346,289.0,440.0,1.03898,3.82,1.08,14.2,9.0,0.960163


<details><summary>Solution
</summary>

```python
wine_quality_white.describe()

# then look at the 75% row. Clorides have the lowest value.
```

</details>

## Frequency Tables

The `crosstab` function will allow us to quickly take a look at the frequency count between two columns.

In [368]:
pd.crosstab(golf_df['Outlook'], golf_df['Result'])

Result,Don't Play,Play
Outlook,Unnamed: 1_level_1,Unnamed: 2_level_1
overcast,0,4
rain,2,3
sunny,3,2


## DateTimes

We can turn strings of dates into datetime types by using Pandas' `to_datetime` function.

In [262]:
golf_df['DateTime'] = pd.to_datetime(golf_df['Date'])
golf_df['DateTime']

0    2014-07-01
1    2014-07-02
2    2014-07-03
3    2014-07-04
4    2014-07-05
5    2014-07-06
6    2014-07-07
7    2014-07-08
8    2014-07-09
9    2014-07-10
10   2014-07-11
11   2014-07-12
12   2014-07-13
13   2014-07-14
Name: DateTime, dtype: datetime64[ns]

Though `day` and `dayofweek` look the same, they are NOT. Look closely to see the difference.

In [263]:
golf_df['DateTime'].dt.day

0      1
1      2
2      3
3      4
4      5
5      6
6      7
7      8
8      9
9     10
10    11
11    12
12    13
13    14
Name: DateTime, dtype: int64

In [264]:
golf_df['DateTime'].dt.dayofweek

0     1
1     2
2     3
3     4
4     5
5     6
6     0
7     1
8     2
9     3
10    4
11    5
12    6
13    0
Name: DateTime, dtype: int64

In [265]:
golf_df['DateTime'].dt.month

0     7
1     7
2     7
3     7
4     7
5     7
6     7
7     7
8     7
9     7
10    7
11    7
12    7
13    7
Name: DateTime, dtype: int64

In [266]:
golf_df['DateTime'][0] - golf_df['DateTime'][5]

Timedelta('-5 days +00:00:00')

In [127]:
golf_df['DateTime'] - golf_df['DateTime']

0    0 days
1    0 days
2    0 days
3    0 days
4    0 days
5    0 days
6    0 days
7    0 days
8    0 days
9    0 days
10   0 days
11   0 days
12   0 days
13   0 days
Name: DateTime, dtype: timedelta64[ns]

There is a HOST of stuff you can do with datetimes. 

### Question:

Return the latest day of the week that we played AND didn't play in. 
(6 means end of week)

```python
Result
Don't Play    6
Play          6
Name: dayofweek, dtype: int64
```

In [269]:
# write here:
golf_df['Datetime'] = pd.to_datetime(golf_df['Date'])

golf_df['day_of_week']=golf_df['Datetime'].dt.dayofweek

golf_df.groupby('Result').max()['day_of_week']

Result
Don't Play    6
Play          6
Name: day_of_week, dtype: int64

<details><summary>Solution
</summary>

```python
golf_df['Datetime'] = pd.to_datetime(golf_df['Date'])
golf_df["dayofweek"] = golf_df['Datetime'].dt.dayofweek
print(golf_df.groupby('Result').max()["dayofweek"])
```

</details>

## Creating a New Row Index

We can also set the index to be an existing column(s).

In [271]:
date_df = golf_df.set_index('DateTime')
date_df

Unnamed: 0_level_0,Date,Outlook,Temperature,Humidity,Windy,Result,HeatIndex,Datetime,day_of_week
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
2014-07-01,7/1/14,sunny,85,85,False,Don't Play,98.004631,2014-07-01,1
2014-07-02,7/2/14,sunny,80,90,True,Don't Play,84.4744,2014-07-02,2
2014-07-03,7/3/14,overcast,83,78,False,Play,89.669911,2014-07-03,3
2014-07-04,7/4/14,rain,70,96,False,Play,62.847024,2014-07-04,4
2014-07-05,7/5/14,rain,68,80,False,Play,69.089776,2014-07-05,5
2014-07-06,7/6/14,rain,65,70,True,Don't Play,73.668025,2014-07-06,6
2014-07-07,7/7/14,overcast,64,65,True,Play,75.987116,2014-07-07,0
2014-07-08,7/8/14,sunny,72,95,False,Don't Play,66.247396,2014-07-08,1
2014-07-09,7/9/14,sunny,69,70,False,Play,72.843649,2014-07-09,2
2014-07-10,7/10/14,rain,75,80,False,Play,74.557,2014-07-10,3


In [272]:
date_df.index

DatetimeIndex(['2014-07-01', '2014-07-02', '2014-07-03', '2014-07-04',
               '2014-07-05', '2014-07-06', '2014-07-07', '2014-07-08',
               '2014-07-09', '2014-07-10', '2014-07-11', '2014-07-12',
               '2014-07-13', '2014-07-14'],
              dtype='datetime64[ns]', name='DateTime', freq=None)

If we have an index of datetime types, we can use the `resample` method to quickly look at time based aggregations.

In [273]:
# Weekly means.
date_df.resample('W').mean()

Unnamed: 0_level_0,Temperature,Humidity,Windy,HeatIndex,day_of_week
DateTime,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2014-07-06,75.166667,83.166667,0.333333,79.625628,3.5
2014-07-13,72.571429,77.857143,0.428571,74.006167,3.0
2014-07-20,71.0,80.0,1.0,70.486984,0.0


This will be especially useful when we work with time series.

## Writing Data

We can write data into a csv file.

```python 
>>> golf_df.to_csv('new_playgolf.csv', index=False)

>>> !cat new_playgolf.csv 

Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,day,DateTime
7/1/14,sunny,85,85,False,Don't Play,170,98.0046312500001,1,2014-07-01
7/2/14,sunny,80,90,True,Don't Play,170,84.47439999999996,2,2014-07-02
7/3/14,overcast,83,78,False,Play,161,89.66991075999985,3,2014-07-03
7/4/14,rain,70,96,False,Play,166,62.84702400000019,4,2014-07-04
7/5/14,rain,68,80,False,Play,148,69.08977600000006,5,2014-07-05
7/6/14,rain,65,70,True,Don't Play,135,73.66802499999999,6,2014-07-06
7/7/14,overcast,64,65,True,Play,129,75.9871160000001,7,2014-07-07
7/8/14,sunny,72,95,False,Don't Play,167,66.2473960000001,8,2014-07-08
7/9/14,sunny,69,70,False,Play,139,72.84364900000003,9,2014-07-09
7/10/14,rain,75,80,False,Play,155,74.55700000000012,10,2014-07-10
7/11/14,sunny,75,70,True,Play,145,75.777625,11,2014-07-11
7/12/14,overcast,72,90,True,Play,162,68.10694399999996,12,2014-07-12
7/13/14,overcast,81,75,False,Play,156,84.52344124999973,13,2014-07-13
7/14/14,rain,71,80,True,Don't Play,151,70.48698400000002,14,2014-07-14
```

## Exploratory Data Analysis

### Question:

Now lets put these skills together in the way we would typically explore data. This is called EDA. Exploratory Data Analysis.

Open chess_games.csv and answer these questions (use markdown headers to denote which question you are answering) 

1. how many rows of data do you have?
1. What does each row represent?
1. Who won more, white or black?
1. How many moves per game on average?
1. What was the most likely 'first move'?
1. How many games end in checkmate? (the alternative is to surrender or timeout)
1. What percent of games is that?
1. How long was the average game?
1. How long was the average game that white won? How long was the average game that was a draw?
1. What is an increment code?
1. Ask three of your own questions *in writing*. Then answer them. (we write our questions because if you come back later you'll forget what question you were answering)


write your question after the '####':

#### 1: [YOUR QUESTION TITLE HERE]

In [275]:
#  how many rows of data do you have?
''' rows are 20058'''
# write code here:
chess = pd.read_csv('data/chess_games.csv')
chess.count()



id                20058
rated             20058
created_at        20058
last_move_at      20058
turns             20058
victory_status    20058
winner            20058
increment_code    20058
white_id          20058
white_rating      20058
black_id          20058
black_rating      20058
moves             20058
opening_eco       20058
opening_name      20058
opening_ply       20058
dtype: int64

<details><summary>Solution 1
</summary>

```python
chess = pd.read_csv('data/chess_games.csv')
chess.count()
```

</details>

write your question after the '####':

#### 

In [277]:
# What does each row represent?
# write code here:
chess.T


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,20048,20049,20050,20051,20052,20053,20054,20055,20056,20057
id,TZJHLljE,l1NXvwaE,mIICvQHh,kWKvrqYL,9tXo1AUZ,MsoDV9wj,qwU9rasv,RVN0N3VK,dwF3DJHO,afoMwnLg,...,dnexZDsv,7IENcPg3,nYOvevdh,uMzb0TPC,EopEqqAa,EfqH7VVH,WSJDhbPl,yrAas0Kj,b0v4tRyF,N8G2JHGG
rated,False,True,True,True,True,False,True,False,True,True,...,True,True,True,True,True,True,True,True,True,True
created_at,1504210000000.0,1504130000000.0,1504130000000.0,1504110000000.0,1504030000000.0,1504240000000.0,1504230000000.0,1503680000000.0,1503510000000.0,1503440000000.0,...,1499869487435.0,1499814832472.0,1499814002224.0,1499812466451.0,1499811847779.0,1499790914342.0,1499698089760.0,1499697877493.0,1499696127019.0,1499643152649.0
last_move_at,1504210000000.0,1504130000000.0,1504130000000.0,1504110000000.0,1504030000000.0,1504240000000.0,1504230000000.0,1503680000000.0,1503510000000.0,1503440000000.0,...,1499869745122.0,1499815248818.0,1499814052544.0,1499813212945.0,1499812436546.0,1499791236076.0,1499698833979.0,1499698050327.0,1499697073718.0,1499643889348.0
turns,13,16,61,61,95,5,33,9,66,119,...,25,43,9,58,37,24,82,35,109,78
victory_status,outoftime,resign,mate,mate,mate,draw,resign,resign,resign,mate,...,resign,mate,outoftime,mate,resign,resign,mate,mate,resign,mate
winner,white,black,white,white,white,draw,white,black,black,white,...,white,white,white,black,white,white,black,white,white,black
increment_code,15+2,5+10,5+10,20+0,30+3,10+0,10+0,15+30,15+0,10+0,...,10+10,10+0,10+0,10+10,10+10,10+10,10+0,10+0,10+0,10+0
white_id,bourgris,a-00,ischia,daniamurashov,nik221107,trelynn17,capa_jr,daniel_likes_chess,ehabfanri,daniel_likes_chess,...,mateuslichess,jkubb29,jamboger,samael88,jamboger,belcolt,jamboger,jamboger,marcodisogno,jamboger
white_rating,1500,1322,1496,1439,1523,1250,1520,1413,1439,1381,...,1252,1328,1243,1237,1219,1691,1233,1219,1360,1235


<details><summary>Solution 2
</summary>

```chess['moves']``` shows that we have multiple moves per row. So probably games? Ah, and theres a winner in ```chess['winner']```. Games.
    
HINT: ```chess.T``` switches rows and columns. (it stands for transpose). Using it allows us to see more of the columns in one look.
   
```python

```

</details>

write your question after the '####':

#### 

In [134]:
# Who won more, white or black?
# write code here:



<details><summary>Solution 3
</summary>

   
```python
chess['winner'].value_counts()

```

</details>

write your question after the '####':

####

In [135]:
# How many moves per game on average?
# write code here:



<details><summary>Solution 4
</summary>

```chess.T``` allows us to see more of the rows.

```chess['moves']``` shows that we have multiple moves per row. So probably games? Ah, and theres a winner in ```chess['winner']```. Games.
    
```python

>>> chess.moves.apply(lambda x: x.split(' ')) # shows the items per row in a list

0        [d4, d5, c4, c6, cxd5, e6, dxe6, fxe6, Nf3, Bb...
1        [d4, Nc6, e4, e5, f4, f6, dxe5, fxe5, fxe5, Nx...
2        [e4, e5, d3, d6, Be3, c6, Be2, b5, Nd2, a5, a4...
3        [d4, d5, Nf3, Bf5, Nc3, Nf6, Bf4, Ng4, e3, Nc6...
4        [e4, e5, Nf3, d6, d4, Nc6, d5, Nb4, a3, Na6, N...
                               ...                        

>>> chess.moves.apply(lambda x: len(x.split(' '))) # shows the length per list

0         13
1         16
2         61
3         61
4         95

>>> chess.moves.apply(lambda x: len(x.split(' '))).mean() # take the mean
60.46599860404826
```

</details>

write your question after the '####':

#### 

In [136]:
# What is the most common opening move?
# write code here:



<details><summary>Solution 5
</summary>

```python
chess.moves.apply(lambda x: x.split(' ')[0]).value_counts()
```

</details>

write your question after the '####':

#### 

In [137]:
# How many games end in checkmate?
# write code here:



<details><summary>Solution 6
</summary>


```python
>>> chess.moves.apply(lambda x: x.split(' ')[-1]).value_counts()

Qxf7#    176
Qg7#     170
Qg2#     157
Qxg2#    141
Qxg7#    137
        ... 
Qhh4#      1
Bxg6#      1
Rxa2+      1
Qac8#      1
Ngxe3      1

# hmm. Now I had to go look up what '#' means. Ok it means Checkmate. So now I grab the end of each string.

>>> chess.moves.apply(lambda x: x.split(' ')[-1][-1]).value_counts()

#    6325
+    2764
5    1890
4    1806
6    1599
3    1466
7    1185
2    1101
1     879
8     844
Q     130
O      62
R       4
N       2
B       1

```

</details>

In [138]:
# What percent of games end in checkmate?
# write code here:



<details><summary>Solution 7
</summary>


```python
>>> chess.moves.apply(lambda x: x.split(' ')[-1][-1]).value_counts()/len(chess)

#    0.315336
+    0.137800
5    0.094227
4    0.090039
6    0.079719
3    0.073088
7    0.059079
2    0.054891
1    0.043823
8    0.042078
Q    0.006481
O    0.003091
R    0.000199
N    0.000100
B    0.000050
Name: moves, dtype: float64
```
> 31% of games ended with checkmate, and 13% of games ended with a normal 'check'
</details>

In [139]:
# How long was the average game?
# write code here:



<details><summary>Solution 8
</summary>

```python
>>> chess['last_move_at'] - chess['created_at']

0             0.0
1             0.0
2             0.0
3             0.0
4             0.0
           ...   
20053    321734.0
20054    744219.0
20055    172834.0
20056    946699.0
20057    736699.0
Length: 20058, dtype: float64

# hmmm. This gives a bunch of... nanoseconds? miliseconds?"

# I netter google 'how to go from milliseconds to pandas datetime.'
# I click stack overflow, The first suggestion (and always a good bet).
# First answer says pd.to_datetime(df['UNIXTIME'], unit='ms')
# Lets try that.


>>> pd.to_datetime(chess['datetime_created_at'], unit='ms')

0       2017-08-31 20:06:40.000000000
1       2017-08-30 21:53:20.000000000
2       2017-08-30 21:53:20.000000000
3       2017-08-30 16:20:00.000000000
4       2017-08-29 18:06:40.000000000
                     ...             
20053   2017-07-11 16:35:14.342000128
20054   2017-07-10 14:48:09.760000000
20055   2017-07-10 14:44:37.492999936
20056   2017-07-10 14:15:27.019000064
20057   2017-07-09 23:32:32.648999936
Name: datetime_created_at, Length: 20058, dtype: datetime64[ns]

That looks reasonable

>>> chess['datetime_last_move'] = pd.to_datetime(chess['last_move_at'], unit='ms')
>>> chess['datetime_created_at'] = pd.to_datetime(chess['created_at'], unit='ms')
>>> chess['datetime_last_move'] - chess['datetime_created_at']

0              00:00:00
1              00:00:00
2              00:00:00
3              00:00:00
4              00:00:00
              ...      
20053   00:05:21.734000
20054   00:12:24.219000
20055   00:02:52.834000
20056   00:15:46.699000
20057   00:12:16.699000
Length: 20058, dtype: timedelta64[ns]


>>> (chess['datetime_last_move'] - chess['datetime_created_at']).mean()

Timedelta('0 days 00:14:29.707049')

# looks like the average game is '14 minutes, 30 seconds'

```

</details>

In [140]:
# How long was the average game that white won? How long was the average game that was a draw?
# write code here:



<details><summary>Solution 9
</summary>


```python
# this one was hard for me because
chess.groupby('winner').mean()
# Doesn't return our column! Bleh!

# but this works
chess[chess['winner']=='white'].mean()

# and so does this
chess['game_ms'] = chess['last_move_at'] - chess['created_at']
pd.to_datetime(chess.groupby('winner').mean()['game_ms'], unit='ms')
```
    
</details>

In [141]:
# write code here:



In [142]:
# What is the increment code?

<details><summary> Solution 10
</summary>

google 'increment code chess':

"increment (in seconds) is the amount added after each move."

</details>

Now do your own! Make them headers, in markdown, in writing!

### Question:

Now let's put these skills together AGAIN in the way we would typically explore data. This is called EDA. Exploratory Data Analysis.

Open exo.csv, skim https://exoplanetarchive.ipac.caltech.edu/docs/API_kepcandidate_columns.html and answer these questions.


This time write your own markdown headers before you answer each question!

1. how many rows of data do you have?
1. What does each row represent?
1. How many planets are confirmed?
1. What is the maximum temperature of the planets?
1. What is the maximum temperature of the stars?
1. What is the distance between the planet and the star?
1. How many planets are in each system?
1. How much does the planet temperatures vary?
1. How much does the star temperatures vary?
1. How many NaNs are in each category?
1. Ask three of your own questions *in writing*. Then answer them. (we write our questions because if you come back later you'll foget what question you were answering)


No solutions provided. You got this.

## Joining DataFrames

We can join DataFrames in a similar way that we join tables to SQL.  In fact, left, right, outer, and inner joins work the same way here.

Lets create a fake DataFrame to join with first.

In [143]:
mood_df = pd.DataFrame([['overcast', 'sad'], ['rainy', 'sad'], ['sunny', 'happy']],
                       columns=['Weather', 'Mood'])

mood_df

Unnamed: 0,Weather,Mood
0,overcast,sad
1,rainy,sad
2,sunny,happy


We can do joins using the merge command.

In [144]:
golf_df.merge(mood_df, how='inner', left_on='Outlook', right_on='Weather')

Unnamed: 0,Date,Outlook,Temperature,Humidity,Windy,Result,TempHumid,HeatIndex,day,DateTime,Weather,Mood
0,7/1/14,sunny,85,85,False,Don't Play,170,98.004631,1,2014-07-01,sunny,happy
1,7/2/14,sunny,80,90,True,Don't Play,170,84.4744,2,2014-07-02,sunny,happy
2,7/8/14,sunny,72,95,False,Don't Play,167,66.247396,8,2014-07-08,sunny,happy
3,7/9/14,sunny,69,70,False,Play,139,72.843649,9,2014-07-09,sunny,happy
4,7/11/14,sunny,75,70,True,Play,145,75.777625,11,2014-07-11,sunny,happy
5,7/3/14,overcast,83,78,False,Play,161,89.669911,3,2014-07-03,overcast,sad
6,7/7/14,overcast,64,65,True,Play,129,75.987116,7,2014-07-07,overcast,sad
7,7/12/14,overcast,72,90,True,Play,162,68.106944,12,2014-07-12,overcast,sad
8,7/13/14,overcast,81,75,False,Play,156,84.523441,13,2014-07-13,overcast,sad


There are, of course, other options besides `inner`, which you can find in the documentation.

## Concatenating dataframes

This is the equivalent of Unions in SQL, but a little more flexible.

In [145]:
df1 = pd.DataFrame(
    {'Col3': range(5), 'Col2': range(5), 'Col1': range(5)},
    index=range(0, 5))
df2 = pd.DataFrame(
    {'Col1': range(5), 'Col2': range(5), 'Col4': range(5)},
    index=range(3, 8))

In [146]:
df1

Unnamed: 0,Col3,Col2,Col1
0,0,0,0
1,1,1,1
2,2,2,2
3,3,3,3
4,4,4,4


In [147]:
df2

Unnamed: 0,Col1,Col2,Col4
3,0,0,0
4,1,1,1
5,2,2,2
6,3,3,3
7,4,4,4


#### Vertically

This is like a Union All. The `sort` parameter controls the order of the columns in the output.

In [148]:
pd.concat([df1, df2], axis=0, join='outer', sort=True)

Unnamed: 0,Col1,Col2,Col3,Col4
0,0,0,0.0,
1,1,1,1.0,
2,2,2,2.0,
3,3,3,3.0,
4,4,4,4.0,
3,0,0,,0.0
4,1,1,,1.0
5,2,2,,2.0
6,3,3,,3.0
7,4,4,,4.0


An `inner` value limits the columns to those in all the inputs.

In [149]:
pd.concat([df1, df2], axis=0, join='inner', sort=True)

Unnamed: 0,Col1,Col2
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4
3,0,0
4,1,1
5,2,2
6,3,3
7,4,4


#### Horizontally

This is pretty much a simple join on indices.  While `concat` is capable of doing joins, it is far less flexible.

In [150]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,Col3,Col2,Col1,Col1.1,Col2.1,Col4
0,0.0,0.0,0.0,,,
1,1.0,1.0,1.0,,,
2,2.0,2.0,2.0,,,
3,3.0,3.0,3.0,0.0,0.0,0.0
4,4.0,4.0,4.0,1.0,1.0,1.0
5,,,,2.0,2.0,2.0
6,,,,3.0,3.0,3.0
7,,,,4.0,4.0,4.0


**Question:** why do some numbers show up as floats? Why do some numbers not?

For more on joining DataFrames, read https://pandas.pydata.org/pandas-docs/stable/merging.html