In [1]:
import numpy as np
import pandas as pd

# Pandas DataFrames

DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object. Like Series, DataFrame accepts many different kinds of input:

* Dict of 1D ndarrays, lists, dicts, or Series
* 2-D numpy.ndarray
* Structured or record ndarray
* A Series
* Another DataFrame

<img src="img/df1.jpg">

Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. If axis labels are not passed, they will be constructed from the input data based on common sense rules.

Here's an example where we have set the Dates column to be the index and label for the rows. 

<img src="img/df2.jpg">

## Creation of DataFrames

### Using Dictionaries

In [11]:
d = {'one' : pd.Series([1., 2., 3.], index=['a', 'b', 'c']),
     'two' : pd.Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}

d

{'one': a    1.0
 b    2.0
 c    3.0
 dtype: float64, 'two': a    1.0
 b    2.0
 c    3.0
 d    4.0
 dtype: float64}

In [12]:
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1.0
b,2.0,2.0
c,3.0,3.0
d,,4.0


In [13]:
df.loc['a']

one    1.0
two    1.0
Name: a, dtype: float64

### From a Dictionary of ndarrays / lists

The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays. If no index is passed, the result will be ``range(n)``, where n is the array length.

In [5]:
d = {'first' : [1., 2., 3., 4.], 'second' : [4., 3., 2., 1.], 'third':np.random.randint(10,20,4)}

In [6]:
df = pd.DataFrame(d)
df

Unnamed: 0,first,second,third
0,1.0,4.0,11
1,2.0,3.0,12
2,3.0,2.0,15
3,4.0,1.0,19


### From a Numpy Array

In [33]:
data = np.random.randint(0,2,size=(5,3))
data

array([[0, 1, 1],
       [1, 1, 0],
       [1, 0, 1],
       [1, 0, 0],
       [0, 1, 1]])

In [34]:
col_names = ['simulation1','simulation2','simulation3']
index_list = list('abcde')

In [35]:
df = pd.DataFrame(data, index=index_list, columns=col_names)
df

Unnamed: 0,simulation1,simulation2,simulation3
a,0,1,1
b,1,1,0
c,1,0,1
d,1,0,0
e,0,1,1


### From a file

In [15]:
titanic = pd.read_csv('../datasets/titanic/titanic.csv')

In [16]:
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1.0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0000,0.0,0.0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1.0,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1.0,0.0,"Allison, Miss. Helen Loraine",female,2.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1.0,0.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1.0,2.0,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1.0,0.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1.0,2.0,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1.0,1.0,"Anderson, Mr. Harry",male,48.0000,0.0,0.0,19952,26.5500,E12,S,3,,"New York, NY"
6,1.0,1.0,"Andrews, Miss. Kornelia Theodosia",female,63.0000,1.0,0.0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1.0,0.0,"Andrews, Mr. Thomas Jr",male,39.0000,0.0,0.0,112050,0.0000,A36,S,,,"Belfast, NI"
8,1.0,1.0,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0000,2.0,0.0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1.0,0.0,"Artagaveytia, Mr. Ramon",male,71.0000,0.0,0.0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


### Anatomy of a DataFrame

A DataFrame consists on three parts:
* Row Index
* Columns Names (Column Index)
* Data

The row and column labels can be accessed respectively by accessing the ``index`` and ``columns`` attributes:

In [17]:
df.index

Index([u'a', u'b', u'c', u'd'], dtype='object')

In [18]:
df.columns

Index([u'one', u'two'], dtype='object')

In [19]:
df.values

array([[  1.,   1.],
       [  2.,   2.],
       [  3.,   3.],
       [ nan,   4.]])

## Column selection, addition, deletion

You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting columns works with the same syntax as the analogous dict operations:

In [20]:
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'ticket', u'fare', u'cabin', u'embarked', u'boat', u'body',
       u'home.dest'],
      dtype='object')

In [25]:
titanic['name']

0                           Allen, Miss. Elisabeth Walton
1                          Allison, Master. Hudson Trevor
2                            Allison, Miss. Helen Loraine
3                    Allison, Mr. Hudson Joshua Creighton
4         Allison, Mrs. Hudson J C (Bessie Waldo Daniels)
5                                     Anderson, Mr. Harry
6                       Andrews, Miss. Kornelia Theodosia
7                                  Andrews, Mr. Thomas Jr
8           Appleton, Mrs. Edward Dale (Charlotte Lamson)
9                                 Artagaveytia, Mr. Ramon
10                                 Astor, Col. John Jacob
11      Astor, Mrs. John Jacob (Madeleine Talmadge Force)
12                          Aubart, Mme. Leontine Pauline
13                           Barber, Miss. Ellen "Nellie"
14                   Barkworth, Mr. Algernon Henry Wilson
15                                    Baumann, Mr. John D
16                               Baxter, Mr. Quigg Edmond
17        Baxt

In [26]:
del titanic['ticket']
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'fare', u'cabin', u'embarked', u'boat', u'body', u'home.dest'],
      dtype='object')

In [27]:
titanic['age_in_months'] = 12*titanic['age']

In [28]:
titanic.columns

Index([u'pclass', u'survived', u'name', u'sex', u'age', u'sibsp', u'parch',
       u'fare', u'cabin', u'embarked', u'boat', u'body', u'home.dest',
       u'age_in_months'],
      dtype='object')

When inserting a scalar value, it will naturally be propagated to fill the column:

In [77]:
titanic['year'] = 1909
titanic['year']

0       1909
1       1909
2       1909
3       1909
4       1909
5       1909
6       1909
7       1909
8       1909
9       1909
10      1909
11      1909
12      1909
13      1909
14      1909
15      1909
16      1909
17      1909
18      1909
19      1909
20      1909
21      1909
22      1909
23      1909
24      1909
25      1909
26      1909
27      1909
28      1909
29      1909
        ... 
1280    1909
1281    1909
1282    1909
1283    1909
1284    1909
1285    1909
1286    1909
1287    1909
1288    1909
1289    1909
1290    1909
1291    1909
1292    1909
1293    1909
1294    1909
1295    1909
1296    1909
1297    1909
1298    1909
1299    1909
1300    1909
1301    1909
1302    1909
1303    1909
1304    1909
1305    1909
1306    1909
1307    1909
1308    1909
1309    1909
Name: year, dtype: int64

You can insert raw ndarrays but their length must match the length of the DataFrame’s index.

In [30]:
len(titanic)

1310

In [29]:
titanic['rand_integer'] = np.random.randint(0,10,size=len(titanic))

You can select many columns by passing a list of column names:

In [31]:
titanic[['name','survived','sex','age']]

Unnamed: 0,name,survived,sex,age
0,"Allen, Miss. Elisabeth Walton",1.0,female,29.0000
1,"Allison, Master. Hudson Trevor",1.0,male,0.9167
2,"Allison, Miss. Helen Loraine",0.0,female,2.0000
3,"Allison, Mr. Hudson Joshua Creighton",0.0,male,30.0000
4,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",0.0,female,25.0000
5,"Anderson, Mr. Harry",1.0,male,48.0000
6,"Andrews, Miss. Kornelia Theodosia",1.0,female,63.0000
7,"Andrews, Mr. Thomas Jr",0.0,male,39.0000
8,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",1.0,female,53.0000
9,"Artagaveytia, Mr. Ramon",0.0,male,71.0000


### Go to Excercises

## The basics of indexing / row selection

<table border="1" class="docutils">
<colgroup>
<col width="50%">
<col width="33%">
<col width="17%">
</colgroup>
<thead valign="bottom">
<tr class="row-odd"><th class="head">Operation</th>
<th class="head">Syntax</th>
<th class="head">Result</th>
</tr>
</thead>
<tbody valign="top">
<tr class="row-even"><td>Select column</td>
<td><tt class="docutils literal"><span class="pre">df[col]</span></tt></td>
<td>Series</td>
</tr>
<tr class="row-odd"><td>Select row by label</td>
<td><tt class="docutils literal"><span class="pre">df.loc[label]</span></tt></td>
<td>Series</td>
</tr>
<tr class="row-even"><td>Select row by integer location</td>
<td><tt class="docutils literal"><span class="pre">df.iloc[loc]</span></tt></td>
<td>Series</td>
</tr>
<tr class="row-odd"><td>Slice rows</td>
<td><tt class="docutils literal"><span class="pre">df[5:10]</span></tt></td>
<td>DataFrame</td>
</tr>
<tr class="row-even"><td>Select rows by boolean vector</td>
<td><tt class="docutils literal"><span class="pre">df[bool_vec]</span></tt></td>
<td>DataFrame</td>
</tr>
</tbody>
</table>

In [36]:
df

Unnamed: 0,simulation1,simulation2,simulation3
a,0,1,1
b,1,1,0
c,1,0,1
d,1,0,0
e,0,1,1


In [37]:
df.loc['b']

simulation1    1
simulation2    1
simulation3    0
Name: b, dtype: int32

In [38]:
df.iloc[1]

simulation1    1
simulation2    1
simulation3    0
Name: b, dtype: int32

In [40]:
df[1:3]

Unnamed: 0,simulation1,simulation2,simulation3
b,1,1,0
c,1,0,1


In [41]:
df.loc['b']

simulation1    1
simulation2    1
simulation3    0
Name: b, dtype: int32

In [42]:
df.loc['b','simulation2']

1

In [17]:
df.index

Index([u'a', u'b', u'c', u'd', u'e'], dtype='object')

In [18]:
df.columns

Index([u'simulation1', u'simulation2', u'simulation3'], dtype='object')

In [19]:
df.values

array([[1, 0, 1],
       [0, 1, 1],
       [1, 0, 0],
       [0, 1, 1],
       [0, 0, 0]])

In [83]:
df.loc[['a','d']]

Unnamed: 0,simulation1,simulation2,simulation3
a,1,0,1
d,1,1,1


In [84]:
df.iloc[4]

simulation1    1
simulation2    0
simulation3    1
Name: e, dtype: int32

In [85]:
df['simulation3'] == 1

a     True
b     True
c    False
d     True
e     True
Name: simulation3, dtype: bool

In [43]:
df[df['simulation3']==1]

Unnamed: 0,simulation1,simulation2,simulation3
a,0,1,1
c,1,0,1
e,0,1,1


In [46]:
df

Unnamed: 0,simulation1,simulation2,simulation3
a,0,1,1
b,1,1,0
c,1,0,1
d,1,0,0
e,0,1,1


In [48]:
df.iloc[1,1]

1

In [55]:
dict1 = {'Mon': 33, 'Tue': 19, 'Wed': 15, 'Thu': 89, 'Fri': 11, 'Sat': -5, 'Sun': 9}
s4 = pd.Series(dict1)
s4

Fri    11
Mon    33
Sat    -5
Sun     9
Thu    89
Tue    19
Wed    15
dtype: int64