# Using Pandas DataFrames

A **DataFrame** in pandas is a tabular, spreadsheet-like structure with a **ordered** collection of columns, each with potentially a different type.

DataFrames have **both** column and row indexes, something like a "dict of Series". 

## Creating a DataFrame from a dict

In [1]:
import pandas as pd
import numpy as np
sales = {
   'region': ["Europe", "Europe", "Europe", 
              "USA", "USA", "USA", "LATAM", "LATAM"],
   'volume':[100, 120, 140, 200, 190, 180, 80, 90],
   'year':[2011, 2012, 2013, 2011, 2012, 2013, 2012, 2013]
}
data = pd.DataFrame(sales)
data

Unnamed: 0,region,volume,year
0,Europe,100,2011
1,Europe,120,2012
2,Europe,140,2013
3,USA,200,2011
4,USA,190,2012
5,USA,180,2013
6,LATAM,80,2012
7,LATAM,90,2013


Now you can apply all the methods of DataFrame

In [2]:
data.tail(2)

Unnamed: 0,region,volume,year
6,LATAM,80,2012
7,LATAM,90,2013


## Taking Series from DataFrames

You can get each column using indexing or dot notation.

In [5]:
print data["region"]
print data.year
print data[["year", "volume"]] # getting more than one.
print type(data.year)

0    Europe
1    Europe
2    Europe
3       USA
4       USA
5       USA
6     LATAM
7     LATAM
Name: region, dtype: object
0    2011
1    2012
2    2013
3    2011
4    2012
5    2013
6    2012
7    2013
Name: year, dtype: int64
   year  volume
0  2011     100
1  2012     120
2  2013     140
3  2011     200
4  2012     190
5  2013     180
6  2012      80
7  2013      90
<class 'pandas.core.series.Series'>


And you can apply operations to whole columns, that propagate to all elements as with NumPy arrays.

In [6]:
data["volume"]+=1 
data.head(3)

Unnamed: 0,region,volume,year
0,Europe,101,2011
1,Europe,121,2012
2,Europe,141,2013


Note that indexing behaves differently if we use indexes at the rows. This can be seen as a bit **inconsistent** so it is better to use specific indexing operators (see below).

In [83]:
data[2:5][["region", "volume"]]

Unnamed: 0,region,volume
2,Europe,141
3,USA,201
4,USA,191


## Adding columns

You can add columns like in a Python dict using indexing and assignment. However, the size of the arrays matter, it must be the same of the index. You can delete with <code>del</code> as with dicts.

In [84]:
data["complains"] = np.zeros(data.index.size)
data.head(3)

Unnamed: 0,region,volume,year,complains
0,Europe,101,2011,0
1,Europe,121,2012,0
2,Europe,141,2013,0


## Getting NumPy arrays

You can use **values** to get a ndarray with the values in one or several columns. If the types of the columns are heterogeneous, you will get an array of lists, but if they are homogeneous, it will be a normal 2D array. 

In [85]:
a = data[["year", "volume"]].values
print type(a)
print a
print a.dtype

<type 'numpy.ndarray'>
[[2011  101]
 [2012  121]
 [2013  141]
 [2011  201]
 [2012  191]
 [2013  181]
 [2012   81]
 [2013   91]]
int64


## Indexing

While you can use the normal Python syntax for indexing [], it is recommended for performance to use some of the special indexing operations in pandas. This is also needed if you are using labels both in rows and columns.

In [54]:
# Sales by city and in different years.
example = {     "MAD": [100, 120, 140, 130],
                "BCN": [90, 80, 100, 150],
                "VAL": [90, 80, 70, 80]
}
years=["2011", "2012", "2013", "2014"]
mysales = pd.DataFrame(example, index=years)
mysales

Unnamed: 0,BCN,MAD,VAL
2011,90,100,90
2012,80,120,80
2013,100,140,70
2014,150,130,80


In [56]:
mysales.ix[:,"BCN"]

2011     90
2012     80
2013    100
2014    150
Name: BCN, dtype: int64

In [57]:
mysales.ix["2012"]

BCN     80
MAD    120
VAL     80
Name: 2012, dtype: int64

In [116]:
# NOTE: Slicing is inclusive of both values, unlike slicing in Python lists!!!
mysales.ix["2011":"2013", ["BCN","VAL"]]
# Different from:  
#mysales.ix["2011":"2013", "BCN":"VAL"]

Unnamed: 0,BCN,VAL
2011,90,90
2012,80,80
2013,100,70


## Changing the index

The indexes up to know have been created by default and are numerical sequences. But you can use any other data type as row index. 

Index objects are inmutable, and can be of different types, including numbers, strings and dates or date periods. 

You can use as index sequences with duplicates and you can add with <code>append</code> or <code>insert</code>. They have also operations similar to that of sets (diff, intersection, union).

Reindexing is possible both for rows and columns, and allows to specify different ways of dealing with new data. It is very common to use some of the columns as index.

In [98]:
byregion = data.set_index(["region", "year"])
byregion

Unnamed: 0_level_0,Unnamed: 1_level_0,volume,complains
region,year,Unnamed: 2_level_1,Unnamed: 3_level_1
Europe,2011,101,0
Europe,2012,121,0
Europe,2013,141,0
USA,2011,201,0
USA,2012,191,0
USA,2013,181,0
LATAM,2012,81,0
LATAM,2013,91,0


This is a multilevel index.

In [103]:
europe = byregion.ix["Europe", :]
europe

Unnamed: 0_level_0,volume,complains
year,Unnamed: 1_level_1,Unnamed: 2_level_1
2011,101,0
2012,121,0
2013,141,0


In [114]:
# Check years were converted to integer!
print europe.ix[2012]
print type(europe.ix[2012])
print europe.ix[2012].volume
print type(europe.index[0])

volume       121
complains      0
Name: 2012, dtype: float64
<class 'pandas.core.series.Series'>
121.0
<type 'numpy.int64'>
