The `DataFrame` can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary.

In [10]:
import numpy as np
import pandas as pd

## DataFrame as a generalized Numpy array

If a `Series` is an analog of a one-dimensional array with flexible indices, a `DataFrame` is an analog of a **two-dimensional array with both flexible row indices and flexible column names.**

Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a `DataFrame` as a sequence of aligned `Series` objects. (by "aligned" we mean that they share the same index.)

In [2]:
area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297,
             'Florida': 170312, 'Illinois': 149995}
area = pd.Series(area_dict)

In [3]:
area

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
dtype: int64

In [4]:
population_dict = {
    'California': 38332521,
    'Texas': 26448193,
    'New York': 19651127,
    'Florida': 19552860,
    'Illinois': 12882135
}

population = pd.Series(population_dict)

In [5]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

Use a dictionary to construct a single two-dimensional object containing this information:

Syntax:
`pd.DataFrame({'column_label': column_series, ...})`

In [12]:
states = pd.DataFrame({'population': population,
                       'area': area})

states

Unnamed: 0,population,area
California,38332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [8]:
# Access index
states.index

Index(['California', 'Texas', 'New York', 'Florida', 'Illinois'], dtype='object')

Additionally, the `DataFrame` has a `columns` attribute, which is an `Index` object holding the column labels:

In [9]:
# Access columns
states.columns

Index(['population', 'area'], dtype='object')

Thus the `DataFrame` can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data.

## DataFrame as specialized dictionary

Where a dictionary maps a key to a value, a `DataFrame` maps a column name to a `Series` of column data. 

E.g., asking for the `area` attribute returns the `Series` object containing the areas we saw earlier:

In [13]:
states['area']

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

In [14]:
states['population']

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
Name: population, dtype: int64

## Construct DataFrame Objects

- From a single `Series` object

In [17]:
population

California    38332521
Texas         26448193
New York      19651127
Florida       19552860
Illinois      12882135
dtype: int64

In [18]:
pd.DataFrame(population, columns=['Population'])

Unnamed: 0,Population
California,38332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


- From a list of dicts

In [20]:
data = [{'a': i, 'b': 2 * i} for i in range(3)]
data

[{'a': 0, 'b': 0}, {'a': 1, 'b': 2}, {'a': 2, 'b': 4}]

In [21]:
pd.DataFrame(data)

Unnamed: 0,a,b
0,0,0
1,1,2
2,2,4


Even if some keys in the dictionary are missing, Pandas will fill them in with `NaN` (i.e., "not a number") values:

In [22]:
pd.DataFrame([{'a': 1, 'b': 2}, {'b': 3, 'c': 4}, {'c': 5, 'd': 6}])

Unnamed: 0,a,b,c,d
0,1.0,2.0,,
1,,3.0,4.0,
2,,,5.0,6.0


- From a two dimensional Numpy array

In [28]:
pd.DataFrame(np.random.randint(5, size=(3, 2)), 
             columns=['foo', 'bar'])

Unnamed: 0,foo,bar
0,0,3
1,1,0
2,4,1


In [29]:
pd.DataFrame(np.random.randint(5, size=(3, 2)), 
             columns=['foo', 'bar'],
             index=['a', 'b', 'c'])

Unnamed: 0,foo,bar
a,3,4
b,1,1
c,1,4


- From Numpy structured array

In [33]:
A = np.zeros(3, dtype=[('A', 'i8'), ('B', 'f8')])
A

array([(0, 0.), (0, 0.), (0, 0.)], dtype=[('A', '<i8'), ('B', '<f8')])

In [34]:
pd.DataFrame(A)

Unnamed: 0,A,B
0,0,0.0
1,0,0.0
2,0,0.0
