# The pandas DataFrame Object<br>
The pandas DataFrame object extends the capabilities of the Series object into
two-dimensions. A Series object adds an index to a NumPy array but can only
associate a single data item per index label, a DataFrame integrates multiple Series
objects by aligning them along common index labels. This automatic alignment by
index label provides a seamless view across all the Series at each index label that
has the appearance of a row in a table.<br><br>
A DataFrame object can be thought of as a dictionary-like container of one or more
Series objects, or as a spreadsheet, probably the best description for those new to
pandas is to compare a DataFrame object to a relational database table. However,
even that comparison is limited, as a DataFrame object has very distinct qualities
(such as automatic data alignment of series) that make it much more capable for
exploratory data analysis than either a spreadsheet or relational database table.<br><br>
Because of the increased dimensionality of the DataFrame object, it becomes necessary
to provide a means to select both rows and columns. Carrying over from a Series,
the DataFrame uses the [] operator for selection, but it is now applied to the selection
of columns of data. This means that another construct must be used to select specific
rows of a DataFrame object. For those operations, a DataFrame object provides several
methods and attributes that can be used in various fashions to select data by rows.<br><br>
A DataFrame also introduces the concept of multiple axes, specifically the horizontal
and vertical axis. Functions from pandas can then be applied to either axis, in essence
stating that the operation be applied horizontally to all the values in the rows, or up
and down each column.<br><br>
In this chapter, we will examine the pandas DataFrame and how we can manipulate
both the DataFrame and the data it represents to build a basis for performing
interactive data analysis.
## we will cover:


## Creating DataFrame from scratch
To use a DataFrame we first need to import pandas and set some options for output.

In [1]:
# reference NumPy and pandas
import numpy as np
import pandas as pd
# Set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

In [2]:
pd.DataFrame(np.array([[10, 11], [20, 21]]))

    0   1
0  10  11
1  20  21

In [3]:
# create a DataFrame for a list of Series objects
df1 = pd.DataFrame([pd.Series(np.arange(10, 15)),
                    pd.Series(np.arange(15, 20))])
df1

    0   1   2   3   4
0  10  11  12  13  14
1  15  16  17  18  19

In [4]:
# what's the shape of this DataFrame
df1.shape # it is two rows by 5 columns

(2, 5)

In [5]:
# specify column names
df = pd.DataFrame(np.array([[10, 11], [20, 21]]), columns=['a', 'b'])
df

    a   b
0  10  11
1  20  21

In [6]:
# what are the names of the columns?
df.columns

Index(['a', 'b'], dtype='object')

In [7]:
# retrieve just the names of the columns by position
"{0}, {1}".format(df.columns[0], df.columns[1])

'a, b'

In [8]:
# rename the columns
df.columns = ['c1', 'c2']
df

   c1  c2
0  10  11
1  20  21

In [9]:
# create a DataFrame with named columns and rows
df = pd.DataFrame(np.array([[0, 1], [2, 3]]),
                  columns=['c1', 'c2'],
                  index=['r1', 'r2'])
df

    c1  c2
r1   0   1
r2   2   3

In [10]:
# retrieve the index of the DataFrame
df.index


Index(['r1', 'r2'], dtype='object')

In [11]:
# create a DataFrame with two Series objects
 # and a dictionary
s1 = pd.Series(np.arange(1, 6, 1))
s2 = pd.Series(np.arange(6, 11, 1))
pd.DataFrame({'c1': s1, 'c2': s2})

   c1  c2
0   1   6
1   2   7
2   3   8
3   4   9
4   5  10

In [12]:
# demonstrate alignment during creation
s3 = pd.Series(np.arange(12, 14), index=[1, 2])
df = pd.DataFrame({'c1': s1, 'c2': s2, 'c3': s3})
df

   c1  c2    c3
0   1   6   NaN
1   2   7  12.0
2   3   8  13.0
3   4   9   NaN
4   5  10   NaN

## Example data
Where possible, the examples in this chapter will utilize several datasets provided
with the code in the download for the text. These datasets make the examples a little
less academic in nature. These datasets will be read from files using the <b>pd.read_
csv()</b> function that will load the sample data from the file into a DataFrame object

In [13]:
# show the first three lines of the file
!head -n 3 data\prices.csv # on Mac or Linux
 # !type data\sp500.csv # on Windows, but will show the entire file

'head' is not recognized as an internal or external command,
operable program or batch file.


In [19]:
# read in the data and print the first five rows
 # use the Symbol column as the index, and
 # only read in columns in positions 0, 2, 3, 7
sp500 = pd.read_csv("data/prices.csv",
                    index_col='date',
                    usecols=[0, 2, 3, 6])
sp500.head()

                           open       close     volume
date                                                  
2016-01-05 00:00:00  123.430000  125.839996  2163600.0
2016-01-06 00:00:00  125.239998  119.980003  2386400.0
2016-01-07 00:00:00  116.379997  114.949997  2489500.0
2016-01-08 00:00:00  115.480003  116.620003  2006300.0
2016-01-11 00:00:00  117.010002  114.970001  1408600.0