In [1]:
# import numpy and pandas, and DataFrame / Series
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

## Chapter1: A Tour of pandas

- the data manipulation part of the equation
leaving statistical, financial, and other types of analyses to other Python libraries.
- The simple and effective data analysis requires the ability to index, retrieve,
tidy, reshape, combine, slice, and perform various analyses on both single and
multidimensional data, including heterogeneous typed data that is automatically
aligned along index labels

In [2]:
# import numpy and pandas, and DataFrame / Series
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
# Set some pandas options
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)
# And some items for matplotlib
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use = 'default'

## Primary pandas objects
the pandas framework: Series and DataFrame. The DataFrame objects
will be the overall workhorse of pandas and the most frequently used as they
provide the means to manipulate tabular and heterogeneous data
## The pandas Series object
The base data structure of pandas is the Series object, which is designed to operate
similar to a NumPy array but also adds index capabilities. A simple way to create a
Series object is by initializing a Series object with a Python array or Python list.

In [3]:
# create a four item DataFrame
s = Series([1, 2, 3, 4])
s

0    1
1    2
2    3
3    4
dtype: int64

- The first column in the output
is not a column of the Series object, but the index labels.
- The second column is
the values of the Series object.
- This Series was created without specifying an index, so pandas
automatically creates indexes starting at zero and increasing by one.
- <b>Elements of a Series object can be accessed through the index using []. This informs
the Series which value to return given one or more index values (referred to in
pandas as labels). The following code retrieves the items in the series with labels
1 and 3.</b>

In [4]:
# return a Series with the rows with labels 1 and 3
s[[1, 3]]

1    2
3    4
dtype: int64

A Series object can be created with a user-defined index by specifying the labels for
the index using the index parameter.

In [5]:
# create a series using an explicit index
s = Series([1, 2, 3, 4],
           index = ['a', 'b', 'c', 'd'])
s

a    1
b    2
c    3
d    4
dtype: int64

In [6]:
# look up items the series having index 'a' and 'd'
s[['a', 'd']]

a    1
d    4
dtype: int64

It is still possible to refer to the elements of the Series object by their
numerical position

In [7]:
# passing a list of integers to a Series that has
 # non-integer index labels will look up based upon
 # 0-based index like an array
s[[1, 2]]

b    2
c    3
dtype: int64

The s.index property allows direct access to the index of the Series object.

In [8]:
# get only the index of the Series
s.index

Index(['a', 'b', 'c', 'd'], dtype='object')

In [9]:
dates = pd.date_range('2014-07-01', '2014-07-06')
dates

DatetimeIndex(['2014-07-01', '2014-07-02', '2014-07-03', '2014-07-04',
               '2014-07-05', '2014-07-06'],
              dtype='datetime64[ns]', freq='D')

In [10]:
# create a Series with values (representing temperatures)
# for each date in the index
temps1 = Series([80, 82, 85, 90, 83, 87],
                index = dates)
temps1

2014-07-01    80
2014-07-02    82
2014-07-03    85
2014-07-04    90
2014-07-05    83
2014-07-06    87
Freq: D, dtype: int64

In [11]:
# calculate the mean of the values in the Series
temps1.mean()

84.5

In [12]:
# create a second series of values using the same index
temps2 = Series([70, 75, 69, 83, 79, 77],
index = dates)
# the following aligns the two by their index values
# and calculates the difference at those matching labels
temp_diffs = temps1 - temps2
temp_diffs

2014-07-01    10
2014-07-02     7
2014-07-03    16
2014-07-04     7
2014-07-05     4
2014-07-06    10
Freq: D, dtype: int64

Time series data such as that shown here can also be accessed via the index or by an
offset into the Series object.

In [13]:
# lookup a value by date using the index
temp_diffs['2014-07-03']

16

In [14]:
# and also possible by integer position as if the
 # series was an array
temp_diffs[2]

16

## The pandas DataFrame object
A pandas Series represents a single array of values, with an index label for each
value. If you want to have more than one Series of data that is aligned by a common
index, then a pandas DataFrame is used.

The following code creates a DataFrame object with two columns representing the
temperatures from the Series objects used earlier.

In [15]:
# create a DataFrame from the two series objects temp1 and temp2
 # and give them column names
temps_df = DataFrame({'Missoula': temps1,
                      'Philadelphia': temps2})
temps_df

            Missoula  Philadelphia
2014-07-01        80            70
2014-07-02        82            75
2014-07-03        85            69
2014-07-04        90            83
2014-07-05        83            79
2014-07-06        87            77

In [16]:
# get the column with the name Missoula
temps_df['Missoula']

2014-07-01    80
2014-07-02    82
2014-07-03    85
2014-07-04    90
2014-07-05    83
2014-07-06    87
Freq: D, Name: Missoula, dtype: int64

In [17]:
# likewise we can get just the Philadelphia column
temps_df['Philadelphia']

2014-07-01    70
2014-07-02    75
2014-07-03    69
2014-07-04    83
2014-07-05    79
2014-07-06    77
Freq: D, Name: Philadelphia, dtype: int64

The following code returns both the columns, but reversed.

In [18]:
# return both columns in a different order
temps_df[['Philadelphia', 'Missoula']]

            Philadelphia  Missoula
2014-07-01            70        80
2014-07-02            75        82
2014-07-03            69        85
2014-07-04            83        90
2014-07-05            79        83
2014-07-06            77        87

 you can use
property-style names to access the columns in a DataFrame.

In [19]:
# retrieve the Missoula column through property syntax
temps_df.Missoula

2014-07-01    80
2014-07-02    82
2014-07-03    85
2014-07-04    90
2014-07-05    83
2014-07-06    87
Freq: D, Name: Missoula, dtype: int64

In [20]:
# calculate the temperature difference between the two cities
temps_df.Missoula - temps_df.Philadelphia

2014-07-01    10
2014-07-02     7
2014-07-03    16
2014-07-04     7
2014-07-05     4
2014-07-06    10
Freq: D, dtype: int64

In [21]:
# add a column to temp_df that contains the difference in temps
temps_df['Difference'] = temp_diffs
temps_df

            Missoula  Philadelphia  Difference
2014-07-01        80            70          10
2014-07-02        82            75           7
2014-07-03        85            69          16
2014-07-04        90            83           7
2014-07-05        83            79           4
2014-07-06        87            77          10

The names of the columns in a DataFrame are object accessible via the DataFrame
object's .columns property, which itself is a pandas Index object.

In [22]:
# get the columns, which is also an Index object
temps_df.columns

Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object')

The DataFrame (and Series) objects can be sliced to retrieve specific rows. A simple
example here shows how to select the second through fourth rows of temperature
difference values.

In [23]:
# slice the temp differences column for the rows at
 # location 1 through 4 (as though it is an array)
temps_df.Difference[1:4]

2014-07-02     7
2014-07-03    16
2014-07-04     7
Freq: D, Name: Difference, dtype: int64

Entire rows from a DataFrame can be retrieved using its `.loc` and `.iloc` properties.
The following code returns a Series object representing the second row of temps_df
of the DataFrame object by zero-based position of the row using the .iloc property:

In [24]:
# get the row at array position 1
temps_df.iloc[1]

Missoula        82
Philadelphia    75
Difference       7
Name: 2014-07-02 00:00:00, dtype: int64

This has converted the row into a Series, with the column names of the DataFrame
pivoted into the index labels of the resulting Series.


In [25]:
# the names of the columns have become the index
 # they have been 'pivoted'
temps_df.ix[1].index

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  This is separate from the ipykernel package so we can avoid doing imports until


Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object')

Rows can be explicitly accessed via index label using the .loc property. The following
code retrieves a row by the index label:

In [26]:
# retrieve row by index label using .loc
temps_df.loc['2014-07-03']

Missoula        85
Philadelphia    69
Difference      16
Name: 2014-07-03 00:00:00, dtype: int64

Specific rows in a DataFrame object can be selected using a list of integer positions.
The following code selects the values from the Difference column in rows at
locations 1, 3, and 5.

In [27]:
# get the values in the Differences column in rows 1, 3, and 5
 # using 0-based location
temps_df.iloc[[1, 3, 5]].Difference

2014-07-02     7
2014-07-04     7
2014-07-06    10
Freq: 2D, Name: Difference, dtype: int64

In [28]:
# which values in the Missoula column are > 82?
temps_df.Missoula > 82

2014-07-01    False
2014-07-02    False
2014-07-03     True
2014-07-04     True
2014-07-05     True
2014-07-06     True
Freq: D, Name: Missoula, dtype: bool

In [29]:
# return the rows where the temps for Missoula > 82
temps_df[temps_df.Missoula > 82]

            Missoula  Philadelphia  Difference
2014-07-03        85            69          16
2014-07-04        90            83           7
2014-07-05        83            79           4
2014-07-06        87            77          10

This technique of selection in pandas terminology is referred to as a Boolean
selection, and will form the basis of selecting data based upon its values.

## Loading data from files and the Web
The data used in analyses is typically provided from other systems via files that are
created and updated at various intervals, dynamically via access over the Web, or
from various types of databases. The pandas library provides powerful facilities for
easy retrieval of data from a variety of data sources and converting it into pandas
objects. Here, we will briefly demonstrate this ease of use by loading data from files
and from financial web services.

## Loading CSV data from files
The pandas library provides built-in support for loading data in .csv format, a
common means of storing structured data in text files. Provided with the code from
this book is a file data/test1.csv in the CSV format, which represents some time
series information. The specific content isn't important right now, as we just want to
demonstrate the ease of loading data into a DataFrame.
The following statement in IPython uses the operating system to display the content
of this file (the command to use is different based upon your operating system).

```
 date,0,1,2
 2000-01-01 00:00:00,1.10376250134,-1.90997889703,-0.808955536115
 2000-01-02 00:00:00,1.18891664768,0.581119740849,0.86159734949
 2000-01-03 00:00:00,-0.964200042412,0.779764393246,1.82906224532
 2000-01-04 00:00:00,0.782130444001,-1.72066965573,-1.10824167327
 2000-01-05 00:00:00,-1.86701699823,-0.528368292754,-2.48830894087
 2000-01-06 00:00:00,2.56928022646,-0.471901478927,-0.835033249865
 2000-01-07 00:00:00,-0.39932258251,-0.676426550985,-0.0112559158931
 2000-01-08 00:00:00,1.64299299394,1.01341997845,1.43566709724
 2000-01-09 00:00:00,1.14730764657,2.13799951538,0.554171306191
 2000-01-10 00:00:00,0.933765825769,1.38715526486,-0.560142729978
```

This information can be easily imported into DataFrame using the pd.read_csv()
function.

In [30]:
# read the contents of the file into a DataFrame
df = pd.read_csv('data/test1.csv')
df

                  date         0         1         2
0  2000-01-01 00:00:00  1.103763 -1.909979 -0.808956
1  2000-01-02 00:00:00  1.188917  0.581120  0.861597
2  2000-01-03 00:00:00 -0.964200  0.779764  1.829062
3  2000-01-04 00:00:00  0.782130 -1.720670 -1.108242
4  2000-01-05 00:00:00 -1.867017 -0.528368 -2.488309
5  2000-01-06 00:00:00  2.569280 -0.471901 -0.835033
6  2000-01-07 00:00:00 -0.399323 -0.676427 -0.011256
7  2000-01-08 00:00:00  1.642993  1.013420  1.435667
8  2000-01-09 00:00:00  1.147308  2.138000  0.554171
9  2000-01-10 00:00:00  0.933766  1.387155 -0.560143

In [31]:
# the contents of the date column
df.date

0    2000-01-01 00:00:00
1    2000-01-02 00:00:00
2    2000-01-03 00:00:00
3    2000-01-04 00:00:00
4    2000-01-05 00:00:00
5    2000-01-06 00:00:00
6    2000-01-07 00:00:00
7    2000-01-08 00:00:00
8    2000-01-09 00:00:00
9    2000-01-10 00:00:00
Name: date, dtype: object

In [32]:
# we can get the first value in the date column
df.date[0]

'2000-01-01 00:00:00'

In [33]:
type(df.date[0])


str

To guide pandas on how to convert data directly into a Python/pandas date
object, we can use the parse_dates parameter of the pd.read_csv() function.
The following code informs pandas to convert the content of the 'date' column
into actual TimeStamp objects.

In [34]:
# read the data and tell pandas the date column should be
 # a date in the resulting DataFrame
df = pd.read_csv('data/test1.csv', parse_dates=['date'])
df

        date         0         1         2
0 2000-01-01  1.103763 -1.909979 -0.808956
1 2000-01-02  1.188917  0.581120  0.861597
2 2000-01-03 -0.964200  0.779764  1.829062
3 2000-01-04  0.782130 -1.720670 -1.108242
4 2000-01-05 -1.867017 -0.528368 -2.488309
5 2000-01-06  2.569280 -0.471901 -0.835033
6 2000-01-07 -0.399323 -0.676427 -0.011256
7 2000-01-08  1.642993  1.013420  1.435667
8 2000-01-09  1.147308  2.138000  0.554171
9 2000-01-10  0.933766  1.387155 -0.560143

In [35]:
# verify the type now is date
 # in pandas, this is actually a Timestamp
type(df.date[0])

pandas._libs.tslib.Timestamp

In [36]:
# unfortunately the index is numeric, which makes
 # accessing data by date more complicated
df.index


RangeIndex(start=0, stop=10, step=1)

This can be rectified using the index_col parameter of the pd.read_csv() method
to specify which column in the file should be used as the index.

In [37]:
# read in again, now specify the data column as being the
 # index of the resulting DataFrame
df = pd.read_csv('data/test1.csv',
                 parse_dates=['date'],
                 index_col='date')
df

                   0         1         2
date                                    
2000-01-01  1.103763 -1.909979 -0.808956
2000-01-02  1.188917  0.581120  0.861597
2000-01-03 -0.964200  0.779764  1.829062
2000-01-04  0.782130 -1.720670 -1.108242
2000-01-05 -1.867017 -0.528368 -2.488309
2000-01-06  2.569280 -0.471901 -0.835033
2000-01-07 -0.399323 -0.676427 -0.011256
2000-01-08  1.642993  1.013420  1.435667
2000-01-09  1.147308  2.138000  0.554171
2000-01-10  0.933766  1.387155 -0.560143

In [38]:
df.index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10'],
              dtype='datetime64[ns]', name='date', freq=None)

## Loading data from the Web
Data from the Web can also be easily read via pandas. To demonstrate this, we
will perform a simple load of actual stock data. The example here uses the pandas.
`io.data.DataReader` class, which is able to read data from various web sources,
one of which is stock data from Yahoo! Finance.

The following reads the data of the previous three months for GOOG (based on the
current date), and prints the five most recent days of stock data:

In [39]:
# imports for reading data from Yahoo!
from pandas.io.data import DataReader
#from pandas_datareader import DataReader
from datetime import date
from dateutil.relativedelta import relativedelta
# read the last three months of data for GOOG
goog = DataReader("GOOG", "yahoo",
                  date.today() +
                  relativedelta(months=-3))
# the result is a DataFrame
#and this gives us the 5 most recent prices
goog.tail()

ImportError: The pandas.io.data module is moved to a separate package (pandas-datareader). After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.

In [None]:
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
f = web.DataReader('F', 'google', start, end)
f.ix['2010-01-04']