# Imports, etc.

This'll be how pretty much every try at data analysis with pandas will look like

In [1]:
# Import numpy, pandas, DataFrame, and Series.
import numpy as np
import pandas as pd
from pandas import DataFrame, Series

# Set some pandas options.
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

# Some items for matplotlib.
%matplotlib inline
import matplotlib.pyplot as plt
pd.options.display.mpl_style = 'default'

# The pandas `Series` Object

Using pandas means spending much of your time with two kinds of objects provided by pandas, DataFrame and Series.

In [4]:
# Create a four-item Series.
s = Series([1, 2, 3, 4])
print(type(s))
print(s)

<class 'pandas.core.series.Series'>
0    1
1    2
2    3
3    4
dtype: int64


The left column in the printed Series is the index of each element. The right column are the elements themselves. All elements in a Series must be of the same type. Varies of different types will be coerced appropriately.

Accessing elements by index isn't terribly different, but double brackets are required. It's important to remember that the bracket notation below is referring the values of the indicies insofar as they can be conceived as labels, rather than as zero-based positions.

In [7]:
# Return a Series with the rows having the indicies 1 and 3.
print(s[[1, 3]])
print(type(s[[1, 3]]))

1    2
3    4
dtype: int64
<class 'pandas.core.series.Series'>


The elements in a Series object can be given custom labels. The lengths of the lists used for elements and labels must be identical.

In [8]:
s = Series(
    [1, 2, 3, 4],
    index = ['a', 'b', 'c', 'd']
)

print(s)

a    1
b    2
c    3
d    4
dtype: int64


The new labels can be used to return elements. Numeric indicies can still be used, though.

In [9]:
print(s[['a', 'd']])
print(s[[0, 3]])

a    1
d    4
dtype: int64
a    1
d    4
dtype: int64


It's possible to access directly the indicies and basic `numpy` array of a `Series` object.

In [13]:
print(s.index)
print(type(s.index))
print(s.values)
print(type(s.values))

Index(['a', 'b', 'c', 'd'], dtype='object')
<class 'pandas.core.index.Index'>
[1 2 3 4]
<class 'numpy.ndarray'>


pandas has the ability to create special time series objects where dates can be used as labels, created by using the `pd.date_range()` method. Dates are in ISO YYYY-MM-DD format.

The start and end dates defined are inclusive.

In [21]:
# Create a Series with a range of dates as labels.
dates = pd.date_range('2014-07-01', '2014-07-6')
print(type(dates))
for d in dates:
    print(d)

<class 'pandas.tseries.index.DatetimeIndex'>
2014-07-01 00:00:00
2014-07-02 00:00:00
2014-07-03 00:00:00
2014-07-04 00:00:00
2014-07-05 00:00:00
2014-07-06 00:00:00


Such an index is of little use on its own, but it can be assigned to another Series and there used as a special kind of label.

In [18]:
# Create a series of floats that represent temperatures.
temps1 = Series([80, 82, 85, 90, 83, 87], index = dates)

print(type(temps1))
print(temps1)

<class 'pandas.core.series.Series'>
2014-07-01    80
2014-07-02    82
2014-07-03    85
2014-07-04    90
2014-07-05    83
2014-07-06    87
Freq: D, dtype: int64


Touching on the realm of statistics, some convenience functions are available through `pandas`, thanks to being built on top of `numpy`, but by no means should one assume that `pandas` is capable of real stats.

In [20]:
print(temps1.mean())

84.5


pandas can do element-wise arithmetic, as one would expect.

In [22]:
temps2 = Series([70, 75, 69, 83, 79, 77], index = dates)
print(temps2)

temp_diffs = temps1 - temps2
print(temp_diffs)

2014-07-01    70
2014-07-02    75
2014-07-03    69
2014-07-04    83
2014-07-05    79
2014-07-06    77
Freq: D, dtype: int64
2014-07-01    10
2014-07-02     7
2014-07-03    16
2014-07-04     7
2014-07-05     4
2014-07-06    10
Freq: D, dtype: int64


A time series index can still be accessed numerically.

In [25]:
print(temp_diffs[2])

16


# The pandas `DataFrame` Object

Here's where things get cool and interesting. A `DataFrame` object in pandas is very much like the kind found in R. It's analogous to a database table, with labels for rows (observations) and columns (variables), which themselves are Series objects.

A `DataFrame` can be creted by putting together multiple `Series`.

In [26]:
# Combine temps1 and temps2 into a two-dimensional DataFrame
# and assign them some names.
temps_df = DataFrame(
    {'Missoula': temps1, 'Philadelphia': temps2}
)

print(type(temps_df))
print(temps_df)

<class 'pandas.core.frame.DataFrame'>
            Missoula  Philadelphia
2014-07-01        80            70
2014-07-02        82            75
2014-07-03        85            69
2014-07-04        90            83
2014-07-05        83            79
2014-07-06        87            77


Passing a column label returns returns the relevent `Series`.

In [30]:
print(type(temps_df['Missoula']))
print(temps_df['Missoula'])
print()
print(type(temps_df['Philadelphia']))
print(temps_df['Philadelphia'])

<class 'pandas.core.series.Series'>
2014-07-01    80
2014-07-02    82
2014-07-03    85
2014-07-04    90
2014-07-05    83
2014-07-06    87
Freq: D, Name: Missoula, dtype: int64

<class 'pandas.core.series.Series'>
2014-07-01    70
2014-07-02    75
2014-07-03    69
2014-07-04    83
2014-07-05    79
2014-07-06    77
Freq: D, Name: Philadelphia, dtype: int64


A list of labels can be passed, returning another `DataFrame`. The list elements can be in any order and they will be returned in kind. Notice the extra brackets.

In [31]:
print(type(temps_df[['Philadelphia', 'Missoula']]))
print(temps_df[['Philadelphia', 'Missoula']])

<class 'pandas.core.frame.DataFrame'>
            Philadelphia  Missoula
2014-07-01            70        80
2014-07-02            75        82
2014-07-03            69        85
2014-07-04            83        90
2014-07-05            79        83
2014-07-06            77        87


Conveniently, if a column label has no spaces, it can be accessed like an object property.

In [33]:
print(type(temps_df.Missoula))
print(temps_df.Missoula)

<class 'pandas.core.series.Series'>
2014-07-01    80
2014-07-02    82
2014-07-03    85
2014-07-04    90
2014-07-05    83
2014-07-06    87
Freq: D, Name: Missoula, dtype: int64


We can do some arithmetic operations, like so.

In [35]:
# Calculate the temperature differences between the two cities.
print(type(temps_df.Missoula - temps_df.Philadelphia))
print(temps_df.Missoula - temps_df.Philadelphia)

<class 'pandas.core.series.Series'>
2014-07-01    10
2014-07-02     7
2014-07-03    16
2014-07-04     7
2014-07-05     4
2014-07-06    10
Freq: D, dtype: int64


New columns can be added to a `DataFrame` by using array indexer bracket notation.

In [38]:
temps_df['Difference'] = temp_diffs
print(temps_df)

            Missoula  Philadelphia  Difference
2014-07-01        80            70          10
2014-07-02        82            75           7
2014-07-03        85            69          16
2014-07-04        90            83           7
2014-07-05        83            79           4
2014-07-06        87            77          10


`DataFrame` objects all have a `columns` attribute.

In [40]:
print(type(temps_df.columns))
print(temps_df.columns)

<class 'pandas.core.index.Index'>
Index(['Missoula', 'Philadelphia', 'Difference'], dtype='object')


Columns can be sliced.

In [41]:
print(temps_df.Difference[1:4])

2014-07-02     7
2014-07-03    16
2014-07-04     7
Freq: D, Name: Difference, dtype: int64
