
# The pandas Series Object
The two primary data structures in pandas are the Series and the DataFrame
objects. In this chapter, we will examine the Series object and how it builds on the
features of a NumPy ndarray to provide operations such as indexing, axis labeling,
alignment, handling of missing data, and merging across multiple series of data.

- Creating and initializing a Series and its index
- Determining the shape of a Series object
- Heads, tails, uniqueness, and counts of values
- Looking up values in a Series object
- Boolean selection
- Alignment via index labels
- Arithmetic operations on a Series object
- Reindexing a Series object
- Applying arithmetic operations on Series objects
- The special case of Not-A-Number (NaN)
- Slicing Series objects

## The Series object
The Series is the primary building block of pandas. A Series represents a
one-dimensional labeled indexed array based on the NumPy ndarray. Like
an array, a Series can hold zero or more values of any single data type.


In [8]:
# bring in NumPy and pandas
import numpy as np
import pandas as pd

In [9]:
 # Set some pandas options for controlling output display
pd.set_option('display.notebook_repr_html', False)
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 10)

In [10]:
# create one item Series
s1 = pd.Series(2)
s1

0    2
dtype: int64

In [11]:
# get value with label 0
s1[0]


2

In [12]:
# create a series of multiple items from a list
s2 = pd.Series([1, 2, 3, 4, 5])
s2

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [13]:
# get the values in the Series
s2.values

array([1, 2, 3, 4, 5], dtype=int64)

In [14]:
# get the index of the Series
s2.index

RangeIndex(start=0, stop=5, step=1)

In [15]:
# explicitly create an index
 # index is alpha, not integer
s3 = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
s3

a    1
b    2
c    3
dtype: int64

In [16]:
s3.index

Index(['a', 'b', 'c'], dtype='object')

In [17]:
# lookup by label value, not integer position
s3['c']

3

In [18]:
# create Series from an existing index
# scalar value with be copied at each index label
s4 = pd.Series(2, index=s2.index)
s4

0    2
1    2
2    2
3    2
4    2
dtype: int64

In [19]:
 # generate a Series from 5 normal random numbers
np.random.seed(123456)
pd.Series(np.random.randn(5))

0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
dtype: float64

In [20]:
# 0 through 9
pd.Series(np.linspace(0, 9, 10))

0    0.0
1    1.0
2    2.0
3    3.0
4    4.0
5    5.0
6    6.0
7    7.0
8    8.0
9    9.0
dtype: float64

In [21]:
# 0 through 8
pd.Series(np.arange(0, 9))

0    0
1    1
2    2
3    3
4    4
5    5
6    6
7    7
8    8
dtype: int32

In [22]:
# create Series from dict
s6 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 4})
s6

a    1
b    2
c    3
d    4
dtype: int64

# Size, shape, uniqueness, and counts of values

In [23]:
# example series, which also contains a NaN
s = pd.Series([0, 1, 1, 2, 3, 4, 5, 6, 7, np.nan])
s

0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [24]:
# length of the Series
len(s)

10

In [25]:
# .size is also the # of items in the Series
s.size

10

In [26]:
# .shape is a tuple with one value
s.shape

(10,)

The number of the values that are not part of the NaN can be found by using the
.count() method:

In [27]:
# count() returns the number of non-NaN values
s.count()


9

To determine all of the unique values in a Series, pandas provides the
`.unique()` method:

In [28]:
# all unique values
s.unique()

array([ 0.,  1.,  2.,  3.,  4.,  5.,  6.,  7., nan])

Also, the count of each of the unique items in a Series can be obtained using
`.value_counts()`:


In [29]:
# count of non-NaN values, returned max to min order
s.value_counts()


1.0    2
7.0    1
6.0    1
5.0    1
4.0    1
3.0    1
2.0    1
0.0    1
dtype: int64

In [30]:
abc = pd.Series([1,2,3,3,3,3,3,4,4,4,np.nan])
abc.value_counts()

3.0    5
4.0    3
2.0    1
1.0    1
dtype: int64

## Peeking at data with heads, tails, and taken
pandas provides the .head() and .tail() methods to examine just the first few,
or last, records in a Series. By default, these return the first or last five rows,
respectively, but you can use the n parameter or just pass an integer to specify
the number of rows:

In [32]:
# first five
s.head()


0    0.0
1    1.0
2    1.0
3    2.0
4    3.0
dtype: float64

In [33]:
# first three
s.head(n = 3) # s.head(3) is equivalent

0    0.0
1    1.0
2    1.0
dtype: float64

In [34]:
# last five
s.tail()

5    4.0
6    5.0
7    6.0
8    7.0
9    NaN
dtype: float64

In [35]:
# last 3
s.tail(n = 3) # equivalent to s.tail(3)

7    6.0
8    7.0
9    NaN
dtype: float64

The` .take()` method will return the rows in a series that correspond to the
zero-based positions specified in a list:

In [36]:
# only take specific items
s.take([0, 3, 9])


0    0.0
3    2.0
9    NaN
dtype: float64

## Looking up values in Series
Values in a Series object can be retrieved using the [] operator and passing either
a single index label or a list of index labels. The following code retrieves the value
associated with the index label 'a' of the s3 series defined earlier:


In [39]:
s3

a    1
b    2
c    3
dtype: int64

In [40]:
# single item lookup
s3['a']


1

In [41]:
# lookup by position since the index is not an integer
s3[1]

2

In [42]:
# multiple items
s3[['a', 'c']]

a    1
c    3
dtype: int64

In [43]:
# series with an integer index, but not starting with 0
s5 = pd.Series([1, 2, 3], index=[10, 11, 12])
s5

10    1
11    2
12    3
dtype: int64

Also, the following code looks up the value at the index label of 11. Label-based
lookup is performed because the type of the index is integer, as well as the value
passed to the [] operator is integer:

In [44]:
# by value as value passed and index are both integer
s5[11]

2

To alleviate the potential confusion in determining label-based lookup versus
position-based lookup, index label based lookup can be enforced using the
`.loc[]` accessor:

In [45]:
# force lookup by index label
s5.loc[12]

3

Lookup by position can be enforced using the .iloc[] accessor:

In [46]:
# forced lookup by location / position
s5.iloc[1]

2

In [47]:
 # multiple items by label (loc)
s5.loc[[12, 10]]

12    3
10    1
dtype: int64

In [48]:
# multiple items by location / position (iloc)
s5.iloc[[0, 2]]


10    1
12    3
dtype: int64

If a location/position passed to .iloc[] in a list is out of bounds, an exception will
be thrown. This is different than with .loc[], which if passed a label that does not
exist, will return NaN as the value for that label:

In [49]:
# -1 and 15 will be NaN
s5.loc[[12, -1, 15]]

 12    3.0
-1     NaN
 15    NaN
dtype: float64

A Series also has a property .ix that can be used to look up items either by label or
by zero-based array position. To demonstrate this, let's revisit the s3 series:

In [50]:
# reminder of the contents of s3
s3

a    1
b    2
c    3
dtype: int64

In [51]:
 # label based lookup
s3.ix[['a', 'c']]


.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  


a    1
c    3
dtype: int64

In [52]:
# position based lookup
s3.ix[[1, 2]]

b    2
c    3
dtype: int64

In [53]:
 # this looks up by label and not position
 # note that 1,2 have NaN as those labels do not exist
 # in the index
s5.ix[[1, 2, 10, 11]]

.ix is deprecated. Please use
.loc for label based indexing or
.iloc for positional indexing

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#ix-indexer-is-deprecated
  after removing the cwd from sys.path.


1     NaN
2     NaN
10    1.0
11    2.0
dtype: float64

Use of .ix is generally frowned upon by many practitioners
due to this issue. It is recommended to use the .loc or .iloc[]
techniques. Additionally, they are also better performing than .ix.

## Alignment via index labels
A fundamental difference between a NumPy ndarray and a pandas Series is the
ability of a Series to automatically align data from another Series based on label
values before performing an operation.
We will examine alignment using the following two Series objects:

In [54]:
s6 = pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])
s6

a    1
b    2
c    3
d    4
dtype: int64

In [55]:
s7 = pd.Series([4, 3, 2, 1], index=['d', 'c', 'b', 'a'])
s7

d    4
c    3
b    2
a    1
dtype: int64

In [56]:
# add them
s6 + s7


a    2
b    4
c    6
d    8
dtype: int64

In [57]:
# see how different from adding numpy arrays
a1 = np.array([1, 2, 3, 4])
a2 = np.array([4, 3, 2, 1])
a1 + a2

array([5, 5, 5, 5])

# Arithmetic operations
Arithmetic operations `(+, -, /, *, and so on)` can be applied either to a Series
or between two Series objects. When applied to a single Series, the operation
is applied to all of the values in that Series. The following code demonstrates
arithmetic operations applied to a Series object by multiplying the values in
s3 by 2. The result is a new Series with the new values (`s3` is unchanged)

In [58]:
# multiply all values in s3 by 2
s3 * 2

a    2
b    4
c    6
dtype: int64

In [59]:
# scalar series using s3's index
t = pd.Series(2, s3.index)
s3 * t

a    2
b    4
c    6
dtype: int64

To reinforce the point that alignment is being performed when applying arithmetic
operations across two Series objects, look at the following two Series as examples:

In [60]:
# we will add this to s9
s8 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 5})
s8

a    1
b    2
c    3
d    5
dtype: int64

In [61]:
# we will add this to s9
s8 = pd.Series({'a': 1, 'b': 2, 'c': 3, 'd': 5})
s8

a    1
b    2
c    3
d    5
dtype: int64

These two Series objects only have intersecting index labels 'b', 'c', and 'd'.
We will add the two series results in the following example:

In [63]:
# going to add this to s8
s9 = pd.Series({'b': 6, 'c': 7, 'd': 9, 'e': 10})
s9

b     6
c     7
d     9
e    10
dtype: int64

These two Series objects only have intersecting index labels 'b', 'c', and 'd'.
We will add the two series results in the following example:

In [64]:
# NaN's result for a and e
 # demonstrates alignment
s8 + s9

a     NaN
b     8.0
c    10.0
d    14.0
e     NaN
dtype: float64

In [65]:
# going to add this to s11
s10 = pd.Series([1.0, 2.0, 3.0], index=['a', 'a', 'b'])
s10


a    1.0
a    2.0
b    3.0
dtype: float64

In [66]:
s11 = pd.Series([4.0, 5.0, 6.0], index=['a', 'a', 'c'])

s11

a    4.0
a    5.0
c    6.0
dtype: float64

In [67]:
# will result in four 'a' index labels
s10 + s11

a    5.0
a    6.0
a    6.0
a    7.0
b    NaN
c    NaN
dtype: float64

The reason for this is that during alignment, pandas actually performs a Cartesian
product of the sets of all unique index labels in both Series objects, and then applies
the specified operation on all items in the products. To explain why there are four
'a' index values, s10 contains two 'a' labels, and s11 also contains two 'a' labels.
Every combination of 'a' labels in each will be calculated, resulting in four 'a'
labels. There is one 'b' label from s10 and one 'c' label from s11. Since there is no
matching label for either in the other Series object, they only result in a single row
in the resulting Series object. Each combination of values for 'a' in both Series are
computed, resulting in the four values: `1+4, 1+5, 2+4 and 2+5`.


So, remember that an index can have duplicate labels, and during alignment this will
result in a number of index labels equivalent to the products of the number of the
labels in each Series.

# The special case of Not-A-Number (NaN)

pandas mathematical operators and functions handle NaN in a special manner
(compared to NumPy) that does not break the computations. pandas is lenient
with missing data assuming that it is a common situation.
To demonstrate the difference, we can examine the following code, which calculates
the mean of a NumPy array

In [69]:
# mean of numpy array values
nda = np.array([1, 2, 3, 4, 5])
nda.mean()


3.0

In [70]:
# mean of numpy array values with a NaN
nda = np.array([1, 2, 3, 4, np.NaN])
nda.mean()

nan

In [71]:
# ignores NaN values
s = pd.Series(nda)
s.mean()

2.5

In [72]:
 # handle NaN values like NumPy
s.mean(skipna=False)

nan

# Boolean selection
Items in a Series can be selected, based on the value instead of index labels, via the
utilization of a Boolean selection. A Boolean selection applies a logical expression to
the values of the Series and returns a new Series of Boolean values representing the
result for each value. The following code demonstrates identifying items in a Series
where the values are greater than 5:

In [73]:
# which rows have values that are > 5?
s = pd.Series(np.arange(0, 10))
s > 5


0    False
1    False
2    False
3    False
4    False
5    False
6     True
7     True
8     True
9     True
dtype: bool

In [74]:
# select rows where values are > 5
logicalResults = s > 5
s[logicalResults]

6    6
7    7
8    8
9    9
dtype: int32

In [76]:
# a little shorter version
s[s > 5]

6    6
7    7
8    8
9    9
dtype: int32

In [77]:
 # commented as it throws an exception
 # s[s > 5 and s < 8]

In [78]:
# correct syntax
s[(s > 5) & (s < 8)]

6    6
7    7
dtype: int32

In [80]:
# are all items >= 0?
(s >= 0).all()

True

In [81]:
# any items < 2?
s[s < 2].any()

True

There is something important going on here that is worth mentioning. The result of
these logical expressions is a Boolean selection, a Series of True and False values.
The `.sum()` method of a Series, when given a series of Boolean values, will treat
True as 1 and False as 0. The following demonstrates using this to determine the
number of items in a Series that satisfy a given expression:

In [82]:
# how many values < 2?
(s < 2).sum()


2

# Reindexing a Series
Reindexing in pandas is a process that makes the data in a Series or DataFrame
match a given set of labels. This is core to the functionality of pandas as it enables
label alignment across multiple objects, which may originally have different
indexing schemes.

In [83]:
# sample series of five items
s = pd.Series(np.random.randn(5))
s

0   -0.173215
1    0.119209
2   -1.044236
3   -0.861849
4   -2.104569
dtype: float64

In [84]:
# change the index
s.index = ['a', 'b', 'c', 'd', 'e']
s

a   -0.173215
b    0.119209
c   -1.044236
d   -0.861849
e   -2.104569
dtype: float64

In [85]:
# concat copies index values verbatim,
 # potentially making duplicates
np.random.seed(123456)
s1 = pd.Series(np.random.randn(3))
s2 = pd.Series(np.random.randn(3))
combined = pd.concat([s1, s2])
combined

0    0.469112
1   -0.282863
2   -1.509059
0   -1.135632
1    1.212112
2   -0.173215
dtype: float64

In [86]:
# reset the index
combined.index = np.arange(0, len(combined))
combined


0    0.469112
1   -0.282863
2   -1.509059
3   -1.135632
4    1.212112
5   -0.173215
dtype: float64

In [87]:
np.random.seed(123456)
s1 = pd.Series(np.random.randn(4), ['a', 'b', 'c', 'd'])
# reindex with different number of labels
# results in dropped rows and/or NaN's
s2 = s1.reindex(['a', 'c', 'g'])
s2

a    0.469112
c   -1.509059
g         NaN
dtype: float64

In [88]:
# s2 is a different Series than s1
s2['a'] = 0
s2

a    0.000000
c   -1.509059
g         NaN
dtype: float64

In [89]:
s1

a    0.469112
b   -0.282863
c   -1.509059
d   -1.135632
dtype: float64

In [90]:
# different types for the same values of labels
# causes big trouble
s1 = pd.Series([0, 1, 2], index=[0, 1, 2])
s2 = pd.Series([3, 4, 5], index=['0', '1', '2'])
s1 + s2


0   NaN
1   NaN
2   NaN
0   NaN
1   NaN
2   NaN
dtype: float64

In [91]:
# reindex by casting the label types
 # and we will get the desired result
s2.index = s2.index.values.astype(int)
s1 + s2

0    3
1    5
2    7
dtype: int64

In [92]:
# fill with 0 instead of NaN
s2 = s.copy()
s2.reindex(['a', 'f'], fill_value=0)

a   -0.173215
f    0.000000
dtype: float64

In [93]:
# create example to demonstrate fills
s3 = pd.Series(['red', 'green', 'blue'], index=[0, 3, 5])
s3

0      red
3    green
5     blue
dtype: object

The following example demonstrates forward filling, often referred to as "last known
value." The Series is reindexed to create a contiguous integer index, and using the
method='ffill' parameter, any new index labels are assigned the previously
known values that are not part of NaN value from earlier in the Series object:

In [94]:
# forward fill example
s3.reindex(np.arange(0,7), method='ffill')

0      red
1      red
2      red
3    green
4    green
5     blue
6     blue
dtype: object

The following example fills backward using method='bfill':

In [95]:
# backwards fill example
s3.reindex(np.arange(0,7), method='bfill')

0      red
1    green
2    green
3    green
4     blue
5     blue
6      NaN
dtype: object

# Modifying a Series in-place
There are several ways that an existing Series can be modified in-place, having
either its values changed or having rows added or deleted. In-place modification of
a Series is a slightly controversial topic. When possible, it is preferred to perform
operations that return a new Series with the modifications represented in the new
Series. However, it is possible to change values and add/remove rows in-place, and
they will be explained here briefly.<br><br>
A new item can be added to a Series by assigning a value to an index label that
does not already exist. The following code creates a Series object and adds a new
item to the series:


In [97]:
# generate a Series to play with
np.random.seed(123456)
s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
s

a    0.469112
b   -0.282863
c   -1.509059
dtype: float64

In [98]:
# change a value in the Series
# this is done in-place
# a new Series is not returned that has a modified value
s['d'] = 100
s

a      0.469112
b     -0.282863
c     -1.509059
d    100.000000
dtype: float64

In [99]:
# modify the value at 'd' in-place
s['d'] = -100
s

a      0.469112
b     -0.282863
c     -1.509059
d   -100.000000
dtype: float64

In [100]:
# remove a row / item
del(s['a'])
s

b     -0.282863
c     -1.509059
d   -100.000000
dtype: float64

<b>NOte</b>To add and remove items out-of-place, you use` pd.concat()`<br>
to add and remove a Boolean selection.

# Slicing a Series
In Chapter 3, NumPy for pandas, we covered techniques for NumPy array slicing.
pandas Series objects also support slicing and override the slicing operators to
perform their magic on Series data. Just like NumPy arrays, you can pass a slice
object to the [] operator of the Series to get the specified values. Slices also work
with the .loc[], .iloc[], and .ix properties and accessors.
To demonstrate slicing, we will use the following Series:

In [101]:
# a Series to use for slicing
# using index labels not starting at 0 to demonstrate
# position based slicing
s = pd.Series(np.arange(100, 110), index=np.arange(10, 20))
s

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
18    108
19    109
dtype: int32

In [102]:
# items at position 0, 2, 4
s[0:6:2]


10    100
12    102
14    104
dtype: int32

In [103]:
# equivalent to
s.iloc[[0, 2, 4]]

10    100
12    102
14    104
dtype: int32

In [104]:
# first five by slicing, same as .head(5)
s[:5]

10    100
11    101
12    102
13    103
14    104
dtype: int32

In [105]:
# fourth position to the end
s[4:]

14    104
15    105
16    106
17    107
18    108
19    109
dtype: int32

In [106]:
s[:5:2]

10    100
12    102
14    104
dtype: int32

In [107]:
# every other item starting at the fourth position
s[4::2]

14    104
16    106
18    108
dtype: int32

# reverse the Series
s[::-1]

In [109]:
# every other starting at position 4, in reverse
s[4::-2]

14    104
12    102
10    100
dtype: int32

In [None]:
Negative values for the start and end of a slice have special meaning. If the series has
n elements, then negative values for the start and end of the slice represent elements
n + start through and not including n + end. This sounds a little confusing, but can be
understood simply with the following example:

In [110]:
# :-2, which means positions 0 through (10-2) [8]
s[:-2]

10    100
11    101
12    102
13    103
14    104
15    105
16    106
17    107
dtype: int32

In [111]:
# equivalent to s.tail(4).head(3)
s[-4:-1]

16    106
17    107
18    108
dtype: int32

In [112]:
copy = s.copy() # preserve s
slice = copy[:2] # slice with first two rows
slice

10    100
11    101
dtype: int32

In [113]:
# change item with label 10 to 1000
slice[11] = 1000
# and see it in the source
copy

10     100
11    1000
12     102
13     103
14     104
15     105
16     106
17     107
18     108
19     109
dtype: int32

In [114]:
# used to demonstrate the next two slices
s = pd.Series(np.arange(0, 5),
index=['a', 'b', 'c', 'd', 'e'])
s

a    0
b    1
c    2
d    3
e    4
dtype: int32

In [115]:
# slices by position as the index is characters
s[1:3]

b    1
c    2
dtype: int32

In [116]:
# this slices by the strings in the index
s['b':'d']


b    1
c    2
d    3
dtype: int32

# Summary
In this chapter, you learned about the pandas Series object and how it provides
capabilities beyond that of the NumPy array. We examined how to create and
initialize a Series and its associated index. Using a Series, we then looked at how
to manipulate the data in one or more Series objects, including alignment by
labels, various means of rearranging and changing data, and applying arithmetical
operations. We closed with examining how to reindex and perform slicing.<br><br>
In the next chapter, you will learn how the DataFrame is used to represent multiple
Series of data that are automatically aligned to a DataFrame level index, providing
a uniform and automatic ability to represent multiple values for each index label.