## 2.3 THE SERIES

In Section 1.3.2.1, we saw how the slicing method affects the type of the result. If we use the loc attribute to subset the first row of our scientists dataframe, we will get a Series object back.

First, let’s re-create our example dataframe.

In [2]:
import pandas as pd

In [3]:
from collections import OrderedDict

In [4]:
# create our example dataframe
# with a row index label

scientists = pd.DataFrame(
    data={'Occupation': ['Chemist', 'Statistician'],
          'Born': ['1920-07-25', '1876-06-13'],
          'Died': ['1958-04-16', '1937-10-16'],
          'Age': [37, 61]},
    index=['Rosaline Franklin', 'William Gosset'],
    columns=['Occupation', 'Born', 'Died', 'Age'])

print(scientists)

                     Occupation        Born        Died  Age
Rosaline Franklin       Chemist  1920-07-25  1958-04-16   37
William Gosset     Statistician  1876-06-13  1937-10-16   61


Now we select a scientist by the row index label.

In [5]:
# select by row index label
first_row = scientists.loc['William Gosset']

print(type(first_row))

<class 'pandas.core.series.Series'>


In [6]:
print(first_row)

Occupation    Statistician
Born            1876-06-13
Died            1937-10-16
Age                     61
Name: William Gosset, dtype: object


When a series is printed (i.e., the string representation), the index is printed as the first “column,” and the values are printed as the second “column.” There are many attributes and methods associated with a Series object.3 Two examples of attributes are index and values.

In [7]:
print(first_row.index)

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')


In [8]:
print(first_row.values)

['Statistician' '1876-06-13' '1937-10-16' 61]


An example of a Series method is keys, which is an alias for the index attribute.

In [9]:
print(first_row.keys())

Index(['Occupation', 'Born', 'Died', 'Age'], dtype='object')


By now, you might have questions about the syntax for index, values, and keys. More information about attributes and methods is found in Appendix S on classes. Attributes can be thought of as properties of an object (in this example, our object is a Series). Methods can be thought of as some calculation or operation that is performed. The subsetting syntax for loc, iloc, and ix (from Section 1.3.2) consists of all attributes. This is why the syntax does not rely on a set of round parentheses, (), but rather a set of square brackets, [ ], for subsetting. Since keys isamethod, if we wanted to get the first key (which is also the first index), we would use the square brackets after the method call. Some attributes for the series are listed in Table 2.1.

In [10]:
# get the first index using an attribute
print(first_row.index[0])

Occupation


Table 2.1 Some of the Attributes Within a Series

Series           Attributes Description 

loc              Subset using index value

iloc             Subset using index position

ix               Subset using index value and/or position

dtype or dtypes  The type of the Series contents

T                Transpose of the series

shape            Dimensions of the data

size             Number of elements in the Series

values           ndarray or ndarray-like of the Series

2.3.1 The Series Is ndarray-like

The Pandas data structure known as Series is very similar to the numpy.ndarray (Appendix R). In turn, many methods and functions that operate on a ndarray will also operate on a Series. A Series may sometimes be referred to as a “vector.”

2.3.1.1 Series Methods

Let’s first get a series of the “Age” column from our scientists dataframe.

In [11]:
# get the 'Age' column
ages = scientists['Age']

print(ages)

Rosaline Franklin    37
William Gosset       61
Name: Age, dtype: int64


Numpy is a scientific computing library that typically deals with numeric vectors. Since a Series can be thought of as an extension to the numpy.ndarray, there is an overlap of attributes and methods. When we have a vector of numbers, there are common calculations we can perform.

4. Descriptive statistics: http://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics

In [12]:
print(ages.mean())

49.0


In [13]:
print(ages.min())

37


In [14]:
print(ages.max())

61


In [15]:
print(ages.std())

16.97056274847714


The mean, min, max, and std are also methods in the numpy.ndarray.5 Some Series methods are listed in Table 2.2.

Table 2.2 Some of the Methods That Can Be Performed on a Series

Series Methods               Description

append                       Concatenates two or more Series

corr                         Calculate a correlation with another Series*

cov                          Calculate a covariance with another Series*

describe                     Calculate summary statistics*

drop_duplicates              Returns a Series without duplicates

equals                       Determines whether a Series has the same elements

get_values                   Get values of the Series; same as the values attribute

hist                         Draw a histogram

isin                         Checks whether values are contained in a Series

min                          Returns the minimum value

max                          Returns the maximum value

mean                         Returns the arithmetic mean

median                       Returns the median

mode                         Returns the mode(s)

quantile                     Returns the value at a given quantile

replace                      Replaces values in the Series with a specified value

sample                       Returns a random sample of values from the Series

sort_values                  Sorts values

to_frame                     Converts a Series to a DataFrame

transpose                    Returns the transpose

unique                       Returns a numpy.ndarray of unique values

2.3.2 Boolean Subsetting: Series

Chapter 1 showed how we can use specific indices to subset our data. Only rarely, however, will we know the exact row or column index to subset the data. Typically you are looking for values that meet (or don’t meet) a particular calculation or observation.

To explore this process, let’s use a larger data set.

In [16]:
scientists = pd.read_csv('data/scientists.csv')

We just saw how we can calculate basic descriptive metrics of vectors. The describe method will calculate multiple descriptive statistics in a single method call.

In [17]:
ages = scientists['Age']

print(ages)

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64


In [20]:
scientists.head(n=8)

Unnamed: 0,Name,Born,Died,Age,Occupation
0,Rosaline Franklin,1920-07-25,1958-04-16,37,Chemist
1,William Gosset,1876-06-13,1937-10-16,61,Statistician
2,Florence Nightingale,1820-05-12,1910-08-13,90,Nurse
3,Marie Curie,1867-11-07,1934-07-04,66,Chemist
4,Rachel Carson,1907-05-27,1964-04-14,56,Biologist
5,John Snow,1813-03-15,1858-06-16,45,Physician
6,Alan Turing,1912-06-23,1954-06-07,41,Computer Scientist
7,Johann Gauss,1777-04-30,1855-02-23,77,Mathematician


In [21]:
# get basic stats
print(ages.describe())

count     8.000000
mean     59.125000
std      18.325918
min      37.000000
25%      44.000000
50%      58.500000
75%      68.750000
max      90.000000
Name: Age, dtype: float64


In [22]:
# mean of all ages
print(ages.mean())

59.125


What if we wanted to subset our ages by identifying those above the mean?

In [23]:
print(ages[ages > ages.mean()])

1    61
2    90
3    66
7    77
Name: Age, dtype: int64


Let’s we tease out this statement and look at what ages > ages.mean() returns.

In [24]:
print(ages > ages.mean())

0    False
1     True
2     True
3     True
4    False
5    False
6    False
7     True
Name: Age, dtype: bool


In [25]:
print(type(ages > ages.mean()))

<class 'pandas.core.series.Series'>


This statement returns a Series with a dtype of bool. In other words, we can not only subset values using labels and indices, but also supply a vector of boolean values. Python has many functions and methods. Depending on how it is implemented, it may return labels, indices, or booleans. Keep this point in mind as you learn new methods and seek to piece together various parts for your work.

If we liked, we could manually supply a vector of bools to subset our data.

In [26]:
# get index 0, 1, 4, and 5

manual_bool_values = [True, True, False, False, True, True, False, True]

print(ages[manual_bool_values])

0    37
1    61
4    56
5    45
7    77
Name: Age, dtype: int64


2.3.3 Operations Are Automatically Aligned and Vectorized (Broadcasting)

If you’re familiar with programming, you would find it strange that ages > ages.mean() returns a vector without any for loops (Appendix M). Many of the methods that work on series (and also DataFrames) are vectorized, meaning that they work on the entire vector simultaneously. This approach makes the code easier to read, and typically optimizations are available to make calculations faster.

2.3.3.1 Vectors of the Same Length

If you perform an operation between two vectors of the same length, the resulting vector will be an element-by-element calculation of the vectors.

In [27]:
print(ages + ages)

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64


In [28]:
print(ages * ages)

0    1369
1    3721
2    8100
3    4356
4    3136
5    2025
6    1681
7    5929
Name: Age, dtype: int64


In [29]:
print(type(ages))

<class 'pandas.core.series.Series'>


2.3.3.2 Vectors With Integers (Scalars)

When you perform an operation on a vector using a scalar, the scalar will be recycled across all the elements in the vector.

In [30]:
print(ages + 100)

0    137
1    161
2    190
3    166
4    156
5    145
6    141
7    177
Name: Age, dtype: int64


In [31]:
print(ages * 2)

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64


2.3.3.3 Vectors With Different Lengths

When you are working with vectors of different lengths, the behavior will depend on the type of the vectors. With a Series, the vectors will perform an operation matched by the index. The rest of the resulting vector will be filled with a “missing” value, denoted with NaN, signifying “not a number.”

This type of behavior, which is called broadcasting, differs between languages. Broadcasting in Pandas refers to how operations are calculated between arrays with different shapes.

In [32]:
print(ages + pd.Series([1, 100]))

0     38.0
1    161.0
2      NaN
3      NaN
4      NaN
5      NaN
6      NaN
7      NaN
dtype: float64


In [33]:
print(pd.Series([1, 100]))

0      1
1    100
dtype: int64


With other types, the shapes must match.

In [34]:
import numpy as np

In [35]:
# this will cause an error

print(ages + np.array([1, 100]))

ValueError: operands could not be broadcast together with shapes (8,) (2,) 

2.3.3.4 Vectors With Common Index Labels (Automatic Alignment)

What’s cool about Pandas is how data alignment is almost always automatic. If possible, things will always align themselves with the index label when actions are performed.

In [36]:
# ages as they appear in the data

print(ages)

0    37
1    61
2    90
3    66
4    56
5    45
6    41
7    77
Name: Age, dtype: int64


In [37]:
rev_ages = ages.sort_index(ascending=False)

print(rev_ages)

7    77
6    41
5    45
4    56
3    66
2    90
1    61
0    37
Name: Age, dtype: int64


If we perform an operation using ages and rev_ages, it will still be conducted on an element-by-element basis, but the vectors will be aligned first before the operation is carried out.

In [38]:
# reference output to show index label alignment

print(ages * 2)

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64


In [39]:
# note how we get the same values
# even though the vector is reversed

print(ages + rev_ages)

0     74
1    122
2    180
3    132
4    112
5     90
6     82
7    154
Name: Age, dtype: int64
