# Working with Pandas Series

The *Pandas* package offers two key data structures that are optimised for data analysis and manipulation: *Series* and *DataFrame*. In this notebook we will start off by looking at the *Series*, which is a one-dimensional structure holding data of any type.

To start off, we import the Pandas package. We can import it as *pd* for shorthand.

In [1]:
import pandas as pd

## Creating Pandas Series

To create a new series, we can use the *Series()* function. The simplest (but least useful) approach is to pass in a Python list. By default, the Series will have a numeric index, counting from 0.

In [2]:
values = [2, 101, 45, 232, 45, 67]
# create the Series
s1 = pd.Series(values)
s1

0      2
1    101
2     45
3    232
4     45
5     67
dtype: int64

In [3]:
# how many values in the series?
len(s1)

6

We can also explicitly pass a list of index labels to the *Series()* function to use a more useful index (in this case strings containing country names). Note the number of values and labels must match.

In [4]:
life_exp_values = [75.77, 82.09, 73.12, 80.99]
countries = ["Argentina", "Australia", "Brazil", "Canada"]
# create the Series
s2 = pd.Series(life_exp_values, countries)
s2

Argentina    75.77
Australia    82.09
Brazil       73.12
Canada       80.99
dtype: float64

In [5]:
# how many values in the series?
len(s2)

4

This use of index labels very similar to a Python dictionary. In fact we can create a Pandas Series directly from a Python dictionary:

In [6]:
d_life_exp = {"Argentina": 75.77, "Australia": 82.09, "Brazil": 73.12, "Canada": 80.99}
s3 = pd.Series(d_life_exp)
s3

Argentina    75.77
Australia    82.09
Brazil       73.12
Canada       80.99
dtype: float64

Let's create a Series with a larger number of values: 

In [7]:
values = [75.77, 82.09, 73.12, 80.99, 49.81, 74.87, 70.48, 80.24, 80.15, 
          84.36, 75.05, 80.67, 55.13, 51.3, 76.99, 80.68, 83.23, 83.49, 82.5, 80.09, 78.51]
labels = ["Argentina", "Australia", "Brazil", "Canada", "Chad", "China", "Egypt", "Germany", "Ireland", 
          "Japan", "Mexico", "New Zealand", "Niger", "Nigeria", "Paraguay", "Portugal", "South Korea", 
          "Spain", "Switzerland", "United Kingdom", "United States"]

In [8]:
life_exp = pd.Series(values, labels)
len(life_exp)

21

We can display the first *n* values in the Series by calling the associated *head()* function:

In [9]:
# show the first 10 values
life_exp.head(10)

Argentina    75.77
Australia    82.09
Brazil       73.12
Canada       80.99
Chad         49.81
China        74.87
Egypt        70.48
Germany      80.24
Ireland      80.15
Japan        84.36
dtype: float64

A Series has an associated *index* attribute, which allows us to access the index values alone:

In [10]:
life_exp.index

Index(['Argentina', 'Australia', 'Brazil', 'Canada', 'Chad', 'China', 'Egypt',
       'Germany', 'Ireland', 'Japan', 'Mexico', 'New Zealand', 'Niger',
       'Nigeria', 'Paraguay', 'Portugal', 'South Korea', 'Spain',
       'Switzerland', 'United Kingdom', 'United States'],
      dtype='object')

We can use the Python *in* operator to check whether or not a particular index exists in a Series:

In [11]:
"Canada" in life_exp.index

True

In [12]:
"France" in life_exp.index

False

## Accessing Values by Position

A Pandas Series offers a number of different ways to access values. We can use simple position numbers like with standard Python lists, counting from 0:

In [13]:
life_exp[0]

75.77

In [14]:
life_exp[4]

49.81

We can use negative indexing to count from the last position backwards:

In [15]:
# get the last value in the Series
life_exp[-1]

78.51

In [16]:
# get the third last value
life_exp[-3]

82.5

Just like lists, we can also using slicing via the *i:j* operator. Remember this includes the elements from position *i* up to but not including position *j*: 

In [17]:
# start at position 0, end before position 2
life_exp[0:2]

Argentina    75.77
Australia    82.09
dtype: float64

In [18]:
# start at position 3, end before position 7
life_exp[3:7]

Canada    80.99
Chad      49.81
China     74.87
Egypt     70.48
dtype: float64

In [19]:
# start at the beginning of the Series, end before position 5
life_exp[:5]

Argentina    75.77
Australia    82.09
Brazil       73.12
Canada       80.99
Chad         49.81
dtype: float64

In [20]:
# start at position 8, go to the end of the Series
life_exp[8:]

Ireland           80.15
Japan             84.36
Mexico            75.05
New Zealand       80.67
Niger             55.13
Nigeria           51.30
Paraguay          76.99
Portugal          80.68
South Korea       83.23
Spain             83.49
Switzerland       82.50
United Kingdom    80.09
United States     78.51
dtype: float64

To access values by position in a Series, we can also use the *iloc[]* operator. This can be useful when we want to explicitly distinguish between positions and numeric index labels.

In [21]:
# get the value at position 3
life_exp.iloc[3]

80.99

In [22]:
# start at position 3, end before position 7
life_exp.iloc[3:7]

Canada    80.99
Chad      49.81
China     74.87
Egypt     70.48
dtype: float64

We can return multiple specific values by passing in a list of numeric positions to *iloc[]*:

In [23]:
life_exp[[1, 3, 5, 7]]

Australia    82.09
Canada       80.99
China        74.87
Germany      80.24
dtype: float64

In [24]:
life_exp[[8, 11, 3, 14, 0]]

Ireland        80.15
New Zealand    80.67
Canada         80.99
Paraguay       76.99
Argentina      75.77
dtype: float64

## Accessing Values by Index Label

We can also access values by their associated index labels defined at creation using the *loc[ ]* operator:

In [25]:
life_exp.loc["Ireland"]

80.15

In [26]:
life_exp.loc["Japan"]

84.36

We can return multiple values by passing in a list of index labels to *loc[]*:

In [27]:
life_exp.loc[["Ireland", "Germany", "United Kingdom"]]

Ireland           80.15
Germany           80.24
United Kingdom    80.09
dtype: float64

In [28]:
life_exp.loc[["Japan", "United Kingdom", "China", "Australia"]]

Japan             84.36
United Kingdom    80.09
China             74.87
Australia         82.09
dtype: float64

## Applying Conditions to Series

We might want to filter the values in a Pandas Series, to reduce it to a subset of the original values based on some condition applied to the values. We can do this by indexing with a boolean expression.

In [29]:
# check which values match the specified condition
life_exp > 80

Argentina         False
Australia          True
Brazil            False
Canada             True
Chad              False
China             False
Egypt             False
Germany            True
Ireland            True
Japan              True
Mexico            False
New Zealand        True
Niger             False
Nigeria           False
Paraguay          False
Portugal           True
South Korea        True
Spain              True
Switzerland        True
United Kingdom     True
United States     False
dtype: bool

In [30]:
# create a new series, with only values > 80
life_exp[life_exp > 80]

Australia         82.09
Canada            80.99
Germany           80.24
Ireland           80.15
Japan             84.36
New Zealand       80.67
Portugal          80.68
South Korea       83.23
Spain             83.49
Switzerland       82.50
United Kingdom    80.09
dtype: float64

In [31]:
# create a new series, with only values <= 80
life_exp[life_exp <= 80]

Argentina        75.77
Brazil           73.12
Chad             49.81
China            74.87
Egypt            70.48
Mexico           75.05
Niger            55.13
Nigeria          51.30
Paraguay         76.99
United States    78.51
dtype: float64

We can combine several different conditions using a boolean operator like AND (&) or OR (|). Note that each condition is surrounded in parentheses:

In [32]:
# values > 75 AND < 80
life_exp[(life_exp > 75) & (life_exp < 80)]

Argentina        75.77
Mexico           75.05
Paraguay         76.99
United States    78.51
dtype: float64

In [33]:
# values < 70 OR values > 80
life_exp[(life_exp < 70) | (life_exp > 83)]

Chad           49.81
Japan          84.36
Niger          55.13
Nigeria        51.30
South Korea    83.23
Spain          83.49
dtype: float64

## Modifying a Series

The easiest way to change elements in an existing Pandas Series is to use the index label. We can do this two different ways (both have the same effect):

In [34]:
life_exp["Chad"] = 50.2

In [35]:
life_exp.loc["Chad"] = 50.2

In [36]:
# check the values has changed
life_exp["Chad"]

50.2

We can also use position numbers to modify elements, via the *iloc[]* operator:

In [37]:
life_exp.iloc[0] = 75.90
life_exp.head()

Argentina    75.90
Australia    82.09
Brazil       73.12
Canada       80.99
Chad         50.20
dtype: float64

We can also add an additional element to the Series, by just assigning a value to a label:

In [38]:
life_exp["Norway"] = 82.91
life_exp

Argentina         75.90
Australia         82.09
Brazil            73.12
Canada            80.99
Chad              50.20
China             74.87
Egypt             70.48
Germany           80.24
Ireland           80.15
Japan             84.36
Mexico            75.05
New Zealand       80.67
Niger             55.13
Nigeria           51.30
Paraguay          76.99
Portugal          80.68
South Korea       83.23
Spain             83.49
Switzerland       82.50
United Kingdom    80.09
United States     78.51
Norway            82.91
dtype: float64

##  Series Statistics

A Series has associated functions for a range of simple statisticsl analyses.

In [39]:
# average of the values
life_exp.mean()

75.58863636363637

In [40]:
# median of the values (the middle value)
life_exp.median()

80.12

In [41]:
# the standard deviation of the values (the spread)
life_exp.std()

10.192637069615518

In [42]:
# range of the values (the minimum and maximum)
life_exp.min(), life_exp.max()

(50.2, 84.36)

The associated *describe()* function gives a useful statistical summary of a Series:

In [43]:
life_exp.describe()

count    22.000000
mean     75.588636
std      10.192637
min      50.200000
25%      74.915000
50%      80.120000
75%      81.815000
max      84.360000
dtype: float64

## Sorting Series

To sort a Series, we call its associated *sort_values()* function. Note this creates a copy of the original Series.

In [44]:
# sort lowest to highest
life_exp.sort_values()

Chad              50.20
Nigeria           51.30
Niger             55.13
Egypt             70.48
Brazil            73.12
China             74.87
Mexico            75.05
Argentina         75.90
Paraguay          76.99
United States     78.51
United Kingdom    80.09
Ireland           80.15
Germany           80.24
New Zealand       80.67
Portugal          80.68
Canada            80.99
Australia         82.09
Switzerland       82.50
Norway            82.91
South Korea       83.23
Spain             83.49
Japan             84.36
dtype: float64

By default values are ordered in ascending order. We can sort in descending order, by specifying the argument *ascending=False*:

In [45]:
# sort highest to lowest
life_exp.sort_values(ascending=False)

Japan             84.36
Spain             83.49
South Korea       83.23
Norway            82.91
Switzerland       82.50
Australia         82.09
Canada            80.99
Portugal          80.68
New Zealand       80.67
Germany           80.24
Ireland           80.15
United Kingdom    80.09
United States     78.51
Paraguay          76.99
Argentina         75.90
Mexico            75.05
China             74.87
Brazil            73.12
Egypt             70.48
Niger             55.13
Nigeria           51.30
Chad              50.20
dtype: float64

We can also sort a Series based on its index labels, by calling *sort_index()*:

In [46]:
life_exp.sort_index()

Argentina         75.90
Australia         82.09
Brazil            73.12
Canada            80.99
Chad              50.20
China             74.87
Egypt             70.48
Germany           80.24
Ireland           80.15
Japan             84.36
Mexico            75.05
New Zealand       80.67
Niger             55.13
Nigeria           51.30
Norway            82.91
Paraguay          76.99
Portugal          80.68
South Korea       83.23
Spain             83.49
Switzerland       82.50
United Kingdom    80.09
United States     78.51
dtype: float64