# Pandas Series

At the basic level, Pandas Series are enhanced versions of NumPy arrays in which we can identify the elements with labels rather than simple integer indices. 

In [1]:
import numpy as np
import pandas as pd

A Pandas Series is a one-dimensional array of indexed data.

Create an empty Serie:

In [2]:
s = pd.Series(dtype='int')
print(s)

Series([], dtype: int32)


Create a Series from ndarray:

In [3]:
data = np.array([10, 20, 30, 40])
s = pd.Series(data)
print(s)

0    10
1    20
2    30
3    40
dtype: int32


We did not pass any index, so by default, it assigned the indexes ranging from 0 to len(data)-1, i.e., 0 to 3.

Indexes can be letters, numbers, or any other type

In [4]:
s = pd.Series(data, index=['a','b','c','d'])
print(s)

a    10
b    20
c    30
d    40
dtype: int32


In [5]:
s = pd.Series(data, index=['blue','red','pink','green'])
print(s)

blue     10
red      20
pink     30
green    40
dtype: int32


Creating a Series from a dictionary: dictionary keys will be used as index.

In [6]:
d1 = {'a':5, 'b':10, 'c':15, 'd':20}
s = pd.Series(d1)
print(s) 

a     5
b    10
c    15
d    20
dtype: int64


**Defining an index that is not in the dictionary**

Observe that the index order is persisted and the missing element is filled with `NaN`
(`NaN` means `not a number`)

In [7]:
s = pd.Series(d1,index=['b','c','d','f'])
print(s)

b    10.0
c    15.0
d    20.0
f     NaN
dtype: float64


We do not have the entry `a` because we do not specify `a` as a valid index. Instead, we specify `f`. As there is no `f` in the dictionary `d1`, Pandas Series creates an entry with the index `f` and assign the value `NaN` to it.

Create a Series from a Scalar

In [8]:
s = pd.Series(20, index=[1,2,3,4,5])
print(s)

1    20
2    20
3    20
4    20
5    20
dtype: int64


## Data types

When constructing a Series, Pandas assigns an appropriate data type. We can change it using the dtype parameter. Of course,  you must choose the correct data type.

In [9]:
pd.Series([1,2,3,4,5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [10]:
pd.Series([1,2,3,4,5],dtype='float')

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [11]:
# replacing 3 for 3.14
s2 = pd.Series([1,2,3.14,4,5])
s2

0    1.00
1    2.00
2    3.14
3    4.00
4    5.00
dtype: float64

The Series type is float because one of the values (3.14) is float.

You can convert the Series from float to integer. Take into account that float numbers will be truncated.

In [12]:
s3 = s2.astype('int')
s3

0    1
1    2
2    3
3    4
4    5
dtype: int32

In [13]:
s3 = s3.astype('object')
s3

0    1
1    2
2    3
3    4
4    5
dtype: object

In [14]:
s4 = pd.Series(['a','b','c','d','e'])
s4

0    a
1    b
2    c
3    d
4    e
dtype: object

We have two different Pandas series with object type: `s3` and `s4`.
- We can convert the type of `s3` to integer because `s3` has numbers.
- We cannot convert the type of `s4` to integer because `s4` has letters.

In [15]:
s3.astype('int')

0    1
1    2
2    3
3    4
4    5
dtype: int32

In [16]:
s4.astype('int')            # This will raise an error

ValueError: invalid literal for int() with base 10: 'a'

In [17]:
pd.Series([True, True, False])

0     True
1     True
2    False
dtype: bool

## Accesing data

Creating Series

In [18]:
s = pd.Series([10,20,30,40,50],index = ['a','b','c','d','e'])
s

a    10
b    20
c    30
d    40
e    50
dtype: int64

Retrieve the first element

In [19]:
s[0]

10

Retrieve the first element using indexes

In [20]:
s['a']

10

Retrieve the first three elements

In [21]:
s[:3]

a    10
b    20
c    30
dtype: int64

Retrieve the first three elements using indexes

In [22]:
s[:'c']

a    10
b    20
c    30
dtype: int64

Retrieve the last three elements

In [23]:
s[-3:]

c    30
d    40
e    50
dtype: int64

Retrieve the last three elements using indexes

In [24]:
s['c':]

c    30
d    40
e    50
dtype: int64

Retrieve multiple elements

In [25]:
s[[0,2,3]]

a    10
c    30
d    40
dtype: int64

Retrieve multiple elements using indexes

In [26]:
s[['a','c','d']]

a    10
c    30
d    40
dtype: int64

If a label is not contained, an exception is raised.

In [27]:
s['f']              # This will raise an error

KeyError: 'f'

## Other useful options

In [28]:
s = pd.Series([1, 2, 3, 1, 3, np.nan])

Getting the indexes

In [29]:
s.index

RangeIndex(start=0, stop=6, step=1)

Getting the values

In [30]:
s.values

array([ 1.,  2.,  3.,  1.,  3., nan])

Getting the number of items in the series

In [31]:
# getting the number of items in the series
print('size  =', s.size)
print('shape =', s.shape)

size  = 6
shape = (6,)


In [32]:
print('Number of unique elements =', s.nunique())
print('Unique elements =', s.unique())

Number of unique elements = 3
Unique elements = [ 1.  2.  3. nan]


In [33]:
print('Min    =', s.min())
print('Max    =', s.max())
print('Mean   =', s.mean())
print('Median =', s.median())
print('Sum    =', s.sum())
print('Non-missing values =', s.count())

Min    = 1.0
Max    = 3.0
Mean   = 2.0
Median = 2.0
Sum    = 10.0
Non-missing values = 5


Another way

In [34]:
s.agg(['min','max','mean','median','sum','count'])

min        1.0
max        3.0
mean       2.0
median     2.0
sum       10.0
count      5.0
dtype: float64

2 smallest elements

In [35]:
# 2 smallest elements
s.nsmallest(2)

0    1.0
3    1.0
dtype: float64

2 largest elements

In [36]:
# 2 largest elements
s.nlargest(2)

2    3.0
4    3.0
dtype: float64

Number of times each element occurs

In [37]:
# number of times each element occurs
s.value_counts()

1.0    2
3.0    2
2.0    1
dtype: int64

## References

- VanderPlas, J. (2017) Python Data Science Handbook: Essential Tools for Working with Data. USA: O’Reilly Media, Inc. chapter 3