In [73]:
%reload_ext postcell
%postcell register

PostCell loaded
Loading config file from /Users/amt/Documents/GitHub/lectures/postcell.conf
Registered user Aman_Monolbaev at https://postcell.io/post_cell


# Pandas Series - an overview

Pandas library provides a fantastic interface to tabular data, made up of rows and columns. This is usually used to represent "business" datasets such as customer or product information. Generally, tables are organized so rows represent observations and columns represent features. Such "tabular" datasets are different from other datasets used in machine learning, such as images, videos, graphs, etc.

We will study three main objects within the Pandas library: Dataframes and two objects which make up dataframes, Series and Indexes.

![](images/dataframes.jpg)

*Hint* Review earlier lectures which provide quickstart introductions to Numpy and Pandas. These notes take a more systematic approach to describing Pandas.

In [74]:
import numpy as np
import pandas as pd

In [75]:
pd.__version__

'2.2.3'

# Series

Pandas series are similar to Python's built-in lists and numpy arrays. Here is an example:

In [76]:
#Python list
["Bart", "Homer", "Lisa", "Maggie", "Marge"]

['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge']

In [77]:
pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

0      Bart
1     Homer
2      Lisa
3    Maggie
4     Marge
dtype: object

In [78]:
#Numpy array
np.array(["Bart", "Homer", "Lisa", "Maggie", "Marge"])

array(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype='<U6')

#### Series, like Numpy arrays, have types

In [79]:
type([1, 2, 3])

list

In [80]:
type([1.2, 2, 3])

list

In [81]:
type([1.2, 2, "Home"])

list

In [82]:
pd.Series([1, 2, 3])

0    1
1    2
2    3
dtype: int64

In [83]:
pd.Series([1.2, 2, 3])

0    1.2
1    2.0
2    3.0
dtype: float64

In [84]:
pd.Series([1.2, 2, "Homer"])

0      1.2
1        2
2    Homer
dtype: object

In [85]:
pd.Series(["Marge", "Lisa", "Homer"])

0    Marge
1     Lisa
2    Homer
dtype: object

Notice that, unlike Python lists, Series objects have types associated with them. One of the reasons Pandas and Numpy are so much faster than native Python is because all elements are expected to be the same type and can be optimized for performance.

**String type** With Pandas 1.0, string series have their own data type, while `object` will be the generic, catch all type.

**Exercise** Create a Pandas series, containing numbers from zero to 10

In [31]:
%%postcell exercise_030_110_a

#type your answer here

Cell posted for evaluation


#### Explicit data types and categorical data

There are times when you want control the type explicitely

In [86]:
pd.Series([1, 2, 3, 4, 5])

0    1
1    2
2    3
3    4
4    5
dtype: int64

In [87]:
pd.Series([1, 2, 3, 4, 5], dtype="float64")

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

In [88]:
pd.Series([1, 2, 3, 4, 5]).astype('float')

0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
dtype: float64

A much more useful usecase is when you want to force a column as categorical

In [89]:
pd.Series(["Homer", "Marge", "Homer", "Homer", "Maggie", "Lisa"], dtype='category')

0     Homer
1     Marge
2     Homer
3     Homer
4    Maggie
5      Lisa
dtype: category
Categories (4, object): ['Homer', 'Lisa', 'Maggie', 'Marge']

In [90]:
pd.Series(["Homer", "Marge", "Homer", "Homer", "Maggie", "Lisa"]).astype('category')

0     Homer
1     Marge
2     Homer
3     Homer
4    Maggie
5      Lisa
dtype: category
Categories (4, object): ['Homer', 'Lisa', 'Maggie', 'Marge']

In [91]:
pd.Categorical(["Homer", "Marge", "Homer", "Homer", "Maggie", "Lisa", ]
           , categories=["Homer", "Marge", "Maggie", "Lisa"], ordered=False)

['Homer', 'Marge', 'Homer', 'Homer', 'Maggie', 'Lisa']
Categories (4, object): ['Homer', 'Marge', 'Maggie', 'Lisa']

#### More complex keys

Notice that Pandas series are nothing more than a wrapper around Numpy arrays (at least in Pandas 1.0):

In [92]:
ss = pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"])
ss

0      Bart
1     Homer
2      Lisa
3    Maggie
4     Marge
dtype: object

In [93]:
ss.values

array(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype=object)

In [94]:
ss.index

RangeIndex(start=0, stop=5, step=1)

In [95]:
type(ss.values)

numpy.ndarray

In [96]:
type(ss.index)

pandas.core.indexes.range.RangeIndex

In [97]:
type(np.array(["Bart", "Homer", "Lisa", "Maggie", "Marge"]))

numpy.ndarray

Recall that Numpy was created to be a library for numeric matrix manipulation. It make sense to ask, what is the value at index zero. It makes no sense, in matrix math, to ask, "what is the value at index 'Homer'?"

Pandas, however, is designed to work with datasets which may have categories and texts. If a matrix contains ages of people, it is perfectly reasonable to ask, what is the 'age' value at index 'Homer.'

In [98]:
ss

0      Bart
1     Homer
2      Lisa
3    Maggie
4     Marge
dtype: object

In [99]:
ss[1]

'Homer'

In [100]:
pd.Series([12, 38, 10, 2, 36])

0    12
1    38
2    10
3     2
4    36
dtype: int64

In [101]:
ss2 = pd.Series([12, 38, 10, 2, 36], index=["Bart", "Homer", "Lisa", "Maggie", "Marge"])
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

Notice that Series provide the ability to add a custom `index`:

In [102]:
ss2["Homer"]

np.int64(38)

**Exercise** Create a series of ages and names of characters as the index:
Ned 41, Daenerys 16, Tyrion 32, Jon 16

In [103]:
%%postcell exercise_030_110_
name_series = pd.Series([41, 16, 32, 16], index = ["Ned", "Daneeneerys", "Tyrion", "Jon"])
name_series

Cell posted for evaluation


Ned            41
Daneeneerys    16
Tyrion         32
Jon            16
dtype: int64

#### Series combine the properties of lists and dictionaries

In [104]:
pd.Series([41, 16, 32, 16])

0    41
1    16
2    32
3    16
dtype: int64

In [105]:
pd.Series({'Ned': 41, 'Daenerys': 16, 'Tyrion': 32, 'Jon': 16})

Ned         41
Daenerys    16
Tyrion      32
Jon         16
dtype: int64

Recall that Python's core lists are accessed by zero based integer values. Elements in a list are ordered, which means we can use the _slice_ notation to access multiple items:

In [106]:
["Bart", "Homer", "Lisa", "Maggie", "Marge"][2:4]

['Lisa', 'Maggie']

Dictionaries let us provide our own keys. However, they cannot be sliced:

In [107]:
{"Homer":38, "Marge":36, "Bart":12, "Lisa":10, "Maggie":6}["Bart":"Maggie"] #<= dictionaries don't understand slicing

KeyError: slice('Bart', 'Maggie', None)

Pandas Series combine properties of lists and dictionaries, along with the performance of Numpy:

In [108]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [109]:
ss2["Bart"] #as dictionary

np.int64(12)

In [110]:
ss2[0] # as list

  ss2[0] # as list


np.int64(12)

In [111]:
ss2[2:4] # as list

Lisa      10
Maggie     2
dtype: int64

In [112]:
ss2["Lisa":"Marge"] # as ??

Lisa      10
Maggie     2
Marge     36
dtype: int64

**Caveat** Notice that when we used slicing with the implicit (built-in) index values: `2:4`, the 4th value wasn't included in the result set (as expected). However, when sliced with explicit index values (the index we provided): "Bart":"Maggie", the last value _is_ included. Series provides a way around this confusion via `.loc` and `.iloc` 

![](images/series.jpg)

In [113]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [114]:
ss2.values

array([12, 38, 10,  2, 36])

In [115]:
ss2.index

Index(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype='object')

### Caveat: `s.columnA` vs `s['columnA']`

So far, we have accessed elements of Series using the syntax `s['columnA']`. However, the following syntax is also allowed `s.columnA`. For example

In [116]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [117]:
ss2['Marge']

np.int64(36)

In [118]:
ss2.Marge

np.int64(36)

In [119]:
ss2.shape

(5,)

In [120]:
ss2.shape

(5,)

In [121]:
ss2.Mr. Burns

AttributeError: 'Series' object has no attribute 'Mr'

The 'dot' syntax is very convenient, since it looks like calling a normal method on an object. It is slightly shorter to type and prvides better IDE help. In a cell, type `ss2.` then press the [TAB] key. You will notice a list of dropdowns. Type "M" and you will see Marge and Maggie's name pop up. You are using the autocomplete to find elements of a series!

Also notice that you can call actual operations on a series:

In [123]:
ss2.sum()

np.int64(98)

If you type `series.XYZ` .. is that referring to an element, indexed by the key "XYZ" or is it referring to the function "XYZ()"? This is confusion is the reason why, unless we are sure that there is no conflict, the safer option is to use the syntax `s['columnA']`

In [124]:
ss2['Homer J']
ss.Homer J #<= syntax error!

SyntaxError: invalid syntax (3628551129.py, line 2)

### Indexing Series

Similar to Numpy, values in series can be retrieved in several ways:
1. Implicit index (similar to lists)
2. Explicit index or label (similar to dictionaries)
3. Slicing
4. Boolean Masking
5. Fancy indexing

#### Implicit index

In [125]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

Given the series above, where names of the Simpson family are the index and their ages are the values, we can get Marge's age, which is at row 5 via directly requesting the second value (remember that Python is zero based):

In [126]:
ss2[4]

  ss2[4]


np.int64(36)

**Exercise** Given the series `ss2`, get second element

In [130]:
%%postcell exercise_030_110_c
ss2[1]
#type your answer here

Cell posted for evaluation


  ss2[1]


np.int64(38)

#### Explicit index or labels

Much like dictionaries, the key associated with Marge's name will return her age:

In [131]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [132]:
ss2["Marge"]

np.int64(36)

**Caveat** Notice that the same syntax is being used to access data implicitely and explicitely: `series[index]`. If `index` is an integer, then Pandas assumes it is an implicit inex, if it is not an integer, then Pandas assumes it is an explicit index. What if our explicit index was also an intger? See `.loc` and `.iloc`, later in the lecture

**Exercise** Given the series `ss2`, get the element corresponding to "Maggie"

In [133]:
%%postcell exercise_030_110_d
ss2["Marge"]

#type your answer here

Cell posted for evaluation


np.int64(36)

#### Slicing

Similar to Numpy arrays and Python lists, Series can be sliced. Note that keys are sliced in terms of their location in the series, **not alphabetically**

In [134]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [135]:
ss2[2:4]

Lisa      10
Maggie     2
dtype: int64

In [136]:
ss2["Lisa":"Marge"]

Lisa      10
Maggie     2
Marge     36
dtype: int64

Notice that slicing with explicit values includes the last item while slicing with implicit indexes the last value is not included.

Remember that negative indexes can be used, just like normal Python lists. Below we select items which start at "second to last" and end at the last item:

In [137]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [None]:
ss2[-2:]

**Exercise** Given the series `ss2`, get the second, third and fourth elements (use slicing)

In [140]:
%%postcell exercise_030_110_e
ss2[1:4]

#type your answer here

Cell posted for evaluation


Homer     38
Lisa      10
Maggie     2
dtype: int64

#### Boolean (Masking)

Providing a `True` or `False` value for element results in elements corresponding to `False` being filtered out.

In [None]:
ss2

In [141]:
mask = [True, False, True, True, True]
ss2[mask]

Bart      12
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [142]:
ss2[[True, False, True, True, True]]

Bart      12
Lisa      10
Maggie     2
Marge     36
dtype: int64

Btw, this does not work with Python lists

In [143]:
['B', 'H', 'L', 'M', 'M'][mask]

TypeError: list indices must be integers or slices, not list

Filter on the index

In [144]:
ss2.index != "Homer"

array([ True, False,  True,  True,  True])

In [145]:
ss2[ss2.index != "Homer"]

Bart      12
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [146]:
ss2[ss2 != "Homer"]

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

Filter on the values

In [147]:
ss2.values <30

array([ True, False,  True,  True, False])

In [149]:
ss2[ss2.values < 30]

Bart      12
Lisa      10
Maggie     2
dtype: int64

In [150]:
ss2 < 30

Bart       True
Homer     False
Lisa       True
Maggie     True
Marge     False
dtype: bool

Notice that you don't have to call `.values` for cleaner code

In [151]:
ss2[ss2 < 30]

Bart      12
Lisa      10
Maggie     2
dtype: int64

**Exercise** Find everyone older than 2 years

In [153]:
%%postcell exercise_030_110_f
ss2 >2

#type your answer here

Cell posted for evaluation


Bart       True
Homer      True
Lisa       True
Maggie    False
Marge      True
dtype: bool

**Exercise** Find everyone, except "Maggie"

In [156]:
%%postcell exercise_030_110_g
ss2[ss2.index = "Maggie"]
#type your answer here

Cell posted for evaluation


SyntaxError: cannot assign to attribute here. Maybe you meant '==' instead of '='? (1169469556.py, line 1)

#### Fancy indexing

If you know the specific list of items, you can ask the Series to return them directly:

In [157]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [159]:
ss2[[1,3,4]]

  ss2[[1,3,4]]


Homer     38
Maggie     2
Marge     36
dtype: int64

Notice that the order of elements in the index will be what the Series returns:

In [160]:
ss2[[4,3,1]]

  ss2[[4,3,1]]


Marge     36
Maggie     2
Homer     38
dtype: int64

In [161]:
ss2[[4,3,2, 2, 2, 2]]

  ss2[[4,3,2, 2, 2, 2]]


Marge     36
Maggie     2
Lisa      10
Lisa      10
Lisa      10
Lisa      10
dtype: int64

You can also use the explicit index, instead of the implicit index

In [162]:
ss2[["Marge", "Marge", "Lisa"]]

Marge    36
Marge    36
Lisa     10
dtype: int64

**Exercise** Given the series `ss2`, get the second, third and fifth elements (using fancy indexing)

In [166]:
%%postcell exercise_030_110_h
ss2[[1,2,4]]


#type your answer here

Cell posted for evaluation


  ss2[[1,2,4]]


Homer    38
Lisa     10
Marge    36
dtype: int64

### `.loc` and `.iloc` or confusion when explicit indexes are integers

Notice that the series we have been working with has strings as the explicit index and integers as the values:

In [167]:
pd.Series([12, 38, 10, 2, 36], index=["Bart", "Homer", "Lisa", "Maggie", "Marge"])

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

What if we flip that around and a couple more characters

In [168]:
ss3 = pd.Series(["Bart", "Homer", "Lisa", "Maggie", "Marge"], index=[12, 38, 10, 2, 36])
ss3

12      Bart
38     Homer
10      Lisa
2     Maggie
36     Marge
dtype: object

**Caveat** If I ask for values for 1 `ss3[2]`, am I asking for elements at location 2 or for characters with age 2?

In [169]:
ss3[2]

'Maggie'

In order to avoid this confusion, Pandas provides methods `.loc`, which expects explicit index values (aka labels) and `.iloc`, which expects implicit index values:

In [170]:
ss3.iloc[2] #implicit index location

'Lisa'

In [171]:
ss3.loc[2]

'Maggie'

#### Special indexer: `.at[]`

Implicit index (similar to lists)
Explicit index (similar to dictionaries)
Slicing
Masking
Fancy indexing

Notice that we have several ways of accessing elements: index, labels, slicing, masking and fancy indexing. The same syntax `s[XXX]`, `s.loc[XXX]` or `s.iloc[XXX]` can be used with any of the methods above. There are times when you want exactly a single, scalar value to be returned. In such cases, you use the syntax `s.at[XXX]`:

In [172]:
ss3.at[2]

'Maggie'

In [None]:
ss2.at['Maggie']

In [None]:
ss3.at[1:2] # <= slicing is prohibited, since it would return multiple values

### Indexing summary

In [None]:
ss2

```python
#implicit (like lists)
ss2[integer] # get value at location number

#explicit labels (like dictionaries)
ss2[key]     # get value at index key 

#slicing
ss2[integer_start:integer_end] # get values between locations integer_start and integer_end (integer_end not included)
ss2[key_start:key_end]         # get values between keys key_start and key_end (key_end INCLUDED, by locations of keys, not alaphabetical)

#fancy indexing
ss2[[integer1, intger3]]  # get values at locations intger1 and integer3
ss2[[key1, key3]] # get values at keys key1 and key3

ss2.loc[...]  # always operate in terms of keys

ss2.iloc[...] # always operate in terms of location

ss2.at[...]   # always return a single value
```

### Creating Series

We have already seen series being create with a list and an index: `pd.Series(data, index=None)`.

Another common way of creating a series is via a dictionary:

In [173]:
pd.Series({"Homer":38, "Marge":36})

Homer    38
Marge    36
dtype: int64

Create it from a list

In [174]:
pd.Series(['Homer', 'Marge', "Maggie"])

0     Homer
1     Marge
2    Maggie
dtype: object

There are less common metods of creating series, such as:

In [None]:
pd.Series(44, index=["Homer", "Marge", "Maggie"])

You can always combine what you learned for lists with series:

### Converting to other formats

Series can be converted to a dictionary very easily:

In [175]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [176]:
ss2.to_dict()

{'Bart': 12, 'Homer': 38, 'Lisa': 10, 'Maggie': 2, 'Marge': 36}

Other, similar method:
1. ss2.to_excel
2. ss2.to_frame (conver to dataframe)
3. ss2.to_json
4. ss2.to_sql

Any many others

You often need to convert a series to a DataFrame:

In [177]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [178]:
ss2.to_frame()

Unnamed: 0,0
Bart,12
Homer,38
Lisa,10
Maggie,2
Marge,36


In [179]:
pd.DataFrame(ss2)

Unnamed: 0,0
Bart,12
Homer,38
Lisa,10
Maggie,2
Marge,36


In [180]:
pd.DataFrame(ss2).shape

(5, 1)

One of the most useful patterns is to convert a series and its index to a dataframe with two columns:

In [181]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [182]:
ss2.reset_index(name="age")

Unnamed: 0,index,age
0,Bart,12
1,Homer,38
2,Lisa,10
3,Maggie,2
4,Marge,36


In [None]:
ss2.reset_index(name="age").rename(columns={"index":"name"})

### Operating on series

Much like Numpy arrays, several operations can be done on a series:

In [183]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [184]:
ss2.sum(), ss2.count(), ss2.min(), ss2.max(), ss2.mean()

(np.int64(98), np.int64(5), 2, 38, np.float64(19.6))

In [187]:
# For each VALUE in the series, is it in this list? is in, are in, is in the list 
ss2.isin([10, 13, 12, 46, 38])

Bart       True
Homer      True
Lisa       True
Maggie    False
Marge     False
dtype: bool

In [188]:
ss2.isin(["Homer", "Mr. Burns", "Barney", "Maggie", "Dr. Hibert"])

Bart      False
Homer     False
Lisa      False
Maggie    False
Marge     False
dtype: bool

In [189]:
ss2.index.isin(["Homer", "Mr. Burns", "Barney", "Maggie", "Dr. Hibert"])

array([False,  True, False,  True, False])

In [190]:
ss2.index # be aware using values versus index 

Index(['Bart', 'Homer', 'Lisa', 'Maggie', 'Marge'], dtype='object')

**Exercise** Explain the previous two cells

In [191]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [193]:
ss2.sample(2)

Marge    36
Lisa     10
dtype: int64

In [194]:
ss2.sample(10, replace=True)

Homer    38
Homer    38
Bart     12
Lisa     10
Homer    38
Bart     12
Lisa     10
Bart     12
Bart     12
Homer    38
dtype: int64

In [195]:
ss2.nlargest(2)

Homer    38
Marge    36
dtype: int64

In [196]:
ss2.astype('float')

Bart      12.0
Homer     38.0
Lisa      10.0
Maggie     2.0
Marge     36.0
dtype: float64

In [197]:
ss2.sort_values()

Maggie     2
Lisa      10
Bart      12
Marge     36
Homer     38
dtype: int64

In [198]:
ss2.sort_index()

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

In [199]:
ss2

Bart      12
Homer     38
Lisa      10
Maggie     2
Marge     36
dtype: int64

**Exercise** Add two years to everyone's age (recall the Numpy lecture) in series `ss2`

In [215]:
%%postcell exercise_030_110_i
ss2 = ss2+2

#type your answer here

Cell posted for evaluation


### Combining series and intelligently handling missing data

The ability to intelligenty combine series is one of the most powerful features of Pandas. Note the series we have already been working on, which contains Simpson family ages:

In [216]:
ss2

Bart      14
Homer     40
Lisa      12
Maggie     4
Marge     38
dtype: int64

In [217]:
weights = pd.Series([240, 85], index=["Homer", "Bart"])
weights

Homer    240
Bart      85
dtype: int64

Notice that we don't have everyone's weight. What happens when we combine these two series:

In [218]:
pd.DataFrame({'ages':ss2, 'weights':weights})

Unnamed: 0,ages,weights
Bart,14,85.0
Homer,40,240.0
Lisa,12,
Maggie,4,
Marge,38,


Series are designed for single dimensional data. DataFrames, which we will study soon, contain multiple series. Notice that the two series have been intelligently combined!

Let's add two series. Notice that we can add them as if they were scalars or numpy arrays. Further notice that Pandas intelligently inserts `NaN` in appropriate places

In [221]:
ss2 + weights

Bart       99.0
Homer     280.0
Lisa        NaN
Maggie      NaN
Marge       NaN
dtype: float64

In [220]:
ss2.add(weights, fill_value=0)

Bart       99.0
Homer     280.0
Lisa       12.0
Maggie      4.0
Marge      38.0
dtype: float64

**Intelligent handling of missing values is one of the reasons to use Pandas**

Note that aggregate functions _know_ to ignore missing values when adding series:

In [222]:
(ss2 + weights).sum()

np.float64(379.0)

In [224]:
97+278

375