# Pandas Series (`pd.Series`)

In [None]:
At the beginning of most EDAs, you will see the following two lines which import the pandas and numpy libraries.  The general convention is to use `pd` to represent the pandas library and `np` to represent the numpy library

In [3]:
import pandas as pd
import numpy as np

## Pandas Series

The building block of pandas is a `pd.Series` object, which is an indexed, sequential list of data.  We will be using crypto market caps below to set the example:

In [4]:
market_caps = pd.Series([954.7, 514.4, 95.8, 76.3, 57.9, 45.7, 41.0, 38.7, 28.8, 25.8])

In [5]:
market_caps

0    954.7
1    514.4
2     95.8
3     76.3
4     57.9
5     45.7
6     41.0
7     38.7
8     28.8
9     25.8
dtype: float64

To make this data clearer, we can add a `name` to the Series:

In [None]:
market_caps.name = 'Market caps of top 10 cryptocurrencies in billions USD'

In [None]:
market_caps

we can see that the data itself is typed, i.e.:

In [None]:
market_caps.dtype

and we can see that the values of the Series are just simple numpy arrays

In [None]:
type(market_caps.values)

One key difference between `np.ndarray` and `pd.Series` is that `pd.Series` is indexed.  In the above example, the index happens to be a sequential index from 0 to 9, however we can easily change this index:

In [6]:
market_caps.index = ['BTC', 'ETH', 'BNB', 'USDT', 'SOL', 'ADA', 'USDC', 'XRP', 'DOT', 'LUNA']

In [7]:
market_caps

BTC     954.7
ETH     514.4
BNB      95.8
USDT     76.3
SOL      57.9
ADA      45.7
USDC     41.0
XRP      38.7
DOT      28.8
LUNA     25.8
dtype: float64

It seems like a little bit of a pain in the ass to do this step by step, but we can do this all at once with `pd.Series` constructor for dicts:

In [None]:
pd.Series({
    'BTC': 954.7, 
    'ETH': 514.4, 
    'BNB': 95.8, 
    'USDT': 76.3, 
    'SOL': 57.9, 
    'ADA': 45.7, 
    'USDC': 41.0, 
    'XRP': 38.7, 
    'DOT': 28.8, 
    'LUNA': 25.8
}, name='Market caps of top 10 cryptocurrencies in billions USD')

or using lists:

In [None]:
pd.Series(
    [954.7, 514.4, 95.8, 76.3, 57.9, 45.7, 41.0, 38.7, 28.8, 25.8],
    index=['BTC', 'ETH', 'BNB', 'USDT', 'SOL', 'ADA', 'USDC', 'XRP', 'DOT', 'LUNA'],
    name='Market caps of top 10 cryptocurrencies in billions USD'
)

tldr; Series construction is pretty intuitive and **very** flexible

## Series Indexes

Indices are very important to understand for both `pd.DataFrame` and `pd.Series`, and we'll start looking at indices with Series here.  When you access an item from the series, you can do it by **index** value or **location**.  To access by index you can simple do:

In [None]:
market_caps['BTC']

In [None]:
market_caps['USDC']

---
**note**: one thing that many people mess up at the beginning is thinking that the `[*]` syntax is for positional access.  This mistakes happens because the **default** index for both `pd.Series` and `pd.DataFrame` is a sequential index, so index access and location access is the same.  **HOWEVER** this is absolutely not true once the list is change (e.g. it's sorted).

Here's an example:

In [None]:
series = pd.Series([10, 2, 3])

In [None]:
series[0] # we expect 10

In [None]:
series[2] # we expect 3

In [None]:
series.sort_values(inplace=True)

In [None]:
series

In [None]:
series[0]

You can see above, when we access **index** `0` of the sorted series above, we still get 10, but this is now the 3rd item in the series.  This is a very common error when working with `pandas`.

To properly access an element in a series **by position**, use `.iloc[*]`

In [None]:
series[0]

In [None]:
series.iloc[0]

---

Selecting multiple items from the Series can be done by passing in a list:

In [None]:
market_caps[['BTC', 'DOT']]

This also works with `.iloc[*]`

In [None]:
market_caps.iloc[[1, 5]]

Finally, slicing can be done with both index and positional references:

In [None]:
market_caps['BTC':'BNB']

In [None]:
market_caps.iloc[0:2]

**note**: slicing with indices is inclusive of the end value (i.e. BNB is included in the result above), however slicing using position excludes the end value, which is consistent with slicing a python list

## Series Filtering

If we want to select a subset of a Series by a condition against its value, we can first create a boolean series and then select on that:

In [None]:
market_caps < 50

In [None]:
market_caps[market_caps < 50]

We can also combine conditions easily, e.g.:

In [None]:
market_caps[(market_caps < 50) & (market_caps.index.str.len() == 4)]

**note**: you need parentheses around each condition if you want to join them with `&`, `|` or `~` because it has the same priority of operations as the comparison operators

## Operations and Aggregations

Because `pd.Series` run numpy underneath, all operations and aggregations are vectorized under the hood.

For example, multiplying by a scalar multiplies every item in the series

In [None]:
market_caps * 1_000_000_000

Aggregations are also standard, e.g.:

In [None]:
market_caps.mean()

In [None]:
market_caps.sum()

In [None]:
market_caps.std()

In addition, all numpy functions work with Series, since all Series are just `np.ndarray` under the hood:

In [None]:
np.log(market_caps)

In [None]:
np.sum(market_caps)

## Mutating the Series

While most Series functions in pandas will not modify the original Series (e.g. `*`, filtering and slicing will all create new `pd.Series`), we can modify elements of a series in place:

In [None]:
market_caps

In [None]:
market_caps['BTC'] = 1000
market_caps

In [None]:
market_caps.iloc[3] = 0
market_caps

In [None]:
market_caps[market_caps < 50] = 0
market_caps