# Python Explainer - Essence of Pandas series and data frames

Pieter Overdevest  
2022-12-29

For suggestions/questions regarding this notebook, please contact
[Pieter Overdevest](https://www.linkedin.com/in/pieteroverdevest/)
(pieter@innovatewithdata.nl).

#### Aim

To explain the essence of Pandas series and data frames. Pandas series
are one dimensional data objects, that can be combined into Panda Data
frames, which are two dimensional data objects.

#### Prefixes in object names

#### Initialization

We start by importing the Pandas package.

In [178]:
import pandas as pd

#### Pandas series out of the box

We start with a Pandas series containing three numeric values.

In [179]:
ps_score = pd.Series([

    43, 56, 78
    
])

ps_score

0    43
1    56
2    78
dtype: int64

Basic functions can be straightforwardly be applied to a Pandas series.

In [180]:
print(f"Sum:     {ps_score.sum()}")
print(f"Minimum: {ps_score.min()}")
print(f"Maximum: {ps_score.max()}")
print(f"Mean:    {ps_score.mean()}")

Sum:     177
Minimum: 43
Maximum: 78
Mean:    59.0

After printing `ps_score` you can see on the left that each element is
accompanied by an index: 0, 1, and 2. We can pull them out of ps_score
using the `index` attribute. To get the values from the `RangeIndex`
object we append the `values` attribute.

In [181]:
print(f"Index (RangeIndex object): {ps_score.index}")
print(f"Index (values):            {ps_score.index.values}")

Index (RangeIndex object): RangeIndex(start=0, stop=3, step=1)
Index (values):            [0 1 2]

By default, the index is a sequential number range, starting with 0. We
can also assign an index ourselves. For example, the person responsible
for the score.

In [182]:
ps_scores_assigned_index = pd.Series(

    [43, 56, 78],
    index = ['Tom', 'Bill', 'Jill']
)

ps_scores_assigned_index

Tom     43
Bill    56
Jill    78
dtype: int64

A Pandas series has attributes that we can call as follows, e.g.,:

In [183]:
print(f"Index (object): {ps_scores_assigned_index.index}")
print(f"Index (values): {ps_scores_assigned_index.index.values}")
print(f"Values:         {ps_scores_assigned_index.values}")
print(f"Dimension:      {ps_scores_assigned_index.shape}")
print(f"Data type:      {ps_scores_assigned_index.dtype}")

Index (object): Index(['Tom', 'Bill', 'Jill'], dtype='object')
Index (values): ['Tom' 'Bill' 'Jill']
Values:         [43 56 78]
Dimension:      (3,)
Data type:      int64

To obtain particular values from a Pandas Series we can use the index
name or number to show, e.g., Tom’s score.

In [184]:
print(f"Using the index name:   {ps_scores_assigned_index['Tom']}")
print(f"Using the index number: {ps_scores_assigned_index[0]}")

Using the index name:   43
Using the index number: 43

In the same way, we can extract a part of a Pandas Series, e.g., Bill
and onwards. Note, ‘’ introduces a line feed. It functions like the
metal arm on a manual typewriter.

In [185]:
print(f"Using the index name:\n{ps_scores_assigned_index['Bill':]}")
print(f"Using the index number:\n{ps_scores_assigned_index[1:]}")

Using the index name:
Bill    56
Jill    78
dtype: int64
Using the index number:
Bill    56
Jill    78
dtype: int64

Or, in case you are interested in the last two elements of a Pandas
Series.

In [186]:
ps_scores_assigned_index[-2:]

Bill    56
Jill    78
dtype: int64

#### Pandas series via dictionary

A Pandas Series can also be obtained by converting a dictionary using
the `pd.Series()` function.

In [187]:
dc_time = {'Bill': 10, 'Jill': 9}
ps_time = pd.Series(dc_time)

ps_time

Bill    10
Jill     9
dtype: int64

As a side note, the keys() and values() methods allow extraction of the
keys and values of a dictionary object. The `list()` function unpacks
the list from the `dict_keys` and `dict_values` objects.

In [188]:
print(f"Keys:   {list(dc_time.keys())}")
print(f"Values: {list(dc_time.values())}")

Keys:   ['Bill', 'Jill']
Values: [10, 9]

and with the index we can easily check whether it has a value in dc_time
and ps_time, or not.

In [189]:
print(f"Is Tom in dc_time?:  {'Tom' in dc_time}")
print(f"Is Bill in dc_time?: {'Bill' in dc_time}")

Is Tom in dc_time?:  False
Is Bill in dc_time?: True

In [190]:
print(f"Is Tom in ps_time?:  {'Tom' in ps_time}")
print(f"Is Bill in ps_time?: {'Bill' in ps_time}")

Is Tom in ps_time?:  False
Is Bill in ps_time?: True

#### Pandas Data frames

The two Pandas Series are combined to form a Pandas Data frame using
Pandas `DataFrame()` function. The indices in both Pandas Series are
used to match the values in each of the rows in df_score. Why is the
‘time’ value empty for Tom?

In [191]:
df_score = pd.DataFrame({
    'time': ps_time,
    'score': ps_scores_assigned_index,
    
})

df_score

A Pandas Data frame has attributes that we can call as follows, e.g.,:

1.  The index (row) and column names:

In [192]:
print(f"Both in one list using 'axes':         {df_score.axes}")
print(f"Just the index names using 'index':    {df_score.index}")
print(f"Just the column names using 'columns': {df_score.columns}")

Both in one list using 'axes':         [Index(['Bill', 'Jill', 'Tom'], dtype='object'), Index(['time', 'score'], dtype='object')]
Just the index names using 'index':    Index(['Bill', 'Jill', 'Tom'], dtype='object')
Just the column names using 'columns': Index(['time', 'score'], dtype='object')

1.  The dimensions (shape):

In [193]:
print(f"Dimensions using 'shape':               {df_score.shape}")
print(f"The first value is the number rows:     {df_score.shape[0]}")
print(f"The second value is the number columns: {df_score.shape[1]}")

Dimensions using 'shape':               (3, 2)
The first value is the number rows:     3
The second value is the number columns: 2

1.  The column types:

In [194]:
print(f"The column types:\n{df_score.dtypes}")

The column types:
time     float64
score      int64
dtype: object

We can take subsets of the data frame using `.loc` and `.iloc`. With
`.loc` we need to specify the index and/or column names. This can be
combined with `:` to also include everything upstream and downstream.

In [195]:
df_score.loc[:,'score']

Bill    56
Jill    78
Tom     43
Name: score, dtype: int64

In [196]:
df_score.loc['Jill':,:]

It can also be used in combination with a Pandas series of True’s and
False’s. This can be helpful in case you want to extract particular rows
from the data frame. For the filtering to work the index of
`ps_true_false` must match the one in `df_score`.

In [197]:
ps_true_false = pd.Series([True, False, True], index = df_score.index)

df_score.loc[ps_true_false, 'time']

Bill    10.0
Tom      NaN
Name: time, dtype: float64

With `.iloc` you need to specify the index and/or column number.

In [198]:
df_score.iloc[1,1]

78

The notation `0:2` means a list containing 0 and 1, so not including 2.

In [199]:
df_score.iloc[0:2,1]

Bill    56
Jill    78
Name: score, dtype: int64

#### Joining Pandas Data frames

In case our data set consists of two or more table (data frames), it is
highly likely that at some point we want to combine two or more data
frames. In this section we will discuss the `concat()` and `merge()`
functions that can be used for this purpose. The `concat()` function
relies solely on the index or column names. The `merge()` function
allows a combination index and column names. The examples below will
further clarify this.

First, we create three additional data frames.

In [205]:
df_score_extra = pd.DataFrame({

        "time": [9.5, 14],
        "score": [67, 95],
        
    },
    index = ["Tim", "Pete"]
)

df_address = pd.DataFrame({

    "streetname": ["Main street", "Temple Road", "Ocean drive"]
    },
    index = ["Tom", "Bill", "Jill"]
)

df_age  = pd.DataFrame({

    "name": ["Tom", "Tim", "Jill", "Bill"],
    "age": [23, 54, 12, 67]
    }
)

We start out with the `concat()` function, see below. The `axis`
parameter has 0 as default argument, resulting in combining the two data
frames in vertical direction; putting one on top of the other. I still
added `axis=0` for clarification; `axis=1` combines the two data frames
horizontally (see down below). The parameter `sort` allows us to sort
the index/column name in the other dimension. In other words, when we
concatenate the data frame vertically, `sort` determines whether we sort
the columns, and vv. Since `df_score` and `df_score_extra` both start
with the ‘time’ column followed by the ‘score’ column, `sort=True`
results in sorting of the columns, as we can see below.

In [204]:
df_score_plus_extra = pd.concat([df_score, df_score_extra], sort=True, axis=0)

df_score_plus_extra

With `axis=1` we combine the two data frames horizontally, and now,
`sort=True` sorts the index. The argument `join="inner"` result in a
data frame holding only the indices that the two data frames have in
common.

In [207]:
df_score_address = pd.concat([df_score_plus_extra, df_address], sort=True, axis=1, join="inner")

df_score_address

Now, suppose we want to add ‘age’ - stored in `df_age` - to the
`df_score_extra` using the ‘name’ information. The situation with
`df_age` is slightly different as with the other data frames where
‘name’ is used for the index; in `df_age` name is one of the columns. By
default, the two data frames are combined using an inner join, i.e.,
only keeping the rows where the joining feature is in common in both
data frames.

From the first data frame mentioned in `merge()`, referred to as left -
in this case `df_score_plus_extra` - we use the index, as that is where
‘name’ is stored. From the second data frame (right) - referred to as
the right data frame - we use the column named ‘name’.

In [208]:
pd.merge(df_score_plus_extra, df_age, left_index=True, right_on='name')

Besides the default inner join, we can also make use of [other types of
joins](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html),
like the left join.

Now, all rows in `df_score_plus_extra` remain in the resulting data
frame, including the row for Pete that was omitted in the inner join. As
expected, the value for Pete’s age is empty as it was not in `df_age`.

In [214]:
df_score_plus_extra_age = pd.merge(df_score_plus_extra, df_age, left_index=True, right_on='name', how='left')
df_score_plus_extra_age

create the new index for columns

In [221]:
ar_names = df_score_plus_extra.index.values
ar_names

array(['Bill', 'Jill', 'Tom', 'Tim', 'Pete'], dtype=object)

In [224]:
df_score_plus_extra_age.reset_index(inplace=True)
df_score_plus_extra_age