# pandas library

pandas is an abbreviation for the **P**ython **an**d **D**ata **A**naly**s**is Library. It is a library that uses three main data structures:

* the Index class
* the Series class
* the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one.

The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:

|index|0|1|2|3|
|---|---|---|---|---|

Or as a column when convenient:

|index|
|---|
|0|
|1|
|2|
|3|


The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

A DataFrame class is essentially a grouping of series instances that have the same index:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

## Importing Libraries

To use the data science libraries they need to be imported:

In [1]:
import numpy as np 
import pandas as pd
from helper_module import print_identifier_group

Once imported the identifiers can be viewed:

In [2]:
print('datamodel attribute:', end=' ')
print_identifier_group(pd, 'datamodel_attribute')
print('datamodel method:', end=' ')
print_identifier_group(pd, 'datamodel_method')
print('attribute:', end=' ')
print_identifier_group(pd, 'attribute')
print('function:', end=' ')
print_identifier_group(pd, 'function')
print('class:', end=' ')
print_identifier_group(pd, 'upper_class')


datamodel attribute: ['__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__']
datamodel method: []
attribute: ['annotations', 'api', 'arrays', 'compat', 'core', 'errors', 'io', 'offsets', 'options', 'pandas', 'plotting', 'testing', 'tseries', 'util']
function: ['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'describe_option', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'period_range', 'pivot', 'pivot_table', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 

The datamodel attributes \_\_name\_\_ (dunder name), \_\_version\_\_ (dunder version) and \_\_file\_\_ (dunder file) can be used to get details about the library:

In [3]:
pd.__name__

'pandas'

In [4]:
pd.__version__

'2.1.4'

In [5]:
pd.__file__

'c:\\Users\\phili\\Anaconda3\\envs\\vscode-env\\Lib\\site-packages\\pandas\\__init__.py'

The classes are all in CamelCase. The main classes are:

* Index
* Series
* DataFrame
 
There are some variations of Index such as RangeIndex, MultiIndex, DateIndex and TimedeltaIndex. 

In general pandas uses object orientated programming (OOP) opposed to functional programming. This means methods are normally applied to Index, Series and DataFrame instances to analyse or manipulate data from the instance. Most of the functions within the pandas library are used to read in data from a file and output a DataFrame instance.

## Series

The initialisation signature for a pandas ```Series``` class can be examined:

In [6]:
pd.Series?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray hav

The main keyword input arguments are:

* data
* index
* dtype
* name


If these are not supplied an empty series instance with no index, no name and a generic object datatype is instantiated:

In [7]:
pd.Series()

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [8]:
pd.Series(data=np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [9]:
pd.Series(data=[1, 2, 3])

0    1
1    2
2    3
dtype: int64

When ```dtype=None```, the data type will be inferred from the data:

In [10]:
pd.Series(data=[1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [11]:
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

0   2023-12-27 12:33:18.172161
1   2023-12-28 12:33:18.172161
2   2023-12-29 12:33:18.172161
dtype: datetime64[ns]

In [12]:
pd.Series(data=['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Anything with a string in it is classed as non-numeric and has the generic dtype ```object``` (meaning it can be any Python object).

The dtype can be manually overidden when supplying the numpy 1darray by using the ```np.array``` input argument ```dtype```:

In [13]:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

0    1
1    2
2    3
dtype: int32

Or by alternatively using the ```Series``` keyword input argument ```dtype```:

In [14]:
pd.Series(data=[1., 2., 3.], dtype=np.int32)

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument ```index``` and providing an ```Index``` instance, ```ndarray``` instance or ```list``` instance of index values:

In [15]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

a    1
b    2
c    3
dtype: int32

A ```Series``` usually also has a ```name```:

In [16]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

a    1
b    2
c    3
Name: x, dtype: int32

Normally the ```data``` and ```name``` are supplied and the ```index``` and ```dtype``` are inferred:

In [17]:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

## DataFrame

The initialisation signature for a pandas DataFrame can be examined:

In [18]:
pd.DataFrame?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
d

The keyword input arguments for a ```DataFrame``` instance are similar to those found for a ```Series``` instance however because a ```DataFrame``` is a collection of ```Series``` most of these are plural:

* data (plural)
* index (singular)
* columns (plural of name)
* dtype (plural)

In [19]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The dtype has to be supplied as a ```tuple``` containing the ```dtype``` for each ```Series``` instance in the ```DataFrame``` instance. If it is supplied as a list of dtypes a ```TypeError``` will display.

Once again normally the ```dtype``` and ```index``` are inferred:

In [20]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


It is common to supply ```columns``` and ```data``` in the form of a mapping. The mapping has a key: value pair. The key should be a string which will become the column name and the value should be a 1darray or list which corresponds to the data:

In [21]:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


## Series Identifiers

If the following ```NDArray``` (1D) and ```Series``` instances are created:

In [22]:
xarray = np.array([1.1, 2.1, 3.1, 4.1])

In [23]:
xarray

array([1.1, 2.1, 3.1, 4.1])

In [24]:
xseries = pd.Series(xarray, name='x')

In [25]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

The ```Series``` is more commonly instantiated directly from a ```list```:

In [26]:
yseries = pd.Series([1.1, -2.1, -3.1, None], name='y', dtype=float)

In [27]:
yseries

0    1.1
1   -2.1
2   -3.1
3    NaN
Name: y, dtype: float64

Its identifiers can be viewed. Notice that the following are consistent with a ```1darray``` instance because a ```Series``` instance is based on a ```NDArray``` (1D). This means the previous knowledge from the ```numpy``` tutorial is applicable to the ```Series```:

In [28]:
print('datamodel attribute:', end=' ')
print_identifier_group(xarray, kind='datamodel_attribute', second=xseries, show_only_intersection_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xarray, kind='datamodel_method', second=xseries, show_only_intersection_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xarray, kind='attribute', second=xseries, show_only_intersection_identifiers=True)
print('method:', end=' ')
print_identifier_group(xarray, kind='function', second=xseries, show_only_intersection_identifiers=True)

datamodel attribute: ['__array_priority__', '__doc__', '__hash__']
datamodel method: ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', 

Some of the identifiers in a ```NDArray``` (1D) are not applicable to a ```Series``` such as the functions which work over multiple dimensions:

In [29]:
print('datamodel attribute:', end=' ')
print_identifier_group(xarray, kind='datamodel_attribute', second=xseries, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xarray, kind='datamodel_method', second=xseries, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xarray, kind='attribute', second=xseries, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xarray, kind='function', second=xseries, show_unique_identifiers=True)

datamodel attribute: ['__array_interface__', '__array_struct__']
datamodel method: ['__array_finalize__', '__array_function__', '__array_prepare__', '__array_wrap__', '__buffer__', '__class_getitem__', '__complex__', '__dlpack__', '__dlpack_device__', '__ilshift__', '__imatmul__', '__index__', '__irshift__', '__lshift__', '__rlshift__', '__rrshift__', '__rshift__']
attribute: ['base', 'ctypes', 'data', 'flat', 'imag', 'itemsize', 'real', 'strides']
method: ['argpartition', 'byteswap', 'choose', 'compress', 'conj', 'conjugate', 'diagonal', 'dump', 'dumps', 'fill', 'flatten', 'getfield', 'itemset', 'newbyteorder', 'nonzero', 'partition', 'ptp', 'put', 'reshape', 'resize', 'setfield', 'setflags', 'sort', 'tobytes', 'tofile', 'tolist', 'tostring', 'trace']


There are also additional identifiers in the ```Series``` class that are not available in the ```NDArray``` (1D):

In [30]:
print('datamodel attribute:', end=' ')
print_identifier_group(xseries, kind='datamodel_attribute', second=xarray, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xseries, kind='datamodel_method', second=xarray, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xseries, kind='attribute', second=xarray, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xseries, kind='function', second=xarray, show_unique_identifiers=True)

datamodel attribute: ['__annotations__', '__dict__', '__module__', '__pandas_priority__']
datamodel method: ['__column_consortium_standard__', '__finalize__', '__getattr__', '__nonzero__', '__round__', '__weakref__']
attribute: ['array', 'at', 'attrs', 'axes', 'dtypes', 'empty', 'hasnans', 'iat', 'index', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'values']
method: ['abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'apply', 'asfreq', 'asof', 'at_time', 'autocorr', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'corr', 'count', 'cov', 'cummax', 'cummin', 'describe', 'diff', 'div', 'divide', 'divmod', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'idxmax', 'idxmin', 'iloc', 'infe

Many of the additional attributes return the supplied value in the initialisation signature:

In [31]:
xseries.array

<NumpyExtensionArray>
[1.1, 2.1, 3.1, 4.1]
Length: 4, dtype: float64

In [32]:
xseries.name

'x'

In [33]:
xseries.index

RangeIndex(start=0, stop=4, step=1)

In [34]:
xseries.values

array([1.1, 2.1, 3.1, 4.1])

In [35]:
xseries.dtypes

dtype('float64')

Many of the additional methods duplicate the behaviour of an equivalent datamodel method for example ```abs``` and ```__abs__``` (dunder abs). Recall that the datamodel identifier ```__abs__``` defines the way the ```builtins``` function ```abs``` operates with an instance of the ```Series``` class and its use is generally preferred:

In [36]:
yseries

0    1.1
1   -2.1
2   -3.1
3    NaN
Name: y, dtype: float64

In [37]:
abs(yseries)

0    1.1
1    2.1
2    3.1
3    NaN
Name: y, dtype: float64

In [38]:
yseries.abs()

0    1.1
1    2.1
2    3.1
3    NaN
Name: y, dtype: float64

 The supplementary method ```add``` largely duplicates the behaviour of the datamodel ```__add__``` (dunder add) which recall defines the behaviour of the ```+``` operator which use is generally preferred:

In [39]:
xseries.__add__(yseries)

0    2.2
1    0.0
2    0.0
3    NaN
dtype: float64

In [40]:
xseries + yseries

0    2.2
1    0.0
2    0.0
3    NaN
dtype: float64

The ```add``` method however includes additional options via keyword input arguments such as ```fill_value``` which can be used for an addition involving a missing value: 

In [41]:
xseries.add(yseries)

0    2.2
1    0.0
2    0.0
3    NaN
dtype: float64

In [42]:
xseries.add(yseries, fill_value=0)

0    2.2
1    0.0
2    0.0
3    4.1
dtype: float64

## DataFrame Identifiers

If the following dataframe is constructed:

In [43]:
df = pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
                   'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [44]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


A large number of identifiers can be seen to be consistent between a ```DataFrame``` and a ```Series``` instance such as almost all of the datamodel identifiers. These identifiers operate across 2 dimensions across a ```DataFrame``` instance instead of 1 dimension along a ```Series```:

In [45]:
print('datamodel attribute:', end=' ')
print_identifier_group(df, kind='datamodel_attribute', second=xseries, show_only_intersection_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(df, kind='datamodel_method', second=xseries, show_only_intersection_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(df, kind='attribute', second=xseries, show_only_intersection_identifiers=True)
print('method:', end=' ')
print_identifier_group(df, kind='function', second=xseries, show_only_intersection_identifiers=True)

datamodel attribute: ['__annotations__', '__array_priority__', '__dict__', '__doc__', '__hash__', '__module__', '__pandas_priority__']
datamodel method: ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub_

The ```Series``` has ```Series``` specific attributes which are not available for a ```DataFrame``` instance. The datamodel methods in a ```Series``` not present in a ```DataFrame``` are for type-casting:

In [46]:
print('datamodel attribute:', end=' ')
print_identifier_group(xseries, kind='datamodel_attribute', second=df, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xseries, kind='datamodel_method', second=df, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xseries, kind='attribute', second=df, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xseries, kind='function', second=df, show_unique_identifiers=True)

datamodel attribute: []
datamodel method: ['__column_consortium_standard__', '__float__', '__int__']
attribute: ['array', 'dtype', 'hasnans', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'nbytes']
method: ['argmax', 'argmin', 'argsort', 'autocorr', 'between', 'divmod', 'factorize', 'item', 'ravel', 'rdivmod', 'repeat', 'searchsorted', 'to_frame', 'to_list', 'unique', 'view']


The ```DataFrame``` instead has ```DataFrame``` specific attributes such as the name of each ```Series``` in the ```DataFrame```. The ```DataFrame``` also has supplementary methods such as ```insert``` which is used to insert a ```Series``` instance into a ```DataFrame``` instance or ```join``` and ```merge``` used to join or merge ```DataFrame``` instances respectively. The datamodel methods in a ```DataFrame``` not present in a ```Series``` are for type-casting (to a ```DataFrame```):

In [47]:
print('datamodel attribute:', end=' ')
print_identifier_group(df, kind='datamodel_attribute', second=xseries, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(df, kind='datamodel_method', second=xseries, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(df, kind='attribute', second=xseries, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(df, kind='function', second=xseries, show_unique_identifiers=True)

datamodel attribute: []
datamodel method: ['__dataframe__', '__dataframe_consortium_standard__']
attribute: ['columns', 'style', 'x', 'y']
method: ['applymap', 'assign', 'boxplot', 'corrwith', 'eval', 'from_dict', 'from_records', 'insert', 'isetitem', 'iterrows', 'itertuples', 'join', 'melt', 'merge', 'pivot', 'pivot_table', 'query', 'select_dtypes', 'set_index', 'stack', 'to_feather', 'to_gbq', 'to_html', 'to_orc', 'to_parquet', 'to_records', 'to_stata', 'to_xml']


Notice the columns attribute returns a list of the names of each ```Series``` in the ```DataFrame```:

In [48]:
df.columns

Index(['x', 'y'], dtype='object')

Since the following conditions are satisfied:

In [49]:
'x'.isidentifier()

True

In [50]:
'y'.isidentifier()

True

And these identifier names don't clash with any of the other ```DataFrame``` identifiers, the following become ```DataFrame``` attributes and correspond to each ```Series``` in the ```DataFrame```:

In [51]:
df.x

0    1.1
1    2.1
2    3.1
3    3.1
Name: x, dtype: float64

In [52]:
df.y

0    1.2
1    2.2
2    3.2
3    4.2
Name: y, dtype: float64

## Mutability

The ```Index```, ```Series``` and ```DataFrame``` classes are mutable Collections meaning they have the immutable datamodel identifier ```__getitem__``` (dunder getitem) as well as the mutable identifier ```__setitem__``` (dunder setitem):

In [53]:
'__getitem__' in dir(pd.Series)

True

In [54]:
'__setitem__' in dir(pd.Series)

True

In [55]:
'__delitem__' in dir(pd.Series)

True

This means the following array can be indexed into:

In [56]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

In [57]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Recall the datamodel ```__getitem__``` (dunder getitem) defines how a ```Collection``` responds to indexing using square brackets:

In [58]:
xseries[0]

1.1

Recall that the mutable method ```__setitem__``` (dunder setitem) defines how a ```MutableCollection``` responds to indexing using square brackets followed by assignment to a new value:

In [59]:
xseries[0] = None

In [60]:
xseries

0    NaN
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Recall that the mutable method ```__delitem__``` (dunder delitem) defines how a ```MutableCollection``` responds to a ```del``` statement of an element indexing using square brackets:

In [61]:
del xseries[2]

In [62]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

Despite the ```NDArray```, ```Series``` and ```DataFrame``` being mutable datatypes, most the identifiers are immutable by default. If the docstring of the method ```dropna``` is examined:

In [63]:
xseries.dropna?

[1;31mSignature:[0m
[0mxseries[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mhow[0m[1;33m:[0m [1;34m'AnyAll | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a new Series with missing values removed.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index'}
    Unused. Parameter needed for compatibility with DataFrame.
inplace : bool, default False
    If True, do operation inplace and return

Notice it has the keyword input arguments ```inplace```. ```inplace``` has the default value of ```False``` making the method immutable by default and therefore returns a new ```Series```:

In [64]:
xseries.dropna() # Return value

1    2.1
3    4.1
Name: x, dtype: float64

In [65]:
xseries # Unchanged

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

When ```inplace``` is set to ```True``` the method becomes mutable:

In [66]:
xseries.dropna(inplace=True) # No return value

In [67]:
xseries # Modified inplace

1    2.1
3    4.1
Name: x, dtype: float64

The same behaviour can be seen on the method ```reset_index```:

In [68]:
xseries.reset_index?

[1;31mSignature:[0m
[0mxseries[0m[1;33m.[0m[0mreset_index[0m[1;33m([0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'IndexLabel | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mdrop[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mallow_duplicates[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or
when the index is meaningless and needs to be reset to the default
before a

With default value this method is immutable and returns a ```DataFrame``` since the old index is now added as the first ```Series```:

In [69]:
xseries.reset_index() # Return value

Unnamed: 0,index,x
0,1,2.1
1,3,4.1


If the ```drop``` keyword input argument is set to ```True```, a ```Series``` will instead be returned:

In [70]:
xseries.reset_index(drop=True) # Return value

0    2.1
1    4.1
Name: x, dtype: float64

Once again the ```inplace``` keyword input argument can be assigned to ```True``` making the method mutable:

In [71]:
xseries.reset_index(drop=True, inplace=True) # No return value

In [72]:
xseries # Modified inplace

0    2.1
1    4.1
Name: x, dtype: float64

The following ```Series``` methods have the parameter ```inplace``` and are therefore immutable by default but are mutable when this parameter is assigned to ```True```:

In [73]:
print_identifier_group(xseries, kind='function', has_parameter='inplace')

['backfill', 'bfill', 'clip', 'drop', 'drop_duplicates', 'dropna', 'ffill', 'fillna', 'interpolate', 'mask', 'pad', 'rename', 'rename_axis', 'replace', 'reset_index', 'sort_index', 'sort_values', 'where']


Notice that most of these are used to fill, interpolate or drop values along a ```Series``` in response to missing data. 

```sort_values``` for example can be used to sort the values along a ```Series```, by default ```inplace=False``` and the method is immutable:

In [74]:
xseries.sort_values(ascending=False) # Return value

1    4.1
0    2.1
Name: x, dtype: float64

Recall when an immutable method is used with assignment, the new value returned on the right of the assignment operator is assigned to the instance name or label on the left of the assignment operator. If the instance name is conceptualised as a label, then a reassignment peels the label from the original instance and places it on the new instance created:

In [75]:
xseries = xseries.sort_values(ascending=False)

In [76]:
xseries

1    4.1
0    2.1
Name: x, dtype: float64

On the other hand when a method is immutable, there is no return value and the ```Series``` is updated inplace:

In [77]:
xseries.sort_values(ascending=True, inplace=True) # No return value

In [78]:
xseries

0    2.1
1    4.1
Name: x, dtype: float64

If assignment is used with an mutable function, the return value of the function is ```None``` and therefore ```None``` will be assigned to the ```new_label```:

In [79]:
new_label = xseries.sort_values(ascending=True, inplace=True) 

In [80]:
new_label

And therefore reassignment with the ```inplace``` parameter set to ```True``` should be avoided as the value will being reassigned will be ```None```:

In [81]:
xseries = xseries.sort_values(ascending=True, inplace=True) 

In [82]:
xseries

By convention immutable methods have a ```return``` value and mutable methods have no ```return``` value. An exception to this is the mutable method ```pop``` which returns the popped value and mutates the ```Series``` in place:

In [90]:
xseries = pd.Series([4.1, 2.1, 3.1, 1.1], name='x')

In [91]:
xseries

0    4.1
1    2.1
2    3.1
3    1.1
Name: x, dtype: float64

In [92]:
xseries.pop(item=1) # Return value

2.1

In [93]:
xseries # Mutated

0    4.1
2    3.1
3    1.1
Name: x, dtype: float64

Most of the other methods are immutable and have a ```return``` value.

## Axis

Another common keyword is ```axis```:

In [94]:
print_identifier_group(xseries, kind='function', has_parameter='axis')

['add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'argmax', 'argmin', 'argsort', 'at_time', 'backfill', 'between_time', 'bfill', 'clip', 'cummax', 'cummin', 'cumprod', 'cumsum', 'div', 'divide', 'divmod', 'drop', 'droplevel', 'dropna', 'eq', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'floordiv', 'ge', 'groupby', 'gt', 'idxmax', 'idxmin', 'iloc', 'interpolate', 'kurt', 'kurtosis', 'le', 'loc', 'lt', 'mask', 'max', 'mean', 'median', 'min', 'mod', 'mul', 'multiply', 'ne', 'pad', 'pow', 'prod', 'product', 'radd', 'rank', 'rdiv', 'rdivmod', 'reindex', 'rename', 'rename_axis', 'repeat', 'resample', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'rpow', 'rsub', 'rtruediv', 'sample', 'sem', 'set_axis', 'shift', 'skew', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'take', 'transform', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'var', 'where', 'xs']


A ```Series``` is a column and only has a single ```axis``` available, ```0```. The operation can be conceptualised as sorting the data in the rows by use of the ```Series``` name and therefore ```axis``` can also be assigned to the ```str``` instance ```'rows'```:

In [100]:
xseries.sort_values(ascending=True, axis=0)

3    1.1
2    3.1
0    4.1
Name: x, dtype: float64

In [125]:
xseries.sort_values(ascending=True, axis='rows')

3    1.1
2    3.1
0    4.1
Name: x, dtype: float64

For a ```DataFrame``` there are two values for ```axis```, ```0``` which is the default and ```1```:

In [162]:
df = pd.DataFrame({'x': np.array([5.1, 2.1, 2.1, 4.1]),
                   'y': np.array([6.2, 7.0, 2.1, 1.2])},
                   index=['a', 'b', 'c', 'd'])

In [163]:
df

Unnamed: 0,x,y
a,5.1,6.2
b,2.1,7.0
c,2.1,2.1
d,4.1,1.2


The default ```axis``` is ```0``` which is equivalent to the ```str``` instance ```'rows'```. This is an instruction to sort the data in the rows ```by``` the ordering of the data in the columns:

In [164]:
df.sort_values(by=['x', 'y'], axis='rows')

Unnamed: 0,x,y
c,2.1,2.1
b,2.1,7.0
d,4.1,1.2
a,5.1,6.2


Notice that the data is sorted in ascending order by ```'x'``` and in the case where the two values in ```'x'``` have duplicate values are sorted by ```'y'``` :

In [156]:
df

Unnamed: 0,x,y
a,5.1,6.2
b,2.1,7.0
c,2.1,2.1
d,4.1,1.2


The ```axis``` can be changed to ```1``` which is equivalent to the ```str``` instance ```'columns'```. This is an instruction to sort the data in the columns ```by``` the ordering of the data in the index:

In [159]:
df.sort_values(by=['c', 'd'], axis='columns')

Unnamed: 0,y,x
a,6.2,5.1
b,7.0,2.1
c,2.1,2.1
d,1.2,4.1


The data is sorted in ascending order first by ```'c'``` but the data in the two ```Series``` instances ```'x'``` and ```'y'``` have the same value 2.1 so there is no instruction to specify the order of the ```Series```. The next index value ```'d'``` is used and the value in the ```Series``` instance ```y``` is 1.2 and the ```Series``` instance ```'x'``` is 4.1, therefore ```'y'``` is ordered before ```'x'```.

In the ```NDArray``` negative indexes are quite commonly used to select an ```axis```. This are not used for the ```Series``` (1D) and ```DataFrame``` (2D) instances which are of fixed dimensions.

## Indexing and Slicing

Supposing the following dictionary instance is instantiated:

In [165]:
mapping = {'x': np.array([1.1, 2.1, 3.1, 4.1]),
           'y': np.array([1.2, 2.2, 3.2, 4.2])}

In [166]:
mapping

{'x': array([1.1, 2.1, 3.1, 4.1]), 'y': array([1.2, 2.2, 3.2, 4.2])}

A ```DataFrame``` instance can be instantiated by assigning the ```mapping``` to the keyword input argument ```data```:

In [167]:
df = pd.DataFrame(data=mapping)

In [168]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


A ```mapping``` can be indexed with a ```key```. This returns the ```value``` the ```key``` references, in this case the ```NDArray```:

In [169]:
mapping['x']

array([1.1, 2.1, 3.1, 4.1])

Analogously, when a ```DataFrame``` is indexed using the ```name``` of a ```Series```, the ```Series``` is returned:

In [170]:
df['x']

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value in the ```NDArray``` instance can be indexed by use of a second set of square brackets to enclose the numeric index:

In [171]:
mapping['x'][1]

2.1

Analogously, a ```value``` in the ```Series``` can be indexed by use of a second set of square brackets to enclose the numeric index:

In [172]:
df['x'][1]

2.1

If the DataFrame instance is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

The first set of brackets select the Series:

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

And the second set of brackets selects the index retrieving the value:

2.1

If the DataFrame is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

Sometimes the value for each ```Series``` at a value within the ```Index``` instance is desired:

|index|'x'|'y'|
|---|---|---|
|1|2.1|2.2|

This is done by use of the property location ```loc```. Note that ```loc``` returns the above *row* as a ```Series``` which is displayed by default as a *column*:

|index|1|
|---|---|
|'x'|2.1|
|'y'|2.1|

```loc``` is callable and has a docstring:

In [173]:
callable(df.loc)

True

In [174]:
df.loc?

[1;31mType:[0m        property
[1;31mString form:[0m <property object at 0x0000018D9D875670>
[1;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

See mo

However unlike most callables it is not called using parenthesis:

In [175]:
df.loc

<pandas.core.indexing._LocIndexer at 0x18d9fd0d5e0>

In [176]:
df.loc()

<pandas.core.indexing._LocIndexer at 0x18d9fcd68f0>

Instead ```loc``` is a property. Under the hood it uses syntactic sugar around the datamodel method ```__getitem__``` that switches the order of indexing from the default ```[column, index]``` to ```[index, column]```:

In [177]:
df.loc[1]

x    2.1
y    2.2
Name: 1, dtype: float64

In [178]:
df.loc[1]['x']

2.1

```loc``` can also uses index values:

In [179]:
df.loc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


The related property integer location ```iloc``` always uses a numeric index. Since ```iloc``` has a numeric index, additional numeric operations can be used such as slicing:

In [180]:
df.iloc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


In [181]:
df.iloc[0:2]

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2


If the following DataFrame instance is created with index labels i.e. a non-numeric index:

|index|'x'|'y'|
|---|---|---|
|'a'|1.1|1.2|
|'b'|2.1|2.2|
|'c'|3.1|3.2|
|'d'|4.1|4.2|

In [223]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd'],
                  data=mapping)

In [224]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The difference between ```loc``` and ```iloc``` can be seen more clearly. For ```loc``` the index label is used:

In [225]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

Despite the labels being non-numeric ```iloc``` handles the index values numerically:

In [226]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

Under the hood ```iloc``` essentially uses the ```DataFrame``` instances reset index:

In [227]:
df.reset_index(drop=True)

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


In [228]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

When ```loc``` and ```iloc``` are used to select a single index, the data for each ```Series``` at this index is itself displayed as a ```Series```:

In [229]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

In [230]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

Because each of the above are a ```Series``` instance, they can in turn be indexed into:

In [231]:
df.loc['b']['y']

2.2

In [232]:
df.iloc[1]['y']

2.2

When ```iloc``` and ```loc``` are instead used to select data from multiple indexes a ```DataFrame``` instance is output:

In [233]:
df.loc[['a', 'b']]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


In [234]:
df.iloc[0:2]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


And because each of these is a ```DataFrame``` instance, the ```Series``` within the ```DataFrame``` instance can then be indexed using the ```Series``` name:

In [235]:
df.loc[['a', 'b']]['x']

a    1.1
b    2.1
Name: x, dtype: float64

In [236]:
df.iloc[0:2]['x']

a    1.1
b    2.1
Name: x, dtype: float64

```at``` is used for a scalar selector and requires both the index and the ```Series``` name: 

In [237]:
df.at['a', 'y']

1.2

The related integer at ```iat``` is also a scalar selector and requires both the index and column to be specified as integers:

In [238]:
df.iat[0, 1]

1.2

Conceptualise, the ```DataFrame``` being cast to a ```NDArray``` (2D) and indexing a value from it:

In [239]:
df.to_numpy()

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [4.1, 4.2]])

In [240]:
df.to_numpy()[0, 1]

1.2

To recap, for a ```DataFrame``` instance:

* ```__getitem__``` selects a ```Series``` by default
* ```loc``` and ```iloc``` change the behaviour to select an observation from the ```Index``` instance label
* ```at``` and ```iat``` select a scalar element


```loc``` can also be used to add a new observation to the ```DataFrame``` instance:

In [241]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


In [242]:
df.loc['f'] = {'x': 6.1, 'y': 6.2}

In [243]:
df.loc['e'] = {'x': 5.1, 'y': 5.2}

The ordering of rows (also known as observations) follows the insertion order: 

In [244]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
f,6.1,6.2
e,5.1,5.2


The ```DataFrame``` method ```sort_index``` can be used to reorder the index: 

In [245]:
df.sort_index(inplace=True)

In [246]:
df # modified inplace

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2


The ```Index``` instance can also be reset to a numeric index using the ```DataFrame``` instance ```reset_index```:

In [247]:
df.reset_index(drop=True, inplace=True)

In [248]:
df # modified inplace

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2


The length of the ```DataFrame``` gives the number of rows (observations):

In [249]:
len(df)

6

Python uses zero-order indexing and the ```Index``` starts at ```0``` (inclusive) and stops at ```len(df)``` (exclusive).

```iloc``` cannot be used to index into an index value that doesn't exist and cannot be used to add a new observation. However ```loc``` can be used to add a numeric index using the ```len``` of the ```DataFrame``` instance:

In [250]:
df.loc[len(df)] = {'x': 7.1, 'y': 7.2}

In [251]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2
6,7.1,7.2


## DataFrame Properties

Supposing the following ```DataFrame``` is instantiated to ```df```:

In [252]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2])})

In [253]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The ```DataFrame``` instance has the following dimension related properties. The attribute ```empty``` returns a boolean that is ```True``` only with an empty DataFrame:

In [254]:
df.empty

False

In [255]:
pd.DataFrame(None).empty

True

A ```DataFrame``` instance has a length, which is returned by the ```builtins``` function ```len```. This was seen previously to correspond to the number of rows (number of observations):

In [256]:
len(df)

7

A ```DataFrame``` instance has the attribute ```shape``` which is a ```tuple``` of dimensions. The 1st dimension is the number of rows (observations in the index) and the 2nd value is the number of ```Series``` (columns):

In [257]:
df.shape

(7, 2)

A ```DataFrame``` instance has the attribute ```ndim``` which gives the number fo dimensions and is always ```2```:

In [258]:
df.ndim

2

Recall this is equivalent to the length of the ```shape``` ```tuple```:

In [259]:
len(df.shape)

2

The ```DataFrame``` instance has a ```size``` attribute which is the product of the elements in the ```shape``` ```tuple```:

In [260]:
df.size

14

The index attribute is an Index instance. An Index instance has a single dimension that can either be depicted as a row or a column. The output below displays this as a row although the index itself is conventionally depicted as a column when incorporated as part of a DataFrame:

In [None]:
df.index

When no index is specified during instantiation a RangeIndex is shown:

In [None]:
df2 = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 3.1]),
                         'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df2.index

The attribute columns is also an instance of the class Index that contains the names used for each Series in the DataFrame:

In [None]:
df.columns

The attribute axes returns a 2 element list, where the first element is the index and the second element is the columns:

In [None]:
df.axes

The attribute values returns the values in the DataFrame in the form of a 2darray:

In [None]:
df.values

The attribute dtypes returns the data types of each Series and of the DataFrame:

In [None]:
df.dtypes

The Series instances x and y are each of the data type float64, the DataFrame instance df is of the data type object. A DataFrame instance is always of the type object.

Each existing Series is accessable as an attribute:

In [None]:
df.x

In [None]:
df.y

The formal representation of the DataFrame instance df can be examined in a cell:

In [None]:
df

The attribute style will instead display the DataFrame instance using default styling:

In [None]:
df.style

This attribute can be used with a number of methods to apply custom formatting:

In [None]:
for identifier in dir(df.style):
    if not identifier.startswith('_') and callable(getattr(df.style, identifier)):
        print(identifier, end=' ')

In [None]:
df_styled = df.style.format(precision=3).set_caption('DataFrame Instance')

This gives a Styler instance:

In [None]:
type(df_styled)

The Styler instance applies the formatting to the data in the DataFrame when output in a cell:

In [None]:
df_styled

The associated attributes give information about an existing Styler instance:

In [None]:
for identifier in dir(df_styled):
    if not identifier.startswith('_') and not callable(getattr(df_styled, identifier)):
        print(identifier, end=' ')

In [None]:
df_styled.caption

In [None]:
df_styled.hidden_rows

The attributes attrs is an empty dictionary by default and is designed to store metadata associated with the DataFrame:

In [None]:
df.attrs

This metadata can include a text description giving information about how the data was collection or contain a link to a scientific publication for example. The pandas documentation warns that this is an experimental feature and is subject to change:

In [None]:
df.attrs = {'description': 'this DataFrame was instantiated from a dict',
            'scientific paper': r'https://www.sciencedirect.com/'}

flags is another experimental feature and is used to change some flags. At current there is only a flag that can be set, the flag which allows duplicate labels:

In [None]:
df.flags

In [None]:
df

This flag is enabled by default:

In [None]:
df.flags.allows_duplicate_labels

In [None]:
df_duplicated = pd.concat([df, df])

In [None]:
df_duplicated

If set to False:

In [None]:
df.flags.allows_duplicate_labels

Then any operation involving that DataFrame that could lead to a DataFrame with duplicate labels will give a DuplicateLabelError:

In [None]:
# pd.concat([df, df])

<span style='color:red'>DuplicateLabelError</span>:

info
describe
head
tail

The dataframe method info gives information about the dataframe putting several attributes together such as the Series names, number of non-null values and the data types of each Series:

In [None]:
df.info()

The describe method gives descriptive statistics on each numeric Series:

In [None]:
df.describe()

The dataframe method head and tail give the top 5 and last 5 observations by default and are usually used to preview a very large DataFrame instance:

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

The number of observations n can be changed:

In [None]:
df.head(n=3)

The method nunique gives the number of unique observations for each Series:

In [None]:
df.nunique()

## Attribute Access - Dictionary Syntax vs Dot Syntax

If the following DataFrame instance is created:

In [None]:
df = pd.DataFrame(index = np.array(['a', 'b', 'c', 'd']),
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

Each Series can be accessed from the DataFrame instance df by indexing into df using a string of the Series name enclosed in square brackets. This style of Series access is analogous to retrieving a value from a dictionary by use of its key:

In [None]:
df['x']

Since the following is True:

In [None]:
'x'.isidentifier()

x becomes an attribute and can also be accessed using:

In [None]:
df.x

In [None]:
df.x is df['x']

One major advantage of this attribute access is that code completion often works better:

In [None]:
# df.x.

In [None]:
# df['x'].

A Series only becomes an identifier of a DataFrame **after** it is instantiated and if it is a valid identifier:

In [None]:
df['z1'] = np.array([1.3, 2.3, 2.3, 2.4])

In [None]:
df

In [None]:
df.z1

A UserWarning will display if the dot syntax is used in an attempt to create a new attribute:

In [None]:
# df.z2 = np.array([1.3, 2.3, 2.3, 2.4])

<span style='color:red'>UserWarning</span>: Pandas doesn't allow columns to be created via a new attribute name

The DataFrame instance df will be unchanged:

In [None]:
df

If an invalid identifier name is used:

In [None]:
'1'.isidentifier()

In [None]:
df['1'] = np.array([1.3, 2.3, 2.3, 2.4])

In [None]:
df

Then the new Series will not be accessible as an attribute and a SyntaxError will be displayed if attribute access is attempted:

In [None]:
# df.1

<span style='color:red'>SyntaxError</span>: invalid syntax

For this reason Series names should follow the naming conventions of Python identifiers (object names).

Although the dot attribute access from the DataFrame instance df is unavailable for this Series instance '1'. The Series instance '1' can still be accessed by indexing into the DataFrame instance df using the Series name '1':

In [None]:
df['1']

Accessing a Series via dictionary-style indexing is therefore more powerful and this syntax is generally preferred.

The major drawback of this syntax is with code-completion. Notice if the following is input, no identifiers display:

In [None]:
# df['x'].

However if the following is input, identifiers display:

In [None]:
# df.x.

The ? operator cannot find the docstring of the Series method info using dictionary-style indexing:

In [None]:
? df['x'].info

But can find the docstring using the attribute-style indexing:

In [None]:
? df.x.info

The method info gives the same result when called from the Series in both cases:

In [None]:
df['x'].info()

In [None]:
df.x.info()

When the name used for each index is also a valid identifier:

In [None]:
'a'.isidentifier()

It will display as an an attribute for each Series:

In [None]:
df.x.a

Once again the object, in this case the str object has its own identifiers and code completion can access them when the dot access is used:

In [None]:
# df.x.a.

In [None]:
# df['x'].a.

But not when square brackets are used:

In [None]:
# df['x']['a'].

The default index is numeric integer steps which are invalid identifiers:

In [None]:
'0'.isidentifier()

And therefore when the default index is used:

In [None]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

The dot attribute access cannot be used to access the value in the Index:

In [None]:
# df.x.1

<span style='color:red'>SyntaxError</span>: invalid syntax

But indexing with square brackets works:

In [None]:
df.x[1]

## Combining DataFrames

DataFrame methods are generally setup for Series. For example if the following DataFrame instance is examined:

In [None]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

A Series is typically appended to the end of the DataFrame by use of:

In [None]:
df['z'] = np.array([1.3, 2.3, 3.3, 4.3])

In [None]:
df

Alternatively it can be inserted at a specified index using the mutable method insert:

In [None]:
? df.insert

In [None]:
df.insert(loc=0, column='w', value=np.array([1.0, 2.0, 3.0, 4.0]))

In [None]:
df

To add an observation to the end of a DataFrame, loc is typically used with a dictionary where the keys are the column names and the values are the associated values at that observation:

In [None]:
df.loc[len(df)] = {'w': 5.0, 'x': 5.1, 'y': 5.2, 'z': 5.3}

In [None]:
df

When multiple observations are to be added to a DataFrame they are normally in the form of a DataFrame:

In [None]:
df2 = pd.DataFrame(index=np.array([5, 6]),
                                  data = {'w': np.array([6.0, 7.0]),
                                          'x': np.array([6.1, 7.1]),
                                          'y': np.array([6.2, 7.2]),
                                          'z': np.array([6.3, 7.3])})

In [None]:
df

In [None]:
df2

If multiple observations are to be added, normally pd.concat is used:

In [None]:
? pd.concat

For example, df and df2 can be concatenated along axis 0 (the index):

In [None]:
pd.concat(objs=[df, df2], axis=0) #'index'

If these DataFrame instances are created with the default indexes:

In [None]:
df.reset_index(drop=True, inplace=True)
df

In [None]:
df2.reset_index(drop=True, inplace=True)
df2

Notice the index now has duplicate entires:

In [None]:
pd.concat(objs=[df, df2])

In such a scenario it is common to assign ignore_index to True which will recreate a numeric index:

In [None]:
pd.concat([df, df2], ignore_index=True)

When two DataFrames are concatenated with Series not in common:

In [None]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df2 = pd.DataFrame(data = {'w': np.array([6.0, 7.0]),
                           'z': np.array([6.3, 7.3])})

In [None]:
df

In [None]:
df2

They can be outer joined (the default). This will lead to NaN values where no data was supplied:

In [None]:
pd.concat(objs=[df, df2], axis=1, join='outer') #'columns'

Alternatively they can be inner joined, which will drop the observations that are missing the data:

In [None]:
pd.concat([df, df2], axis=1, join='inner') #'columns'

The DataFrame method align can be used to align the data of a DataFrame with another DataFrame instance for the purpose of comparison:

In [None]:
df3 = pd.concat([df, df2], axis=1, join='inner') #'columns'

In [None]:
df.align(other=df3)

## Not Available Values

If the following DataDrame is instantiated with None Values:

In [None]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, None, 3.1, None, 5.1, None, 7.1]),
                        'y': np.array([1.2, None, 3.2, 4.2, 5.2, 6.2, 7.2])})

The information of the DataFrame instance can be examined, now there are 7 entries (observations). 5 observations have available (non-null) values in Series instance x. 6 observations have available (non-null) values in Series instance y. Also notice the data type is now object instead of float64 meaning everything in each Series is interpretted as a string:

In [None]:
df.info()

If describe is used, because None values are present and the data type is an object the descriptive statistics change:

In [None]:
df.describe()

The data type of each Series in the DataFrame can be changed using the method astype:

In [None]:
df

In [None]:
df.astype(float)

Notice the difference between the two DataFrame instances, the one which has each Series with the data type object has None whereas the one which has each Series as numeric has NaN (not a number).

In [None]:
None == np.NaN

NaN is essentially equivalent to None that has a datatype of float. Series that only have numeric data and NaN can therefore have the data type float:

In [None]:
type(np.NaN)

Series with only numeric data and None therefore contain multiple different data types and therefore the Series has the data type object:

In [None]:
type(None)

If the method describe is used on the DataFrame instance that has the float data type with NaN values instead of None values, the numeric descriptive statistics display:

In [None]:
df.astype(float).describe()

The method drop not available dropna can be used to drop these values outputting a new DataFrame instance. Both None and NaN are classified as not available and are also known collectively as null values. Notice the number of observations is now reduced to 4:

In [None]:
df.dropna()

In [None]:
df.astype(float).dropna()

If the DataFrame method info is used on this new DataFrame instance, notice the data type of each Series is still object and not float64:

In [None]:
df.dropna().info()

The astype method can be used to change the data type of each Series in the DataFrame to a float once again outputting a new DataFrame instance. If the info method is examined for this DataFrame instance, each Series now has a float64 data type:

In [None]:
df.dropna().astype(float).info()

And describe can be used on this instance to give descriptive statistics:

In [None]:
df.dropna().astype(float).describe()

dropna can be used when a DataFrame instance has a large number of observations and only a small number of these observations have not available values. If the docstring is examined:

In [None]:
? df.dropna

Notice there is the keyword input argument inplace and axis. These two keywords are present it many of the DataFrame identifiers.

The following identifiers have the keyword inplace which recall toggles the method from being immutable (default when inplace=False) to mutable (when inplace=True). Notice that many of these other identifiers are used to drop not available data or to fill not available data.

In [None]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('inplace' in inspect.signature(getattr(df, identifier)).parameters):
            print(identifier, end=' ')

The following identifiers have the keyword axis:

In [None]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('axis' in inspect.signature(getattr(df, identifier)).parameters):
            print(identifier, end=' ')

The keyword axis can be examined in more detail using the DataFrame instance df:

In [None]:
df

df has a shape tuple which has 7 observations or rows in the index and 2 Series or columns:

In [None]:
df.shape

Because a DataFrame is always 2 dimensions, positive indexes can be considered. Notice the 7 is at index 0 and the 2 is at index 1 of the shape tuple:

In [None]:
nrows = df.shape[0]
nrows

In [None]:
ncols = df.shape[1]
ncols

The default value is axis 0 or 'index' and drops any observations along the index that have null entries:

In [None]:
df.dropna(axis=0)

In [None]:
df.dropna(axis='index')

This can be changed to an axis of 1 or 'columns' that will instead drop any Series that has not available values. In this case all the Series have not available values:

In [None]:
df.dropna(axis=1)

In [None]:
df.dropna(axis='columns')

The method fillna can be used to fill in not available values:

In [None]:
? df.fillna

These can be filled with a constant value:

In [None]:
df.fillna(0)

In [None]:
df.fillna(np.inf)

Alternatively a method can be used to linearly forward fill missing data. When using the forward fill, the previous available value is used to replace the not available value:

In [None]:
df

In [None]:
df.fillna(method='ffill')

When using the back fill the next available value is used to replace the not available value:

In [None]:
df

In [None]:
df.fillna(method='bfill')

These also have synonym methods ffill and bfill:

In [None]:
df.bfill()

In [None]:
df.ffill()

The interpolate method can use neighbouring datapoints to interpolate a missing value:

In [None]:
? df.interpolate

The interpolate method has the keyword input argument method. If method is set to 'linear' numeric interpolation will use the two nearest non-null data points.

If the data type of the Series is object, the data will not be recognised as numeric and a TypeError will display:

In [None]:
# df.interpolate(method='linear')

<span style='color:red'>TypeError</span>: Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype.

In [None]:
df.astype(float).interpolate(method='linear')

This is the same as a 1st order polynomial (two nearest data points). If a polynomial method is specified however the index needs to be numeric otherwise there is a ValueError:

In [None]:
# df.astype(float).interpolate(method='polynomial', order=1)

<span style='color:red'>ValueError</span>: Index column must be numeric or datetime type when using polynomial method other than linear. Try setting a numeric or datetime index column before interpolating.

If reset_index is used to make the index numeric, polynomial interpolation can be used:

In [None]:
df.reset_index(drop=True).astype(float).interpolate(method='polynomial', order=1) # 2 nearest data points

In [None]:
df.reset_index(drop=True).astype(float).interpolate(method='polynomial', order=2) # 3 nearest data points

In [None]:
df.reset_index(drop=True).astype(float).interpolate(method='polynomial', order=3) # 4 nearest data points

The isna DataFrame method returns a boolean DataFrame instance which is True for not available values and False otherwise:

In [None]:
df.isna()

The opposite method notna returns a boolean DataFrame of inverse values:

In [None]:
df.notna()

These two methods have the alias isnull and notnull respectively. These alias are used for consistency with the R programming language.

The boolean mask above can be used to index into the DataFrame instance:

In [None]:
bool_mask = df.notna()

Notice indexing using the boolean mask updates None to NaN:

In [None]:
df

In [None]:
df[bool_mask]

## String Series and String Methods

Supposing the following list of words is instantiated:

In [None]:
words = 'the quick brown for jumped over the lazy dog'.split()

In [None]:
words

Using len of words will return the number of words and not the length of each word:

In [None]:
len(words)

To instead get a list of the length of each word i.e. use len on each individual str, list comprehension can be used:

In [None]:
[len(word) for word in words]

This can also be done using map:

In [None]:
map(len, words)

In [None]:
list(map(len, words))

If an analogous DataFrame is instantiated with a Series words:

In [None]:
df = pd.DataFrame({'words': 'the quick brown for jumped over the lazy dog'.split()})

In [None]:
df

Using len on the DataFrame will return the number of observations in the index:

In [None]:
len(df)

The DataFrame method applymap is similar to map and can be used to individually apply the len function element by element throughout the DataFrame:

In [None]:
df.applymap(func=len)

Since every element in the DataFrame is a str, a str method can be applied to each element using applymap and a lambda expression:

In [None]:
df.applymap(func=lambda str: str.upper())

A Series has a similar method map:

In [None]:
df['words'].map(lambda str: str.upper())

Notice the difference in the return values. The method applymap called from the DataFrame returns another DataFrame instance. In contrast the method map when called from a Series returns another Series instance.

The Series instance returned can be assigned to a new Series of the DataFrame:

In [None]:
df['upperwords'] = df['words'].map(lambda str: str.upper())

In [None]:
df

The DataFrame instance also has the method apply which can be used to apply a function for example a universal function along an axis, by default it operates along axis 0 which is the 'index':

In [None]:
df.apply(max)

In [None]:
df.apply(min)

Since the str methods are commonly invoked, a Series has the attribute str which can be used to invoke the most common string methods:

In [None]:
df['words'].str.zfill(20)

And includes some additions such as len:

In [None]:
df['words'].str.len()

## Numeric Series

If a DataFrame with numeric Series x, y and z is instantiated:

In [None]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                   'y': [-2, -4, 6, 8, 10],
                   'z': [12, 24, 48, -63, -999]})

In [None]:
df

The apply method can be used to apply the builtins universal function max along axis 0 'index' (default) or axis 1 'columns':

In [None]:
df.apply(max) #'index'

In [None]:
df.apply(max, axis=1) #'columns'

Note however that most the universal functions from builtins or numpy are implemented directly as DataFrame methods:

In [None]:
df.max(axis=0)

In [None]:
df.min(axis=0)

In [None]:
df.mean(axis=0)

In [None]:
df.var(axis=0)

In [None]:
df.std(axis=0)

And the data model identifiers are configured for numeric operation:

In [None]:
df['x'] + df['y']

In [None]:
df['x'] + 5

The apply function can also be used with a tuple of these universal functions outputting a DataFrame instance opposed to a Series:

In [None]:
df.apply((len, max, min, np.mean, np.var, np.std))

## Categorical Series

Another common type of Series is a category Series:

In [None]:
df = pd.DataFrame({'student': ['Lucie', 'Petra', 'Pavel', 'Martin', 'Harry', 'Daniel', 'Valeria', 'Julia'],
                   'grade': ['B', 'F', 'A', 'C', 'A', 'C', 'B', 'A']})

When instantiated, the categories will normally be recognised as strings:

In [None]:
df

And the data types will therefore be objects:

In [None]:
df.dtypes

The data type of a Series can be changed using the method astype. To change to category use the input argument 'category':

In [None]:
oldidentifiers = dir(df['grade'])

In [None]:
df['grade'].astype('category')

The original Series can be reassigned to the new Series that are now categorical:

In [None]:
df['grade'] = df['grade'].astype('category')

If the DataFrame instance is examined, it looks the same:

In [None]:
df

However its data type is updated:

In [None]:
df.dtypes

A categorical Series also has the attribute cat which groups together methods and attributes commonly used for categorical Series:

In [None]:
newidentifiers = dir(df['grade'])

In [None]:
for identifier in newidentifiers:
    if identifier not in oldidentifiers:
        print(identifier, end=' ')

Categories are often used for boolean selectors:

In [None]:
df[df['grade'] == 'A']

In [None]:
df[df['grade'] == 'B']

In [None]:
df[(df['grade'] == 'A') | (df['grade'] == 'B')]

Only the equal to == and not equal to != operators are defined for unordered categoricals. A TypeError displays if one of the other comparision operators is attempted to be used:

In [None]:
# df[df['grade'] >= 'B']

<span style='color:red'>TypeError</span>: Unordered Categoricals can only compare equality or not

The as_ordered method can be used to ordinally order categories:

In [None]:
df['grade'].cat.as_ordered()

In this case, the order desired is reverse the ordinal values because 'A' corresponds to a higher grade than 'F':

In [None]:
df['grade'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                   ordered=True)

The original Series 'grade' can be reassigned:

In [None]:
df['grade'] = df['grade'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                                 ordered=True)

In [None]:
df[df['grade'] >= 'B']

When sorting out data in DataFrames, ordinal Series are quite often used:

In [None]:
df.sort_values(['grade'])

In [None]:
df.sort_values(['grade', 'student'])

A GroupBy instance can be created from the categories:

In [None]:
df.groupby(df['grade'])

In [None]:
gbo = df.groupby(df['grade'])

Statistical methods can then be called from this GroupBy instance applying them to every Series in the DataFrame. For example the statistical method count returns a DataFrame which counts the number of students for each grade:

In [None]:
gbo.count()

A Series can be selected from the GroupBy instance and the statistical method can only be called on this Series:

In [None]:
gbo['student'].count()

Notice the difference in output, the return value is a Series and not a DataFrame because the method was called from a Series and not a DataFrame.

Some methods like describe will however output a DataFrame: 

In [None]:
gbo['student'].describe()

In [None]:
gbo.describe()

Notice the slight difference with the multi-index column above being used to give statistical information (count, unique, top and freq) for each Series in the latter case.

In [None]:
df

The difference can be seen more clearly if a second category is added to the DataFrame:

In [None]:
df['sex'] = pd.Series(['F', 'F', 'M', 'M', 'M', 'M', 'F', 'F'])

In [None]:
df['sex'] = df['sex'].astype('category')

In [None]:
df.groupby('grade')['student'].describe()

In [None]:
df.groupby('grade').describe()

If the DataFrame instance df is examined:

In [None]:
df

The Series grade gives the ordinal grade which is normally achieved by an examiniation score. The results of the exam can be added as a numeric Series:

In [None]:
df['score'] = np.array([35, 20, 99, 55, 75, 58, 68, 90])

In [None]:
df

The pandas function pd.cut can be used to cut this Series of numeric values into bins which correspond to each grade:

In [None]:
? pd.cut

For example:

0:50 'F'

50:60 'C'

60:70 'B'

70:100 'A'

Inclusive of the top bound and exclusive of the top bound.

In [None]:
pd.cut(x=df['score'], bins=[0, 50, 60, 70, 101])

In the output below the ( means inclusive of the boundary and the ] means exclusive of the top boundary. These can be relabelled as the grades using the keyword labels. Notice that there are 5 values for bins and 4 labels, this is because each bin is between two values:

In [None]:
pd.cut(x=df['score'], bins=[0, 50, 60, 70, 101], labels=['F', 'C', 'B', 'A'])

## DateTime

In pandas dates and time invervals are based upon the data types datetime64 or timedelta64 respectively:

In [None]:
? np.datetime64

The datetime64 class is normally initialised using a timestamp string of the following format:

For example:

In [None]:
np.datetime64('2023-07-25')

In [None]:
np.datetime64('2023-07-25T14:30:15.123456')

The timedelta64 is normally initialised using a set of tuples:

In [None]:
? np.timedelta64

In [None]:
np.timedelta64(1, 'D')

In [None]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h')

In [None]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's')

In [None]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's') + np.timedelta64(1, 'ms')

These can be used to make an Index or Series respectively, normally using the np.arange function:

In [None]:
starttime = np.datetime64('2023-07-25')
endtime = np.datetime64('2023-07-26')
timeinterval = np.timedelta64(1, 'h')

In [None]:
times = np.arange(start=starttime, #inclusive
                  stop=endtime, #exclusive
                  step=timeinterval)

In [None]:
times

These times can be cast into an Index or Series:

In [None]:
pd.Index(times)

In [None]:
pd.Series(data=times, name='times')

This datetime64 Index instance can be used as a time index alongside measurement Series for example emulated temperature, ph and humidity data:

In [None]:
import numpy.random as random
random.seed(0)

In [None]:
df = pd.DataFrame(index=pd.Index(times),
                  data={'temperature': 25 + random.randn(24),
                        'ph': 7 + random.randn(24) / 10,
                        'humidity': 100 - random.randint(0, 100, 24)})

In [None]:
df

loc can be used to retrieve the data at a specified datetime64:

In [None]:
df.loc['2023-07-25T01:00:00']

iloc can also be used with an integer:

In [None]:
df.iloc[1]

A comparison between two times can be made:

In [None]:
df.loc['2023-07-25 16:00:00'] - df.loc['2023-07-25T01:00:00']

In addition to the datetime64 Index, the Series instance times can be added:

In [None]:
df['times'] = times

In [None]:
df

When loc is used to calculate the difference between two measurements at the two different times, the time difference, i.e. timedelta64 will be calculated:

In [None]:
df.loc['2023-07-25 16:00:00'] - df.loc['2023-07-25T01:00:00']

datetimes64 are usually specified in UTC. The tz_localize method can be used to specify a timezone using the input argument tz:

In [None]:
? df.times.tz_localize

For example in the UK:

In [None]:
df['times'].tz_localize(tz='Europe/London')

And in the Czech Republic:

In [None]:
df['times'].tz_localize(tz='Europe/Prague')

Care needs to be taken with non-UTC timezones as the clock changes leading to ambiguous times. For example in the UK the clock changes on the 29th of October:

In [None]:
starttime = np.datetime64('2023-10-28T11:00:00')
endtime = np.datetime64('2023-10-29T03:00:00')
timeinterval = np.timedelta64(30, 'm')

In [None]:
utc_times = np.arange(start=starttime, #inclusive
                      stop=endtime, #exclusive
                      step=timeinterval)

In [None]:
pd.Index(utc_times).tz_localize(tz='Europe/London', ambiguous=True)

In [None]:
pd.Index(utc_times).tz_localize(tz='Europe/London', ambiguous='NaT')

## Reading Data from Files

The Series and DataFrames previously examined were created using builtins datatypes. pandas has a number of functions for reading in data from external files:

In [None]:
for identifier in dir(pd):
    if identifier.startswith('read_'):
        print(identifier, end=' ')

### Comma Separated Values File

CSV is an abbreviation for comma separated values. The file format has a similar structure to a tuple, where each element is seperated by a comma. In the case of a CSV file, each column is seperated by a comma and the newline character is an instruction to move onto the next row:

When opened in a program such as Microsoft Excel, these display as a grid:

<img src='./images/img_001.png' alt='img_001' width='800'/>

Notice that the comma in twinkle, twinkle is not a delimiter but part of the string. For this reason "twinkle, twinkle" was displayed enclosed in quotations.

The CSV has a file name in this case:

Because it is in the same folder as the interactive Python notebook, the file path can be specified as the following string:

<img src='./images/img_002.png' alt='img_002' width='800'/>

In [None]:
file_path = r'.\Book1.csv'

In [None]:
file_path

* r means raw string. In a raw string \ is used to indicate a \ instead of an instruction to insert an escape character.
* ./ means in the same folder as the interactive Python notebook

If the file is moved into a sub folder called files:

<img src='./images/img_003.png' alt='img_003' width='800'/>

Then the file path becomes:

In [None]:
file_path = r'.\files\Book1.csv'

In [None]:
file_path

If the file is place up a level from the interactive notebook, the file path becomes:

<img src='./images/img_005.png' alt='img_005' width='800'/>

In [None]:
file_path = r'..\Book1.csv'

In [None]:
file_path

And if a subfolder (that is in the folder up a level from the interactive Python notebook file) is made called files:

<img src='./images/img_006.png' alt='img_006' width='800'/>

In [None]:
file_path = r'..\files\Book1.csv'

In [None]:
file_path

The function read_csv is used to read in a CSV file as a dataframe:

In [None]:
? pd.read_csv

The read_csv has a larger number of input arguments however only the first one is mandatory when the file is in the expected format:

In [None]:
df = pd.read_csv(filepath_or_buffer = 'Book1.csv')

In [None]:
df

The first input argument is normally used positionally:

In [None]:
df = pd.read_csv(r'./files/Book1.csv')

In [None]:
df

Notice the Series names are as expected and a numeric index is added.

In [None]:
df.axes

## Tab Delimited Text File

A text file is very similar to a csv file and uses \t instead of , as a delimiter:

The same function read_csv is used to read in text data.The function by default looks for a , as a delimiter to move onto the next column and as it is not present, the data is all shown in a single column:

In [None]:
df = pd.read_csv(r'./files/Book2.txt')

In [None]:
df

If the delimiter is specified as '\t' the data will be read in properly:

In [None]:
df = pd.read_csv(r'./files/Book2.txt', delimiter='\t')

In [None]:
df

## Microsoft Excel File

A Microsoft Excel File is a collection of sheets, where each individual sheet is similar to a csv file. The Excel file can also be modified in Microsoft Excel to visually format the data:

<img src='./images/img_007.png' alt='img_007' width='800'/>

This formatting capability makes the raw Excel File less human readible than the more basic csv file:

This formatting is not important and only the data is read in by Python. The related function read_excel is used to read in the data from an Excel File. The delimiter is predefined in an Excel File however the Excel File can have multiple sheets so the keyword input argument sheet_name has to be specified:

In [None]:
? pd.read_excel

In [None]:
df = pd.read_excel(r'./files/Book3.xlsx', sheet_name='Sheet1')

In [None]:
df

The sheet name is also ordinal:

In [None]:
df = pd.read_excel(r'./files/Book3.xlsx', sheet_name=0)

In [None]:
df

There is normally more success in an Excel File for parsing dates:

In [None]:
df = pd.read_excel(r'./files/Book3.xlsx', sheet_name='Sheet1', parse_dates=True)

In [None]:
df

The date column is parsed as a date but the time column isn't:

In [None]:
df.dtypes

Sometimes the date and time have to be cast into strings and manipulated:

In [None]:
df['date'].astype('str')

In [None]:
df['date'].astype('str').str[:10]

In [None]:
df['time'].astype('str')

Once they are both strings they can be concatenated:

In [None]:
df['date'].astype('str').str[:10] + ' T' + df['time'].astype('str')

In [None]:
df['datetime'] = (df['date'].astype('str').str[:10] + ' T' + df['time'].astype('str')).astype('datetime64[ns]')

In [None]:
df

The old series can be deleted:

In [None]:
del df['date']
del df['time']

In [None]:
df

The categorical Series can be made categorical:

In [None]:
df['category'].astype('category')

In [None]:
df['category'] = df['category'].astype('category')

Then to reorder the columns, indexing is quite commonly used:

In [None]:
df

In [None]:
df[['string', 'integer', 'boolean', 'floatingpoint', 'datetime', 'category']]

The instance name df can be reassigned to this output:

In [None]:
df = df[['string', 'integer', 'boolean', 'floatingpoint', 'datetime', 'category']]

## Writing DataFrames to Objects and Files

The pandas library contains a number of functions for reading in files to DataFrames. The DataFrame class has analogous to methods for writing to files:

In [None]:
for identifier in dir(df):
    if identifier.startswith('to_'):
        print(identifier, end=' ')

## Python Dictionary

A DataFrame can be written to a dictionary using:

In [None]:
df.to_dict()

This is the same form that can be used to instantiate a DataFrame:

In [None]:
pd.DataFrame(data=df.to_dict())

## JSON

JavaScript Object Notation (JSON) as the name suggests originates from JavaScript but now has become a commonly used standard data stream that is similar to a Python dictionary. It is common to retrieve data from a website stored in a JSON table and convert it to a DataFrame:

In [None]:
df.to_json()

In [None]:
pd.read_json(df.to_json())

## Markdown

A DataFrame can be written to markdown using:

In [None]:
df.to_markdown()

If this is printed:

In [None]:
print(df.to_markdown())

And the cell output copied to a markdown cell:

|    | string            |   integer | boolean   |   floatingpoint | datetime            | category   |
|---:|:------------------|----------:|:----------|----------------:|:--------------------|:-----------|
|  0 | the fat black cat |         4 | True      |            0.86 | 2023-07-24 11:36:00 | A          |
|  1 | sat on the mat    |         4 | True      |            0.86 | 2023-07-25 12:36:00 | A          |
|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 2023-07-26 13:36:00 | B          |
|  3 | little star       |         2 | True      |           -1.14 | 2023-07-27 14:36:00 | B          |
|  4 | how I wonder      |         3 | False     |           -0.14 | 2023-07-28 15:36:00 | B          |
|  5 | what you are      |         4 | True      |            0.86 | 2023-07-29 16:36:00 | B          |

There is no analogous read_markdown.

## CSV and Text Files

The DataFrame can be written to_csv:

In [None]:
df.to_csv(r'./files/Book4.csv')

This has the raw form:

Notice that an index was added, meaning if this is read into a DataFrame instance using the defaults there is an Unnamed Series corresponing to the index read in:

In [None]:
pd.read_csv(r'./files/Book4.csv')

This can be assigned to the Index using the keyword input argument index_col:

In [None]:
pd.read_csv(r'./files/Book4.csv', index_col=0)

Alternatively the DataFrame can be exported without the Index:

In [None]:
? df.to_csv

In [None]:
df.to_csv(r'./files/Book5.csv', index=False)

To save to a text file, the seperator needs to be specified:

In [None]:
df.to_csv(r'./files/Book6.txt', sep='\t', index=False)

## Excel File

Supposing there are three DataFrame instances:

In [None]:
df

In [None]:
df2 = df[['string', 'integer', 'boolean']]

In [None]:
df2

In [None]:
df3 = df[['string', 'datetime', 'category']]

In [None]:
df3

The method to_excel allows the writing of multiple DataFrame instances to individual sheets within an Excel File:

In [None]:
? df.to_excel

To write DataFrame instances to multiple sheets an ExcelWriter instance has to be instantiated and given the instruction to create a blank Excel File:

In [None]:
writer = pd.ExcelWriter(path=r'./files/Book7.xlsx')

The to_excel DataFrame method can then be used to instruct the ExcelWriter instance to write the DataFrame instance to a specified sheet:

In [None]:
df.to_excel(excel_writer=writer, sheet_name='df')
df2.to_excel(excel_writer=writer, sheet_name='df2')
df3.to_excel(excel_writer=writer, sheet_name='df3')

Details about the sheets can be seen using the writers sheets attribute:

In [None]:
writer.sheets

Finally the ExcelWriter instance can be closed. This will release the Excel SpreadSheet from Python:

In [None]:
writer.close()

<img src='./images/img_008.png' alt='img_008' width='800'/>

<img src='./images/img_009.png' alt='img_009' width='800'/>

<img src='./images/img_010.png' alt='img_010' width='800'/>

The identifiers of the ExcelWriter class can be examined. Notice it has the data model identifiers \_\_enter\_\_ and \_\_exit\_\_ which means it can be used within a with code block. The with code block will automaitcally close the ExcelWriter class when the block ends and is the preferred way to write multiple objects to a file:

In [None]:
for identifier in dir(writer):
    print(identifier, end=' ')

In [None]:
with pd.ExcelWriter('./files/Book8.xlsx') as writer:  
    df.to_excel(writer, sheet_name='df1', index=False)
    df2.to_excel(writer, sheet_name='df2', index=False)
    df3.to_excel(writer, sheet_name='df3', index=False)

<img src='./images/img_011.png' alt='img_011' width='800'/>

<img src='./images/img_012.png' alt='img_012' width='800'/>

<img src='./images/img_013.png' alt='img_013' width='800'/>