# pandas library

pandas is an abbreviation for the **P**ython **an**d **D**ata **A**naly**s**is Library. It is a library that uses three main data structures:

* the Index class
* the Series class
* the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one.

The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:

|index|0|1|2|3|
|---|---|---|---|---|

Or as a column when convenient:

|index|
|---|
|0|
|1|
|2|
|3|


The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

A DataFrame class is essentially a grouping of series instances that have the same index:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

## Importing Libraries

To use the data science libraries they need to be imported:

In [1]:
import numpy as np 
import pandas as pd
from helper_module import print_identifier_group

Once imported the identifiers can be viewed:

In [2]:
print('datamodel attribute:', end=' ')
print_identifier_group(pd, 'datamodel_attribute')
print('datamodel method:', end=' ')
print_identifier_group(pd, 'datamodel_method')
print('attribute:', end=' ')
print_identifier_group(pd, 'attribute')
print('function:', end=' ')
print_identifier_group(pd, 'function')
print('class:', end=' ')
print_identifier_group(pd, 'upper_class')


datamodel attribute: ['__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__']
datamodel method: []
attribute: ['annotations', 'api', 'arrays', 'compat', 'core', 'errors', 'io', 'offsets', 'options', 'pandas', 'plotting', 'testing', 'tseries', 'util']
function: ['array', 'bdate_range', 'concat', 'crosstab', 'cut', 'date_range', 'describe_option', 'eval', 'factorize', 'from_dummies', 'get_dummies', 'get_option', 'infer_freq', 'interval_range', 'isna', 'isnull', 'json_normalize', 'lreshape', 'melt', 'merge', 'merge_asof', 'merge_ordered', 'notna', 'notnull', 'period_range', 'pivot', 'pivot_table', 'qcut', 'read_clipboard', 'read_csv', 'read_excel', 'read_feather', 'read_fwf', 'read_gbq', 'read_hdf', 'read_html', 'read_json', 'read_orc', 'read_parquet', 'read_pickle', 'read_sas', 'read_spss', 'read_sql', 'read_sql_query', 'read_sql_table', 'read_stata', 'read_table', 

The datamodel attributes \_\_name\_\_ (dunder name), \_\_version\_\_ (dunder version) and \_\_file\_\_ (dunder file) can be used to get details about the library:

In [3]:
pd.__name__

'pandas'

In [4]:
pd.__version__

'2.1.4'

In [5]:
pd.__file__

'c:\\Users\\phili\\Anaconda3\\envs\\vscode-env\\Lib\\site-packages\\pandas\\__init__.py'

The classes are all in CamelCase. The main classes are:

* Index
* Series
* DataFrame
 
There are some variations of Index such as RangeIndex, MultiIndex, DateIndex and TimedeltaIndex. 

In general pandas uses object orientated programming (OOP) opposed to functional programming. This means methods are normally applied to Index, Series and DataFrame instances to analyse or manipulate data from the instance. Most of the functions within the pandas library are used to read in data from a file and output a DataFrame instance.

## Series

The initialisation signature for a pandas ```Series``` class can be examined:

In [6]:
pd.Series?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray hav

The main keyword input arguments are:

* data
* index
* dtype
* name


If these are not supplied an empty series instance with no index, no name and a generic object datatype is instantiated:

In [7]:
pd.Series()

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [8]:
pd.Series(data=np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [9]:
pd.Series(data=[1, 2, 3])

0    1
1    2
2    3
dtype: int64

When ```dtype=None```, the data type will be inferred from the data:

In [10]:
pd.Series(data=[1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [11]:
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

0   2023-12-29 23:48:41.968987
1   2023-12-30 23:48:41.968987
2   2023-12-31 23:48:41.968987
dtype: datetime64[ns]

In [12]:
pd.Series(data=['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Anything with a string in it is classed as non-numeric and has the generic dtype ```object``` (meaning it can be any Python object).

The dtype can be manually overidden when supplying the numpy 1darray by using the ```np.array``` input argument ```dtype```:

In [13]:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

0    1
1    2
2    3
dtype: int32

Or by alternatively using the ```Series``` keyword input argument ```dtype```:

In [14]:
pd.Series(data=[1., 2., 3.], dtype=np.int32)

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument ```index``` and providing an ```Index``` instance, ```ndarray``` instance or ```list``` instance of index values:

In [15]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

a    1
b    2
c    3
dtype: int32

A ```Series``` usually also has a ```name```:

In [16]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

a    1
b    2
c    3
Name: x, dtype: int32

Normally the ```data``` and ```name``` are supplied and the ```index``` and ```dtype``` are inferred:

In [17]:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

## DataFrame

The initialisation signature for a pandas DataFrame can be examined:

In [18]:
pd.DataFrame?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
d

The keyword input arguments for a ```DataFrame``` instance are similar to those found for a ```Series``` instance however because a ```DataFrame``` is a collection of ```Series``` most of these are plural:

* data (plural)
* index (singular)
* columns (plural of name)
* dtype (plural)

In [19]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The dtype has to be supplied as a ```tuple``` containing the ```dtype``` for each ```Series``` instance in the ```DataFrame``` instance. If it is supplied as a list of dtypes a ```TypeError``` will display.

Once again normally the ```dtype``` and ```index``` are inferred:

In [20]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


It is common to supply ```columns``` and ```data``` in the form of a mapping. The mapping has a key: value pair. The key should be a string which will become the column name and the value should be a 1darray or list which corresponds to the data:

In [21]:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


## Series Identifiers

If the following ```NDArray``` (1D) and ```Series``` instances are created:

In [22]:
xarray = np.array([1.1, 2.1, 3.1, 4.1])

In [23]:
xarray

array([1.1, 2.1, 3.1, 4.1])

In [24]:
xseries = pd.Series(xarray, name='x')

In [25]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

The ```Series``` is more commonly instantiated directly from a ```list```:

In [26]:
yseries = pd.Series([1.1, -2.1, -3.1, None], name='y', dtype=float)

In [27]:
yseries

0    1.1
1   -2.1
2   -3.1
3    NaN
Name: y, dtype: float64

Its identifiers can be viewed. Notice that the following are consistent with a ```1darray``` instance because a ```Series``` instance is based on a ```NDArray``` (1D). This means the previous knowledge from the ```numpy``` tutorial is applicable to the ```Series```:

In [28]:
print('datamodel attribute:', end=' ')
print_identifier_group(xarray, kind='datamodel_attribute', second=xseries, show_only_intersection_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xarray, kind='datamodel_method', second=xseries, show_only_intersection_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xarray, kind='attribute', second=xseries, show_only_intersection_identifiers=True)
print('method:', end=' ')
print_identifier_group(xarray, kind='function', second=xseries, show_only_intersection_identifiers=True)

datamodel attribute: ['__array_priority__', '__doc__', '__hash__']
datamodel method: ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__rpow__', '__rsub__', '__rtruediv__', '__rxor__', '__setattr__', '__setitem__', '__setstate__', '__sizeof__', '__str__', 

Some of the identifiers in a ```NDArray``` (1D) are not applicable to a ```Series``` such as the functions which work over multiple dimensions:

In [29]:
print('datamodel attribute:', end=' ')
print_identifier_group(xarray, kind='datamodel_attribute', second=xseries, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xarray, kind='datamodel_method', second=xseries, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xarray, kind='attribute', second=xseries, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xarray, kind='function', second=xseries, show_unique_identifiers=True)

datamodel attribute: ['__array_interface__', '__array_struct__']
datamodel method: ['__array_finalize__', '__array_function__', '__array_prepare__', '__array_wrap__', '__buffer__', '__class_getitem__', '__complex__', '__dlpack__', '__dlpack_device__', '__ilshift__', '__imatmul__', '__index__', '__irshift__', '__lshift__', '__rlshift__', '__rrshift__', '__rshift__']
attribute: ['base', 'ctypes', 'data', 'flat', 'imag', 'itemsize', 'real', 'strides']
method: ['argpartition', 'byteswap', 'choose', 'compress', 'conj', 'conjugate', 'diagonal', 'dump', 'dumps', 'fill', 'flatten', 'getfield', 'itemset', 'newbyteorder', 'nonzero', 'partition', 'ptp', 'put', 'reshape', 'resize', 'setfield', 'setflags', 'sort', 'tobytes', 'tofile', 'tolist', 'tostring', 'trace']


There are also additional identifiers in the ```Series``` class that are not available in the ```NDArray``` (1D):

In [30]:
print('datamodel attribute:', end=' ')
print_identifier_group(xseries, kind='datamodel_attribute', second=xarray, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xseries, kind='datamodel_method', second=xarray, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xseries, kind='attribute', second=xarray, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xseries, kind='function', second=xarray, show_unique_identifiers=True)

datamodel attribute: ['__annotations__', '__dict__', '__module__', '__pandas_priority__']
datamodel method: ['__column_consortium_standard__', '__finalize__', '__getattr__', '__nonzero__', '__round__', '__weakref__']
attribute: ['array', 'at', 'attrs', 'axes', 'dtypes', 'empty', 'hasnans', 'iat', 'index', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'values']
method: ['abs', 'add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'apply', 'asfreq', 'asof', 'at_time', 'autocorr', 'backfill', 'between', 'between_time', 'bfill', 'bool', 'combine', 'combine_first', 'compare', 'convert_dtypes', 'corr', 'count', 'cov', 'cummax', 'cummin', 'describe', 'diff', 'div', 'divide', 'divmod', 'drop', 'drop_duplicates', 'droplevel', 'dropna', 'duplicated', 'eq', 'equals', 'ewm', 'expanding', 'explode', 'factorize', 'ffill', 'fillna', 'filter', 'first', 'first_valid_index', 'floordiv', 'ge', 'get', 'groupby', 'gt', 'head', 'hist', 'idxmax', 'idxmin', 'iloc', 'infe

Many of the additional attributes return the supplied value in the initialisation signature:

In [31]:
xseries.array

<NumpyExtensionArray>
[1.1, 2.1, 3.1, 4.1]
Length: 4, dtype: float64

In [32]:
xseries.name

'x'

In [33]:
xseries.index

RangeIndex(start=0, stop=4, step=1)

In [34]:
xseries.values

array([1.1, 2.1, 3.1, 4.1])

In [35]:
xseries.dtypes

dtype('float64')

Many of the additional methods duplicate the behaviour of an equivalent datamodel method for example ```abs``` and ```__abs__``` (dunder abs). Recall that the datamodel identifier ```__abs__``` defines the way the ```builtins``` function ```abs``` operates with an instance of the ```Series``` class and its use is generally preferred:

In [36]:
yseries

0    1.1
1   -2.1
2   -3.1
3    NaN
Name: y, dtype: float64

In [37]:
abs(yseries)

0    1.1
1    2.1
2    3.1
3    NaN
Name: y, dtype: float64

In [38]:
yseries.abs()

0    1.1
1    2.1
2    3.1
3    NaN
Name: y, dtype: float64

 The supplementary method ```add``` largely duplicates the behaviour of the datamodel ```__add__``` (dunder add) which recall defines the behaviour of the ```+``` operator which use is generally preferred:

In [39]:
xseries.__add__(yseries)

0    2.2
1    0.0
2    0.0
3    NaN
dtype: float64

In [40]:
xseries + yseries

0    2.2
1    0.0
2    0.0
3    NaN
dtype: float64

The ```add``` method however includes additional options via keyword input arguments such as ```fill_value``` which can be used for an addition involving a missing value: 

In [41]:
xseries.add(yseries)

0    2.2
1    0.0
2    0.0
3    NaN
dtype: float64

In [42]:
xseries.add(yseries, fill_value=0)

0    2.2
1    0.0
2    0.0
3    4.1
dtype: float64

## DataFrame Identifiers

If the following dataframe is constructed:

In [43]:
df = pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
                   'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [44]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


A large number of identifiers can be seen to be consistent between a ```DataFrame``` and a ```Series``` instance such as almost all of the datamodel identifiers. These identifiers operate across 2 dimensions across a ```DataFrame``` instance instead of 1 dimension along a ```Series```:

In [45]:
print('datamodel attribute:', end=' ')
print_identifier_group(df, kind='datamodel_attribute', second=xseries, show_only_intersection_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(df, kind='datamodel_method', second=xseries, show_only_intersection_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(df, kind='attribute', second=xseries, show_only_intersection_identifiers=True)
print('method:', end=' ')
print_identifier_group(df, kind='function', second=xseries, show_only_intersection_identifiers=True)

datamodel attribute: ['__annotations__', '__array_priority__', '__dict__', '__doc__', '__hash__', '__module__', '__pandas_priority__']
datamodel method: ['__abs__', '__add__', '__and__', '__array__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__round__', '__rpow__', '__rsub_

The ```Series``` has ```Series``` specific attributes which are not available for a ```DataFrame``` instance. The datamodel methods in a ```Series``` not present in a ```DataFrame``` are for type-casting:

In [46]:
print('datamodel attribute:', end=' ')
print_identifier_group(xseries, kind='datamodel_attribute', second=df, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xseries, kind='datamodel_method', second=df, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xseries, kind='attribute', second=df, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xseries, kind='function', second=df, show_unique_identifiers=True)

datamodel attribute: []
datamodel method: ['__column_consortium_standard__', '__float__', '__int__']
attribute: ['array', 'dtype', 'hasnans', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'nbytes']
method: ['argmax', 'argmin', 'argsort', 'autocorr', 'between', 'divmod', 'factorize', 'item', 'ravel', 'rdivmod', 'repeat', 'searchsorted', 'to_frame', 'to_list', 'unique', 'view']


The ```DataFrame``` instead has ```DataFrame``` specific attributes such as the name of each ```Series``` in the ```DataFrame```. The ```DataFrame``` also has supplementary methods such as ```insert``` which is used to insert a ```Series``` instance into a ```DataFrame``` instance or ```join``` and ```merge``` used to join or merge ```DataFrame``` instances respectively. The datamodel methods in a ```DataFrame``` not present in a ```Series``` are for type-casting (to a ```DataFrame```):

In [47]:
print('datamodel attribute:', end=' ')
print_identifier_group(df, kind='datamodel_attribute', second=xseries, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(df, kind='datamodel_method', second=xseries, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(df, kind='attribute', second=xseries, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(df, kind='function', second=xseries, show_unique_identifiers=True)

datamodel attribute: []
datamodel method: ['__dataframe__', '__dataframe_consortium_standard__']
attribute: ['columns', 'style', 'x', 'y']
method: ['applymap', 'assign', 'boxplot', 'corrwith', 'eval', 'from_dict', 'from_records', 'insert', 'isetitem', 'iterrows', 'itertuples', 'join', 'melt', 'merge', 'pivot', 'pivot_table', 'query', 'select_dtypes', 'set_index', 'stack', 'to_feather', 'to_gbq', 'to_html', 'to_orc', 'to_parquet', 'to_records', 'to_stata', 'to_xml']


Notice the columns attribute returns a list of the names of each ```Series``` in the ```DataFrame```:

In [48]:
df.columns

Index(['x', 'y'], dtype='object')

Since the following conditions are satisfied:

In [49]:
'x'.isidentifier()

True

In [50]:
'y'.isidentifier()

True

And these identifier names don't clash with any of the other ```DataFrame``` identifiers, the following become ```DataFrame``` attributes and correspond to each ```Series``` in the ```DataFrame```:

In [51]:
df.x

0    1.1
1    2.1
2    3.1
3    3.1
Name: x, dtype: float64

In [52]:
df.y

0    1.2
1    2.2
2    3.2
3    4.2
Name: y, dtype: float64

## Mutability

The ```Index```, ```Series``` and ```DataFrame``` classes are mutable Collections meaning they have the immutable datamodel identifier ```__getitem__``` (dunder getitem) as well as the mutable identifier ```__setitem__``` (dunder setitem):

In [53]:
'__getitem__' in dir(pd.Series)

True

In [54]:
'__setitem__' in dir(pd.Series)

True

In [55]:
'__delitem__' in dir(pd.Series)

True

This means the following array can be indexed into:

In [56]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

In [57]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Recall the datamodel ```__getitem__``` (dunder getitem) defines how a ```Collection``` responds to indexing using square brackets:

In [58]:
xseries[0]

1.1

Recall that the mutable method ```__setitem__``` (dunder setitem) defines how a ```MutableCollection``` responds to indexing using square brackets followed by assignment to a new value:

In [59]:
xseries[0] = None

In [60]:
xseries

0    NaN
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Recall that the mutable method ```__delitem__``` (dunder delitem) defines how a ```MutableCollection``` responds to a ```del``` statement of an element indexing using square brackets:

In [61]:
del xseries[2]

In [62]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

Despite the ```NDArray```, ```Series``` and ```DataFrame``` being mutable datatypes, most the identifiers are immutable by default. If the docstring of the method ```dropna``` is examined:

In [63]:
xseries.dropna?

[1;31mSignature:[0m
[0mxseries[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mhow[0m[1;33m:[0m [1;34m'AnyAll | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a new Series with missing values removed.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index'}
    Unused. Parameter needed for compatibility with DataFrame.
inplace : bool, default False
    If True, do operation inplace and return

Notice it has the keyword input arguments ```inplace```. ```inplace``` has the default value of ```False``` making the method immutable by default and therefore returns a new ```Series```:

In [64]:
xseries.dropna() # Return value

1    2.1
3    4.1
Name: x, dtype: float64

In [65]:
xseries # Unchanged

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

When ```inplace``` is set to ```True``` the method becomes mutable:

In [66]:
xseries.dropna(inplace=True) # No return value

In [67]:
xseries # Modified inplace

1    2.1
3    4.1
Name: x, dtype: float64

The same behaviour can be seen on the method ```reset_index```:

In [68]:
xseries.reset_index?

[1;31mSignature:[0m
[0mxseries[0m[1;33m.[0m[0mreset_index[0m[1;33m([0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'IndexLabel | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mdrop[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mallow_duplicates[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or
when the index is meaningless and needs to be reset to the default
before a

With default value this method is immutable and returns a ```DataFrame``` since the old index is now added as the first ```Series```:

In [69]:
xseries.reset_index() # Return value

Unnamed: 0,index,x
0,1,2.1
1,3,4.1


If the ```drop``` keyword input argument is set to ```True```, a ```Series``` will instead be returned:

In [70]:
xseries.reset_index(drop=True) # Return value

0    2.1
1    4.1
Name: x, dtype: float64

Once again the ```inplace``` keyword input argument can be assigned to ```True``` making the method mutable:

In [71]:
xseries.reset_index(drop=True, inplace=True) # No return value

In [72]:
xseries # Modified inplace

0    2.1
1    4.1
Name: x, dtype: float64

The following ```Series``` methods have the parameter ```inplace``` and are therefore immutable by default but are mutable when this parameter is assigned to ```True```:

In [73]:
print_identifier_group(xseries, kind='function', has_parameter='inplace')

['backfill', 'bfill', 'clip', 'drop', 'drop_duplicates', 'dropna', 'ffill', 'fillna', 'interpolate', 'mask', 'pad', 'rename', 'rename_axis', 'replace', 'reset_index', 'sort_index', 'sort_values', 'where']


Notice that most of these are used to fill, interpolate or drop values along a ```Series``` in response to missing data. 

```sort_values``` for example can be used to sort the values along a ```Series```, by default ```inplace=False``` and the method is immutable:

In [74]:
xseries.sort_values(ascending=False) # Return value

1    4.1
0    2.1
Name: x, dtype: float64

Recall when an immutable method is used with assignment, the new value returned on the right of the assignment operator is assigned to the instance name or label on the left of the assignment operator. If the instance name is conceptualised as a label, then a reassignment peels the label from the original instance and places it on the new instance created:

In [75]:
xseries = xseries.sort_values(ascending=False)

In [76]:
xseries

1    4.1
0    2.1
Name: x, dtype: float64

On the other hand when a method is immutable, there is no return value and the ```Series``` is updated inplace:

In [77]:
xseries.sort_values(ascending=True, inplace=True) # No return value

In [78]:
xseries

0    2.1
1    4.1
Name: x, dtype: float64

If assignment is used with an mutable function, the return value of the function is ```None``` and therefore ```None``` will be assigned to the ```new_label```:

In [79]:
new_label = xseries.sort_values(ascending=True, inplace=True) 

In [80]:
new_label

And therefore reassignment with the ```inplace``` parameter set to ```True``` should be avoided as the value will being reassigned will be ```None```:

In [81]:
xseries = xseries.sort_values(ascending=True, inplace=True) 

In [82]:
xseries

By convention immutable methods have a ```return``` value and mutable methods have no ```return``` value. An exception to this is the mutable method ```pop``` which returns the popped value and mutates the ```Series``` in place:

In [83]:
xseries = pd.Series([4.1, 2.1, 3.1, 1.1], name='x')

In [84]:
xseries

0    4.1
1    2.1
2    3.1
3    1.1
Name: x, dtype: float64

In [85]:
xseries.pop(item=1) # Return value

2.1

In [86]:
xseries # Mutated

0    4.1
2    3.1
3    1.1
Name: x, dtype: float64

The methods that have consistent names to the mutable methods in a ```list``` will also be mutable with no ```return``` value. Most of the other methods are immutable and have a ```return``` value.

## Axis

Another common keyword is ```axis```:

In [87]:
print_identifier_group(xseries, kind='function', has_parameter='axis')

['add', 'add_prefix', 'add_suffix', 'agg', 'aggregate', 'align', 'all', 'any', 'argmax', 'argmin', 'argsort', 'at_time', 'backfill', 'between_time', 'bfill', 'clip', 'cummax', 'cummin', 'cumprod', 'cumsum', 'div', 'divide', 'divmod', 'drop', 'droplevel', 'dropna', 'eq', 'ewm', 'expanding', 'ffill', 'fillna', 'filter', 'floordiv', 'ge', 'groupby', 'gt', 'idxmax', 'idxmin', 'iloc', 'interpolate', 'kurt', 'kurtosis', 'le', 'loc', 'lt', 'mask', 'max', 'mean', 'median', 'min', 'mod', 'mul', 'multiply', 'ne', 'pad', 'pow', 'prod', 'product', 'radd', 'rank', 'rdiv', 'rdivmod', 'reindex', 'rename', 'rename_axis', 'repeat', 'resample', 'rfloordiv', 'rmod', 'rmul', 'rolling', 'rpow', 'rsub', 'rtruediv', 'sample', 'sem', 'set_axis', 'shift', 'skew', 'sort_index', 'sort_values', 'squeeze', 'std', 'sub', 'subtract', 'sum', 'take', 'transform', 'truediv', 'truncate', 'tz_convert', 'tz_localize', 'var', 'where', 'xs']


A ```Series``` is a column and only has a single ```axis``` available, ```0```. The operation can be conceptualised as sorting the data in the rows by use of the ```Series``` name and therefore ```axis``` can also be assigned to the ```str``` instance ```'rows'```:

In [88]:
xseries.sort_values(ascending=True, axis=0)

3    1.1
2    3.1
0    4.1
Name: x, dtype: float64

In [89]:
xseries.sort_values(ascending=True, axis='rows')

3    1.1
2    3.1
0    4.1
Name: x, dtype: float64

For a ```DataFrame``` there are two values for ```axis```, ```0``` which is the default and ```1```:

In [90]:
df = pd.DataFrame({'x': np.array([5.1, 2.1, 2.1, 4.1]),
                   'y': np.array([6.2, 7.0, 2.1, 1.2])},
                   index=['a', 'b', 'c', 'd'])

In [91]:
df

Unnamed: 0,x,y
a,5.1,6.2
b,2.1,7.0
c,2.1,2.1
d,4.1,1.2


The default ```axis``` is ```0``` which is equivalent to the ```str``` instance ```'rows'```. This is an instruction to sort the data in the rows ```by``` the ordering of the data in the columns:

In [92]:
df.sort_values(by=['x', 'y'], axis='rows')

Unnamed: 0,x,y
c,2.1,2.1
b,2.1,7.0
d,4.1,1.2
a,5.1,6.2


Notice that the data is sorted in ascending order by ```'x'``` and in the case where the two values in ```'x'``` have duplicate values are sorted by ```'y'``` :

In [93]:
df

Unnamed: 0,x,y
a,5.1,6.2
b,2.1,7.0
c,2.1,2.1
d,4.1,1.2


The ```axis``` can be changed to ```1``` which is equivalent to the ```str``` instance ```'columns'```. This is an instruction to sort the data in the columns ```by``` the ordering of the data in the index:

In [94]:
df.sort_values(by=['c', 'd'], axis='columns')

Unnamed: 0,y,x
a,6.2,5.1
b,7.0,2.1
c,2.1,2.1
d,1.2,4.1


The data is sorted in ascending order first by ```'c'``` but the data in the two ```Series``` instances ```'x'``` and ```'y'``` have the same value 2.1 so there is no instruction to specify the order of the ```Series```. The next index value ```'d'``` is used and the value in the ```Series``` instance ```y``` is 1.2 and the ```Series``` instance ```'x'``` is 4.1, therefore ```'y'``` is ordered before ```'x'```.

In the ```NDArray``` negative indexes are quite commonly used to select an ```axis```. This are not used for the ```Series``` (1D) and ```DataFrame``` (2D) instances which are of fixed dimensions.

## Indexing and Slicing

Supposing the following dictionary instance is instantiated:

In [95]:
mapping = {'x': np.array([1.1, 2.1, 3.1, 4.1]),
           'y': np.array([1.2, 2.2, 3.2, 4.2])}

In [96]:
mapping

{'x': array([1.1, 2.1, 3.1, 4.1]), 'y': array([1.2, 2.2, 3.2, 4.2])}

A ```DataFrame``` instance can be instantiated by assigning the ```mapping``` to the keyword input argument ```data```:

In [97]:
df = pd.DataFrame(data=mapping)

In [98]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


A ```mapping``` can be indexed with a ```key```. This returns the ```value``` the ```key``` references, in this case the ```NDArray```:

In [99]:
mapping['x']

array([1.1, 2.1, 3.1, 4.1])

Analogously, when a ```DataFrame``` is indexed using the ```name``` of a ```Series```, the ```Series``` is returned:

In [100]:
df['x']

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value in the ```NDArray``` instance can be indexed by use of a second set of square brackets to enclose the numeric index:

In [101]:
mapping['x'][1]

2.1

Analogously, a ```value``` in the ```Series``` can be indexed by use of a second set of square brackets to enclose the numeric index:

In [102]:
df['x'][1]

2.1

If the DataFrame instance is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

The first set of brackets select the Series:

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

And the second set of brackets selects the index retrieving the value:

2.1

If the DataFrame is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

Sometimes the value for each ```Series``` at a value within the ```Index``` instance is desired:

|index|'x'|'y'|
|---|---|---|
|1|2.1|2.2|

This is done by use of the property location ```loc```. Note that ```loc``` returns the above *row* as a ```Series``` which is displayed by default as a *column*:

|index|1|
|---|---|
|'x'|2.1|
|'y'|2.1|

```loc``` is callable and has a docstring:

In [103]:
callable(df.loc)

True

In [104]:
df.loc?

[1;31mType:[0m        property
[1;31mString form:[0m <property object at 0x00000218B8169580>
[1;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

See mo

However unlike most callables it is not called using parenthesis:

In [105]:
df.loc

<pandas.core.indexing._LocIndexer at 0x218ba5a9360>

In [106]:
df.loc()

<pandas.core.indexing._LocIndexer at 0x218ba5768f0>

Instead ```loc``` is a property. Under the hood it uses syntactic sugar around the datamodel method ```__getitem__``` that switches the order of indexing from the default ```[column, index]``` to ```[index, column]```:

In [107]:
df.loc[1]

x    2.1
y    2.2
Name: 1, dtype: float64

In [108]:
df.loc[1]['x']

2.1

```loc``` can also uses index values:

In [109]:
df.loc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


The related property integer location ```iloc``` always uses a numeric index. Since ```iloc``` has a numeric index, additional numeric operations can be used such as slicing:

In [110]:
df.iloc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


In [111]:
df.iloc[0:2]

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2


If the following DataFrame instance is created with index labels i.e. a non-numeric index:

|index|'x'|'y'|
|---|---|---|
|'a'|1.1|1.2|
|'b'|2.1|2.2|
|'c'|3.1|3.2|
|'d'|4.1|4.2|

In [112]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd'],
                  data=mapping)

In [113]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The difference between ```loc``` and ```iloc``` can be seen more clearly. For ```loc``` the index label is used:

In [114]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

Despite the labels being non-numeric ```iloc``` handles the index values numerically:

In [115]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

Under the hood ```iloc``` essentially uses the ```DataFrame``` instances reset index:

In [116]:
df.reset_index(drop=True)

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


In [117]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

When ```loc``` and ```iloc``` are used to select a single index, the data for each ```Series``` at this index is itself displayed as a ```Series```:

In [118]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

In [119]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

Because each of the above are a ```Series``` instance, they can in turn be indexed into:

In [120]:
df.loc['b']['y']

2.2

In [121]:
df.iloc[1]['y']

2.2

When ```iloc``` and ```loc``` are instead used to select data from multiple indexes a ```DataFrame``` instance is output:

In [122]:
df.loc[['a', 'b']]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


In [123]:
df.iloc[0:2]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


And because each of these is a ```DataFrame``` instance, the ```Series``` within the ```DataFrame``` instance can then be indexed using the ```Series``` name:

In [124]:
df.loc[['a', 'b']]['x']

a    1.1
b    2.1
Name: x, dtype: float64

In [125]:
df.iloc[0:2]['x']

a    1.1
b    2.1
Name: x, dtype: float64

```at``` is used for a scalar selector and requires both the index and the ```Series``` name: 

In [126]:
df.at['a', 'y']

1.2

The related integer at ```iat``` is also a scalar selector and requires both the index and column to be specified as integers:

In [127]:
df.iat[0, 1]

1.2

Conceptualise, the ```DataFrame``` being cast to a ```NDArray``` (2D) and indexing a value from it:

In [128]:
df.to_numpy()

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [4.1, 4.2]])

In [129]:
df.to_numpy()[0, 1]

1.2

To recap, for a ```DataFrame``` instance:

* ```__getitem__``` selects a ```Series``` by default
* ```loc``` and ```iloc``` change the behaviour to select an observation from the ```Index``` instance label
* ```at``` and ```iat``` select a scalar element


```loc``` can also be used to add a new observation to the ```DataFrame``` instance:

In [130]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


In [131]:
df.loc['f'] = {'x': 6.1, 'y': 6.2}

In [132]:
df.loc['e'] = {'x': 5.1, 'y': 5.2}

The ordering of rows (also known as observations) follows the insertion order: 

In [133]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
f,6.1,6.2
e,5.1,5.2


The ```DataFrame``` method ```sort_index``` can be used to reorder the index: 

In [134]:
df.sort_index(inplace=True)

In [135]:
df # modified inplace

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2


The ```Index``` instance can also be reset to a numeric index using the ```DataFrame``` instance ```reset_index```:

In [136]:
df.reset_index(drop=True, inplace=True)

In [137]:
df # modified inplace

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2


The length of the ```DataFrame``` gives the number of rows (observations):

In [138]:
len(df)

6

Python uses zero-order indexing and the ```Index``` starts at ```0``` (inclusive) and stops at ```len(df)``` (exclusive).

```iloc``` cannot be used to index into an index value that doesn't exist and cannot be used to add a new observation. However ```loc``` can be used to add a numeric index using the ```len``` of the ```DataFrame``` instance:

In [139]:
df.loc[len(df)] = {'x': 7.1, 'y': 7.2}

In [140]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2
6,7.1,7.2


## DataFrame Properties

Supposing the following ```DataFrame``` is instantiated to ```df```:

In [141]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2])})

In [142]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The ```DataFrame``` instance has the following dimension related properties. The attribute ```empty``` returns a boolean that is ```True``` only with an empty DataFrame:

In [143]:
df.empty

False

In [144]:
pd.DataFrame(None).empty

True

A ```DataFrame``` instance has a length, which is returned by the ```builtins``` function ```len```. This was seen previously to correspond to the number of rows (number of observations):

In [145]:
len(df)

7

A ```DataFrame``` instance has the attribute ```shape``` which is a ```tuple``` of dimensions. The 1st dimension is the number of rows (observations in the index) and the 2nd value is the number of ```Series``` (columns):

In [146]:
df.shape

(7, 2)

A ```DataFrame``` instance has the attribute ```ndim``` which gives the number fo dimensions and is always ```2```:

In [147]:
df.ndim

2

Recall this is equivalent to the length of the ```shape``` ```tuple```:

In [148]:
len(df.shape)

2

The ```DataFrame``` instance has a ```size``` attribute which is the product of the elements in the ```shape``` ```tuple```:

In [149]:
df.size

14

The ```index``` attribute returns the ```Index``` instance associated with the ```DataFrame```. An ```Index``` instance has a single dimension that can either be depicted as a row or a column. The output below displays this as a row although the index itself is conventionally depicted as a column when incorporated as part of a ```DataFrame```:

In [150]:
df.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

When no ```index``` is supplied during ```DataFrame``` instantiation a ```RangeIndex``` is automatically generated:

In [151]:
df2 = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 3.1]),
                         'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [152]:
df2.index

RangeIndex(start=0, stop=4, step=1)

The ```columns``` attribute also returns an ```Index``` instance corresponding to the names of each ```Series``` in the ```DataFrame```:

In [153]:
df.columns

Index(['x', 'y'], dtype='object')

The attribute ```axes``` returns a 2 element list, where the first element is the ```index``` attribute and the second element is the ```columns``` attribute:

In [154]:
df.axes

[Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object'),
 Index(['x', 'y'], dtype='object')]

The attribute ```values``` returns the values in the ```DataFrame``` in the form of a ```NDArray``` (2D):

In [155]:
df.values

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [4.1, 4.2],
       [5.1, 5.2],
       [6.1, 6.2],
       [7.1, 7.2]])

The attribute ```dtypes``` returns the datatypes of each ```Series``` and of the ```DataFrame```:

In [156]:
df.dtypes

x    float64
y    float64
dtype: object

The ```Series``` instances ```x``` and ```y``` are each of the datatype ```float64```, the ```DataFrame``` instance ```df``` is of the datatype ```object```. A ```DataFrame``` instance is always of the type ```object```.

Each existing ```Series``` is accessible as an attribute:

In [157]:
df.x

a    1.1
b    2.1
c    3.1
d    4.1
e    5.1
f    6.1
g    7.1
Name: x, dtype: float64

In [158]:
df.y

a    1.2
b    2.2
c    3.2
d    4.2
e    5.2
f    6.2
g    7.2
Name: y, dtype: float64

The formal representation of the ```DataFrame``` instance ```df``` can be examined in a cell:

In [159]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The attribute ```style``` can instead be used to display a ```DataFrame``` instance using a specific style. The default style is shown:

In [160]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


This ```style``` attribute has a number of stackable methods which return a modified ```style``` and can therefore be stacked to apply custom formatting:

In [161]:
print_identifier_group(df.style, kind='function')

['apply', 'apply_index', 'applymap', 'applymap_index', 'background_gradient', 'bar', 'clear', 'concat', 'export', 'format', 'format_index', 'from_custom_template', 'hide', 'highlight_between', 'highlight_max', 'highlight_min', 'highlight_null', 'highlight_quantile', 'map', 'map_index', 'pipe', 'relabel_index', 'set_caption', 'set_properties', 'set_sticky', 'set_table_attributes', 'set_table_styles', 'set_td_classes', 'set_tooltips', 'set_uuid', 'text_gradient', 'to_excel', 'to_html', 'to_latex', 'to_string', 'use']


In [162]:
df.style.format(precision=3).set_caption('DataFrame Instance')

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The attributes ```attrs``` is an empty dictionary by default and is designed to store metadata associated with the ```DataFrame``` instance:

In [163]:
df.attrs

{}

This metadata can include a text description giving information about how the data was collection or contain a link to where the data was sourced from:

In [164]:
df.attrs = {'description': 'this DataFrame was instantiated from a dict',
            'documentation': r'https://pandas.pydata.org/docs/getting_started/index.html'}

The ```DataFrame``` method ```info``` gives information about the ```DataFrame```:

In [165]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       7 non-null      float64
 1   y       7 non-null      float64
dtypes: float64(2)
memory usage: 468.0+ bytes


The ```DataFrame``` method ```describe``` method gives supplementary descriptive statistics on each numeric ```Series```:

In [166]:
df.describe()

Unnamed: 0,x,y
count,7.0,7.0
mean,4.1,4.2
std,2.160247,2.160247
min,1.1,1.2
25%,2.6,2.7
50%,4.1,4.2
75%,5.6,5.7
max,7.1,7.2


The ```DataFrame``` methods ```head``` and ```tail``` give the top 5 and last 5 observations by default and are usually used to preview a large ```DataFrame``` instance:

In [167]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


In [168]:
df.head()

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2


In [169]:
df.tail()

Unnamed: 0,x,y
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The number of observations ```n``` defaults to ```5``` and can be changed:

In [170]:
df.head(n=3)

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2


The ```DataFrame``` method ```nunique``` gives the number of unique observations for each ```Series```:

In [171]:
df.nunique()

x    7
y    7
dtype: int64

## Attribute Access - Dictionary Syntax vs Dot Syntax

If the following ```DataFrame``` is instantiated to ```df```:

In [172]:
df = pd.DataFrame(index = np.array(['a', 'b', 'c', 'd']),
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [173]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


Each ```Series``` can be accessed from the ```DataFrame``` instance ```df``` by indexing into ```df``` using the corresponding ```Series``` ```name``` enclosed in square brackets. This style of ```Series``` access is analogous to retrieving a value from a ```dict``` by use of its ```key```:

In [174]:
df['x']

a    1.1
b    2.1
c    3.1
d    4.1
Name: x, dtype: float64

Since the following is ```True```:

In [175]:
'x'.isidentifier()

True

```x``` becomes an attribute and can also be accessed using:

In [176]:
df.x

a    1.1
b    2.1
c    3.1
d    4.1
Name: x, dtype: float64

In [177]:
df.x is df['x']

True

A ```Series``` ```name``` only becomes an attribute of a DataFrame **after** it is instantiated and if it is a valid identifier:

In [178]:
df['z1'] = np.array([1.3, 2.3, 2.3, 2.4])

In [179]:
df

Unnamed: 0,x,y,z1
a,1.1,1.2,1.3
b,2.1,2.2,2.3
c,3.1,3.2,2.3
d,4.1,4.2,2.4


In [180]:
df.z1

a    1.3
b    2.3
c    2.3
d    2.4
Name: z1, dtype: float64

A ```UserWarning``` displays if the dot syntax is used in an attempt to create a new attribute, leaving the ```df``` instance unchanged. 

Notice when an invalid identifier is used as the ```nam``` for a new ```Series```:

In [181]:
'1'.isidentifier()

False

In [182]:
df['1'] = np.array([1.3, 2.3, 2.3, 2.4])

That it does not show as an attribute:

In [183]:
print_identifier_group(df, kind='attribute')

['at', 'attrs', 'axes', 'columns', 'dtypes', 'empty', 'flags', 'iat', 'index', 'ndim', 'shape', 'size', 'style', 'values', 'x', 'y', 'z1']


For this reason ```Series``` names should generally follow the naming conventions of Python identifiers.

Although the dot attribute access from the ```DataFrame``` instance ```df``` is unavailable for this Series instance ```'1'```. The Series instance ```'1'``` can still be accessed by indexing into the ```DataFrame``` instance ```df``` using the ```Series``` ```name``` ```'1'```:

In [184]:
df['1']

a    1.3
b    2.3
c    2.3
d    2.4
Name: 1, dtype: float64

Accessing a ```Series``` via dictionary-style indexing is therefore more powerful and this syntax is generally preferred.

The major drawback of the dictionary-style indexing syntax is with code-completion. Notice no docstring displays when ```?``` is used:

In [185]:
df['x'].info?

Object `info` not found.


In [186]:
df['x'].info()

<class 'pandas.core.series.Series'>
Index: 4 entries, a to d
Series name: x
Non-Null Count  Dtype  
--------------  -----  
4 non-null      float64
dtypes: float64(1)
memory usage: 236.0+ bytes


However if the attribute is used, the docstring displays:

In [187]:
df.x.info?

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0mx[0m[1;33m.[0m[0minfo[0m[1;33m([0m[1;33m
[0m    [0mverbose[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mbuf[0m[1;33m:[0m [1;34m'IO[str] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmax_cols[0m[1;33m:[0m [1;34m'int | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mmemory_usage[0m[1;33m:[0m [1;34m'bool | str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mshow_counts[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Print a concise summary of a Series.

This method prints information about a Series including
the index dtype, non-null values and memory usage.

.. versionadded:: 1.4.0

Parameters
----------
verbose : bool, optional
    Whether to print the full summar

In [188]:
df.x.info()

<class 'pandas.core.series.Series'>
Index: 4 entries, a to d
Series name: x
Non-Null Count  Dtype  
--------------  -----  
4 non-null      float64
dtypes: float64(1)
memory usage: 236.0+ bytes


The ```Series``` method info gives the same result in both cases.

When the ```name``` used in an  ```Index``` is also a valid identifier:

In [189]:
'a'.isidentifier()

True

It will be available as an attribute for each  ```Series```:

In [190]:
df.x.a

1.1

The default ```Index``` is numeric ```RangeIndex``` of integer steps which are invalid identifiers:

In [191]:
'0'.isidentifier()

False

And therefore when the default ```RangeIndex``` is used:

In [192]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [193]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


Because these are invalid identifiers they do not show as attributes for any of the ```Series``` belonging to the ```DataFrame``` instance:

In [194]:
print_identifier_group(df['x'], kind='attribute')

['array', 'at', 'attrs', 'axes', 'dtype', 'dtypes', 'empty', 'flags', 'hasnans', 'iat', 'index', 'is_monotonic_decreasing', 'is_monotonic_increasing', 'is_unique', 'name', 'nbytes', 'ndim', 'shape', 'size', 'values']


However the ```rows``` can be selected by indexing the numeric index in square brackets:

In [195]:
df['x'][1]

2.1

## Combining DataFrames

```DataFrame``` methods are generally setup for ```Series```. For example if the following ```DataFrame``` instance is examined:

In [196]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [197]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


A ```Series``` is typically appended to the end of the ```DataFrame``` by use of:

In [198]:
df['z'] = np.array([1.3, 2.3, 3.3, 4.3])

In [199]:
df

Unnamed: 0,x,y,z
0,1.1,1.2,1.3
1,2.1,2.2,2.3
2,3.1,3.2,3.3
3,4.1,4.2,4.3


Alternatively a ```Series``` can be inserted at a specified index using the mutable method ```insert```:

In [200]:
df.insert?

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0minsert[0m[1;33m([0m[1;33m
[0m    [0mloc[0m[1;33m:[0m [1;34m'int'[0m[1;33m,[0m[1;33m
[0m    [0mcolumn[0m[1;33m:[0m [1;34m'Hashable'[0m[1;33m,[0m[1;33m
[0m    [0mvalue[0m[1;33m:[0m [1;34m'Scalar | AnyArrayLike'[0m[1;33m,[0m[1;33m
[0m    [0mallow_duplicates[0m[1;33m:[0m [1;34m'bool | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Insert column into DataFrame at specified location.

Raises a ValueError if `column` is already contained in the DataFrame,
unless `allow_duplicates` is set to True.

Parameters
----------
loc : int
    Insertion index. Must verify 0 <= loc <= len(columns).
column : str, number, or hashable object
    Label of the inserted column.
value : Scalar, Series, or array-like
allow_duplicates : bool, optional, default lib.no_default

See Also


In [201]:
df.insert(loc=0, column='w', value=np.array([1.0, 2.0, 3.0, 4.0]))

In [202]:
df

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3


Recall to append an observation to a ```DataFrame```, ```loc``` is typically used and assigned to a mapping where the keys are the column names and the values are the associated values at that observation:

In [203]:
df.loc[len(df)] = {'w': 5.0, 'x': 5.1, 'y': 5.2, 'z': 5.3}

In [204]:
df

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3
4,5.0,5.1,5.2,5.3


When multiple observations are to be appended to a ```DataFrame``` they are normally in the form of a ```DataFrame```:

In [205]:
df2 = pd.DataFrame(index=np.array([5, 6]),
                                  data = {'w': np.array([6.0, 7.0]),
                                          'x': np.array([6.1, 7.1]),
                                          'y': np.array([6.2, 7.2]),
                                          'z': np.array([6.3, 7.3])})

In [206]:
df

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3
4,5.0,5.1,5.2,5.3


In [207]:
df2

Unnamed: 0,w,x,y,z
5,6.0,6.1,6.2,6.3
6,7.0,7.1,7.2,7.3


The function ```pd.concat``` can be used to concatenate these two ```DataFrame``` instances:

In [208]:
pd.concat?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mconcat[0m[1;33m([0m[1;33m
[0m    [0mobjs[0m[1;33m:[0m [1;34m'Iterable[Series | DataFrame] | Mapping[HashableT, Series | DataFrame]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mjoin[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m'outer'[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mkeys[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlevels[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'list[HashableT] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mverify_integrity[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0msort[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse

For example, ```df``` and ```df2``` can be concatenated along ```axis``` ```0``` which recall is ```'rows'``` (the index):

In [209]:
pd.concat(objs=[df, df2], axis='rows') #'index'

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3
4,5.0,5.1,5.2,5.3
5,6.0,6.1,6.2,6.3
6,7.0,7.1,7.2,7.3


If these ```DataFrame``` instances are created with the default indexes:

In [210]:
df.reset_index(drop=True, inplace=True)
df

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3
4,5.0,5.1,5.2,5.3


In [211]:
df2.reset_index(drop=True, inplace=True)
df2

Unnamed: 0,w,x,y,z
0,6.0,6.1,6.2,6.3
1,7.0,7.1,7.2,7.3


Notice the index now has duplicate entires:

In [212]:
pd.concat(objs=[df, df2])

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3
4,5.0,5.1,5.2,5.3
0,6.0,6.1,6.2,6.3
1,7.0,7.1,7.2,7.3


In such a scenario it is common to assign ```ignore_index``` to ```True``` which will recreate a numeric ```RangeIndex```:

In [213]:
pd.concat([df, df2], ignore_index=True)

Unnamed: 0,w,x,y,z
0,1.0,1.1,1.2,1.3
1,2.0,2.1,2.2,2.3
2,3.0,3.1,3.2,3.3
3,4.0,4.1,4.2,4.3
4,5.0,5.1,5.2,5.3
5,6.0,6.1,6.2,6.3
6,7.0,7.1,7.2,7.3


When a ```DataFrame``` instance has a ```Series``` not in common with the second ```DataFrame``` instance being concatenated:

In [214]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [215]:
df2 = pd.DataFrame(data = {'w': np.array([6.0, 7.0]),
                           'z': np.array([6.3, 7.3])})

In [216]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


In [217]:
df2

Unnamed: 0,w,z
0,6.0,6.3
1,7.0,7.3


The ```DataFrame``` instances can be ```'outer'``` joined (the default). This will lead to ```NaN``` values where no data was supplied:

In [218]:
pd.concat(objs=[df, df2], axis='columns', join='outer') #'columns'

Unnamed: 0,x,y,w,z
0,1.1,1.2,6.0,6.3
1,2.1,2.2,7.0,7.3
2,3.1,3.2,,
3,4.1,4.2,,


Alternatively the two ```DataFrame``` instances can be ```'inner'``` joined, which will drop the observations that are missing data:

In [219]:
pd.concat([df, df2], axis='columns', join='inner') #'columns'

Unnamed: 0,x,y,w,z
0,1.1,1.2,6.0,6.3
1,2.1,2.2,7.0,7.3


The ```DataFrame``` method ```align``` can be used to align the data of a ```DataFrame``` with another ```DataFrame``` instance for the purpose of comparison:

In [220]:
df3 = pd.concat([df, df2], axis=1, join='inner') #'columns'

In [221]:
df.align(other=df3)

(    w    x    y   z
 0 NaN  1.1  1.2 NaN
 1 NaN  2.1  2.2 NaN
 2 NaN  3.1  3.2 NaN
 3 NaN  4.1  4.2 NaN,
      w    x    y    z
 0  6.0  1.1  1.2  6.3
 1  7.0  2.1  2.2  7.3
 2  NaN  NaN  NaN  NaN
 3  NaN  NaN  NaN  NaN)

## Not Available Values

The following ```DataFrame``` can be instantiated to ```df``` with multiple ```None``` values:

In [222]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, None, 3.1, None, 5.1, None, 7.1]),
                        'y': np.array([1.2, None, 3.2, 4.2, 5.2, 6.2, 7.2])})

The ```DataFrame``` instance ```df``` information can be examined. Notice that there are 7 rows; 4 rows have available (non-null) values in ```Series``` instance ```x``` and ```6``` rows have available (non-null) values in ```Series``` instance ```y```:

In [223]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       4 non-null      object
 1   y       6 non-null      object
dtypes: object(2)
memory usage: 168.0+ bytes


The ```DataFrame``` instance ```df``` descriptive statistics can be viewed:

In [224]:
df.describe()

Unnamed: 0,x,y
count,4.0,6.0
unique,4.0,6.0
top,1.1,1.2
freq,1.0,1.0


Notice in the information that the datatype is now ```object``` instead of ```float64```. The ```DataFrame``` instance uses the datatype ```object``` for each ```Series``` because each of these ```Series``` contain the value ```None``` which is a generic Python ```object```:

In [225]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


The datatype of each ```Series``` in the ```DataFrame``` can be changed to a ```float``` using the method ```astype```:

In [226]:
df.astype(float)

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


Notice the values that were previously ```None``` are cast into ```NaN``` (Not a Number). ```NaN``` conceptually is similar to ```None``` however it has a datatype of ```float``` and is therefore designated as being numeric. ```Series``` instances that only have numeric data inclusive of ```NaN``` therefore have the datatype ```float```. Both ```None``` and ```NaN``` are classified as ```not available``` values and are known collectively as ```null``` values.

The changes can be seen when the ```DataFrame``` method ```info``` is used on the returned ```DataFrame``` instance:

In [227]:
df.astype(float).info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       4 non-null      float64
 1   y       6 non-null      float64
dtypes: float64(2)
memory usage: 168.0+ bytes


Notice the changes when the ```DataFrame``` method ```describe``` is used on the returned ```DataFrame``` instance. Because each ```Series``` is numeric additional statics can be calculated:

In [228]:
df.astype(float).describe()

Unnamed: 0,x,y
count,4.0,6.0
mean,4.1,4.533333
std,2.581989,2.160247
min,1.1,1.2
25%,2.6,3.45
50%,4.1,4.7
75%,5.6,5.95
max,7.1,7.2


The ```DataFrame``` method drop not available ```dropna``` can be used to drop not available values (```None``` or ```NaN``` values) outputting a new ```DataFrame``` instance. Notice the number of rows is now reduced to 4: 

In [229]:
df.dropna()

Unnamed: 0,x,y
a,1.1,1.2
c,3.1,3.2
e,5.1,5.2
g,7.1,7.2


If the ```DataFrame``` method ```info``` is used on this new ```DataFrame``` instance, notice the datatype of each ```Series``` is still ```object``` and not ```float64```:

In [230]:
df.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       4 non-null      object
 1   y       4 non-null      object
dtypes: object(2)
memory usage: 96.0+ bytes


The ```DataFrame``` method ```astype``` can be used to change the datatype of each ```Series``` in the returned ```DataFrame``` to ```float``` once again and this once again outputs a new ```DataFrame``` instance. If the ```DataFrame``` method ```info``` is examined for this new ```DataFrame``` instance, each ```Series``` now has a ```float64``` datatype:

In [231]:
df.dropna().astype(float).info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       4 non-null      float64
 1   y       4 non-null      float64
dtypes: float64(2)
memory usage: 96.0+ bytes


Note ```DataFrame``` methods are often stacked for convenience:

* df # Original DataFrame instance 1
* df.drop(na) # returns a DataFrame instance 2
* df.drop(na).astype(float) # returns a DataFrame instance 3

A ```Series``` can also be selected from ```DataFrame``` instance 3:

In [232]:
df.dropna().astype(float)['x']

a    1.1
c    3.1
e    5.1
g    7.1
Name: x, dtype: float64

And a ```Series``` method ```astype``` can be used on this ```Series``` returning another ```Series```:

In [233]:
df.dropna().astype(float)['x'].astype(int)

a    1
c    3
e    5
g    7
Name: x, dtype: int32

The ```DataFrame``` method ```dropna``` was demonstrated over a small ```DataFrame``` instance with a small number of rows. This method is however usually only typically employed on a ```DataFrame``` that contains a large number of rows that has enough data for further analysis without that isn't influence too much by the missing values. For a sparse dataset, it is common to attempt to fill in the missign values in some way. The ```DataFrame``` method ```fillna``` can be used to fill in not available values. These can be filled with a constant value:

In [234]:
df.fillna(0)

Unnamed: 0,x,y
a,1.1,1.2
b,0.0,0.0
c,3.1,3.2
d,0.0,4.2
e,5.1,5.2
f,0.0,6.2
g,7.1,7.2


In [235]:
df.fillna(np.inf)

Unnamed: 0,x,y
a,1.1,1.2
b,inf,inf
c,3.1,3.2
d,inf,4.2
e,5.1,5.2
f,inf,6.2
g,7.1,7.2


Alternatively the ```DataFrame``` method ```ffill``` can be used to linearly forward fill missing data. When using the forward fill, the previous available value is used to replace the not available value:

In [236]:
df.ffill()

Unnamed: 0,x,y
a,1.1,1.2
b,1.1,1.2
c,3.1,3.2
d,3.1,4.2
e,5.1,5.2
f,5.1,6.2
g,7.1,7.2


This can be aligned with the original ```DataFrame``` instance for comparison:

In [237]:
df.ffill().align(df)

(     x    y
 a  1.1  1.2
 b  1.1  1.2
 c  3.1  3.2
 d  3.1  4.2
 e  5.1  5.2
 f  5.1  6.2
 g  7.1  7.2,
       x     y
 a   1.1   1.2
 b  None  None
 c   3.1   3.2
 d  None   4.2
 e   5.1   5.2
 f  None   6.2
 g   7.1   7.2)

The related ```DataFrame``` method ```bfill``` can be used to linearly backwards fill missing data. When using the backward fill, the subsequent available value is used to replace the not available value:

In [238]:
df.bfill().align(df)

(     x    y
 a  1.1  1.2
 b  3.1  3.2
 c  3.1  3.2
 d  5.1  4.2
 e  5.1  5.2
 f  7.1  6.2
 g  7.1  7.2,
       x     y
 a   1.1   1.2
 b  None  None
 c   3.1   3.2
 d  None   4.2
 e   5.1   5.2
 f  None   6.2
 g   7.1   7.2)

The ```DataFrame``` method ```interpolate``` method can use neighbouring datapoints to interpolate a missing value. It has a keyword input argument ```method``` which can be used to specify an interpolation method. Note that all the data in the ```DataFrame``` must be cast to numeric for the ```interpolate``` method to be used:

In [239]:
df.astype(float).interpolate(method='linear')

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


Sometimes it is preferable to use the ```Series``` method ```interpolate``` which is consistent:

In [240]:
df['x'].astype(float).interpolate(method='linear')

a    1.1
b    2.1
c    3.1
d    4.1
e    5.1
f    6.1
g    7.1
Name: x, dtype: float64

Many of the interpolation methods need a numeric index to work properly:

In [241]:
df['x'].astype(float).reset_index(drop=True).interpolate(method='linear')

0    1.1
1    2.1
2    3.1
3    4.1
4    5.1
5    6.1
6    7.1
Name: x, dtype: float64

A new ```DataFrame``` instance can be created where each ```Series``` is an interpolated method used on the original ```Series``` ```'x'```. The ```Series``` methods ```astype``` and ```reset_index``` will be used to cast the ```Series``` ```'x'``` to a ```float``` with a numeric index. The original index from the ```Series``` ```'x'``` can be assigned to the new ```DataFrame``` instance after the interpolation:

In [242]:
df2 = pd.DataFrame({'x': df['x'].astype(float).reset_index(drop=True),
                    'x_1': df['x'].astype(float).reset_index(drop=True).interpolate(method='linear'),
                    'x_2': df['x'].astype(float).reset_index(drop=True).interpolate(method='polynomial', order=2),
                    'x_3': df['x'].astype(float).reset_index(drop=True).interpolate(method='polynomial', order=3)})
df2.index = df['x'].index

In [243]:
df2

Unnamed: 0,x,x_1,x_2,x_3
a,1.1,1.1,1.1,1.1
b,,2.1,2.1,2.1
c,3.1,3.1,3.1,3.1
d,,4.1,4.1,4.1
e,5.1,5.1,5.1,5.1
f,,6.1,6.1,6.1
g,7.1,7.1,7.1,7.1


The ```DataFrame``` method ```isna``` returns a boolean ```DataFrame``` instance which is ```True``` for not available values and ```False``` otherwise:

In [244]:
df.isna()

Unnamed: 0,x,y
a,False,False
b,True,True
c,False,False
d,True,False
e,False,False
f,True,False
g,False,False


The opposite method ```notna``` returns a boolean ```DataFrame``` of inverse values:

In [245]:
df.notna().align(df.isna())

(       x      y
 a   True   True
 b  False  False
 c   True   True
 d  False   True
 e   True   True
 f  False   True
 g   True   True,
        x      y
 a  False  False
 b   True   True
 c  False  False
 d   True  False
 e  False  False
 f   True  False
 g  False  False)

These two methods have the alias ```isnull``` and ```notnull``` respectively which are used for consistency with the R programming language.

The boolean mask above can be used to index into the ```DataFrame``` instance:

In [246]:
bool_mask = df.notna()

Notice indexing using the boolean mask updates ```None``` to ```NaN```:

In [247]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


In [248]:
df[bool_mask]

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


## String Series and String Methods

Supposing the following list of words is instantiated:

In [249]:
words = 'the quick brown for jumped over the lazy dog'.split()

In [250]:
words

['the', 'quick', 'brown', 'for', 'jumped', 'over', 'the', 'lazy', 'dog']

Using ```len``` of words will return the number of words in the outer collection which is the ```list``` and not the length of each inner collection which is the word:

In [251]:
len(words)

9

To instead get a ```list``` of the length of each word, ```list``` comprehension can be used:

In [252]:
[len(word) for word in words]

[3, 5, 5, 3, 6, 4, 3, 4, 3]

This can also be done using ```map```:

In [253]:
map(len, words)

<map at 0x218ba9306d0>

In [254]:
list(map(len, words))

[3, 5, 5, 3, 6, 4, 3, 4, 3]

If an analogous ```Series``` is instantiated to ```words```:

In [255]:
words = pd.Series(data='the quick brown for jumped over the lazy dog'.split())

In [256]:
words

0       the
1     quick
2     brown
3       for
4    jumped
5      over
6       the
7      lazy
8       dog
dtype: object

Using ```len``` on the ```Series``` will return the number of rows:

In [257]:
len(words)

9

The ```Series``` method ```map``` is similar to the ```builtins``` function ```map``` and can be used to individually ```map``` a ```function``` to the ```Series```:

In [258]:
words.map(len)

0    3
1    5
2    5
3    3
4    6
5    4
6    3
7    4
8    3
dtype: int64

Since every element in the ```Series``` is a ```str``` instance, a ```str``` method can be applied to each element using ```map``` and a ```lambda``` expression:

In [259]:
words.map(lambda str: str.upper())

0       THE
1     QUICK
2     BROWN
3       FOR
4    JUMPED
5      OVER
6       THE
7      LAZY
8       DOG
dtype: object

Recall that ```str``` instances are ordinal and the ```builtins``` universal ```max``` function can be mapped to get a ```Series``` that has the letter corresponding to the highest ordinal value in the word:

In [260]:
words.map(max)

0    t
1    u
2    w
3    r
4    u
5    v
6    t
7    z
8    o
dtype: object

If the ```DataFrame``` is instantiated to ```df```:

In [261]:
df = pd.DataFrame({'words': words,
                   'maximum': words.map(max)})

In [262]:
df

Unnamed: 0,words,maximum
0,the,t
1,quick,u
2,brown,w
3,for,r
4,jumped,u
5,over,v
6,the,t
7,lazy,z
8,dog,o


The ```DataFrame``` has a consistent method ```map``` which operates element by element:

In [263]:
df.map(max)

Unnamed: 0,words,maximum
0,t,t
1,u,u
2,w,w
3,r,r
4,u,u
5,v,v
6,t,t
7,z,z
8,o,o


The ```DataFrame``` also has the method ```apply``` which operates along an ```axis```:

In [264]:
df.apply(max, axis='rows')

words      the
maximum      z
dtype: object

In [265]:
df.apply(max, axis='columns')

0    the
1      u
2      w
3      r
4      u
5      v
6    the
7      z
8      o
dtype: object

Most universal functions are implemented as ```Series``` and ```DataFrame``` methods:

In [266]:
words.max(axis='rows')

'the'

In [267]:
words.min(axis='rows')

'brown'

A ```Series``` has the attribute ```str``` which can be used to quickly access ```str``` methods:

In [268]:
words.str.zfill(20)

0    00000000000000000the
1    000000000000000quick
2    000000000000000brown
3    00000000000000000for
4    00000000000000jumped
5    0000000000000000over
6    00000000000000000the
7    0000000000000000lazy
8    00000000000000000dog
dtype: object

The ```str``` datamodel method ```__len__``` is available under the ```str``` attribute as ```len```:

In [269]:
words.str.len()

0    3
1    5
2    5
3    3
4    6
5    4
6    3
7    4
8    3
dtype: int64

If the lengths of each strings are examined:

In [270]:
lengths = words.map(len)

In [271]:
lengths

0    3
1    5
2    5
3    3
4    6
5    4
6    3
7    4
8    3
dtype: int64

A function can be created to cast a numeric length into a ```str``` for example ```3``` to ```'three'```:

In [272]:
def get_length_str(length):
    match length:
        case 3:
            return 'three'
        case 4:
            return 'four'
        case 5:
            return 'five'
        case 6:
            return 'six' 

This function can be applied to the ```length``` ```Series``` using ```map```:

In [273]:
lengths.map(get_length_str)

0    three
1     five
2     five
3    three
4      six
5     four
6    three
7     four
8    three
dtype: object

## Numeric Series

If a ```DataFrame``` with numeric ```Series``` ```x```, ```y``` and ```z``` is instantiated to ```df```:

In [274]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                   'y': [-2, -4, 6, 8, 10],
                   'z': [12, 24, 48, -63, -999]})

In [275]:
df

Unnamed: 0,x,y,z
0,1,-2,12
1,2,-4,24
2,3,6,48
3,4,8,-63
4,5,10,-999


The ```DataFrame``` method ```apply``` can be used to apply the builtins universal function ```max```:

In [276]:
df.apply(max, axis='rows')

x     5
y    10
z    48
dtype: int64

In [277]:
df.apply(max, axis='columns') 

0    12
1    24
2    48
3     8
4    10
dtype: int64

Note however that most the universal functions from ```builtins``` or ```numpy``` are implemented directly as ```DataFrame``` methods:

In [278]:
df.max(axis='rows')

x     5
y    10
z    48
dtype: int64

In [279]:
df.min(axis='rows')

x      1
y     -4
z   -999
dtype: int64

In [280]:
df.mean(axis='rows')

x      3.0
y      3.6
z   -195.6
dtype: float64

In [281]:
df.var(axis='rows')

x         2.5
y        38.8
z    203424.3
dtype: float64

In [282]:
df.std(axis='rows')

x      1.581139
y      6.228965
z    451.025831
dtype: float64

And the datamodel identifiers for a numeric ```Series``` are configured for numeric operations:

In [283]:
df['x'] + df['y']

0    -1
1    -2
2     9
3    12
4    15
dtype: int64

In [284]:
df['x'] + 5

0     6
1     7
2     8
3     9
4    10
Name: x, dtype: int64

The ```apply``` function can also be used with a ```tuple``` of universal functions outputting a ```DataFrame``` instance opposed to a Series:

In [285]:
df.apply((len, max, min, np.mean, np.var, np.std))

Unnamed: 0,x,y,z
len,5.0,5.0,5.0
max,5.0,10.0,48.0
min,1.0,-4.0,-999.0
mean,3.0,3.6,-195.6
var,2.0,31.04,162739.44
std,1.414214,5.571355,403.409767


## Categorical Series

Another common type of ```Series``` is a category ```Series```:

In [286]:
df = pd.DataFrame({'student_names': ['student' + str(num) for num in range(1, 9)],
                   'grades': ['b', 'F', 'A', 'C', 'a', 'C', 'B', 'A']})

When instantiated, the categories will normally be recognised as ```str``` instances:

In [287]:
df

Unnamed: 0,student_names,grades
0,student1,b
1,student2,F
2,student3,A
3,student4,C
4,student5,a
5,student6,C
6,student7,B
7,student8,A


And the datatypes will therefore be ```object``` for each ```Series```:

In [288]:
df.dtypes

student_names    object
grades           object
dtype: object

The datatype of a ```Series``` can be changed using the method ```astype```. To change to category use the input argument ```'category'```:

In [289]:
df['grades'].astype('category')

0    b
1    F
2    A
3    C
4    a
5    C
6    B
7    A
Name: grades, dtype: category
Categories (6, object): ['A', 'B', 'C', 'F', 'a', 'b']

The original ```Series``` can be reassigned to the new ```Series``` that are now categorical:

In [290]:
df['grades'] = df['grades'].astype('category')

If the ```DataFrame``` instance is examined, it looks the same:

In [291]:
df

Unnamed: 0,student_names,grades
0,student1,b
1,student2,F
2,student3,A
3,student4,C
4,student5,a
5,student6,C
6,student7,B
7,student8,A


However its datatype is updated:

In [292]:
df.dtypes

student_names      object
grades           category
dtype: object

A categorical ```Series``` also has the attribute ```cat``` which groups together attributes and methods commonly used for categorical ```Series```:

In [293]:
print('attributes', end=' ')
print_identifier_group(df['grades'].cat, kind='attribute')
print('methods', end=' ')
print_identifier_group(df['grades'].cat, kind='function')

attributes ['categories', 'codes', 'ordered']
methods ['add_categories', 'as_ordered', 'as_unordered', 'remove_categories', 'remove_unused_categories', 'rename_categories', 'reorder_categories', 'set_categories']


The ```Series.cat``` attribute ```categories``` can be used to get the names of the existing categories:

In [294]:
df['grades'].cat.categories

Index(['A', 'B', 'C', 'F', 'a', 'b'], dtype='object')

Notice that these categories have uppercase and lowercase variants. A ```list``` comprehension can be used with a ```str``` method to change the lowercase grades to uppercase:

In [295]:
old_grade_categories = df['grades'].cat.categories
new_grade_categories = [grade.upper() for grade in old_grade_categories]
new_grade_categories

['A', 'B', 'C', 'F', 'A', 'B']

A category mapping can then be created:

In [296]:
category_mapping = dict(zip(old_grade_categories, new_grade_categories))
category_mapping

{'A': 'A', 'B': 'B', 'C': 'C', 'F': 'F', 'a': 'A', 'b': 'B'}

This is the type of ```mapping``` that can be used with the ```Series.cat``` method ```rename_categories``` however at present this method does not support merging of categories and flags a ```ValueError``` because some of the ```values``` are the same:

In [297]:
# df['grades'].cat.rename_categories(category_mapping)

It is therefore easier to manipulate the ```str``` datatype ```Series``` and then cast it to a ```category``` datatype ```Series```:

In [298]:
df['grades'] = df['grades'].str.lower().astype('category')
df['grades']

0    b
1    f
2    a
3    c
4    a
5    c
6    b
7    a
Name: grades, dtype: category
Categories (4, object): ['a', 'b', 'c', 'f']

When all the values in the mapping are unique, the ```DataFrame``` method ```rename_category``` works as expected:

In [299]:
old_grade_categories = df['grades'].cat.categories
new_grade_categories = [grade.upper() for grade in old_grade_categories]
category_mapping = dict(zip(old_grade_categories, new_grade_categories))
category_mapping

{'a': 'A', 'b': 'B', 'c': 'C', 'f': 'F'}

In [300]:
df['grades'] = df['grades'].cat.rename_categories(category_mapping)
df['grades']

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: grades, dtype: category
Categories (4, object): ['A', 'B', 'C', 'F']

Categories are often used for boolean selectors:

In [301]:
df[df['grades'] == 'A']

Unnamed: 0,student_names,grades
2,student3,A
4,student5,A
7,student8,A


In [302]:
df[df['grades'] == 'B']

Unnamed: 0,student_names,grades
0,student1,B
6,student7,B


In [303]:
df[(df['grades'] == 'A') | (df['grades'] == 'B')]

Unnamed: 0,student_names,grades
0,student1,B
2,student3,A
4,student5,A
6,student7,B
7,student8,A


Only the equal to ```==``` and not equal to ```!=``` operators are defined for unordered categoricals. A ```TypeError``` displays if one of the other comparison operators is attempted to be used:

The ```Series.cat``` method ```as_ordered``` can be used to ordinally order categories:

In [304]:
df['grades'].cat.as_ordered()

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: grades, dtype: category
Categories (4, object): ['A' < 'B' < 'C' < 'F']

In this case, the order desired is reverse the ordinal values because ```'A'``` corresponds to a higher grade than ```'F'```:

In [305]:
df['grades'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                    ordered=True)

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: grades, dtype: category
Categories (4, object): ['F' < 'C' < 'B' < 'A']

The original ```Series``` can be reassigned:

In [306]:
df['grades'] = df['grades'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                                   ordered=True)

In [307]:
df[df['grades'] >= 'B']

Unnamed: 0,student_names,grades
0,student1,B
2,student3,A
4,student5,A
6,student7,B
7,student8,A


When sorting out data in ```DataFrames```, ordinal ```Series``` are quite often used:

In [308]:
df.sort_values(['grades', 'student_names'])

Unnamed: 0,student_names,grades
1,student2,F
3,student4,C
5,student6,C
0,student1,B
6,student7,B
2,student3,A
4,student5,A
7,student8,A


If the ```DataFrame``` method ```count``` is used, it will return the number of rows in each ```Series``` and return this information as a new ```Series```:

In [309]:
df.count()

student_names    8
grades           8
dtype: int64

A ```GroupBy``` instance can be created:

In [310]:
gbo = df.groupby(df['grades'], observed=True)

This ```gbo``` instance is essentially a ```DataFrame``` with an additional groupby instruction that is applied when a ```DataFrame``` method such as ```count``` is used. Notice that a ```DataFrame``` instance is returned:

In [311]:
gbo.count()

Unnamed: 0_level_0,student_names
grades,Unnamed: 1_level_1
F,1
C,2
B,2
A,3


A ```GroupBy``` instance can be created from a ```Series```:

In [312]:
gbo = df['grades'].groupby(df['grades'], observed=True)

This ```gbo``` instance is essentially a ```Series``` with an additional groupby instruction that is applied when a ```Series``` method such as ```count``` is used. A ```Series``` instance is returned:

In [313]:
gbo.count()

grades
F    1
C    2
B    2
A    3
Name: grades, dtype: int64

Some ```Series``` methods like ```describe``` provide information spanning over multiple ```Series``` and will therefore return a ```DataFrame``` instance: 

In [314]:
gbo.describe()

Unnamed: 0_level_0,count,unique,top,freq
grades,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,1,1,F,1
C,2,1,C,2
B,2,1,B,2
A,3,1,A,3


Notice the slight difference with the column grouping ```student_names``` being used to group the statistical information (```count```, ```unique```, ```top``` and ```freq```) which for that specific ```Series```.

In [315]:
df

Unnamed: 0,student_names,grades
0,student1,B
1,student2,F
2,student3,A
3,student4,C
4,student5,A
5,student6,C
6,student7,B
7,student8,A


The difference can be seen more clearly if a second category is added to the ```DataFrame```:

In [316]:
import random
random.seed(0)
df['sex'] = pd.Series([random.choice(['F', 'M']) for num in range(8)])
df['sex'] = df['sex'].astype('category')

In [317]:
df

Unnamed: 0,student_names,grades,sex
0,student1,B,M
1,student2,F,M
2,student3,A,F
3,student4,C,M
4,student5,A,M
5,student6,C,M
6,student7,B,M
7,student8,A,M


Now when the ```DataFrame``` instance ```df``` is grouped by the ```Series``` ```grades``` followed by the ```DataFrame``` method ```describe```. Descriptive statistics are shown for each ```Series``` and each of these statistics is grouped using a multilevel ```Index```:

In [318]:
df.groupby('grades', observed=True).describe()

Unnamed: 0_level_0,student_names,student_names,student_names,student_names,sex,sex,sex,sex
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
grades,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
F,1,1,student2,1,1,1,M,1
C,2,2,student4,1,2,1,M,2
B,2,2,student1,1,2,1,M,2
A,3,3,student3,1,3,2,M,2


If this ```DataFrame``` is indexed into using the top level column name:

In [319]:
df.groupby('grades', observed=True).describe()['sex']

Unnamed: 0_level_0,count,unique,top,freq
grades,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,1,1,M,1
C,2,1,M,2
B,2,1,M,2
A,3,2,M,2


This returns a single-index ```DataFrame``` which can also be indexed into for example by using the column name ```count```:

In [320]:
df.groupby('grades', observed=True).describe()['sex']['count']

grades
F    1
C    2
B    2
A    3
Name: count, dtype: object

The ```DataFrame``` can be grouped by a list of multiple ```Series``` names. This gives a multi-level index for both the ```Series``` and ```Index```:

In [321]:
df.groupby(['grades', 'sex'], observed=True).describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,student_names,student_names,student_names,student_names
Unnamed: 0_level_1,Unnamed: 1_level_1,count,unique,top,freq
grades,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
F,M,1,1,student2,1
C,M,2,2,student4,1
B,M,2,2,student1,1
A,F,1,1,student3,1
A,M,2,2,student5,1


Returning to:

In [322]:
df

Unnamed: 0,student_names,grades,sex
0,student1,B,M
1,student2,F,M
2,student3,A,F
3,student4,C,M
4,student5,A,M
5,student6,C,M
6,student7,B,M
7,student8,A,M


A function can be made to generate a random ```score``` in response to a ```grade```:

In [323]:
def generate_score(grade):
    random.seed(0)
    match grade:
        case 'A':
            return random.randint(70, 101)
        case 'B':
            return random.randint(60, 70)  
        case 'C':
            return random.randint(50, 60)
        case 'F':
            return random.randint(0, 50)

This custom function can be applied to the ```grades``` ```Series``` to generate a ```Series``` of random marks for each student:

In [324]:
df['scores'] = df['grades'].map(generate_score)

In [325]:
df

Unnamed: 0,student_names,grades,sex,scores
0,student1,B,M,66
1,student2,F,M,24
2,student3,A,F,94
3,student4,C,M,56
4,student5,A,M,94
5,student6,C,M,56
6,student7,B,M,66
7,student8,A,M,94


Categories can be created from ordinal values using the pandas function ```pd.cut```:

In [326]:
pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101])

0     (60, 70]
1      (0, 50]
2    (70, 101]
3     (50, 60]
4    (70, 101]
5     (50, 60]
6     (60, 70]
7    (70, 101]
Name: scores, dtype: category
Categories (4, interval[int64, right]): [(0, 50] < (50, 60] < (60, 70] < (70, 101]]

In the output below the ```(``` means inclusive of the boundary and the ```]``` means exclusive of the top boundary. For convenience this will be inserted into the ```DataFrame``` at column index 3. Recall ```insert``` is an immutable method and occurs in place:

In [327]:
df.insert(3, 'score_cats', pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101]))

In [328]:
df

Unnamed: 0,student_names,grades,sex,score_cats,scores
0,student1,B,M,"(60, 70]",66
1,student2,F,M,"(0, 50]",24
2,student3,A,F,"(70, 101]",94
3,student4,C,M,"(50, 60]",56
4,student5,A,M,"(70, 101]",94
5,student6,C,M,"(50, 60]",56
6,student7,B,M,"(60, 70]",66
7,student8,A,M,"(70, 101]",94


```cut``` can be used with the keyword labels. Notice that there are ```5``` values for ```bins``` and ```4``` values for ```labels```, this is because each bin is between two values:

In [329]:
pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101], labels=['F', 'C', 'B', 'A'])

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: scores, dtype: category
Categories (4, object): ['F' < 'C' < 'B' < 'A']

Notice that these categories are also automatically ordinal:

In [330]:
df.insert(4, 'calculated_grades', pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101], labels=['F', 'C', 'B', 'A']))

In [331]:
df

Unnamed: 0,student_names,grades,sex,score_cats,calculated_grades,scores
0,student1,B,M,"(60, 70]",B,66
1,student2,F,M,"(0, 50]",F,24
2,student3,A,F,"(70, 101]",A,94
3,student4,C,M,"(50, 60]",C,56
4,student5,A,M,"(70, 101]",A,94
5,student6,C,M,"(50, 60]",C,56
6,student7,B,M,"(60, 70]",B,66
7,student8,A,M,"(70, 101]",A,94


## DateTime

In ```pandas``` date and time intervals are based upon the datatypes ```datetime64``` or ```timedelta64``` from the ```numpy``` library:

The ```datetime64``` class is normally initialised using a timestamp string of the following format:

```python
np.datetime64('YYYY-MM-DD')
np.datetime64('YYYY-MM-DDThh:mm:ss:μμμμμμ')
```

For example:

In [332]:
np.datetime64('2023-07-25')

numpy.datetime64('2023-07-25')

In [333]:
np.datetime64('2023-07-25T14:30:15.123456')

numpy.datetime64('2023-07-25T14:30:15.123456')

The ```timedelta64``` is normally initialised using a set of tuples where ```X``` is the quantity followed by the unit:

```python
np.datetime64(X, 'U')
```

These are usually combined using addition:

In [334]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's') + np.timedelta64(1, 'ms')

numpy.timedelta64(90001001,'ms')

These can be used to make an ```Index``` or ```Series``` respectively, using the ```np.arange``` function:

In [335]:
start_time = np.datetime64('2023-07-25')
end_time = np.datetime64('2023-07-26')
time_interval = np.timedelta64(1, 'h')

In [336]:
times = np.arange(start=start_time, #inclusive
                  stop=end_time, #exclusive
                  step=time_interval)

In [337]:
times

array(['2023-07-25T00', '2023-07-25T01', '2023-07-25T02', '2023-07-25T03',
       '2023-07-25T04', '2023-07-25T05', '2023-07-25T06', '2023-07-25T07',
       '2023-07-25T08', '2023-07-25T09', '2023-07-25T10', '2023-07-25T11',
       '2023-07-25T12', '2023-07-25T13', '2023-07-25T14', '2023-07-25T15',
       '2023-07-25T16', '2023-07-25T17', '2023-07-25T18', '2023-07-25T19',
       '2023-07-25T20', '2023-07-25T21', '2023-07-25T22', '2023-07-25T23'],
      dtype='datetime64[h]')

These times can be cast into an ```Index``` or ```Series```:

In [338]:
pd.Index(times)

DatetimeIndex(['2023-07-25 00:00:00', '2023-07-25 01:00:00',
               '2023-07-25 02:00:00', '2023-07-25 03:00:00',
               '2023-07-25 04:00:00', '2023-07-25 05:00:00',
               '2023-07-25 06:00:00', '2023-07-25 07:00:00',
               '2023-07-25 08:00:00', '2023-07-25 09:00:00',
               '2023-07-25 10:00:00', '2023-07-25 11:00:00',
               '2023-07-25 12:00:00', '2023-07-25 13:00:00',
               '2023-07-25 14:00:00', '2023-07-25 15:00:00',
               '2023-07-25 16:00:00', '2023-07-25 17:00:00',
               '2023-07-25 18:00:00', '2023-07-25 19:00:00',
               '2023-07-25 20:00:00', '2023-07-25 21:00:00',
               '2023-07-25 22:00:00', '2023-07-25 23:00:00'],
              dtype='datetime64[s]', freq=None)

In [339]:
pd.Series(data=times, name='times')

0    2023-07-25 00:00:00
1    2023-07-25 01:00:00
2    2023-07-25 02:00:00
3    2023-07-25 03:00:00
4    2023-07-25 04:00:00
5    2023-07-25 05:00:00
6    2023-07-25 06:00:00
7    2023-07-25 07:00:00
8    2023-07-25 08:00:00
9    2023-07-25 09:00:00
10   2023-07-25 10:00:00
11   2023-07-25 11:00:00
12   2023-07-25 12:00:00
13   2023-07-25 13:00:00
14   2023-07-25 14:00:00
15   2023-07-25 15:00:00
16   2023-07-25 16:00:00
17   2023-07-25 17:00:00
18   2023-07-25 18:00:00
19   2023-07-25 19:00:00
20   2023-07-25 20:00:00
21   2023-07-25 21:00:00
22   2023-07-25 22:00:00
23   2023-07-25 23:00:00
Name: times, dtype: datetime64[s]

The ```Index``` of the datatype ```datetime64``` can be used as a time index alongside measurement Series for example emulated temperature, ph and humidity data:

In [340]:
np.random.seed(0)

In [341]:
df = pd.DataFrame(index=pd.Index(times),
                  data={'temperature': 25 + np.random.randn(24),
                        'ph': 7 + np.random.randn(24) / 10,
                        'humidity': 100 - np.random.randint(0, 100, 24)})

In [342]:
df

Unnamed: 0,temperature,ph,humidity
2023-07-25 00:00:00,26.764052,7.226975,85
2023-07-25 01:00:00,25.400157,6.854563,80
2023-07-25 02:00:00,25.978738,7.004576,1
2023-07-25 03:00:00,27.240893,6.981282,42
2023-07-25 04:00:00,26.867558,7.153278,77
2023-07-25 05:00:00,24.022722,7.146936,21
2023-07-25 06:00:00,25.950088,7.015495,87
2023-07-25 07:00:00,24.848643,7.037816,15
2023-07-25 08:00:00,24.896781,6.911221,52
2023-07-25 09:00:00,25.410599,6.80192,51


```loc``` can be used to retrieve the data at a specified ```datetime64```:

In [343]:
df.loc['2023-07-25T01:00:00']

temperature    25.400157
ph              6.854563
humidity       80.000000
Name: 2023-07-25 01:00:00, dtype: float64

```iloc``` can also be used with the ```int``` which would correspond to the ```RangeIndex``` if the ```Index``` is reset:

In [344]:
df.iloc[1]

temperature    25.400157
ph              6.854563
humidity       80.000000
Name: 2023-07-25 01:00:00, dtype: float64

A comparison between two times can be made:

In [345]:
df.loc['2023-07-25 16:00:00'] - df.loc['2023-07-25T01:00:00']

temperature     1.093922
ph              0.040581
humidity      -74.000000
dtype: float64

In addition to the ```Index``` of datatype ```datetime64``` the ```Series``` instance ```times``` can be added to the ```DataFrame``` instance ```df```:

In [346]:
df['times'] = times

In [347]:
df

Unnamed: 0,temperature,ph,humidity,times
2023-07-25 00:00:00,26.764052,7.226975,85,2023-07-25 00:00:00
2023-07-25 01:00:00,25.400157,6.854563,80,2023-07-25 01:00:00
2023-07-25 02:00:00,25.978738,7.004576,1,2023-07-25 02:00:00
2023-07-25 03:00:00,27.240893,6.981282,42,2023-07-25 03:00:00
2023-07-25 04:00:00,26.867558,7.153278,77,2023-07-25 04:00:00
2023-07-25 05:00:00,24.022722,7.146936,21,2023-07-25 05:00:00
2023-07-25 06:00:00,25.950088,7.015495,87,2023-07-25 06:00:00
2023-07-25 07:00:00,24.848643,7.037816,15,2023-07-25 07:00:00
2023-07-25 08:00:00,24.896781,6.911221,52,2023-07-25 08:00:00
2023-07-25 09:00:00,25.410599,6.80192,51,2023-07-25 09:00:00


When ```loc``` is used to calculate the difference between two measurements at the two different times, the time difference, i.e. ```timedelta64``` will be calculated:

In [348]:
df.loc['2023-07-25 16:00:00'] - df.loc['2023-07-25T01:00:00']

temperature           1.093922
ph                    0.040581
humidity                   -74
times          0 days 15:00:00
dtype: object

The ```Series``` method ```tz_localize``` can be used to specify a ```timezone``` using the input argument ```tz```. For example in the UK:

In [349]:
df['times'].tz_localize(tz='Europe/London')

2023-07-25 00:00:00+01:00   2023-07-25 00:00:00
2023-07-25 01:00:00+01:00   2023-07-25 01:00:00
2023-07-25 02:00:00+01:00   2023-07-25 02:00:00
2023-07-25 03:00:00+01:00   2023-07-25 03:00:00
2023-07-25 04:00:00+01:00   2023-07-25 04:00:00
2023-07-25 05:00:00+01:00   2023-07-25 05:00:00
2023-07-25 06:00:00+01:00   2023-07-25 06:00:00
2023-07-25 07:00:00+01:00   2023-07-25 07:00:00
2023-07-25 08:00:00+01:00   2023-07-25 08:00:00
2023-07-25 09:00:00+01:00   2023-07-25 09:00:00
2023-07-25 10:00:00+01:00   2023-07-25 10:00:00
2023-07-25 11:00:00+01:00   2023-07-25 11:00:00
2023-07-25 12:00:00+01:00   2023-07-25 12:00:00
2023-07-25 13:00:00+01:00   2023-07-25 13:00:00
2023-07-25 14:00:00+01:00   2023-07-25 14:00:00
2023-07-25 15:00:00+01:00   2023-07-25 15:00:00
2023-07-25 16:00:00+01:00   2023-07-25 16:00:00
2023-07-25 17:00:00+01:00   2023-07-25 17:00:00
2023-07-25 18:00:00+01:00   2023-07-25 18:00:00
2023-07-25 19:00:00+01:00   2023-07-25 19:00:00
2023-07-25 20:00:00+01:00   2023-07-25 2

And in the Czech Republic:

In [350]:
df['times'].tz_localize(tz='Europe/Prague')

2023-07-25 00:00:00+02:00   2023-07-25 00:00:00
2023-07-25 01:00:00+02:00   2023-07-25 01:00:00
2023-07-25 02:00:00+02:00   2023-07-25 02:00:00
2023-07-25 03:00:00+02:00   2023-07-25 03:00:00
2023-07-25 04:00:00+02:00   2023-07-25 04:00:00
2023-07-25 05:00:00+02:00   2023-07-25 05:00:00
2023-07-25 06:00:00+02:00   2023-07-25 06:00:00
2023-07-25 07:00:00+02:00   2023-07-25 07:00:00
2023-07-25 08:00:00+02:00   2023-07-25 08:00:00
2023-07-25 09:00:00+02:00   2023-07-25 09:00:00
2023-07-25 10:00:00+02:00   2023-07-25 10:00:00
2023-07-25 11:00:00+02:00   2023-07-25 11:00:00
2023-07-25 12:00:00+02:00   2023-07-25 12:00:00
2023-07-25 13:00:00+02:00   2023-07-25 13:00:00
2023-07-25 14:00:00+02:00   2023-07-25 14:00:00
2023-07-25 15:00:00+02:00   2023-07-25 15:00:00
2023-07-25 16:00:00+02:00   2023-07-25 16:00:00
2023-07-25 17:00:00+02:00   2023-07-25 17:00:00
2023-07-25 18:00:00+02:00   2023-07-25 18:00:00
2023-07-25 19:00:00+02:00   2023-07-25 19:00:00
2023-07-25 20:00:00+02:00   2023-07-25 2

Care needs to be taken with non-UTC timezones as clock changes leads to ambiguous times. In the UK one of biannual clock changes can be examined:

In [351]:
start_time = np.datetime64('2023-10-28T11:00:00')
end_time = np.datetime64('2023-10-29T03:00:00')
time_interval = np.timedelta64(30, 'm')

In [352]:
utc_times = np.arange(start=start_time, #inclusive
                      stop=end_time, #exclusive
                      step=time_interval)

In [353]:
pd.Index(utc_times).tz_localize(tz='Europe/London', ambiguous=True)

DatetimeIndex(['2023-10-28 11:00:00+01:00', '2023-10-28 11:30:00+01:00',
               '2023-10-28 12:00:00+01:00', '2023-10-28 12:30:00+01:00',
               '2023-10-28 13:00:00+01:00', '2023-10-28 13:30:00+01:00',
               '2023-10-28 14:00:00+01:00', '2023-10-28 14:30:00+01:00',
               '2023-10-28 15:00:00+01:00', '2023-10-28 15:30:00+01:00',
               '2023-10-28 16:00:00+01:00', '2023-10-28 16:30:00+01:00',
               '2023-10-28 17:00:00+01:00', '2023-10-28 17:30:00+01:00',
               '2023-10-28 18:00:00+01:00', '2023-10-28 18:30:00+01:00',
               '2023-10-28 19:00:00+01:00', '2023-10-28 19:30:00+01:00',
               '2023-10-28 20:00:00+01:00', '2023-10-28 20:30:00+01:00',
               '2023-10-28 21:00:00+01:00', '2023-10-28 21:30:00+01:00',
               '2023-10-28 22:00:00+01:00', '2023-10-28 22:30:00+01:00',
               '2023-10-28 23:00:00+01:00', '2023-10-28 23:30:00+01:00',
               '2023-10-29 00:00:00+01:00', '2023-1

In [354]:
pd.Index(utc_times).tz_localize(tz='Europe/London', ambiguous='NaT')

DatetimeIndex(['2023-10-28 11:00:00+01:00', '2023-10-28 11:30:00+01:00',
               '2023-10-28 12:00:00+01:00', '2023-10-28 12:30:00+01:00',
               '2023-10-28 13:00:00+01:00', '2023-10-28 13:30:00+01:00',
               '2023-10-28 14:00:00+01:00', '2023-10-28 14:30:00+01:00',
               '2023-10-28 15:00:00+01:00', '2023-10-28 15:30:00+01:00',
               '2023-10-28 16:00:00+01:00', '2023-10-28 16:30:00+01:00',
               '2023-10-28 17:00:00+01:00', '2023-10-28 17:30:00+01:00',
               '2023-10-28 18:00:00+01:00', '2023-10-28 18:30:00+01:00',
               '2023-10-28 19:00:00+01:00', '2023-10-28 19:30:00+01:00',
               '2023-10-28 20:00:00+01:00', '2023-10-28 20:30:00+01:00',
               '2023-10-28 21:00:00+01:00', '2023-10-28 21:30:00+01:00',
               '2023-10-28 22:00:00+01:00', '2023-10-28 22:30:00+01:00',
               '2023-10-28 23:00:00+01:00', '2023-10-28 23:30:00+01:00',
               '2023-10-29 00:00:00+01:00', '2023-1

## Reading Data from Files

The ```Series``` and ```DataFrames``` previously examined were created from scratch using ```builtins``` datatypes. It is also common to read data in from another source. The ```pandas``` library therefore has a number of functions for reading in data from external files. The function names all have a ```read_``` prefix followed by the file type:

In [355]:
for identifier in dir(pd):
    if identifier.startswith('read_'):
        print(identifier, end=' ')

read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml 

Some of the most common formats will be explored.

### Comma Separated Values File

CSV is an abbreviation for Comma Separated Values although lower case csv is typically used for the file format. The file format has a similar structure to a ```tuple```, where each element is separated by a comma. In the case of a csv file, each column is separated by a comma and the newline character ```\n``` is an instruction to move onto the next row:

When opened in a program such as Microsoft Excel, these display as a grid:

<img src='./images/img_001.png' alt='img_001' width='800'/>

Notice that the comma in ```twinkle, twinkle``` is not a delimiter but part of the ```str```. For this reason ```"twinkle, twinkle"``` was is enclosed in quotations.

The csv has a file name in this case:

Because it is in the same folder as the interactive Python notebook, the file path can be specified as the following string:

<img src='./images/img_002.png' alt='img_002' width='800'/>

In [356]:
file_path = r'.\Book1.csv'

In [357]:
file_path

'.\\Book1.csv'

* r means raw string. In a raw string \ is used to indicate a \ instead of an instruction to insert an escape character.
* .\ means in the same folder as the interactive Python notebook

If the file is moved into a sub folder called files:

<img src='./images/img_003.png' alt='img_003' width='800'/>

Then the file path becomes:

In [358]:
file_path = r'.\files\Book1.csv'

In [359]:
file_path

'.\\files\\Book1.csv'

If the file is place up a level from the interactive notebook, the file path becomes:

<img src='./images/img_005.png' alt='img_005' width='800'/>

In [360]:
file_path = r'..\Book1.csv'

In [361]:
file_path

'..\\Book1.csv'

And if a subfolder (that is in the folder up a level from the interactive Python notebook file) is made called files:

<img src='./images/img_006.png' alt='img_006' width='800'/>

In [362]:
file_path = r'..\files\Book1.csv'

In [363]:
file_path

'..\\files\\Book1.csv'

The function ```read_csv``` is used to read in a csv file returning a ```DataFrame``` instance:

In [364]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m:

The ```read_csv``` has a larger number of input arguments. Note that most of these are keyword input arguments and are therefore assigned to a default value which is consistent to the default behaviour of a csv file. When the file is in the expected format only the ```filepath_or_buffer``` needs to be specified:

In [365]:
df = pd.read_csv(filepath_or_buffer = 'Book1.csv')

In [366]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


The first input argument is normally used positionally:

In [367]:
df = pd.read_csv(r'.\files\Book1.csv')

In [368]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


Notice the ```Series``` names in the file are in the expected format and taken from the first line. A csv does not have an index by default and so a ```RangeIndex``` is automatically generated:

In [369]:
df.axes

[RangeIndex(start=0, stop=6, step=1),
 Index(['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time',
        'category'],
       dtype='object')]

## Tab Delimited Text File

A text file, has the file extension txt and is very similar to a csv file but uses ```\t``` instead of ```,``` as a delimiter:

The same function ```read_csv``` is used to read in a txt file. However this function by default looks for a ```,``` as a delimiter to move onto the next column and as it is not present, the data is all shown in a single column:

In [370]:
df = pd.read_csv(r'.\files\Book2.txt')

In [371]:
df

Unnamed: 0,string\tinteger\tboolean\tfloatingpoint\tdate\ttime\tcategory
0,the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t...
1,sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:...
2,"twinkle, twinkle\t2\tTRUE\t-1.14\t26/07/2023\t..."
3,little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36...
4,how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:...
5,what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36...


If the delimiter is specified as ```'\t'``` the data will instead be read in properly:

In [372]:
df = pd.read_csv(r'.\files\Book2.txt', delimiter='\t')

In [373]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


## Microsoft Excel File

A Microsoft Excel File, file extensions .xlsx (or .xls for older files) is a collection of sheets. The data in each individual sheet is similar to a csv file. The Excel file can also be modified in Microsoft Excel to visually format the data:

<img src='./images/img_007.png' alt='img_007' width='800'/>

This formatting capability makes the raw Excel File less human readable than the more basic csv file when examined in a text editor:

However the additional formatting is not important and only the data is read in by Python. The related function ```read_excel``` is used to read in the data from an Excel File. The delimiter is predefined in an Excel File however the Excel File can have multiple sheets so the keyword input argument ```sheet_name``` is available, this defaults to the name ```'Sheet1'``` which is the default for an Excel Spreadsheet:

In [374]:
pd.read_excel?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_excel[0m[1;33m([0m[1;33m
[0m    [0mio[0m[1;33m,[0m[1;33m
[0m    [0msheet_name[0m[1;33m:[0m [1;34m'str | int | list[IntStrT] | None'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'int | Sequence[int] | None'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'list[str] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'int | Sequence[int] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m:[0m [1;34m'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'DtypeArg | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mengine[0m[1;33m:[0m [1;34m"Literal['xlrd', 'openpyxl', 

In [375]:
df = pd.read_excel(r'.\files\Book3.xlsx', sheet_name='Sheet1')

In [376]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B


Sheets in an Excel File are ordered and the ordering of the sheets is analogous to a ```RangeIndex``` (```'Sheet1'``` corresponds to an index of ```0``` because of zero-order indexing):

In [377]:
df = pd.read_excel(r'.\files\Book3.xlsx', sheet_name=0)

In [378]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B


In a .xlsx file, a column is normally formatted for a datetime format and the xlsx file contains the formatting information. This is used by the ```read_excel``` function and therefore there is usually more success parsing dates from an .xlsx file:

In [379]:
df = pd.read_excel(r'./files/Book3.xlsx', sheet_name='Sheet1', parse_dates=True)

In [380]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B


The date column is parsed as a date but the time column isn't:

In [381]:
df.dtypes

string                   object
integer                   int64
boolean                    bool
floatingpoint           float64
date             datetime64[ns]
time                     object
category                 object
dtype: object

When reading from a csv, the date and time columns are usually read in as ```str``` and have to be manipulated to get the correct format. If the ```date``` ```Series``` of datatype ```datetime64``` is cast to a ```str```:

In [382]:
df['date'].astype('str')

0    2023-07-24 11:30:00
1    2023-07-25 00:00:00
2    2023-07-26 00:00:00
3    2023-07-27 00:00:00
4    2023-07-28 00:00:00
5    2023-07-29 00:00:00
Name: date, dtype: object

Then the ```str``` can be indexed to only get the date component:

In [383]:
df['date'].astype('str').str[:10]

0    2023-07-24
1    2023-07-25
2    2023-07-26
3    2023-07-27
4    2023-07-28
5    2023-07-29
Name: date, dtype: object

And time component:

In [384]:
df['date'].astype('str').str[11:]

0    11:30:00
1    00:00:00
2    00:00:00
3    00:00:00
4    00:00:00
5    00:00:00
Name: date, dtype: object

These components can be used to create a timestamp:

In [385]:
df['date'].astype('str').str[:10] + ' T' + df['date'].astype('str').str[11:]

0    2023-07-24 T11:30:00
1    2023-07-25 T00:00:00
2    2023-07-26 T00:00:00
3    2023-07-27 T00:00:00
4    2023-07-28 T00:00:00
5    2023-07-29 T00:00:00
Name: date, dtype: object

In [386]:
df['datetime'] = df['date'].astype('str').str[:10] + ' T' + df['date'].astype('str').str[11:]

In [387]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category,datetime
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A,2023-07-24 T11:30:00
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A,2023-07-25 T00:00:00
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B,2023-07-26 T00:00:00
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B,2023-07-27 T00:00:00
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B,2023-07-28 T00:00:00
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B,2023-07-29 T00:00:00


And this can be cast into ```'datetime[ns]'```:

In [388]:
df['datetime'].astype('datetime64[ns]')

0   2023-07-24 11:30:00
1   2023-07-25 00:00:00
2   2023-07-26 00:00:00
3   2023-07-27 00:00:00
4   2023-07-28 00:00:00
5   2023-07-29 00:00:00
Name: datetime, dtype: datetime64[ns]

In [389]:
df['datetime'] = df['datetime'].astype('datetime64[ns]')
del df['date']
del df['time']
df

Unnamed: 0,string,integer,boolean,floatingpoint,category,datetime
0,the fat black cat,4,True,0.86,A,2023-07-24 11:30:00
1,sat on the mat,4,True,0.86,A,2023-07-25 00:00:00
2,"twinkle, twinkle",2,True,-1.14,B,2023-07-26 00:00:00
3,little star,2,True,-1.14,B,2023-07-27 00:00:00
4,how I wonder,3,False,-0.14,B,2023-07-28 00:00:00
5,what you are,4,True,0.86,B,2023-07-29 00:00:00


The categorical ```Series``` can be made categorical:

In [390]:
df['category'].astype('category')

0    A
1    A
2    B
3    B
4    B
5    B
Name: category, dtype: category
Categories (2, object): ['A', 'B']

In [391]:
df['category'] = df['category'].astype('category')

Then to reorder the columns, indexing is quite commonly used:

In [392]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,category,datetime
0,the fat black cat,4,True,0.86,A,2023-07-24 11:30:00
1,sat on the mat,4,True,0.86,A,2023-07-25 00:00:00
2,"twinkle, twinkle",2,True,-1.14,B,2023-07-26 00:00:00
3,little star,2,True,-1.14,B,2023-07-27 00:00:00
4,how I wonder,3,False,-0.14,B,2023-07-28 00:00:00
5,what you are,4,True,0.86,B,2023-07-29 00:00:00


In [393]:
df[['string', 'integer', 'boolean', 'floatingpoint', 'datetime', 'category']]

Unnamed: 0,string,integer,boolean,floatingpoint,datetime,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,B


The instance name ```df``` can be reassigned to this output:

In [394]:
df = df[['string', 'integer', 'boolean', 'floatingpoint', 'datetime', 'category']]

## Writing DataFrames to Objects and Files

The ```DataFrame``` class has a number of methods for writing to files:

In [395]:
for identifier in dir(df):
    if identifier.startswith('to_'):
        print(identifier, end=' ')

to_clipboard to_csv to_dict to_excel to_feather to_gbq to_hdf to_html to_json to_latex to_markdown to_numpy to_orc to_parquet to_period to_pickle to_records to_sql to_stata to_string to_timestamp to_xarray to_xml 

## Python Dictionary

A  ```DataFrame``` instance can be written to a dictionary using:

In [396]:
df.to_dict()

{'string': {0: 'the fat black cat',
  1: 'sat on the mat',
  2: 'twinkle, twinkle',
  3: 'little star',
  4: 'how I wonder',
  5: 'what you are'},
 'integer': {0: 4, 1: 4, 2: 2, 3: 2, 4: 3, 5: 4},
 'boolean': {0: True, 1: True, 2: True, 3: True, 4: False, 5: True},
 'floatingpoint': {0: 0.8599999999999999,
  1: 0.8599999999999999,
  2: -1.1400000000000001,
  3: -1.1400000000000001,
  4: -0.14000000000000012,
  5: 0.8599999999999999},
 'datetime': {0: Timestamp('2023-07-24 11:30:00'),
  1: Timestamp('2023-07-25 00:00:00'),
  2: Timestamp('2023-07-26 00:00:00'),
  3: Timestamp('2023-07-27 00:00:00'),
  4: Timestamp('2023-07-28 00:00:00'),
  5: Timestamp('2023-07-29 00:00:00')},
 'category': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B', 5: 'B'}}

Note that this is the same form that can be used to instantiate a ```DataFrame```:

In [397]:
pd.DataFrame(data=df.to_dict())

Unnamed: 0,string,integer,boolean,floatingpoint,datetime,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,B


## JSON

JavaScript Object Notation (JSON) has become a commonly used standard data stream. It is common to retrieve data from a website stored in a JSON table and convert it to a ```DataFrame```:

In [398]:
df.to_json()

'{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"datetime":{"0":1690198200000,"1":1690243200000,"2":1690329600000,"3":1690416000000,"4":1690502400000,"5":1690588800000},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'

Notice the above is similar to a ```str``` of a Python ```dict``` however there are some subtle differences between the two file formats:

In [399]:
dataframe_string = df.to_json()

The ```StringIO``` class can prepare a ```str``` for input output operations:

In [400]:
from io import StringIO
dataframe_string_io = StringIO(dataframe_string)

This ```StringIO``` instance can be read in using the complementary ```read_json``` function, returning a ```DataFrame```:

In [401]:
pd.read_json(dataframe_string_io)

Unnamed: 0,string,integer,boolean,floatingpoint,datetime,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,B


## Markdown

A DataFrame can be written to markdown using:

In [402]:
df.to_markdown()

'|    | string            |   integer | boolean   |   floatingpoint | datetime            | category   |\n|---:|:------------------|----------:|:----------|----------------:|:--------------------|:-----------|\n|  0 | the fat black cat |         4 | True      |            0.86 | 2023-07-24 11:30:00 | A          |\n|  1 | sat on the mat    |         4 | True      |            0.86 | 2023-07-25 00:00:00 | A          |\n|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 2023-07-26 00:00:00 | B          |\n|  3 | little star       |         2 | True      |           -1.14 | 2023-07-27 00:00:00 | B          |\n|  4 | how I wonder      |         3 | False     |           -0.14 | 2023-07-28 00:00:00 | B          |\n|  5 | what you are      |         4 | True      |            0.86 | 2023-07-29 00:00:00 | B          |'

If this is printed:

In [403]:
print(df.to_markdown())

|    | string            |   integer | boolean   |   floatingpoint | datetime            | category   |
|---:|:------------------|----------:|:----------|----------------:|:--------------------|:-----------|
|  0 | the fat black cat |         4 | True      |            0.86 | 2023-07-24 11:30:00 | A          |
|  1 | sat on the mat    |         4 | True      |            0.86 | 2023-07-25 00:00:00 | A          |
|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 2023-07-26 00:00:00 | B          |
|  3 | little star       |         2 | True      |           -1.14 | 2023-07-27 00:00:00 | B          |
|  4 | how I wonder      |         3 | False     |           -0.14 | 2023-07-28 00:00:00 | B          |
|  5 | what you are      |         4 | True      |            0.86 | 2023-07-29 00:00:00 | B          |


When the cell output is copied to a markdown cell it displays:

|    | string            |   integer | boolean   |   floatingpoint | datetime            | category   |
|---:|:------------------|----------:|:----------|----------------:|:--------------------|:-----------|
|  0 | the fat black cat |         4 | True      |            0.86 | 2023-07-24 11:36:00 | A          |
|  1 | sat on the mat    |         4 | True      |            0.86 | 2023-07-25 12:36:00 | A          |
|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 2023-07-26 13:36:00 | B          |
|  3 | little star       |         2 | True      |           -1.14 | 2023-07-27 14:36:00 | B          |
|  4 | how I wonder      |         3 | False     |           -0.14 | 2023-07-28 15:36:00 | B          |
|  5 | what you are      |         4 | True      |            0.86 | 2023-07-29 16:36:00 | B          |

There is no analogous function to read in data from this format:

## CSV and Text Files

The DataFrame can be written to_csv:

In [404]:
df.to_csv(r'.\files\Book4.csv')

This has the raw form:

Notice that an index was added, meaning if this is read into a ```DataFrame``` instance using the defaults there is an Unnamed ```Series``` corresponding to the index read in:

In [405]:
pd.read_csv(r'.\files\Book4.csv')

Unnamed: 0.1,Unnamed: 0,string,integer,boolean,floatingpoint,datetime,category
0,0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,A
1,1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,A
2,2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,B
3,3,little star,2,True,-1.14,2023-07-27 00:00:00,B
4,4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,B
5,5,what you are,4,True,0.86,2023-07-29 00:00:00,B


This can be assigned to the ```Index``` using the keyword input argument ```index_col```:

In [406]:
pd.read_csv(r'.\files\Book4.csv', index_col=0)

Unnamed: 0,string,integer,boolean,floatingpoint,datetime,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,B


Alternatively the ```DataFrame``` can be exported without the ```Index```:

In [407]:
df.to_csv?

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0mto_csv[0m[1;33m([0m[1;33m
[0m    [0mpath_or_buf[0m[1;33m:[0m [1;34m'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m','[0m[1;33m,[0m[1;33m
[0m    [0mna_rep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m''[0m[1;33m,[0m[1;33m
[0m    [0mfloat_format[0m[1;33m:[0m [1;34m'str | Callable | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'bool_t | list[str]'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex_label[0m[1;33m:[0m [1;34m'IndexLabel | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,

In [408]:
df.to_csv(r'.\files\Book5.csv', index=False)

To save to a text file, the seperator needs to be specified:

In [409]:
df.to_csv(r'.\files\Book6.txt', sep='\t', index=False)

## Excel File

Supposing there are three DataFrame instances:

In [410]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,datetime,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,B


In [411]:
df2 = df[['string', 'integer', 'boolean']]

In [412]:
df2

Unnamed: 0,string,integer,boolean
0,the fat black cat,4,True
1,sat on the mat,4,True
2,"twinkle, twinkle",2,True
3,little star,2,True
4,how I wonder,3,False
5,what you are,4,True


In [413]:
df3 = df[['string', 'datetime', 'category']]

In [414]:
df3

Unnamed: 0,string,datetime,category
0,the fat black cat,2023-07-24 11:30:00,A
1,sat on the mat,2023-07-25 00:00:00,A
2,"twinkle, twinkle",2023-07-26 00:00:00,B
3,little star,2023-07-27 00:00:00,B
4,how I wonder,2023-07-28 00:00:00,B
5,what you are,2023-07-29 00:00:00,B


The method ```to_excel``` allows the writing of multiple ```DataFrame``` instances to individual sheets within an Excel File:

In [415]:
df.to_excel?

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0mto_excel[0m[1;33m([0m[1;33m
[0m    [0mexcel_writer[0m[1;33m:[0m [1;34m'FilePath | WriteExcelBuffer | ExcelWriter'[0m[1;33m,[0m[1;33m
[0m    [0msheet_name[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m'Sheet1'[0m[1;33m,[0m[1;33m
[0m    [0mna_rep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m''[0m[1;33m,[0m[1;33m
[0m    [0mfloat_format[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'Sequence[Hashable] | bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex_label[0m[1;33m:[0m [1;34m'IndexLabel | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mstartr

To write ```DataFrame``` instances to multiple sheets an ```ExcelWriter``` instance has to be instantiated and given the instruction to create a blank Excel File:

In [416]:
writer = pd.ExcelWriter(path=r'.\files\Book7.xlsx')

The ```DataFrame``` method ```to_excel``` can then be used to instruct the ```ExcelWriter``` instance to write the ```DataFrame``` instance to a specified sheet:

In [417]:
df.to_excel(excel_writer=writer, sheet_name='df')
df2.to_excel(excel_writer=writer, sheet_name='df2')
df3.to_excel(excel_writer=writer, sheet_name='df3')

Details about the sheets being written can be seen using the ```ExcelWriter``` attribute ```sheets``` which is a mapping where the key is the sheet name and the value is the sheet being written:

In [418]:
writer.sheets

{'df': <xlsxwriter.worksheet.Worksheet at 0x218bb2a9160>,
 'df2': <xlsxwriter.worksheet.Worksheet at 0x218bb3c4290>,
 'df3': <xlsxwriter.worksheet.Worksheet at 0x218b87450d0>}

Finally the ```ExcelWriter``` instance can be closed. This will release the Excel SpreadSheet from Python:

In [419]:
writer.close()

<img src='./images/img_008.png' alt='img_008' width='800'/>

<img src='./images/img_009.png' alt='img_009' width='800'/>

<img src='./images/img_010.png' alt='img_010' width='800'/>

The identifiers of the ```ExcelWriter``` class can be examined:

In [420]:
print('datamodel attribute:', end=' ')
print_identifier_group(pd.ExcelWriter, kind='datamodel_attribute')
print('datamodel method:', end=' ')
print_identifier_group(pd.ExcelWriter, kind='datamodel_method')
print('attribute:', end=' ')
print_identifier_group(pd.ExcelWriter, kind='attribute')
print('method:', end=' ')
print_identifier_group(pd.ExcelWriter, kind='function')

datamodel attribute: ['__abstractmethods__', '__annotations__', '__dict__', '__doc__', '__module__', '__orig_bases__', '__parameters__', '__weakref__']
datamodel method: ['__class__', '__class_getitem__', '__delattr__', '__dir__', '__enter__', '__eq__', '__exit__', '__format__', '__fspath__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']
attribute: ['book', 'date_format', 'datetime_format', 'engine', 'if_sheet_exists', 'sheets', 'supported_extensions']
method: ['check_extension', 'close']


Notice it has the datamodel identifiers ```__enter__``` and ```__exit__``` which means it can be used within a ```with``` code block. The ```with``` code block will automatically close the ```ExcelWriter``` class when the block ends and is the safest to create the file, write multiple sheets to it and close the file::

In [421]:
with pd.ExcelWriter(r'.\files\Book8.xlsx') as writer:  
    df.to_excel(writer, sheet_name='df1', index=False)
    df2.to_excel(writer, sheet_name='df2', index=False)
    df3.to_excel(writer, sheet_name='df3', index=False)

<img src='./images/img_011.png' alt='img_011' width='800'/>

<img src='./images/img_012.png' alt='img_012' width='800'/>

<img src='./images/img_013.png' alt='img_013' width='800'/>