# Pandas Library

pandas is an abbreviation for the **P**ython **an**d **D**ata **A**naly**s**is Library. It is a library that uses three main data structures:

* the Index class
* the Series class
* the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one.

The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:

|index|0|1|2|3|
|---|---|---|---|---|

Or as a column when convenient:

|index|
|---|
|0|
|1|
|2|
|3|


The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

A DataFrame class is essentially a grouping of series instances that have the same index:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

## Categorize_Identifiers Module

This notebook will use the following functions ```dir2```, ```variables``` and ```view``` in the custom module ```categorize_identifiers``` which is found in the same directory as this notebook file. ```dir2``` is a variant of ```dir``` that groups identifiers into a ```dict``` under categories and ```variables``` is an IPython based a variable inspector. ```view``` is used to view a ```Collection``` in more detail:

In [1]:
from categorize_identifiers import dir2, variables, view

The following will be imported in order to simplify the output of ```dir2```:

In [2]:
import operator
reverse_operators = ['radd', 'rmod', 'rmul', 'rpow', 'rsub', 'rtruediv']

## Importing Libraries

To use the data science libraries they need to be imported:

In [3]:
import numpy as np 
import pandas as pd

Once imported the identifiers can be viewed:

In [8]:
dir2(pd, exclude_external_modules=True, drop_internal=True)

{'attribute': ['annotations', 'options'],
 'constant': ['IndexSlice', 'NA', 'NaT'],
 'module': ['api',
            'arrays',
            'compat',
            'core',
            'errors',
            'io',
            'offsets',
            'pandas',
            'plotting',
            'testing',
            'tseries',
            'util'],
 'method': ['array',
            'bdate_range',
            'concat',
            'crosstab',
            'cut',
            'date_range',
            'describe_option',
            'eval',
            'factorize',
            'from_dummies',
            'get_dummies',
            'get_option',
            'infer_freq',
            'interval_range',
            'isna',
            'isnull',
            'json_normalize',
            'lreshape',
            'melt',
            'merge',
            'merge_asof',
            'merge_ordered',
            'notna',
            'notnull',
            'period_range',
            'pivot',
            'pivot_t

In general ```pd``` uses object orientated programming (OOP) opposed to functional programming. Most of the functions within ```pd``` are used to read in data from a file and output a ```pd.DataFrame``` instance.

The classes are all in PascalCase. The main classes are:

* pd.Index
* pd.Series
* pd.DataFrame
 
The OOP approach means methods are normally called from instances to analyse or manipulate data from the instance. The identifiers for the main three classes can be examined. The relationship between the identifiers in the three classes above will first be examined before later examining the classes in detail.

The ```pd.Series``` can be conceptualised as a ```np.ndarray``` with a ```name``` and therefore has consistent attributes, methods and numerical datamodel methods:

In [9]:
dir2(pd.Series, np.ndarray, consistent_only=True, exclude_external_modules=True, drop_internal=True)

{'attribute': ['dtype', 'flags', 'nbytes', 'ndim', 'shape', 'size'],
 'constant': ['T'],
 'method': ['all',
            'any',
            'argmax',
            'argmin',
            'argsort',
            'astype',
            'clip',
            'copy',
            'cumprod',
            'cumsum',
            'dot',
            'item',
            'max',
            'mean',
            'min',
            'prod',
            'ravel',
            'repeat',
            'round',
            'searchsorted',
            'squeeze',
            'std',
            'sum',
            'swapaxes',
            'take',
            'tolist',
            'transpose',
            'var',
            'view'],
 'datamodel_attribute': ['__array_priority__', '__doc__', '__hash__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__array__',
                      '__array_ufunc__',
                      '__bool__',
                  

A number of the ```__builtins__``` functions are also available as ```pd.Series``` methods:

In [16]:
dir2(pd.Series, __builtins__, consistent_only=True)

{'method': ['abs',
            'all',
            'any',
            'bool',
            'divmod',
            'filter',
            'map',
            'max',
            'min',
            'pow',
            'round',
            'sum'],
 'lower_class': ['list', 'str'],
 'datamodel_attribute': ['__doc__']}


Many of the datamodel operators defined for numeric operation are also available as a method:

In [18]:
dir2(pd.Series, operator, consistent_only=True)

{'attribute': ['index'],
 'method': ['abs',
            'add',
            'eq',
            'floordiv',
            'ge',
            'gt',
            'le',
            'lt',
            'mod',
            'mul',
            'ne',
            'pow',
            'sub',
            'truediv'],
 'datamodel_attribute': ['__doc__'],
 'datamodel_method': ['__abs__',
                      '__add__',
                      '__and__',
                      '__contains__',
                      '__delitem__',
                      '__eq__',
                      '__floordiv__',
                      '__ge__',
                      '__getitem__',
                      '__gt__',
                      '__iadd__',
                      '__iand__',
                      '__ifloordiv__',
                      '__imod__',
                      '__imul__',
                      '__invert__',
                      '__ior__',
                      '__ipow__',
                      '__isub__',
           

The ```pd.Series``` has additional attributes such as:

* ```array``` which returns the underlying ```np.ndarray``` (1darray) instance. 
* ```name``` which is a ```str``` instance corresponding to the series name.
* ```index``` which returns the ```pd.Index``` instance.
* ```at```, ```iat```, ```loc``` and ```iloc``` are used for additional indexing purposes.
* ```attrs``` is used to store optional metadata.
* ```hasnans```, ```empty``` and ```is_unique``` are ```bool``` instances that are ```True``` if the series has values that aren't available, is empty or consists on only unique values respectively ```is_monotonic_decreasing``` and ```is_monotonic_decreasing``` are also ```bool``` instances that are ```True``` if the series is reverse ordered or ordered.

The ```pd.Series``` has additional methods. Many of these are statistical methods and have a similar name to functions found in the statistics module. There are also a large number of ```to``` methods which convert the ```pd.Series``` to another datatype or file format.

There are also lowercase classes:

* ```cat``` which groups together categorical identifiers
* ```dt``` which groups together datetime (datetime.datetime or datetime.timedelta) identifiers
* ```list``` which groups together list identifiers
* ```str``` which groups together str identifiers

The lowercase classes are appropriate when the ```pd.Series``` datatype matches one of the class types above.

In [19]:
dir2(pd.Series, [np.ndarray, operator], unique_only=True, 
     exclude_external_modules=True, exclude_identifier_list=reverse_operators, drop_internal=True)

{'attribute': ['array',
               'at',
               'attrs',
               'axes',
               'dtypes',
               'empty',
               'hasnans',
               'iat',
               'iloc',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'loc',
               'name',
               'values'],
 'method': ['add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'apply',
            'asfreq',
            'asof',
            'at_time',
            'autocorr',
            'backfill',
            'between',
            'between_time',
            'bfill',
            'bool',
            'case_when',
            'combine',
            'combine_first',
            'compare',
            'convert_dtypes',
            'corr',
            'count',
            'cov',
            'cummax',
            'cummin',
            'describe',
    

A ```pd.DataFrame``` is essentially a grouping of ```pd.Series``` instances and therefore has many identifiers in common.

The attributes ```dtypes``` and ```values``` are used as plural equivalents to ```dtype``` and ```array``` seen in the ```pd.Series```. ```dtypes``` reflects the fact each ```pd.Series``` in the ```pd.DataFrame``` instance can have a different datatype. However when each ```pd.Series``` are numeric then ```pd.values``` is the 2d ```ndarray``` instance. Most of the statistical methods will only also work when each ```pd.Series``` in the ```pd.DataFrame``` instance are numeric. It is more common to use statistical functions on a ```pd.Series```. 

In [21]:
dir2(pd.DataFrame, pd.Series, consistent_only=True, drop_internal=True)

{'attribute': ['at',
               'attrs',
               'axes',
               'dtypes',
               'empty',
               'flags',
               'iat',
               'iloc',
               'index',
               'loc',
               'ndim',
               'shape',
               'size',
               'values'],
 'constant': ['T'],
 'method': ['abs',
            'add',
            'add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'all',
            'any',
            'apply',
            'asfreq',
            'asof',
            'astype',
            'at_time',
            'backfill',
            'between_time',
            'bfill',
            'bool',
            'clip',
            'combine',
            'combine_first',
            'compare',
            'convert_dtypes',
            'copy',
            'corr',
            'count',
            'cov',
            'cummax',
            'cummin',
         

The ```pd.DataFrame``` class only has a small number of additional identifiers such as the attributes ```columns``` which is a ```list``` of ```str``` instances with each ```str``` corresponding to the appropriate ```pd.Series.name``` available in the ```pd.DataFrame```. The ```pd.DataFrame``` instance is mutable and the ```insert``` method can be used to insert another ```pd.Series``` instance. ```join``` and ```merge``` are used to join and merge ```pd.DataFrame``` instances.

The ```style``` attribute is an additional attribute which can be used to control the style of the ```pd.DataFrame``` instance when displayed in a cell output.

In [22]:
dir2(pd.DataFrame, pd.Series, unique_only=True, drop_internal=True)

{'attribute': ['columns', 'style'],
 'method': ['applymap',
            'assign',
            'boxplot',
            'corrwith',
            'eval',
            'from_dict',
            'from_records',
            'insert',
            'isetitem',
            'iterrows',
            'itertuples',
            'join',
            'melt',
            'merge',
            'pivot',
            'pivot_table',
            'query',
            'select_dtypes',
            'set_index',
            'stack',
            'to_feather',
            'to_gbq',
            'to_html',
            'to_orc',
            'to_parquet',
            'to_records',
            'to_stata',
            'to_xml'],
 'datamodel_method': ['__arrow_c_stream__',
                      '__dataframe__',
                      '__dataframe_consortium_standard__']}


The identifiers in a ```pd.Series``` instance are singular identifiers typically only used across a ```pd.Series``` instance. These include the lowercase classes used to work a specific datatype across a ```pd.Series```:

In [23]:
dir2(pd.Series, pd.DataFrame, unique_only=True, drop_internal=True)

{'attribute': ['array',
               'dtype',
               'hasnans',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'name',
               'nbytes'],
 'method': ['argmax',
            'argmin',
            'argsort',
            'autocorr',
            'between',
            'case_when',
            'divmod',
            'factorize',
            'item',
            'ravel',
            'rdivmod',
            'repeat',
            'searchsorted',
            'to_frame',
            'to_list',
            'tolist',
            'unique',
            'view'],
 'lower_class': ['cat', 'dt', 'list', 'str', 'struct'],
 'datamodel_method': ['__column_consortium_standard__', '__float__', '__int__']}


Each ```pd.Series``` is associated with a ```pd.Index``` and every ```pd.Series``` in a ```pd.DataFrame``` uses the same ```pd.Index```. Many identifiers for the ```pd.Index``` can be seen to be consistent with a ```pd.Series``` as it is also based on an ```np.ndarray```:

In [24]:
dir2(pd.Index, pd.Series, consistent_only=True, exclude_external_modules=True, drop_internal=True)

{'attribute': ['array',
               'dtype',
               'empty',
               'hasnans',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'name',
               'nbytes',
               'ndim',
               'shape',
               'size',
               'values'],
 'constant': ['T'],
 'method': ['all',
            'any',
            'argmax',
            'argmin',
            'argsort',
            'asof',
            'astype',
            'copy',
            'diff',
            'drop',
            'drop_duplicates',
            'droplevel',
            'dropna',
            'duplicated',
            'equals',
            'factorize',
            'fillna',
            'groupby',
            'infer_objects',
            'isin',
            'isna',
            'isnull',
            'item',
            'map',
            'max',
            'memory_usage',
            'min',
            'notna',
      

the ```pd.Index``` class only has a small number of identifiers not found in a ```pd.Series``` instance. The ```pd.Index``` is typically numeric but each value in the ```Series``` can be assigned a ```name``` and the attribute ```names``` retrieves these. Typically each ```name``` should be unique for the purpose of indexing and the attribute ```has_duplicates``` is used to check for duplicate values and the ```pd.Series``` or ```pd.DataFrame``` method ```reset_index``` can be used in this case to return a unique index of numeric values. More complicated indexes can be multilevel and the ```nlevels``` attribute gives details about the amount of levels in the index:

In [25]:
dir2(pd.Index, pd.Series, unique_only=True, exclude_external_modules=True, drop_internal=True)

{'attribute': ['has_duplicates', 'inferred_type', 'names', 'nlevels'],
 'method': ['append',
            'asof_locs',
            'delete',
            'difference',
            'format',
            'get_indexer',
            'get_indexer_for',
            'get_indexer_non_unique',
            'get_level_values',
            'get_loc',
            'get_slice_bound',
            'holds_integer',
            'identical',
            'insert',
            'intersection',
            'is_',
            'is_boolean',
            'is_categorical',
            'is_floating',
            'is_integer',
            'is_interval',
            'is_numeric',
            'is_object',
            'join',
            'putmask',
            'set_names',
            'slice_indexer',
            'slice_locs',
            'sort',
            'sortlevel',
            'symmetric_difference',
            'to_flat_index',
            'to_series',
            'union'],
 'datamodel_method': ['__array_wrap__']}

## Series

The initialisation signature for the ```pd.Series``` class can be examined:

In [26]:
pd.Series?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the i

The main keyword input arguments are:

* data
* index
* dtype
* name


If these are not supplied an empty series instance with no index, no name and a generic object datatype is instantiated:

In [27]:
pd.Series()

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [28]:
pd.Series(data=np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [29]:
pd.Series(data=[1, 2, 3])

0    1
1    2
2    3
dtype: int64

When ```dtype=None```, the data type will be inferred from the data:

In [30]:
pd.Series(data=[1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [31]:
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

0   2024-02-13 13:44:22.235566
1   2024-02-14 13:44:22.235566
2   2024-02-15 13:44:22.235566
dtype: datetime64[ns]

In [32]:
pd.Series(data=['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Anything with a ```str``` in it is classed as non-numeric and has the generic dtype ```object``` (meaning it can be any Python object).

The dtype can be manually overidden when supplying the numpy 1darray by using the ```np.array``` input argument ```dtype```:

In [33]:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

0    1
1    2
2    3
dtype: int32

Or by alternatively using the ```Series``` keyword input argument ```dtype```:

In [34]:
pd.Series(data=[1., 2., 3.], dtype=np.int32)

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument ```index``` and providing an ```Index``` instance, ```ndarray``` instance or ```list``` instance of index values:

In [35]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

a    1
b    2
c    3
dtype: int32

A ```Series``` usually also has a ```name```:

In [36]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

a    1
b    2
c    3
Name: x, dtype: int32

Normally the ```data``` and ```name``` are supplied and the ```index``` and ```dtype``` are inferred:

In [37]:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

## DataFrame

The initialisation signature for the ```pd.DataFrame``` class can be examined:

In [38]:
pd.DataFrame?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------
d

The keyword input arguments for a ```DataFrame``` instance are similar to those found for a ```Series``` instance however because a ```DataFrame``` is a collection of ```Series``` most of these are plural:

* data (plural)
* index (singular)
* columns (plural of name)
* dtype (plural)

In [45]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The dtype has to be supplied as a ```tuple``` containing the ```dtype``` for each ```Series``` instance in the ```DataFrame``` instance. If it is supplied as a list of dtypes a ```TypeError``` will display.

Once again normally the ```dtype``` and ```index``` are inferred:

In [46]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


It is common to supply ```columns``` and ```data``` in the form of a mapping. The ```key``` should be a ```str``` instance which will become the column name and the ```value``` should be a ```np.ndarray``` (1d) or ```list``` instance which corresponds to the data for that ```pd.Series```:

In [47]:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


## Reading Data from Files

The ```Series``` and ```DataFrames``` previously examined were created from scratch using ```builtins``` datatypes. It is also common to read data in from another source. The ```pandas``` library therefore has a number of functions for reading in data from external files. The function names all have a ```read_``` prefix followed by the file type:

In [51]:
for identifier in dir(pd):
    if identifier.startswith('read_'):
        print(identifier)

read_clipboard
read_csv
read_excel
read_feather
read_fwf
read_gbq
read_hdf
read_html
read_json
read_orc
read_parquet
read_pickle
read_sas
read_spss
read_sql
read_sql_query
read_sql_table
read_stata
read_table
read_xml


Some of the more common formats will be explored.

### Comma Separated Values File

CSV is an abbreviation for Comma Separated Values. 

The file format uses:
* ```a``` , as a column separator which is where the name comma separated values comes from
* ```\n```, as a row separator

A csv with escape characters explicitly shown looks as follows:

```powershell
string,integer,bool,float,date,time,category\n
the fat black cat,4,TRUE,0.86,24/07/2023,11:36:00,A\n
sat on the mat,4,TRUE,0.86,25/07/2023,12:36:00,A\n
"twinkle, twinkle",2,TRUE,-1.14,26/07/2023,13:36:00,B\n
little star,2,TRUE,-1.14,27/07/2023,14:36:00,B\n
how I wonder,3,FALSE,-0.14,28/07/2023,15:36:00,B\n
what you are,4,TRUE,0.86,29/07/2023,16:36:00,B\n
```

This can be written to a csv file:

In [66]:
%%writefile ./files/Book1.csv
string,integer,bool,float,date,time,category
the fat black cat,4,TRUE,0.86,24/07/2023,11:36:00,A
sat on the mat,4,TRUE,0.86,25/07/2023,12:36:00,A
"twinkle, twinkle",2,TRUE,-1.14,26/07/2023,13:36:00,B
little star,2,TRUE,-1.14,27/07/2023,14:36:00,B
how I wonder,3,FALSE,-0.14,28/07/2023,15:36:00,B
what you are,4,TRUE,0.86,29/07/2023,16:36:00,B

Overwriting ./files/Book1.csv


When opened in a program such as Microsoft Excel, these delimiters display as a grid:

<img src='./images/img_001.png' alt='img_001' width='800'/>

Notice that the comma in ```twinkle, twinkle``` is not a delimiter but part of the ```str```. For this reason ```"twinkle, twinkle"``` was is enclosed in quotations.

The ```pd.read_csv``` function is used to read in a csv file returning a ```DataFrame``` instance:

In [74]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m:

The ```pd.read_csv``` function has a larger number of input parameters. Note that most of these are named input parameters and are therefore assigned to a default value which is consistent to the default behaviour of a csv file. When the file is in the expected format only the ```filepath_or_buffer``` needs to be specified and this is normally provided positionally:

In [77]:
df = pd.read_csv('./files/Book1.csv')

In [76]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


Notice the ```Series``` names in the file are in the expected format and taken from the first line. A csv does not have an index by default and so a ```RangeIndex``` has automatically been generated.

In [78]:
df.axes

[RangeIndex(start=0, stop=6, step=1),
 Index(['string', 'integer', 'bool', 'float', 'date', 'time', 'category'], dtype='object')]

### Tab Delimited Text File

A text file, has the file extension txt and is very similar to a csv file but uses ```\t``` instead of ```,``` as a delimiter:

```powershell
string\tinteger\tbool\tfloat\tdate\ttime\tcategory\n
the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t11:36:00\tA\n
sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:36:00\tA\n
"twinkle, twinkle"\t2\tTRUE\t-1.14\t26/07/2023\t13:36:00\tB\n
little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36:00\tB\n
how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:36:00\tB\n
what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36:00\tB\n
```

In [79]:
%%writefile ./files/Book2.txt
string	integer	boolean	floatingpoint	date	time	category
the fat black cat	4	TRUE	0.86	24/07/2023	11:36:00	A
sat on the mat	4	TRUE	0.86	25/07/2023	12:36:00	A
"twinkle, twinkle"	2	TRUE	-1.14	26/07/2023	13:36:00	B
little star	2	TRUE	-1.14	27/07/2023	14:36:00	B
how I wonder	3	FALSE	-0.14	28/07/2023	15:36:00	B
what you are	4	TRUE	0.86	29/07/2023	16:36:00	B

Overwriting ./files/Book2.txt


The same ```pd.read_csv``` function is used to read in a txt file. However this function by default looks for a ```,``` as a delimiter to move onto the next column and as it is not present, the data is all shown in a single column:

In [80]:
df = pd.read_csv(r'.\files\Book2.txt')

In [81]:
df

Unnamed: 0,string\tinteger\tboolean\tfloatingpoint\tdate\ttime\tcategory
0,the fat black cat\t4\tTRUE\t0.86\t24/07/2023\t11:36:00\tA
1,sat on the mat\t4\tTRUE\t0.86\t25/07/2023\t12:36:00\tA
2,"twinkle, twinkle\t2\tTRUE\t-1.14\t26/07/2023\t13:36:00\tB"
3,little star\t2\tTRUE\t-1.14\t27/07/2023\t14:36:00\tB
4,how I wonder\t3\tFALSE\t-0.14\t28/07/2023\t15:36:00\tB
5,what you are\t4\tTRUE\t0.86\t29/07/2023\t16:36:00\tB


If the delimiter is specified as ```'\t'``` the data will instead be read in properly:

In [139]:
df = pd.read_csv(r'.\files\Book2.txt', delimiter='\t')

In [142]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


### JavaScript Object Notation

If the data is in the form of a JSON ```str``` instance:

In [144]:
json_string = '{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"date":{"0":"24\\/07\\/2023","1":"25\\/07\\/2023","2":"26\\/07\\/2023","3":"27\\/07\\/2023","4":"28\\/07\\/2023","5":"29\\/07\\/2023"},"time":{"0":"11:36:00","1":"12:36:00","2":"13:36:00","3":"14:36:00","4":"15:36:00","5":"16:36:00"},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'

It can be cast into a ```StringIO``` instance:

In [147]:
import io

In [149]:
io.StringIO(json_string)

<_io.StringIO at 0x221d995bb80>

And then read in using the ```pd.read_json``` function:

In [148]:
pd.read_json(io.StringIO(json_string))

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26,13:36:00,B
3,little star,2,True,-1.14,2023-07-27,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28,15:36:00,B
5,what you are,4,True,0.86,2023-07-29,16:36:00,B


### Microsoft Excel File

A Microsoft Excel File, file extensions .xlsx (or .xls for older files) is a collection of sheets. The data in each individual sheet is similar to an individual csv file, although obscured because Excel applies formatting options to the data:

<img src='./images/img_002.png' alt='img_002' width='800'/>

The related function ```read_excel``` is used to read in the data from an Excel File. The delimiter is predefined in an Excel File however the Excel File can have multiple sheets so the keyword input argument ```sheet_name``` is available, this defaults to the name ```'Sheet1'``` which is the default for an Excel Spreadsheet:

In [84]:
pd.read_excel?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_excel[0m[1;33m([0m[1;33m
[0m    [0mio[0m[1;33m,[0m[1;33m
[0m    [0msheet_name[0m[1;33m:[0m [1;34m'str | int | list[IntStrT] | None'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'int | Sequence[int] | None'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'SequenceNotStr[Hashable] | range | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'int | str | Sequence[int] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m:[0m [1;34m'int | str | Sequence[int] | Sequence[str] | Callable[[str], bool] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'DtypeArg | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mengine[0m[1;33m:[0m [1;34m

In [85]:
df = pd.read_excel('./files/Book3.xlsx', sheet_name='Sheet1')

In [86]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B


Sheets in an Excel File are ordered and the ordering of the sheets is analogous to a ```RangeIndex``` (```'Sheet1'``` corresponds to an index of ```0``` because of zero-order indexing). This parameter is often supplied positionally:

In [87]:
df = pd.read_excel('./files/Book3.xlsx', 0)

In [88]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B


In a .xlsx file, a column is normally formatted for the appropriate datatype. For example a datetime format and the .xlsx file contains the formatting information. This information can be used by the ```read_excel``` function:

In [94]:
df = pd.read_excel('./files/Book3.xlsx', 0, parse_dates=True)

In [95]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,16:36:00,B


The date column is parsed as a date correctly but the time column wasn't parsed as a time column:

In [96]:
df.dtypes

string                   object
integer                   int64
boolean                    bool
floatingpoint           float64
date             datetime64[ns]
time                     object
category                 object
dtype: object

The ```df['time']``` ```pd.Series``` instance can be cast into a ```str``` instance:

In [122]:
df['time'].astype('str')

0    11:36:00
1    12:36:00
2    13:36:00
3    14:36:00
4    15:36:00
5    16:36:00
Name: time, dtype: object

The ```pd.Series``` method ```astype``` has limited support for converting to a ```timedelta64``` and instead te ```pd``` function ```to_timedelta``` needs to be used:

In [123]:
pd.to_timedelta(df['time'].astype('str'))

0   0 days 11:36:00
1   0 days 12:36:00
2   0 days 13:36:00
3   0 days 14:36:00
4   0 days 15:36:00
5   0 days 16:36:00
Name: time, dtype: timedelta64[ns]

This can be reassigned to the ```pd.Series``` instance name ```df['time']```. This will update the ```df``` instance inplace:

In [124]:
df['time'] = pd.to_timedelta(df['time'].astype('str'))

In [125]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,0 days 11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,0 days 12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,0 days 13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,0 days 14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,0 days 15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,0 days 16:36:00,B


In [126]:
df.dtypes

string                    object
integer                    int64
boolean                     bool
floatingpoint            float64
date              datetime64[ns]
time             timedelta64[ns]
category                  object
dtype: object

The ```df['category']``` ```pd.Series``` instance can be cast into a ```pd.Categorical``` instance:

In [131]:
df['category'].astype('category')

0    A
1    A
2    B
3    B
4    B
5    B
Name: category, dtype: category
Categories (2, object): ['A', 'B']

In [132]:
df['category'] = df['category'].astype('category')

In [133]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,0 days 11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,0 days 12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,0 days 13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,0 days 14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,0 days 15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,0 days 16:36:00,B


In [134]:
df.dtypes

string                    object
integer                    int64
boolean                     bool
floatingpoint            float64
date              datetime64[ns]
time             timedelta64[ns]
category                category
dtype: object

To reorder the columns, indexing is quite commonly used:

In [136]:
df[['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time', 'category']]

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,0 days 11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,0 days 12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,0 days 13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,0 days 14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,0 days 15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,0 days 16:36:00,B


The instance name ```df``` can be reassigned to this output:

In [137]:
df = df[['string', 'integer', 'boolean', 'floatingpoint', 'date', 'time', 'category']]

In [138]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24 11:30:00,0 days 11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25 00:00:00,0 days 12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26 00:00:00,0 days 13:36:00,B
3,little star,2,True,-1.14,2023-07-27 00:00:00,0 days 14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28 00:00:00,0 days 15:36:00,B
5,what you are,4,True,0.86,2023-07-29 00:00:00,0 days 16:36:00,B


## Writing DataFrames to Objects and Files

The ```DataFrame``` class has a number of methods for writing to files:

In [152]:
for identifier in dir(df):
    if identifier.startswith('to_'):
        print(identifier)

to_clipboard
to_csv
to_dict
to_excel
to_feather
to_gbq
to_hdf
to_html
to_json
to_latex
to_markdown
to_numpy
to_orc
to_parquet
to_period
to_pickle
to_records
to_sql
to_stata
to_string
to_timestamp
to_xarray
to_xml


### Python Dictionary

A  ```pd.DataFrame``` instance can be written to a ```dict``` instance using:

In [153]:
df.to_dict()

{'string': {0: 'the fat black cat',
  1: 'sat on the mat',
  2: 'twinkle, twinkle',
  3: 'little star',
  4: 'how I wonder',
  5: 'what you are'},
 'integer': {0: 4, 1: 4, 2: 2, 3: 2, 4: 3, 5: 4},
 'boolean': {0: True, 1: True, 2: True, 3: True, 4: False, 5: True},
 'floatingpoint': {0: 0.86, 1: 0.86, 2: -1.14, 3: -1.14, 4: -0.14, 5: 0.86},
 'date': {0: '24/07/2023',
  1: '25/07/2023',
  2: '26/07/2023',
  3: '27/07/2023',
  4: '28/07/2023',
  5: '29/07/2023'},
 'time': {0: '11:36:00',
  1: '12:36:00',
  2: '13:36:00',
  3: '14:36:00',
  4: '15:36:00',
  5: '16:36:00'},
 'category': {0: 'A', 1: 'A', 2: 'B', 3: 'B', 4: 'B', 5: 'B'}}

This is read into a ```pd.DataFrame``` instance using:

In [154]:
pd.DataFrame(data=df.to_dict())

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


### Comma Separated Values File

A ```pd.DataFrame``` can be written to a file using:

In [159]:
df.to_csv('./files/Book4.csv')

Notice that an index was added, meaning if this is read into a ```pd.DataFrame``` instance using the defaults there is an Unnamed ```pd.Series``` corresponding to the index read in:

In [161]:
pd.read_csv('./files/Book4.csv')

Unnamed: 0.1,Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,5,what you are,4,True,0.86,29/07/2023,16:36:00,B


This can be assigned to the ```pd.Index``` instance associated with the ```pd.DataFrame``` using the keyword input argument ```index_col```:

In [162]:
pd.read_csv('./files/Book4.csv', index_col=0)

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


Alternatively the ```pd.DataFrame``` instance can be exported without the ```pd.Index``` information:

In [163]:
df.to_csv?

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0mto_csv[0m[1;33m([0m[1;33m
[0m    [0mpath_or_buf[0m[1;33m:[0m [1;34m'FilePath | WriteBuffer[bytes] | WriteBuffer[str] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m','[0m[1;33m,[0m[1;33m
[0m    [0mna_rep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m''[0m[1;33m,[0m[1;33m
[0m    [0mfloat_format[0m[1;33m:[0m [1;34m'str | Callable | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'bool_t | list[str]'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex_label[0m[1;33m:[0m [1;34m'IndexLabel | None'

In [164]:
df.to_csv('./files/Book5.csv', index=False)

### Tab Delimited Text File

To save to a text file, the separator needs to be specified:

In [165]:
df.to_csv('./files/Book6.txt', sep='\t', index=False)

This is read into a ```pd.DataFrame``` instance using:

In [166]:
pd.read_csv('./files/Book6.txt', sep='\t')

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


### JavaScript Object Notation

A ```pd.DataFrame``` can be written to a JSON ```str``` using:

In [155]:
df.to_json()

'{"string":{"0":"the fat black cat","1":"sat on the mat","2":"twinkle, twinkle","3":"little star","4":"how I wonder","5":"what you are"},"integer":{"0":4,"1":4,"2":2,"3":2,"4":3,"5":4},"boolean":{"0":true,"1":true,"2":true,"3":true,"4":false,"5":true},"floatingpoint":{"0":0.86,"1":0.86,"2":-1.14,"3":-1.14,"4":-0.14,"5":0.86},"date":{"0":"24\\/07\\/2023","1":"25\\/07\\/2023","2":"26\\/07\\/2023","3":"27\\/07\\/2023","4":"28\\/07\\/2023","5":"29\\/07\\/2023"},"time":{"0":"11:36:00","1":"12:36:00","2":"13:36:00","3":"14:36:00","4":"15:36:00","5":"16:36:00"},"category":{"0":"A","1":"A","2":"B","3":"B","4":"B","5":"B"}}'

Note that this is the same form that can be used to instantiate a ```pd.DataFrame``` instance:

In [158]:
pd.read_json(io.StringIO(df.to_json()))

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,2023-07-24,11:36:00,A
1,sat on the mat,4,True,0.86,2023-07-25,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,2023-07-26,13:36:00,B
3,little star,2,True,-1.14,2023-07-27,14:36:00,B
4,how I wonder,3,False,-0.14,2023-07-28,15:36:00,B
5,what you are,4,True,0.86,2023-07-29,16:36:00,B


Supposing there are three DataFrame instances:

### Microsoft Excel File

Recall that an Excel file has sheets. For convenience multiple ```pd.DataFrame``` instances can be created by indexing into the ```pd.DataFrame``` instance ```df``` and indexing using a ```list``` of ```str``` instances, where each ```str``` instance corresponds to the name of a ```pd.Series```:

In [167]:
df

Unnamed: 0,string,integer,boolean,floatingpoint,date,time,category
0,the fat black cat,4,True,0.86,24/07/2023,11:36:00,A
1,sat on the mat,4,True,0.86,25/07/2023,12:36:00,A
2,"twinkle, twinkle",2,True,-1.14,26/07/2023,13:36:00,B
3,little star,2,True,-1.14,27/07/2023,14:36:00,B
4,how I wonder,3,False,-0.14,28/07/2023,15:36:00,B
5,what you are,4,True,0.86,29/07/2023,16:36:00,B


In [168]:
df1 = df[['string', 'integer', 'time']]

In [177]:
df1

Unnamed: 0,string,integer,time
0,the fat black cat,4,11:36:00
1,sat on the mat,4,12:36:00
2,"twinkle, twinkle",2,13:36:00
3,little star,2,14:36:00
4,how I wonder,3,15:36:00
5,what you are,4,16:36:00


In [170]:
df2 = df[['string', 'integer', 'boolean']]

In [171]:
df2

Unnamed: 0,string,integer,boolean
0,the fat black cat,4,True
1,sat on the mat,4,True
2,"twinkle, twinkle",2,True
3,little star,2,True
4,how I wonder,3,False
5,what you are,4,True


In [174]:
df3 = df[['string', 'date', 'category']]

In [175]:
df3

Unnamed: 0,string,date,category
0,the fat black cat,24/07/2023,A
1,sat on the mat,25/07/2023,A
2,"twinkle, twinkle",26/07/2023,B
3,little star,27/07/2023,B
4,how I wonder,28/07/2023,B
5,what you are,29/07/2023,B


The ```pd.DataFrame``` method ```to_excel``` allows the writing of multiple ```pd.DataFrame``` instances to individual sheets within an Excel File:

In [178]:
df.to_excel?

[1;31mSignature:[0m
[0mdf[0m[1;33m.[0m[0mto_excel[0m[1;33m([0m[1;33m
[0m    [0mexcel_writer[0m[1;33m:[0m [1;34m'FilePath | WriteExcelBuffer | ExcelWriter'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msheet_name[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m'Sheet1'[0m[1;33m,[0m[1;33m
[0m    [0mna_rep[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m''[0m[1;33m,[0m[1;33m
[0m    [0mfloat_format[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m'Sequence[Hashable] | bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'bool_t'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mindex_label[0m[1;33m:[0m [1;34m'IndexLabel | None'[0m [1;33m=[0m [1;32mNone[

To write ```pd.DataFrame``` instances to multiple sheets an ```pd.ExcelWriter``` instance has to be instantiated and given the instruction to create a blank Excel File:

In [179]:
writer = pd.ExcelWriter(path='./files/Book7.xlsx')

The ```DataFrame``` method ```to_excel``` can then be used to instruct the ```ExcelWriter``` instance to write the ```df.DataFrame``` instance to a specified sheet:

In [180]:
df1.to_excel(excel_writer=writer, sheet_name='sheet1')
df2.to_excel(excel_writer=writer, sheet_name='sheet2')
df3.to_excel(excel_writer=writer, sheet_name='sheet3')

Details about the sheets being written can be seen using the ```ExcelWriter``` attribute ```sheets``` which is a mapping where the key is the sheet name and the value is the sheet being written:

In [181]:
writer.sheets

{'sheet1': <xlsxwriter.worksheet.Worksheet at 0x221db0e2030>,
 'sheet2': <xlsxwriter.worksheet.Worksheet at 0x221d9cce450>,
 'sheet3': <xlsxwriter.worksheet.Worksheet at 0x221d774fe60>}

Finally the ```ExcelWriter``` instance can be closed. This will release the Excel SpreadSheet from Python:

In [182]:
writer.close()

<img src='./images/img_003.png' alt='img_003' width='800'/>

<img src='./images/img_004.png' alt='img_004' width='800'/>

<img src='./images/img_005.png' alt='img_005' width='800'/>

The identifiers of the ```ExcelWriter``` class can be examined:

In [184]:
dir2(pd.ExcelWriter, drop_internal=True)

{'attribute': ['book',
               'date_format',
               'datetime_format',
               'engine',
               'if_sheet_exists',
               'sheets',
               'supported_extensions'],
 'method': ['check_extension', 'close'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__module__',
                         '__orig_bases__',
                         '__parameters__',
                         '__weakref__'],
 'datamodel_method': ['__class__',
                      '__class_getitem__',
                      '__delattr__',
                      '__dir__',
                      '__enter__',
                      '__eq__',
                      '__exit__',
                      '__format__',
                      '__fspath__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',


Notice it has the datamodel identifiers ```__enter__``` and ```__exit__``` which means it can be used within a ```with``` code block. The ```with``` code block will automatically close the ```ExcelWriter``` class when the block ends and is the safest to create the file, write multiple sheets to it and close the file::

In [185]:
with pd.ExcelWriter('./files/Book8.xlsx') as writer:  
    df1.to_excel(writer, sheet_name='df1', index=False)
    df2.to_excel(writer, sheet_name='df2', index=False)
    df3.to_excel(writer, sheet_name='df3', index=False)

### Markdown

A ```pd.DataFrame``` instance can be exported into a markdown ```str``` using:

In [191]:
df.to_markdown()

'|    | string            |   integer | boolean   |   floatingpoint | date       | time     | category   |\n|---:|:------------------|----------:|:----------|----------------:|:-----------|:---------|:-----------|\n|  0 | the fat black cat |         4 | True      |            0.86 | 24/07/2023 | 11:36:00 | A          |\n|  1 | sat on the mat    |         4 | True      |            0.86 | 25/07/2023 | 12:36:00 | A          |\n|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 26/07/2023 | 13:36:00 | B          |\n|  3 | little star       |         2 | True      |           -1.14 | 27/07/2023 | 14:36:00 | B          |\n|  4 | how I wonder      |         3 | False     |           -0.14 | 28/07/2023 | 15:36:00 | B          |\n|  5 | what you are      |         4 | True      |            0.86 | 29/07/2023 | 16:36:00 | B          |'

When this is printed, it has the following form:

In [190]:
print(df.to_markdown())

|    | string            |   integer | boolean   |   floatingpoint | date       | time     | category   |
|---:|:------------------|----------:|:----------|----------------:|:-----------|:---------|:-----------|
|  0 | the fat black cat |         4 | True      |            0.86 | 24/07/2023 | 11:36:00 | A          |
|  1 | sat on the mat    |         4 | True      |            0.86 | 25/07/2023 | 12:36:00 | A          |
|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 26/07/2023 | 13:36:00 | B          |
|  3 | little star       |         2 | True      |           -1.14 | 27/07/2023 | 14:36:00 | B          |
|  4 | how I wonder      |         3 | False     |           -0.14 | 28/07/2023 | 15:36:00 | B          |
|  5 | what you are      |         4 | True      |            0.86 | 29/07/2023 | 16:36:00 | B          |


And when this printed format is copied into a markdown cell, it looks like:

|    | string            |   integer | boolean   |   floatingpoint | date       | time     | category   |
|---:|:------------------|----------:|:----------|----------------:|:-----------|:---------|:-----------|
|  0 | the fat black cat |         4 | True      |            0.86 | 24/07/2023 | 11:36:00 | A          |
|  1 | sat on the mat    |         4 | True      |            0.86 | 25/07/2023 | 12:36:00 | A          |
|  2 | twinkle, twinkle  |         2 | True      |           -1.14 | 26/07/2023 | 13:36:00 | B          |
|  3 | little star       |         2 | True      |           -1.14 | 27/07/2023 | 14:36:00 | B          |
|  4 | how I wonder      |         3 | False     |           -0.14 | 28/07/2023 | 15:36:00 | B          |
|  5 | what you are      |         4 | True      |            0.86 | 29/07/2023 | 16:36:00 | B          |

## Series Identifiers

The identifiers for a ```pd.Series``` are:

In [216]:
dir2(pd.Series, drop_internal=True)

{'attribute': ['array',
               'at',
               'attrs',
               'axes',
               'dtype',
               'dtypes',
               'empty',
               'flags',
               'hasnans',
               'iat',
               'iloc',
               'index',
               'is_monotonic_decreasing',
               'is_monotonic_increasing',
               'is_unique',
               'loc',
               'name',
               'nbytes',
               'ndim',
               'shape',
               'size',
               'values'],
 'constant': ['T'],
 'method': ['abs',
            'add',
            'add_prefix',
            'add_suffix',
            'agg',
            'aggregate',
            'align',
            'all',
            'any',
            'apply',
            'argmax',
            'argmin',
            'argsort',
            'asfreq',
            'asof',
            'astype',
            'at_time',
            'autocorr',
            'backfill',
  

The initialisation signature for a ```pd.Series``` can be examined:

In [228]:
pd.Series?

[1;31mInit signature:[0m
[0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the i

A ```pd.Series``` can be instantiated by supplying ```data``` to a ```list``` of values and providing a ```name```:

In [478]:
t = pd.Series(name='t')
u = pd.Series([[], [], [], []], name='u')
v = pd.Series([1.1, 2.2, 3.1, 4.1], name='v')
w = pd.Series([-1.1, 2.2, -3.1, 4.1], name='w')
x = pd.Series([1.1, -2.1, -2.1, None], name='x', dtype=float)
y = pd.Series(['A', 'A', 'B', 'B'], index=['a', 'b', 'c', 'd'], name='y')
z = pd.Series(['A', 'B', 'C', None], index=['a', 'b', 'c', 'd'], name='z')

### Attributes

In [479]:
t, u, v, w, x, y, z

(Series([], Name: t, dtype: object),
 0    []
 1    []
 2    []
 3    []
 Name: u, dtype: object,
 0    1.1
 1    2.2
 2    3.1
 3    4.1
 Name: v, dtype: float64,
 0   -1.1
 1    2.2
 2   -3.1
 3    4.1
 Name: w, dtype: float64,
 0    1.1
 1   -2.1
 2   -2.1
 3    NaN
 Name: x, dtype: float64,
 a    A
 b    A
 c    B
 d    B
 Name: y, dtype: object,
 a       A
 b       B
 c       C
 d    None
 Name: z, dtype: object)

In [468]:
u.array

<NumpyExtensionArray>
[[], [], [], []]
Length: 4, dtype: object

In [471]:
u.values

array([list([]), list([]), list([]), list([])], dtype=object)

In [469]:
u.dtype

dtype('O')

In [470]:
u.name

'u'

In [472]:
u.dtypes

dtype('O')

In [473]:
u.shape

(4,)

### Boolean Attributes

In [474]:
v, x

(0    1.1
 1    2.2
 2    3.1
 3    4.1
 Name: v, dtype: float64,
 0    1.1
 1   -2.1
 2   -2.1
 3    NaN
 Name: x, dtype: float64)

In [475]:
v.hasnans

False

In [476]:
x.hasnans

True

In [477]:
u.empty

False

In [450]:
v.empty, w.empty, x.empty, y.empty, z.empty

(False, False, False, False, False)

In [451]:
v.is_unique, w.is_unique, x.is_unique, y.is_unique, z.is_unique

(True, True, False, False, True)

In [452]:
v.is_monotonic_increasing, w.is_monotonic_increasing, x.is_monotonic_increasing, y.is_monotonic_increasing, z.is_monotonic_increasing

(True, False, False, True, False)

In [453]:
v.is_monotonic_decreasing, w.is_monotonic_decreasing, x.is_monotonic_decreasing, y.is_monotonic_decreasing, z.is_monotonic_decreasing

(False, False, False, False, False)

### Indexing

In [371]:
w

0    1.1
1    2.2
2    3.1
3    4.1
Name: w, dtype: float64

In [372]:
x.loc[0]

1.1

In [373]:
x.iloc[0]

1.1

In [374]:
x.at[0]

1.1

In [375]:
x.iat[0]

1.1

In [387]:
z

a    A
b    B
c    C
d    D
Name: z, dtype: object

In [388]:
z.loc['a']

'A'

In [389]:
z.iloc[0]

'A'

In [390]:
z.at['a']

'A'

In [391]:
z.iat[0]

'A'

### String Methods

In [392]:
z

a    A
b    B
c    C
d    D
Name: z, dtype: object

In [393]:
dir2(z.str, drop_internal=True)

{'method': ['capitalize',
            'casefold',
            'cat',
            'center',
            'contains',
            'count',
            'decode',
            'encode',
            'endswith',
            'extract',
            'extractall',
            'find',
            'findall',
            'fullmatch',
            'get',
            'get_dummies',
            'index',
            'isalnum',
            'isalpha',
            'isdecimal',
            'isdigit',
            'islower',
            'isnumeric',
            'isspace',
            'istitle',
            'isupper',
            'join',
            'len',
            'ljust',
            'lower',
            'lstrip',
            'match',
            'normalize',
            'pad',
            'partition',
            'removeprefix',
            'removesuffix',
            'repeat',
            'replace',
            'rfind',
            'rindex',
            'rjust',
            'rpartition',
            'rspl

In [396]:
z.str.lower()

a    a
b    b
c    c
d    d
Name: z, dtype: object

In [395]:
z.apply(lambda string: 5 * string)

a    AAAAA
b    BBBBB
c    CCCCC
d    DDDDD
Name: z, dtype: object

In [412]:
5 * z

a    AAAAA
b    BBBBB
c    CCCCC
d    DDDDD
Name: z, dtype: object

### Numeric Methods

In [414]:
w

0    1.1
1    2.2
2    3.1
3    4.1
Name: w, dtype: float64

In [415]:
5 * w

0     5.5
1    11.0
2    15.5
3    20.5
Name: w, dtype: float64

In [455]:
w

0   -1.1
1    2.2
2   -3.1
3    4.1
Name: w, dtype: float64

In [456]:
abs(w)

0    1.1
1    2.2
2    3.1
3    4.1
Name: w, dtype: float64

In [457]:
w.abs()

0    1.1
1    2.2
2    3.1
3    4.1
Name: w, dtype: float64

In [462]:
import statistics

In [463]:
statistics.mean(w)

0.5249999999999999

In [464]:
w.mean()

0.5249999999999999

### Categorical Methods

In [397]:
y

a    A
b    A
c    B
d    B
Name: y, dtype: object

In [402]:
y = y.astype('category')

In [403]:
y

a    A
b    A
c    B
d    B
Name: y, dtype: category
Categories (2, object): ['A', 'B']

In [404]:
dir2(y.cat, drop_internal=True)

{'attribute': ['categories', 'codes', 'ordered'],
 'method': ['add_categories',
            'as_ordered',
            'as_unordered',
            'remove_categories',
            'remove_unused_categories',
            'rename_categories',
            'reorder_categories',
            'set_categories'],
 'datamodel_attribute': ['__annotations__',
                         '__dict__',
                         '__doc__',
                         '__frozen',
                         '__module__',
                         '__weakref__'],
 'datamodel_method': ['__class__',
                      '__delattr__',
                      '__dir__',
                      '__eq__',
                      '__format__',
                      '__ge__',
                      '__getattribute__',
                      '__getstate__',
                      '__gt__',
                      '__hash__',
                      '__init__',
                      '__init_subclass__',
                      '__le__',
 

In [406]:
y = y.cat.as_ordered()

In [407]:
y

a    A
b    A
c    B
d    B
Name: y, dtype: category
Categories (2, object): ['A' < 'B']

In [410]:
y == 'A'

a     True
b     True
c    False
d    False
Name: y, dtype: bool

In [411]:
y[y == 'A']

a    A
b    A
Name: y, dtype: category
Categories (2, object): ['A' < 'B']

In [458]:
v + w

0    0.0
1    4.4
2    0.0
3    8.2
dtype: float64

In [461]:
z

a       A
b       B
c       C
d    None
Name: z, dtype: object

In [460]:
z + z

a     AA
b     BB
c     CC
d    NaN
Name: z, dtype: object

## DataFrame Identifiers

If the following dataframe is constructed:

In [None]:
df = pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
                   'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

A large number of identifiers can be seen to be consistent between a ```DataFrame``` and a ```Series``` instance such as almost all of the datamodel identifiers. These identifiers operate across 2 dimensions across a ```DataFrame``` instance instead of 1 dimension along a ```Series```:

In [None]:
print('datamodel attribute:', end=' ')
print_identifier_group(df, kind='datamodel_attribute', second=xseries, show_only_intersection_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(df, kind='datamodel_method', second=xseries, show_only_intersection_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(df, kind='attribute', second=xseries, show_only_intersection_identifiers=True)
print('method:', end=' ')
print_identifier_group(df, kind='function', second=xseries, show_only_intersection_identifiers=True)

The ```Series``` has ```Series``` specific attributes which are not available for a ```DataFrame``` instance. The datamodel methods in a ```Series``` not present in a ```DataFrame``` are for type-casting:

In [None]:
print('datamodel attribute:', end=' ')
print_identifier_group(xseries, kind='datamodel_attribute', second=df, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(xseries, kind='datamodel_method', second=df, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(xseries, kind='attribute', second=df, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(xseries, kind='function', second=df, show_unique_identifiers=True)

The ```DataFrame``` instead has ```DataFrame``` specific attributes such as the name of each ```Series``` in the ```DataFrame```. The ```DataFrame``` also has supplementary methods such as ```insert``` which is used to insert a ```Series``` instance into a ```DataFrame``` instance or ```join``` and ```merge``` used to join or merge ```DataFrame``` instances respectively. The datamodel methods in a ```DataFrame``` not present in a ```Series``` are for type-casting (to a ```DataFrame```):

In [None]:
print('datamodel attribute:', end=' ')
print_identifier_group(df, kind='datamodel_attribute', second=xseries, show_unique_identifiers=True)
print('datamodel method:', end=' ')
print_identifier_group(df, kind='datamodel_method', second=xseries, show_unique_identifiers=True)
print('attribute:', end=' ')
print_identifier_group(df, kind='attribute', second=xseries, show_unique_identifiers=True)
print('method:', end=' ')
print_identifier_group(df, kind='function', second=xseries, show_unique_identifiers=True)

Notice the columns attribute returns a list of the names of each ```Series``` in the ```DataFrame```:

In [None]:
df.columns

Since the following conditions are satisfied:

In [None]:
'x'.isidentifier()

In [None]:
'y'.isidentifier()

And these identifier names don't clash with any of the other ```DataFrame``` identifiers, the following become ```DataFrame``` attributes and correspond to each ```Series``` in the ```DataFrame```:

In [None]:
df.x

In [None]:
df.y

## Mutability

The ```Index```, ```Series``` and ```DataFrame``` classes are mutable Collections meaning they have the immutable datamodel identifier ```__getitem__``` (dunder getitem) as well as the mutable identifier ```__setitem__``` (dunder setitem):

In [None]:
'__getitem__' in dir(pd.Series)

In [None]:
'__setitem__' in dir(pd.Series)

In [None]:
'__delitem__' in dir(pd.Series)

This means the following array can be indexed into:

In [None]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

In [None]:
xseries

Recall the datamodel ```__getitem__``` (dunder getitem) defines how a ```Collection``` responds to indexing using square brackets:

In [None]:
xseries[0]

Recall that the mutable method ```__setitem__``` (dunder setitem) defines how a ```MutableCollection``` responds to indexing using square brackets followed by assignment to a new value:

In [None]:
xseries[0] = None

In [None]:
xseries

Recall that the mutable method ```__delitem__``` (dunder delitem) defines how a ```MutableCollection``` responds to a ```del``` statement of an element indexing using square brackets:

In [None]:
del xseries[2]

In [None]:
xseries

Despite the ```NDArray```, ```Series``` and ```DataFrame``` being mutable datatypes, most the identifiers are immutable by default. If the docstring of the method ```dropna``` is examined:

In [None]:
xseries.dropna?

Notice it has the keyword input arguments ```inplace```. ```inplace``` has the default value of ```False``` making the method immutable by default and therefore returns a new ```Series```:

In [None]:
xseries.dropna() # Return value

In [None]:
xseries # Unchanged

When ```inplace``` is set to ```True``` the method becomes mutable:

In [None]:
xseries.dropna(inplace=True) # No return value

In [None]:
xseries # Modified inplace

The same behaviour can be seen on the method ```reset_index```:

In [None]:
xseries.reset_index?

With default value this method is immutable and returns a ```DataFrame``` since the old index is now added as the first ```Series```:

In [None]:
xseries.reset_index() # Return value

If the ```drop``` keyword input argument is set to ```True```, a ```Series``` will instead be returned:

In [None]:
xseries.reset_index(drop=True) # Return value

Once again the ```inplace``` keyword input argument can be assigned to ```True``` making the method mutable:

In [None]:
xseries.reset_index(drop=True, inplace=True) # No return value

In [None]:
xseries # Modified inplace

The following ```Series``` methods have the parameter ```inplace``` and are therefore immutable by default but are mutable when this parameter is assigned to ```True```:

In [None]:
print_identifier_group(xseries, kind='function', has_parameter='inplace')

Notice that most of these are used to fill, interpolate or drop values along a ```Series``` in response to missing data. 

```sort_values``` for example can be used to sort the values along a ```Series```, by default ```inplace=False``` and the method is immutable:

In [None]:
xseries.sort_values(ascending=False) # Return value

Recall when an immutable method is used with assignment, the new value returned on the right of the assignment operator is assigned to the instance name or label on the left of the assignment operator. If the instance name is conceptualised as a label, then a reassignment peels the label from the original instance and places it on the new instance created:

In [None]:
xseries = xseries.sort_values(ascending=False)

In [None]:
xseries

On the other hand when a method is immutable, there is no return value and the ```Series``` is updated inplace:

In [None]:
xseries.sort_values(ascending=True, inplace=True) # No return value

In [None]:
xseries

If assignment is used with an mutable function, the return value of the function is ```None``` and therefore ```None``` will be assigned to the ```new_label```:

In [None]:
new_label = xseries.sort_values(ascending=True, inplace=True) 

In [None]:
new_label

And therefore reassignment with the ```inplace``` parameter set to ```True``` should be avoided as the value will being reassigned will be ```None```:

In [None]:
xseries = xseries.sort_values(ascending=True, inplace=True) 

In [None]:
xseries

By convention immutable methods have a ```return``` value and mutable methods have no ```return``` value. An exception to this is the mutable method ```pop``` which returns the popped value and mutates the ```Series``` in place:

In [None]:
xseries = pd.Series([4.1, 2.1, 3.1, 1.1], name='x')

In [None]:
xseries

In [None]:
xseries.pop(item=1) # Return value

In [None]:
xseries # Mutated

The methods that have consistent names to the mutable methods in a ```list``` will also be mutable with no ```return``` value. Most of the other methods are immutable and have a ```return``` value.

## Axis

Another common keyword is ```axis```:

In [None]:
print_identifier_group(xseries, kind='function', has_parameter='axis')

A ```Series``` is a column and only has a single ```axis``` available, ```0```. The operation can be conceptualised as sorting the data in the rows by use of the ```Series``` name and therefore ```axis``` can also be assigned to the ```str``` instance ```'rows'```:

In [None]:
xseries.sort_values(ascending=True, axis=0)

In [None]:
xseries.sort_values(ascending=True, axis='rows')

For a ```DataFrame``` there are two values for ```axis```, ```0``` which is the default and ```1```:

In [None]:
df = pd.DataFrame({'x': np.array([5.1, 2.1, 2.1, 4.1]),
                   'y': np.array([6.2, 7.0, 2.1, 1.2])},
                   index=['a', 'b', 'c', 'd'])

In [None]:
df

The default ```axis``` is ```0``` which is equivalent to the ```str``` instance ```'rows'```. This is an instruction to sort the data in the rows ```by``` the ordering of the data in the columns:

In [None]:
df.sort_values(by=['x', 'y'], axis='rows')

Notice that the data is sorted in ascending order by ```'x'``` and in the case where the two values in ```'x'``` have duplicate values are sorted by ```'y'``` :

In [None]:
df

The ```axis``` can be changed to ```1``` which is equivalent to the ```str``` instance ```'columns'```. This is an instruction to sort the data in the columns ```by``` the ordering of the data in the index:

In [None]:
df.sort_values(by=['c', 'd'], axis='columns')

The data is sorted in ascending order first by ```'c'``` but the data in the two ```Series``` instances ```'x'``` and ```'y'``` have the same value 2.1 so there is no instruction to specify the order of the ```Series```. The next index value ```'d'``` is used and the value in the ```Series``` instance ```y``` is 1.2 and the ```Series``` instance ```'x'``` is 4.1, therefore ```'y'``` is ordered before ```'x'```.

In the ```NDArray``` negative indexes are quite commonly used to select an ```axis```. This are not used for the ```Series``` (1D) and ```DataFrame``` (2D) instances which are of fixed dimensions.

## Indexing and Slicing

Supposing the following dictionary instance is instantiated:

In [None]:
mapping = {'x': np.array([1.1, 2.1, 3.1, 4.1]),
           'y': np.array([1.2, 2.2, 3.2, 4.2])}

In [None]:
mapping

A ```DataFrame``` instance can be instantiated by assigning the ```mapping``` to the keyword input argument ```data```:

In [None]:
df = pd.DataFrame(data=mapping)

In [None]:
df

A ```mapping``` can be indexed with a ```key```. This returns the ```value``` the ```key``` references, in this case the ```NDArray```:

In [None]:
mapping['x']

Analogously, when a ```DataFrame``` is indexed using the ```name``` of a ```Series```, the ```Series``` is returned:

In [None]:
df['x']

A value in the ```NDArray``` instance can be indexed by use of a second set of square brackets to enclose the numeric index:

In [None]:
mapping['x'][1]

Analogously, a ```value``` in the ```Series``` can be indexed by use of a second set of square brackets to enclose the numeric index:

In [None]:
df['x'][1]

If the DataFrame instance is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

The first set of brackets select the Series:

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

And the second set of brackets selects the index retrieving the value:

2.1

If the DataFrame is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

Sometimes the value for each ```Series``` at a value within the ```Index``` instance is desired:

|index|'x'|'y'|
|---|---|---|
|1|2.1|2.2|

This is done by use of the property location ```loc```. Note that ```loc``` returns the above *row* as a ```Series``` which is displayed by default as a *column*:

|index|1|
|---|---|
|'x'|2.1|
|'y'|2.1|

```loc``` is callable and has a docstring:

In [None]:
callable(df.loc)

In [None]:
df.loc?

However unlike most callables it is not called using parenthesis:

In [None]:
df.loc

In [None]:
df.loc()

Instead ```loc``` is a property. Under the hood it uses syntactic sugar around the datamodel method ```__getitem__``` that switches the order of indexing from the default ```[column, index]``` to ```[index, column]```:

In [None]:
df.loc[1]

In [None]:
df.loc[1]['x']

```loc``` can also uses index values:

In [None]:
df.loc[[0, 2]]

The related property integer location ```iloc``` always uses a numeric index. Since ```iloc``` has a numeric index, additional numeric operations can be used such as slicing:

In [None]:
df.iloc[[0, 2]]

In [None]:
df.iloc[0:2]

If the following DataFrame instance is created with index labels i.e. a non-numeric index:

|index|'x'|'y'|
|---|---|---|
|'a'|1.1|1.2|
|'b'|2.1|2.2|
|'c'|3.1|3.2|
|'d'|4.1|4.2|

In [None]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd'],
                  data=mapping)

In [None]:
df

The difference between ```loc``` and ```iloc``` can be seen more clearly. For ```loc``` the index label is used:

In [None]:
df.loc['b']

Despite the labels being non-numeric ```iloc``` handles the index values numerically:

In [None]:
df.iloc[1]

Under the hood ```iloc``` essentially uses the ```DataFrame``` instances reset index:

In [None]:
df.reset_index(drop=True)

In [None]:
df.iloc[1]

When ```loc``` and ```iloc``` are used to select a single index, the data for each ```Series``` at this index is itself displayed as a ```Series```:

In [None]:
df.loc['b']

In [None]:
df.iloc[1]

Because each of the above are a ```Series``` instance, they can in turn be indexed into:

In [None]:
df.loc['b']['y']

In [None]:
df.iloc[1]['y']

When ```iloc``` and ```loc``` are instead used to select data from multiple indexes a ```DataFrame``` instance is output:

In [None]:
df.loc[['a', 'b']]

In [None]:
df.iloc[0:2]

And because each of these is a ```DataFrame``` instance, the ```Series``` within the ```DataFrame``` instance can then be indexed using the ```Series``` name:

In [None]:
df.loc[['a', 'b']]['x']

In [None]:
df.iloc[0:2]['x']

```at``` is used for a scalar selector and requires both the index and the ```Series``` name: 

In [None]:
df.at['a', 'y']

The related integer at ```iat``` is also a scalar selector and requires both the index and column to be specified as integers:

In [None]:
df.iat[0, 1]

Conceptualise, the ```DataFrame``` being cast to a ```NDArray``` (2D) and indexing a value from it:

In [None]:
df.to_numpy()

In [None]:
df.to_numpy()[0, 1]

To recap, for a ```DataFrame``` instance:

* ```__getitem__``` selects a ```Series``` by default
* ```loc``` and ```iloc``` change the behaviour to select an observation from the ```Index``` instance label
* ```at``` and ```iat``` select a scalar element


```loc``` can also be used to add a new observation to the ```DataFrame``` instance:

In [None]:
df

In [None]:
df.loc['f'] = {'x': 6.1, 'y': 6.2}

In [None]:
df.loc['e'] = {'x': 5.1, 'y': 5.2}

The ordering of rows (also known as observations) follows the insertion order: 

In [None]:
df

The ```DataFrame``` method ```sort_index``` can be used to reorder the index: 

In [None]:
df.sort_index(inplace=True)

In [None]:
df # modified inplace

The ```Index``` instance can also be reset to a numeric index using the ```DataFrame``` instance ```reset_index```:

In [None]:
df.reset_index(drop=True, inplace=True)

In [None]:
df # modified inplace

The length of the ```DataFrame``` gives the number of rows (observations):

In [None]:
len(df)

Python uses zero-order indexing and the ```Index``` starts at ```0``` (inclusive) and stops at ```len(df)``` (exclusive).

```iloc``` cannot be used to index into an index value that doesn't exist and cannot be used to add a new observation. However ```loc``` can be used to add a numeric index using the ```len``` of the ```DataFrame``` instance:

In [None]:
df.loc[len(df)] = {'x': 7.1, 'y': 7.2}

In [None]:
df

## DataFrame Properties

Supposing the following ```DataFrame``` is instantiated to ```df```:

In [None]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2])})

In [None]:
df

The ```DataFrame``` instance has the following dimension related properties. The attribute ```empty``` returns a boolean that is ```True``` only with an empty DataFrame:

In [None]:
df.empty

In [None]:
pd.DataFrame(None).empty

A ```DataFrame``` instance has a length, which is returned by the ```builtins``` function ```len```. This was seen previously to correspond to the number of rows (number of observations):

In [None]:
len(df)

A ```DataFrame``` instance has the attribute ```shape``` which is a ```tuple``` of dimensions. The 1st dimension is the number of rows (observations in the index) and the 2nd value is the number of ```Series``` (columns):

In [None]:
df.shape

A ```DataFrame``` instance has the attribute ```ndim``` which gives the number fo dimensions and is always ```2```:

In [None]:
df.ndim

Recall this is equivalent to the length of the ```shape``` ```tuple```:

In [None]:
len(df.shape)

The ```DataFrame``` instance has a ```size``` attribute which is the product of the elements in the ```shape``` ```tuple```:

In [None]:
df.size

The ```index``` attribute returns the ```Index``` instance associated with the ```DataFrame```. An ```Index``` instance has a single dimension that can either be depicted as a row or a column. The output below displays this as a row although the index itself is conventionally depicted as a column when incorporated as part of a ```DataFrame```:

In [None]:
df.index

When no ```index``` is supplied during ```DataFrame``` instantiation a ```RangeIndex``` is automatically generated:

In [None]:
df2 = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 3.1]),
                         'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df2.index

The ```columns``` attribute also returns an ```Index``` instance corresponding to the names of each ```Series``` in the ```DataFrame```:

In [None]:
df.columns

The attribute ```axes``` returns a 2 element list, where the first element is the ```index``` attribute and the second element is the ```columns``` attribute:

In [None]:
df.axes

The attribute ```values``` returns the values in the ```DataFrame``` in the form of a ```NDArray``` (2D):

In [None]:
df.values

The attribute ```dtypes``` returns the datatypes of each ```Series``` and of the ```DataFrame```:

In [None]:
df.dtypes

The ```Series``` instances ```x``` and ```y``` are each of the datatype ```float64```, the ```DataFrame``` instance ```df``` is of the datatype ```object```. A ```DataFrame``` instance is always of the type ```object```.

Each existing ```Series``` is accessible as an attribute:

In [None]:
df.x

In [None]:
df.y

The formal representation of the ```DataFrame``` instance ```df``` can be examined in a cell:

In [None]:
df

The attribute ```style``` can instead be used to display a ```DataFrame``` instance using a specific style. The default style is shown:

In [None]:
df.style

This ```style``` attribute has a number of stackable methods which return a modified ```style``` and can therefore be stacked to apply custom formatting:

In [None]:
print_identifier_group(df.style, kind='function')

In [None]:
df.style.format(precision=3).set_caption('DataFrame Instance')

The attributes ```attrs``` is an empty dictionary by default and is designed to store metadata associated with the ```DataFrame``` instance:

In [None]:
df.attrs

This metadata can include a text description giving information about how the data was collection or contain a link to where the data was sourced from:

In [None]:
df.attrs = {'description': 'this DataFrame was instantiated from a dict',
            'documentation': r'https://pandas.pydata.org/docs/getting_started/index.html'}

The ```DataFrame``` method ```info``` gives information about the ```DataFrame```:

In [None]:
df.info()

The ```DataFrame``` method ```describe``` method gives supplementary descriptive statistics on each numeric ```Series```:

In [None]:
df.describe()

The ```DataFrame``` methods ```head``` and ```tail``` give the top 5 and last 5 observations by default and are usually used to preview a large ```DataFrame``` instance:

In [None]:
df

In [None]:
df.head()

In [None]:
df.tail()

The number of observations ```n``` defaults to ```5``` and can be changed:

In [None]:
df.head(n=3)

The ```DataFrame``` method ```nunique``` gives the number of unique observations for each ```Series```:

In [None]:
df.nunique()

## Attribute Access - Dictionary Syntax vs Dot Syntax

If the following ```DataFrame``` is instantiated to ```df```:

In [None]:
df = pd.DataFrame(index = np.array(['a', 'b', 'c', 'd']),
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

Each ```Series``` can be accessed from the ```DataFrame``` instance ```df``` by indexing into ```df``` using the corresponding ```Series``` ```name``` enclosed in square brackets. This style of ```Series``` access is analogous to retrieving a value from a ```dict``` by use of its ```key```:

In [None]:
df['x']

Since the following is ```True```:

In [None]:
'x'.isidentifier()

```x``` becomes an attribute and can also be accessed using:

In [None]:
df.x

In [None]:
df.x is df['x']

A ```Series``` ```name``` only becomes an attribute of a DataFrame **after** it is instantiated and if it is a valid identifier:

In [None]:
df['z1'] = np.array([1.3, 2.3, 2.3, 2.4])

In [None]:
df

In [None]:
df.z1

A ```UserWarning``` displays if the dot syntax is used in an attempt to create a new attribute, leaving the ```df``` instance unchanged. 

Notice when an invalid identifier is used as the ```nam``` for a new ```Series```:

In [None]:
'1'.isidentifier()

In [None]:
df['1'] = np.array([1.3, 2.3, 2.3, 2.4])

That it does not show as an attribute:

In [None]:
print_identifier_group(df, kind='attribute')

For this reason ```Series``` names should generally follow the naming conventions of Python identifiers.

Although the dot attribute access from the ```DataFrame``` instance ```df``` is unavailable for this Series instance ```'1'```. The Series instance ```'1'``` can still be accessed by indexing into the ```DataFrame``` instance ```df``` using the ```Series``` ```name``` ```'1'```:

In [None]:
df['1']

Accessing a ```Series``` via dictionary-style indexing is therefore more powerful and this syntax is generally preferred.

The major drawback of the dictionary-style indexing syntax is with code-completion. Notice no docstring displays when ```?``` is used:

In [None]:
df['x'].info?

In [None]:
df['x'].info()

However if the attribute is used, the docstring displays:

In [None]:
df.x.info?

In [None]:
df.x.info()

The ```Series``` method info gives the same result in both cases.

When the ```name``` used in an  ```Index``` is also a valid identifier:

In [None]:
'a'.isidentifier()

It will be available as an attribute for each  ```Series```:

In [None]:
df.x.a

The default ```Index``` is numeric ```RangeIndex``` of integer steps which are invalid identifiers:

In [None]:
'0'.isidentifier()

And therefore when the default ```RangeIndex``` is used:

In [None]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

Because these are invalid identifiers they do not show as attributes for any of the ```Series``` belonging to the ```DataFrame``` instance:

In [None]:
print_identifier_group(df['x'], kind='attribute')

However the ```rows``` can be selected by indexing the numeric index in square brackets:

In [None]:
df['x'][1]

## Combining DataFrames

```DataFrame``` methods are generally setup for ```Series```. For example if the following ```DataFrame``` instance is examined:

In [None]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df

A ```Series``` is typically appended to the end of the ```DataFrame``` by use of:

In [None]:
df['z'] = np.array([1.3, 2.3, 3.3, 4.3])

In [None]:
df

Alternatively a ```Series``` can be inserted at a specified index using the mutable method ```insert```:

In [None]:
df.insert?

In [None]:
df.insert(loc=0, column='w', value=np.array([1.0, 2.0, 3.0, 4.0]))

In [None]:
df

Recall to append an observation to a ```DataFrame```, ```loc``` is typically used and assigned to a mapping where the keys are the column names and the values are the associated values at that observation:

In [None]:
df.loc[len(df)] = {'w': 5.0, 'x': 5.1, 'y': 5.2, 'z': 5.3}

In [None]:
df

When multiple observations are to be appended to a ```DataFrame``` they are normally in the form of a ```DataFrame```:

In [None]:
df2 = pd.DataFrame(index=np.array([5, 6]),
                                  data = {'w': np.array([6.0, 7.0]),
                                          'x': np.array([6.1, 7.1]),
                                          'y': np.array([6.2, 7.2]),
                                          'z': np.array([6.3, 7.3])})

In [None]:
df

In [None]:
df2

The function ```pd.concat``` can be used to concatenate these two ```DataFrame``` instances:

In [None]:
pd.concat?

For example, ```df``` and ```df2``` can be concatenated along ```axis``` ```0``` which recall is ```'rows'``` (the index):

In [None]:
pd.concat(objs=[df, df2], axis='rows') #'index'

If these ```DataFrame``` instances are created with the default indexes:

In [None]:
df.reset_index(drop=True, inplace=True)
df

In [None]:
df2.reset_index(drop=True, inplace=True)
df2

Notice the index now has duplicate entires:

In [None]:
pd.concat(objs=[df, df2])

In such a scenario it is common to assign ```ignore_index``` to ```True``` which will recreate a numeric ```RangeIndex```:

In [None]:
pd.concat([df, df2], ignore_index=True)

When a ```DataFrame``` instance has a ```Series``` not in common with the second ```DataFrame``` instance being concatenated:

In [None]:
df = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 4.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [None]:
df2 = pd.DataFrame(data = {'w': np.array([6.0, 7.0]),
                           'z': np.array([6.3, 7.3])})

In [None]:
df

In [None]:
df2

The ```DataFrame``` instances can be ```'outer'``` joined (the default). This will lead to ```NaN``` values where no data was supplied:

In [None]:
pd.concat(objs=[df, df2], axis='columns', join='outer') #'columns'

Alternatively the two ```DataFrame``` instances can be ```'inner'``` joined, which will drop the observations that are missing data:

In [None]:
pd.concat([df, df2], axis='columns', join='inner') #'columns'

The ```DataFrame``` method ```align``` can be used to align the data of a ```DataFrame``` with another ```DataFrame``` instance for the purpose of comparison:

In [None]:
df3 = pd.concat([df, df2], axis=1, join='inner') #'columns'

In [None]:
df.align(other=df3)

## Not Available Values

The following ```DataFrame``` can be instantiated to ```df``` with multiple ```None``` values:

In [None]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, None, 3.1, None, 5.1, None, 7.1]),
                        'y': np.array([1.2, None, 3.2, 4.2, 5.2, 6.2, 7.2])})

The ```DataFrame``` instance ```df``` information can be examined. Notice that there are 7 rows; 4 rows have available (non-null) values in ```Series``` instance ```x``` and ```6``` rows have available (non-null) values in ```Series``` instance ```y```:

In [None]:
df.info()

The ```DataFrame``` instance ```df``` descriptive statistics can be viewed:

In [None]:
df.describe()

Notice in the information that the datatype is now ```object``` instead of ```float64```. The ```DataFrame``` instance uses the datatype ```object``` for each ```Series``` because each of these ```Series``` contain the value ```None``` which is a generic Python ```object```:

In [None]:
df

The datatype of each ```Series``` in the ```DataFrame``` can be changed to a ```float``` using the method ```astype```:

In [None]:
df.astype(float)

Notice the values that were previously ```None``` are cast into ```NaN``` (Not a Number). ```NaN``` conceptually is similar to ```None``` however it has a datatype of ```float``` and is therefore designated as being numeric. ```Series``` instances that only have numeric data inclusive of ```NaN``` therefore have the datatype ```float```. Both ```None``` and ```NaN``` are classified as ```not available``` values and are known collectively as ```null``` values.

The changes can be seen when the ```DataFrame``` method ```info``` is used on the returned ```DataFrame``` instance:

In [None]:
df.astype(float).info()

Notice the changes when the ```DataFrame``` method ```describe``` is used on the returned ```DataFrame``` instance. Because each ```Series``` is numeric additional statics can be calculated:

In [None]:
df.astype(float).describe()

The ```DataFrame``` method drop not available ```dropna``` can be used to drop not available values (```None``` or ```NaN``` values) outputting a new ```DataFrame``` instance. Notice the number of rows is now reduced to 4: 

In [None]:
df.dropna()

If the ```DataFrame``` method ```info``` is used on this new ```DataFrame``` instance, notice the datatype of each ```Series``` is still ```object``` and not ```float64```:

In [None]:
df.dropna().info()

The ```DataFrame``` method ```astype``` can be used to change the datatype of each ```Series``` in the returned ```DataFrame``` to ```float``` once again and this once again outputs a new ```DataFrame``` instance. If the ```DataFrame``` method ```info``` is examined for this new ```DataFrame``` instance, each ```Series``` now has a ```float64``` datatype:

In [None]:
df.dropna().astype(float).info()

Note ```DataFrame``` methods are often stacked for convenience:

* df # Original DataFrame instance 1
* df.drop(na) # returns a DataFrame instance 2
* df.drop(na).astype(float) # returns a DataFrame instance 3

A ```Series``` can also be selected from ```DataFrame``` instance 3:

In [None]:
df.dropna().astype(float)['x']

And a ```Series``` method ```astype``` can be used on this ```Series``` returning another ```Series```:

In [None]:
df.dropna().astype(float)['x'].astype(int)

The ```DataFrame``` method ```dropna``` was demonstrated over a small ```DataFrame``` instance with a small number of rows. This method is however usually only typically employed on a ```DataFrame``` that contains a large number of rows that has enough data for further analysis without that isn't influence too much by the missing values. For a sparse dataset, it is common to attempt to fill in the missign values in some way. The ```DataFrame``` method ```fillna``` can be used to fill in not available values. These can be filled with a constant value:

In [None]:
df.fillna(0)

In [None]:
df.fillna(np.inf)

Alternatively the ```DataFrame``` method ```ffill``` can be used to linearly forward fill missing data. When using the forward fill, the previous available value is used to replace the not available value:

In [None]:
df.ffill()

This can be aligned with the original ```DataFrame``` instance for comparison:

In [None]:
df.ffill().align(df)

The related ```DataFrame``` method ```bfill``` can be used to linearly backwards fill missing data. When using the backward fill, the subsequent available value is used to replace the not available value:

In [None]:
df.bfill().align(df)

The ```DataFrame``` method ```interpolate``` method can use neighbouring datapoints to interpolate a missing value. It has a keyword input argument ```method``` which can be used to specify an interpolation method. Note that all the data in the ```DataFrame``` must be cast to numeric for the ```interpolate``` method to be used:

In [None]:
df.astype(float).interpolate(method='linear')

Sometimes it is preferable to use the ```Series``` method ```interpolate``` which is consistent:

In [None]:
df['x'].astype(float).interpolate(method='linear')

Many of the interpolation methods need a numeric index to work properly:

In [None]:
df['x'].astype(float).reset_index(drop=True).interpolate(method='linear')

A new ```DataFrame``` instance can be created where each ```Series``` is an interpolated method used on the original ```Series``` ```'x'```. The ```Series``` methods ```astype``` and ```reset_index``` will be used to cast the ```Series``` ```'x'``` to a ```float``` with a numeric index. The original index from the ```Series``` ```'x'``` can be assigned to the new ```DataFrame``` instance after the interpolation:

In [None]:
df2 = pd.DataFrame({'x': df['x'].astype(float).reset_index(drop=True),
                    'x_1': df['x'].astype(float).reset_index(drop=True).interpolate(method='linear'),
                    'x_2': df['x'].astype(float).reset_index(drop=True).interpolate(method='polynomial', order=2),
                    'x_3': df['x'].astype(float).reset_index(drop=True).interpolate(method='polynomial', order=3)})
df2.index = df['x'].index

In [None]:
df2

The ```DataFrame``` method ```isna``` returns a boolean ```DataFrame``` instance which is ```True``` for not available values and ```False``` otherwise:

In [None]:
df.isna()

The opposite method ```notna``` returns a boolean ```DataFrame``` of inverse values:

In [None]:
df.notna().align(df.isna())

These two methods have the alias ```isnull``` and ```notnull``` respectively which are used for consistency with the R programming language.

The boolean mask above can be used to index into the ```DataFrame``` instance:

In [None]:
bool_mask = df.notna()

Notice indexing using the boolean mask updates ```None``` to ```NaN```:

In [None]:
df

In [None]:
df[bool_mask]

## String Series and String Methods

Supposing the following list of words is instantiated:

In [None]:
words = 'the quick brown for jumped over the lazy dog'.split()

In [None]:
words

Using ```len``` of words will return the number of words in the outer collection which is the ```list``` and not the length of each inner collection which is the word:

In [None]:
len(words)

To instead get a ```list``` of the length of each word, ```list``` comprehension can be used:

In [None]:
[len(word) for word in words]

This can also be done using ```map```:

In [None]:
map(len, words)

In [None]:
list(map(len, words))

If an analogous ```Series``` is instantiated to ```words```:

In [None]:
words = pd.Series(data='the quick brown for jumped over the lazy dog'.split())

In [None]:
words

Using ```len``` on the ```Series``` will return the number of rows:

In [None]:
len(words)

The ```Series``` method ```map``` is similar to the ```builtins``` function ```map``` and can be used to individually ```map``` a ```function``` to the ```Series```:

In [None]:
words.map(len)

Since every element in the ```Series``` is a ```str``` instance, a ```str``` method can be applied to each element using ```map``` and a ```lambda``` expression:

In [None]:
words.map(lambda str: str.upper())

Recall that ```str``` instances are ordinal and the ```builtins``` universal ```max``` function can be mapped to get a ```Series``` that has the letter corresponding to the highest ordinal value in the word:

In [None]:
words.map(max)

If the ```DataFrame``` is instantiated to ```df```:

In [None]:
df = pd.DataFrame({'words': words,
                   'maximum': words.map(max)})

In [None]:
df

The ```DataFrame``` has a consistent method ```map``` which operates element by element:

In [None]:
df.map(max)

The ```DataFrame``` also has the method ```apply``` which operates along an ```axis```:

In [None]:
df.apply(max, axis='rows')

In [None]:
df.apply(max, axis='columns')

Most universal functions are implemented as ```Series``` and ```DataFrame``` methods:

In [None]:
words.max(axis='rows')

In [None]:
words.min(axis='rows')

A ```Series``` has the attribute ```str``` which can be used to quickly access ```str``` methods:

In [None]:
words.str.zfill(20)

The ```str``` datamodel method ```__len__``` is available under the ```str``` attribute as ```len```:

In [None]:
words.str.len()

If the lengths of each strings are examined:

In [None]:
lengths = words.map(len)

In [None]:
lengths

A function can be created to cast a numeric length into a ```str``` for example ```3``` to ```'three'```:

In [None]:
def get_length_str(length):
    match length:
        case 3:
            return 'three'
        case 4:
            return 'four'
        case 5:
            return 'five'
        case 6:
            return 'six' 

This function can be applied to the ```length``` ```Series``` using ```map```:

In [None]:
lengths.map(get_length_str)

## Numeric Series

If a ```DataFrame``` with numeric ```Series``` ```x```, ```y``` and ```z``` is instantiated to ```df```:

In [None]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                   'y': [-2, -4, 6, 8, 10],
                   'z': [12, 24, 48, -63, -999]})

In [None]:
df

The ```DataFrame``` method ```apply``` can be used to apply the builtins universal function ```max```:

In [None]:
df.apply(max, axis='rows')

In [None]:
df.apply(max, axis='columns') 

Note however that most the universal functions from ```builtins``` or ```numpy``` are implemented directly as ```DataFrame``` methods:

In [None]:
df.max(axis='rows')

In [None]:
df.min(axis='rows')

In [None]:
df.mean(axis='rows')

In [None]:
df.var(axis='rows')

In [None]:
df.std(axis='rows')

And the datamodel identifiers for a numeric ```Series``` are configured for numeric operations:

In [None]:
df['x'] + df['y']

In [None]:
df['x'] + 5

The ```apply``` function can also be used with a ```tuple``` of universal functions outputting a ```DataFrame``` instance opposed to a Series:

In [None]:
df.apply((len, max, min, np.mean, np.var, np.std))

## Categorical Series

Another common type of ```Series``` is a category ```Series```:

In [None]:
df = pd.DataFrame({'student_names': ['student' + str(num) for num in range(1, 9)],
                   'grades': ['b', 'F', 'A', 'C', 'a', 'C', 'B', 'A']})

When instantiated, the categories will normally be recognised as ```str``` instances:

In [None]:
df

And the datatypes will therefore be ```object``` for each ```Series```:

In [None]:
df.dtypes

The datatype of a ```Series``` can be changed using the method ```astype```. To change to category use the input argument ```'category'```:

In [None]:
df['grades'].astype('category')

The original ```Series``` can be reassigned to the new ```Series``` that are now categorical:

In [None]:
df['grades'] = df['grades'].astype('category')

If the ```DataFrame``` instance is examined, it looks the same:

In [None]:
df

However its datatype is updated:

In [None]:
df.dtypes

A categorical ```Series``` also has the attribute ```cat``` which groups together attributes and methods commonly used for categorical ```Series```:

In [None]:
print('attributes', end=' ')
print_identifier_group(df['grades'].cat, kind='attribute')
print('methods', end=' ')
print_identifier_group(df['grades'].cat, kind='function')

The ```Series.cat``` attribute ```categories``` can be used to get the names of the existing categories:

In [None]:
df['grades'].cat.categories

Notice that these categories have uppercase and lowercase variants. A ```list``` comprehension can be used with a ```str``` method to change the lowercase grades to uppercase:

In [None]:
old_grade_categories = df['grades'].cat.categories
new_grade_categories = [grade.upper() for grade in old_grade_categories]
new_grade_categories

A category mapping can then be created:

In [None]:
category_mapping = dict(zip(old_grade_categories, new_grade_categories))
category_mapping

This is the type of ```mapping``` that can be used with the ```Series.cat``` method ```rename_categories``` however at present this method does not support merging of categories and flags a ```ValueError``` because some of the ```values``` are the same:

In [None]:
# df['grades'].cat.rename_categories(category_mapping)

It is therefore easier to manipulate the ```str``` datatype ```Series``` and then cast it to a ```category``` datatype ```Series```:

In [None]:
df['grades'] = df['grades'].str.lower().astype('category')
df['grades']

When all the values in the mapping are unique, the ```DataFrame``` method ```rename_category``` works as expected:

In [None]:
old_grade_categories = df['grades'].cat.categories
new_grade_categories = [grade.upper() for grade in old_grade_categories]
category_mapping = dict(zip(old_grade_categories, new_grade_categories))
category_mapping

In [None]:
df['grades'] = df['grades'].cat.rename_categories(category_mapping)
df['grades']

Categories are often used for boolean selectors:

In [None]:
df[df['grades'] == 'A']

In [None]:
df[df['grades'] == 'B']

In [None]:
df[(df['grades'] == 'A') | (df['grades'] == 'B')]

Only the equal to ```==``` and not equal to ```!=``` operators are defined for unordered categoricals. A ```TypeError``` displays if one of the other comparison operators is attempted to be used:

The ```Series.cat``` method ```as_ordered``` can be used to ordinally order categories:

In [None]:
df['grades'].cat.as_ordered()

In this case, the order desired is reverse the ordinal values because ```'A'``` corresponds to a higher grade than ```'F'```:

In [None]:
df['grades'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                    ordered=True)

The original ```Series``` can be reassigned:

In [None]:
df['grades'] = df['grades'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                                   ordered=True)

In [None]:
df[df['grades'] >= 'B']

When sorting out data in ```DataFrames```, ordinal ```Series``` are quite often used:

In [None]:
df.sort_values(['grades', 'student_names'])

If the ```DataFrame``` method ```count``` is used, it will return the number of rows in each ```Series``` and return this information as a new ```Series```:

In [None]:
df.count()

A ```GroupBy``` instance can be created:

In [None]:
gbo = df.groupby(df['grades'], observed=True)

This ```gbo``` instance is essentially a ```DataFrame``` with an additional groupby instruction that is applied when a ```DataFrame``` method such as ```count``` is used. Notice that a ```DataFrame``` instance is returned:

In [None]:
gbo.count()

A ```GroupBy``` instance can be created from a ```Series```:

In [None]:
gbo = df['grades'].groupby(df['grades'], observed=True)

This ```gbo``` instance is essentially a ```Series``` with an additional groupby instruction that is applied when a ```Series``` method such as ```count``` is used. A ```Series``` instance is returned:

In [None]:
gbo.count()

Some ```Series``` methods like ```describe``` provide information spanning over multiple ```Series``` and will therefore return a ```DataFrame``` instance: 

In [None]:
gbo.describe()

Notice the slight difference with the column grouping ```student_names``` being used to group the statistical information (```count```, ```unique```, ```top``` and ```freq```) which for that specific ```Series```.

In [None]:
df

The difference can be seen more clearly if a second category is added to the ```DataFrame```:

In [None]:
import random
random.seed(0)
df['sex'] = pd.Series([random.choice(['F', 'M']) for num in range(8)])
df['sex'] = df['sex'].astype('category')

In [None]:
df

Now when the ```DataFrame``` instance ```df``` is grouped by the ```Series``` ```grades``` followed by the ```DataFrame``` method ```describe```. Descriptive statistics are shown for each ```Series``` and each of these statistics is grouped using a multilevel ```Index```:

In [None]:
df.groupby('grades', observed=True).describe()

If this ```DataFrame``` is indexed into using the top level column name:

In [None]:
df.groupby('grades', observed=True).describe()['sex']

This returns a single-index ```DataFrame``` which can also be indexed into for example by using the column name ```count```:

In [None]:
df.groupby('grades', observed=True).describe()['sex']['count']

The ```DataFrame``` can be grouped by a list of multiple ```Series``` names. This gives a multi-level index for both the ```Series``` and ```Index```:

In [None]:
df.groupby(['grades', 'sex'], observed=True).describe()

Returning to:

In [None]:
df

A function can be made to generate a random ```score``` in response to a ```grade```:

In [None]:
def generate_score(grade):
    random.seed(0)
    match grade:
        case 'A':
            return random.randint(70, 101)
        case 'B':
            return random.randint(60, 70)  
        case 'C':
            return random.randint(50, 60)
        case 'F':
            return random.randint(0, 50)

This custom function can be applied to the ```grades``` ```Series``` to generate a ```Series``` of random marks for each student:

In [None]:
df['scores'] = df['grades'].map(generate_score)

In [None]:
df

Categories can be created from ordinal values using the pandas function ```pd.cut```:

In [None]:
pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101])

In the output below the ```(``` means inclusive of the boundary and the ```]``` means exclusive of the top boundary. For convenience this will be inserted into the ```DataFrame``` at column index 3. Recall ```insert``` is an immutable method and occurs in place:

In [None]:
df.insert(3, 'score_cats', pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101]))

In [None]:
df

```cut``` can be used with the keyword labels. Notice that there are ```5``` values for ```bins``` and ```4``` values for ```labels```, this is because each bin is between two values:

In [None]:
pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101], labels=['F', 'C', 'B', 'A'])

Notice that these categories are also automatically ordinal:

In [None]:
df.insert(4, 'calculated_grades', pd.cut(x=df['scores'], bins=[0, 50, 60, 70, 101], labels=['F', 'C', 'B', 'A']))

In [None]:
df

## DateTime

In ```pandas``` date and time intervals are based upon the datatypes ```datetime64``` or ```timedelta64``` from the ```numpy``` library:

The ```datetime64``` class is normally initialised using a timestamp string of the following format:

```python
np.datetime64('YYYY-MM-DD')
np.datetime64('YYYY-MM-DDThh:mm:ss:μμμμμμ')
```

For example:

In [None]:
np.datetime64('2023-07-25')

In [None]:
np.datetime64('2023-07-25T14:30:15.123456')

The ```timedelta64``` is normally initialised using a set of tuples where ```X``` is the quantity followed by the unit:

```python
np.datetime64(X, 'U')
```

These are usually combined using addition:

In [None]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's') + np.timedelta64(1, 'ms')

These can be used to make an ```Index``` or ```Series``` respectively, using the ```np.arange``` function:

In [None]:
start_time = np.datetime64('2023-07-25')
end_time = np.datetime64('2023-07-26')
time_interval = np.timedelta64(1, 'h')

In [None]:
times = np.arange(start=start_time, #inclusive
                  stop=end_time, #exclusive
                  step=time_interval)

In [None]:
times

These times can be cast into an ```Index``` or ```Series```:

In [None]:
pd.Index(times)

In [None]:
pd.Series(data=times, name='times')

The ```Index``` of the datatype ```datetime64``` can be used as a time index alongside measurement Series for example emulated temperature, ph and humidity data:

In [None]:
np.random.seed(0)

In [None]:
df = pd.DataFrame(index=pd.Index(times),
                  data={'temperature': 25 + np.random.randn(24),
                        'ph': 7 + np.random.randn(24) / 10,
                        'humidity': 100 - np.random.randint(0, 100, 24)})

In [None]:
df

```loc``` can be used to retrieve the data at a specified ```datetime64```:

In [None]:
df.loc['2023-07-25T01:00:00']

```iloc``` can also be used with the ```int``` which would correspond to the ```RangeIndex``` if the ```Index``` is reset:

In [None]:
df.iloc[1]

A comparison between two times can be made:

In [None]:
df.loc['2023-07-25 16:00:00'] - df.loc['2023-07-25T01:00:00']

In addition to the ```Index``` of datatype ```datetime64``` the ```Series``` instance ```times``` can be added to the ```DataFrame``` instance ```df```:

In [None]:
df['times'] = times

In [None]:
df

When ```loc``` is used to calculate the difference between two measurements at the two different times, the time difference, i.e. ```timedelta64``` will be calculated:

In [None]:
df.loc['2023-07-25 16:00:00'] - df.loc['2023-07-25T01:00:00']

The ```Series``` method ```tz_localize``` can be used to specify a ```timezone``` using the input argument ```tz```. For example in the UK:

In [None]:
df['times'].tz_localize(tz='Europe/London')

And in the Czech Republic:

In [None]:
df['times'].tz_localize(tz='Europe/Prague')

Care needs to be taken with non-UTC timezones as clock changes leads to ambiguous times. In the UK one of biannual clock changes can be examined:

In [None]:
start_time = np.datetime64('2023-10-28T11:00:00')
end_time = np.datetime64('2023-10-29T03:00:00')
time_interval = np.timedelta64(30, 'm')

In [None]:
utc_times = np.arange(start=start_time, #inclusive
                      stop=end_time, #exclusive
                      step=time_interval)

In [None]:
pd.Index(utc_times).tz_localize(tz='Europe/London', ambiguous=True)

In [None]:
pd.Index(utc_times).tz_localize(tz='Europe/London', ambiguous='NaT')