# pandas library

pandas is an abbreviation for the **P**ython **an**d **D**ata **A**naly**s**is Library. It is a library that uses three main data structures:

* the Index class
* the Series class
* the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one.

The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:

|index|0|1|2|3|
|---|---|---|---|---|

Or as a column when convenient:

|index|
|---|
|0|
|1|
|2|
|3|


The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

A DataFrame class is essentially a grouping of series that have the same index:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

## Importing Libraries

To use the data science libraries they need to be imported:

In [2]:
import numpy as np 
import pandas as pd

Once imported the identifiers can be imported:

In [3]:
print(dir(pd), sep=' ')

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_config', '_is_numpy_dev', '_libs', '_testing', '_typing', '_version', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', '

These can be grouped, the identifiers beginning and eneding with the double underscore are the data model identifiers which mainly give details about the library:

In [4]:
for identifier in dir(pd):
    isdatamodel = identifier[0:2] == '__'
    if (isdatamodel):
        print(identifier, end=' ')

__all__ __builtins__ __cached__ __doc__ __docformat__ __file__ __git_version__ __loader__ __name__ __package__ __path__ __spec__ __version__ 

For example the name, version and file:

In [5]:
pd.__name__

'pandas'

In [6]:
pd.__version__

'2.0.2'

In [7]:
pd.__file__

'c:\\Users\\pyip\\AppData\\Local\\mambaforge\\envs\\jupyterlab\\Lib\\site-packages\\pandas\\__init__.py'

The identifiers beginning with a single underscore are for internal use only:

In [8]:
for identifier in dir(pd):
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__')
    if (isdatamodel):
        print(identifier, end=' ')

_config _is_numpy_dev _libs _testing _typing _version 

The classes are all in CamelCase:

In [9]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and isupper and not isdatamodel):
        print(identifier, end=' ')

ArrowDtype BooleanDtype Categorical CategoricalDtype CategoricalIndex DataFrame DateOffset DatetimeIndex DatetimeTZDtype ExcelFile ExcelWriter Flags Float32Dtype Float64Dtype Grouper HDFStore Index Int16Dtype Int32Dtype Int64Dtype Int8Dtype Interval IntervalDtype IntervalIndex MultiIndex NamedAgg Period PeriodDtype PeriodIndex RangeIndex Series SparseDtype StringDtype Timedelta TimedeltaIndex Timestamp UInt16Dtype UInt32Dtype UInt64Dtype UInt8Dtype 

The main classes are:

* Index
* Series
* DataFrame
 
There are some variations of Index such as RangeIndex, MultiIndex, DateIndex and TimedeltaIndex. 

In general pandas uses object orientated programming (OOP) opposed to functional programming. This means methods are normally applied to Index, Series and DataFrame instances to analyse or manipulate data from the instance. Most of the functions within the pandas library are used to read in data from a file and output a DataFrame instance:

In [10]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

array bdate_range concat crosstab cut date_range describe_option eval factorize from_dummies get_dummies get_option infer_freq interval_range isna isnull json_normalize lreshape melt merge merge_asof merge_ordered notna notnull option_context period_range pivot pivot_table qcut read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml reset_option set_eng_float_format set_option show_versions test timedelta_range to_datetime to_numeric to_pickle to_timedelta unique value_counts wide_to_long 

pandas modules are also in lower case:

In [11]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

annotations api arrays compat core errors io offsets options pandas plotting testing tseries util 

The modules are not normally used directly by the user but internally called when constructing an Index, Series or DataFrame:

In [12]:
print(dir(pd.arrays), end=' ')

['ArrowExtensionArray', 'ArrowStringArray', 'BooleanArray', 'Categorical', 'DatetimeArray', 'FloatingArray', 'IntegerArray', 'IntervalArray', 'PandasArray', 'PeriodArray', 'SparseArray', 'StringArray', 'TimedeltaArray', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 

## Series

The initialisation signature for a pandas Series can be examined:

In [13]:
? pd.Series

[1;31mInit signature:[0m
 [0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray ha

The main keyword input arguments are:

* data
* index
* dtype
* name


If these are not supplied. An empty series with no index, no name and a generic object datatype is instantiated:

In [14]:
pd.Series()

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [15]:
pd.Series(data=np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [16]:
pd.Series(data=[1, 2, 3])

0    1
1    2
2    3
dtype: int64

When dtype=None, the data type will be inferred from the data:

In [17]:
pd.Series(data=[1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [18]:
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

0   2023-07-25 11:01:12.764351
1   2023-07-26 11:01:12.764351
2   2023-07-27 11:01:12.764351
dtype: datetime64[ns]

In [19]:
pd.Series(data=['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Anything with a string in it is classed as non-numeric and has the generic dtype object.

The dtype can be manually overidden when suppling the numpy 1darray by using the np.array input argument dtype:

In [20]:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

0    1
1    2
2    3
dtype: int32

Or by alternatively using the Series keyword input argument dtype:

In [21]:
pd.Series(data=[1., 2., 3.], dtype=np.int32)

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument index and providing an Index, ndarray or list of index values:

In [22]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

a    1
b    2
c    3
dtype: int32

A Series usually also has a name:

In [23]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

a    1
b    2
c    3
Name: x, dtype: int32

Normally the data and name are supplied and the index and dtype are inferred:

In [24]:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

## DataFrame

The initialisation signature for a pandas DataFrame can be examined:

In [25]:
? pd.DataFrame

[1;31mInit signature:[0m
 [0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------


The keyword input arguments for a DataFrame are similar to a Series however as a DataFrame is a collection of Series most of these are plural:

* data (plural)
* index (singular)
* columns (plural of name)
* dtype (plural)

In [26]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The dtype has to be supplied as a tuple of dtypes, if it is supplied as a list of dtypes a TypeError will display.

Normally the dtype and index are inferred:

In [27]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


It is more common to supply the data in the form of a dictionary. The dictionary has a key: value pair. The key should be a string and will become the column name in the DataFrame instance and the value should be a 1darray or list which will become the data:

In [28]:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


## Series Identifiers

If the following ndarray and Series are created:

In [29]:
xarray = np.array([1.1, 2.1, 3.1, 4.1])

In [30]:
xarray

array([1.1, 2.1, 3.1, 4.1])

In [31]:
xseries = pd.Series(xarray, name='x')

In [32]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Its attributes can be viewed:

In [33]:
print(dir(xseries), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror_

The above may seem overwhelming at first glance however these can be split into seperate groupings... The behaviour of many of these identifiers, particularly many of the most common ones have already been examined when looking at numeric datatypes and ndarrays and broadcast across the data in the Series:

Into attributes:

In [34]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

array at attrs axes dtype dtypes empty flags hasnans iat index is_monotonic_decreasing is_monotonic_increasing is_unique name nbytes ndim shape size values 

Methods:

In [35]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply argmax argmin argsort asfreq asof astype at_time autocorr backfill between between_time bfill bool clip combine combine_first compare convert_dtypes copy corr count cov cummax cummin cumprod cumsum describe diff div divide divmod dot drop drop_duplicates droplevel dropna duplicated eq equals ewm expanding explode factorize ffill fillna filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull item items keys kurt kurtosis last last_valid_index le loc lt map mask max mean median memory_usage min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop pow prod product quantile radd rank ravel rdiv rdivmod reindex reindex_like rename rename_axis reorder_levels repeat replace resample reset_index rfloordiv rmod rmul rolling round rpow rsub rtruediv sample searchsorted sem set_axis set_flags shift skew sort_index 

Data model attributes:

In [36]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__annotations__ __array_priority__ __dict__ __doc__ __hash__ __module__ 

Data model methods:

In [37]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__abs__ __add__ __and__ __array__ __array_ufunc__ __bool__ __class__ __contains__ __copy__ __deepcopy__ __delattr__ __delitem__ __dir__ __divmod__ __eq__ __finalize__ __float__ __floordiv__ __format__ __ge__ __getattr__ __getattribute__ __getitem__ __getstate__ __gt__ __iadd__ __iand__ __ifloordiv__ __imod__ __imul__ __init__ __init_subclass__ __int__ __invert__ __ior__ __ipow__ __isub__ __iter__ __itruediv__ __ixor__ __le__ __len__ __lt__ __matmul__ __mod__ __mul__ __ne__ __neg__ __new__ __nonzero__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmatmul__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __setitem__ __setstate__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __weakref__ __xor__ 

There are also a number of internal attributes:

In [38]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__') 
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

_AXIS_LEN _AXIS_ORDERS _AXIS_TO_AXIS_NUMBER _HANDLED_TYPES _accessors _agg_examples_doc _agg_see_also_doc _attrs _can_hold_na _data _flags _hidden_attrs _info_axis _info_axis_name _info_axis_number _internal_names _internal_names_set _is_cached _is_copy _is_mixed_type _is_view _item_cache _metadata _mgr _name _references _stat_axis _stat_axis_name _stat_axis_number _typ _values 

And internal methods:

In [39]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__') 
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

_accum_func _add_numeric_operations _align_frame _align_series _append _arith_method _as_manager _binop _check_inplace_and_allows_duplicate_labels _check_inplace_setting _check_is_chained_assignment_possible _check_label_or_level_ambiguity _check_setitem_copy _clear_item_cache _clip_with_one_bound _clip_with_scalar _cmp_method _consolidate _consolidate_inplace _construct_axes_dict _construct_result _constructor _constructor_expanddim _convert_dtypes _dir_additions _dir_deletions _drop_axis _drop_labels_or_levels _duplicated _find_valid_index _get_axis _get_axis_name _get_axis_number _get_axis_resolvers _get_block_manager_axis _get_bool_data _get_cacher _get_cleaned_column_resolvers _get_index_resolvers _get_label_or_level_values _get_numeric_data _get_value _get_values _get_values_tuple _get_with _gotitem _indexed_same _init_dict _init_mgr _inplace_method _is_label_or_level_reference _is_label_reference _is_level_reference _ixs _logical_func _logical_method _map_values _maybe_update_ca

Since the Series is based on a ndarray, the data model attributes and methods behave analogously:

In [40]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and isdatamodel and not isinarray):
        print(identifier, end=' ')

__annotations__ __dict__ __module__ 

In [41]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and isdatamodel and not isinarray):
        print(identifier, end=' ')

__finalize__ __getattr__ __nonzero__ __round__ __weakref__ 

The main supplementary functionality is with the attributes:

In [42]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isdatamodel and not isinarray):
        print(identifier, end=' ')

array at attrs axes dtypes empty hasnans iat index is_monotonic_decreasing is_monotonic_increasing is_unique name values 

Many of these are the attributes return the supplied value in the intialisation signature:

In [43]:
xseries.array

<PandasArray>
[1.1, 2.1, 3.1, 4.1]
Length: 4, dtype: float64

In [44]:
xseries.name

'x'

In [45]:
xseries.index

RangeIndex(start=0, stop=4, step=1)

In [46]:
xseries.values

array([1.1, 2.1, 3.1, 4.1])

In [47]:
xseries.dtypes

dtype('float64')

The main functionality is in the added methods:

In [48]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align apply asfreq asof at_time autocorr backfill between between_time bfill bool combine combine_first compare convert_dtypes corr count cov cummax cummin describe diff div divide divmod drop drop_duplicates droplevel dropna duplicated eq equals ewm expanding explode factorize ffill fillna filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull items keys kurt kurtosis last last_valid_index le loc lt map mask median memory_usage mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop pow product quantile radd rank rdiv rdivmod reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv rmod rmul rolling rpow rsub rtruediv sample sem set_axis set_flags shift skew sort_index sort_values sub subtract swaplevel tail to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list to_

Note in the above there are method equivalents to the data model identifiers:

In [49]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isinarrayasdatamodel = ('__' + identifier + '__') in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray and isinarrayasdatamodel):
        print(identifier, end=' ')

abs add bool divmod eq floordiv ge gt le lt mod mul ne pow radd rdivmod rfloordiv rmod rmul rpow rsub rtruediv sub truediv 

It is more common to use the equivalent data model operator:

In [50]:
xseries + 4

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

In [51]:
xseries.__add__(4)

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

In [52]:
xseries.add(4)

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

Some of these methods have the same name as functions in builtins and therefore behave similarly:

In [53]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    import builtins
    isinbuiltins = identifier in dir(builtins)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray and isinbuiltins):
        print(identifier, end=' ')

abs bool divmod filter map pow 

In [54]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isinarrayasdatamodel = ('__' + identifier + '__') in dir(xarray)
    isdatamodel = identifier[0] == '_'
    import builtins
    isinbuiltins = identifier in dir(builtins)
    if (isfunction and not isdatamodel and not isinarray and not isinarrayasdatamodel and not isinbuiltins):
        print(identifier, end=' ')

add_prefix add_suffix agg aggregate align apply asfreq asof at_time autocorr backfill between between_time bfill combine combine_first compare convert_dtypes corr count cov cummax cummin describe diff div divide drop drop_duplicates droplevel dropna duplicated equals ewm expanding explode factorize ffill fillna first first_valid_index get groupby head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull items keys kurt kurtosis last last_valid_index loc mask median memory_usage mode multiply nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop product quantile rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rolling sample sem set_axis set_flags shift skew sort_index sort_values subtract swaplevel tail to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list to_markdown to_numpy to_period to_pickle to_sql to_string to_timestamp to_xarray transform truncate tz_convert tz_localize unique

## DataFrame Identifiers

If the following dataframe is constructed:

In [55]:
df = pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
                   'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [56]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


In [57]:
print(dir(df), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__

These can be split into attributes:

In [58]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

at attrs axes columns dtypes empty flags iat index ndim shape size style values x y 

Methods:

In [59]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply applymap asfreq asof assign astype at_time backfill between_time bfill bool boxplot clip combine combine_first compare convert_dtypes copy corr corrwith count cov cummax cummin cumprod cumsum describe diff div divide dot drop drop_duplicates droplevel dropna duplicated eq equals eval ewm expanding explode ffill fillna filter first first_valid_index floordiv from_dict from_records ge get groupby gt head hist idxmax idxmin iloc infer_objects info insert interpolate isetitem isin isna isnull items iterrows itertuples join keys kurt kurtosis last last_valid_index le loc lt mask max mean median melt memory_usage merge min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe pivot pivot_table plot pop pow prod product quantile query radd rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv rmod rmul rolling round rpow rsub rtruediv sample select_

Data model attributes:

In [60]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__annotations__ __array_priority__ __dict__ __doc__ __hash__ __module__ 

Data model methods:

In [61]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__abs__ __add__ __and__ __array__ __array_ufunc__ __bool__ __class__ __contains__ __copy__ __dataframe__ __deepcopy__ __delattr__ __delitem__ __dir__ __divmod__ __eq__ __finalize__ __floordiv__ __format__ __ge__ __getattr__ __getattribute__ __getitem__ __getstate__ __gt__ __iadd__ __iand__ __ifloordiv__ __imod__ __imul__ __init__ __init_subclass__ __invert__ __ior__ __ipow__ __isub__ __iter__ __itruediv__ __ixor__ __le__ __len__ __lt__ __matmul__ __mod__ __mul__ __ne__ __neg__ __new__ __nonzero__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmatmul__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __setitem__ __setstate__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __weakref__ __xor__ 

Most of these behave analogously to their counterparts in Series broadcasting across the entire DataFrame instance instead of just along a Series. 

There are no data model attributes in the DataFrame class not found in the Series class:

In [62]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and isdatamodel and not isinxseries):
        print(identifier, end=' ')

The only data model method in the DataFrame class not in the Series class is \_\_dataframe\_\_:

In [63]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and isdatamodel and not isinxseries):
        print(identifier, end=' ')

__dataframe__ 

If the attributes are examined:

In [64]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isdatamodel and not isinxseries):
        print(identifier, end=' ')

columns style x y 

Notice the columns attribute returns a list of the names of each Series:

In [65]:
df.columns

Index(['x', 'y'], dtype='object')

Since the following condition is satisfied:

In [66]:
'x'.isidentifier()

True

In [67]:
'y'.isidentifier()

True

And the identifier name doesn't clash with any of the other DataFrame identifiers, the following are also attributes:

In [68]:
df.x

0    1.1
1    2.1
2    3.1
3    3.1
Name: x, dtype: float64

In [69]:
df.y

0    1.2
1    2.2
2    3.2
3    4.2
Name: y, dtype: float64

The following methods are also supplementary for a DataFrame:

In [70]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinxseries):
        print(identifier, end=' ')

applymap assign boxplot corrwith eval from_dict from_records insert isetitem iterrows itertuples join melt merge pivot pivot_table query select_dtypes set_index stack to_feather to_gbq to_html to_orc to_parquet to_records to_stata to_xml 

## Mutability

The Index, Series and DataFrame classes are mutable Collections meaning they have the immutable data model identifier \_\_getitem\_\_ as well as the mutatable identifier \_\_setitem\_\_:

In [71]:
'__getitem__' in dir(pd.Series)

True

In [72]:
'__setitem__' in dir(pd.Series)

True

In [73]:
'__delitem__' in dir(pd.Series)

True

This means the following array can be indexed into:

In [74]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

Indexing can be carried out using \_\_getitem\_\_, typically the shorthand notation uses square brackets to enclose the index value:

In [75]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

For example:

In [76]:
xseries[0]

1.1

A value can be reassigned using the mutatable method \_\_setitem\_\_:

In [77]:
xseries[0] = None

In [78]:
xseries

0    NaN
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value can be deleted using the mutable method \_\_delitem\_\_:

In [79]:
del xseries[2]

In [80]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

Despite the ndarray, Series and DataFrame being mutatable data types, most the identifiers are immutable by default. If the docstring of the method dropna is examined:

In [81]:
? xseries.dropna

[1;31mSignature:[0m
 [0mxseries[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mhow[0m[1;33m:[0m [1;34m'AnyAll | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a new Series with missing values removed.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index'}
    Unused. Parameter needed for compatibility with DataFrame.
inplace : bool, default False
    If True, do operation inplace and retur

Notice it has a number of keyword input arguments such as axis and inplace which have default values. inplace has the default value of False making the method immutable and therefore returning a new Series:

In [82]:
xseries.dropna() # Return value

1    2.1
3    4.1
Name: x, dtype: float64

In [83]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

When inplace is set to True it becomes a mutable method, modifying the Series inplace:

In [84]:
xseries.dropna(inplace=True) # No return value

In [85]:
xseries

1    2.1
3    4.1
Name: x, dtype: float64

The same behaviour can be seen on the method reset_index:

In [86]:
? xseries.reset_index

[1;31mSignature:[0m
 [0mxseries[0m[1;33m.[0m[0mreset_index[0m[1;33m([0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'IndexLabel'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mdrop[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mallow_duplicates[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or
when the index is meaningless and needs to be reset to the default
before another

With default values this returns a DataFrame since the old index is now added as a Series:

In [87]:
xseries.reset_index() # Return value

Unnamed: 0,index,x
0,1,2.1
1,3,4.1


If drop is set to True, a Series will instead be returned:

In [88]:
xseries.reset_index(drop=True) # Return value

0    2.1
1    4.1
Name: x, dtype: float64

Once again the inplace keyword input argument can be assigned to True making the method mutatable:

In [89]:
xseries.reset_index(drop=True, inplace=True) # No return value

In [90]:
xseries

0    2.1
1    4.1
Name: x, dtype: float64

The inspect module can be used to group the Series methods that have inplace as a keyword argument. All of these are configured to be immutable by default but can be made mutable by assigning inplace to True:

In [91]:
import inspect

for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('inplace' in inspect.signature(getattr(xseries, identifier)).parameters):
            print(identifier, end=' ')

backfill bfill clip drop drop_duplicates dropna ffill fillna interpolate mask pad rename rename_axis replace reset_index sort_index sort_values where 

Notice that most of these are used to fill or drop missing values.

When the above methods are immutable, they have a return value:

In [92]:
xseries.sort_values(ascending=False) # Return value

1    4.1
0    2.1
Name: x, dtype: float64

For a mutable method assignment or in this case reassignment can be used:

In [93]:
xseries = xseries.sort_values(ascending=False)

In [94]:
xseries

1    4.1
0    2.1
Name: x, dtype: float64

On the other hand when they are immutable, they have no return value and the Series is updated inplace:

In [95]:
xseries.sort_values(ascending=True, inplace=True) # No return value

In [96]:
xseries

0    2.1
1    4.1
Name: x, dtype: float64

If assignment or reassignment is used with the keyword inplace, the return value of the funciton will be None and None will be assigned to the original Series:

In [97]:
xseries = xseries.sort_values(ascending=True, inplace=True) 

In [98]:
xseries

Notice no cell output because:

In [99]:
xseries == None

True

By convention immutable methods have a return value and mutable methods have no return value. An exception to this is the mutable method pop which returns the popped value and mutates the Series in place:

In [100]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

In [101]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

In [102]:
xseries.pop(item=1) # Return value

2.1

In [103]:
xseries # Mutated

0    1.1
2    3.1
3    4.1
Name: x, dtype: float64

Most of the other methods are immutable and have a return value:

In [104]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('inplace' not in inspect.signature(getattr(xseries, identifier)).parameters):
            if identifier not in ['pop']:
                print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply argmax argmin argsort asfreq asof astype at_time autocorr between between_time bool combine combine_first compare convert_dtypes copy corr count cov cummax cummin cumprod cumsum describe diff div divide divmod dot droplevel duplicated eq equals ewm expanding explode factorize filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info isin isna isnull item items keys kurt kurtosis last last_valid_index le loc lt map max mean median memory_usage min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pct_change pipe plot pow prod product quantile radd rank ravel rdiv rdivmod reindex reindex_like reorder_levels repeat resample rfloordiv rmod rmul rolling round rpow rsub rtruediv sample searchsorted sem set_axis set_flags shift skew squeeze std sub subtract sum swapaxes swaplevel tail take to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list 

## Indexing and Slicing

Supposing the following dictionary instance is instantiated:

In [105]:
mapping = {'x': np.array([1.1, 2.1, 3.1, 4.1]),
           'y': np.array([1.2, 2.2, 3.2, 4.2])}

In [106]:
mapping

{'x': array([1.1, 2.1, 3.1, 4.1]), 'y': array([1.2, 2.2, 3.2, 4.2])}

A DataFrame instance can be instantiated by assigning the mapping to the keyword input argument data:

In [107]:
df = pd.DataFrame(data=mapping)

In [108]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


A mapping can be indexed with a key. This returns the value the key references, in this case the numpy array:

In [109]:
mapping['x']

array([1.1, 2.1, 3.1, 4.1])

Analogously, when a DataFrame is indexed using the name of a column, the Series is returned:

In [110]:
df['x']

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value in the ndarray can be indexed by use of a second set of square brackets to enclose the numeric index:

In [111]:
mapping['x'][1]

2.1

Analogously, a value in the Series can be indexed by use of a second set of square brackets to enclose the numeric index:

In [112]:
df['x'][1]

2.1

If the DataFrame instance is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

The first set of brackets select the Series:

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

And the second set of brackets selects the index retrieving the value:

2.1

If the DataFrame is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

Sometimes the value for each Series at an index is desired:

|index|'x'|'y'|
|---|---|---|
|1|2.1|2.2|

This is done by use of the property location loc. Note that loc returns the above "row" as a Series which is displayed by default as a "column":

|index|1|
|---|---|
|'x'|2.1|
|'y'|2.1|

loc is callable and has a docstring:

In [113]:
callable(df.loc)

True

In [114]:
? df.loc

[1;31mType:[0m        property
[1;31mString form:[0m <property object at 0x000001B695B45030>
[1;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

See mo

However it isn't a function and is not called using parenthesis:

In [115]:
df.loc

<pandas.core.indexing._LocIndexer at 0x1b6971a08c0>

In [116]:
df.loc()

<pandas.core.indexing._LocIndexer at 0x1b6971a0870>

Instead loc is a property, think of it as syntactic sugar around the data model method \_\_getitem\_\_ that switches the order of indexing from Series, index to index, Series:

In [117]:
df.loc[1]

x    2.1
y    2.2
Name: 1, dtype: float64

In [118]:
df.loc[1]['x']

2.1

loc uses index values:

In [119]:
df.loc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


The related property integer location iloc uses a numeric index. Since the index is numeric, additional numeric operations can be used such as indexing:

In [120]:
df.iloc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


In [121]:
df.iloc[0:2]

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2


If the following DataFrame instance is created with index labels i.e. a non-numeric index:

|index|'x'|'y'|
|---|---|---|
|'a'|1.1|1.2|
|'b'|2.1|2.2|
|'c'|3.1|3.2|
|'d'|4.1|4.2|

In [122]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd'],
                  data=mapping)

In [123]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The difference between loc and iloc ca be seen more clearly. For loc the index label is used:

In [124]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

Despite the labels being non-numeric iloc handles the index values numerically:

In [125]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

iloc essentially analyses a dataframe with a reset index:

In [126]:
df.reset_index(drop=True)

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


In [127]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

When loc and iloc are used to select a single index, the data for each Series at this index is itself displayed as a Series:

In [128]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

In [129]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

Because each of the above are a Series instance, they can in turn be indexed into:

In [130]:
df.loc['b']['y']

2.2

In [131]:
df.iloc[1]['y']

2.2

Each element in a Series can also be accessed numerically:

In [132]:
df.loc['b'][1]

2.2

In [133]:
df.iloc[1][1]

2.2

When iloc and loc are instead used to select data from multiple indexes a DataFrame instance is output:

In [134]:
df.loc[['a', 'b']]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


In [135]:
df.iloc[0:2]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


And because each of these is a DataFrame instance, the Series within the DataFrame instance can then be indexed using the Series name:

In [136]:
df.loc[['a', 'b']]['x']

a    1.1
b    2.1
Name: x, dtype: float64

In [137]:
df.iloc[0:2]['x']

a    1.1
b    2.1
Name: x, dtype: float64

at is used for a scalar selector and requires both the index and the Series name: 

In [138]:
df.at['a', 'y']

1.2

The related integer at is also a scalar selector and requires both the index and column to be specified as integers:

In [139]:
df.iat[0, 1]

1.2

Conceptualise, the DataDrame being cast to a 2darray and indexign a value from it:

In [140]:
df.to_numpy()

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [4.1, 4.2]])

In [141]:
df.to_numpy()[0, 1]

1.2

To recap, for a DataFrame instance:

* \_\_getitem\_\_ selects a Series
* loc and iloc selects an observation from an Index
* at and iat select a scalar element


loc can also be used to add a new observation to the DataFrame:

In [142]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


In [143]:
df.loc['e'] = {'x': 5.1, 'y': 5.2}

In [144]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2


The length of the DataFrame gives the number of observations:

In [145]:
len(df)

5

iloc isn't as powerful as loc and cannot be used to enlarge the DataFrame:

In [146]:
# df.iloc[len(df)] = {'x': 6.1, 'y': 6.2}

<span style='color:red'>IndexError</span>: iloc cannot enlarge its target object

However loc can be used to add a numeric index this way:

In [147]:
df.loc[len(df)] = {'x': 6.1, 'y': 6.2}

In [148]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
5,6.1,6.2


## DataFrame Properties

Supposing the following DataFrame is instantiated:

In [149]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, 2.1, 3.1, 4.1, 5.1, 6.1, 7.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2, 5.2, 6.2, 7.2])})

In [150]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The DataFrame has the following dimension related properties. The attribute empty returns a boolean that is True only with an empty DataFrame:

In [151]:
df.empty

False

In [152]:
pd.DataFrame(None).empty

True

A DataFrame has a length, which is the number of observations or rows i.e. number of values in the Index:

In [153]:
len(df)

7

It has a shape tuple, the 1st value in the shape tuple is the number of rows (observations in the index) and 2nd value is the number of Series (columns):

In [154]:
df.shape

(7, 2)

It has 2 dimensions:

In [155]:
df.ndim

2

Recall this is the length of the shape tuple:

In [156]:
len(df.shape)

2

And it has a size which is the product of the elements in the shape tuple:

In [157]:
df.size

14

The index attribute is an Index instance. An Index instance has a single dimension that can either be depicted as a row or a column. The output below displays this as a row although the index itself is conventionally depicted as a column when incorporated as part of a DataFrame:

In [158]:
df.index

Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object')

When no index is specified during instantiation a RangeIndex is shown:

In [159]:
df2 = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 3.1]),
                         'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [160]:
df2.index

RangeIndex(start=0, stop=4, step=1)

The attribute columns is also an instance of the class Index that contains the names used for each Series in the DataFrame:

In [161]:
df.columns

Index(['x', 'y'], dtype='object')

The attribute axes returns a 2 element list, where the first element is the index and the second element is the columns:

In [162]:
df.axes

[Index(['a', 'b', 'c', 'd', 'e', 'f', 'g'], dtype='object'),
 Index(['x', 'y'], dtype='object')]

The attribute values returns the values in the DataFrame in the form of a 2darray:

In [163]:
df.values

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [4.1, 4.2],
       [5.1, 5.2],
       [6.1, 6.2],
       [7.1, 7.2]])

The attribute dtypes returns the data types of each Series and of the DataFrame:

In [164]:
df.dtypes

x    float64
y    float64
dtype: object

The Series instances x and y are each of the data type float64, the DataFrame instance df is of the data type object. A DataFrame instance is always of the type object.

Each existing Series is accessable as an attribute:

In [165]:
df.x

a    1.1
b    2.1
c    3.1
d    4.1
e    5.1
f    6.1
g    7.1
Name: x, dtype: float64

In [166]:
df.y

a    1.2
b    2.2
c    3.2
d    4.2
e    5.2
f    6.2
g    7.2
Name: y, dtype: float64

The formal representation of the DataFrame instance df can be examined in a cell:

In [167]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The attribute style will instead display the DataFrame instance using default styling:

In [168]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


This attribute can be used with a number of methods to apply custom formatting:

In [169]:
for identifier in dir(df.style):
    if not identifier.startswith('_') and callable(getattr(df.style, identifier)):
        print(identifier, end=' ')

apply apply_index applymap applymap_index background_gradient bar clear concat export format format_index from_custom_template hide highlight_between highlight_max highlight_min highlight_null highlight_quantile pipe relabel_index set_caption set_properties set_sticky set_table_attributes set_table_styles set_td_classes set_tooltips set_uuid text_gradient to_excel to_html to_latex to_string use 

In [170]:
df_styled = df.style.format(precision=3).set_caption('DataFrame Instance')

This gives a Styler instance:

In [171]:
type(df_styled)

pandas.io.formats.style.Styler

The Styler instance applies the formatting to the data in the DataFrame when output in a cell:

In [172]:
df_styled

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The associated attributes give information about an existing Styler instance:

In [173]:
for identifier in dir(df_styled):
    if not identifier.startswith('_') and not callable(getattr(df_styled, identifier)):
        print(identifier, end=' ')

caption cell_context cell_ids columns concatenated css ctx ctx_columns ctx_index data env hidden_columns hidden_rows hide_column_names hide_columns_ hide_index_ hide_index_names index loader table_attributes table_styles template_html template_html_style template_html_table template_latex template_string tooltips uuid uuid_len 

In [174]:
df_styled.caption

'DataFrame Instance'

In [175]:
df_styled.hidden_rows

[]

The attributes attrs is an empty dictionary by default and is designed to store metadata associated with the DataFrame:

In [176]:
df.attrs

{}

This metadata can include a text description giving information about how the data was collection or contain a link to a scientific publication for example. The pandas documentation warns that this is an experimental feature and is subject to change:

In [177]:
df.attrs = {'description': 'this DataFrame was instantiated from a dict',
            'scientific paper': r'https://www.sciencedirect.com/'}

flags is another experimental feature and is used to change some flags. At current there is only a flag that can be set, the flag which allows duplicate labels:

In [178]:
df.flags

<Flags(allows_duplicate_labels=True)>

In [179]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


This flag is enabled by default:

In [180]:
df.flags.allows_duplicate_labels

True

In [181]:
df_duplicated = pd.concat([df, df])

In [182]:
df_duplicated

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2


If set to False:

In [183]:
df.flags.allows_duplicate_labels

True

Then any operation involving that DataFrame that could lead to a DataFrame with duplicate labels will give a DuplicateLabelError:

In [184]:
# pd.concat([df, df])

<span style='color:red'>DuplicateLabelError</span>:

info
describe
head
tail

The dataframe method info gives information about the dataframe putting several attributes together such as the Series names, number of non-null values and the data types of each Series:

In [185]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       7 non-null      float64
 1   y       7 non-null      float64
dtypes: float64(2)
memory usage: 468.0+ bytes


The describe method gives descriptive statistics on each numeric Series:

In [186]:
df.describe()

Unnamed: 0,x,y
count,7.0,7.0
mean,4.1,4.2
std,2.160247,2.160247
min,1.1,1.2
25%,2.6,2.7
50%,4.1,4.2
75%,5.6,5.7
max,7.1,7.2


The dataframe method head and tail give the top 5 and last 5 observations by default and are usually used to preview a very large DataFrame instance:

In [187]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


In [188]:
df.head()

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2


In [189]:
df.tail()

Unnamed: 0,x,y
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


The number of observations n can be changed:

In [190]:
df.head(n=3)

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2


The method nunique gives the number of unique observations for each Series:

In [191]:
df.nunique()

x    7
y    7
dtype: int64

## Not Available Values

If the following DataDrame is instantiated with None Values:

In [192]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd', 'e', 'f', 'g'],
                  data={'x': np.array([1.1, None, 3.1, None, 5.1, None, 7.1]),
                        'y': np.array([1.2, None, 3.2, 4.2, 5.2, 6.2, 7.2])})

The information of the DataFrame instance can be examined, now there are 7 entries (observations). 5 observations have available (non-null) values in Series instance x. 6 observations have available (non-null) values in Series instance y. Also notice the data type is now object instead of float64 meaning everything in each Series is interpretted as a string:

In [193]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       4 non-null      object
 1   y       6 non-null      object
dtypes: object(2)
memory usage: 168.0+ bytes


If describe is used, because None values are present and the data type is an object the descriptive statistics change:

In [194]:
df.describe()

Unnamed: 0,x,y
count,4.0,6.0
unique,4.0,6.0
top,1.1,1.2
freq,1.0,1.0


The data type of each Series in the DataFrame can be changed using the method astype:

In [295]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


In [294]:
df.astype(float)

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


Notice the difference between the two DataFrame instances, the one which has each Series with the data type object has None whereas the one which has each Series as numeric has NaN (not a number).

In [305]:
None == np.NaN

False

NaN is essentially equivalent to None that has a datatype of float. Series that only have numeric data and NaN can therefore have the data type float:

In [302]:
type(np.NaN)

float

Series with only numeric data and None therefore contain multiple different data types and therefore the Series has the data type object:

In [303]:
type(None)

NoneType

If the method describe is used on the DataFrame instance that has the float data type with NaN values instead of None values, the numeric descriptive statistics display:

In [300]:
df.astype(float).describe()

Unnamed: 0,x,y
count,4.0,6.0
mean,4.1,4.533333
std,2.581989,2.160247
min,1.1,1.2
25%,2.6,3.45
50%,4.1,4.7
75%,5.6,5.95
max,7.1,7.2


The method drop not available dropna can be used to drop these values outputting a new DataFrame instance. Both None and NaN are classified as not available and are also known collectively as null values. Notice the number of observations is now reduced to 4:

In [195]:
df.dropna()

Unnamed: 0,x,y
a,1.1,1.2
c,3.1,3.2
e,5.1,5.2
g,7.1,7.2


In [304]:
df.astype(float).dropna()

Unnamed: 0,x,y
a,1.1,1.2
c,3.1,3.2
e,5.1,5.2
g,7.1,7.2


If the DataFrame method info is used on this new DataFrame instance, notice the data type of each Series is still object and not float64:

In [196]:
df.dropna().info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       4 non-null      object
 1   y       4 non-null      object
dtypes: object(2)
memory usage: 96.0+ bytes


The astype method can be used to change the data type of each Series in the DataFrame to a float once again outputting a new DataFrame instance. If the info method is examined for this DataFrame instance, each Series now has a float64 data type:

In [222]:
df.dropna().astype(float).info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       4 non-null      float64
 1   y       4 non-null      float64
dtypes: float64(2)
memory usage: 96.0+ bytes


And describe can be used on this instance to give descriptive statistics:

In [223]:
df.dropna().astype(float).describe()

Unnamed: 0,x,y
count,4.0,4.0
mean,4.1,4.2
std,2.581989,2.581989
min,1.1,1.2
25%,2.6,2.7
50%,4.1,4.2
75%,5.6,5.7
max,7.1,7.2


dropna can be used when a DataFrame instance has a large number of observations and only a small number of these observations have not available values. If the docstring is examined:

In [224]:
? df.dropna

[1;31mSignature:[0m
 [0mdf[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mhow[0m[1;33m:[0m [1;34m'AnyAll | NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mthresh[0m[1;33m:[0m [1;34m'int | NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0msubset[0m[1;33m:[0m [1;34m'IndexLabel'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Remove missing values.

See the :ref:`User Guide <missing_data>` for more on whi

Notice there is the keyword input argument inplace and axis. These two keywords are present it many of the DataFrame identifiers.

The following identifiers have the keyword inplace which recall toggles the method from being immutable (default when inplace=False) to mutable (when inplace=True). Notice that many of these other identifiers are used to drop not available data or to fill not available data.

In [228]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('inplace' in inspect.signature(getattr(df, identifier)).parameters):
            print(identifier, end=' ')

backfill bfill clip drop drop_duplicates dropna eval ffill fillna interpolate mask pad query rename rename_axis replace reset_index set_index sort_index sort_values where 

The following identifiers have the keyword axis:

In [227]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('axis' in inspect.signature(getattr(df, identifier)).parameters):
            print(identifier, end=' ')

add add_prefix add_suffix agg aggregate align all any apply at_time backfill between_time bfill clip corrwith count cummax cummin cumprod cumsum diff div divide drop droplevel dropna eq ewm expanding ffill fillna filter floordiv ge groupby gt idxmax idxmin iloc interpolate kurt kurtosis le loc lt mask max mean median min mod mode mul multiply ne nunique pad pow prod product quantile radd rank rdiv reindex rename rename_axis reorder_levels resample rfloordiv rmod rmul rolling rpow rsub rtruediv sample sem set_axis shift skew sort_index sort_values squeeze std sub subtract sum swaplevel take to_period to_timestamp transform truediv truncate tz_convert tz_localize var where xs 

The keyword axis can be examined in more detail using the DataFrame instance df:

In [235]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


df has a shape tuple which has 7 observations or rows in the index and 2 Series or columns:

In [237]:
df.shape

(7, 2)

Because a DataFrame is always 2 dimensions, positive indexes can be considered. Notice the 7 is at index 0 and the 2 is at index 1 of the shape tuple:

In [238]:
nrows = df.shape[0]
nrows

7

In [239]:
ncols = df.shape[1]
ncols

2

The default value is axis 0 or 'index' and drops any observations along the index that have null entries:

In [233]:
df.dropna(axis=0)

Unnamed: 0,x,y
a,1.1,1.2
c,3.1,3.2
e,5.1,5.2
g,7.1,7.2


In [240]:
df.dropna(axis='index')

Unnamed: 0,x,y
a,1.1,1.2
c,3.1,3.2
e,5.1,5.2
g,7.1,7.2


This can be changed to an axis of 1 or 'columns' that will instead drop any Series that has not available values. In this case all the Series have not available values:

In [234]:
df.dropna(axis=1)

a
b
c
d
e
f
g


In [241]:
df.dropna(axis='columns')

a
b
c
d
e
f
g


The method fillna can be used to fill in not available values:

In [246]:
? df.fillna

[1;31mSignature:[0m
 [0mdf[0m[1;33m.[0m[0mfillna[0m[1;33m([0m[1;33m
[0m    [0mvalue[0m[1;33m:[0m [1;34m'Hashable | Mapping | Series | DataFrame'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mmethod[0m[1;33m:[0m [1;34m'FillnaOptions | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mlimit[0m[1;33m:[0m [1;34m'int | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdowncast[0m[1;33m:[0m [1;34m'dict | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Fill NA/NaN values using the specified method.

Parameters
----------
value : scalar, dict, Ser

These can be filled with a constant value:

In [247]:
df.fillna(0)

Unnamed: 0,x,y
a,1.1,1.2
b,0.0,0.0
c,3.1,3.2
d,0.0,4.2
e,5.1,5.2
f,0.0,6.2
g,7.1,7.2


In [250]:
df.fillna(np.inf)

Unnamed: 0,x,y
a,1.1,1.2
b,inf,inf
c,3.1,3.2
d,inf,4.2
e,5.1,5.2
f,inf,6.2
g,7.1,7.2


Alternatively a method can be used to linearly forward fill missing data. When using the forward fill, the previous available value is used to replace the not available value:

In [254]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


In [252]:
df.fillna(method='ffill')

Unnamed: 0,x,y
a,1.1,1.2
b,1.1,1.2
c,3.1,3.2
d,3.1,4.2
e,5.1,5.2
f,5.1,6.2
g,7.1,7.2


When using the back fill the next available value is used to replace the not available value:

In [255]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


In [251]:
df.fillna(method='bfill')

Unnamed: 0,x,y
a,1.1,1.2
b,3.1,3.2
c,3.1,3.2
d,5.1,4.2
e,5.1,5.2
f,7.1,6.2
g,7.1,7.2


These also have synonym methods ffill and bfill:

In [243]:
df.bfill()

Unnamed: 0,x,y
a,1.1,1.2
b,3.1,3.2
c,3.1,3.2
d,5.1,4.2
e,5.1,5.2
f,7.1,6.2
g,7.1,7.2


In [244]:
df.ffill()

Unnamed: 0,x,y
a,1.1,1.2
b,1.1,1.2
c,3.1,3.2
d,3.1,4.2
e,5.1,5.2
f,5.1,6.2
g,7.1,7.2


The interpolate method can use neighbouring datapoints to interpolate a missing value:

In [257]:
? df.interpolate

[1;31mSignature:[0m
 [0mdf[0m[1;33m.[0m[0minterpolate[0m[1;33m([0m[1;33m
[0m    [0mmethod[0m[1;33m:[0m [1;34m'str'[0m [1;33m=[0m [1;34m'linear'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0mlimit[0m[1;33m:[0m [1;34m'int | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mlimit_direction[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mlimit_area[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdowncast[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m**[0m[0mkwargs[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | None'[0m[1;33m[0m[1;3

The interpolate method has the keyword input argument method. If method is set to 'linear' numeric interpolation will use the two nearest non-null data points.

If the data type of the Series is object, the data will not be recognised as numeric and a TypeError will display:

In [261]:
# df.interpolate(method='linear')

<span style='color:red'>TypeError</span>: Cannot interpolate with all object-dtype columns in the DataFrame. Try setting at least one column to a numeric dtype.

In [262]:
df.astype(float).interpolate(method='linear')

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
f,6.1,6.2
g,7.1,7.2


This is the same as a 1st order polynomial (two nearest data points). If a polynomial method is specified however the index needs to be numeric otherwise there is a ValueError:

In [266]:
# df.astype(float).interpolate(method='polynomial', order=1)

<span style='color:red'>ValueError</span>: Index column must be numeric or datetime type when using polynomial method other than linear. Try setting a numeric or datetime index column before interpolating.

If reset_index is used to make the index numeric, polynomial interpolation can be used:

In [268]:
df.reset_index(drop=True).astype(float).interpolate(method='polynomial', order=1) # 2 nearest data points

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2
6,7.1,7.2


In [269]:
df.reset_index(drop=True).astype(float).interpolate(method='polynomial', order=2) # 3 nearest data points

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2
6,7.1,7.2


In [270]:
df.reset_index(drop=True).astype(float).interpolate(method='polynomial', order=3) # 4 nearest data points

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2
4,5.1,5.2
5,6.1,6.2
6,7.1,7.2


The isna DataFrame method returns a boolean DataFrame instance which is True for not available values and False otherwise:

In [281]:
df.isna()

Unnamed: 0,x,y
a,False,False
b,True,True
c,False,False
d,True,False
e,False,False
f,True,False
g,False,False


The opposite method notna returns a boolean DataFrame of inverse values:

In [306]:
df.notna()

Unnamed: 0,x,y
a,True,True
b,False,False
c,True,True
d,False,True
e,True,True
f,False,True
g,True,True


These two methods have the alias isnull and notnull respectively. These alias are used for consistency with the R programming language.

The boolean mask above can be used to index into the DataFrame instance:

In [307]:
bool_mask = df.notna()

Notice indexing using the boolean mask updates None to NaN:

In [308]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


In [292]:
df[bool_mask]

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


## String Series and String Methods

Supposing the following list of words is instantiated:

In [499]:
words = 'the quick brown for jumped over the lazy dog'.split()

In [500]:
words

['the', 'quick', 'brown', 'for', 'jumped', 'over', 'the', 'lazy', 'dog']

Using len of words will return the number of words and not the length of each word:

In [501]:
len(words)

9

To instead get a list of the length of each word i.e. use len on each individual str, list comprehension can be used:

In [502]:
[len(word) for word in words]

[3, 5, 5, 3, 6, 4, 3, 4, 3]

This can also be done using map:

In [503]:
map(len, words)

<map at 0x1b6a5814b80>

In [504]:
list(map(len, words))

[3, 5, 5, 3, 6, 4, 3, 4, 3]

If an analogous DataFrame is instantiated with a Series words:

In [505]:
df = pd.DataFrame({'words': 'the quick brown for jumped over the lazy dog'.split()})

In [506]:
df

Unnamed: 0,words
0,the
1,quick
2,brown
3,for
4,jumped
5,over
6,the
7,lazy
8,dog


Using len on the DataFrame will return the number of observations in the index:

In [507]:
len(df)

9

The DataFrame method applymap is similar to map and can be used to individually apply the len function element by element throughout the DataFrame:

In [508]:
df.applymap(func=len)

Unnamed: 0,words
0,3
1,5
2,5
3,3
4,6
5,4
6,3
7,4
8,3


Since every element in the DataFrame is a str, a str method can be applied to each element using applymap and a lambda expression:

In [509]:
df.applymap(func=lambda str: str.upper())

Unnamed: 0,words
0,THE
1,QUICK
2,BROWN
3,FOR
4,JUMPED
5,OVER
6,THE
7,LAZY
8,DOG


A Series has a similar method map:

In [510]:
df['words'].map(lambda str: str.upper())

0       THE
1     QUICK
2     BROWN
3       FOR
4    JUMPED
5      OVER
6       THE
7      LAZY
8       DOG
Name: words, dtype: object

Notice the difference in the return values. The method applymap called from the DataFrame returns another DataFrame instance. In contrast the method map when called from a Series returns another Series instance.

The Series instance returned can be assigned to a new Series of the DataFrame:

In [511]:
df['upperwords'] = df['words'].map(lambda str: str.upper())

In [512]:
df

Unnamed: 0,words,upperwords
0,the,THE
1,quick,QUICK
2,brown,BROWN
3,for,FOR
4,jumped,JUMPED
5,over,OVER
6,the,THE
7,lazy,LAZY
8,dog,DOG


The DataFrame instance also has the method apply which can be used to apply a function for example a universal function along an axis, by default it operates along axis 0 which is the 'index':

In [513]:
df.apply(max)

words         the
upperwords    THE
dtype: object

In [514]:
df.apply(min)

words         brown
upperwords    BROWN
dtype: object

Since the str methods are commonly invoked, a Series has the attribute str which can be used to invoke the most common string methods:

In [515]:
df['words'].str.zfill(20)

0    00000000000000000the
1    000000000000000quick
2    000000000000000brown
3    00000000000000000for
4    00000000000000jumped
5    0000000000000000over
6    00000000000000000the
7    0000000000000000lazy
8    00000000000000000dog
Name: words, dtype: object

And includes some additions such as len:

In [517]:
df['words'].str.len()

0    3
1    5
2    5
3    3
4    6
5    4
6    3
7    4
8    3
Name: words, dtype: int64

## Numeric Series

If a DataFrame with numeric Series x, y and z is instantiated:

In [393]:
df = pd.DataFrame({'x': [1, 2, 3, 4, 5],
                   'y': [-2, -4, 6, 8, 10],
                   'z': [12, 24, 48, -63, -999]})

In [394]:
df

Unnamed: 0,x,y,z
0,1,-2,12
1,2,-4,24
2,3,6,48
3,4,8,-63
4,5,10,-999


The apply method can be used to apply the builtins universal function max along axis 0 'index' (default) or axis 1 'columns':

In [398]:
df.apply(max) #'index'

x     5
y    10
z    48
dtype: int64

In [399]:
df.apply(max, axis=1) #'columns'

0    12
1    24
2    48
3     8
4    10
dtype: int64

Note however that most the universal functions from builtins or numpy are implemented directly as DataFrame methods:

In [402]:
df.max(axis=0)

x     5
y    10
z    48
dtype: int64

In [401]:
df.min(axis=0)

x      1
y     -4
z   -999
dtype: int64

In [403]:
df.mean(axis=0)

x      3.0
y      3.6
z   -195.6
dtype: float64

In [405]:
df.var(axis=0)

x         2.5
y        38.8
z    203424.3
dtype: float64

In [404]:
df.std(axis=0)

x      1.581139
y      6.228965
z    451.025831
dtype: float64

And the data model identifiers are configured for numeric operation:

In [407]:
df['x'] + df['y']

0    -1
1    -2
2     9
3    12
4    15
dtype: int64

In [408]:
df['x'] + 5

0     6
1     7
2     8
3     9
4    10
Name: x, dtype: int64

The apply function can also be used with a tuple of these universal functions outputting a DataFrame instance opposed to a Series:

In [412]:
df.apply((len, max, min, np.mean, np.var, np.std))

Unnamed: 0,x,y,z
len,5.0,5.0,5.0
max,5.0,10.0,48.0
min,1.0,-4.0,-999.0
mean,3.0,3.6,-195.6
var,2.5,38.8,203424.3
std,1.581139,6.228965,451.025831


## Categorical Series

Another common type of Series is a category Series:

In [521]:
df = pd.DataFrame({'student': ['Lucie', 'Petra', 'Pavel', 'Martin', 'Harry', 'Daniel', 'Valeria', 'Julia'],
                   'grade': ['B', 'F', 'A', 'C', 'A', 'C', 'B', 'A']})

When instantiated, the categories will normally be recognised as strings:

In [522]:
df

Unnamed: 0,student,grade
0,Lucie,B
1,Petra,F
2,Pavel,A
3,Martin,C
4,Harry,A
5,Daniel,C
6,Valeria,B
7,Julia,A


And the data types will therefore be objects:

In [523]:
df.dtypes

student    object
grade      object
dtype: object

The data type of a Series can be changed using the method astype. To change to category use the input argument 'category':

In [524]:
oldidentifiers = dir(df['grade'])

In [525]:
df['grade'].astype('category')

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: grade, dtype: category
Categories (4, object): ['A', 'B', 'C', 'F']

The original Series can be reassigned to the new Series that are now categorical:

In [526]:
df['grade'] = df['grade'].astype('category')

If the DataFrame instance is examined, it looks the same:

In [527]:
df

Unnamed: 0,student,grade
0,Lucie,B
1,Petra,F
2,Pavel,A
3,Martin,C
4,Harry,A
5,Daniel,C
6,Valeria,B
7,Julia,A


However its data type is updated:

In [528]:
df.dtypes

student      object
grade      category
dtype: object

A categorical Series also has the attribute cat which groups together methods and attributes commonly used for categorical Series:

In [529]:
newidentifiers = dir(df['grade'])

In [530]:
for identifier in newidentifiers:
    if identifier not in oldidentifiers:
        print(identifier, end=' ')

cat 

Categories are often used for boolean selectors:

In [531]:
df[df['grade'] == 'A']

Unnamed: 0,student,grade
2,Pavel,A
4,Harry,A
7,Julia,A


In [532]:
df[df['grade'] == 'B']

Unnamed: 0,student,grade
0,Lucie,B
6,Valeria,B


In [533]:
df[(df['grade'] == 'A') | (df['grade'] == 'B')]

Unnamed: 0,student,grade
0,Lucie,B
2,Pavel,A
4,Harry,A
6,Valeria,B
7,Julia,A


Only the equal to == and not equal to != operators are defined for unordered categoricals. A TypeError displays if one of the other comparision operators is attempted to be used:

In [534]:
# df[df['grade'] >= 'B']

<span style='color:red'>TypeError</span>: Unordered Categoricals can only compare equality or not

The as_ordered method can be used to ordinally order categories:

In [535]:
df['grade'].cat.as_ordered()

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: grade, dtype: category
Categories (4, object): ['A' < 'B' < 'C' < 'F']

In this case, the order desired is reverse the ordinal values because 'A' corresponds to a higher grade than 'F':

In [536]:
df['grade'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                   ordered=True)

0    B
1    F
2    A
3    C
4    A
5    C
6    B
7    A
Name: grade, dtype: category
Categories (4, object): ['F' < 'C' < 'B' < 'A']

The original Series 'grade' can be reassigned:

In [537]:
df['grade'] = df['grade'].cat.reorder_categories(new_categories = ('F', 'C', 'B', 'A'),
                                                 ordered=True)

In [538]:
df[df['grade'] >= 'B']

Unnamed: 0,student,grade
0,Lucie,B
2,Pavel,A
4,Harry,A
6,Valeria,B
7,Julia,A


When sorting out data in DataFrames, ordinal Series are quite often used:

In [539]:
df.sort_values(['grade'])

Unnamed: 0,student,grade
1,Petra,F
3,Martin,C
5,Daniel,C
0,Lucie,B
6,Valeria,B
2,Pavel,A
4,Harry,A
7,Julia,A


In [540]:
df.sort_values(['grade', 'student'])

Unnamed: 0,student,grade
1,Petra,F
5,Daniel,C
3,Martin,C
0,Lucie,B
6,Valeria,B
4,Harry,A
7,Julia,A
2,Pavel,A


A GroupBy instance can be created from the categories:

In [541]:
df.groupby(df['grade'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001B6A5247010>

In [542]:
gbo = df.groupby(df['grade'])

Statistical methods can then be called from this GroupBy instance applying them to every Series in the DataFrame. For example the statistical method count returns a DataFrame which counts the number of students for each grade:

In [543]:
gbo.count()

Unnamed: 0_level_0,student
grade,Unnamed: 1_level_1
F,1
C,2
B,2
A,3


A Series can be selected from the GroupBy instance and the statistical method can only be called on this Series:

In [547]:
gbo['student'].count()

grade
F    1
C    2
B    2
A    3
Name: student, dtype: int64

Notice the difference in output, the return value is a Series and not a DataFrame because the method was called from a Series and not a DataFrame.

Some methods like describe will however output a DataFrame: 

In [549]:
gbo['student'].describe()

Unnamed: 0_level_0,count,unique,top,freq
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,1,1,Petra,1
C,2,2,Martin,1
B,2,2,Lucie,1
A,3,3,Pavel,1


In [550]:
gbo.describe()

Unnamed: 0_level_0,student,student,student,student
Unnamed: 0_level_1,count,unique,top,freq
grade,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
F,1,1,Petra,1
C,2,2,Martin,1
B,2,2,Lucie,1
A,3,3,Pavel,1


Notice the slight difference with the multi-index column above being used to give statistical information (count, unique, top and freq) for each Series in the latter case.

In [551]:
df

Unnamed: 0,student,grade
0,Lucie,B
1,Petra,F
2,Pavel,A
3,Martin,C
4,Harry,A
5,Daniel,C
6,Valeria,B
7,Julia,A


The difference can be seen more clearly if a second category is added to the DataFrame:

In [552]:
df['sex'] = pd.Series(['F', 'F', 'M', 'M', 'M', 'M', 'F', 'F'])

In [555]:
df['sex'] = df['sex'].astype('category')

In [561]:
df.groupby('grade')['student'].describe()

Unnamed: 0_level_0,count,unique,top,freq
grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
F,1,1,Petra,1
C,2,2,Martin,1
B,2,2,Lucie,1
A,3,3,Pavel,1


In [562]:
df.groupby('grade').describe()

Unnamed: 0_level_0,student,student,student,student,sex,sex,sex,sex
Unnamed: 0_level_1,count,unique,top,freq,count,unique,top,freq
grade,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
F,1,1,Petra,1,1,1,F,1
C,2,2,Martin,1,2,1,M,2
B,2,2,Lucie,1,2,1,F,2
A,3,3,Pavel,1,3,2,M,2


## DateTime

In pandas dates and time invervals are based upon the data types datetime64 or timedelta64 respectively:

In [565]:
? np.datetime64

[1;31mInit signature:[0m  [0mnp[0m[1;33m.[0m[0mdatetime64[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
If created from a 64-bit integer, it represents an offset from
``1970-01-01T00:00:00``.
If created from string, the string can be in ISO 8601 date
or datetime format.

>>> np.datetime64(10, 'Y')
numpy.datetime64('1980')
>>> np.datetime64('1980', 'Y')
numpy.datetime64('1980')
>>> np.datetime64(10, 'D')
numpy.datetime64('1970-01-11')

See :ref:`arrays.datetime` for more information.

:Character code: ``'M'``
[1;31mFile:[0m           c:\users\pyip\appdata\local\mambaforge\envs\jupyterlab\lib\site-packages\numpy\__init__.py
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

The datetime64 class is normally initialised using a timestamp string of the following format:

For example:

In [582]:
np.datetime64('2023-07-25')

numpy.datetime64('2023-07-25')

In [581]:
np.datetime64('2023-07-25T14:30:15.123456')

numpy.datetime64('2023-07-25T14:30:15.123456')

The timedelta64 is normally initialised using a set of tuples:

In [564]:
? np.timedelta64

[1;31mInit signature:[0m  [0mnp[0m[1;33m.[0m[0mtimedelta64[0m[1;33m([0m[0mself[0m[1;33m,[0m [1;33m/[0m[1;33m,[0m [1;33m*[0m[0margs[0m[1;33m,[0m [1;33m**[0m[0mkwargs[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
A timedelta stored as a 64-bit integer.

See :ref:`arrays.datetime` for more information.

:Character code: ``'m'``
[1;31mFile:[0m           c:\users\pyip\appdata\local\mambaforge\envs\jupyterlab\lib\site-packages\numpy\__init__.py
[1;31mType:[0m           type
[1;31mSubclasses:[0m     

In [579]:
np.timedelta64(1, 'D')

numpy.timedelta64(1,'D')

In [577]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h')

numpy.timedelta64(25,'h')

In [586]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's')

numpy.timedelta64(90001,'s')

In [585]:
np.timedelta64(1, 'D') + np.timedelta64(1, 'h') + np.timedelta64(1, 's') + np.timedelta64(1, 'ms')

numpy.timedelta64(90001001,'ms')

These can be used to make an Index or Series respectively, normally using the np.arange function:

In [590]:
starttime = np.datetime64('2023-07-25')
endtime = np.datetime64('2023-07-26')
timeinterval = np.timedelta64(1, 'h')

In [591]:
times = np.arange(start=starttime, #inclusive
                  stop=endtime, #exclusive
                  step=timeinterval)

In [592]:
times

array(['2023-07-25T00', '2023-07-25T01', '2023-07-25T02', '2023-07-25T03',
       '2023-07-25T04', '2023-07-25T05', '2023-07-25T06', '2023-07-25T07',
       '2023-07-25T08', '2023-07-25T09', '2023-07-25T10', '2023-07-25T11',
       '2023-07-25T12', '2023-07-25T13', '2023-07-25T14', '2023-07-25T15',
       '2023-07-25T16', '2023-07-25T17', '2023-07-25T18', '2023-07-25T19',
       '2023-07-25T20', '2023-07-25T21', '2023-07-25T22', '2023-07-25T23'],
      dtype='datetime64[h]')

These times can be cast into an Index or Series:

In [593]:
pd.Index(times)

DatetimeIndex(['2023-07-25 00:00:00', '2023-07-25 01:00:00',
               '2023-07-25 02:00:00', '2023-07-25 03:00:00',
               '2023-07-25 04:00:00', '2023-07-25 05:00:00',
               '2023-07-25 06:00:00', '2023-07-25 07:00:00',
               '2023-07-25 08:00:00', '2023-07-25 09:00:00',
               '2023-07-25 10:00:00', '2023-07-25 11:00:00',
               '2023-07-25 12:00:00', '2023-07-25 13:00:00',
               '2023-07-25 14:00:00', '2023-07-25 15:00:00',
               '2023-07-25 16:00:00', '2023-07-25 17:00:00',
               '2023-07-25 18:00:00', '2023-07-25 19:00:00',
               '2023-07-25 20:00:00', '2023-07-25 21:00:00',
               '2023-07-25 22:00:00', '2023-07-25 23:00:00'],
              dtype='datetime64[s]', freq=None)

In [595]:
pd.Series(data=times, name='times')

0    2023-07-25 00:00:00
1    2023-07-25 01:00:00
2    2023-07-25 02:00:00
3    2023-07-25 03:00:00
4    2023-07-25 04:00:00
5    2023-07-25 05:00:00
6    2023-07-25 06:00:00
7    2023-07-25 07:00:00
8    2023-07-25 08:00:00
9    2023-07-25 09:00:00
10   2023-07-25 10:00:00
11   2023-07-25 11:00:00
12   2023-07-25 12:00:00
13   2023-07-25 13:00:00
14   2023-07-25 14:00:00
15   2023-07-25 15:00:00
16   2023-07-25 16:00:00
17   2023-07-25 17:00:00
18   2023-07-25 18:00:00
19   2023-07-25 19:00:00
20   2023-07-25 20:00:00
21   2023-07-25 21:00:00
22   2023-07-25 22:00:00
23   2023-07-25 23:00:00
Name: times, dtype: datetime64[s]

In [604]:
df = pd.DataFrame(index=pd.Index(times),
                  data={'measurement': np.arange(start=0, stop=24, step=1)})

In [605]:
df

Unnamed: 0,measurement
2023-07-25 00:00:00,0
2023-07-25 01:00:00,1
2023-07-25 02:00:00,2
2023-07-25 03:00:00,3
2023-07-25 04:00:00,4
2023-07-25 05:00:00,5
2023-07-25 06:00:00,6
2023-07-25 07:00:00,7
2023-07-25 08:00:00,8
2023-07-25 09:00:00,9


In [607]:
df.loc['2023-07-25T01:00:00']

measurement    1
Name: 2023-07-25 01:00:00, dtype: int32

In [608]:
df.iloc[1]

measurement    1
Name: 2023-07-25 01:00:00, dtype: int32

In [587]:
? pd.bdate_range

[1;31mSignature:[0m
 [0mpd[0m[1;33m.[0m[0mbdate_range[0m[1;33m([0m[1;33m
[0m    [0mstart[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mend[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mperiods[0m[1;33m:[0m [1;34m'int | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfreq[0m[1;33m:[0m [1;34m'Frequency'[0m [1;33m=[0m [1;34m'B'[0m[1;33m,[0m[1;33m
[0m    [0mtz[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mnormalize[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mTrue[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Hashable'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mweekmask[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mholidays[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0minclusive[0m[1;33m:[0m [1;34m'IntervalClosedType'[0m [1;33m=[0m [1;34m'both'[0m[1;33m,[0m[1;33m
[0m    [1;33m**[0m[0mkwargs[0m[

In [588]:
? pd.timedelta_range

[1;31mSignature:[0m
 [0mpd[0m[1;33m.[0m[0mtimedelta_range[0m[1;33m([0m[1;33m
[0m    [0mstart[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mend[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mperiods[0m[1;33m:[0m [1;34m'int | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfreq[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mclosed[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0munit[0m[1;33m:[0m [1;34m'str | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'TimedeltaIndex'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a fixed frequency TimedeltaIndex with day as the default.

Parameters
----------
start : str or timedelta-like, default None
    Left bound for generating timedeltas.
end : str or timedelta-like, default None

In [589]:
? pd.tseries

[1;31mType:[0m        module
[1;31mString form:[0m <module 'pandas.tseries' from 'c:\\Users\\pyip\\AppData\\Local\\mambaforge\\envs\\jupyterlab\\Lib\\site-packages\\pandas\\tseries\\__init__.py'>
[1;31mFile:[0m        c:\users\pyip\appdata\local\mambaforge\envs\jupyterlab\lib\site-packages\pandas\tseries\__init__.py
[1;31mDocstring:[0m   <no docstring>

In [None]:
pd.

In [230]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('axis' in inspect.signature(getattr(df, identifier)).parameters):
            if identifier not in dir(np.ndarray):
                print(identifier, end=' ')

add add_prefix add_suffix agg aggregate align apply at_time backfill between_time bfill corrwith count cummax cummin diff div divide drop droplevel dropna eq ewm expanding ffill fillna filter floordiv ge groupby gt idxmax idxmin iloc interpolate kurt kurtosis le loc lt mask median mod mode mul multiply ne nunique pad pow product quantile radd rank rdiv reindex rename rename_axis reorder_levels resample rfloordiv rmod rmul rolling rpow rsub rtruediv sample sem set_axis shift skew sort_index sort_values sub subtract swaplevel to_period to_timestamp transform truediv truncate tz_convert tz_localize where xs 

In [232]:
print(dir(np.ndarray), end=' ')

['T', '__abs__', '__add__', '__and__', '__array__', '__array_finalize__', '__array_function__', '__array_interface__', '__array_prepare__', '__array_priority__', '__array_struct__', '__array_ufunc__', '__array_wrap__', '__bool__', '__class__', '__class_getitem__', '__complex__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dir__', '__divmod__', '__dlpack__', '__dlpack_device__', '__doc__', '__eq__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__ilshift__', '__imatmul__', '__imod__', '__imul__', '__index__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__irshift__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lshift__', '__lt__', '__matmul__', '__mod__', '__mul__', '__ne__', '__neg__', '__new__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__',

In [199]:
for identifier in dir(pd.DataFrame):
    if not identifier.startswith('_'):
        print(identifier, end=' ')

T abs add add_prefix add_suffix agg aggregate align all any apply applymap asfreq asof assign astype at at_time attrs axes backfill between_time bfill bool boxplot clip columns combine combine_first compare convert_dtypes copy corr corrwith count cov cummax cummin cumprod cumsum describe diff div divide dot drop drop_duplicates droplevel dropna dtypes duplicated empty eq equals eval ewm expanding explode ffill fillna filter first first_valid_index flags floordiv from_dict from_records ge get groupby gt head hist iat idxmax idxmin iloc index infer_objects info insert interpolate isetitem isin isna isnull items iterrows itertuples join keys kurt kurtosis last last_valid_index le loc lt mask max mean median melt memory_usage merge min mod mode mul multiply ndim ne nlargest notna notnull nsmallest nunique pad pct_change pipe pivot pivot_table plot pop pow prod product quantile query radd rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv

print

In [200]:
for identifier in dir(pd.Series):
    if identifier in dir(str):
        print(identifier, end=' ')

__add__ __class__ __contains__ __delattr__ __dir__ __doc__ __eq__ __format__ __ge__ __getattribute__ __getitem__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __iter__ __le__ __len__ __lt__ __mod__ __mul__ __ne__ __new__ __reduce__ __reduce_ex__ __repr__ __rmod__ __rmul__ __setattr__ __sizeof__ __str__ __subclasshook__ count index replace 

In [201]:
for identifier in dir(pd.Series):
    if identifier in dir(int):
        print(identifier, end=' ')

__abs__ __add__ __and__ __bool__ __class__ __delattr__ __dir__ __divmod__ __doc__ __eq__ __float__ __floordiv__ __format__ __ge__ __getattribute__ __getstate__ __gt__ __hash__ __init__ __init_subclass__ __int__ __invert__ __le__ __lt__ __mod__ __mul__ __ne__ __neg__ __new__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __xor__ 

In [202]:
print(dir(df), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__

In [203]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


In [204]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 7 entries, a to g
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       4 non-null      object
 1   y       6 non-null      object
dtypes: object(2)
memory usage: 168.0+ bytes


In [205]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,,
c,3.1,3.2
d,,4.2
e,5.1,5.2
f,,6.2
g,7.1,7.2


## Reading and Writing to Files

The Series and DataFrames previously examined were created using builtins datatypes. pandas has a number of functions for reading in data from external files:

In [206]:
for identifier in dir(pd):
    if identifier.startswith('read_'):
        print(identifier, end=' ')

read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml 

## CSV File

CSV is an abbreviation for comma seperated values. The file format has a similar structure to a tuple, where each element is seperated by a comma. In the case of a CSV file, each column is seperated by a comma and the newline character is an instruction to move onto the next row:

When opened in a program such as Microsoft Excel, these display as a grid:

<img src='./images/img_001.png' alt='img_001' width='800'/>

Notice that the comma in twinkle, twinkle is not a delimiter but part of the string. For this reason "twinkle, twinkle" was displayed enclosed in quotations.

The CSV has a file name in this case:

Because it is in the same folder as the interactive Python notebook, the file path can be specified as the following string:

<img src='./images/img_002.png' alt='img_002' width='800'/>

In [207]:
file_path = r'.\Book1.csv'

In [208]:
file_path

'.\\Book1.csv'

* r means raw string. In a raw string \ is used to indicate a \ instead of an instruction to insert an escape character.
* ./ means in the same folder as the interactive Python notebook

If the file is moved into a sub folder called files:

<img src='./images/img_003.png' alt='img_003' width='800'/>

Then the file path becomes:

In [209]:
file_path = r'.\files\Book1.csv'

In [210]:
file_path

'.\\files\\Book1.csv'

If the file is place up a level from the interactive notebook, the file path becomes:

<img src='./images/img_005.png' alt='img_005' width='800'/>

In [211]:
file_path = r'..\Book1.csv'

In [212]:
file_path

'..\\Book1.csv'

And if a subfolder (that is in the folder up a level from the interactive Python notebook file) is made called files:

<img src='./images/img_006.png' alt='img_006' width='800'/>

In [213]:
file_path = r'..\files\Book1.csv'

In [214]:
file_path

'..\\files\\Book1.csv'

The function read_csv is used to read in a CSV file as a dataframe:

In [215]:
? pd.read_csv

[1;31mSignature:[0m
 [0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m=

The read_csv has a larger number of input arguments however only the first one is mandatory when the file is in the expected format:

In [216]:
df = aapd.read_csv(filepath_or_buffer = 'Book1.csv')

NameError: name 'aapd' is not defined

In [None]:
df

The first input argument is normally used positionally:

In [None]:
df = pd.read_csv(r'./files/Book1.csv')

In [None]:
df

Notice the Series names are as expected and a numeric index is added.

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.axes

In [None]:
df.

If the file is very large, the DataFrame methods head and tail can be used to preview the first 5 or last 5 rows:

In [None]:
df.head()

In [None]:
df.tail()

A custom number of rows n can be specified for preview:

In [None]:
df.head(n=3)

The attribute dtypes can be used to view the datatype for each Series:

In [None]:
df.dtypes

In [None]:
df.info()

Notice the integer, boolean and floatingpoint Series are int64, bool and float64 meaning their datatype has automatically been inferred. pandas gives non-numeric such as strings the object data type. Notice that the date, time and category Series are all of the datatype object effectively read in as strings.

The 

In [None]:
df.describe()

In [None]:
df.count()

In [None]:
df.un

file_path

/Book1.csv

Programs such as Excel.

The comma is used as a delimiter

xseries

In [None]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('axis' in inspect.signature(getattr(xseries, identifier)).parameters):
            print(identifier, end=' ')

In [None]:
? pd.concat