# pandas library

pandas is an abbreviation for the **P**ython **an**d **D**ata **A**naly**s**is Library. It is a library that uses three main data structures:

* the Index class
* the Series class
* the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one.

The Index, similar to a tuple, list or 1darray has a single dimension which can be represented either as a row:

|index|0|1|2|3|
|---|---|---|---|---|

Or as a column when convenient:

|index|
|---|
|0|
|1|
|2|
|3|


The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column (notice the Index associated with the Series is also displayed as a column):

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

A DataFrame class is essentially a grouping of series that have the same index:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

## Importing Libraries

To use the data science libraries they need to be imported:

In [623]:
import numpy as np 
import pandas as pd

Once imported the identifiers can be imported:

In [624]:
print(dir(pd), sep=' ')

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_config', '_is_numpy_dev', '_libs', '_testing', '_typing', '_version', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', '

These can be grouped, the identifiers beginning and eneding with the double underscore are the data model identifiers which mainly give details about the library:

In [625]:
for identifier in dir(pd):
    isdatamodel = identifier[0:2] == '__'
    if (isdatamodel):
        print(identifier, end=' ')

__all__ __builtins__ __cached__ __doc__ __docformat__ __file__ __git_version__ __loader__ __name__ __package__ __path__ __spec__ __version__ 

For example the name, version and file:

In [626]:
pd.__name__

'pandas'

In [627]:
pd.__version__

'2.0.2'

In [628]:
pd.__file__

'c:\\Users\\pyip\\AppData\\Local\\mambaforge\\envs\\jupyterlab\\Lib\\site-packages\\pandas\\__init__.py'

The identifiers beginning with a single underscore are for internal use only:

In [629]:
for identifier in dir(pd):
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__')
    if (isdatamodel):
        print(identifier, end=' ')

_config _is_numpy_dev _libs _testing _typing _version 

The classes are all in CamelCase:

In [630]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and isupper and not isdatamodel):
        print(identifier, end=' ')

ArrowDtype BooleanDtype Categorical CategoricalDtype CategoricalIndex DataFrame DateOffset DatetimeIndex DatetimeTZDtype ExcelFile ExcelWriter Flags Float32Dtype Float64Dtype Grouper HDFStore Index Int16Dtype Int32Dtype Int64Dtype Int8Dtype Interval IntervalDtype IntervalIndex MultiIndex NamedAgg Period PeriodDtype PeriodIndex RangeIndex Series SparseDtype StringDtype Timedelta TimedeltaIndex Timestamp UInt16Dtype UInt32Dtype UInt64Dtype UInt8Dtype 

The main classes are:

* Index
* Series
* DataFrame
 
There are some variations of Index such as RangeIndex, MultiIndex, DateIndex and TimedeltaIndex. 

In general pandas uses object orientated programming (OOP) opposed to functional programming. This means methods are normally applied to Index, Series and DataFrame instances to analyse or manipulate data from the instance. Most of the functions within the pandas library are used to read in data from a file and output a DataFrame instance:

In [631]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

array bdate_range concat crosstab cut date_range describe_option eval factorize from_dummies get_dummies get_option infer_freq interval_range isna isnull json_normalize lreshape melt merge merge_asof merge_ordered notna notnull option_context period_range pivot pivot_table qcut read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml reset_option set_eng_float_format set_option show_versions test timedelta_range to_datetime to_numeric to_pickle to_timedelta unique value_counts wide_to_long 

pandas modules are also in lower case:

In [632]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

annotations api arrays compat core errors io offsets options pandas plotting testing tseries util 

The modules are not normally used directly by the user but internally called when constructing an Index, Series or DataFrame:

In [633]:
print(dir(pd.arrays), end=' ')

['ArrowExtensionArray', 'ArrowStringArray', 'BooleanArray', 'Categorical', 'DatetimeArray', 'FloatingArray', 'IntegerArray', 'IntervalArray', 'PandasArray', 'PeriodArray', 'SparseArray', 'StringArray', 'TimedeltaArray', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 

## Series

The initialisation signature for a pandas Series can be examined:

In [634]:
? pd.Series

[1;31mInit signature:[0m
 [0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray ha

The main keyword input arguments are:

* data
* index
* dtype
* name


If these are not supplied. An empty series with no index, no name and a generic object datatype is instantiated:

In [635]:
pd.Series()

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [636]:
pd.Series(data=np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [637]:
pd.Series(data=[1, 2, 3])

0    1
1    2
2    3
dtype: int64

When dtype=None, the data type will be inferred from the data:

In [638]:
pd.Series(data=[1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [639]:
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

0   2023-07-24 17:47:51.601748
1   2023-07-25 17:47:51.601748
2   2023-07-26 17:47:51.601748
dtype: datetime64[ns]

In [640]:
pd.Series(data=['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Anything with a string in it is classed as non-numeric and has the generic dtype object.

The dtype can be manually overidden when suppling the numpy 1darray by using the np.array input argument dtype:

In [641]:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

0    1
1    2
2    3
dtype: int32

Or by alternatively using the Series keyword input argument dtype:

In [642]:
pd.Series(data=[1., 2., 3.], dtype=np.int32)

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument index and providing an Index, ndarray or list of index values:

In [643]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

a    1
b    2
c    3
dtype: int32

A Series usually also has a name:

In [644]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

a    1
b    2
c    3
Name: x, dtype: int32

Normally the data and name are supplied and the index and dtype are inferred:

In [645]:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

## DataFrame

The initialisation signature for a pandas DataFrame can be examined:

In [646]:
? pd.DataFrame

[1;31mInit signature:[0m
 [0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------


The keyword input arguments for a DataFrame are similar to a Series however as a DataFrame is a collection of Series most of these are plural:

* data (plural)
* index (singular)
* columns (plural of name)
* dtype (plural)

In [647]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The dtype has to be supplied as a tuple of dtypes, if it is supplied as a list of dtypes a TypeError will display.

Normally the dtype and index are inferred:

In [648]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


It is more common to supply the data in the form of a dictionary. The dictionary has a key: value pair. The key should be a string and will become the column name in the DataFrame instance and the value should be a 1darray or list which will become the data:

In [649]:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


## Series Identifiers

If the following ndarray and Series are created:

In [650]:
xarray = np.array([1.1, 2.1, 3.1, 4.1])

In [651]:
xarray

array([1.1, 2.1, 3.1, 4.1])

In [652]:
xseries = pd.Series(xarray, name='x')

In [653]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Its attributes can be viewed:

In [654]:
print(dir(xseries), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror_

The above may seem overwhelming at first glance however these can be split into seperate groupings... The behaviour of many of these identifiers, particularly many of the most common ones have already been examined when looking at numeric datatypes and ndarrays and broadcast across the data in the Series:

Into attributes:

In [655]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

array at attrs axes dtype dtypes empty flags hasnans iat index is_monotonic_decreasing is_monotonic_increasing is_unique name nbytes ndim shape size values 

Methods:

In [656]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply argmax argmin argsort asfreq asof astype at_time autocorr backfill between between_time bfill bool clip combine combine_first compare convert_dtypes copy corr count cov cummax cummin cumprod cumsum describe diff div divide divmod dot drop drop_duplicates droplevel dropna duplicated eq equals ewm expanding explode factorize ffill fillna filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull item items keys kurt kurtosis last last_valid_index le loc lt map mask max mean median memory_usage min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop pow prod product quantile radd rank ravel rdiv rdivmod reindex reindex_like rename rename_axis reorder_levels repeat replace resample reset_index rfloordiv rmod rmul rolling round rpow rsub rtruediv sample searchsorted sem set_axis set_flags shift skew sort_index 

Data model attributes:

In [657]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__annotations__ __array_priority__ __dict__ __doc__ __hash__ __module__ 

Data model methods:

In [658]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__abs__ __add__ __and__ __array__ __array_ufunc__ __bool__ __class__ __contains__ __copy__ __deepcopy__ __delattr__ __delitem__ __dir__ __divmod__ __eq__ __finalize__ __float__ __floordiv__ __format__ __ge__ __getattr__ __getattribute__ __getitem__ __getstate__ __gt__ __iadd__ __iand__ __ifloordiv__ __imod__ __imul__ __init__ __init_subclass__ __int__ __invert__ __ior__ __ipow__ __isub__ __iter__ __itruediv__ __ixor__ __le__ __len__ __lt__ __matmul__ __mod__ __mul__ __ne__ __neg__ __new__ __nonzero__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmatmul__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __setitem__ __setstate__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __weakref__ __xor__ 

There are also a number of internal attributes:

In [659]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__') 
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

_AXIS_LEN _AXIS_ORDERS _AXIS_TO_AXIS_NUMBER _HANDLED_TYPES _accessors _agg_examples_doc _agg_see_also_doc _attrs _can_hold_na _data _flags _hidden_attrs _info_axis _info_axis_name _info_axis_number _internal_names _internal_names_set _is_cached _is_copy _is_mixed_type _is_view _item_cache _metadata _mgr _name _references _stat_axis _stat_axis_name _stat_axis_number _typ _values 

And internal methods:

In [660]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__') 
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

_accum_func _add_numeric_operations _align_frame _align_series _append _arith_method _as_manager _binop _check_inplace_and_allows_duplicate_labels _check_inplace_setting _check_is_chained_assignment_possible _check_label_or_level_ambiguity _check_setitem_copy _clear_item_cache _clip_with_one_bound _clip_with_scalar _cmp_method _consolidate _consolidate_inplace _construct_axes_dict _construct_result _constructor _constructor_expanddim _convert_dtypes _dir_additions _dir_deletions _drop_axis _drop_labels_or_levels _duplicated _find_valid_index _get_axis _get_axis_name _get_axis_number _get_axis_resolvers _get_block_manager_axis _get_bool_data _get_cacher _get_cleaned_column_resolvers _get_index_resolvers _get_label_or_level_values _get_numeric_data _get_value _get_values _get_values_tuple _get_with _gotitem _indexed_same _init_dict _init_mgr _inplace_method _is_label_or_level_reference _is_label_reference _is_level_reference _ixs _logical_func _logical_method _map_values _maybe_update_ca

Since the Series is based on a ndarray, the data model attributes and methods behave analogously:

In [661]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and isdatamodel and not isinarray):
        print(identifier, end=' ')

__annotations__ __dict__ __module__ 

In [662]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and isdatamodel and not isinarray):
        print(identifier, end=' ')

__finalize__ __getattr__ __nonzero__ __round__ __weakref__ 

The main supplementary functionality is with the attributes:

In [663]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isdatamodel and not isinarray):
        print(identifier, end=' ')

array at attrs axes dtypes empty hasnans iat index is_monotonic_decreasing is_monotonic_increasing is_unique name values 

Many of these are the attributes return the supplied value in the intialisation signature:

In [664]:
xseries.array

<PandasArray>
[1.1, 2.1, 3.1, 4.1]
Length: 4, dtype: float64

In [665]:
xseries.name

'x'

In [666]:
xseries.index

RangeIndex(start=0, stop=4, step=1)

In [667]:
xseries.values

array([1.1, 2.1, 3.1, 4.1])

In [668]:
xseries.dtypes

dtype('float64')

The main functionality is in the added methods:

In [669]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align apply asfreq asof at_time autocorr backfill between between_time bfill bool combine combine_first compare convert_dtypes corr count cov cummax cummin describe diff div divide divmod drop drop_duplicates droplevel dropna duplicated eq equals ewm expanding explode factorize ffill fillna filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull items keys kurt kurtosis last last_valid_index le loc lt map mask median memory_usage mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop pow product quantile radd rank rdiv rdivmod reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv rmod rmul rolling rpow rsub rtruediv sample sem set_axis set_flags shift skew sort_index sort_values sub subtract swaplevel tail to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list to_

Note in the above there are method equivalents to the data model identifiers:

In [670]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isinarrayasdatamodel = ('__' + identifier + '__') in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray and isinarrayasdatamodel):
        print(identifier, end=' ')

abs add bool divmod eq floordiv ge gt le lt mod mul ne pow radd rdivmod rfloordiv rmod rmul rpow rsub rtruediv sub truediv 

It is more common to use the equivalent data model operator:

In [671]:
xseries + 4

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

In [672]:
xseries.__add__(4)

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

In [673]:
xseries.add(4)

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

Some of these methods have the same name as functions in builtins and therefore behave similarly:

In [674]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    import builtins
    isinbuiltins = identifier in dir(builtins)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray and isinbuiltins):
        print(identifier, end=' ')

abs bool divmod filter map pow 

In [675]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isinarrayasdatamodel = ('__' + identifier + '__') in dir(xarray)
    isdatamodel = identifier[0] == '_'
    import builtins
    isinbuiltins = identifier in dir(builtins)
    if (isfunction and not isdatamodel and not isinarray and not isinarrayasdatamodel and not isinbuiltins):
        print(identifier, end=' ')

add_prefix add_suffix agg aggregate align apply asfreq asof at_time autocorr backfill between between_time bfill combine combine_first compare convert_dtypes corr count cov cummax cummin describe diff div divide drop drop_duplicates droplevel dropna duplicated equals ewm expanding explode factorize ffill fillna first first_valid_index get groupby head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull items keys kurt kurtosis last last_valid_index loc mask median memory_usage mode multiply nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop product quantile rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rolling sample sem set_axis set_flags shift skew sort_index sort_values subtract swaplevel tail to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list to_markdown to_numpy to_period to_pickle to_sql to_string to_timestamp to_xarray transform truncate tz_convert tz_localize unique

## DataFrame Identifiers

If the following dataframe is constructed:

In [676]:
df = pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
                   'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [677]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


In [678]:
print(dir(df), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__

These can be split into attributes:

In [679]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

at attrs axes columns dtypes empty flags iat index ndim shape size style values x y 

Methods:

In [680]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply applymap asfreq asof assign astype at_time backfill between_time bfill bool boxplot clip combine combine_first compare convert_dtypes copy corr corrwith count cov cummax cummin cumprod cumsum describe diff div divide dot drop drop_duplicates droplevel dropna duplicated eq equals eval ewm expanding explode ffill fillna filter first first_valid_index floordiv from_dict from_records ge get groupby gt head hist idxmax idxmin iloc infer_objects info insert interpolate isetitem isin isna isnull items iterrows itertuples join keys kurt kurtosis last last_valid_index le loc lt mask max mean median melt memory_usage merge min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe pivot pivot_table plot pop pow prod product quantile query radd rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv rmod rmul rolling round rpow rsub rtruediv sample select_

Data model attributes:

In [681]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__annotations__ __array_priority__ __dict__ __doc__ __hash__ __module__ 

Data model methods:

In [682]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__abs__ __add__ __and__ __array__ __array_ufunc__ __bool__ __class__ __contains__ __copy__ __dataframe__ __deepcopy__ __delattr__ __delitem__ __dir__ __divmod__ __eq__ __finalize__ __floordiv__ __format__ __ge__ __getattr__ __getattribute__ __getitem__ __getstate__ __gt__ __iadd__ __iand__ __ifloordiv__ __imod__ __imul__ __init__ __init_subclass__ __invert__ __ior__ __ipow__ __isub__ __iter__ __itruediv__ __ixor__ __le__ __len__ __lt__ __matmul__ __mod__ __mul__ __ne__ __neg__ __new__ __nonzero__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmatmul__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __setitem__ __setstate__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __weakref__ __xor__ 

Most of these behave analogously to their counterparts in Series broadcasting across the entire DataFrame instance instead of just along a Series. 

There are no data model attributes in the DataFrame class not found in the Series class:

In [683]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and isdatamodel and not isinxseries):
        print(identifier, end=' ')

The only data model method in the DataFrame class not in the Series class is \_\_dataframe\_\_:

In [684]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and isdatamodel and not isinxseries):
        print(identifier, end=' ')

__dataframe__ 

If the attributes are examined:

In [685]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isdatamodel and not isinxseries):
        print(identifier, end=' ')

columns style x y 

Notice the columns attribute returns a list of the names of each Series:

In [686]:
df.columns

Index(['x', 'y'], dtype='object')

Since the following condition is satisfied:

In [687]:
'x'.isidentifier()

True

In [688]:
'y'.isidentifier()

True

And the identifier name doesn't clash with any of the other DataFrame identifiers, the following are also attributes:

In [689]:
df.x

0    1.1
1    2.1
2    3.1
3    3.1
Name: x, dtype: float64

In [690]:
df.y

0    1.2
1    2.2
2    3.2
3    4.2
Name: y, dtype: float64

The following methods are also supplementary for a DataFrame:

In [691]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinxseries):
        print(identifier, end=' ')

applymap assign boxplot corrwith eval from_dict from_records insert isetitem iterrows itertuples join melt merge pivot pivot_table query select_dtypes set_index stack to_feather to_gbq to_html to_orc to_parquet to_records to_stata to_xml 

## Mutability

The Index, Series and DataFrame classes are mutable Collections meaning they have the immutable data model identifier \_\_getitem\_\_ as well as the mutatable identifier \_\_setitem\_\_:

In [692]:
'__getitem__' in dir(pd.Series)

True

In [693]:
'__setitem__' in dir(pd.Series)

True

In [694]:
'__delitem__' in dir(pd.Series)

True

This means the following array can be indexed into:

In [695]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

Indexing can be carried out using \_\_getitem\_\_, typically the shorthand notation uses square brackets to enclose the index value:

In [696]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

For example:

In [697]:
xseries[0]

1.1

A value can be reassigned using the mutatable method \_\_setitem\_\_:

In [698]:
xseries[0] = None

In [699]:
xseries

0    NaN
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value can be deleted using the mutable method \_\_delitem\_\_:

In [700]:
del xseries[2]

In [701]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

Despite the ndarray, Series and DataFrame being mutatable data types, most the identifiers are immutable by default. If the docstring of the method dropna is examined:

In [702]:
? xseries.dropna

[1;31mSignature:[0m
 [0mxseries[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mhow[0m[1;33m:[0m [1;34m'AnyAll | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a new Series with missing values removed.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index'}
    Unused. Parameter needed for compatibility with DataFrame.
inplace : bool, default False
    If True, do operation inplace and retur

Notice it has a number of keyword input arguments such as axis and inplace which have default values. inplace has the default value of False making the method immutable and therefore returning a new Series:

In [703]:
xseries.dropna() # Return value

1    2.1
3    4.1
Name: x, dtype: float64

In [704]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

When inplace is set to True it becomes a mutable method, modifying the Series inplace:

In [705]:
xseries.dropna(inplace=True) # No return value

In [706]:
xseries

1    2.1
3    4.1
Name: x, dtype: float64

The same behaviour can be seen on the method reset_index:

In [707]:
? xseries.reset_index

[1;31mSignature:[0m
 [0mxseries[0m[1;33m.[0m[0mreset_index[0m[1;33m([0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'IndexLabel'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mdrop[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mallow_duplicates[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or
when the index is meaningless and needs to be reset to the default
before another

With default values this returns a DataFrame since the old index is now added as a Series:

In [708]:
xseries.reset_index() # Return value

Unnamed: 0,index,x
0,1,2.1
1,3,4.1


If drop is set to True, a Series will instead be returned:

In [709]:
xseries.reset_index(drop=True) # Return value

0    2.1
1    4.1
Name: x, dtype: float64

Once again the inplace keyword input argument can be assigned to True making the method mutatable:

In [710]:
xseries.reset_index(drop=True, inplace=True) # No return value

In [711]:
xseries

0    2.1
1    4.1
Name: x, dtype: float64

The inspect module can be used to group the Series methods that have inplace as a keyword argument. All of these are configured to be immutable by default but can be made mutable by assigning inplace to True:

In [712]:
import inspect

for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('inplace' in inspect.signature(getattr(xseries, identifier)).parameters):
            print(identifier, end=' ')

backfill bfill clip drop drop_duplicates dropna ffill fillna interpolate mask pad rename rename_axis replace reset_index sort_index sort_values where 

Notice that most of these are used to fill or drop missing values.

When the above methods are immutable, they have a return value:

In [713]:
xseries.sort_values(ascending=False) # Return value

1    4.1
0    2.1
Name: x, dtype: float64

For a mutable method assignment or in this case reassignment can be used:

In [714]:
xseries = xseries.sort_values(ascending=False)

In [715]:
xseries

1    4.1
0    2.1
Name: x, dtype: float64

On the other hand when they are immutable, they have no return value and the Series is updated inplace:

In [716]:
xseries.sort_values(ascending=True, inplace=True) # No return value

In [717]:
xseries

0    2.1
1    4.1
Name: x, dtype: float64

If assignment or reassignment is used with the keyword inplace, the return value of the funciton will be None and None will be assigned to the original Series:

In [718]:
xseries = xseries.sort_values(ascending=True, inplace=True) 

In [719]:
xseries

Notice no cell output because:

In [720]:
xseries == None

True

By convention immutable methods have a return value and mutable methods have no return value. An exception to this is the mutable method pop which returns the popped value and mutates the Series in place:

In [721]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

In [722]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

In [723]:
xseries.pop(item=1) # Return value

2.1

In [724]:
xseries # Mutated

0    1.1
2    3.1
3    4.1
Name: x, dtype: float64

Most of the other methods are immutable and have a return value:

In [725]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('inplace' not in inspect.signature(getattr(xseries, identifier)).parameters):
            if identifier not in ['pop']:
                print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply argmax argmin argsort asfreq asof astype at_time autocorr between between_time bool combine combine_first compare convert_dtypes copy corr count cov cummax cummin cumprod cumsum describe diff div divide divmod dot droplevel duplicated eq equals ewm expanding explode factorize filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info isin isna isnull item items keys kurt kurtosis last last_valid_index le loc lt map max mean median memory_usage min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pct_change pipe plot pow prod product quantile radd rank ravel rdiv rdivmod reindex reindex_like reorder_levels repeat resample rfloordiv rmod rmul rolling round rpow rsub rtruediv sample searchsorted sem set_axis set_flags shift skew squeeze std sub subtract sum swapaxes swaplevel tail take to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list 

## Indexing and Slicing

Supposing the following dictionary instance is instantiated:

In [726]:
mapping = {'x': np.array([1.1, 2.1, 3.1, 4.1]),
           'y': np.array([1.2, 2.2, 3.2, 4.2])}

In [727]:
mapping

{'x': array([1.1, 2.1, 3.1, 4.1]), 'y': array([1.2, 2.2, 3.2, 4.2])}

A DataFrame instance can be instantiated by assigning the mapping to the keyword input argument data:

In [728]:
df = pd.DataFrame(data=mapping)

In [729]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


A mapping can be indexed with a key. This returns the value the key references, in this case the numpy array:

In [730]:
mapping['x']

array([1.1, 2.1, 3.1, 4.1])

Analogously, when a DataFrame is indexed using the name of a column, the Series is returned:

In [731]:
df['x']

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value in the ndarray can be indexed by use of a second set of square brackets to enclose the numeric index:

In [732]:
mapping['x'][1]

2.1

Analogously, a value in the Series can be indexed by use of a second set of square brackets to enclose the numeric index:

In [733]:
df['x'][1]

2.1

If the DataFrame instance is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

The first set of brackets select the Series:

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

And the second set of brackets selects the index retrieving the value:

2.1

If the DataFrame is examined:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

Sometimes the value for each Series at an index is desired:

|index|'x'|'y'|
|---|---|---|
|1|2.1|2.2|

This is done by use of the property location loc. Note that loc returns the above "row" as a Series which is displayed by default as a "column":

|index|1|
|---|---|
|'x'|2.1|
|'y'|2.1|

loc is callable and has a docstring:

In [734]:
callable(df.loc)

True

In [735]:
? df.loc

[1;31mType:[0m        property
[1;31mString form:[0m <property object at 0x000001E815C77150>
[1;31mDocstring:[0m  
Access a group of rows and columns by label(s) or a boolean array.

``.loc[]`` is primarily label based, but may also be used with a
boolean array.

Allowed inputs are:

- A single label, e.g. ``5`` or ``'a'``, (note that ``5`` is
  interpreted as a *label* of the index, and **never** as an
  integer position along the index).
- A list or array of labels, e.g. ``['a', 'b', 'c']``.
- A slice object with labels, e.g. ``'a':'f'``.

      start and the stop are included

- A boolean array of the same length as the axis being sliced,
  e.g. ``[True, False, True]``.
- An alignable boolean Series. The index of the key will be aligned before
  masking.
- An alignable Index. The Index of the returned selection will be the input.
- A ``callable`` function with one argument (the calling Series or
  DataFrame) and that returns valid output for indexing (one of the above)

See mo

However it isn't a function and is not called using parenthesis:

In [736]:
df.loc

<pandas.core.indexing._LocIndexer at 0x1e818d55680>

In [737]:
df.loc()

<pandas.core.indexing._LocIndexer at 0x1e819b0fd40>

Instead loc is a property, think of it as syntactic sugar around the data model method \_\_getitem\_\_ that switches the order of indexing from Series, index to index, Series:

In [738]:
df.loc[1]

x    2.1
y    2.2
Name: 1, dtype: float64

In [739]:
df.loc[1]['x']

2.1

loc uses index values:

In [740]:
df.loc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


The related property integer location iloc uses a numeric index. Since the index is numeric, additional numeric operations can be used such as indexing:

In [741]:
df.iloc[[0, 2]]

Unnamed: 0,x,y
0,1.1,1.2
2,3.1,3.2


In [742]:
df.iloc[0:2]

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2


If the following DataFrame instance is created with index labels i.e. a non-numeric index:

|index|'x'|'y'|
|---|---|---|
|'a'|1.1|1.2|
|'b'|2.1|2.2|
|'c'|3.1|3.2|
|'d'|4.1|4.2|

In [743]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd'],
                  data=mapping)

In [744]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The difference between loc and iloc ca be seen more clearly. For loc the index label is used:

In [745]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

Despite the labels being non-numeric iloc handles the index values numerically:

In [746]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

iloc essentially analyses a dataframe with a reset index:

In [747]:
df.reset_index(drop=True)

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


In [748]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

When loc and iloc are used to select a single index, the data for each Series at this index is itself displayed as a Series:

In [749]:
df.loc['b']

x    2.1
y    2.2
Name: b, dtype: float64

In [750]:
df.iloc[1]

x    2.1
y    2.2
Name: b, dtype: float64

Because each of the above are a Series instance, they can in turn be indexed into:

In [751]:
df.loc['b']['y']

2.2

In [752]:
df.iloc[1]['y']

2.2

Each element in a Series can also be accessed numerically:

In [753]:
df.loc['b'][1]

2.2

In [754]:
df.iloc[1][1]

2.2

When iloc and loc are instead used to select data from multiple indexes a DataFrame instance is output:

In [755]:
df.loc[['a', 'b']]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


In [756]:
df.iloc[0:2]

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2


And because each of these is a DataFrame instance, the Series within the DataFrame instance can then be indexed using the Series name:

In [757]:
df.loc[['a', 'b']]['x']

a    1.1
b    2.1
Name: x, dtype: float64

In [758]:
df.iloc[0:2]['x']

a    1.1
b    2.1
Name: x, dtype: float64

at is used for a scalar selector and requires both the index and the Series name: 

In [759]:
df.at['a', 'y']

1.2

The related integer at is also a scalar selector and requires both the index and column to be specified as integers:

In [760]:
df.iat[0, 1]

1.2

Conceptualise, the DataDrame being cast to a 2darray and indexign a value from it:

In [761]:
df.to_numpy()

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [4.1, 4.2]])

In [762]:
df.to_numpy()[0, 1]

1.2

To recap, for a DataFrame instance:

* \_\_getitem\_\_ selects a Series
* loc and iloc selects an observation from an Index
* at and iat select a scalar element


loc can also be used to add a new observation to the DataFrame:

In [763]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


In [764]:
df.loc['e'] = {'x': 5.1, 'y': 5.2}

In [765]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2


The length of the DataFrame gives the number of observations:

In [766]:
len(df)

5

iloc isn't as powerful as loc and cannot be used to enlarge the DataFrame:

In [767]:
# df.iloc[len(df)] = {'x': 6.1, 'y': 6.2}

<span style='color:red'>IndexError</span>: iloc cannot enlarge its target object

However loc can be used to add a numeric index this way:

In [768]:
df.loc[len(df)] = {'x': 6.1, 'y': 6.2}

In [769]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2
e,5.1,5.2
5,6.1,6.2


## DataFrame Properties

Supposing the following DataFrame is instantiated:

In [770]:
df = pd.DataFrame(index=['a', 'b', 'c', 'd'],
                  data={'x': np.array([1.1, 2.1, 3.1, 3.1]),
                        'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [771]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


The DataFrame has the following dimension related properties. The attribute empty returns a boolean that is True only with an empty DataFrame:

In [772]:
df.empty

False

In [773]:
pd.DataFrame(None).empty

True

A DataFrame has a length, which is the number of observations or rows i.e. number of values in the Index:

In [774]:
len(df)

4

It has a shape tuple, the 1st value in the shape tuple is the number of rows (observations in the index) and 2nd value is the number of Series (columns):

In [775]:
df.shape

(4, 2)

It has 2 dimensions:

In [776]:
df.ndim

2

Recall this is the length of the shape tuple:

In [777]:
len(df.shape)

2

And it has a size which is the product of the elements in the shape tuple:

In [778]:
df.size

8

The index attribute is an Index instance. An Index instance has a single dimension that can either be depicted as a row or a column. The output below displays this as a row although the index itself is conventionally depicted as a column when incorporated as part of a DataFrame:

In [779]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

When no index is specified during instantiation a RangeIndex is shown:

In [780]:
df2 = pd.DataFrame(data={'x': np.array([1.1, 2.1, 3.1, 3.1]),
                         'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [781]:
df2.index

RangeIndex(start=0, stop=4, step=1)

The attribute columns is also an instance of the class Index that contains the names used for each Series in the DataFrame:

In [782]:
df.columns

Index(['x', 'y'], dtype='object')

The attribute axes returns a 2 element list, where the first element is the index and the second element is the columns:

In [783]:
df.axes

[Index(['a', 'b', 'c', 'd'], dtype='object'),
 Index(['x', 'y'], dtype='object')]

The attribute values returns the values in the DataFrame in the form of a 2darray:

In [784]:
df.values

array([[1.1, 1.2],
       [2.1, 2.2],
       [3.1, 3.2],
       [3.1, 4.2]])

The attribute dtypes returns the data types of each Series and of the DataFrame:

In [785]:
df.dtypes

x    float64
y    float64
dtype: object

The Series instances x and y are each of the data type float64, the DataFrame instance df is of the data type object. A DataFrame instance is always of the type object.

Each existing Series is accessable as an attribute:

In [786]:
df.x

a    1.1
b    2.1
c    3.1
d    3.1
Name: x, dtype: float64

In [787]:
df.y

a    1.2
b    2.2
c    3.2
d    4.2
Name: y, dtype: float64

The formal representation of the DataFrame instance df can be examined in a cell:

In [788]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


The attribute style will instead display the DataFrame instance using default styling:

In [789]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


This attribute can be used with a number of methods to apply custom formatting:

In [790]:
for identifier in dir(df.style):
    if not identifier.startswith('_') and callable(getattr(df.style, identifier)):
        print(identifier, end=' ')

apply apply_index applymap applymap_index background_gradient bar clear concat export format format_index from_custom_template hide highlight_between highlight_max highlight_min highlight_null highlight_quantile pipe relabel_index set_caption set_properties set_sticky set_table_attributes set_table_styles set_td_classes set_tooltips set_uuid text_gradient to_excel to_html to_latex to_string use 

In [791]:
df_styled = df.style.format(precision=3).set_caption('DataFrame Instance')

This gives a Styler instance:

In [792]:
type(df_styled)

pandas.io.formats.style.Styler

The Styler instance applies the formatting to the data in the DataFrame when output in a cell:

In [793]:
df_styled

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


The associated attributes give information about an existing Styler instance:

In [794]:
for identifier in dir(df_styled):
    if not identifier.startswith('_') and not callable(getattr(df_styled, identifier)):
        print(identifier, end=' ')

caption cell_context cell_ids columns concatenated css ctx ctx_columns ctx_index data env hidden_columns hidden_rows hide_column_names hide_columns_ hide_index_ hide_index_names index loader table_attributes table_styles template_html template_html_style template_html_table template_latex template_string tooltips uuid uuid_len 

In [795]:
df_styled.caption

'DataFrame Instance'

In [796]:
df_styled.hidden_rows

[]

The attributes attrs is an empty dictionary by default and is designed to store metadata associated with the DataFrame:

In [797]:
df.attrs

{}

This metadata can include a text description giving information about how the data was collection or contain a link to a scientific publication for example. The pandas documentation warns that this is an experimental feature and is subject to change:

In [798]:
df.attrs = {'description': 'this DataFrame was instantiated from a dict',
            'scientific paper': r'https://www.sciencedirect.com/'}

flags is another experimental feature and is used to change some flags. At current there is only a flag that can be set, the flag which allows duplicate labels:

In [817]:
df.flags

<Flags(allows_duplicate_labels=True)>

In [800]:
df

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


This flag is enabled by default:

In [834]:
df.flags.allows_duplicate_labels

True

In [835]:
df_duplicated = pd.concat([df, df])

In [836]:
df_duplicated

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


If set to False:

In [801]:
df.flags.allows_duplicate_labels

True

Then any operation involving that DataFrame that could lead to a DataFrame with duplicate labels will give a DuplicateLabelError:

In [839]:
# pd.concat([df, df])

<span style='color:red'>DuplicateLabelError</span>:

In [802]:
print(dir(df), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__

In [803]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


In [804]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, a to d
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       4 non-null      float64
 1   y       4 non-null      float64
dtypes: float64(2)
memory usage: 268.0+ bytes


In [805]:
df.style

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,3.1,4.2


## Reading and Writing to Files

The Series and DataFrames previously examined were created using builtins datatypes. pandas has a number of functions for reading in data from external files:

In [806]:
for identifier in dir(pd):
    if identifier.startswith('read_'):
        print(identifier, end=' ')

read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml 

## CSV File

CSV is an abbreviation for comma seperated values. The file format has a similar structure to a tuple, where each element is seperated by a comma. In the case of a CSV file, each column is seperated by a comma and the newline character is an instruction to move onto the next row:

When opened in a program such as Microsoft Excel, these display as a grid:

<img src='./images/img_001.png' alt='img_001' width='800'/>

Notice that the comma in twinkle, twinkle is not a delimiter but part of the string. For this reason "twinkle, twinkle" was displayed enclosed in quotations.

The CSV has a file name in this case:

Because it is in the same folder as the interactive Python notebook, the file path can be specified as the following string:

<img src='./images/img_002.png' alt='img_002' width='800'/>

In [807]:
file_path = r'.\Book1.csv'

In [808]:
file_path

'.\\Book1.csv'

* r means raw string. In a raw string \ is used to indicate a \ instead of an instruction to insert an escape character.
* ./ means in the same folder as the interactive Python notebook

If the file is moved into a sub folder called files:

<img src='./images/img_003.png' alt='img_003' width='800'/>

Then the file path becomes:

In [809]:
file_path = r'.\files\Book1.csv'

In [810]:
file_path

'.\\files\\Book1.csv'

If the file is place up a level from the interactive notebook, the file path becomes:

<img src='./images/img_005.png' alt='img_005' width='800'/>

In [811]:
file_path = r'..\Book1.csv'

In [812]:
file_path

'..\\Book1.csv'

And if a subfolder (that is in the folder up a level from the interactive Python notebook file) is made called files:

<img src='./images/img_006.png' alt='img_006' width='800'/>

In [813]:
file_path = r'..\files\Book1.csv'

In [814]:
file_path

'..\\files\\Book1.csv'

The function read_csv is used to read in a CSV file as a dataframe:

In [815]:
? pd.read_csv

[1;31mSignature:[0m
 [0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [1;34m'FilePath | ReadCsvBuffer[bytes] | ReadCsvBuffer[str]'[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m:[0m [1;34m'str | None | lib.NoDefault'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m:[0m [1;34m"int | Sequence[int] | None | Literal['infer']"[0m [1;33m=[0m [1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m:[0m [1;34m'Sequence[Hashable] | None | lib.NoDefault'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33m:[0m [1;34m'IndexLabel | Literal[False] | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0musecols[0m[1;33m=

The read_csv has a larger number of input arguments however only the first one is mandatory when the file is in the expected format:

In [816]:
df = aapd.read_csv(filepath_or_buffer = 'Book1.csv')

NameError: name 'aapd' is not defined

In [None]:
df

The first input argument is normally used positionally:

In [None]:
df = pd.read_csv(r'./files/Book1.csv')

In [None]:
df

Notice the Series names are as expected and a numeric index is added.

In [None]:
df.index

In [None]:
df.columns

In [None]:
df.axes

In [None]:
df.

If the file is very large, the DataFrame methods head and tail can be used to preview the first 5 or last 5 rows:

In [None]:
df.head()

In [None]:
df.tail()

A custom number of rows n can be specified for preview:

In [None]:
df.head(n=3)

The attribute dtypes can be used to view the datatype for each Series:

In [None]:
df.dtypes

In [None]:
df.info()

Notice the integer, boolean and floatingpoint Series are int64, bool and float64 meaning their datatype has automatically been inferred. pandas gives non-numeric such as strings the object data type. Notice that the date, time and category Series are all of the datatype object effectively read in as strings.

The 

In [None]:
df.describe()

In [None]:
df.count()

In [None]:
df.un

file_path

/Book1.csv

Programs such as Excel.

The comma is used as a delimiter

xseries

In [None]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel):
        if ('axis' in inspect.signature(getattr(xseries, identifier)).parameters):
            print(identifier, end=' ')

In [None]:
? pd.concat