# pandas library

pandas is an abbreviation for the **P**ython **an**d **D**ata **A**naly**s**is Library. It is a library that uses three main data structures:

* the Index class
* the Series class
* the DataFrame class

Most Index classes are numeric, that is zero-order integer steps of one:

|index|
|---|
|0|
|1|
|2|
|3|

The Series class has a value at each index and a name. It is essentially a numpy 1darray that has a name. A Series is normally represented as a column:

|index|'x'|
|---|---|
|0|1.1|
|1|2.1|
|2|3.1|
|3|4.1|

A DataFrame class is essentially a grouping of series that have the same index:

|index|'x'|'y'|
|---|---|---|
|0|1.1|1.2|
|1|2.1|2.2|
|2|3.1|3.2|
|3|4.1|4.2|

## Importing Libraries

To use the data science libraries they need to be imported:

In [1]:
import numpy as np 
import pandas as pd

Once imported the identifiers can be imported:

In [2]:
print(dir(pd), sep=' ')

['ArrowDtype', 'BooleanDtype', 'Categorical', 'CategoricalDtype', 'CategoricalIndex', 'DataFrame', 'DateOffset', 'DatetimeIndex', 'DatetimeTZDtype', 'ExcelFile', 'ExcelWriter', 'Flags', 'Float32Dtype', 'Float64Dtype', 'Grouper', 'HDFStore', 'Index', 'IndexSlice', 'Int16Dtype', 'Int32Dtype', 'Int64Dtype', 'Int8Dtype', 'Interval', 'IntervalDtype', 'IntervalIndex', 'MultiIndex', 'NA', 'NaT', 'NamedAgg', 'Period', 'PeriodDtype', 'PeriodIndex', 'RangeIndex', 'Series', 'SparseDtype', 'StringDtype', 'Timedelta', 'TimedeltaIndex', 'Timestamp', 'UInt16Dtype', 'UInt32Dtype', 'UInt64Dtype', 'UInt8Dtype', '__all__', '__builtins__', '__cached__', '__doc__', '__docformat__', '__file__', '__git_version__', '__loader__', '__name__', '__package__', '__path__', '__spec__', '__version__', '_config', '_is_numpy_dev', '_libs', '_testing', '_typing', '_version', 'annotations', 'api', 'array', 'arrays', 'bdate_range', 'compat', 'concat', 'core', 'crosstab', 'cut', 'date_range', 'describe_option', 'errors', '

These can be grouped, the identifiers beginning and eneding with the double underscore are the data model identifiers which mainly give details about the library:

In [3]:
for identifier in dir(pd):
    isdatamodel = identifier[0:2] == '__'
    if (isdatamodel):
        print(identifier, end=' ')

__all__ __builtins__ __cached__ __doc__ __docformat__ __file__ __git_version__ __loader__ __name__ __package__ __path__ __spec__ __version__ 

For example the name, version and file:

In [4]:
pd.__name__

'pandas'

In [5]:
pd.__version__

'2.0.2'

In [6]:
pd.__file__

'c:\\Users\\pyip\\AppData\\Local\\mambaforge\\envs\\jupyterlab\\Lib\\site-packages\\pandas\\__init__.py'

The identifiers beginning with a single underscore are for internal use only:

In [7]:
for identifier in dir(pd):
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__')
    if (isdatamodel):
        print(identifier, end=' ')

_config _is_numpy_dev _libs _testing _typing _version 

The classes are all in CamelCase:

In [8]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and isupper and not isdatamodel):
        print(identifier, end=' ')

ArrowDtype BooleanDtype Categorical CategoricalDtype CategoricalIndex DataFrame DateOffset DatetimeIndex DatetimeTZDtype ExcelFile ExcelWriter Flags Float32Dtype Float64Dtype Grouper HDFStore Index Int16Dtype Int32Dtype Int64Dtype Int8Dtype Interval IntervalDtype IntervalIndex MultiIndex NamedAgg Period PeriodDtype PeriodIndex RangeIndex Series SparseDtype StringDtype Timedelta TimedeltaIndex Timestamp UInt16Dtype UInt32Dtype UInt64Dtype UInt8Dtype 

The main classes are:

* Index
* Series
* DataFrame
 
There are some variations of Index such as RangeIndex, MultiIndex, DateIndex and TimedeltaIndex. 

In general pandas uses object orientated programming (OOP) opposed to functional programming. This means methods are normally applied to Index, Series and DataFrame instances to analyse or manipulate data from the instance. Most of the functions within the pandas library are used to read in data from a file and output a DataFrame instance:

In [9]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

array bdate_range concat crosstab cut date_range describe_option eval factorize from_dummies get_dummies get_option infer_freq interval_range isna isnull json_normalize lreshape melt merge merge_asof merge_ordered notna notnull option_context period_range pivot pivot_table qcut read_clipboard read_csv read_excel read_feather read_fwf read_gbq read_hdf read_html read_json read_orc read_parquet read_pickle read_sas read_spss read_sql read_sql_query read_sql_table read_stata read_table read_xml reset_option set_eng_float_format set_option show_versions test timedelta_range to_datetime to_numeric to_pickle to_timedelta unique value_counts wide_to_long 

pandas modules are also in lower case:

In [10]:
for identifier in dir(pd):
    isfunction = callable(getattr(pd, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

annotations api arrays compat core errors io offsets options pandas plotting testing tseries util 

The modules are not normally used directly by the user but internally called when constructing an Index, Series or DataFrame:

In [11]:
print(dir(pd.arrays), end=' ')

['ArrowExtensionArray', 'ArrowStringArray', 'BooleanArray', 'Categorical', 'DatetimeArray', 'FloatingArray', 'IntegerArray', 'IntervalArray', 'PandasArray', 'PeriodArray', 'SparseArray', 'StringArray', 'TimedeltaArray', '__all__', '__builtins__', '__cached__', '__doc__', '__file__', '__loader__', '__name__', '__package__', '__path__', '__spec__'] 

## Series

The initialisation signature for a pandas Series can be examined:

In [12]:
? pd.Series

[1;31mInit signature:[0m
 [0mpd[0m[1;33m.[0m[0mSeries[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mfastpath[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray ha

The main keyword input arguments are:

* data
* index
* dtype
* name


If these are not supplied. An empty series with no index, no name and a generic object datatype is instantiated:

In [13]:
pd.Series()

Series([], dtype: object)

Normally data is supplied in the form of a numpy 1darray:

In [14]:
pd.Series(data=np.array([1, 2, 3]))

0    1
1    2
2    3
dtype: int32

Since a ndarray itself is initialised from a list, this can be abbreviated to:

In [15]:
pd.Series(data=[1, 2, 3])

0    1
1    2
2    3
dtype: int64

When dtype=None, the data type will be inferred from the data:

In [16]:
pd.Series(data=[1., 2., 3.])

0    1.0
1    2.0
2    3.0
dtype: float64

In [17]:
from datetime import datetime, timedelta
pd.Series(data=[datetime.now(), 
                datetime.now() + timedelta(days=1),
                datetime.now() + timedelta(days=2)])

0   2023-07-20 17:55:11.781803
1   2023-07-21 17:55:11.781803
2   2023-07-22 17:55:11.781803
dtype: datetime64[ns]

In [18]:
pd.Series(data=['a', 'b', 'c'])

0    a
1    b
2    c
dtype: object

Anything with a string in it is classed as non-numeric and has the generic dtype object.

The dtype can be manually overidden when suppling the numpy 1darray by using the np.array input argument dtype:

In [19]:
pd.Series(data=np.array([1., 2., 3.], dtype=np.int32))

0    1
1    2
2    3
dtype: int32

Or by alternatively using the Series keyword input argument dtype:

In [20]:
pd.Series(data=[1., 2., 3.], dtype=np.int32)

0    1
1    2
2    3
dtype: int32

Notice that the index is zero-ordered numeric in integer steps of 1 by default. This can be manually changed by use of the keyword input argument index and providing an Index, ndarray or list of index values:

In [21]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32)

a    1
b    2
c    3
dtype: int32

A Series usually also has a name:

In [22]:
pd.Series(index=['a', 'b', 'c'], data=[1., 2., 3.], dtype=np.int32, name='x')

a    1
b    2
c    3
Name: x, dtype: int32

Normally the data and name are supplied and the index and dtype are inferred:

In [23]:
pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

## DataFrame

The initialisation signature for a pandas DataFrame can be examined:

In [24]:
? pd.DataFrame

[1;31mInit signature:[0m
 [0mpd[0m[1;33m.[0m[0mDataFrame[0m[1;33m([0m[1;33m
[0m    [0mdata[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcolumns[0m[1;33m:[0m [1;34m'Axes | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mdtype[0m[1;33m:[0m [1;34m'Dtype | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mcopy[0m[1;33m:[0m [1;34m'bool | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m     
Two-dimensional, size-mutable, potentially heterogeneous tabular data.

Data structure also contains labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be
thought of as a dict-like container for Series objects. The primary
pandas data structure.

Parameters
----------


The keyword input arguments for a DataFrame are similar to a Series however as a DataFrame is a collection of Series most of these are plural:

* data (plural)
* index (singular)
* columns (plural of name)
* dtype (plural)

In [25]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             index=['a', 'b', 'c', 'd'],
             columns=('x', 'y'),
             dtype=(np.float64, np.float64))

Unnamed: 0,x,y
a,1.1,1.2
b,2.1,2.2
c,3.1,3.2
d,4.1,4.2


The dtype has to be supplied as a tuple of dtypes, if it is supplied as a list of dtypes a TypeError will display.

Normally the dtype and index are inferred:

In [26]:
pd.DataFrame(data=[[1.1, 1.2], 
                   [2.1, 2.2],
                   [3.1, 3.2],
                   [4.1, 4.2]],
             columns=('x', 'y'))

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,4.1,4.2


It is more common to supply the data in the form of a dictionary. The dictionary has a key: value pair. The key should be a string and will become the column name in the DataFrame instance and the value should be a 1darray or list which will become the data:

In [27]:
pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
              'y': np.array([1.2, 2.2, 3.2, 4.2])})

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


## Series Identifiers

If the following ndarray and Series are created:

In [28]:
xarray = np.array([1.1, 2.1, 3.1, 4.1])

In [29]:
xarray

array([1.1, 2.1, 3.1, 4.1])

In [30]:
xseries = pd.Series(xarray, name='x')

In [31]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

Its attributes can be viewed:

In [32]:
print(dir(xseries), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__float__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__int__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror_

The above may seem overwhelming at first glance however these can be split into seperate groupings... The behaviour of many of these identifiers, particularly many of the most common ones have already been examined when looking at numeric datatypes and ndarrays and broadcast across the data in the Series:

Into attributes:

In [33]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

array at attrs axes dtype dtypes empty flags hasnans iat index is_monotonic_decreasing is_monotonic_increasing is_unique name nbytes ndim shape size values 

Methods:

In [34]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply argmax argmin argsort asfreq asof astype at_time autocorr backfill between between_time bfill bool clip combine combine_first compare convert_dtypes copy corr count cov cummax cummin cumprod cumsum describe diff div divide divmod dot drop drop_duplicates droplevel dropna duplicated eq equals ewm expanding explode factorize ffill fillna filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull item items keys kurt kurtosis last last_valid_index le loc lt map mask max mean median memory_usage min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop pow prod product quantile radd rank ravel rdiv rdivmod reindex reindex_like rename rename_axis reorder_levels repeat replace resample reset_index rfloordiv rmod rmul rolling round rpow rsub rtruediv sample searchsorted sem set_axis set_flags shift skew sort_index 

Data model attributes:

In [35]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__annotations__ __array_priority__ __dict__ __doc__ __hash__ __module__ 

Data model methods:

In [36]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__abs__ __add__ __and__ __array__ __array_ufunc__ __bool__ __class__ __contains__ __copy__ __deepcopy__ __delattr__ __delitem__ __dir__ __divmod__ __eq__ __finalize__ __float__ __floordiv__ __format__ __ge__ __getattr__ __getattribute__ __getitem__ __getstate__ __gt__ __iadd__ __iand__ __ifloordiv__ __imod__ __imul__ __init__ __init_subclass__ __int__ __invert__ __ior__ __ipow__ __isub__ __iter__ __itruediv__ __ixor__ __le__ __len__ __lt__ __matmul__ __mod__ __mul__ __ne__ __neg__ __new__ __nonzero__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmatmul__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __setitem__ __setstate__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __weakref__ __xor__ 

There are also a number of internal attributes:

In [37]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__') 
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

_AXIS_LEN _AXIS_ORDERS _AXIS_TO_AXIS_NUMBER _HANDLED_TYPES _accessors _agg_examples_doc _agg_see_also_doc _attrs _can_hold_na _data _flags _hidden_attrs _info_axis _info_axis_name _info_axis_number _internal_names _internal_names_set _is_cached _is_copy _is_mixed_type _is_view _item_cache _metadata _mgr _name _references _stat_axis _stat_axis_name _stat_axis_number _typ _values 

And internal methods:

In [38]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = (identifier[0:1] == '_') & (identifier[0:2] != '__') 
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

_accum_func _add_numeric_operations _align_frame _align_series _append _arith_method _as_manager _binop _check_inplace_and_allows_duplicate_labels _check_inplace_setting _check_is_chained_assignment_possible _check_label_or_level_ambiguity _check_setitem_copy _clear_item_cache _clip_with_one_bound _clip_with_scalar _cmp_method _consolidate _consolidate_inplace _construct_axes_dict _construct_result _constructor _constructor_expanddim _convert_dtypes _dir_additions _dir_deletions _drop_axis _drop_labels_or_levels _duplicated _find_valid_index _get_axis _get_axis_name _get_axis_number _get_axis_resolvers _get_block_manager_axis _get_bool_data _get_cacher _get_cleaned_column_resolvers _get_index_resolvers _get_label_or_level_values _get_numeric_data _get_value _get_values _get_values_tuple _get_with _gotitem _indexed_same _init_dict _init_mgr _inplace_method _is_label_or_level_reference _is_label_reference _is_level_reference _ixs _logical_func _logical_method _map_values _maybe_update_ca

Since the Series is based on a ndarray, the data model attriutes and methods behave analogously:

In [39]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and isdatamodel and not isinarray):
        print(identifier, end=' ')

__annotations__ __dict__ __module__ 

In [40]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and isdatamodel and not isinarray):
        print(identifier, end=' ')

__finalize__ __getattr__ __nonzero__ __round__ __weakref__ 

The main supplementary functionality is with the attributes:

In [41]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isdatamodel and not isinarray):
        print(identifier, end=' ')

array at attrs axes dtypes empty hasnans iat index is_monotonic_decreasing is_monotonic_increasing is_unique name values 

Many of these are the attributes return the supplied value in the intialisation signature:

In [42]:
xseries.array

<PandasArray>
[1.1, 2.1, 3.1, 4.1]
Length: 4, dtype: float64

In [43]:
xseries.name

'x'

In [44]:
xseries.index

RangeIndex(start=0, stop=4, step=1)

In [45]:
xseries.values

array([1.1, 2.1, 3.1, 4.1])

In [46]:
xseries.dtypes

dtype('float64')

The main functionality is in the added methods:

In [47]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align apply asfreq asof at_time autocorr backfill between between_time bfill bool combine combine_first compare convert_dtypes corr count cov cummax cummin describe diff div divide divmod drop drop_duplicates droplevel dropna duplicated eq equals ewm expanding explode factorize ffill fillna filter first first_valid_index floordiv ge get groupby gt head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull items keys kurt kurtosis last last_valid_index le loc lt map mask median memory_usage mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop pow product quantile radd rank rdiv rdivmod reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv rmod rmul rolling rpow rsub rtruediv sample sem set_axis set_flags shift skew sort_index sort_values sub subtract swaplevel tail to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list to_

Note in the above there are method equivalents to the data model identifiers:

In [48]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isinarrayasdatamodel = ('__' + identifier + '__') in dir(xarray)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray and isinarrayasdatamodel):
        print(identifier, end=' ')

abs add bool divmod eq floordiv ge gt le lt mod mul ne pow radd rdivmod rfloordiv rmod rmul rpow rsub rtruediv sub truediv 

It is more common to use the equivalent data model operator:

In [49]:
xseries + 4

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

In [50]:
xseries.__add__(4)

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

In [51]:
xseries.add(4)

0    5.1
1    6.1
2    7.1
3    8.1
Name: x, dtype: float64

Some of these methods have the same name as functions in builtins and therefore behave similarly:

In [52]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    import builtins
    isinbuiltins = identifier in dir(builtins)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinarray and isinbuiltins):
        print(identifier, end=' ')

abs bool divmod filter map pow 

In [53]:
for identifier in dir(xseries):
    isfunction = callable(getattr(xseries, identifier))
    isinarray = identifier in dir(xarray)
    isinarrayasdatamodel = ('__' + identifier + '__') in dir(xarray)
    isdatamodel = identifier[0] == '_'
    import builtins
    isinbuiltins = identifier in dir(builtins)
    if (isfunction and not isdatamodel and not isinarray and not isinarrayasdatamodel and not isinbuiltins):
        print(identifier, end=' ')

add_prefix add_suffix agg aggregate align apply asfreq asof at_time autocorr backfill between between_time bfill combine combine_first compare convert_dtypes corr count cov cummax cummin describe diff div divide drop drop_duplicates droplevel dropna duplicated equals ewm expanding explode factorize ffill fillna first first_valid_index get groupby head hist idxmax idxmin iloc infer_objects info interpolate isin isna isnull items keys kurt kurtosis last last_valid_index loc mask median memory_usage mode multiply nlargest notna notnull nsmallest nunique pad pct_change pipe plot pop product quantile rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rolling sample sem set_axis set_flags shift skew sort_index sort_values subtract swaplevel tail to_clipboard to_csv to_dict to_excel to_frame to_hdf to_json to_latex to_list to_markdown to_numpy to_period to_pickle to_sql to_string to_timestamp to_xarray transform truncate tz_convert tz_localize unique

## DataFrame Identifiers

If the following dataframe is constructed:

In [54]:
df = pd.DataFrame({'x': np.array([1.1, 2.1, 3.1, 3.1]),
                   'y': np.array([1.2, 2.2, 3.2, 4.2])})

In [55]:
df

Unnamed: 0,x,y
0,1.1,1.2
1,2.1,2.2
2,3.1,3.2
3,3.1,4.2


In [56]:
print(dir(df), end=' ')

['T', '_AXIS_LEN', '_AXIS_ORDERS', '_AXIS_TO_AXIS_NUMBER', '_HANDLED_TYPES', '__abs__', '__add__', '__and__', '__annotations__', '__array__', '__array_priority__', '__array_ufunc__', '__bool__', '__class__', '__contains__', '__copy__', '__dataframe__', '__deepcopy__', '__delattr__', '__delitem__', '__dict__', '__dir__', '__divmod__', '__doc__', '__eq__', '__finalize__', '__floordiv__', '__format__', '__ge__', '__getattr__', '__getattribute__', '__getitem__', '__getstate__', '__gt__', '__hash__', '__iadd__', '__iand__', '__ifloordiv__', '__imod__', '__imul__', '__init__', '__init_subclass__', '__invert__', '__ior__', '__ipow__', '__isub__', '__iter__', '__itruediv__', '__ixor__', '__le__', '__len__', '__lt__', '__matmul__', '__mod__', '__module__', '__mul__', '__ne__', '__neg__', '__new__', '__nonzero__', '__or__', '__pos__', '__pow__', '__radd__', '__rand__', '__rdivmod__', '__reduce__', '__reduce_ex__', '__repr__', '__rfloordiv__', '__rmatmul__', '__rmod__', '__rmul__', '__ror__', '__

These can be split into attributes:

In [57]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

at attrs axes columns dtypes empty flags iat index ndim shape size style values x y 

Methods:

In [58]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isupper and not isdatamodel):
        print(identifier, end=' ')

abs add add_prefix add_suffix agg aggregate align all any apply applymap asfreq asof assign astype at_time backfill between_time bfill bool boxplot clip combine combine_first compare convert_dtypes copy corr corrwith count cov cummax cummin cumprod cumsum describe diff div divide dot drop drop_duplicates droplevel dropna duplicated eq equals eval ewm expanding explode ffill fillna filter first first_valid_index floordiv from_dict from_records ge get groupby gt head hist idxmax idxmin iloc infer_objects info insert interpolate isetitem isin isna isnull items iterrows itertuples join keys kurt kurtosis last last_valid_index le loc lt mask max mean median melt memory_usage merge min mod mode mul multiply ne nlargest notna notnull nsmallest nunique pad pct_change pipe pivot pivot_table plot pop pow prod product quantile query radd rank rdiv reindex reindex_like rename rename_axis reorder_levels replace resample reset_index rfloordiv rmod rmul rolling round rpow rsub rtruediv sample select_

Data model attributes:

In [59]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__annotations__ __array_priority__ __dict__ __doc__ __hash__ __module__ 

Data model methods:

In [60]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isupper = identifier[0].isupper()
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and not isupper and isdatamodel):
        print(identifier, end=' ')

__abs__ __add__ __and__ __array__ __array_ufunc__ __bool__ __class__ __contains__ __copy__ __dataframe__ __deepcopy__ __delattr__ __delitem__ __dir__ __divmod__ __eq__ __finalize__ __floordiv__ __format__ __ge__ __getattr__ __getattribute__ __getitem__ __getstate__ __gt__ __iadd__ __iand__ __ifloordiv__ __imod__ __imul__ __init__ __init_subclass__ __invert__ __ior__ __ipow__ __isub__ __iter__ __itruediv__ __ixor__ __le__ __len__ __lt__ __matmul__ __mod__ __mul__ __ne__ __neg__ __new__ __nonzero__ __or__ __pos__ __pow__ __radd__ __rand__ __rdivmod__ __reduce__ __reduce_ex__ __repr__ __rfloordiv__ __rmatmul__ __rmod__ __rmul__ __ror__ __round__ __rpow__ __rsub__ __rtruediv__ __rxor__ __setattr__ __setitem__ __setstate__ __sizeof__ __str__ __sub__ __subclasshook__ __truediv__ __weakref__ __xor__ 

Most of these behave analogously to their counterparts in Series broadcasting across the entire DataFrame instance instead of just along a Series. 

There are no data model attributes in the DataFrame class not found in the Series class:

In [61]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0:2] == '__'
    if (not isfunction and isdatamodel and not isinxseries):
        print(identifier, end=' ')

The only data model method in the DataFrame class not in the Series class is \_\_dataframe\_\_:

In [62]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0:2] == '__'
    if (isfunction and isdatamodel and not isinxseries):
        print(identifier, end=' ')

__dataframe__ 

If the attributes are examined:

In [63]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0] == '_'
    if (not isfunction and not isdatamodel and not isinxseries):
        print(identifier, end=' ')

columns style x y 

Notice the columns attribute returns a list of the names of each Series:

In [64]:
df.columns

Index(['x', 'y'], dtype='object')

Since the following condition is satisfied:

In [65]:
'x'.isidentifier()

True

In [66]:
'y'.isidentifier()

True

And the identifier name doesn't clash with any of the other DataFrame identifiers, the following are also attributes:

In [67]:
df.x

0    1.1
1    2.1
2    3.1
3    3.1
Name: x, dtype: float64

In [68]:
df.y

0    1.2
1    2.2
2    3.2
3    4.2
Name: y, dtype: float64

The following methods are also supplementary for a DataFrame:

In [69]:
for identifier in dir(df):
    isfunction = callable(getattr(df, identifier))
    isinxseries = identifier in dir(xseries)
    isdatamodel = identifier[0] == '_'
    if (isfunction and not isdatamodel and not isinxseries):
        print(identifier, end=' ')

applymap assign boxplot corrwith eval from_dict from_records insert isetitem iterrows itertuples join melt merge pivot pivot_table query select_dtypes set_index stack to_feather to_gbq to_html to_orc to_parquet to_records to_stata to_xml 

## Mutability

The Index, Series and DataFrame classes are mutable Collections meaning they have the immutable data model identifier \_\_getitem\_\_ as well as the mutatable identifier \_\_setitem\_\_:

In [70]:
'__getitem__' in dir(pd.Series)

True

In [71]:
'__setitem__' in dir(pd.Series)

True

In [72]:
'__delitem__' in dir(pd.Series)

True

This means the following array can be indexed into:

In [73]:
xseries = pd.Series([1.1, 2.1, 3.1, 4.1], name='x')

Indexing can be carried out using \_\_getitem\_\_, typically the shorthand notation uses square brackets to enclose the index value:

In [74]:
xseries

0    1.1
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

For example:

In [75]:
xseries[0]

1.1

A value can be reassigned using the mutatable method \_\_setitem\_\_:

In [76]:
xseries[0] = None

In [77]:
xseries

0    NaN
1    2.1
2    3.1
3    4.1
Name: x, dtype: float64

A value can be deleted using the mutatable method \_\_delitem\_\_:

In [78]:
del xseries[2]

In [79]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

Despite the ndarray, Series and DataFrame being mutatable data types, most the identifiers are immutable by default. If the docstring of the method dropna is examined:

In [80]:
? xseries.dropna

[1;31mSignature:[0m
 [0mxseries[0m[1;33m.[0m[0mdropna[0m[1;33m([0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0maxis[0m[1;33m:[0m [1;34m'Axis'[0m [1;33m=[0m [1;36m0[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mhow[0m[1;33m:[0m [1;34m'AnyAll | None'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mignore_index[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Return a new Series with missing values removed.

See the :ref:`User Guide <missing_data>` for more on which values are
considered missing, and how to work with missing data.

Parameters
----------
axis : {0 or 'index'}
    Unused. Parameter needed for compatibility with DataFrame.
inplace : bool, default False
    If True, do operation inplace and retur

Notice it has a number of keyword input arguments such as axis and inplace which have default values. inplace has the default value of False making the method immutable and therefore returning a new Series:

In [81]:
xseries.dropna() # Return value

1    2.1
3    4.1
Name: x, dtype: float64

In [88]:
xseries

0    NaN
1    2.1
3    4.1
Name: x, dtype: float64

When inplace is set to True it becomes a mutable method, modifying the Series inplace:

In [89]:
xseries.dropna(inplace=True) # No return value

In [90]:
xseries

1    2.1
3    4.1
Name: x, dtype: float64

The same behaviour can be seen on the method reset_index:

In [82]:
? xseries.reset_index

[1;31mSignature:[0m
 [0mxseries[0m[1;33m.[0m[0mreset_index[0m[1;33m([0m[1;33m
[0m    [0mlevel[0m[1;33m:[0m [1;34m'IndexLabel'[0m [1;33m=[0m [1;32mNone[0m[1;33m,[0m[1;33m
[0m    [1;33m*[0m[1;33m,[0m[1;33m
[0m    [0mdrop[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mname[0m[1;33m:[0m [1;34m'Level'[0m [1;33m=[0m [1;33m<[0m[0mno_default[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0minplace[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m    [0mallow_duplicates[0m[1;33m:[0m [1;34m'bool'[0m [1;33m=[0m [1;32mFalse[0m[1;33m,[0m[1;33m
[0m[1;33m)[0m [1;33m->[0m [1;34m'DataFrame | Series | None'[0m[1;33m[0m[1;33m[0m[0m
[1;31mDocstring:[0m
Generate a new DataFrame or Series with the index reset.

This is useful when the index needs to be treated as a column, or
when the index is meaningless and needs to be reset to the default
before another

With default values this returns a DataFrame since the old index is now added as a Series:

In [91]:
xseries.reset_index() # Return value

Unnamed: 0,index,x
0,1,2.1
1,3,4.1


If drop is set to True, a Series will instead be returned:

In [92]:
xseries.reset_index(drop=True) # Return value

0    2.1
1    4.1
Name: x, dtype: float64

Once again the inplace keyword input argument can be assigned to True making the method mutatable:

In [84]:
xseries.reset_index(drop=True, inplace=True) # No return value

In [93]:
xseries

1    2.1
3    4.1
Name: x, dtype: float64

In [86]:
#xseries.pop(1)

In [87]:
#xseries