[back](./12-dataframe-modification.ipynb)

---
## `DataFrame Operations`

These are similar to operations on Pandas Series, but we have multiple columns to handle.

- [Arithmetic operations on DataFrames](#arithmetic-operations-on-dataframes)
- [Applying functions to DataFrames _(to change values)_](#applying-functions-to-dataframes)
- [Transposing DataFrames _(flips rows and columns)_](#transposing-dataframes)
- [Converting DataFrames _(to different types)_](#converting-dataframes)
- [Sorting DataFrames](#sorting-dataframes)

### `Initial Setup`

In [1]:
# Importing Pandas

import pandas as pd
import numpy as np

In [2]:
# Data set-up

df = pd.DataFrame({
  'col1': {'row1':1, 'row2':1, 'row3':3},
  'col2': {'row1':4, 'row3':9, 'row4':6},
  'col3': {'row1':10, 'row2':8, 'row4':6}
  })

def reset_df():
  global df
  df = pd.DataFrame({
      'col1': {'row1': 1, 'row2': 1, 'row3': 3},
      'col2': {'row1': 4, 'row3': 9, 'row4': 6},
      'col3': {'row1': 10, 'row2': 8, 'row4': 6}
  })
  print_df()


def print_df():
  print('Original DataFrame:')
  print(df)
  divider()

def divider():
  print('-'*80)

print_df()


Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------



---

### `Arithmetic operations on DataFrames`

- Simple arithmetic operators like `+`, `-`, `*`, `/`, `**`, `//`, `%`, `^` can be used
- Or we can use the functions that are inherited with Pandas

In [3]:
print_df()
# Simple operators
# This and almost all of the operators will return a new DataFrame, instead of modifying the original one
print('Add 5 to DataFrame:')
print(df + 5)
divider()

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Add 5 to DataFrame:
      col1  col2  col3
row1   6.0   9.0  15.0
row2   6.0   NaN  13.0
row3   8.0  14.0   NaN
row4   NaN  11.0  11.0
--------------------------------------------------------------------------------


In [4]:
print_df()
# Simple operators
# This and almost all of the operators will return a new DataFrame, instead of modifying the original one
print('Square all the values in the DataFrame:')
print(df ** 2)
divider()


Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Square all the values in the DataFrame:
      col1  col2   col3
row1   1.0  16.0  100.0
row2   1.0   NaN   64.0
row3   9.0  81.0    NaN
row4   NaN  36.0   36.0
--------------------------------------------------------------------------------


In [5]:
print_df()

# Another alternative and better way to do it is using the inbuilt functions
print('Convert all NaN to 0 and then add 5 to DataFrame:')
# Can pass in axis (column labels to specify which specific columns should be operated on)
# Also can use fill_value to override NaN

# Covert All the NaN to 0 and then add 5
print(df.add(5, fill_value=0))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Convert all NaN to 0 and then add 5 to DataFrame:
      col1  col2  col3
row1   6.0   9.0  15.0
row2   6.0   5.0  13.0
row3   8.0  14.0   5.0
row4   5.0  11.0  11.0


### `Applying functions to DataFrames`

In [6]:
print_df()

# Custom function
def func(x):
  return (2*x) + 5

# Applying custom function to DataFrame
print(df.apply(func))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
      col1  col2  col3
row1   7.0  13.0  25.0
row2   7.0   NaN  21.0
row3  11.0  23.0   NaN
row4   NaN  17.0  17.0



---
We can also use the `transform` function, but this function is better suited to built in functions which are part of the NumPy or the Pandas libraries specifically.

There is no real noticeable benefits if we are going to apply a `custom function` v/s using the `transform function`

In [7]:
print_df()

# Transform function

print('Transform DataFrame with numpy square function:')
print(df.transform(np.square))


Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Transform DataFrame with numpy square function:
      col1  col2   col3
row1   1.0  16.0  100.0
row2   1.0   NaN   64.0
row3   9.0  81.0    NaN
row4   NaN  36.0   36.0


In [8]:
print_df()

# Custom function


def func(x):
  return (2*x) + 5

# Transform function

print('Transform DataFrame with custom function:')
print(df.transform(func))


Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Transform DataFrame with custom function:
      col1  col2  col3
row1   7.0  13.0  25.0
row2   7.0   NaN  21.0
row3  11.0  23.0   NaN
row4   NaN  17.0  17.0


In [9]:
print_df()

# We can also pass functions as string
print('Transform DataFrame using string function name:')
print(df.transform('square'))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Transform DataFrame using string function name:
      col1  col2   col3
row1   1.0  16.0  100.0
row2   1.0   NaN   64.0
row3   9.0  81.0    NaN
row4   NaN  36.0   36.0


In [10]:
print_df()

# Alternatively, we can pass in a lambda function to transform function
print('Square the values in a DataFrame using lambda function:')
print(df.transform(lambda x: x**2))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Square the values in a DataFrame using lambda function:
      col1  col2   col3
row1   1.0  16.0  100.0
row2   1.0   NaN   64.0
row3   9.0  81.0    NaN
row4   NaN  36.0   36.0


---

The main advantage of using `transform` is because we can apply multiple functions simultaneously

In [11]:
print_df()

# Example where we wanted to square everything and then we wanted to take the log of everything
print('Perform square and log the values in a DataFrame:')
print(df.transform([np.square, np.log]))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Perform square and log the values in a DataFrame:
       col1             col2             col3          
     square       log square       log square       log
row1    1.0  0.000000   16.0  1.386294  100.0  2.302585
row2    1.0  0.000000    NaN       NaN   64.0  2.079442
row3    9.0  1.098612   81.0  2.197225    NaN       NaN
row4    NaN       NaN   36.0  1.791759   36.0  1.791759


We see with the above result that, the is a sub-column created for holding the result of both the operations performed using the `transform` function

The `transform` function did not perform the square and then do a log on the squared values,<br>
instead this these operations individually and created new sub-columns for each operations applied

---

We can apply the `transform` function to specific columns using a dictionary 

In [12]:
print_df()

# Applying transform function to specific column
print(df.transform({'col1': np.square, 'col2': np.log}))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
      col1      col2
row1   1.0  1.386294
row2   1.0       NaN
row3   9.0  2.197225
row4   NaN  1.791759


### `Transposing DataFrames`

In [13]:
print_df()

# Transpose the DataFrame
print('DataFrame Transposed:')
print(df.transpose())

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
DataFrame Transposed:
      row1  row2  row3  row4
col1   1.0   1.0   3.0   NaN
col2   4.0   NaN   9.0   6.0
col3  10.0   8.0   NaN   6.0


#### `Converting DataFrames`

In [14]:
print_df()

# This is similar to Series
print('Covert DataFrame elements to objects:')
print(df.astype('object'))

Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Covert DataFrame elements to objects:
     col1 col2  col3
row1  1.0  4.0  10.0
row2  1.0  NaN   8.0
row3  3.0  9.0   NaN
row4  NaN  6.0   6.0


In [15]:
print_df()

# This is similar to Series
print('Covert DataFrame elements to np.float16:')
print(df.astype(np.float16))


Original DataFrame:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Covert DataFrame elements to np.float16:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0


#### `Sorting DataFrames`

We can sort by index or by values

##### `01 - By index`

> axis : {0 or 'index', 1 or 'columns'}, default 0
>
>    The axis along which to sort. The value 0 identifies the rows, and 1 identifies the columns.
>
> ascending : bool or list-like of bools, default True
>
>    Sort ascending vs. descending. When the index is a MultiIndex the sort direction can be controlled for each level individually.

In [16]:
new_df = pd.DataFrame({
    'col2': {'row1': 1, 'row2': 1, 'row3': 3},
    'col3': {'row1': 4, 'row3': 9, 'row4': 6},
    'col1': {'row1': 10, 'row2': 8, 'row4': 6}
})

print('Original DataFrame:')
print(new_df)
divider()

print('Sort by rows:')
print(new_df.sort_index(axis=0, ascending=False))

print('Sort by columns:')
print(new_df.sort_index(axis=1, ascending=True))


Original DataFrame:
      col2  col3  col1
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Sort by rows:
      col2  col3  col1
row4   NaN   6.0   6.0
row3   3.0   9.0   NaN
row2   1.0   NaN   8.0
row1   1.0   4.0  10.0
Sort by columns:
      col1  col2  col3
row1  10.0   1.0   4.0
row2   8.0   1.0   NaN
row3   NaN   3.0   9.0
row4   6.0   NaN   6.0


##### `02 - By values`

This works a bit differently, as we need to specify how we need to sort things

The reason we need to do this is that different rows and columns might have largest and smallest values in completely different indexes

>         by : str or list of str
>            Name or list of names to sort by.
>
>  - if `axis` is 0 or `'index'` then `by` may contain index
>
> levels and/or column labels. - if `axis` is 1 or `'columns'` then `by` may contain column levels and/or index labels. axis : {0 or 'index', 1 or 'columns'}, default 0 Axis to be sorted. ascending : bool or list of bool, default True Sort ascending vs. descending. Specify list for multiple sort orders. If this is a list of bools, must match the length of the by. inplace : bool, default False If True, perform operation in-place. kind : {'quicksort', 'mergesort', 'heapsort', 'stable'}, default 'quicksort' Choice of sorting algorithm. See also `numpy.sort` for more information. `mergesort` and `stable` are the only stable algorithms. For DataFrames, this option is only applied when sorting on a single column or label. na_position : {'first', 'last'}, default 'last' Puts NaNs at the beginning if `first`; `last` puts NaNs at the end. ignore_index : bool, default False If True, the resulting axis will be labeled 0, 1, …, n - 1.
>
> key : callable, optional
>    Apply the key function to the values before sorting. This is similar to the `key` argument in the builtin `sorted` function, with the notable difference that this `key` function should be *vectorized*. It should expect a `Series` and return a Series with the same shape as the input. It will be applied to each column in `by` independently.


In [17]:
new_df = pd.DataFrame({
    'col2': {'row1': 1, 'row2': 1, 'row3': 3},
    'col3': {'row1': 4, 'row3': 9, 'row4': 6},
    'col1': {'row1': 10, 'row2': 8, 'row4': 6}
})

print('Original DataFrame:')
print(new_df)
divider()


print('Sort DataFrame using column values:')
# Sorting by columns, and by axis 0
print(new_df.sort_values('col3', axis=0, ascending=False))
divider()

print('Sort DataFrame using row values:')
# Sorting by row, and by axis 1
# This will not shuffle the rows, but the column based on the row label
print(new_df.sort_values('row2', axis=1, ascending=True))

Original DataFrame:
      col2  col3  col1
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Sort DataFrame using column values:
      col2  col3  col1
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
--------------------------------------------------------------------------------
Sort DataFrame using row values:
      col2  col1  col3
row1   1.0  10.0   4.0
row2   1.0   8.0   NaN
row3   3.0   NaN   9.0
row4   NaN   6.0   6.0



---
[next]()