[back](./11-dataframe-properties.ipynb)

---
## `Modifying DataFrames`

- [Modifying individual elements from DataFrame](#modifying-dataframe-elements)
  - [Adding elements to DataFrame](#adding-elements-to-dataframe)
  - [Combining DataFrames](#combining-dataframe)
  - [Remove operations on a DataFrame](#remove-operations-on-dataframe)
- [Modifying entire rows and columns in DataFrame](#modifying-dataframe-rows-and-columns)
- [Shuffling and relabelling DataFrames](#shuffling-and-relabelling-dataframes)

### `Initial Setup`

In [1]:
# Import pandas

import pandas as pd

In [2]:
# Data set-up

df1 = pd.DataFrame({
  'col1': {'row1':1, 'row2':1, 'row3':3},
  'col2': {'row1':4, 'row3':9, 'row4':6},
  'col3': {'row1':10, 'row2':8, 'row4':6}
  })

df2 = pd.DataFrame({
  'col1': {'row1':10, 'row4':6, 'row3':9},
  'col2': {'row1':2, 'row3':1, 'row2':6},
  'col3': {'row3':7, 'row2':6, 'row4':0}
  })

def reset_df1():
  global df1
  df1 = pd.DataFrame({
      'col1': {'row1': 1, 'row2': 1, 'row3': 3},
      'col2': {'row1': 4, 'row3': 9, 'row4': 6},
      'col3': {'row1': 10, 'row2': 8, 'row4': 6}
  })
  print_df1()

def reset_df2():
  global df2
  df2 = pd.DataFrame({
      'col1': {'row1': 10, 'row4': 6, 'row3': 9},
      'col2': {'row1': 2, 'row3': 1, 'row2': 6},
      'col3': {'row3': 7, 'row2': 6, 'row4': 0}
  })
  print_df2()

def print_df1():
  print('Original DataFrame 1:')
  print(df1)
  divider()

def print_df2():
  print('Original DataFrame 2:')
  print(df2)
  divider()

def divider():
  print('-'*80)

print_df1()
print_df2()


Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Original DataFrame 2:
      col1  col2  col3
row1  10.0   2.0   NaN
row4   6.0   NaN   0.0
row3   9.0   1.0   7.0
row2   NaN   6.0   6.0
--------------------------------------------------------------------------------


### `Modifying DataFrame elements`

#### `Adding elements to DataFrame`

- Assign a row or column that doesn't exist already
- Use the append function

In [3]:
"""
Assigning new row / column
works similar to a Dictionary
"""

print_df1()

"""
It is better to keep it the same length (rows / columns) as of existing DataFrame
Else it may either cut them off / fill with NaN
"""
# This would not work either, because this (row labels) is not the structure on the existing DataFrame
# df1['col4'] = pd.Series({'a':1, 'b':2})

# Adding a column to DataFrame 1
print('Updating / Modifying DataFrame 1:')
df1['col4'] = [1, 2, 3, 4]

print_df1()

Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Updating / Modifying DataFrame 1:
Original DataFrame 1:
      col1  col2  col3  col4
row1   1.0   4.0  10.0     1
row2   1.0   NaN   8.0     2
row3   3.0   9.0   NaN     3
row4   NaN   6.0   6.0     4
--------------------------------------------------------------------------------


In [4]:
"""
Assigning new row / column
works similar to a Dictionary
"""

print_df1()

"""
It is better to keep it the same length (rows / columns) as of existing DataFrame
Else it may either cut them off / fill with NaN
"""
# This would not work either, because this (column labels) is not the structure on the existing DataFrame
# df1.loc['row5'] = pd.Series({'a':1, 'b':2})

# Adding a column to DataFrame 1
print('Updating / Modifying DataFrame 1:')
df1.loc['row5'] = [20, 30, 40, 50]

print_df1()


Original DataFrame 1:
      col1  col2  col3  col4
row1   1.0   4.0  10.0     1
row2   1.0   NaN   8.0     2
row3   3.0   9.0   NaN     3
row4   NaN   6.0   6.0     4
--------------------------------------------------------------------------------
Updating / Modifying DataFrame 1:
Original DataFrame 1:
      col1  col2  col3  col4
row1   1.0   4.0  10.0     1
row2   1.0   NaN   8.0     2
row3   3.0   9.0   NaN     3
row4   NaN   6.0   6.0     4
row5  20.0  30.0  40.0    50
--------------------------------------------------------------------------------


In [5]:
# Assigning new row / column, using append function

print_df1()

"""
  Can append a DataFrame or Series
  Adding a new row, so need to add as many elements as the no. of columns
  Need to set ignore_index=True

  ignore_index : bool, default False
    If True, the resulting axis will be labeled 0, 1, …, n - 1.
"""
new_df1 = df1.append(pd.Series([11, 12, 13, 14], index=['col1', 'col2', 'col3', 'col4']), ignore_index=True)
print('New DataFrame 1:')
print(new_df1)
divider()


Original DataFrame 1:
      col1  col2  col3  col4
row1   1.0   4.0  10.0     1
row2   1.0   NaN   8.0     2
row3   3.0   9.0   NaN     3
row4   NaN   6.0   6.0     4
row5  20.0  30.0  40.0    50
--------------------------------------------------------------------------------
New DataFrame 1:
   col1  col2  col3  col4
0   1.0   4.0  10.0     1
1   1.0   NaN   8.0     2
2   3.0   9.0   NaN     3
3   NaN   6.0   6.0     4
4  20.0  30.0  40.0    50
5  11.0  12.0  13.0    14
--------------------------------------------------------------------------------


#### `Combining DataFrame`

Combining takes one DataFrame and stick on the end if we use **concat** operation<br>
Or if we just use the **combine_first** function, this actually combines it and overrides any **NaN** values that it needs to

Both of these operations return a new DataFrame and leave the original one intact


In [6]:
reset_df1()
print_df2()

# Combine DataFrame 2 with DataFrame1
combined = df1.combine_first(df2)
print('Result of combining DataFrame 2 with DataFrame 1:')
print(combined)

Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Original DataFrame 2:
      col1  col2  col3
row1  10.0   2.0   NaN
row4   6.0   NaN   0.0
row3   9.0   1.0   7.0
row2   NaN   6.0   6.0
--------------------------------------------------------------------------------
Result of combining DataFrame 2 with DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   6.0   8.0
row3   3.0   9.0   7.0
row4   6.0   6.0   6.0


Okay, so the above result is because it overrides everything from DataFrame 2 in DataFrame 1 if a value is **NaN**

On the other hand, the below example, using concat, it will take one DataFrame and stick it to the end of another DataFrame

In [7]:
print_df1()
print_df2()

concat_df = pd.concat([df1, df2])
print('Concat DataFrame 1 and DataFrame 2:')
print(concat_df)

Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Original DataFrame 2:
      col1  col2  col3
row1  10.0   2.0   NaN
row4   6.0   NaN   0.0
row3   9.0   1.0   7.0
row2   NaN   6.0   6.0
--------------------------------------------------------------------------------
Concat DataFrame 1 and DataFrame 2:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
row1  10.0   2.0   NaN
row4   6.0   NaN   0.0
row3   9.0   1.0   7.0
row2   NaN   6.0   6.0


**Things to NOTE**

We actually have duplicate row labels, this is a desired behavior.

But when we want to fetch a row, it might fetch more than one row that has the same label, for example

In [8]:
print(concat_df.loc['row2'])

      col1  col2  col3
row2   1.0   NaN   8.0
row2   NaN   6.0   6.0


We can also concatenate **DataFrame slices**

for example, first two rows from DataFrame 1 and last two rows from DataFrame 2

In [9]:
print_df1()
print_df2()

combined_df = pd.concat([df1[:2], df2[-2:]])
print(combined_df)

Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
Original DataFrame 2:
      col1  col2  col3
row1  10.0   2.0   NaN
row4   6.0   NaN   0.0
row3   9.0   1.0   7.0
row2   NaN   6.0   6.0
--------------------------------------------------------------------------------
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   9.0   1.0   7.0
row2   NaN   6.0   6.0


#### `Remove operations on DataFrame`

This is very similar to remove operations on **Pandas Series**, just that we have more axes to work with

In [10]:
# dropna function

"""
This helps us to drop either rows with NAs or columns with NAs
This also return a new DataFrame
"""
print_df1()

dropped = df1.dropna()
print('After dropna(), rows:')
print(dropped)

# with axis set to drop columns
dropped = df1.dropna(axis=1)
print('After dropna(), columns:')
print(dropped)


Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
After dropna(), rows:
      col1  col2  col3
row1   1.0   4.0  10.0
After dropna(), columns:
Empty DataFrame
Columns: []
Index: [row1, row2, row3, row4]


If we want to drop specific columns or indexes (rows) from DataFrames, we need to use .drop()

In [11]:
print_df1()

# We want to drop 'row4' and 'col3'
new_df1 = df1.drop(index=['row4'], columns=['col3'])
print('After dropping row4 and col3 from DataFrame 1:')
print(new_df1)

Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
After dropping row4 and col3 from DataFrame 1:
      col1  col2
row1   1.0   4.0
row2   1.0   NaN
row3   3.0   9.0


In [12]:
print_df1()

# Alternatively, we can use the del function
# This will modify the original DataFrame and will work only on columns
del df1['col2']
print('After dropping col2 from DataFrame 1:')
print_df1()

Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
After dropping col2 from DataFrame 1:
Original DataFrame 1:
      col1  col3
row1   1.0  10.0
row2   1.0   8.0
row3   3.0   NaN
row4   NaN   6.0
--------------------------------------------------------------------------------


In [13]:
reset_df1()

# Also, we can use the pop function
# This will modify the original DataFrame, will work only on columns and also return the popped column
popped = df1.pop('col2')
print('After popping col2 from DataFrame 1:')
print_df1()

print('Popped column:')
print(popped)


Original DataFrame 1:
      col1  col2  col3
row1   1.0   4.0  10.0
row2   1.0   NaN   8.0
row3   3.0   9.0   NaN
row4   NaN   6.0   6.0
--------------------------------------------------------------------------------
After popping col2 from DataFrame 1:
Original DataFrame 1:
      col1  col3
row1   1.0  10.0
row2   1.0   8.0
row3   3.0   NaN
row4   NaN   6.0
--------------------------------------------------------------------------------
Popped column:
row1    4.0
row2    NaN
row3    9.0
row4    6.0
Name: col2, dtype: float64


### `Modifying DataFrame Rows and Columns`

### `Shuffling and Relabelling DataFrames`


---
[next]()