# 02.02 - Pandas DataFrames

## Introduction

In this Jupyter Notebook, we will be covering the basics of using Pandas DataFrames in Python. DataFrames are two-dimensional data structures with columns that can be of different types, similar to a spreadsheet or SQL table. They are one of the most important data structures in Python and form the basis for most data manipulation tasks. The topics that will be covered in this notebook include:

1. **Creating a DataFrame** - This includes methods of creating a DataFrame using lists and dictionaries.
2. **Inspecting a DataFrame** - This covers various methods to inspect a DataFrame, such as `head()`, `tail()`, `info()`, and `describe()`.
3. **Indexing and Selecting Data in DataFrame** - This involves selecting rows and columns, using `loc` and `iloc`, and Boolean indexing.
4. **Modifying a DataFrame** - This includes adding and deleting rows and columns, and renaming columns.
5. **Handling Missing Data** - This covers methods to handle missing data, such as `isnull()`, `notnull()`, `dropna()`, and `fillna()`.
6. **DataFrame Operations** - This covers mathematical operations on DataFrame, applying functions to a DataFrame, and grouping and aggregating data.

By understanding these topics, you will be able to manipulate and analyze data using Pandas DataFrames effectively.

## Section1: Creating a DataFrame

### 1.1 - Using Lists

You can create a DataFrame from list-like objects using the pandas DataFrame function.

**Example 1: Creating a DataFrame from a Single List**

In [1]:
import pandas as pd

data = ['a', 'b', 'c', 'd', 'e']
df = pd.DataFrame(data)
print(df)

   0
0  a
1  b
2  c
3  d
4  e


**Example 2: Creating a DataFrame from a List of Lists**

In [2]:
import pandas as pd

data = [['Alex', 10], ['Bob', 12], ['Clarke', 13]]
df = pd.DataFrame(data, columns=['Name', 'Age'])
print(df)

     Name  Age
0    Alex   10
1     Bob   12
2  Clarke   13


**Example 3: Creating a DataFrame from a List of Dictionaries**

In [3]:
import pandas as pd

data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print(df)

   a   b     c
0  1   2   NaN
1  5  10  20.0


**Example 4: Creating a DataFrame with Index**

In [5]:
import pandas as pd

data = {'Name':['Tom', 'nick', 'krish', 'jack'],
        'Age':[20, 21, 19, 18]}

df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4'])
print(df)

        Name  Age
rank1    Tom   20
rank2   nick   21
rank3  krish   19
rank4   jack   18


### 1.2 - Using Dictionaries

You can create a DataFrame from dictionary using the pandas DataFrame function. The keys of the dictionary are used as column labels and the values are used as data.

**Example 1: Creating a DataFrame from a Dictionary**

In [6]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)
print(df)

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 2: Creating a DataFrame from a Dictionary with specified indexes**

In [7]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])
print(df)

       Name  Age
rank1   Tom   20
rank2  Nick   21
rank3  John   19
rank4  Alex   18


**Example 3: Creating a DataFrame from a Dictionary of Series**

When the values in the dictionary are series, the result is a DataFrame where the series are aligned by their index. If no index is specified, the series' index will be used.

In [8]:
import pandas as pd

data = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
        'two': pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
df = pd.DataFrame(data)
print(df)

   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4


**Example 4: Creating a DataFrame from a Dictionary of Lists with different lengths**

If you try to create a DataFrame from a dictionary of lists with different lengths, you will get an error.

In [9]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)  # This will raise a ValueError

ValueError: All arrays must be of the same length

**Example 5: Creating a DataFrame from a Dictionary while specifying column names**

If you specify the column names while creating a DataFrame from a dictionary, the order of the columns in the resulting DataFrame will match the order of the names in your list. Any key in the dictionary not in your list will be excluded from the DataFrame.

In [10]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, columns=['Name'])
print(df)

   Name
0   Tom
1  Nick
2  John
3  Alex


## Section 2: Inspecting a DataFrame

### 2.1 - `head()`

The `head()` function is used to get the first N rows of a DataFrame. By default, it returns the first 5 rows.

**Example 1: Getting the first five rows**

In [11]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.head())

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 2: Getting the first three rows**

In [12]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.head(3))

   Name  Age
0   Tom   20
1  Nick   21
2  John   19


**Example 3: Using `head()` with a DataFrame resulting from a group operation**

In [13]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex', 'Tom'], 'Age': [20, 21, 19, 18, 22]}
df = pd.DataFrame(data)

print(df.groupby('Name').head(1))

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 4: Using `head()` with a DataFrame resulting from a sort operation**

In [14]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.sort_values('Age').head(2))

   Name  Age
3  Alex   18
2  John   19


**Example 5: Using `head()` with a DataFrame resulting from a filter operation**

In [15]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[df['Age'] > 19].head(2))

   Name  Age
0   Tom   20
1  Nick   21


### 2.2 - `tail()`

The `tail()` function is used to get the last N rows of a DataFrame. By default, it returns the last 5 rows.

**Example 1: Getting the last five rows**

In [16]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex', 'Matt'], 'Age': [20, 21, 19, 18, 22]}
df = pd.DataFrame(data)

print(df.tail())

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18
4  Matt   22


**Example 2: Getting the last three rows**

In [17]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex', 'Matt'], 'Age': [20, 21, 19, 18, 22]}
df = pd.DataFrame(data)

print(df.tail(3))

   Name  Age
2  John   19
3  Alex   18
4  Matt   22


**Example 3: Using `tail()` with a DataFrame resulting from a group operation**

In [18]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex', 'Tom'], 'Age': [20, 21, 19, 18, 22]}
df = pd.DataFrame(data)

print(df.groupby('Name').tail(1))

   Name  Age
1  Nick   21
2  John   19
3  Alex   18
4   Tom   22


**Example 4: Using `tail()` with a DataFrame resulting from a sort operation**

In [19]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.sort_values('Age').tail(2))

   Name  Age
0   Tom   20
1  Nick   21


**Example 5: Using `tail()` with a DataFrame resulting from a filter operation**

In [20]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[df['Age'] > 19].tail(2))

   Name  Age
0   Tom   20
1  Nick   21


### 2.3 - `info()`

The `info()` function is used to get a concise summary of a DataFrame. It provides the essential details about the dataset such as the number of rows and columns, the number of non-null values, what type of data is in each column, and how much memory the DataFrame uses.

**Example 1: Getting information about a DataFrame**

In [21]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    4 non-null      object
 1   Age     4 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 192.0+ bytes


**Example 2: Getting information about a DataFrame resulting from a filter operation**

In [23]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df[df['Age'] > 19].info()

<class 'pandas.core.frame.DataFrame'>
Index: 2 entries, 0 to 1
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    2 non-null      object
 1   Age     2 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 48.0+ bytes


**Example 3: Getting information about a DataFrame with missing values**

In [24]:
import pandas as pd
import numpy as np

data = {'Name': ['Tom', 'Nick', 'John', np.nan], 'Age': [20, 21, np.nan, 18]}
df = pd.DataFrame(data)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Name    3 non-null      object 
 1   Age     3 non-null      float64
dtypes: float64(1), object(1)
memory usage: 192.0+ bytes


**Example 5: Getting information about a large DataFrame**

In [25]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(1000, 50), columns=['Col' + str(i) for i in range(50)])
df = pd.DataFrame(data)

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 50 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Col0    1000 non-null   float64
 1   Col1    1000 non-null   float64
 2   Col2    1000 non-null   float64
 3   Col3    1000 non-null   float64
 4   Col4    1000 non-null   float64
 5   Col5    1000 non-null   float64
 6   Col6    1000 non-null   float64
 7   Col7    1000 non-null   float64
 8   Col8    1000 non-null   float64
 9   Col9    1000 non-null   float64
 10  Col10   1000 non-null   float64
 11  Col11   1000 non-null   float64
 12  Col12   1000 non-null   float64
 13  Col13   1000 non-null   float64
 14  Col14   1000 non-null   float64
 15  Col15   1000 non-null   float64
 16  Col16   1000 non-null   float64
 17  Col17   1000 non-null   float64
 18  Col18   1000 non-null   float64
 19  Col19   1000 non-null   float64
 20  Col20   1000 non-null   float64
 21  Col21   1000 non-null   float64
 22  C

### 2.4 - `describe()`

The `describe()` function is used to generate descriptive statistics of a DataFrame or a Series. It provides a quick overview of the central tendencies, dispersion and shape of the dataset’s distribution.

**Example 1: Basic usage of `describe()`**

In [26]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df.describe()

Unnamed: 0,Age
count,4.0
mean,19.5
std,1.290994
min,18.0
25%,18.75
50%,19.5
75%,20.25
max,21.0


**Example 2: Using `describe()` on string data**

When `describe()` is used on a column of non-numeric data, it returns the count, unique, top, and frequency.

In [27]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df['Name'].describe()

count       4
unique      4
top       Tom
freq        1
Name: Name, dtype: object

**Example 3: Including all data in `describe()` summary**

By default, `describe()` provides a summary of only the numerical columns. If you want a summary of all columns, include the `include='all'` argument.

In [28]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df.describe(include='all')

Unnamed: 0,Name,Age
count,4,4.0
unique,4,
top,Tom,
freq,1,
mean,,19.5
std,,1.290994
min,,18.0
25%,,18.75
50%,,19.5
75%,,20.25


**Example 4: Using `describe()` on a DataFrame with missing values**

If a DataFrame includes missing values, `describe()` will handle them as NaN and exclude them from the summary statistics.

In [29]:
import pandas as pd
import numpy as np

data = {'Name': ['Tom', 'Nick', 'John', np.nan], 'Age': [20, 21, np.nan, 18]}
df = pd.DataFrame(data)

df.describe()

Unnamed: 0,Age
count,3.0
mean,19.666667
std,1.527525
min,18.0
25%,19.0
50%,20.0
75%,20.5
max,21.0


**Example 5: Changing percentiles in `describe()` summary**

By default, `describe()` provides the 25th, 50th, and 75th percentiles. You can specify your own percentiles using the `percentiles` argument.

In [30]:
import pandas as pd

data = {'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df.describe(percentiles=[.10, .20, .30, .40, .50, .60, .70, .80, .90])

Unnamed: 0,Age
count,4.0
mean,19.5
std,1.290994
min,18.0
10%,18.3
20%,18.6
30%,18.9
40%,19.2
50%,19.5
60%,19.8


## Section 3: Indexing and Selecting Data in DataFrame

### 3.1 - Selecting rows

Rows in DataFrame can be selected by passing row label to a `loc` function.

**Example 1: Selecting a single row by row label**

In [31]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc['rank1'])

Name    Tom
Age      20
Name: rank1, dtype: object


**Example 2: Selecting rows by label range**

In [32]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc['rank1':'rank3'])

       Name  Age
rank1   Tom   20
rank2  Nick   21
rank3  John   19


**Example 3: Selecting rows by list of labels**

In [33]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc[['rank1', 'rank3']])

       Name  Age
rank1   Tom   20
rank3  John   19


**Example 4: Selecting rows by condition**

In [34]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.loc[df['Age'] > 19])

   Name  Age
0   Tom   20
1  Nick   21


**Example 5: Selecting rows by multiple conditions**

In [35]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.loc[(df['Age'] > 19) & (df['Name'] == 'Tom')])

  Name  Age
0  Tom   20


### 3.2 - Selecting columns

Columns in DataFrame can be selected by passing column label to a `loc` function or by directly using the name of the column.

**Example 1: Selecting a single column by column label**

In [36]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df['Name'])

0     Tom
1    Nick
2    John
3    Alex
Name: Name, dtype: object


**Example 2: Selecting multiple columns by column label**

In [37]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[['Name', 'Age']])

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 3: Selecting columns by label range**

In [38]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18], 'Rank': [1, 2, 3, 4]}
df = pd.DataFrame(data)

print(df.loc[:, 'Name':'Age'])

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 4: Selecting columns by condition**

In [39]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])

print(df[df['col1'] > 0.5])

       col1      col2      col3
3  0.898332 -0.439848 -0.427518


**Example 5: Selecting columns by multiple conditions**

In [40]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(5, 3), columns=['col1', 'col2', 'col3'])

print(df[(df['col1'] > 0.5) & (df['col2'] < 0.5)])

Empty DataFrame
Columns: [col1, col2, col3]
Index: []


### 3.3 - Using `loc`

The `loc` function is a label-based data selection method which means that we have to pass the name of the row or column which we want to select. This method includes the last element of the range, unlike `iloc`. Apart from accessing data, the `loc` function can also be used to modify data.

**Example 1: Selecting a single row by row label**

In [41]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc['rank1'])

Name    Tom
Age      20
Name: rank1, dtype: object


**Example 2: Selecting multiple rows by label range**

In [42]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc['rank1':'rank3'])

       Name  Age
rank1   Tom   20
rank2  Nick   21
rank3  John   19


**Example 3: Selecting rows and columns by label**

In [43]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc[['rank1', 'rank3'], ['Name', 'Age']])

       Name  Age
rank1   Tom   20
rank3  John   19


**Example 4: Modifying data using `loc`**

In [44]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

df.loc['rank1', 'Name'] = 'Jerry'

print(df)

        Name  Age
rank1  Jerry   20
rank2   Nick   21
rank3   John   19
rank4   Alex   18


**Example 5: Complex data selection using `loc`**

In [45]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18], 'Rank': [1, 2, 3, 4]}
df = pd.DataFrame(data, index=['rank1', 'rank2', 'rank3', 'rank4'])

print(df.loc[df['Age'] > 19, ['Name', 'Rank']])

       Name  Rank
rank1   Tom     1
rank2  Nick     2


### 3.4 - Using `iloc`

The `iloc` function is an index-based selection method which means we have to pass integer index in the method to select specific rows/columns. Unlike `loc`, `iloc` excludes the last element of the range.

**Example 1: Selecting a single row by integer index**

In [46]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.iloc[0])

Name    Tom
Age      20
Name: 0, dtype: object


**Example 2: Selecting multiple rows by integer range**

In [47]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.iloc[0:2])

   Name  Age
0   Tom   20
1  Nick   21


**Example 3: Selecting rows and columns by integer index**

In [49]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.iloc[0:2, 0:1])

   Name
0   Tom
1  Nick


**Example 4: Modifying data using `iloc`**

In [50]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

df.iloc[0, 0] = 'Jerry'

print(df)

    Name  Age
0  Jerry   20
1   Nick   21
2   John   19
3   Alex   18


**Example 5: Complex data selection using `iloc`**

In [51]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18], 'Rank': [1, 2, 3, 4]}
df = pd.DataFrame(data)

print(df.iloc[df['Age'].values > 19, [0, 2]])

   Name  Rank
0   Tom     1
1  Nick     2


### 3.5 - Boolean indexing

In Python, particularly with Pandas DataFrame, you can use Boolean indexing to select subsets of data. This method is used when you want to select data based on the criteria that you specify.

**Example 1: Basic Boolean Indexing**

In [52]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[df['Age'] > 19])

   Name  Age
0   Tom   20
1  Nick   21


In this example, `df['Age'] > 19` is a Boolean condition that checks which rows have 'Age' greater than 19. The result is a series of True/False values, which is used to select the rows.

**Example 2: Using Multiple Conditions**

You can use `&` and `|` operators to combine multiple conditions.

In [53]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[(df['Age'] > 19) & (df['Name'] == 'Tom')])

  Name  Age
0  Tom   20


In this example, rows where 'Age' is greater than 19 and 'Name' is 'Tom' are selected.

**Example 3: Using `isin` Function for Filtering**

The `isin` function is used when you want to match some specific values.

In [54]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[df['Name'].isin(['Tom', 'John'])])

   Name  Age
0   Tom   20
2  John   19


In this example, rows where 'Name' is either 'Tom' or 'John' are selected.

**Example 4: Using `~` Operator for Filtering**

The `~` operator is used when you want to select the rows that do not satisfy the condition.

In [55]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df[~(df['Age'] < 20)])

   Name  Age
0   Tom   20
1  Nick   21


In this example, rows where 'Age' is not less than 20 are selected.

**Example 5: Using Boolean Indexing with `loc` Function**

The `loc` function can also be used with Boolean indexing.

In [56]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John', 'Alex'], 'Age': [20, 21, 19, 18]}
df = pd.DataFrame(data)

print(df.loc[df['Age'] > 19, 'Name'])

0     Tom
1    Nick
Name: Name, dtype: object


In this example, 'Name' of the rows where 'Age' is greater than 19 are selected.

## Section 4: Modifying a DataFrame

### 4.1 - Adding rows

You can add rows to a DataFrame using the `concat()` function. This function concatenates pandas objects along a particular axis.

**Example 1: Adding a single row to a DataFrame**

In [60]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John'], 'Age': [20, 21, 19]}
df = pd.DataFrame(data)

new_row = pd.DataFrame({'Name': ['Alex'], 'Age': [18]}, index=[3])
df = pd.concat([df, new_row])

print(df)

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 2: Adding multiple rows to a DataFrame**

In [61]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John'], 'Age': [20, 21, 19]}
df = pd.DataFrame(data)

new_rows = pd.DataFrame({'Name': ['Alex', 'Matt'], 'Age': [18, 22]}, index=[3, 4])
df = pd.concat([df, new_rows])

print(df)

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18
4  Matt   22


**Example 3: Adding rows from another DataFrame**

In [62]:
import pandas as pd

data1 = {'Name': ['Tom', 'Nick', 'John'], 'Age': [20, 21, 19]}
df1 = pd.DataFrame(data1)

data2 = {'Name': ['Alex', 'Matt'], 'Age': [18, 22]}
df2 = pd.DataFrame(data2)

df1 = pd.concat([df1, df2])

print(df1)

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
0  Alex   18
1  Matt   22


**Example 4: Adding rows using a dictionary**

In [63]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John'], 'Age': [20, 21, 19]}
df = pd.DataFrame(data)

new_row = pd.DataFrame({'Name': ['Alex'], 'Age': [18]}, index=[3])
df = pd.concat([df, new_row])

print(df)

   Name  Age
0   Tom   20
1  Nick   21
2  John   19
3  Alex   18


**Example 5: Adding rows with different columns**

In [64]:
import pandas as pd

data = {'Name': ['Tom', 'Nick', 'John'], 'Age': [20, 21, 19]}
df = pd.DataFrame(data)

new_row = pd.DataFrame({'Name': ['Alex'], 'Age': [18], 'Rank': [4]}, index=[3])
df = pd.concat([df, new_row])

print(df)

   Name  Age  Rank
0   Tom   20   NaN
1  Nick   21   NaN
2  John   19   NaN
3  Alex   18   4.0


In this example, the new row includes a new column 'Rank'. The value in this column for the existing rows is filled with `NaN`.

### 4.2 - Adding columns

Adding columns to a DataFrame in pandas can be achieved through the assignment operator.

**Example 1: Adding a Single Column**

In [65]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df['C'] = [5, 6]

print(df)

   A  B  C
0  1  3  5
1  2  4  6


**Example 2: Adding Multiple Columns**

In [66]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df = df.assign(C=[5, 6], D=[7, 8])

print(df)

   A  B  C  D
0  1  3  5  7
1  2  4  6  8


**Example 3: Adding a Column Based on Existing Columns**

In [67]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df['C'] = df['A'] + df['B']

print(df)

   A  B  C
0  1  3  4
1  2  4  6


**Example 4: Adding a Column Using a Function**

In [68]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df['C'] = df['A'].apply(lambda x: x ** 2)

print(df)

   A  B  C
0  1  3  1
1  2  4  4


**Example 5: Adding a Column Conditionally**

In [69]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df['C'] = df['A'].apply(lambda x: 'High' if x > 1 else 'Low')

print(df)

   A  B     C
0  1  3   Low
1  2  4  High


In this example, a new column `C` is added to the DataFrame. The values in this column are determined based on the values in column `A`. If the value in `A` is greater than 1, `C` is 'High'. Otherwise, `C` is 'Low'.

### 4.3 - Deleting rows

In pandas, you can delete one or more rows from a DataFrame using the `drop()` function. The key parameter to the `drop()` function is the index of the rows you want to delete.

**Example 1: Deleting a Single Row by Index**

In [70]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.drop(1)

print(df)

   A  B
0  1  4
2  3  6


**Example 2: Deleting Multiple Rows by Index**

In [71]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.drop([0, 2])

print(df)

   A  B
1  2  5


**Example 3: Deleting Rows Based on Condition**

In [72]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.drop(df[df['A'] < 3].index)

print(df)

   A  B
2  3  6


**Example 4: Deleting Rows with Missing Values**

In [73]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df = df.dropna()

print(df)

     A    B
0  1.0  4.0


**Example 5: Deleting Rows in Large DataFrames**

In [74]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(1000, 50), columns=['Col' + str(i) for i in range(50)])
data = data.drop(data[data['Col1'] < 0.5].index)

print(data)

         Col0      Col1      Col2      Col3      Col4      Col5      Col6  \
5    0.783560  0.922735 -1.514587  1.363061 -0.283216 -0.732328  0.120786   
9   -0.556428  1.234734 -1.382040 -0.892821 -0.364225 -1.094022 -0.480513   
10   0.198524  0.566077  0.741394 -0.469477  1.613670 -1.045715  1.438147   
13   0.989406  1.763224 -0.460602  0.358639  0.080909  0.838941 -0.065463   
15   0.519660  1.331177  0.058357  0.486299  0.311575  0.746012  0.072102   
..        ...       ...       ...       ...       ...       ...       ...   
972  2.334944  0.592943  0.721719 -0.081290 -1.278230  0.090313 -0.153647   
976  0.562071  0.556007  0.603591  0.166400 -0.673643  0.683556  1.010934   
977  1.599622  0.563105 -0.511225 -1.340288  2.379503  1.223057 -0.064570   
983  1.188907  1.293782  0.445693  0.524671 -0.726381  0.784087 -0.933569   
999  0.006981  2.089618  1.664403  2.026878 -0.236650 -0.881655  0.218536   

         Col7      Col8      Col9  ...     Col40     Col41     Col42  \
5  

### 4.4 - Deleting columns

Columns in a DataFrame can be deleted or dropped by using the `drop()` function and specifying the `axis=1` parameter.

**Example 1: Deleting a Single Column by Label**

In [75]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.drop('A', axis=1)

print(df)

   B  C
0  4  7
1  5  8
2  6  9


**Example 2: Deleting Multiple Columns by Label**

In [76]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.drop(['A', 'B'], axis=1)

print(df)

   C
0  7
1  8
2  9


**Example 3: Deleting Columns Based on Condition**

In [77]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C': [7, 8, 9]})
df = df.drop(df.columns[df.apply(lambda col: col.sum() < 10)], axis=1)

print(df)

   B  C
0  4  7
1  5  8
2  6  9


**Example 4: Deleting Columns with Missing Values**

In [78]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, 6], 'C': [7, np.nan, 9]})
df = df.dropna(axis=1)

print(df)

   B
0  4
1  5
2  6


**Example 5: Deleting Columns in Large DataFrames**

In [79]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(1000, 50), columns=['Col' + str(i) for i in range(50)])
data = data.drop(data.columns[data.apply(lambda col: col.mean() < 0)], axis=1)

print(data)

         Col0      Col2      Col4      Col6      Col7     Col10     Col14  \
0   -0.398428  0.585864 -0.047487  0.194210  0.118103 -0.106484  0.835472   
1    0.913807 -0.463994 -0.435905  1.763789 -0.672060 -1.189803  0.098013   
2   -0.897861  0.182388  1.863928  0.132625 -0.602466  2.581540 -0.173454   
3    1.122628 -1.414626  0.103776  0.810942  0.791066  1.285727  0.436057   
4   -1.561162  1.238574  1.323453 -0.697975 -0.311414  0.357093  1.070905   
..        ...       ...       ...       ...       ...       ...       ...   
995 -0.024788 -0.521992 -0.590010  1.672357  1.754726  0.707214  0.607003   
996 -0.236039  0.340864  0.031249  0.305286  1.243318 -0.172771  0.111305   
997 -0.999715 -1.561974 -0.762783  0.280265  1.052498  0.297266 -1.164675   
998  0.812245  0.591240  1.116369 -0.872340  0.086809  1.107168  0.320752   
999  0.037886  1.614442  1.712559 -0.177169  0.854634 -1.344601  0.048011   

        Col15     Col16     Col23     Col29     Col35     Col38     Col39  

### 4.5 - Renaming columns

Renaming column names in pandas DataFrame can be achieved using the `rename()` function or by assigning a new list of column names to the `columns` attribute of the DataFrame.

**Example 1: Renaming a Single Column**

In [80]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.rename(columns={'A': 'X'})

print(df)

   X  B
0  1  4
1  2  5
2  3  6


**Example 2: Renaming Multiple Columns**

In [81]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.rename(columns={'A': 'X', 'B': 'Y'})

print(df)

   X  Y
0  1  4
1  2  5
2  3  6


**Example 3: Renaming All Columns**

In [82]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df.columns = ['X', 'Y']

print(df)

   X  Y
0  1  4
1  2  5
2  3  6


**Example 4: Renaming Columns Using a Function**

In [83]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.rename(columns=str.lower)

print(df)

   a  b
0  1  4
1  2  5
2  3  6


**Example 5: Renaming Columns in a Large DataFrame**

In [84]:
import pandas as pd
import numpy as np

data = pd.DataFrame(np.random.randn(1000, 50), columns=['Col' + str(i) for i in range(50)])
data = data.rename(columns=lambda x: x.replace('Col', 'Column'))

print(data)

      Column0   Column1   Column2   Column3   Column4   Column5   Column6  \
0    1.708102  1.140212 -0.234168  0.468273 -0.878713 -0.717516 -0.560662   
1   -0.864273 -0.560668  0.275754  0.067009 -0.264682  0.245055 -0.385277   
2   -0.785442  0.582638 -0.298446  0.345213 -0.114231  0.288188 -0.781287   
3   -0.382307 -0.895223 -0.381456 -0.641675  0.629782  0.737742  0.685549   
4    0.667738  1.246094  0.950891  1.359284 -0.059760  0.069623 -0.325293   
..        ...       ...       ...       ...       ...       ...       ...   
995 -0.993203  1.536368 -1.132079 -1.200345 -0.533931 -0.723136 -0.544457   
996 -0.773440  0.401375  0.408672  0.927838  1.087591  0.972114 -1.122101   
997  2.129209  2.366911  1.047558 -2.405874  1.895077 -2.508289  0.979335   
998  1.276694 -1.645011  0.641682 -0.394741  0.396671 -0.372716 -0.183743   
999  0.104942 -0.449575 -0.086573  0.106359 -0.286234  0.053536  0.434288   

      Column7   Column8   Column9  ...  Column40  Column41  Column42  \
0  

## Section 5: Handling Missing Data

### 5.1 - `isnull()`

`isnull()` is a function in pandas used to detect missing or NaN values. It returns a DataFrame or Series of boolean values representing whether each element is a NaN or not.

**Example 1: Detecting Null Values in a Series**

In [85]:
import pandas as pd
import numpy as np

s = pd.Series([1, 2, np.nan, 4, 5])
print(s.isnull())

0    False
1    False
2     True
3    False
4    False
dtype: bool


**Example 2: Detecting Null Values in a DataFrame**

In [86]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df.isnull())

       A      B
0  False  False
1  False   True
2   True  False


**Example 3: Counting Null Values in a DataFrame**

In [87]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df.isnull().sum())

A    1
B    1
dtype: int64


**Example 4: Filtering Out Null Values**

In [88]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df[df['A'].notnull()])

     A    B
0  1.0  4.0
1  2.0  NaN


**Example 5: Replacing Null Values**

In [89]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df = df.fillna(0)
print(df)

     A    B
0  1.0  4.0
1  2.0  0.0
2  0.0  6.0


In this example, `fillna()` method replaces all the NaN values in the DataFrame with 0.

### 5.2 - `notnull()`

`notnull()` is a function in pandas used to detect existing (non-missing) values. It is the counterpart of `isnull()`. It returns a DataFrame or Series of boolean values representing whether each element is a non-missing or not.

**Example 1: Detecting Non-Missing Values in a Series**

In [90]:
import pandas as pd
import numpy as np

s = pd.Series([1, 2, np.nan, 4, 5])
print(s.notnull())

0     True
1     True
2    False
3     True
4     True
dtype: bool


**Example 2: Detecting Non-Missing Values in a DataFrame**

In [91]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df.notnull())

       A      B
0   True   True
1   True  False
2  False   True


**Example 3: Counting Non-Missing Values in a DataFrame**

In [92]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df.notnull().sum())

A    2
B    2
dtype: int64


**Example 4: Filtering Out Missing Values**

In [93]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
print(df[df['A'].notnull()])

     A    B
0  1.0  4.0
1  2.0  NaN


**Example 5: Replacing Missing Values**

In [94]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df = df.fillna(0)
print(df)

     A    B
0  1.0  4.0
1  2.0  0.0
2  0.0  6.0


In this example, `fillna()` method replaces all the NaN values in the DataFrame with 0.

### 5.3 - `dropna()`

`dropna()` is a function in pandas used to remove missing values. It removes the rows or columns with missing values based on the specified method.

**Example 1: Dropping Rows with Missing Values**

In [95]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan]})
df = df.dropna()

print(df)

     A    B
0  1.0  4.0


**Example 2: Dropping Columns with Missing Values**

In [96]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, 6]})
df = df.dropna(axis=1)

print(df)

   B
0  4
1  5
2  6


**Example 3: Dropping Rows with All Missing Values**

In [98]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, np.nan], 'B': [4, 1, np.nan]})
df = df.dropna(how='all')

print(df)

     A    B
0  1.0  4.0
1  NaN  1.0


**Example 4: Dropping Rows with Any Missing Values in Specific Columns**

In [99]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, np.nan, 6]})
df = df.dropna(subset=['A'])

print(df)

     A    B
0  1.0  4.0
1  2.0  NaN


**Example 5: Dropping Rows with Missing Values using a Threshold**

In [100]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, np.nan, np.nan], 'B': [4, np.nan, 6, np.nan], 'C': [7, 8, np.nan, np.nan]})
df = df.dropna(thresh=2)

print(df)

     A    B    C
0  1.0  4.0  7.0
1  2.0  NaN  8.0


In this example, `dropna(thresh=2)` removes rows that have 2 or more non-NA values.

### 5.4 - `fillna()`

The `fillna()` function in pandas is used to fill NA/NaN values using the specified method.

**Example 1: Filling NaN Values with Zero**

In [101]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, np.nan, 6]})
df = df.fillna(0)

print(df)

     A    B
0  1.0  4.0
1  0.0  0.0
2  3.0  6.0


**Example 2: Using Forward Fill Method to Fill NaN Values**

In [102]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, np.nan, 6]})
df = df.fillna(method='ffill')

print(df)

     A    B
0  1.0  4.0
1  1.0  4.0
2  3.0  6.0


  df = df.fillna(method='ffill')


**Example 3: Using Backward Fill Method to Fill NaN Values**

In [103]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, np.nan, 6]})
df = df.fillna(method='bfill')

print(df)

     A    B
0  1.0  4.0
1  3.0  6.0
2  3.0  6.0


  df = df.fillna(method='bfill')


**Example 4: Filling NaN Values by Mean of the Column**

In [104]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, np.nan, 6]})
df['A'] = df['A'].fillna(df['A'].mean())

print(df)

     A    B
0  1.0  4.0
1  2.0  NaN
2  3.0  6.0


**Example 5: Filling NaN Values by Interpolation**

In [105]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, np.nan, 3, np.nan]})
df['A'] = df['A'].interpolate()

print(df)

     A
0  1.0
1  2.0
2  3.0
3  3.0


## Section 6: DataFrame Operations

### 6.1 - Mathematical operations on DataFrame

Pandas DataFrame allows us to perform various mathematical operations on the data. We can perform operations on an entire DataFrame, individual series, or between two series. Here are some examples:

**Example 1: Addition**

In [106]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['A'] + df['B']

print(df)

   A  B  C
0  1  4  5
1  2  5  7
2  3  6  9


**Example 2: Subtraction**

In [107]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df['C'] = df['B'] - df['A']

print(df)

   A  B  C
0  1  4  3
1  2  5  3
2  3  6  3


**Example 3: Addition Between DataFrames**

In [108]:
import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [7, 8, 9], 'B': [10, 11, 12]})
df3 = df1 + df2

print(df3)

    A   B
0   8  14
1  10  16
2  12  18


**Example 4: Division by a Scalar**

In [109]:
import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df / 2

print(df)

     A    B
0  0.5  2.0
1  1.0  2.5
2  1.5  3.0


**Example 5: Applying a Function**

In [110]:
import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.apply(np.square)

print(df)

   A   B
0  1  16
1  4  25
2  9  36


In this example, the `apply()` function is used to apply the `np.square` function to every element in the DataFrame. This squares the value of each element in the DataFrame.

### 6.2 - Applying functions to a DataFrame

Applying functions to a DataFrame is a powerful tool in pandas which lets us manipulate data in a DataFrame using our custom functions or built-in Python functions.

**Example 1: Applying a Function to Each Element**

In [111]:
import pandas as pd

def square(x):
    return x**2

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.applymap(square)

print(df)

   A   B
0  1  16
1  4  25
2  9  36


  df = df.applymap(square)


**Example 2: Applying a Function to Each Column**

In [112]:
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})
df = df.apply(sum, axis=0)

print(df)

A     6
B    15
C    24
dtype: int64


**Example 3: Applying a Function to Each Row**

In [113]:
import pandas as pd

df = pd.DataFrame({
    'A': [1, 2, 3],
    'B': [4, 5, 6],
    'C': [7, 8, 9]
})
df = df.apply(sum, axis=1)

print(df)

0    12
1    15
2    18
dtype: int64


**Example 4: Applying a Function Conditionally**

In [114]:
import pandas as pd

def check(x):
    return 'High' if x > 5 else 'Low'

df = pd.DataFrame({'A': [1, 2, 8], 'B': [4, 6, 9]})
df['A'] = df['A'].apply(check)

print(df)

      A  B
0   Low  4
1   Low  6
2  High  9


**Example 5: Applying a Function That Returns Multiple Values**

In [115]:
import pandas as pd

def calculate(x):
    return pd.Series([x.min(), x.max(), x.mean()], index=['min', 'max', 'mean'])

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df = df.apply(calculate)

print(df)

        A    B
min   1.0  4.0
max   3.0  6.0
mean  2.0  5.0


In this example, the `calculate` function is applied to each column in the DataFrame and it returns a Series with multiple values. The result is a DataFrame where each column represents the results of the applied function for the corresponding column in the original DataFrame.

### 6.3 - Grouping and aggregating data

Grouping and aggregating data in a DataFrame is a fundamental task in data analysis. It involves combining multiple rows into a single row based on some criteria.

**Example 1: Grouping Data by a Single Column**

In [116]:
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two'],
    'C': [1, 2, 3, 4, 5, 6],
    'D': [10, 20, 30, 40, 50, 60]
})

grouped = df.groupby('A')

print(grouped.sum())

               B   C    D
A                        
bar  onethreetwo  12  120
foo    onetwotwo   9   90


**Example 2: Grouping Data by Multiple Columns**

In [117]:
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two'],
    'C': [1, 2, 3, 4, 5, 6],
    'D': [10, 20, 30, 40, 50, 60]
})

grouped = df.groupby(['A', 'B'])

print(grouped.mean())

             C     D
A   B               
bar one    2.0  20.0
    three  4.0  40.0
    two    6.0  60.0
foo one    1.0  10.0
    two    4.0  40.0


**Example 3: Grouping with a Function**

In [118]:
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two'],
    'C': [1, 2, 3, 4, 5, 6],
    'D': [10, 20, 30, 40, 50, 60]
})

grouped = df.groupby(lambda x: x % 2 == 0)

print(grouped.sum())

               A            B   C    D
False  barbarbar  onethreetwo  12  120
True   foofoofoo    onetwotwo   9   90


**Example 4: Grouping by Index Levels**

In [119]:
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two'],
    'C': [1, 2, 3, 4, 5, 6],
    'D': [10, 20, 30, 40, 50, 60]
})

df = df.set_index(['A', 'B'])

grouped = df.groupby(level=df.index.names.difference(['B']))

print(grouped.sum())

      C    D
A           
bar  12  120
foo   9   90


**Example 5: Aggregating Data**

In [120]:
import pandas as pd

df = pd.DataFrame({
    'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar'],
    'B': ['one', 'one', 'two', 'three', 'two', 'two'],
    'C': [1, 2, 3, 4, 5, 6],
    'D': [10, 20, 30, 40, 50, 60]
})

grouped = df.groupby('A')

print(grouped.agg({
    'C': ['sum', 'min', 'max', 'mean'],
    'D': ['mean']
}))

      C                  D
    sum min max mean  mean
A                         
bar  12   2   6  4.0  40.0
foo   9   1   5  3.0  30.0


In this example, the `agg()` function is used to apply different aggregations to different columns in the DataFrame.

## Challenge

Create a `DataFrameManipulator` class that has the following methods:

- An `add_column` method that takes a DataFrame, column name, and data (list or array) and adds the data as a new column to the DataFrame.
- A `rename_column` method that takes a DataFrame, old column name, and new column name and renames the specified column.
- A `drop_column` method that takes a DataFrame and a column name and removes the specified column from the DataFrame.

### Output Format

- The `add_column`, `rename_column` and `drop_column` methods must return the modified DataFrame.

### Explanation

Consider the following code:

```python
import pandas as pd

# Create a DataFrame
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Create a DataFrameManipulator object
manipulator = DataFrameManipulator()

# Add a column
df = manipulator.add_column(df, 'C', [7, 8, 9])
print(df)

# Rename a column
df = manipulator.rename_column(df, 'A', 'X')
print(df)

# Drop a column
df = manipulator.drop_column(df, 'B')
print(df)

```

When executed with a properly implemented DataFrameManipulator class, this code should print:

```
   A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

```

```
   X  B  C
0  1  4  7
1  2  5  8
2  3  6  9

```

```
   X  C
0  1  7
1  2  8
2  3  9

```

In [None]:
### WRITE YOUR CODE BELOW THIS LINE ###


### WRITE YOUR CODE ABOVE THIS LINE ###