# Data Analysis with Pandas: Core Concepts 4

This guide builds on the foundational concepts of Pandas to explore more advanced data manipulation techniques. We will cover how to filter, sort, and group data, as well as how to combine multiple DataFrames through merging and concatenation. Finally, we'll look at how to add and remove columns and rows. All examples use a consistent dataset to demonstrate practical applications of these concepts.

In [1]:
import os 
os.getcwd() # no input argument needed
os.listdir() # no input argument needed
os.chdir('C:\\Users\\mjayant\\Documents\\NIT\\31_day_11_08_2025\\') # full path or the absolute path 

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\Users\\mjayant\\Documents\\NIT\\31_day_11_08_2025\\'

In [16]:
os.listdir()

['.ipynb_checkpoints',
 'Data_Analysis_with_Pandas_Core_Concepts_04.ipynb',
 'Data_Analysis_with_Pandas_Core_Concepts_04.pdf',
 'new_transactions.csv',
 'sales_transactions_50.csv',
 'staff_details.csv']

In [3]:
os.chdir('C:\\Users\\mjayant\\Documents\\NIT\\31_day_11_08_2025\\Pandas_04_concepts')

FileNotFoundError: [WinError 3] The system cannot find the path specified: 'C:\\Users\\mjayant\\Documents\\NIT\\31_day_11_08_2025\\Pandas_04_concepts'

In [4]:
os.listdir()

['.ipynb_checkpoints',
 'Data_Analysis_with_Pandas_Core_Concepts_04.ipynb',
 'Data_Analysis_with_Pandas_Core_Concepts_04.pdf',
 'new_transactions.csv',
 'sales_transactions_50.csv',
 'staff_details.csv']

## Introduction to Pandas Data Manipulation

#### Concept Introduction: Data Manipulation

In real-world scenarios, raw data is rarely ready for analysis. Data manipulation involves a series of steps to clean, transform, and structure data to make it suitable for a specific task. Pandas provides a rich set of tools to perform these operations efficiently and intuitively.

### 1. DataFrame - Filtering

#### Concept Introduction: Filtering Data

Filtering is the process of selecting a subset of data from a DataFrame based on specific conditions. This is one of the most common and powerful operations in data analysis, allowing you to focus on the information that is most relevant to your task.

#### Example 1: Filtering with a Single Condition

Let's find all the sales transactions where the product_type is 'Frozen'. We use a boolean mask to achieve this.

In [13]:
import pandas as pd

# Load the sales data
# df = pd.read_csv('absolutePath') ## 'C:\\Users\\mjayant\\Documents\\NIT\\31_day_11_08_2025\\'
df = pd.read_csv('sales_transactions_50.csv')

# Create a boolean mask where the product_type is 'Frozen'
is_frozen_mask = df['product_type'] == 'Frozen'


In [14]:
is_frozen_mask = df['product_type'] == 'Frozen'

In [15]:
frozen_items = df[is_frozen_mask]
len(frozen_items)

28

In [6]:
frozen_items.head(6)

Unnamed: 0,transaction_id,product_type,flavor,quantity,price_per_unit_inr,staff_id
0,101,Frozen,Chocolate,3,114.41,EMP002
4,105,Frozen,Kulfi,4,130.38,EMP001
5,106,Frozen,Tea,5,96.49,EMP004
6,107,Frozen,Coffee,3,115.39,EMP004
10,111,Frozen,Coffee,4,104.81,EMP004
13,114,Frozen,Vanilla,3,187.66,EMP003


In [7]:
import pandas as pd

# Load the sales data
df = pd.read_csv('sales_transactions_50.csv')

# Create a boolean mask where the product_type is 'Frozen'
is_frozen_mask = df['product_type'] == 'Frozen'

# Use the mask to filter the DataFrame and display the first few results
frozen_items = df[(df['product_type'] == 'Frozen') | (df['quantity'] == 4) ]

print("Transactions for 'Frozen' items:")
print(frozen_items.head())
print("...")
print(f"Total number of 'Frozen' transactions: {len(frozen_items)}")

Transactions for 'Frozen' items:
   transaction_id product_type     flavor  quantity  price_per_unit_inr  \
0             101       Frozen  Chocolate         3              114.41   
4             105       Frozen      Kulfi         4              130.38   
5             106       Frozen        Tea         5               96.49   
6             107       Frozen     Coffee         3              115.39   
8             109      Dessert  Chocolate         4              159.63   

  staff_id  
0   EMP002  
4   EMP001  
5   EMP004  
6   EMP004  
8   EMP002  
...
Total number of 'Frozen' transactions: 32


This code first creates a Series of True/False values (`is_frozen_mask`) by checking if each row's `product_type` is 'Frozen'. Passing this mask to the DataFrame `df[...]` returns only the rows where the condition is `True`.

#### Example 2: Filtering with Multiple Conditions

Now, let's find all sales of 'Dessert' items with a quantity of more than 3. We combine multiple conditions using logical operators (`&` for AND, `|` for OR, `~` for NOT).

In [8]:
# Filtering for 'Dessert' items or  a quantity greater than 3
beverage_high_qty_items = df[(df['product_type'] == 'Beverage') | (df['quantity'] == 5)]

print("frozen items or quantity == 5:")
print(beverage_high_qty_items)

frozen items or quantity == 5:
    transaction_id product_type     flavor  quantity  price_per_unit_inr  \
1              102     Beverage  Chocolate         1              106.83   
5              106       Frozen        Tea         5               96.49   
7              108     Beverage      Pista         1              114.69   
9              110      Dessert      Pista         5              161.23   
11             112     Beverage        Tea         5              189.77   
12             113     Beverage        Tea         5              116.68   
14             115       Frozen        Tea         5              131.10   
16             117       Frozen        Tea         5              124.51   
22             123       Frozen    Vanilla         5              149.27   
24             125       Frozen      Mango         5              126.97   
33             134     Beverage  Chocolate         1              140.82   
34             135     Beverage    Vanilla         1     

In [9]:
# Filtering for 'Dessert' items with a quantity greater than 3
dessert_high_qty_items = df[(df['product_type'] == 'Dessert') & (df['quantity'] > 3)]

print("Dessert items with quantity > 3:")
print(dessert_high_qty_items)

Dessert items with quantity > 3:
    transaction_id product_type     flavor  quantity  price_per_unit_inr  \
8              109      Dessert  Chocolate         4              159.63   
9              110      Dessert      Pista         5              161.23   
25             126      Dessert      Mango         4              114.22   
27             128      Dessert      Pista         4              137.03   
28             129      Dessert      Kulfi         4              188.34   
45             146      Dessert      Pista         5              173.77   

   staff_id  
8    EMP002  
9    EMP001  
25   EMP004  
27   EMP001  
28   EMP003  
45   EMP001  


By enclosing each condition in parentheses and joining them with `&`, we create a combined boolean mask. The filtered DataFrame contains only the rows that satisfy both conditions simultaneously. Note that each condition must be enclosed in parentheses to ensure correct operator precedence.

#### Real-time Usage

A business manager might need to quickly identify all the sales of a particular product line to analyze its performance. For instance, they could filter for all transactions of a 'Frozen' item to see which flavors are most popular or if any are selling poorly. This helps in making informed decisions about inventory and marketing strategies.

### 2. Data Frame - Sorting

#### Concept Introduction: Sorting Data

Sorting a DataFrame by its values is essential for organizing data and making it easier to read and analyze. The `sort_values()` method allows you to sort the DataFrame based on one or more columns, in either ascending or descending order.

#### Example 1: Sorting by a Single Column

Let's sort our sales transactions by quantity in ascending order to see which items were sold least.

In [10]:
# Load the sales data
df = pd.read_csv('sales_transactions_50.csv')

# Sort the DataFrame by the 'quantity' column in ascending order (default)
df_sorted_by_quantity = df.sort_values(by='quantity')

print("DataFrame sorted by quantity (ascending, first 5 records):")
print(df_sorted_by_quantity.tail())

DataFrame sorted by quantity (ascending, first 5 records):
    transaction_id product_type  flavor  quantity  price_per_unit_inr staff_id
9              110      Dessert   Pista         5              161.23   EMP001
12             113     Beverage     Tea         5              116.68   EMP005
14             115       Frozen     Tea         5              131.10   EMP001
45             146      Dessert   Pista         5              173.77   EMP001
40             141       Frozen  Coffee         5              124.87   EMP001


The `sort_values()` method, when given a column name, returns a new DataFrame sorted by that column. The default order is ascending, which means the lowest values appear first.

#### Example 2: Sorting by Multiple Columns (Descending)

To find the most expensive items that were sold in high quantities, we can sort by `price_per_unit_inr` and then by `quantity`, both in descending order.

In [11]:
# Sort by 'price_per_unit_inr' and then by 'quantity', both in descending order
df_sorted_multi = df.sort_values(by=['price_per_unit_inr', 'quantity'], ascending=[False, False])

print("DataFrame sorted by price (descending) and then quantity (descending, first 5 records):")
print(df_sorted_multi.head())

DataFrame sorted by price (descending) and then quantity (descending, first 5 records):
    transaction_id product_type   flavor  quantity  price_per_unit_inr  \
11             112     Beverage      Tea         5              189.77   
28             129      Dessert    Kulfi         4              188.34   
13             114       Frozen  Vanilla         3              187.66   
42             143       Frozen    Mango         3              184.05   
35             136     Beverage    Kulfi         1              175.84   

   staff_id  
11   EMP005  
28   EMP003  
13   EMP003  
42   EMP004  
35   EMP005  


By passing a list of column names to the `by` parameter and a list of boolean values to the `ascending` parameter, we can define a multi-level sort. The data is first sorted by the first column in the list, and then any rows with equal values are sorted by the second column, and so on.

- 1. Prioritizing high-ticket items for sales campaigns.
- 2. Identifying high-price, high-stock products for warehouse optimization.

In [12]:
import pandas as pd

data = {
    'product_id': [101, 102, 103, 104, 105, 106],
    'product_name': ['Vanilla', 'Mango', 'Kesar Pista', 'Chocolate', 'Strawberry','ChocoChips'],
    'price_per_unit_inr': [45000, 800, 1200, 15000, 2500,2500],
    'quantity': [10, 50, 30, 5, 20,30]
}

df = pd.DataFrame(data)

# Multi-column sort: price (descending) → quantity (descending)
df_sorted = df.sort_values( by=['price_per_unit_inr', 'quantity'], ascending=[False, False])

print("Top 5 products (sorted by price high -> low, then quantity high -> low):")
print(df_sorted.head())

Top 5 products (sorted by price high -> low, then quantity high -> low):
   product_id product_name  price_per_unit_inr  quantity
0         101      Vanilla               45000        10
3         104    Chocolate               15000         5
5         106   ChocoChips                2500        30
4         105   Strawberry                2500        20
2         103  Kesar Pista                1200        30


- Sort by Price (High→Low)[desc] → Quantity (Low→High)[asc] </br>
**Prioritize expensive items with low stock for inventory clearance**

In [13]:
df_sorted = df.sort_values(
    by=['price_per_unit_inr', 'quantity'], 
    ascending=[False, True]  # Price DESC, Quantity ASC
)

print("Top 5 products (sorted by price high -> low, then quantity low -> high):")
print(df_sorted.head())

Top 5 products (sorted by price high -> low, then quantity low -> high):
   product_id product_name  price_per_unit_inr  quantity
0         101      Vanilla               45000        10
3         104    Chocolate               15000         5
4         105   Strawberry                2500        20
5         106   ChocoChips                2500        30
2         103  Kesar Pista                1200        30


- Sort by Quantity (High→Low)[desc] → Price (High→Low)[desc] </br>
**Identify high-demand products with premium pricing**

In [20]:
df_sorted = df.sort_values(
    by=['quantity', 'price_per_unit_inr'], 
    ascending=[False, False]  # Quantity DESC, Price DESC
)
print("Top 5 high-demand products with premium pricing (sorted by price high -> low,  quantity high -> low):")

print(df_sorted.head())

Top 5 high-demand products with premium pricing (sorted by price high -> low,  quantity high -> low):
   product_id product_name  price_per_unit_inr  quantity
1         102        Mango                 800        50
5         106   ChocoChips                2500        30
2         103  Kesar Pista                1200        30
4         105   Strawberry                2500        20
0         101      Vanilla               45000        10


- Sort by Price (Low→High)[asc] → Quantity (High→Low)[desc] </br>
**Find budget-friendly products with high availability**

In [21]:
df_sorted = df.sort_values(
    by=['price_per_unit_inr', 'quantity'], 
    ascending=[True, False]  # Price ASC, Quantity DESC
)

print("Top 5 budget-friendly products with high availability (sorted by price low -> high,  quantity high -> low):")

print(df_sorted.head())

Top 5 budget-friendly products with high availability (sorted by price low -> high,  quantity high -> low):
   product_id product_name  price_per_unit_inr  quantity
1         102        Mango                 800        50
2         103  Kesar Pista                1200        30
5         106   ChocoChips                2500        30
4         105   Strawberry                2500        20
3         104    Chocolate               15000         5


#### Real-time Usage

A data analyst might sort the data to generate a sales leaderboard for staff or to identify the most expensive products sold in a day. Sorting helps in quickly identifying patterns, outliers, and trends, which is a key part of any exploratory data analysis process.

### 3. Data Frame - GroupBy

#### Concept Introduction: Grouping Data

The `groupby()` method is one of the most powerful features of Pandas. It allows you to split a DataFrame into groups based on some criteria, apply a function (like `sum()`, `mean()`, or `count()`) to each group, and then combine the results. This is similar to the GROUP BY clause in SQL.

#### Example 1: Grouping by a Single Column and Aggregating

Let's find the total quantity of each product flavor sold. We'll group the DataFrame by the `flavor` column and then calculate the sum of the `quantity` for each group.

In [1]:
import pandas as pd

# Load the sales data
df = pd.read_csv('sales_transactions_50.csv')

# Group by 'flavor' and calculate the sum of 'quantity'
flavor_sales = df.groupby('flavor')['quantity'].sum()

print("Total quantity sold per flavor:")
print(flavor_sales)

Total quantity sold per flavor:
flavor
Brownie        2
Chocolate      9
Coffee        31
Kulfi         21
Mango         20
Pista         18
Strawberry     1
Tea           31
Vanilla       22
Name: quantity, dtype: int64


First, we group the DataFrame by `flavor`. Then, we select the `quantity` column and apply the `sum()` aggregation function. This returns a new Series with the unique flavors as the index and their corresponding total quantities as the values.

#### Example 2: Grouping by Multiple Columns and Applying Multiple Aggregations

To get a more detailed view, let's group by both `staff_id` and `product_type`, and then find the total `quantity` sold and the total `price_per_unit_inr` for each group. We'll use the `.agg()` method for multiple aggregations.

In [3]:
# Group by 'staff_id' and 'product_type', then aggregate
staff_sales_summary = df.groupby(['staff_id', 'product_type']).agg(
    total_quantity=('quantity', 'sum'),
    average_price=('price_per_unit_inr', 'mean')
)

print("Sales summary by staff and product type:")
print(staff_sales_summary)

Sales summary by staff and product type:
                       total_quantity  average_price
staff_id product_type                               
EMP001   Beverage                   1     114.690000
         Dessert                   17     153.584000
         Frozen                    25     121.501250
EMP002   Beverage                   1     140.820000
         Dessert                    6     152.720000
         Frozen                     9     129.770000
EMP003   Beverage                   5     132.056667
         Dessert                   12     154.052500
         Frozen                    20     157.964000
EMP004   Dessert                    6     113.050000
         Frozen                    34     135.663000
EMP005   Beverage                  11     160.763333
         Dessert                    3     171.640000
         Frozen                     5     121.880000


The `groupby()` method is used with a list of columns. The `.agg()` method takes a dictionary where keys are the new column names and values are tuples of the original column and the aggregation function to apply. This creates a powerful summary table with a multi-index.

#### Real-time Usage

Grouping data is crucial for generating business intelligence reports. A manager might use this to calculate total revenue per staff member, average sales per hour, or to find the most popular product types on a monthly basis. This aggregated data provides the key insights needed to measure performance and make strategic decisions.

### 4. Merging or Joining

#### Concept Introduction: Merging DataFrames

Merging or joining is the process of combining two or more DataFrames based on a common column or index. This is a fundamental operation when your data is spread across multiple tables or files, a common scenario in real-world databases. Pandas provides the `merge()` function for this purpose.

#### Example 1: Merging DataFrames with an Inner Join

Let's combine our sales_transactions_50 and `staff_details` DataFrames to see which employee made each transaction. We will join them on the common column, `staff_id`.

In [7]:
df_sales['staff_id'].value_counts()

staff_id
EMP001    14
EMP004    12
EMP003    12
EMP002     6
EMP005     6
Name: count, dtype: int64

In [8]:
import pandas as pd

# Load the two DataFrames
df_sales = pd.read_csv('sales_transactions_50.csv')
df_staff = pd.read_csv('staff_details.csv')

# Perform an inner merge on the 'staff_id' column
merged_df = pd.merge(df_sales, df_staff, on='staff_id', how='inner')

print("Merged DataFrame (sales with staff names, first 5 records):")
print(merged_df.head())

print(len(df_sales))
print(len(df_staff))
print(len(merged_df))

Merged DataFrame (sales with staff names, first 5 records):
   transaction_id product_type     flavor  quantity  price_per_unit_inr  \
0             101       Frozen  Chocolate         3              114.41   
1             102     Beverage  Chocolate         1              106.83   
2             103      Dessert    Brownie         2              111.88   
3             104      Dessert     Coffee         2              145.81   
4             105       Frozen      Kulfi         4              130.38   

  staff_id staff_name joining_date shift_type  
0   EMP002      Priya   2023-02-20    Evening  
1   EMP003      Rohan   2023-03-10    Morning  
2   EMP004      Sneha   2023-04-01    Evening  
3   EMP002      Priya   2023-02-20    Evening  
4   EMP001      Aarav   2023-01-15    Morning  
50
5
50


The `pd.merge()` function is used here. We specify the two DataFrames to merge, the column to join on (`on='staff_id'`), and the type of join (`how='inner'`). An inner join only keeps rows where the `staff_id` exists in both DataFrames, which is the default behavior.

#### Example 2: Merging with a Left Join

What if we wanted to keep all sales transactions, even if a `staff_id` in the sales data didn't have a match in the staff details? A left join would be appropriate. In this case, since all `staff_id`s in our sales data have a match, the output will be the same as the inner join, but the principle is important.

In [10]:
# Perform a left merge on the 'staff_id' column
left_merged_df = pd.merge(df_sales, df_staff, on='staff_id', how='left')

print("Left merged DataFrame:")
print(left_merged_df.head())

Left merged DataFrame:
   transaction_id product_type     flavor  quantity  price_per_unit_inr  \
0             101       Frozen  Chocolate         3              114.41   
1             102     Beverage  Chocolate         1              106.83   
2             103      Dessert    Brownie         2              111.88   
3             104      Dessert     Coffee         2              145.81   
4             105       Frozen      Kulfi         4              130.38   

  staff_id staff_name joining_date shift_type  
0   EMP002      Priya   2023-02-20    Evening  
1   EMP003      Rohan   2023-03-10    Morning  
2   EMP004      Sneha   2023-04-01    Evening  
3   EMP002      Priya   2023-02-20    Evening  
4   EMP001      Aarav   2023-01-15    Morning  


A left merge (`how='left'`) returns all rows from the "left" DataFrame (`df_sales`) and the matching rows from the "right" DataFrame (`df_staff`). If there's no match, the columns from the right DataFrame are filled with `NaN` for that row. This is useful for preserving all records from a primary dataset.

#### Pros and Cons of Merging

* Pros:

Versatility: Allows for powerful combinations of data using different join types (inner, left, right, outer).
Clarity: The `on` parameter makes the join condition explicit and easy to understand.
* Cons:

Key Dependency: Requires a common key or column between the DataFrames to be effective.
Memory Usage: Creating a new, combined DataFrame can be memory-intensive with very large datasets.

#### Real-time Usage

Merging is essential for getting a complete picture of a business's operations. A sales manager might merge sales transactions with staff details to analyze sales performance per employee. They could also merge sales data with a separate product catalog to see how many units of a specific item were sold, regardless of the flavor.

### 5. DataFrame - Concat

#### Concept Introduction: Concatenating DataFrames

Concatenation is the process of stacking DataFrames on top of each other (vertically) or side by side (horizontally). It is used when you have multiple DataFrames with the same or similar structure that you want to combine into a single, larger DataFrame. The `pd.concat()` function is used for this.

#### Example 1: Concatenating Vertically

Let's add the new sales transactions from `new_transactions.csv` to our main `sales_transactions_50.csv` DataFrame. We will stack the new data below the old data.

In [None]:
df1 -->  col1,col2,col3 (50)
df2 -->  col1,col2,col3 (10)
----------------------------
vertical stacking - one over the other 
df3 --> pd.concat(df1,df2)
df3  ---> col1 , col2, col3 (60)

In [62]:
import pandas as pd

# Load the two DataFrames
df_sales = pd.read_csv('sales_transactions_50.csv')
df_new_sales = pd.read_csv('new_transactions.csv')

# Vertically concatenate the two DataFrames
combined_sales = pd.concat([df_sales, df_new_sales], ignore_index=True)

print("Original sales DataFrame tail:")
print(df_sales.tail(3))
print("\nNew transactions DataFrame:")
print(df_new_sales)
print("\nCombined DataFrame after concatenation (tail):")
print(combined_sales.tail(10))

Original sales DataFrame tail:
    transaction_id product_type   flavor  quantity  price_per_unit_inr  \
47             148      Dessert  Vanilla         3              163.52   
48             149      Dessert    Pista         2              110.91   
49             150       Frozen    Kulfi         4              144.14   

   staff_id  
47   EMP003  
48   EMP003  
49   EMP005  

New transactions DataFrame:
   transaction_id product_type      flavor  quantity  price_per_unit_inr  \
0             151       Frozen       Kulfi         2               180.0   
1             152      Dessert        Cake         1               100.0   
2             153       Frozen  Strawberry         3               140.0   
3             154     Beverage    Iced Tea         2               110.0   
4             155      Dessert     Brownie         1               120.0   

  staff_id  
0   EMP001  
1   EMP002  
2   EMP004  
3   EMP003  
4   EMP005  

Combined DataFrame after concatenation (tail):
    

The `pd.concat()` function takes a list of DataFrames to combine. By default, it concatenates vertically along the row axis (`axis=0`). We use `ignore_index=True` to reset the index of the final DataFrame, creating a clean, continuous index from 0 upwards.

#### Pros and Cons of Concatenation

* Pros:

Simplicity: It's a straightforward way to stack or join DataFrames with similar structures.
Performance: Generally faster than other join methods when simply stacking data.
* Cons:

Schema Dependency: Works best when the DataFrames have the same columns and data types.
Index Duplication: Without `ignore_index=True`, the final DataFrame can have duplicate index values, which can cause issues.

#### Real-time Usage

Concatenation is often used to combine daily, weekly, or monthly sales reports into a single, comprehensive dataset for long-term analysis. For example, a business might receive a new CSV file of transactions every day, and a script could use `pd.concat()` to append this new data to a master sales file.

### 6. DataFrame - Adding, Dropping Columns & Rows

#### Concept Introduction: Modifying DataFrame Structure

As part of data cleaning and preparation, you will often need to add, remove, or modify columns and rows in a DataFrame. These operations are essential for feature engineering, data normalization, and preparing a dataset for a specific analysis task.

#### Example 1: Adding a New Column

Let's add a new column, `total_price_inr`, to our sales DataFrame. This new column will be the result of a simple calculation: `quantity` multiplied by `price_per_unit_inr`.

In [3]:
import pandas as pd
df = pd.read_csv('sales_transactions_50.csv')

# Add a new column by multiplying existing columns
df['total_price_inr'] = df['quantity'] * df['price_per_unit_inr']

print("DataFrame with the new 'total_price_inr' column (first 5 records):")
print(df.head())

DataFrame with the new 'total_price_inr' column (first 5 records):
   transaction_id product_type     flavor  quantity  price_per_unit_inr  \
0             101       Frozen  Chocolate         3              114.41   
1             102     Beverage  Chocolate         1              106.83   
2             103      Dessert    Brownie         2              111.88   
3             104      Dessert     Coffee         2              145.81   
4             105       Frozen      Kulfi         4              130.38   

  staff_id  total_price_inr  
0   EMP002           343.23  
1   EMP003           106.83  
2   EMP004           223.76  
3   EMP002           291.62  
4   EMP001           521.52  


Adding a new column is as simple as assigning a new `Series` of values to a new column name. This new column is created and populated based on the calculation of the two existing columns. This is a very common task for feature creation.

#### Example 2: Dropping Columns

Sometimes, columns are no longer needed. We can use the `drop()` method to remove one or more columns from a DataFrame. Let's drop the `product_type` column.

In [5]:
# Drop the 'product_type' column
df_dropped_col = df.drop(columns='product_type')

print("DataFrame after dropping the 'product_type' column (first 5 records):")
print(df_dropped_col.head())

DataFrame after dropping the 'product_type' column (first 5 records):
   transaction_id     flavor  quantity  price_per_unit_inr staff_id  \
0             101  Chocolate         3              114.41   EMP002   
1             102  Chocolate         1              106.83   EMP003   
2             103    Brownie         2              111.88   EMP004   
3             104     Coffee         2              145.81   EMP002   
4             105      Kulfi         4              130.38   EMP001   

   total_price_inr  
0           343.23  
1           106.83  
2           223.76  
3           291.62  
4           521.52  


The `drop()` method returns a new DataFrame with the specified column(s) removed. You must specify `axis=1` (or `columns=`) to tell Pandas you are dropping a column and not a row. By default, `drop()` returns a new DataFrame, leaving the original unchanged.

#### Example 3: Dropping Rows

We can also drop rows based on their index. Let's drop the first and the last rows of our DataFrame, which correspond to index labels 0 and 49.

In [7]:
df.head()

Unnamed: 0,transaction_id,product_type,flavor,quantity,price_per_unit_inr,staff_id,total_price_inr
0,101,Frozen,Chocolate,3,114.41,EMP002,343.23
1,102,Beverage,Chocolate,1,106.83,EMP003,106.83
2,103,Dessert,Brownie,2,111.88,EMP004,223.76
3,104,Dessert,Coffee,2,145.81,EMP002,291.62
4,105,Frozen,Kulfi,4,130.38,EMP001,521.52


In [11]:
# Drop rows at index 0 and 49
df_dropped_rows = df.drop(index=[0, 49])

print("DataFrame after dropping the first and last rows (head and tail):")
print(df_dropped_rows.head(2))
print("...")
print(df_dropped_rows.tail(2))

DataFrame after dropping the first and last rows (head and tail):
   transaction_id product_type     flavor  quantity  price_per_unit_inr  \
1             102     Beverage  Chocolate         1              106.83   
2             103      Dessert    Brownie         2              111.88   

  staff_id  total_price_inr  
1   EMP003           106.83  
2   EMP004           223.76  
...
    transaction_id product_type   flavor  quantity  price_per_unit_inr  \
47             148      Dessert  Vanilla         3              163.52   
48             149      Dessert    Pista         2              110.91   

   staff_id  total_price_inr  
47   EMP003           490.56  
48   EMP003           221.82  


To drop rows, we use the `index` parameter and pass a list of index labels. This method is useful when you want to remove specific records from the dataset, for example, removing a transaction that was found to be fraudulent or duplicated.

#### Real-time Usage

Data modification is a daily task for an analyst. They might add a new column for 'Profit_Margin' to a sales DataFrame to perform a new analysis. They might also drop a column like 'transaction_id' before training a machine learning model, as it is unlikely to have predictive value. Dropping rows is useful for removing outliers or duplicate entries from a dataset.