# Pandas Fundamentals I - Part 2: Basic DataFrame Operations

## Week 2, Day 2 (Thursday) - April 17th, 2025

### Overview
This is the second part of our introduction to Pandas, focusing on basic operations for exploring and manipulating DataFrames. We'll learn the equivalent Pandas operations for common SQL tasks.

### Learning Objectives
- Inspect and understand DataFrame structure
- Access and manipulate DataFrame elements
- Handle missing data
- Perform basic column operations

### Prerequisites
- Pandas Fundamentals I - Part 1

In [2]:
# Import libraries
import pandas as pd
import numpy as np

# Create a sample DataFrame to work with
data = {
    'product_id': ['P001', 'P002', 'P003', 'P004', 'P005'],
    'product_name': ['Laptop', 'Smartphone', 'Tablet', 'Headphones', 'Monitor'],
    'category': ['Electronics', 'Electronics', 'Electronics', 'Accessories', 'Electronics'],
    'price': [1200, 800, 450, 150, 300],
    'stock_quantity': [15, 25, 0, 30, 10],
    'rating': [4.5, 4.8, 4.2, 4.6, np.nan]  # Note the missing value (np.nan)
}

products_df = pd.DataFrame(data)
print(products_df)

  product_id product_name     category  price  stock_quantity  rating
0       P001       Laptop  Electronics   1200              15     4.5
1       P002   Smartphone  Electronics    800              25     4.8
2       P003       Tablet  Electronics    450               0     4.2
3       P004   Headphones  Accessories    150              30     4.6
4       P005      Monitor  Electronics    300              10     NaN


## 1. DataFrame Inspection

Before diving into data analysis, it's important to inspect and understand your data. Pandas provides several methods for this purpose.

In [2]:
# View the first n rows (default is 5)
print("First 3 rows:")
print(products_df.head(3))

# View the last n rows (default is 5)
print("\nLast 2 rows:")
print(products_df.tail(2))

First 3 rows:
  product_id product_name     category  price  stock_quantity  rating
0       P001       Laptop  Electronics   1200              15     4.5
1       P002   Smartphone  Electronics    800              25     4.8
2       P003       Tablet  Electronics    450               0     4.2

Last 2 rows:
  product_id product_name     category  price  stock_quantity  rating
3       P004   Headphones  Accessories    150              30     4.6
4       P005      Monitor  Electronics    300              10     NaN


In [3]:
# Get basic information about the DataFrame
print("DataFrame info:")
products_df.info()

DataFrame info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 6 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   product_id      5 non-null      object 
 1   product_name    5 non-null      object 
 2   category        5 non-null      object 
 3   price           5 non-null      int64  
 4   stock_quantity  5 non-null      int64  
 5   rating          4 non-null      float64
dtypes: float64(1), int64(2), object(3)
memory usage: 372.0+ bytes


The `info()` method provides key information about the DataFrame:
- The number of rows and columns
- The column names and data types
- Non-null values for each column
- Memory usage

This is similar to getting table schema information in SQL.

In [4]:
# Get summary statistics for numeric columns
print("Summary statistics:")
print(products_df.describe())

Summary statistics:
             price  stock_quantity  rating
count     5.000000        5.000000   4.000
mean    580.000000       16.000000   4.525
std     422.196637       11.937336   0.250
min     150.000000        0.000000   4.200
25%     300.000000       10.000000   4.425
50%     450.000000       15.000000   4.550
75%     800.000000       25.000000   4.650
max    1200.000000       30.000000   4.800


The `describe()` method provides descriptive statistics for numeric columns:
- count: number of non-missing values
- mean, std (standard deviation): measures of central tendency and dispersion
- min, 25%, 50% (median), 75%, max: percentiles

This is similar to aggregate functions in SQL like COUNT(), AVG(), MIN(), MAX(), etc.

In [5]:
# Get descriptive statistics for categorical columns
print("Category counts:")
print(products_df['category'].value_counts())

# You can also use describe() with include/exclude parameters
print("\nDescribe categorical columns:")
print(products_df.describe(include=['object']))  # 'object' is pandas' string type

Category counts:
category
Electronics    4
Accessories    1
Name: count, dtype: int64

Describe categorical columns:
       product_id product_name     category
count           5            5            5
unique          5            5            2
top          P001       Laptop  Electronics
freq            1            1            4


### Additional inspection methods

In [6]:
# Column names
print("Column names:", products_df.columns.tolist())

# Row index
print("\nRow index:", products_df.index.tolist())

# DataFrame dimensions (rows, columns)
print("\nDataFrame shape:", products_df.shape)

# Unique values in a column
print("\nUnique categories:", products_df['category'].unique())

Column names: ['product_id', 'product_name', 'category', 'price', 'stock_quantity', 'rating']

Row index: [0, 1, 2, 3, 4]

DataFrame shape: (5, 6)

Unique categories: ['Electronics' 'Accessories']


## 2. Accessing Columns and Rows

Now let's look at different ways to access data within a DataFrame.

In [7]:
# Accessing a single column (returns a Series)
prices = products_df['price']
print("Prices:\n", prices)
print("Type:", type(prices))

Prices:
 0    1200
1     800
2     450
3     150
4     300
Name: price, dtype: int64
Type: <class 'pandas.core.series.Series'>


In [8]:
# Alternative column access using dot notation
# Note: This only works for column names that could be valid Python variable names
# and don't conflict with DataFrame method names
prices_alt = products_df.price
print("Prices (dot notation):\n", prices_alt)

Prices (dot notation):
 0    1200
1     800
2     450
3     150
4     300
Name: price, dtype: int64


In [3]:
# Accessing multiple columns (returns a DataFrame)
product_info = products_df[['product_name', 'price', 'rating']]
print("Product info:\n", product_info)
print("Type:", type(product_info))

Product info:
   product_name  price  rating
0       Laptop   1200     4.5
1   Smartphone    800     4.8
2       Tablet    450     4.2
3   Headphones    150     4.6
4      Monitor    300     NaN
Type: <class 'pandas.core.frame.DataFrame'>


### Accessing rows

In [4]:
# Accessing a row by position using iloc
# iloc uses integer-based indexing [row, column]
first_row = products_df.iloc[0]
print("First row:\n", first_row)
print("Type:", type(first_row))

First row:
 product_id               P001
product_name           Laptop
category          Electronics
price                    1200
stock_quantity             15
rating                    4.5
Name: 0, dtype: object
Type: <class 'pandas.core.series.Series'>


In [5]:
# Accessing multiple rows with iloc
first_three_rows = products_df.iloc[0:3]
print("First three rows:\n", first_three_rows)

First three rows:
   product_id product_name     category  price  stock_quantity  rating
0       P001       Laptop  Electronics   1200              15     4.5
1       P002   Smartphone  Electronics    800              25     4.8
2       P003       Tablet  Electronics    450               0     4.2


In [6]:
# Accessing specific rows and columns with iloc
# iloc[row_selector, column_selector]
subset = products_df.iloc[1:4, [0, 1, 3]]  # Rows 1-3, columns 0, 1, and 3
print("Subset of rows and columns:\n", subset)

Subset of rows and columns:
   product_id product_name  price
1       P002   Smartphone    800
2       P003       Tablet    450
3       P004   Headphones    150


In [7]:
# Accessing rows and columns by label using loc
# loc uses label-based indexing [row_label, column_label]
# Since our index is numeric (0-4), it looks similar to iloc in this case
second_row = products_df.loc[1, ['product_id', 'product_name', 'price']]
print("Second row selected fields:\n", second_row)

Second row selected fields:
 product_id            P002
product_name    Smartphone
price                  800
Name: 1, dtype: object


### Label-based indexing with custom index

Let's set the `product_id` as the index to see how label-based indexing works:

In [None]:
# Set product_id as index
products_indexed = products_df.set_index('product_id')
print(products_indexed)

In [None]:
# Now we can access rows by product_id
laptop_row = products_indexed.loc['P001']
print("Laptop details:\n", laptop_row)

# Get specific fields for specific products
headphones_info = products_indexed.loc['P004', ['price', 'stock_quantity']]
print("\nHeadphones price and stock:\n", headphones_info)

### SQL equivalent for row and column selection

In SQL, selecting specific columns and rows would look like:

```sql
SELECT product_name, price, rating
FROM products
WHERE product_id = 'P001';
```

In Pandas, this is equivalent to:

In [None]:
# SQL to Pandas translation
result = products_df.loc[products_df['product_id'] == 'P001', ['product_name', 'price', 'rating']]
print(result)

## 3. Basic Data Types and Conversions

Let's look at data types in Pandas and how to convert between them.

In [4]:
# Check data types
print(products_df.dtypes)

product_id         object
product_name       object
category           object
price               int64
stock_quantity      int64
rating            float64
dtype: object


Common pandas data types include:
- `object`: String or mixed types (similar to VARCHAR in SQL)
- `int64`: Integer (similar to INT in SQL)
- `float64`: Floating-point (similar to FLOAT or DOUBLE in SQL)
- `bool`: Boolean (similar to BOOLEAN in SQL)
- `datetime64`: Date and time (similar to DATE, DATETIME in SQL)
- `category`: Categorical data (similar to ENUM in SQL, but more powerful)

Let's convert some columns to different types:

In [9]:
# Convert category to categorical type (more memory efficient for repeated values)
products_df['category'] = products_df['category'].astype('category')

# Create a date column and convert to datetime
products_df['last_updated'] = ['2025-01-15', '2025-01-20', '2025-01-10', '2025-01-25', '2025-01-18']
products_df['last_updated'] = pd.to_datetime(products_df['last_updated'])

# Check data types again
print(products_df.dtypes)

# Display the DataFrame with the new column
print("\nUpdated DataFrame:")
print(products_df)

product_id                object
product_name              object
category                category
price                      int64
stock_quantity             int64
rating                   float64
last_updated      datetime64[ns]
dtype: object

Updated DataFrame:
  product_id product_name     category  price  stock_quantity  rating  \
0       P001       Laptop  Electronics   1200              15     4.5   
1       P002   Smartphone  Electronics    800              25     4.8   
2       P003       Tablet  Electronics    450               0     4.2   
3       P004   Headphones  Accessories    150              30     4.6   
4       P005      Monitor  Electronics    300              10     NaN   

  last_updated  
0   2025-01-15  
1   2025-01-20  
2   2025-01-10  
3   2025-01-25  
4   2025-01-18  


## 4. Handling Missing Data

Missing data is a common issue in real-world datasets. In Pandas, missing values are typically represented by `NaN` (Not a Number). Let's see how to detect and handle missing values.

In [16]:
# Check for missing values
print("Missing values per column:")
print(products_df.isna().sum())

# Check if any value in a row is missing
print("\nRows with any missing value:")
print(products_df[products_df.isna().any(axis=1)])

Missing values per column:
product_id        0
product_name      0
category          0
price             0
stock_quantity    0
rating            1
last_updated      0
dtype: int64

Rows with any missing value:
  product_id product_name     category  price  stock_quantity  rating  \
4       P005      Monitor  Electronics    300              10     NaN   

  last_updated  
4   2025-01-18  


### Handling missing values

There are several ways to handle missing values:

In [None]:
# 1. Remove rows with missing values
print("DataFrame after dropping rows with NaN:")
print(products_df.dropna())

# Note: The above operation doesn't modify the original DataFrame unless inplace=True
# Check that our original DataFrame still has the missing value
print("\nOriginal DataFrame (unchanged):")
print(products_df)

In [None]:
# 2. Fill missing values
# With a constant value
print("DataFrame after filling NaN with 0:")
print(products_df.fillna(0))

# With column-specific values
print("\nDataFrame after filling NaN with column-specific values:")
print(products_df.fillna({'rating': 3.0}))

# With the mean of the column
mean_rating = products_df['rating'].mean()
print(f"\nMean rating: {mean_rating:.2f}")
print("DataFrame after filling NaN with column mean:")
print(products_df.fillna({'rating': mean_rating}))

In [30]:
# 3. Update the original DataFrame
# Let's fill the missing rating with the mean and update our DataFrame
products_df['rating'] = products_df['rating'].fillna(mean_rating)
print("Updated DataFrame:")
print(products_df)

# Verify there are no more missing values
print("\nMissing values per column:")
print(products_df.isna().sum())

NameError: name 'mean_rating' is not defined

## 5. Basic Column Operations

Now let's look at some basic operations on DataFrame columns.

In [29]:
# Adding a new column
# Calculate inventory value (price * stock_quantity)
products_df['inventory_value'] = products_df['price'] * products_df['stock_quantity']
print("DataFrame with inventory value:")
print(products_df)

DataFrame with inventory value:
  product_id product_name     category  price  stock_quantity  rating  \
0       P001       Laptop  Electronics   1200              15     4.5   
1       P002   Smartphone  Electronics    800              25     4.8   
2       P003       Tablet  Electronics    450               0     4.2   
3       P004   Headphones  Accessories    150              30     4.6   
4       P005      Monitor  Electronics    300              10     NaN   

  last_updated  inventory_value  
0   2025-01-15            18000  
1   2025-01-20            20000  
2   2025-01-10                0  
3   2025-01-25             4500  
4   2025-01-18             3000  


In [None]:
# Using apply() to create a column with a function
def stock_status(quantity):
    if quantity == 0:
        return 'Out of Stock'
    elif quantity < 15:
        return 'Low Stock'
    else:
        return 'In Stock'

products_df['stock_status'] = products_df['stock_quantity'].apply(stock_status)
print("DataFrame with stock status:")
print(products_df)

In [None]:
# Using lambda functions for simple operations
# Calculate a 10% discount price
products_df['discount_price'] = products_df['price'].apply(lambda x: x * 0.9)
print("DataFrame with discount price:")
print(products_df)

In [None]:
# Renaming columns
products_df = products_df.rename(columns={
    'stock_quantity': 'quantity_in_stock',
    'discount_price': 'sale_price'
})
print("DataFrame with renamed columns:")
print(products_df)

In [None]:
# Dropping columns
products_df_simplified = products_df.drop(columns=['inventory_value', 'stock_status'])
print("Simplified DataFrame:")
print(products_df_simplified)

## 6. Practice Exercises

Now let's practice with some exercises using what we've learned.

### Exercise 1: DataFrame Inspection

Create a new DataFrame with sales data and answer the following questions:
1. How many rows and columns are in the DataFrame?
2. What is the data type of each column?
3. Are there any missing values?
4. What is the average sales amount?

In [5]:
# Create a sales DataFrame
sales_data = {
    'sale_id': ['S001', 'S002', 'S003', 'S004', 'S005', 'S006'],
    'date': ['2025-01-05', '2025-01-10', '2025-01-15', '2025-01-20', '2025-01-25', '2025-01-30'],
    'product_id': ['P001', 'P002', 'P001', 'P003', 'P002', 'P004'],
    'quantity': [1, 2, 1, 1, 3, 2],
    'amount': [1200, 1600, 1200, 450, 2400, 300],
    'customer_id': ['C001', 'C002', 'C003', 'C001', None, 'C002']
}

sales_df = pd.DataFrame(sales_data)
sales_df['date'] = pd.to_datetime(sales_df['date'])
print(sales_df)

# Your code here to answer the questions
print("\n Number of rows and column:")
rows, columns = products_df.shape
print("Number of rows:", rows)
print("Number of columns:", columns)

#data type of each column
print("\n Data type of each column:")
print(sales_df.dtypes)

#check for missing values
print("\n missing values:")
print(sales_df.isna().sum().sum())

average_sales = sales_df["amount"].mean()
print("\n Average sales amount:")
print("Average Sales Amount:", round(average_sales,2))



  sale_id       date product_id  quantity  amount customer_id
0    S001 2025-01-05       P001         1    1200        C001
1    S002 2025-01-10       P002         2    1600        C002
2    S003 2025-01-15       P001         1    1200        C003
3    S004 2025-01-20       P003         1     450        C001
4    S005 2025-01-25       P002         3    2400        None
5    S006 2025-01-30       P004         2     300        C002

 Number of rows and column:
Number of rows: 5
Number of columns: 6

 Data type of each column:
sale_id                object
date           datetime64[ns]
product_id             object
quantity                int64
amount                  int64
customer_id            object
dtype: object

 missing values:
1

 Average sales amount:
Average Sales Amount: 1191.67


### Exercise 2: Column Operations

Using the sales DataFrame from Exercise 1:
1. Add a column 'unit_price' that calculates the price per unit (amount / quantity)
2. Add a column 'month' that extracts the month from the date
3. Add a column 'high_value' that is True if the amount is greater than 1000, False otherwise
4. Calculate the total sales amount

In [6]:
# Add a new column
sales_df["unit_price"] = sales_df['amount'] / sales_df['quantity']
print("dataframe with new colum:")
print(sales_df)

#Add new column month
sales_df["month"] = sales_df['date'].dt.month
print("\n New column 'month:")
print(sales_df)

dataframe with new colum:
  sale_id       date product_id  quantity  amount customer_id  unit_price
0    S001 2025-01-05       P001         1    1200        C001      1200.0
1    S002 2025-01-10       P002         2    1600        C002       800.0
2    S003 2025-01-15       P001         1    1200        C003      1200.0
3    S004 2025-01-20       P003         1     450        C001       450.0
4    S005 2025-01-25       P002         3    2400        None       800.0
5    S006 2025-01-30       P004         2     300        C002       150.0

 New column 'month:
  sale_id       date product_id  quantity  amount customer_id  unit_price  \
0    S001 2025-01-05       P001         1    1200        C001      1200.0   
1    S002 2025-01-10       P002         2    1600        C002       800.0   
2    S003 2025-01-15       P001         1    1200        C003      1200.0   
3    S004 2025-01-20       P003         1     450        C001       450.0   
4    S005 2025-01-25       P002         3    2400 

### Exercise 3: Handling Missing Values

Using the sales DataFrame from Exercise 1:
1. Identify which rows have missing values
2. Fill missing customer_id values with 'Unknown'
3. Create a new DataFrame that drops rows with any missing values

In [None]:
# Your code here


## Next Steps

In the next part, we'll focus on data selection and filtering operations, including how to translate SQL WHERE clauses to Pandas.

Continue to [Part 3: Selection and Filtering](02_Pandas_Fundamentals_I_part3.ipynb)