# Data Manipulation with Pandas

This notebook covers essential data manipulation techniques using the pandas library in Python. Pandas is one of the most powerful and flexible tools for data analysis and manipulation in Python.

## Topics Covered:
1. Introduction to pandas and basic data structures
2. Reading and writing data
3. Data inspection and cleaning
4. Filtering, selecting, and indexing data
5. Data transformation and aggregation
6. Handling missing data
7. Merging, joining, and concatenating dataframes
8. Working with time series data
9. Practical exercises

Each section includes real-life use cases to demonstrate practical applications of these concepts.

## 1. Introduction to Pandas and Basic Data Structures

Pandas provides two primary data structures:
- **Series**: One-dimensional array-like object containing a sequence of values with associated labels (index)
- **DataFrame**: Two-dimensional tabular data structure with labeled axes (rows and columns)

**Real-Life Use Case:** Customer Analytics — Use DataFrames to store and analyze customer data, and Series for individual metrics.

**What's next:** We'll import pandas and numpy, and set display options for better readability.

In [None]:
# Import pandas and numpy for data manipulation
import pandas as pd
import numpy as np

In [None]:
# Set display options for better readability
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 20)

### Creating a Series
A Series is a one-dimensional labeled array. Useful for storing a single column of data or a metric.

**Real-life use:** Storing a list of customer ages, product prices, or daily sales.

**What's next:** We'll create a Series and inspect its index and values.

In [None]:
# Create a Series with custom index
s = pd.Series([10, 20, 30, 40], index=['a', 'b', 'c', 'd'])
print("Pandas Series:")
print(s)

In [None]:
# Inspect the index and values of the Series
print("\nIndex:", s.index)
print("Values:", s.values)

### Creating a DataFrame
A DataFrame is a two-dimensional table of data, like a spreadsheet.

**Real-life use:** Storing customer records, sales transactions, or survey responses.

**What's next:** We'll create a DataFrame from a dictionary and display it.

In [None]:
# Create a DataFrame from a dictionary
data = {
    'Name': ['John', 'Anna', 'Peter', 'Linda'],
    'Age': [28, 34, 29, 42],
    'City': ['New York', 'Paris', 'Berlin', 'London'],
    'Salary': [65000, 70000, 62000, 85000]
}

df = pd.DataFrame(data)
print("DataFrame:")
df

## 2. Reading and Writing Data

Pandas can read and write data from/to various file formats including CSV, Excel, SQL, and JSON.

**Real-life use:** Importing sales data from CSV, exporting cleaned data to Excel, or loading JSON from an API.

**What's next:** We'll create a sample DataFrame and save it to a CSV file, then read it back.

In [None]:
# Create a sample DataFrame to save
sample_df = pd.DataFrame({
    'A': np.random.rand(5),          # Random float values
    'B': np.random.randint(0, 10, 5), # Random integers between 0-9
    'C': ['foo', 'bar', 'baz', 'qux', 'quux'],  # Text values
    'D': pd.date_range('2023-01-01', periods=5)  # Date range
})

In [None]:
# Save the DataFrame to a CSV file
sample_df.to_csv('sample_data.csv', index=False)
print("Data saved to CSV file")

In [None]:
# Read the data back from the CSV file
df_from_csv = pd.read_csv('sample_data.csv')
print("Data loaded from CSV:")
df_from_csv

### Creating a Realistic Sample Dataset
Let's create a larger, more realistic dataset for use in later examples.

**Real-life use:** Simulating sales data for a retail business.

**What's next:** We'll generate a DataFrame with random sales data and save it for future use.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)
dates = pd.date_range('2023-01-01', periods=100)

sales_data = pd.DataFrame({
    'Date': dates,
    'Store': np.random.choice(['A', 'B', 'C', 'D'], 100),
    'Product': np.random.choice(['Widget', 'Gadget', 'Tool', 'Device'], 100),
    'Units_Sold': np.random.randint(1, 50, 100),
    'Revenue': np.random.randint(100, 5000, 100),
    'Customer_Satisfaction': np.random.randint(1, 6, 100)
})

# Save this dataset for future use
sales_data.to_csv('sales_data.csv', index=False)
sales_data.head()

## 3. Data Inspection and Cleaning

Before analyzing data, it's important to inspect and clean it by checking its structure, identifying missing values, and handling duplicates.

**Real-Life Use Case:** Healthcare Data Management — Ensuring patient data accuracy by cleaning electronic health records.

**What's next:** We'll load the sales data and explore its structure and summary statistics.

In [None]:
# Load the sales data
df = pd.read_csv('sales_data.csv')

# Basic information about the DataFrame
print("Shape (rows, columns):", df.shape)
print("\nColumn names:", df.columns.tolist())
print("\nData types:")
print(df.dtypes)

# Summary statistics
print("\nSummary statistics:")
df.describe(include='all')

### Checking for Missing Values
Identifying missing values is crucial for data cleaning.

**Real-life use:** Detecting missing patient records, incomplete survey responses, or gaps in time series data.

**What's next:** We'll check the sales data for missing values and introduce some for demonstration.

In [None]:
# Checking for missing values
print("Missing values:")
print(df.isna().sum())

In [None]:
# Let's introduce some missing values for demonstration
df_with_missing = df.copy()
df_with_missing.loc[np.random.choice(df.index, 10), 'Revenue'] = np.nan
df_with_missing.loc[np.random.choice(df.index, 5), 'Customer_Satisfaction'] = np.nan

print("\nMissing values in the modified dataset:")
print(df_with_missing.isna().sum())

### Handling Missing Values
There are various strategies to handle missing data, including imputation and deletion.

**Real-life use:** Filling missing lab results with the average value, or removing duplicate patient records.

**What's next:** We'll explore different methods to handle missing values in the sales data.

In [None]:
# Handling missing values
# Method 1: Fill missing values with mean/median/mode
df_filled = df_with_missing.copy()
df_filled['Revenue'] = df_filled['Revenue'].fillna(df_filled['Revenue'].mean())
df_filled['Customer_Satisfaction'] = df_filled['Customer_Satisfaction'].fillna(
    df_filled['Customer_Satisfaction'].median())

# Method 2: Drop rows with missing values
df_dropped = df_with_missing.dropna()

print(f"Original shape: {df_with_missing.shape}")
print(f"After filling: {df_filled.shape}")
print(f"After dropping: {df_dropped.shape}")

### Detecting and Handling Duplicates
Duplicates can skew your analysis and should be handled appropriately.

**Real-life use:** Removing duplicate entries in a customer database or transaction log.

**What's next:** We'll check the sales data for duplicates and remove them.

In [None]:
# Detecting and handling duplicates
# Let's introduce some duplicates
df_with_duplicates = pd.concat([df, df.iloc[:5]], ignore_index=True)

# Check for duplicates
duplicate_count = df_with_duplicates.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count}")

# Remove duplicates
df_unique = df_with_duplicates.drop_duplicates()
print(f"Shape after removing duplicates: {df_unique.shape}")

## 4. Filtering, Selecting, and Indexing Data

Pandas provides powerful ways to select, filter, and access data within DataFrames.

**Real-Life Use Case:** Marketing Campaign Analysis — Segmenting customers for targeted campaigns.

**What's next:** We'll explore different methods to select and filter data in pandas.

In [None]:
# Basic column selection
print("Selecting a single column as Series:")
print(df['Revenue'].head())

print("\nSelecting multiple columns as DataFrame:")
print(df[['Store', 'Product', 'Revenue']].head())

In [None]:
# Row selection using iloc (position-based) and loc (label-based)
print("First 5 rows using iloc:")
print(df.iloc[:5])

print("\nRows 10-15:")
print(df.iloc[10:16])

print("\nSpecific rows and columns using iloc:")
print(df.iloc[0:3, [1, 3, 4]])

In [None]:
# Filtering data based on conditions
# Filter stores with high revenue
high_revenue = df[df['Revenue'] > 4000]
print("High revenue transactions:")
print(high_revenue)

# Filter specific store and product combinations
store_a_widgets = df[(df['Store'] == 'A') & (df['Product'] == 'Widget')]
print("\nStore A Widget sales:")
print(store_a_widgets)

In [None]:
# Using query method for more readable filtering
high_satisfaction = df.query('Customer_Satisfaction >= 4 & Units_Sold > 30')
print("High satisfaction and high volume sales:")
print(high_satisfaction)

## 5. Data Transformation and Aggregation

Pandas provides various methods for transforming, grouping, and aggregating data.

**Real-Life Use Case:** Retail Sales Analysis — Summarizing sales performance across stores and products.

**What's next:** We'll learn how to add new columns, group data, and create pivot tables in pandas.

In [None]:
# Adding new columns
df['Revenue_per_Unit'] = df['Revenue'] / df['Units_Sold']
df['Is_High_Value'] = df['Revenue'] > df['Revenue'].median()

# Apply a custom function to a column
def categorize_satisfaction(score):
    if score >= 4:
        return 'High'
    elif score >= 2:
        return 'Medium'
    else:
        return 'Low'

df['Satisfaction_Category'] = df['Customer_Satisfaction'].apply(categorize_satisfaction)
df.head()

In [None]:
# Group by operations
store_summary = df.groupby('Store').agg({
    'Revenue': ['sum', 'mean'], 
    'Units_Sold': 'sum',
    'Customer_Satisfaction': 'mean'
})

print("Store summary:")
store_summary

In [None]:
# Multi-level groupby
product_store_summary = df.groupby(['Store', 'Product']).agg({
    'Revenue': 'sum',
    'Units_Sold': 'sum',
    'Customer_Satisfaction': 'mean'
}).reset_index()

print("Product and store summary:")
product_store_summary

In [None]:
# Pivot tables
pivot_table = df.pivot_table(
    values=['Revenue', 'Units_Sold'],
    index='Store',
    columns='Product',
    aggfunc={'Revenue': 'sum', 'Units_Sold': 'sum'}
)

print("Pivot table of revenue and units sold by store and product:")
pivot_table

## 6. Handling Missing Data

Let's explore more advanced techniques for handling missing data.

**Real-Life Use Case:** Environmental Sensor Data — Handling gaps in sensor data for accurate modeling.

**What's next:** We'll learn different imputation methods and how to interpolate missing values.

In [None]:
# Create a dataset with various missing patterns
df_missing = pd.DataFrame({
    'A': [1, 2, np.nan, 4, 5],
    'B': [np.nan, 2, 3, 4, 5],
    'C': [1, 2, 3, np.nan, np.nan],
    'D': [1, 2, 3, 4, 5]
})

print("Original dataset with missing values:")
print(df_missing)
print("\nMissing value count by column:")
print(df_missing.isna().sum())

In [None]:
# Different imputation strategies

# 1. Fill with a constant value
df_fill_const = df_missing.fillna(0)
print("Filled with constant value:")
print(df_fill_const)

# 2. Fill with different values for each column
df_fill_dict = df_missing.fillna({'A': 0, 'B': 10, 'C': -1})
print("\nFilled with different values per column:")
print(df_fill_dict)

# 3. Forward fill (propagate last valid observation forward)
df_ffill = df_missing.fillna(method='ffill')
print("\nForward fill:")
print(df_ffill)

# 4. Backward fill (use next valid observation to fill gap)
df_bfill = df_missing.fillna(method='bfill')
print("\nBackward fill:")
print(df_bfill)

In [None]:
# Interpolation methods
df_interp = df_missing.interpolate(method='linear')
print("Linear interpolation:")
print(df_interp)

## 7. Merging, Joining, and Concatenating DataFrames

Pandas provides various ways to combine multiple DataFrames together.

**Real-Life Use Case:** Supply Chain Management — Combining data from different systems for end-to-end visibility.

**What's next:** We'll learn how to merge, join, and concatenate DataFrames in pandas.

In [None]:
# Create sample DataFrames
df1 = pd.DataFrame({
    'ID': ['A1', 'A2', 'A3', 'A4'],
    'Name': ['John', 'Emily', 'Martha', 'Samuel'],
    'Department': ['HR', 'Marketing', 'Finance', 'IT']
})

df2 = pd.DataFrame({
    'ID': ['A2', 'A3', 'A4', 'A5'],
    'Salary': [60000, 80000, 70000, 90000],
    'Years_Employed': [3, 7, 4, 2]
})

df3 = pd.DataFrame({
    'Department': ['HR', 'Marketing', 'Finance', 'IT', 'Operations'],
    'Budget': [100000, 200000, 300000, 250000, 150000]
})

print("DataFrame 1:")
print(df1)
print("\nDataFrame 2:")
print(df2)
print("\nDataFrame 3:")
print(df3)

In [None]:
# Different join types

# Inner join
inner_join = pd.merge(df1, df2, on='ID', how='inner')
print("Inner join:")
print(inner_join)

# Left join
left_join = pd.merge(df1, df2, on='ID', how='left')
print("\nLeft join:")
print(left_join)

# Right join
right_join = pd.merge(df1, df2, on='ID', how='right')
print("\nRight join:")
print(right_join)

# Outer join
outer_join = pd.merge(df1, df2, on='ID', how='outer')
print("\nOuter join:")
print(outer_join)

In [None]:
# Join on different column names
dept_budget = pd.merge(df1, df3, on='Department', how='left')
print("Joining on Department column:")
print(dept_budget)

In [None]:
# Concatenating DataFrames
df4 = pd.DataFrame({
    'ID': ['A6', 'A7'],
    'Name': ['Mark', 'Lisa'],
    'Department': ['Legal', 'HR']
})

# Vertical concatenation (row-wise)
df_concat_rows = pd.concat([df1, df4], ignore_index=True)
print("Concatenated rows:")
print(df_concat_rows)

# Horizontal concatenation (column-wise)
df5 = pd.DataFrame({
    'Performance_Score': [4.5, 3.9, 4.2, 4.7],
    'Bonus': [2000, 1500, 1800, 2200]
}, index=['A1', 'A2', 'A3', 'A4'])

df1_with_index = df1.copy()
df1_with_index.set_index('ID', inplace=True)

df_concat_cols = pd.concat([df1_with_index, df5], axis=1)
print("\nConcatenated columns:")
print(df_concat_cols)

## 8. Working with Time Series Data

Pandas has excellent support for time series analysis.

**Real-Life Use Case:** Energy Consumption Forecasting — Analyzing and forecasting energy demand.

**What's next:** We'll learn how to work with date and time data, and perform time series analysis in pandas.

In [None]:
# Convert Date column to datetime
df['Date'] = pd.to_datetime(df['Date'])

# Set Date as index
ts_df = df.set_index('Date')
print("Time series data (first few rows):")
print(ts_df.head())

In [None]:
# Extracting date components
df['Year'] = df['Date'].dt.year
df['Month'] = df['Date'].dt.month
df['Day'] = df['Date'].dt.day
df['Weekday'] = df['Date'].dt.day_name()

print("Date components:")
df[['Date', 'Year', 'Month', 'Day', 'Weekday']].head()

In [None]:
# Time series resampling
# Daily to weekly resampling
weekly_sales = ts_df.resample('W')['Revenue'].sum()
print("Weekly sales:")
print(weekly_sales.head())

# Daily to monthly resampling
monthly_sales = ts_df.resample('M').agg({
    'Revenue': 'sum',
    'Units_Sold': 'sum',
    'Customer_Satisfaction': 'mean'
})
print("\nMonthly aggregated data:")
print(monthly_sales)

In [None]:
# Time-based filtering
# Filter data for a specific month
jan_data = ts_df['2023-01-01':'2023-01-31']
print(f"January data shape: {jan_data.shape}")

# Filter data between two dates
q1_data = ts_df['2023-01-01':'2023-03-31']
print(f"Q1 data shape: {q1_data.shape}")

## 9. Practical Exercises

Let's consolidate our knowledge with some practical exercises.

**Real-Life Use Case:** Business Intelligence Dashboard — Preparing data for interactive executive dashboards.

**What's next:** Complete the exercises to reinforce your learning and apply pandas skills to real-world scenarios.

### Exercise 1: Basic Data Analysis

1. Load the sales data
2. Find the top 5 days with highest revenue
3. Calculate the average satisfaction score by product
4. Identify which store has the highest average revenue per transaction

In [None]:
# Exercise 1 Solution

# 1. Load the sales data
sales = pd.read_csv('sales_data.csv')
sales['Date'] = pd.to_datetime(sales['Date'])

# 2. Find the top 5 days with highest revenue
top_revenue_days = sales.groupby('Date')['Revenue'].sum().sort_values(ascending=False).head(5)
print("Top 5 days with highest revenue:")
print(top_revenue_days)

# 3. Calculate the average satisfaction score by product
product_satisfaction = sales.groupby('Product')['Customer_Satisfaction'].mean().sort_values(ascending=False)
print("\nAverage satisfaction score by product:")
print(product_satisfaction)

# 4. Identify which store has the highest average revenue per transaction
store_avg_revenue = sales.groupby('Store')['Revenue'].mean().sort_values(ascending=False)
print("\nAverage revenue per transaction by store:")
print(store_avg_revenue)

### Exercise 2: Data Transformation Challenge

1. Create a new column that categorizes revenue into 'Low', 'Medium', 'High' based on percentiles
2. Calculate a 7-day moving average of revenue
3. Find the correlation between units sold and customer satisfaction

In [None]:
# Exercise 2 Solution

# 1. Create a new column that categorizes revenue into 'Low', 'Medium', 'High' based on percentiles
sales['Revenue_Category'] = pd.qcut(
    sales['Revenue'], 
    q=[0, 0.33, 0.67, 1], 
    labels=['Low', 'Medium', 'High']
)

print("Revenue categories count:")
print(sales['Revenue_Category'].value_counts())

# 2. Calculate a 7-day moving average of revenue
ts_sales = sales.set_index('Date').sort_index()
revenue_ma = ts_sales['Revenue'].rolling(window=7).mean()

print("\n7-day moving average of revenue (first 10 days):")
print(revenue_ma.head(10))

# 3. Find the correlation between units sold and customer satisfaction
correlation = sales[['Units_Sold', 'Customer_Satisfaction', 'Revenue']].corr()
print("\nCorrelation matrix:")
print(correlation)

### Exercise 3: Advanced Analysis

1. Identify which product-store combination generates the highest total revenue
2. Calculate the day of week effect on sales
3. Create a pivot table showing total revenue by store and product

In [None]:
# Exercise 3 Solution

# 1. Identify which product-store combination generates the highest total revenue
product_store_revenue = sales.groupby(['Store', 'Product'])['Revenue'].sum().reset_index()
top_combinations = product_store_revenue.sort_values(by='Revenue', ascending=False).head(5)
print("Top 5 product-store combinations by revenue:")
print(top_combinations)

# 2. Calculate the day of week effect on sales
sales['Weekday'] = sales['Date'].dt.day_name()
weekday_sales = sales.groupby('Weekday')['Revenue'].agg(['sum', 'mean'])
# Reorder days of week
weekday_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
weekday_sales = weekday_sales.reindex(weekday_order)
print("\nSales by day of week:")
print(weekday_sales)

# 3. Create a pivot table showing total revenue by store and product
pivot = pd.pivot_table(sales, values='Revenue', index='Store', columns='Product', 
                       aggfunc='sum', fill_value=0)
print("\nTotal revenue by store and product:")
print(pivot)

## Summary

In this notebook, we've covered key pandas functionalities for data manipulation:

- **Creating and working with pandas data structures** (Series, DataFrame) - *essential for organizing any data analysis project*
- **Reading and writing data from/to various formats** - *critical for working with diverse data sources in business environments*
- **Data inspection and cleaning techniques** - *fundamental for ensuring data quality in any analytics workflow*
- **Different ways to select, filter, and index data** - *powerful capabilities for focusing on relevant subsets of your data*
- **Data transformation and aggregation methods** - *key for summarizing information and extracting insights*
- **Handling missing data effectively** - *essential for maintaining data integrity despite incomplete information*
- **Combining datasets through merging, joining, and concatenation** - *important for creating comprehensive unified datasets*
- **Working with time series data** - *crucial for analyzing patterns over time and making forecasts*
- **Practical exercises to apply these concepts** - *reinforcing learning through hands-on problem-solving*

These pandas skills form the backbone of modern data analysis in Python, applicable across industries from finance and healthcare to marketing and environmental science. Mastering these techniques will enable you to efficiently transform raw data into actionable insights in virtually any data-driven role.

## Additional Real-World Case Studies

### 1. Fraud Detection in Banking

Banks use pandas to analyze transaction data for suspicious patterns. Data scientists create features like transaction frequency, amount deviations from historical patterns, and geographic anomalies using pandas groupby, rolling window functions, and filtering operations. These derived features feed into machine learning models that flag potentially fraudulent transactions for further investigation.

```python
# Example of transaction frequency analysis
hourly_transactions = df.groupby([df['transaction_date'].dt.date, 
                                df['customer_id'], 
                                df['transaction_date'].dt.hour])['amount'].count().reset_index()
# Flag unusual transaction frequency
customer_hourly_stats = hourly_transactions.groupby('customer_id')['amount'].agg(['mean', 'std'])
merged_data = hourly_transactions.merge(customer_hourly_stats, on='customer_id')
merged_data['z_score'] = (merged_data['amount'] - merged_data['mean']) / merged_data['std']
suspicious = merged_data[merged_data['z_score'] > 3]  # Transactions with unusually high frequency
```

### 2. Product Recommendation Systems

E-commerce companies use pandas to analyze purchase history and build recommendation engines. Data scientists use merge operations to combine user profiles with purchase history, pivot tables to create user-item matrices, and groupby operations to identify frequently co-purchased items. These transformations prepare the data for collaborative filtering algorithms that power "customers who bought this also bought" recommendations.

### 3. Urban Transportation Planning

City planners use pandas to analyze public transit data, including bus/train ridership, traffic patterns, and infrastructure usage. Using time series capabilities, they can identify peak travel times, resampling minute-by-minute data to hourly or daily aggregates. With groupby operations, they can analyze differences between weekdays and weekends, or examine transit usage by neighborhood. These insights help optimize bus routes, traffic light timing, and infrastructure investments.

### 4. Clinical Trial Data Analysis

Pharmaceutical researchers use pandas to analyze clinical trial results, comparing treatment and control groups across multiple metrics. They use pivot tables to restructure patient visit data from long to wide format, filtering operations to handle inclusion/exclusion criteria, and statistical functions to calculate effect sizes and confidence intervals. Pandas' ability to handle missing values is especially important, as patient data often contains gaps due to missed visits or incomplete tests.