# Pandas Comprehensive Tutorial
============================

This script provides a complete introduction to pandas, including:
- What is pandas and how it works
- Series and DataFrame concepts
- Common operations and indexing
- Conditional indexing and data wrangling
- Interactive exercises for students
- Complex exercises to test understanding

In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

# SECTION 1: WHAT IS PANDAS?

Pandas is a powerful Python library for data manipulation and analysis. 
It provides data structures for efficiently storing and manipulating 
large datasets, along with tools for reading and writing data in 
various formats.

Key Features:
- DataFrame: 2D labeled data structure (like a spreadsheet)
- Series: 1D labeled array (like a column in a spreadsheet)
- Powerful indexing and selection capabilities
- Built-in data cleaning and preparation tools
- Integration with other data science libraries
- Fast performance for large datasets

Think of pandas as 'Excel on steroids' for Python!

## Datatypes in Python vs Pandas vs Numpy

![](../images/dtypes.png)

# SECTION 2: SERIES - THE BUILDING BLOCK

A Series is a one-dimensional labeled array that can hold any data type.

It's like a column in a spreadsheet with an index.

In [None]:
# From a list
numbers = [10, 20, 30, 40, 50.3]
series_from_list = pd.Series(numbers)
print("Series from list:")
series_from_list

In [None]:
# From a list with custom index
custom_index = ['A', 'B', 'C', 'D', 'E']
series_custom_index = pd.Series(numbers, index=custom_index)
print("Series with custom index:")

series_custom_index

In [None]:
# From a dictionary
data_dict = {'Jan': 100, 'Feb': 150, 'Mar': 200, 'Apr': 175}
series_from_dict = pd.Series(data_dict)
print("Series from dictionary:")
series_from_dict

In [None]:
series_from_dict['Jan']

In [None]:
# Series operations
print("--- Series Operations ---")
print("Original series:", series_from_list)
print("Sum:", series_from_list.sum())
print("Mean:", series_from_list.mean())
print("Max:", series_from_list.max())
print("Min:", series_from_list.min())
print()

# SECTION 3: DATAFRAME - THE POWERHOUSE

A DataFrame is a 2D labeled data structure with columns that can be 
different types (numeric, string, boolean, etc.). Think of it as a 
collection of Series objects, or a spreadsheet with multiple columns.

In [None]:
# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")

df

In [None]:
# From a list of lists
data_list = [
    ['Apple', 'Fruit', 0.5, 100],
    ['Carrot', 'Vegetable', 0.3, 50],
    ['Chicken', 'Meat', 8.0, 200],
    ['Rice', 'Grain', 2.0, 150],
    ['Pizza', 'Meat', 10.0, 250],
    ['Potato', 'Vegetable', 0.5, 100]
]

columns = ['Food', 'Category', 'Price', 'Calories']
df_food = pd.DataFrame(data_list, columns=columns)
print("DataFrame from list of lists:")

df_food


In [None]:
# Basic DataFrame information
print("--- DataFrame Information ---")
print("Shape (rows, columns):", df_food.shape)
print("Data types:")
print(df_food.dtypes)
print("\nColumn names:", list(df_food.columns))
print("Index:", list(df_food.index))
print()

In [None]:
df_food.columns = ['Food', 'Category', 'Price', 'Calories']
df_food

# SECTION 4: INDEXING AND SELECTION

Pandas provides powerful ways to select and filter data. Understanding 
indexing is crucial for effective data manipulation.

In [None]:
# From a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle'],
    'Salary': [50000, 60000, 70000, 55000, 65000]
}

df = pd.DataFrame(data)
print("DataFrame from dictionary:")

df

In [None]:
# Column selection
print("--- Column Selection ---")
print("Select single column (returns Series):")
print(df['Name'])
print(type(df['Name']))


In [None]:
print("Select multiple columns (returns DataFrame):")
df[['Name', 'Age', 'Salary']]
# type(df[['Name', 'Age', 'Salary']])

In [None]:
# Row selection
print("--- Row Selection ---")
print("Select first 3 rows:")
print(df.head(3))
print()

print("Select last 2 rows:")
print(df.tail(2))
print()

In [None]:
df.loc[[2, 4], ['Age', 'Salary']]

In [None]:
df_food

In [None]:
df_food['Category'] == 'Vegetable'

In [None]:
df_food.loc[df_food['Category'] == 'Vegetable', 'Price']

## Indexing: loc vs iloc

In [None]:
df

In [None]:
# Position-based indexing with .iloc
print("--- Position-based Indexing (.iloc) ---")
print("First row (index 0):")
print(df.iloc[0])
print(type(df.iloc[0]))

print("First 2 rows, first 2 columns:")
print(df.iloc[0:2, 0:2])
print(type(df.iloc[0:2, 0:2]))

print("Specific row and column (row 1, column 'Age'):")
print(df.iloc[1]['Age'])
print()

In [None]:
# Label-based indexing with .loc
print("--- Label-based Indexing (.loc) ---")
print("Select rows by index label:")
print(df.loc[0:3])
print()

print("Select specific row and column:")
print(df.loc[0, 'Name'])
print()

In [None]:
df.iloc[0, 0]

# SECTION 5: CONDITIONAL INDEXING

Conditional indexing (boolean indexing) allows you to filter data 
based on conditions, similar to WHERE clauses in SQL.

In [None]:
df['Age'] > 30

In [None]:
# Simple conditions
print("--- Simple Conditions ---")
print("People older than 30:")
print(df[df['Age'] > 30])
print()

print("People with salary >= 60000:")
df[df['Salary'] >= 60000]

In [None]:
df

In [None]:
# Multiple conditions
# & and 
# | or
# ~ not

print("--- Multiple Conditions ---")
print("People aged 25-30 AND salary < 60000:")
condition = ((df['Age'] >= 25) & (df['Age'] <= 30)) | ~(df['Salary'] >= 60000)

df[condition]

In [None]:
df[~(df['Salary'] >= 60000)]
# df['Salary'] >= 60000

In [None]:
~(df['Salary'] >= 60000)

In [None]:
df

In [None]:
# String conditions
print("--- String Conditions ---")
print("People from cities starting with 'B':")
print(df[df['City'].str.startswith('B')])
print()

print("People with names containing 'a' (case insensitive):")
df[df['Name'].str.contains('a', case=True)]

In [None]:
# isin() method
print("--- Using isin() ---")
print("People from NYC or LA:")
list_of_cities = ['NYC', 'LA', "Boston"]
df[~(df['City'].isin(list_of_cities)) & (df['Salary'] > 65000)]

# SECTION 6: DATA WRANGLING

Data wrangling involves cleaning, transforming, and preparing data 
for analysis. Pandas provides many tools for this.

In [None]:
df

In [None]:
# Adding/removing columns
print("--- Adding/Removing Columns ---")
df['Experience'] = df['Age'] - 22  # Assuming they started working at 22
print("Added Experience column:")
df

In [None]:
# Remove a column
df = df.drop('Experience', axis=1)
print("After dropping Experience column:")
df

In [None]:
df

In [None]:
# Sorting
print("--- Sorting ---")
print("Sort by Age (ascending):")
print(df.sort_values('Age'))
print()

print("Sort by Salary (descending):")
print(df.sort_values('Salary', ascending=False))
print()

In [None]:
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'George', 'Hannah', 'Ivy', 'Jack'],
    'Age': [25, 30, 35, 28, 32, np.nan, 31, 27, 26, 33],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle', 'Chicago', 'NYC', 'LA', 'Chicago', 'Boston'],
    'Salary': [50000, 60000, np.nan, 55000, 65000, 55000, 65000, 55000, None, 55000]
}

df = pd.DataFrame(data)
df

In [None]:
# Grouping and aggregation
print("--- Grouping and Aggregation ---")
print("Average salary by city:")
city_salary = df.groupby('City')['Salary'].agg(['mean', 'count', 'min', 'max', 'sum']).reset_index()
city_salary

In [None]:
type(np.nan)
# None == np.nan
# np.nan == np.nan
np.nan == None
np.nan == False
np.nan == True
np.nan == 0
np.nan == 1

In [None]:
# Handling missing data
print("--- Handling Missing Data ---")
df_with_nulls = df.copy()
df_with_nulls.loc[2, 'Salary'] = np.nan # Not a Number NaN
df_with_nulls.loc[1, 'Age'] = np.nan
df_with_nulls

In [None]:
df_with_nulls.isna()

In [None]:
df_filled = df_with_nulls.fillna({'Salary': df_with_nulls['Salary'].mean(), 'Age': df_with_nulls['Age'].median()})
df_filled

# SECTION 7: ADVANCED DATAFRAME MANIPULATION

Advanced DataFrame manipulation includes merging/joining data (similar to SQL JOINs)
and sophisticated grouping operations. These are essential skills for working
with multiple datasets and complex data analysis.

## Merge and Join Operations

In [None]:
# Customers DataFrame
customers = pd.DataFrame({
    'customer_id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'email': ['alice@email.com', 'bob@email.com', 'charlie@email.com', 'diana@email.com', 'eve@email.com'],
    'city': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle']
})
customers

In [None]:
# Orders DataFrame
orders = pd.DataFrame({
    'order_id': [101, 102, 103, 104, 105, 106, 107],
    'customer_id': [1, 2, 1, 3, 4, 2, 10],
    'product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones', 'Tablet', 'Phone'],
    'amount': [999, 25, 75, 299, 150, 399, 100],
    'order_date': ['2024-01-15', '2024-01-16', '2024-01-17', '2024-01-18', '2024-01-19', '2024-01-20', '2024-01-21']
})

print("Orders DataFrame:")
orders

In [None]:
# Inner Join (default merge type)
print("--- Inner Join (default) ---")
print("Combines rows where customer_id exists in both DataFrames:")
inner_merge = pd.merge(customers, orders, on='customer_id', how='inner')
inner_merge

In [None]:
# Left Join
print("--- Left Join ---")
print("Keeps all rows from left DataFrame (customers):")
left_merge = pd.merge(customers, orders, on='customer_id', how='left')
left_merge

In [None]:
left_merge[left_merge["order_id"].isna()]

In [None]:
# Right Join
print("--- Right Join ---")
print("Keeps all rows from right DataFrame (orders):")
right_merge = pd.merge(customers, orders, on='customer_id', how='right')
right_merge

In [None]:
# Outer Join
print("--- Outer Join ---")
print("Keeps all rows from both DataFrames:")
outer_merge = pd.merge(customers, orders, on='customer_id', how='outer')
outer_merge

In [None]:
# Merge with different column names
print("--- Merge with Different Column Names ---")
print("When joining columns have different names:")

customers_alt = customers.copy()
customers_alt.rename(columns={'customer_id': 'cust_id'}, inplace=True)

print("Customers with renamed column:")
customers_alt

In [None]:
print("Merge using left_on and right_on:")
merge_different_names = pd.merge(customers_alt, orders, 
                                left_on='cust_id', right_on='customer_id', 
                                how='inner')
merge_different_names

In [None]:
# Multiple key merge
print("--- Multiple Key Merge ---")
print("Merging on multiple columns:")

# Create DataFrames with multiple keys
employees = pd.DataFrame({
    'dept_id': [1, 1, 2, 2, 3],
    'emp_id': [101, 102, 201, 202, 301],
    'name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
    'salary': [50000, 55000, 60000, 65000, 70000]
})

departments = pd.DataFrame({
    'dept_id': [1, 1, 2, 2, 3],
    'emp_id': [101, 102, 201, 202, 301],
    'dept_name': ['IT', 'IT', 'HR', 'HR', 'Finance'],
    'location': ['Floor 1', 'Floor 1', 'Floor 2', 'Floor 2', 'Floor 3']
})

employees

In [None]:
departments


In [None]:
print("Merge on multiple keys (dept_id and emp_id):")
multi_key_merge = pd.merge(employees, departments, on=['dept_id', 'emp_id'])
multi_key_merge

## Advanced GroupBy Operations

In [None]:
# Advanced GroupBy Operations
print("--- Advanced GroupBy Operations ---")
print("GroupBy is one of pandas' most powerful features for data analysis.")

# Create a more complex dataset for grouping
sales_data = pd.DataFrame({
    'date': pd.date_range('2024-01-01', periods=20, freq='D'),
    'region': ['North', 'South', 'East', 'West'] * 5,
    'product': ['A', 'B', 'C', 'D'] * 5,
    'sales_amount': np.random.randint(100, 1000, 20),
    'units_sold': np.random.randint(10, 100, 20),
    'salesperson': ['John', 'Jane', 'Bob', 'Alice'] * 5
})

print("Sales Data:")
sales_data

In [None]:
# Basic grouping
print("--- Basic Grouping ---")
print("Group by region and calculate total sales:")
region_sales = sales_data.groupby('region')['sales_amount'].sum().reset_index()
region_sales

In [None]:
# Multiple aggregations
print("--- Multiple Aggregations ---")
print("Group by region and calculate multiple statistics:")
region_stats = sales_data.groupby('region').agg({
    'sales_amount': ['sum', 'count', 'std'],
    'units_sold': ['sum', 'mean', 'max', 'min']
})
region_stats.reset_index()

In [None]:
# Group by multiple columns
print("--- Group by Multiple Columns ---")
print("Group by region and product:")
region_product = sales_data.groupby(['region', 'product']).agg({
    'sales_amount': 'sum',
    'units_sold': 'sum'
}).reset_index()
region_product

In [None]:
# Custom aggregation functions
print("--- Custom Aggregation Functions ---")
print("Group by salesperson and calculate custom metrics:")

def custom_agg(x):
    return pd.Series({
        'total_sales': x['sales_amount'].sum(),
        'avg_order_value': x['sales_amount'].mean(),
        'num_orders': len(x),
        'best_day': x.loc[x['sales_amount'].idxmax(), 'date']
    })

salesperson_analysis = sales_data.groupby('salesperson').apply(custom_agg)
salesperson_analysis

In [None]:
# Transform operations
print("--- Transform Operations ---")
print("Add columns with group-level calculations:")

# Calculate percentage of total sales by region
sales_data['pct_of_region'] = sales_data.groupby('region')['sales_amount'].transform(
    lambda x: x / x.sum() * 100
)

sales_data[sales_data["region"] == "North"]

In [None]:
round(sales_data['pct_of_region'], 2)

In [None]:
sales_data

In [None]:
# Pivot tables
print("--- Pivot Tables ---")
print("Create a pivot table showing sales by region and product:")
pivot_table = sales_data.pivot_table(
    values='units_sold',
    index=['region', 'product'],
    columns='salesperson',
    aggfunc='sum',
    fill_value=0
)
pivot_table

In [None]:
# Filter within groups
print("--- Filter Within Groups ---")
print("Keep only top 2 sales days per region:")

def top_n_sales(group, n=2):
    return group.nlargest(n, 'sales_amount')

top_sales_by_region = sales_data.groupby('region').apply(top_n_sales)
print("Top 2 sales days per region:")

top_sales_by_region[['region', 'date', 'sales_amount']]

In [192]:
# Apply with groupby
print("--- Apply with GroupBy ---")
print("Apply functions to groups of data:")

# Create sales data for grouping
sales_group_data = pd.DataFrame({
    'region': ['North', 'South', 'East', 'West'] * 3,
    'product': ['A', 'B', 'C'] * 4,
    'sales': [100, 150, 200, 120, 180, 220, 90, 160, 190, 110, 170, 210],
    'cost': [60, 90, 120, 70, 100, 130, 50, 85, 110, 65, 95, 125]
})

print("Sales data for grouping:")
sales_group_data

--- Apply with GroupBy ---
Apply functions to groups of data:
Sales data for grouping:


Unnamed: 0,region,product,sales,cost
0,North,A,100,60
1,South,B,150,90
2,East,C,200,120
3,West,A,120,70
4,North,B,180,100
5,South,C,220,130
6,East,A,90,50
7,West,B,160,85
8,North,C,190,110
9,South,A,110,65


In [193]:
# Apply custom function to each group
def analyze_group(group):
    return pd.Series({
        'total_sales': group['sales'].sum(),
        'total_cost': group['cost'].sum(),
        'profit': group['sales'].sum() - group['cost'].sum(),
        'profit_margin': (group['sales'].sum() - group['cost'].sum()) / group['sales'].sum() * 100,
        'avg_sale': group['sales'].mean(),
        'num_transactions': len(group)
    })

# Apply to groups
group_analysis = sales_group_data.groupby('region').apply(analyze_group)
print("Group analysis by region:")
group_analysis

Group analysis by region:


Unnamed: 0_level_0,total_sales,total_cost,profit,profit_margin,avg_sale,num_transactions
region,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
East,460.0,265.0,195.0,42.391304,153.333333,3.0
North,470.0,270.0,200.0,42.553191,156.666667,3.0
South,480.0,285.0,195.0,40.625,160.0,3.0
West,490.0,280.0,210.0,42.857143,163.333333,3.0


In [None]:
# Cross-tabulation
print("--- Cross-Tabulation ---")
print("Cross-tab of region vs product:")
cross_tab = pd.crosstab(sales_data['region'], sales_data['product'], 
                        values=sales_data['sales_amount'], aggfunc='sum')
cross_tab

# SECTION 8: THE POWERFUL APPLY METHOD

The apply method is one of pandas' most versatile tools, allowing you to apply
custom functions to DataFrames, Series, or groups of data. It's like having
a Swiss Army knife for data transformation!

## Apply on Series

In [None]:
# Create a sample Series
sample_series = pd.Series([1, 4, 9, 16, 25, 36, 49, 64, 81, 100])
print("Original Series:")
sample_series

In [None]:
# Apply mathematical functions
print("Square root of each number:")
sqrt_series = sample_series.apply(np.sqrt)
sqrt_series

In [None]:
# Apply custom function
def categorize_number(x):
    if x < 10:
        return 'Small'
    elif x < 50:
        return 'Medium'
    else:
        return 'Large'

print("Categorize numbers by size:")
categorized = sample_series.apply(categorize_number)
categorized

In [None]:
# Apply with lambda functions
print("Double each number:")
doubled = sample_series.apply(lambda x: x * 2)
doubled


## Apply on DataFrame columns

In [None]:

# Apply on DataFrame columns
print("--- Apply on DataFrame Columns ---")
print("Apply functions to entire columns:")

In [None]:
sample_df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'salary': [50000, 60000, 70000, 55000, 65000],
    'department': ['IT', 'HR', 'IT', 'Finance', 'Marketing']
})

print("Sample DataFrame:")
sample_df

In [None]:
# Apply to string columns
print("Convert names to uppercase:")
sample_df['name'] = sample_df['name'].apply(lambda x: x.upper())
sample_df


In [None]:
# Apply to numeric columns
print("Calculate salary with 10% bonus:")
sample_df['bonus'] = sample_df['salary'].apply(lambda x: x * 0.10)
sample_df

In [None]:
# Apply to multiple columns
print("Calculate total compensation (salary + bonus):")
sample_df['total_comp'] = sample_df.apply(
    lambda row: row['salary'] + row['bonus'], axis=1
)
sample_df

## Apply on DataFrame rows (axis=1)

In [None]:
# Create a more complex DataFrame
employees = pd.DataFrame({
    'first_name': ['John', 'Jane', 'Bob', 'Alice', 'Charlie'],
    'last_name': ['Smith', 'Doe', 'Johnson', 'Brown', 'Wilson'],
    'hours_worked': [40, 35, 45, 38, 42],
    'hourly_rate': [25, 30, 22, 28, 26],
    'overtime_hours': [5, 0, 10, 2, 8]
})

print("Employees DataFrame:")
employees

In [None]:
# Calculate weekly pay for each employee
def calculate_weekly_pay(row):
    regular_pay = row['hours_worked'] * row['hourly_rate']
    overtime_pay = row['overtime_hours'] * row['hourly_rate'] * 1.5
    return regular_pay + overtime_pay

employees['weekly_pay'] = employees.apply(calculate_weekly_pay, axis=1)
print("Employees with weekly pay calculation:")
employees


In [None]:
# Apply with conditional logic
def categorize_employee(row):
    if row['weekly_pay'] > 1500:
        return 'High Earner'
    elif row['weekly_pay'] > 1000:
        return 'Medium Earner'
    else:
        return 'Low Earner'

employees['earner_category'] = employees.apply(categorize_employee, axis=1)
print("Employees with earning categories:")
employees

In [None]:
# Apply with multiple return values
def analyze_employee(row):
    efficiency = row['hours_worked'] / (row['hours_worked'] + row['overtime_hours'])
    cost_per_hour = row['weekly_pay'] / (row['hours_worked'] + row['overtime_hours'])
    return pd.Series({
        'efficiency': round(efficiency, 2),
        'cost_per_hour': round(cost_per_hour, 2)
    })

# Apply and expand results
analysis_results = employees.apply(analyze_employee, axis=1)
analysis_results

In [None]:
employees = pd.concat([employees, analysis_results], axis=1)
employees

## Vectorized operations vs apply

In [198]:
print("--- Vectorized Operations vs Apply ---")
print("When to use apply vs vectorized operations:")

# Create a large dataset for comparison
large_data = pd.DataFrame({
    'x': np.random.randn(10000000),
    'y': np.random.randn(10000000)
})

print("Large dataset shape:", large_data.shape)
print()

# Vectorized operation (faster)
print("Vectorized operation (recommended):")
vectorized_result = large_data['x'] ** 2 + large_data['y'] ** 2
print("Result shape:", vectorized_result.shape)
print("First 5 values:", vectorized_result.head().values)
print()

--- Vectorized Operations vs Apply ---
When to use apply vs vectorized operations:
Large dataset shape: (10000000, 2)

Vectorized operation (recommended):
Result shape: (10000000,)
First 5 values: [0.17059513 1.68615053 8.75483424 2.66142221 1.43857283]



In [195]:
large_data

Unnamed: 0,x,y
0,2.473697,1.314079
1,0.431847,0.695701
2,1.167145,-0.294532
3,-1.751994,-0.432979
4,-1.910059,-0.115126
...,...,...
9995,-0.771564,-2.139121
9996,0.437799,-0.454341
9997,-0.228791,1.136136
9998,-1.456464,-0.143452


In [199]:
# Apply operation (slower for large datasets)
print("Apply operation (use sparingly for large datasets):")
def vector_operation(row):
    return row['x'] ** 2 + row['y'] ** 2

apply_result = large_data.apply(vector_operation, axis=1)
print("Result shape:", apply_result.shape)
print("First 5 values:", apply_result.head().values)
print()

print("Note: Vectorized operations are much faster for large datasets!")
print("Use apply when you need custom logic that can't be vectorized.")
print()


Apply operation (use sparingly for large datasets):
Result shape: (10000000,)
First 5 values: [0.17059513 1.68615053 8.75483424 2.66142221 1.43857283]

Note: Vectorized operations are much faster for large datasets!
Use apply when you need custom logic that can't be vectorized.



# SECTION 9: STUDENT EXERCISES

This document contains all the exercises from the pandas comprehensive tutorial, organized by difficulty level and concept area.

### Exercise 1: Basic Creation
**Objective:** Create Series and DataFrames from scratch

**Tasks:**
- Create a Series with the numbers 1, 4, 9, 16, 25 and labels 'a', 'b', 'c', 'd', 'e'
- Create a DataFrame with columns: 'Product', 'Price', 'Stock' and at least 3 rows of data

**Expected Output:** Series with square numbers, DataFrame with product information

In [None]:
# Sample data for reference
sample_series_data = [1, 4, 9, 16, 25]
sample_series_labels = ['a', 'b', 'c', 'd', 'e']

sample_products = ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones']
sample_prices = [999, 25, 75, 299, 150]
sample_stock = [10, 50, 30, 15, 25]

### Exercise 2: Indexing and Selection
**Objective:** Practice various indexing and selection methods

**Tasks:** From the DataFrame 'df' created:
- Select only the 'Name' and 'City' columns
- Select the first 2 rows
- Select people older than 28
- Select people with salary between 50000 and 70000

In [None]:
# Sample DataFrame for this exercise
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Age': [25, 30, 35, 28, 32, 29, 27],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle', 'Miami', 'Denver'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 58000, 52000],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'Sales', 'IT']
})
df

### Exercise 3: Data Manipulation
**Objective:** Learn basic data manipulation techniques

**Tasks:** From the DataFrame 'df':
- Add a new column 'Bonus' that is 10% of salary
- Sort the data by age in descending order
- Calculate the average salary by city
- Find the person with the highest salary

In [None]:
# Use the same DataFrame from Exercise 2
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Age': [25, 30, 35, 28, 32, 29, 27],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle', 'Miami', 'Denver'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 58000, 52000],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'Sales', 'IT']
})
df

### Exercise 4: Conditional Operations
**Objective:** Master conditional filtering and boolean indexing

**Tasks:** Create a new DataFrame that contains:
- People from cities with more than 3 letters
- People whose names start with a vowel (A, E, I, O, U)
- People with age + salary > 80000

In [None]:
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Age': [25, 30, 35, 28, 32, 29, 27],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle', 'Miami', 'Denver'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 58000, 52000],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'Sales', 'IT']
})
df

### Exercise 5: Data Cleaning
**Objective:** Practice handling missing data and data cleaning

**Tasks:** 
- Identify all missing values
- Fill numeric missing values with median
- Fill string missing values with 'Unknown'
- Remove any rows that still have missing values

In [None]:
# Sample DataFrame with missing values for this exercise
df_missing = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'Age': [25, 30, np.nan, 28, 32, 29, np.nan],
    'City': ['NYC', 'LA', 'Chicago', 'Boston', 'Seattle', np.nan, 'Denver'],
    'Salary': [50000, 60000, 70000, 55000, 65000, 58000, 52000],
    'Department': ['IT', 'HR', 'IT', 'Finance', 'Marketing', 'Sales', 'IT'],
    'Experience': [2, 5, np.nan, 3, 7, 4, 1]
})
df_missing

### Exercise 6: Merge and Join Operations
**Objective:** Master SQL-like JOIN operations in pandas

**Required Operations:**
- Perform an inner join to see all students with their courses
- Perform a left join to see all students (even those without courses)
- Perform a right join to see all courses (even those without students)
- Calculate average GPA by major for students who have taken courses


In [None]:
students = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5, 6, 7],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace'],
    'major': ['CS', 'Math', 'CS', 'Physics', 'Math', 'Engineering', 'Biology'],
    'gpa': [3.8, 3.5, 3.9, 3.7, 3.6, 3.4, 3.8]
})

# DataFrame 2: Courses
courses = pd.DataFrame({
    'course_id': [101, 102, 103, 104, 105, 106, 107, 108],
    'student_id': [1, 1, 2, 3, 4, 5, 6, 7],
    'course_name': ['Python', 'Data Structures', 'Calculus', 'Algorithms', 'Thermodynamics', 'Linear Algebra', 'Mechanics', 'Genetics'],
    'grade': ['A', 'A-', 'B+', 'A', 'B', 'A-', 'B+', 'A']
})

### Exercise 7: Advanced GroupBy Operations
**Objective:** Master sophisticated grouping and aggregation

**Required Operations:**
- Group by region and calculate total sales, average order value, and count
- Group by region and product to see sales breakdown
- Find the top 3 salespeople by total sales
- Calculate the percentage of total sales each product contributes
- Create a pivot table showing sales by region and product


In [None]:
# Create date range for last 30 days
dates = pd.date_range('2024-01-01', periods=30, freq='D')

# Generate sample sales data
np.random.seed(42)  # For reproducible results
sales_data = pd.DataFrame({
    'date': np.random.choice(dates, 100),
    'product': np.random.choice(['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Headphones'], 100),
    'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
    'salesperson': np.random.choice(['John', 'Jane', 'Bob', 'Alice', 'Charlie'], 100),
    'amount': np.random.randint(100, 2000, 100)
})
sales_data

### Exercise 8: The Apply Method
**Objective:** Master the versatile apply method for custom operations

**Required Operations:**
- Use apply to calculate the average score for each student
- Use apply to categorize students as 'Excellent' (>90), 'Good' (80-90), 'Average' (70-80), 'Below Average' (<70)
- Use apply to create a 'grade_point' column (A=4.0, B=3.0, C=2.0, D=1.0, F=0.0)
- Use apply with axis=1 to calculate a 'performance_index' (weighted average: math*0.4, science*0.35, english*0.25)
- Use apply to find the subject with the highest score for each student

In [None]:
# Sample student dataset for this exercise
students_scores = pd.DataFrame({
    'student_id': [1, 2, 3, 4, 5, 6, 7, 8],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'],
    'math_score': [95, 87, 92, 78, 88, 91, 85, 79],
    'science_score': [88, 91, 85, 82, 90, 87, 89, 84],
    'english_score': [92, 85, 89, 75, 87, 90, 83, 88]
})
students_scores

## Complex Exercises (Advanced Projects)

### Complex Exercise 1: Sales Analysis
**Objective:** Business intelligence and sales analytics

**Tasks:**
1. Calculate daily, weekly, and monthly sales totals
2. Find top 5 products by revenue and units sold
3. Analyze sales performance by region and salesperson
4. Identify customer segments with highest average order value
5. Create a pivot table showing sales by category and region

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Create date range for last 30 days
dates = pd.date_range('2024-01-01', periods=30, freq='D')

# Generate sample data
n_records = 200
sales_comprehensive = pd.DataFrame({
    'Date': np.random.choice(dates, n_records),
    'Product_ID': np.random.randint(1001, 1011, n_records),
    'Product_Name': np.random.choice(['Laptop Pro', 'Wireless Mouse', 'Mechanical Keyboard', '4K Monitor', 
                                     'Noise-Canceling Headphones', 'Webcam HD', 'USB-C Hub', 'SSD 1TB', 
                                     'RAM 16GB', 'Graphics Card'], n_records),
    'Category': np.random.choice(['Electronics', 'Accessories', 'Components'], n_records),
    'Sales_Amount': np.random.randint(50, 2500, n_records),
    'Units_Sold': np.random.randint(1, 10, n_records),
    'Customer_ID': np.random.randint(10001, 10101, n_records),
    'Customer_Segment': np.random.choice(['Premium', 'Standard', 'Budget'], n_records),
    'Region': np.random.choice(['North', 'South', 'East', 'West', 'Central'], n_records),
    'Salesperson_ID': np.random.randint(2001, 2011, n_records)
})
sales_comprehensive

### Complex Exercise 2: Customer Churn Analysis
**Objective:** Customer analytics and retention analysis

**Tasks:**
1. Calculate customer lifetime value for each customer
2. Identify factors correlated with churn using conditional indexing
3. Create customer segments based on usage patterns
4. Analyze churn rates by demographic and subscription factors
5. Build a summary report with key insights and recommendations


In [None]:
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample customer data
n_customers = 1000
customers = pd.DataFrame({
    'customer_id': range(1, n_customers + 1),
    'age': np.random.normal(35, 12, n_customers).clip(18, 80).astype(int),
    'gender': np.random.choice(['Male', 'Female'], n_customers),
    'location': np.random.choice(['Urban', 'Suburban', 'Rural'], n_customers),
    'subscription_plan': np.random.choice(['Basic', 'Premium', 'Enterprise'], n_customers, p=[0.5, 0.3, 0.2]),
    'start_date': [datetime.now() - timedelta(days=np.random.randint(30, 1000)) for _ in range(n_customers)],
    'monthly_fee': np.random.choice([29, 49, 99], n_customers),
    'monthly_calls': np.random.poisson(50, n_customers),
    'monthly_data_usage_gb': np.random.exponential(5, n_customers),
    'support_tickets_last_6m': np.random.poisson(2, n_customers),
    'payment_method': np.random.choice(['Credit Card', 'Debit Card', 'Bank Transfer'], n_customers),
    'churned': np.random.choice([0, 1], n_customers, p=[0.8, 0.2])  # 20% churn rate
})

# Calculate tenure in months
customers['tenure_months'] = ((datetime.now() - customers['start_date']).dt.days / 30).astype(int)
customers

### Complex Exercise 3: E-commerce Analytics
**Objective:** Online business metrics and customer journey analysis

**Tasks:**
1. Calculate conversion rates from browsing to purchase
2. Analyze customer journey and identify drop-off points



In [None]:
from datetime import datetime, timedelta

# Set random seed for reproducibility
np.random.seed(42)

# Generate sample e-commerce data
n_customers = 500
n_products = 50
n_orders = 2000

# Customer behavior data
customer_behavior = pd.DataFrame({
    'customer_id': np.random.randint(1, n_customers + 1, n_orders),
    'session_id': np.random.randint(1, 1001, n_orders),
    'browsing_time_minutes': np.random.exponential(15, n_orders),
    'pages_viewed': np.random.poisson(8, n_orders),
    'cart_adds': np.random.poisson(2, n_orders),
    'purchases': np.random.choice([0, 1], n_orders, p=[0.7, 0.3]),  # 30% conversion rate
    'campaign_source': np.random.choice(['Google Ads', 'Facebook', 'Email', 'Organic', 'Direct'], n_orders),
    'discount_applied': np.random.choice([0, 0.1, 0.15, 0.2], n_orders, p=[0.6, 0.2, 0.15, 0.05])
})

# Product catalog
products = pd.DataFrame({
    'product_id': range(1, n_products + 1),
    'name': [f'Product_{i}' for i in range(1, n_products + 1)],
    'category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Books', 'Sports'], n_products),
    'price': np.random.uniform(10, 500, n_products),
    'inventory': np.random.randint(10, 100, n_products)
})

# Order details
orders = pd.DataFrame({
    'order_id': range(1, n_orders + 1),
    'customer_id': np.random.randint(1, n_customers + 1, n_orders),
    'order_date': [datetime.now() - timedelta(days=np.random.randint(1, 90)) for _ in range(n_orders)],
    'total_amount': np.random.uniform(25, 1000, n_orders),
    'shipping_cost': np.random.choice([0, 5, 10, 15], n_orders),
    'payment_method': np.random.choice(['Credit Card', 'PayPal', 'Apple Pay'], n_orders)
})