In [None]:
import pandas as pd
import numpy as np

# Set a seed for reproducibility
np.random.seed(42)

data = {
    'Region': np.random.choice(['North', 'South', 'East', 'West'], size=100),
    'Product_Category': np.random.choice(['Electronics', 'Clothing', 'Home Goods', 'Food'], size=100),
    'Sales_USD': np.random.randint(50, 500, size=100),
    'Units_Sold': np.random.randint(1, 10, size=100),
    'Customer_Rating': np.random.randint(1, 6, size=100), # 1 to 5 stars
    'Employee_Performance_Score': np.random.uniform(60, 100, size=100),
    'Has_Warranty': np.random.choice([True, False], size=100, p=[0.7, 0.3]),
    'Customer_Segment': np.random.choice(['New', 'Returning', 'Loyal'], size=100, p=[0.2, 0.5, 0.3]),
    'Delivery_Time_Days': np.random.randint(1, 15, size=100)
}
df = pd.DataFrame(data)

# Introduce some NaN values for robustness
for col in ['Sales_USD', 'Customer_Rating', 'Employee_Performance_Score']:
    df.loc[np.random.choice(df.index, 5, replace=False), col] = np.nan

Initial DataFrame Head:
  Region Product_Category  Sales_USD  Units_Sold  Customer_Rating  \
0   East       Home Goods      112.0           7              1.0   
1   West         Clothing      401.0           8              1.0   
2  North         Clothing      280.0           1              NaN   
3   East             Food      290.0           6              4.0   
4   East         Clothing      101.0           8              5.0   

   Employee_Performance_Score  Has_Warranty Customer_Segment  \
0                   86.407895          True            Loyal   
1                   71.197356          True            Loyal   
2                   98.194611          True            Loyal   
3                   89.515877          True        Returning   
4                   82.174162          True        Returning   

   Delivery_Time_Days  
0                   8  
1                   6  
2                  14  
3                  10  
4                  12  

Initial DataFrame Info:
<class 

## Exercise 1: Basic Descriptive Statistics

**Task:** Calculate the basic descriptive statistics (mean, median, min, max, std, count, quartiles) for all numerical columns in the DataFrame.

## Exercise 2: Correlation Analysis

**Scenario:** You want to understand the linear relationships between different numerical variables in your sales data.

**Task:**

1. Calculate the pairwise correlation matrix for all numerical columns.

2. Specifically, find the correlation between 'Sales_USD' and 'Units_Sold'.

3. Find the correlation between 'Sales_USD' and 'Employee_Performance_Score'.

## Exercise 3: Covariance Analysis

**Scenario:** You want to understand how two numerical variables in your sales data change together.

**Task:**

1. Calculate the pairwise covariance matrix for all numerical columns.

2. Specifically, find the covariance between 'Sales_USD' and 'Units_Sold'.

## Exercise 4: Unique Values and Counts - Categorical Columns

**Scenario:** You need to explore the distinct categories and their frequencies within your categorical data.

**Task:**

1. Find all unique values in the 'Region' column.

2. Count the occurrences of each unique value in the 'Product_Category' column.

3. Count the number of unique customer segments.

## Exercise 5: Unique Values and Counts - Numerical/Boolean Columns

**Scenario:** You want to examine the distinct entries and their frequencies in columns that might appear numerical or boolean, including handling potential missing values.

**Task:**

1. Find all unique values in the 'Customer_Rating' column.

2. Count the occurrences of each unique value in the 'Has_Warranty' column.

3. Count the number of unique 'Delivery_Time_Days' values.

## Exercise 6: Membership Check - isin() for Filtering

**Scenario:** You need to select rows where specific columns contain values from a predefined set.

**Task:**

1. Filter the DataFrame to show only rows where 'Product_Category' is either 'Electronics' or 'Food'.

2. Filter the DataFrame to show rows where 'Customer_Rating' is among [4, 5] (meaning 4 or 5 stars).

## Exercise 7: Membership Check - isin() for Creating a New Column

**Scenario:** You want to summarize numerical data for specific subgroups within your dataset.

**Task:**

1. Create a new boolean column named 'Is_High_Value_Product' which is True if 'Product_Category' is 'Electronics' or 'Home Goods', and False otherwise.

2. Create a new boolean column named 'Is_Top_Rated_Customer' which is True if 'Customer_Rating' is 5, and False otherwise (handle NaN appropriately, perhaps by making them False or keeping them NaN).

## Exercise 8: Grouped Descriptive Statistics

**Scenario:** You want to summarize numerical data for specific subgroups within your dataset.

**Tasks:**

1. Calculate the average 'Sales_USD' and average 'Units_Sold' for each 'Region' in df.

2. Find the maximum 'Employee_Performance_Score' for each 'Product_Category' in df.

3. Count the number of customers (i.e., rows) in each 'Customer_Segment' in df.

## Exercise 9: Grouped Correlation

**Scenario:** You need to analyze the relationship between two variables, but this relationship might vary across different segments of your data.

**Task:**

1. Calculate the correlation between 'Sales_USD' and 'Units_Sold' separately for each 'Region' in df.

## Exercise 10: Combined Unique Counts and Aggregation

**Scenario:** You want to understand the diversity of values within certain categories and identify the most common characteristics of groups.

**Tasks:**

1. For each 'Product_Category' in df, find the number of unique 'Customer_Segment' values associated with it.

2. For each 'Customer_Segment' in df, find the most frequent 'Product_Category' purchased by that segment.