## Pandas: Data Manipulation II

## Task 1: Reshaping Dataframes (Superstore Dataset)

### 1. Pivot Tables

#### a) Pivot total sales by Region and Category

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/Users/DELL/Downloads/Sample - Superstore.csv.zip', encoding='cp1252')

# Create a pivot table showing total sales by Region (rows) and Category (columns)
sales_pivot = df.pivot_table(
    values='Sales', 
    index='Region', 
    columns='Category', 
    aggfunc='sum'
)

print("Total Sales by Region and Category:")
print(sales_pivot)

Total Sales by Region and Category:
Category    Furniture  Office Supplies  Technology
Region                                            
Central   163797.1638       167026.415  170416.312
East      208291.2040       205516.055  264973.981
South     117298.6840       125651.313  148771.908
West      252612.7435       220853.249  251991.832


#### b) Pivot average profit per Segment

In [None]:
# Create a pivot table showing average profit by Segment (rows) and Region (columns)
profit_pivot = df.pivot_table(
    values='Profit', 
    index='Segment', 
    columns='Region', 
    aggfunc='mean'
)

print("\nAverage Profit by Segment and Region:")
print(profit_pivot)


Average Profit by Segment and Region:
Region         Central       East      South       West
Segment                                                
Consumer      7.066046  28.040153  32.116435  34.360409
Corporate    27.791831  26.935666  29.833771  35.872323
Home Office  28.398202  53.205611  16.987626  28.949939


### 2. Using melt() to unpivot data

In [None]:
# First create a wide format dataframe (sales by region and category)
wide_df = df.pivot_table(
    values='Sales', 
    index='Region', 
    columns='Category', 
    aggfunc='sum'
).reset_index()

print("\nWide Format DataFrame:")
print(wide_df)

# Now melt it back to long format
long_df = wide_df.melt(
    id_vars='Region',
    value_vars=['Furniture', 'Office Supplies', 'Technology'],
    var_name='Category',
    value_name='Total Sales'
)

print("\nLong Format DataFrame:")
print(long_df)


Wide Format DataFrame:
Category   Region    Furniture  Office Supplies  Technology
0         Central  163797.1638       167026.415  170416.312
1            East  208291.2040       205516.055  264973.981
2           South  117298.6840       125651.313  148771.908
3            West  252612.7435       220853.249  251991.832

Long Format DataFrame:
     Region         Category  Total Sales
0   Central        Furniture  163797.1638
1      East        Furniture  208291.2040
2     South        Furniture  117298.6840
3      West        Furniture  252612.7435
4   Central  Office Supplies  167026.4150
5      East  Office Supplies  205516.0550
6     South  Office Supplies  125651.3130
7      West  Office Supplies  220853.2490
8   Central       Technology  170416.3120
9      East       Technology  264973.9810
10    South       Technology  148771.9080
11     West       Technology  251991.8320


## Task 2: Apply Custom Functions

### 1. Classifying Profit Margins

In [None]:
# Calculate profit margin (Profit/Sales)
df['Profit Margin'] = df['Profit'] / df['Sales']

# Define a function to classify profit margins
def classify_margin(margin):
    if margin > 0.2:  # 20% margin
        return 'High'
    elif margin > 0.1:  # 10-20% margin
        return 'Medium'
    elif margin >= 0:
        return 'Low'
    else:
        return 'Loss'  # Negative margin

# Apply the classification using lambda
df['Margin Class'] = df['Profit Margin'].apply(lambda x: classify_margin(x))

# Show distribution
print("\nProfit Margin Classification Distribution:")
print(df['Margin Class'].value_counts())


Profit Margin Classification Distribution:
Margin Class
High      5898
Loss      1871
Low       1291
Medium     934
Name: count, dtype: int64


### 2. Flagging Rows Based on Discount Percentage

In [None]:
df['Discount Flag'] = df['Discount'].apply(lambda x: 'High Discount' if x > 0.9 else 'Normal')

print("\nDiscount Flag Distribution:")
print(df['Discount Flag'].value_counts())


Discount Flag Distribution:
Discount Flag
Normal    8041
Name: count, dtype: int64


## Task 3: Mapping & Replacing Values

### 1. Mapping Full Country Names to Codes

In [None]:
# Create a mapping dictionary for US states
state_to_abbr = {
    'Alabama': 'AL', 'Alaska': 'AK', 'Arizona': 'AZ', 'Arkansas': 'AR',
    'California': 'CA', 'Colorado': 'CO', 'Connecticut': 'CT', 'Delaware': 'DE',
    'Florida': 'FL', 'Georgia': 'GA', 'Hawaii': 'HI', 'Idaho': 'ID',
    'Illinois': 'IL', 'Indiana': 'IN', 'Iowa': 'IA', 'Kansas': 'KS',
    'Kentucky': 'KY', 'Louisiana': 'LA', 'Maine': 'ME', 'Maryland': 'MD',
    'Massachusetts': 'MA', 'Michigan': 'MI', 'Minnesota': 'MN', 'Mississippi': 'MS',
    'Missouri': 'MO', 'Montana': 'MT', 'Nebraska': 'NE', 'Nevada': 'NV',
    'New Hampshire': 'NH', 'New Jersey': 'NJ', 'New Mexico': 'NM', 'New York': 'NY',
    'North Carolina': 'NC', 'North Dakota': 'ND', 'Ohio': 'OH', 'Oklahoma': 'OK',
    'Oregon': 'OR', 'Pennsylvania': 'PA', 'Rhode Island': 'RI', 'South Carolina': 'SC',
    'South Dakota': 'SD', 'Tennessee': 'TN', 'Texas': 'TX', 'Utah': 'UT',
    'Vermont': 'VT', 'Virginia': 'VA', 'Washington': 'WA', 'West Virginia': 'WV',
    'Wisconsin': 'WI', 'Wyoming': 'WY'
}

# Create a new column with state abbreviations
df['State Abbr'] = df['State'].map(state_to_abbr)

# Show the first few rows to verify
print("\nState abbreviations added:")
print(df[['State', 'State Abbr']].head())


State abbreviations added:
        State State Abbr
0    Kentucky         KY
1    Kentucky         KY
2  California         CA
3     Florida         FL
4     Florida         FL


### 2. Replacing "Consumer" with "Retail" in Segment Column

In [None]:
# Replace values in the Segment column
df['Segment'] = df['Segment'].replace({'Consumer': 'Retail'})

# Verify the replacement
print("\nSegment values after replacement:")
print(df['Segment'].value_counts())


Segment values after replacement:
Segment
Retail         5191
Corporate      3020
Home Office    1783
Name: count, dtype: int64


## Task 4: Combining DataFrames

### 1. Vertical Concatenation (axis=0)

In [None]:
# Split the DataFrame into two parts
df1 = df.iloc[:5000]  # First 5000 rows
df2 = df.iloc[5000:]  # Remaining rows

# Vertical concatenation (stacking rows)
vertical_combined = pd.concat([df1, df2], axis=0, ignore_index=True)

print(f"Original shape: {df.shape}")
print(f"Combined shape: {vertical_combined.shape}")
print("Original and combined row counts match:", len(df) == len(vertical_combined))

Original shape: (9994, 25)
Combined shape: (9994, 25)
Original and combined row counts match: True


### 2. Horizontal Concatenation (axis=1)

In [None]:
# Split columns into two DataFrames
df_cols1 = df[['Row ID', 'Order ID', 'Order Date', 'Ship Date']]
df_cols2 = df[['Ship Mode', 'Customer ID', 'Customer Name', 'Segment']]

# Horizontal concatenation (side-by-side)
horizontal_combined = pd.concat([df_cols1, df_cols2], axis=1)

print("\nOriginal columns:", df.shape[1])
print("Combined columns:", horizontal_combined.shape[1])


Original columns: 25
Combined columns: 8


### 3. Combining DataFrames with Mismatched Columns

In [None]:
# Create two DataFrames with some overlapping and some unique columns
df_a = df[['Order ID', 'Customer ID', 'Sales']].sample(n=1000)
df_b = df[['Order ID', 'Product ID', 'Profit']].sample(n=1000)

# Concatenate with mismatched columns
combined_mismatched = pd.concat([df_a, df_b], axis=0, ignore_index=True)

print("\nDataFrame A columns:", df_a.columns.tolist())
print("DataFrame B columns:", df_b.columns.tolist())
print("Combined DataFrame columns:", combined_mismatched.columns.tolist())
print("\nCombined DataFrame sample:")
print(combined_mismatched.head())


DataFrame A columns: ['Order ID', 'Customer ID', 'Sales']
DataFrame B columns: ['Order ID', 'Product ID', 'Profit']
Combined DataFrame columns: ['Order ID', 'Customer ID', 'Sales', 'Product ID', 'Profit']

Combined DataFrame sample:
         Order ID Customer ID     Sales Product ID  Profit
0  CA-2014-158337    KA-16525  108.9200        NaN     NaN
1  CA-2014-124688    CC-12610   29.1200        NaN     NaN
2  US-2016-146794    SH-19975  424.9575        NaN     NaN
3  US-2017-118087    SP-20620    4.6720        NaN     NaN
4  CA-2015-152891    TB-21625  253.1760        NaN     NaN


### 4. Handling Overlapping Data with Different Indices

In [None]:
# Create two DataFrames with overlapping Order IDs but different data
df_overlap1 = df[['Order ID', 'Sales']].drop_duplicates(subset=['Order ID']).sample(n=1000)
df_overlap2 = df[['Order ID', 'Discount']].drop_duplicates(subset=['Order ID']).sample(n=1000)

# Combine them horizontally
combined_overlap = pd.concat([df_overlap1.set_index('Order ID'), 
                            df_overlap2.set_index('Order ID')], axis=1)

print("\nCombined overlapping DataFrames (sample):")
print(combined_overlap.head())


Combined overlapping DataFrames (sample):
                  Sales  Discount
Order ID                         
CA-2017-161851   15.570       NaN
CA-2015-157434    7.968       NaN
CA-2017-135909  209.940       NaN
US-2016-125402   10.900       NaN
CA-2017-111269  479.952       NaN


## Task 5: Superstore Data Pipeline

### Step 1: Load raw dataset

In [None]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/Users/DELL/Downloads/Sample - Superstore.csv.zip', encoding='cp1252')

print(f"Raw data shape: {df.shape}")
print("First few rows:")
print(df.head(3))

Raw data shape: (9994, 21)
First few rows:
   Row ID        Order ID Order Date   Ship Date     Ship Mode Customer ID  \
0       1  CA-2016-152156  11/8/2016  11/11/2016  Second Class    CG-12520   
1       2  CA-2016-152156  11/8/2016  11/11/2016  Second Class    CG-12520   
2       3  CA-2016-138688  6/12/2016   6/16/2016  Second Class    DV-13045   

     Customer Name    Segment        Country         City  ... Postal Code  \
0      Claire Gute   Consumer  United States    Henderson  ...       42420   
1      Claire Gute   Consumer  United States    Henderson  ...       42420   
2  Darrin Van Huff  Corporate  United States  Los Angeles  ...       90036   

   Region       Product ID         Category Sub-Category  \
0   South  FUR-BO-10001798        Furniture    Bookcases   
1   South  FUR-CH-10000454        Furniture       Chairs   
2    West  OFF-LA-10000240  Office Supplies       Labels   

                                        Product Name   Sales  Quantity  \
0               

### Step 2: Clean and filter data

In [None]:
print("\nStep 2: Cleaning and filtering data...")

# Data cleaning
# Convert date columns to datetime
df['Order Date'] = pd.to_datetime(df['Order Date'])
df['Ship Date'] = pd.to_datetime(df['Ship Date'])

# Handle missing values (if any)
df = df.dropna()

# Filter data - remove negative profits and outliers
df = df[df['Profit'] >= 0]
df = df[df['Sales'] < df['Sales'].quantile(0.99)]  # Remove top 1% sales outliers

# Standardize categorical values
df['Segment'] = df['Segment'].replace({'Consumer': 'Retail'})
df['Category'] = df['Category'].str.title()

print(f"Cleaned data shape: {df.shape}")


Step 2: Cleaning and filtering data...
Cleaned data shape: (8041, 21)


### Step 3: Group and Reshape Data

In [None]:
print("\nStep 3: Grouping and reshaping data...")

# Create a grouped summary
grouped = df.groupby(['Region', 'Category', 'Ship Mode']).agg({
    'Sales': 'sum',
    'Profit': 'sum',
    'Quantity': 'mean'
}).reset_index()

# Add a profit margin column
grouped['Profit Margin'] = grouped['Profit'] / grouped['Sales']

print("Grouped data sample:")
print(grouped.head(3))


Step 3: Grouping and reshaping data...
Grouped data sample:
    Region   Category     Ship Mode     Sales     Profit  Quantity  \
0  Central  Furniture   First Class   4531.26   918.5455  3.750000   
1  Central  Furniture      Same Day   3812.52   900.6535  3.583333   
2  Central  Furniture  Second Class  10626.98  2510.4912  3.640000   

   Profit Margin  
0       0.202713  
1       0.236236  
2       0.236238  


### Step 4: Add Derived Column Using .apply()

In [None]:
print("\nStep 4: Adding derived columns...")

# Categorize shipping efficiency (days between order and ship)
def shipping_efficiency(row):
    days = (row['Ship Date'] - row['Order Date']).days
    if days == 0:
        return 'Same Day'
    elif days <= 2:
        return 'Fast'
    elif days <= 5:
        return 'Standard'
    else:
        return 'Slow'

df['Shipping Efficiency'] = df.apply(shipping_efficiency, axis=1)

# Performance tier based on profit margin
def performance_tier(margin):
    if margin > 0.25:
        return 'High Performance'
    elif margin > 0.15:
        return 'Medium Performance'
    elif margin > 0:
        return 'Low Performance'
    else:
        return 'Negative'

grouped['Performance Tier'] = grouped['Profit Margin'].apply(performance_tier)

print("Derived columns added:")
print(grouped[['Region', 'Category', 'Profit Margin', 'Performance Tier']].head())


Step 4: Adding derived columns...
Derived columns added:
    Region         Category  Profit Margin    Performance Tier
0  Central        Furniture       0.202713  Medium Performance
1  Central        Furniture       0.236236  Medium Performance
2  Central        Furniture       0.236238  Medium Performance
3  Central        Furniture       0.207703  Medium Performance
4  Central  Office Supplies       0.280611    High Performance


### Step 5: Pivot for a summary dashboard

In [None]:
print("\nStep 5: Creating pivot tables for dashboard...")

# Pivot 1: Profit by Category & Region
profit_pivot = pd.pivot_table(
    data=grouped,
    values='Profit',
    index='Category',
    columns='Region',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

# Pivot 2: Shipping Efficiency by Region
shipping_pivot = pd.pivot_table(
    data=df,
    values='Sales',
    index='Shipping Efficiency',
    columns='Region',
    aggfunc='count',
    fill_value=0
)

# Pivot 3: Performance Tier Summary
performance_pivot = grouped.pivot_table(
    values='Sales',
    index='Performance Tier',
    columns='Category',
    aggfunc='sum',
    margins=True
)

print("\nProfit by Category & Region:")
print(profit_pivot)

print("\nShipping Efficiency Distribution:")
print(shipping_pivot)

print("\nPerformance Tier Summary:")
print(performance_pivot)


Step 5: Creating pivot tables for dashboard...

Profit by Category & Region:
Region              Central         East       South         West        Total
Category                                                                      
Furniture        14296.9977   18779.2915  15946.8229   22568.2872   71591.3993
Office Supplies  28293.3123   44644.0475  23897.9050   50157.0671  146992.3319
Technology       24373.8656   36760.4717  18743.2121   34151.7574  114029.3068
Total            66964.1756  100183.8107  58587.9400  106877.1117  332613.0380

Shipping Efficiency Distribution:
Region               Central  East  South  West
Shipping Efficiency                            
Fast                     221   386    232   542
Same Day                  82   120     71   147
Slow                     311   417    234   524
Standard                 954  1345    811  1644

Performance Tier Summary:
Category              Furniture  Office Supplies  Technology           All
Performance Tier       