## Pandas: Data Manipulation I

## Task 1: Indexing and Slicing in Pandas (Superstore Dataset)

### 1. Load the Dataset

In [36]:
import pandas as pd

# Load the dataset
df = pd.read_csv('/Users/DELL/Downloads/Sample - Superstore.csv.zip', encoding='cp1252')

# Display first 5 rows
print(df.head())

   Row ID        Order ID  Order Date   Ship Date       Ship Mode Customer ID  \
0       1  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
1       2  CA-2016-152156   11/8/2016  11/11/2016    Second Class    CG-12520   
2       3  CA-2016-138688   6/12/2016   6/16/2016    Second Class    DV-13045   
3       4  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   
4       5  US-2015-108966  10/11/2015  10/18/2015  Standard Class    SO-20335   

     Customer Name    Segment        Country             City  ...  \
0      Claire Gute   Consumer  United States        Henderson  ...   
1      Claire Gute   Consumer  United States        Henderson  ...   
2  Darrin Van Huff  Corporate  United States      Los Angeles  ...   
3   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   
4   Sean O'Donnell   Consumer  United States  Fort Lauderdale  ...   

  Postal Code  Region       Product ID         Category Sub-Category  \
0       42420   Sout

### 2. Basic Indexing with .loc[] and .iloc[]

#### A. Select rows & columns using .loc[] (label-based)
- Select a single row (by index label):

In [25]:
df.loc[0]  # Returns the first row as a Series

Row ID                                           1
Order ID                            CA-2016-152156
Order Date                               11/8/2016
Ship Date                               11/11/2016
Ship Mode                             Second Class
Customer ID                               CG-12520
Customer Name                          Claire Gute
Segment                                   Consumer
Country                              United States
City                                     Henderson
State                                     Kentucky
Postal Code                                  42420
Region                                       South
Product ID                         FUR-BO-10001798
Category                                 Furniture
Sub-Category                             Bookcases
Product Name     Bush Somerset Collection Bookcase
Sales                                       261.96
Quantity                                         2
Discount                       

- Select multiple rows and specific columns:

In [30]:
df.loc[[0, 1, 2], ['Ship Mode', 'Segment', 'Sales']]  # Rows 0-2, 3 columns

Unnamed: 0,Ship Mode,Segment,Sales
0,Second Class,Consumer,261.96
1,Second Class,Consumer,731.94
2,Second Class,Corporate,14.62


- Select all rows but specific columns:

In [33]:
df.loc[:, ['City', 'State', 'Region']]  # All rows, only 3 columns

Unnamed: 0,City,State,Region
0,Henderson,Kentucky,South
1,Henderson,Kentucky,South
2,Los Angeles,California,West
3,Fort Lauderdale,Florida,South
4,Fort Lauderdale,Florida,South
...,...,...,...
9989,Miami,Florida,South
9990,Costa Mesa,California,West
9991,Costa Mesa,California,West
9992,Costa Mesa,California,West


#### B. Select rows & columns using .iloc[] (position-based)

- Select the first row (index 0):

In [42]:
df.iloc[0]  # Returns first row

Row ID                                           1
Order ID                            CA-2016-152156
Order Date                               11/8/2016
Ship Date                               11/11/2016
Ship Mode                             Second Class
Customer ID                               CG-12520
Customer Name                          Claire Gute
Segment                                   Consumer
Country                              United States
City                                     Henderson
State                                     Kentucky
Postal Code                                  42420
Region                                       South
Product ID                         FUR-BO-10001798
Category                                 Furniture
Sub-Category                             Bookcases
Product Name     Bush Somerset Collection Bookcase
Sales                                       261.96
Quantity                                         2
Discount                       

- Select rows 0 to 4 and columns 1 to 3:

In [45]:
df.iloc[0:5, 1:4]  # Slicing works like Python lists

Unnamed: 0,Order ID,Order Date,Ship Date
0,CA-2016-152156,11/8/2016,11/11/2016
1,CA-2016-152156,11/8/2016,11/11/2016
2,CA-2016-138688,6/12/2016,6/16/2016
3,US-2015-108966,10/11/2015,10/18/2015
4,US-2015-108966,10/11/2015,10/18/2015


- Select every alternate row and specific column positions:

In [48]:
df.iloc[::2, [0, 2, 4]]  # Rows: 0, 2, 4,... | Columns: 0, 2, 4

Unnamed: 0,Row ID,Order Date,Ship Mode
0,1,11/8/2016,Second Class
2,3,6/12/2016,Second Class
4,5,10/11/2015,Standard Class
6,7,6/9/2014,Standard Class
8,9,6/9/2014,Standard Class
...,...,...,...
9984,9985,5/17/2015,Standard Class
9986,9987,9/29/2016,Standard Class
9988,9989,11/17/2017,Standard Class
9990,9991,2/26/2017,Standard Class


### 3. Conditional Selections

- Select rows where Sales > 1000:

In [57]:
high_sales = df[df['Sales'] > 1000]  
print(high_sales.head())

    Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
10      11  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
24      25  CA-2015-106320   9/25/2015   9/30/2015  Standard Class   
27      28  US-2015-150630   9/17/2015   9/21/2015  Standard Class   
35      36  CA-2016-117590   12/8/2016  12/10/2016     First Class   
54      55  CA-2016-105816  12/11/2016  12/17/2016  Standard Class   

   Customer ID    Customer Name    Segment        Country           City  ...  \
10    BH-11710  Brosina Hoffman   Consumer  United States    Los Angeles  ...   
24    EB-13870      Emily Burns   Consumer  United States           Orem  ...   
27    TB-21520  Tracy Blumstein   Consumer  United States   Philadelphia  ...   
35    GH-14485        Gene Hale  Corporate  United States     Richardson  ...   
54    JM-15265   Janet Molinari  Corporate  United States  New York City  ...   

   Postal Code   Region       Product ID    Category Sub-Category  \
10       90032     West

- Select rows where Profit is negative (loss):

In [60]:
loss_making = df[df['Profit'] < 0]  
print(loss_making.head())

    Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
3        4  US-2015-108966  10/11/2015  10/18/2015  Standard Class   
14      15  US-2015-118983  11/22/2015  11/26/2015  Standard Class   
15      16  US-2015-118983  11/22/2015  11/26/2015  Standard Class   
23      24  US-2017-156909   7/16/2017   7/18/2017    Second Class   
27      28  US-2015-150630   9/17/2015   9/21/2015  Standard Class   

   Customer ID    Customer Name      Segment        Country             City  \
3     SO-20335   Sean O'Donnell     Consumer  United States  Fort Lauderdale   
14    HP-14815    Harold Pawlan  Home Office  United States       Fort Worth   
15    HP-14815    Harold Pawlan  Home Office  United States       Fort Worth   
23    SF-20065  Sandra Flanagan     Consumer  United States     Philadelphia   
27    TB-21520  Tracy Blumstein     Consumer  United States     Philadelphia   

    ... Postal Code   Region       Product ID         Category Sub-Category  \
3   ...       33311

### 4. Combining .loc[] with Conditions

- Select only 'Sales' and 'Profit' columns for high-sales orders:

In [67]:
df.loc[df['Sales'] > 1000, ['Sales', 'Profit']]

Unnamed: 0,Sales,Profit
10,1706.184,85.3092
24,1044.630,240.2649
27,3083.430,-1665.0522
35,1097.544,123.4737
54,1029.950,298.6855
...,...,...
9866,1085.420,282.2092
9925,1087.936,353.5792
9929,2799.960,944.9865
9947,1925.880,539.2464


- Select rows where State is 'California' and only 'City', 'Sales', 'Profit' columns:

In [70]:
df.loc[df['State'] == 'California', ['City', 'Sales', 'Profit']]

Unnamed: 0,City,Sales,Profit
2,Los Angeles,14.620,6.8714
5,Los Angeles,48.860,14.1694
6,Los Angeles,7.280,1.9656
7,Los Angeles,907.152,90.7152
8,Los Angeles,18.504,5.7825
...,...,...,...
9986,Los Angeles,36.240,15.2208
9990,Costa Mesa,91.960,15.6332
9991,Costa Mesa,258.576,19.3932
9992,Costa Mesa,29.600,13.3200


## Task 2: Filtering & Sorting

### 1. Filtering Data

#### A. Basic Boolean Filtering

Filter rows where sales exceed $500:

In [78]:
high_sales = df[df['Sales'] > 500]
print(high_sales.head())

    Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
1        2  CA-2016-152156   11/8/2016  11/11/2016    Second Class   
3        4  US-2015-108966  10/11/2015  10/18/2015  Standard Class   
7        8  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
10      11  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
11      12  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   

   Customer ID    Customer Name   Segment        Country             City  \
1     CG-12520      Claire Gute  Consumer  United States        Henderson   
3     SO-20335   Sean O'Donnell  Consumer  United States  Fort Lauderdale   
7     BH-11710  Brosina Hoffman  Consumer  United States      Los Angeles   
10    BH-11710  Brosina Hoffman  Consumer  United States      Los Angeles   
11    BH-11710  Brosina Hoffman  Consumer  United States      Los Angeles   

    ... Postal Code  Region       Product ID    Category Sub-Category  \
1   ...       42420   South  FUR-CH-1000045

#### B. Multiple Conditions (AND/OR)

- AND (&) → Both conditions must be true:

In [110]:
tech_high_sales = df[(df['Category'] == 'Technology') & (df['Sales'] > 1000)]
print(tech_high_sales)

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
35        36  CA-2016-117590   12/8/2016  12/10/2016     First Class   
54        55  CA-2016-105816  12/11/2016  12/17/2016  Standard Class   
165      166  CA-2014-139892    9/8/2014   9/12/2014  Standard Class   
215      216  CA-2015-146262    1/2/2015    1/9/2015  Standard Class   
251      252  CA-2016-145625   9/11/2016   9/17/2016  Standard Class   
...      ...             ...         ...         ...             ...   
9513    9514  CA-2016-125220  10/14/2016  10/19/2016  Standard Class   
9578    9579  CA-2017-152975   9/14/2017   9/16/2017     First Class   
9660    9661  CA-2016-160717    6/6/2016   6/11/2016  Standard Class   
9673    9674  CA-2016-114867  12/23/2016  12/28/2016  Standard Class   
9929    9930  CA-2016-129630    9/4/2016    9/4/2016        Same Day   

     Customer ID    Customer Name      Segment        Country           City  \
35      GH-14485        Gene Hale    Corporate  United 

- OR (|) → Either condition can be true:

In [112]:
tech_or_furniture = df[(df['Category'] == 'Technology') | (df['Category'] == 'Furniture')]
print(tech_or_furniture)

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
0          1  CA-2016-152156   11/8/2016  11/11/2016    Second Class   
1          2  CA-2016-152156   11/8/2016  11/11/2016    Second Class   
3          4  US-2015-108966  10/11/2015  10/18/2015  Standard Class   
5          6  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
7          8  CA-2014-115812    6/9/2014   6/14/2014  Standard Class   
...      ...             ...         ...         ...             ...   
9987    9988  CA-2017-163629  11/17/2017  11/21/2017  Standard Class   
9988    9989  CA-2017-163629  11/17/2017  11/21/2017  Standard Class   
9989    9990  CA-2014-110422   1/21/2014   1/23/2014    Second Class   
9990    9991  CA-2017-121258   2/26/2017    3/3/2017  Standard Class   
9991    9992  CA-2017-121258   2/26/2017    3/3/2017  Standard Class   

     Customer ID     Customer Name    Segment        Country             City  \
0       CG-12520       Claire Gute   Consumer  United 

### 2. Sorting Data

#### A. Sort by a Single Column

Sort by Profit (highest first):

In [95]:
df_sorted = df.sort_values('Profit', ascending=False)
print(df_sorted.head())

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
6826    6827  CA-2016-118689   10/2/2016   10/9/2016  Standard Class   
8153    8154  CA-2017-140151   3/23/2017   3/25/2017     First Class   
4190    4191  CA-2017-166709  11/17/2017  11/22/2017  Standard Class   
9039    9040  CA-2016-117121  12/17/2016  12/21/2016  Standard Class   
4098    4099  CA-2014-116904   9/23/2014   9/28/2014  Standard Class   

     Customer ID  Customer Name    Segment        Country         City  ...  \
6826    TC-20980   Tamara Chand  Corporate  United States    Lafayette  ...   
8153    RB-19360   Raymond Buch   Consumer  United States      Seattle  ...   
4190    HL-15040   Hunter Lopez   Consumer  United States       Newark  ...   
9039    AB-10105  Adrian Barton   Consumer  United States      Detroit  ...   
4098    SC-20095   Sanjit Chand   Consumer  United States  Minneapolis  ...   

     Postal Code   Region       Product ID         Category Sub-Category  \
6826       47905

#### B. Sort by Multiple Columns

Sort by Category (A-Z), then Sales (highest first):

In [99]:
df_sorted = df.sort_values(by=['Category', 'Sales'], ascending=[True, False])
print(df_sorted.head())

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
7243    7244  CA-2017-118892   8/17/2017   8/22/2017    Second Class   
9741    9742  CA-2015-117086   11/8/2015  11/12/2015  Standard Class   
9639    9640  CA-2015-116638   1/28/2015   1/31/2015    Second Class   
5917    5918  US-2015-126977   9/17/2015   9/23/2015  Standard Class   
6535    6536  CA-2014-128209  11/17/2014  11/22/2014  Standard Class   

     Customer ID Customer Name    Segment        Country           City  ...  \
7243    TP-21415  Tom Prescott   Consumer  United States   Philadelphia  ...   
9741    QJ-19255  Quincy Jones  Corporate  United States     Burlington  ...   
9639    JH-15985   Joseph Holt   Consumer  United States        Concord  ...   
5917    PF-19120  Peter Fuller   Consumer  United States  New York City  ...   
6535    GT-14710     Greg Tran   Consumer  United States        Buffalo  ...   

     Postal Code  Region       Product ID   Category Sub-Category  \
7243       19134 

### 3. Combining Filtering & Sorting

Find high-sales orders in the 'Technology' category and sort by Profit:

In [114]:
tech_high_sales = df[(df['Category'] == 'Technology') & (df['Sales'] > 1000)]
tech_high_sales_sorted = tech_high_sales.sort_values('Profit', ascending=False)
print(tech_high_sales_sorted)

      Row ID        Order ID  Order Date   Ship Date       Ship Mode  \
6826    6827  CA-2016-118689   10/2/2016   10/9/2016  Standard Class   
8153    8154  CA-2017-140151   3/23/2017   3/25/2017     First Class   
4190    4191  CA-2017-166709  11/17/2017  11/22/2017  Standard Class   
2623    2624  CA-2017-127180  10/22/2017  10/24/2017     First Class   
8488    8489  CA-2016-158841    2/2/2016    2/4/2016    Second Class   
...      ...             ...         ...         ...             ...   
2697    2698  CA-2014-145317   3/18/2014   3/23/2014  Standard Class   
3151    3152  CA-2015-147830  12/15/2015  12/18/2015     First Class   
3011    3012  CA-2017-134845   4/17/2017   4/23/2017  Standard Class   
683      684  US-2017-168116   11/4/2017   11/4/2017        Same Day   
7772    7773  CA-2016-108196  11/25/2016   12/2/2016  Standard Class   

     Customer ID     Customer Name      Segment        Country           City  \
6826    TC-20980      Tamara Chand    Corporate  Unite

## Task 3: Handling Missing Data

### 1. Identifying Missing Data

#### A. Check for Null Values

In [184]:
# Count nulls per column
null_counts = df.isnull().sum()
print("Null values per column:\n", null_counts)

Null values per column:
 Row ID           0
Order ID         0
Order Date       0
Ship Date        0
Ship Mode        0
Customer ID      0
Customer Name    0
Segment          0
Country          0
City             0
State            0
Postal Code      0
Region           0
Product ID       0
Category         0
Sub-Category     0
Product Name     0
Sales            0
Quantity         0
Discount         0
Profit           0
dtype: int64


#### B. Check Rows with Nulls

In [123]:
# Show rows where any column is null
rows_with_nulls = df[df.isnull().any(axis=1)]
print("Rows with missing values:\n", rows_with_nulls)

Rows with missing values:
 Empty DataFrame
Columns: [Row ID, Order ID, Order Date, Ship Date, Ship Mode, Customer ID, Customer Name, Segment, Country, City, State, Postal Code, Region, Product ID, Category, Sub-Category, Product Name, Sales, Quantity, Discount, Profit]
Index: []

[0 rows x 21 columns]


### 2. Handling Missing Data

#### A. Dropping Null Values

In [127]:
# Drop rows with ANY null values (careful - can lose data!)
df_no_nulls = df.dropna()

# Drop rows where ALL values are null
df.dropna(how='all')

# Drop columns with nulls
df.dropna(axis=1)

Unnamed: 0,Row ID,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,...,Postal Code,Region,Product ID,Category,Sub-Category,Product Name,Sales,Quantity,Discount,Profit
0,1,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600,2,0.00,41.9136
1,2,CA-2016-152156,11/8/2016,11/11/2016,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,...,42420,South,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400,3,0.00,219.5820
2,3,CA-2016-138688,6/12/2016,6/16/2016,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,...,90036,West,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200,2,0.00,6.8714
3,4,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775,5,0.45,-383.0310
4,5,US-2015-108966,10/11/2015,10/18/2015,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,...,33311,South,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680,2,0.20,2.5164
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9989,9990,CA-2014-110422,1/21/2014,1/23/2014,Second Class,TB-21400,Tom Boeckenhauer,Consumer,United States,Miami,...,33180,South,FUR-FU-10001889,Furniture,Furnishings,Ultra Door Pull Handle,25.2480,3,0.20,4.1028
9990,9991,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,FUR-FU-10000747,Furniture,Furnishings,Tenex B1-RE Series Chair Mats for Low Pile Car...,91.9600,2,0.00,15.6332
9991,9992,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,TEC-PH-10003645,Technology,Phones,Aastra 57i VoIP phone,258.5760,2,0.20,19.3932
9992,9993,CA-2017-121258,2/26/2017,3/3/2017,Standard Class,DB-13060,Dave Brooks,Consumer,United States,Costa Mesa,...,92627,West,OFF-PA-10004041,Office Supplies,Paper,"It's Hot Message Books with Stickers, 2 3/4"" x 5""",29.6000,4,0.00,13.3200


#### B. Filling Null Values

In [134]:
# Fill with a specific value
df_filled = df.fillna(0)  # Replace all nulls with 0

# Column-specific filling
df['Postal Code'] = df['Postal Code'].fillna('Unknown')

# Forward fill (use previous value)
df['Sales'] = df['Sales'].ffill()

# Fill with mean/median
mean_sales = df['Sales'].mean()
df['Sales'] = df['Sales'].fillna(mean_sales)
print(df['Sales'])

0       261.9600
1       731.9400
2        14.6200
3       957.5775
4        22.3680
          ...   
9989     25.2480
9990     91.9600
9991    258.5760
9992     29.6000
9993    243.1600
Name: Sales, Length: 9994, dtype: float64


### 3. Creating a Test DataFrame with Nulls

In [137]:
import numpy as np
import pandas as pd

# Create DataFrame with intentional nulls
test_data = {
    'Product': ['Chair', 'Desk', np.nan, 'Monitor', 'Keyboard'],
    'Sales': [120, np.nan, 340, 210, np.nan],
    'Category': ['Furniture', 'Furniture', 'Technology', np.nan, 'Technology']
}

test_df = pd.DataFrame(test_data)
print("Original test DataFrame:\n", test_df)

# Now practice handling these nulls!
test_df_filled = test_df.fillna({'Product': 'Unknown', 'Sales': 0, 'Category': 'Uncategorized'})
print("\nAfter filling nulls:\n", test_df_filled)

Original test DataFrame:
     Product  Sales    Category
0     Chair  120.0   Furniture
1      Desk    NaN   Furniture
2       NaN  340.0  Technology
3   Monitor  210.0         NaN
4  Keyboard    NaN  Technology

After filling nulls:
     Product  Sales       Category
0     Chair  120.0      Furniture
1      Desk    0.0      Furniture
2   Unknown  340.0     Technology
3   Monitor  210.0  Uncategorized
4  Keyboard    0.0     Technology


## Task 4: GroupBy Operations

### 1. Basic GroupBy Operations

#### A. Simple Aggregation

Calculate average values per category:

In [144]:
category_stats = df.groupby('Category').mean(numeric_only=True)
print(category_stats)

                      Row ID   Postal Code       Sales  Quantity  Discount  \
Category                                                                     
Furniture        5041.643564  55726.556341  349.834887  3.785007  0.173923   
Office Supplies  4980.175075  54890.951211  119.324101  3.801195  0.157285   
Technology       5003.331890  55551.572279  452.709276  3.756903  0.132323   

                    Profit  
Category                    
Furniture         8.699327  
Office Supplies  20.327050  
Technology       78.752002  


#### B. Grouping with Column Selection

Get total sales per region:

In [148]:
region_sales = df.groupby('Region')['Sales'].sum()
print(region_sales)

Region
Central    501239.8908
East       678781.2400
South      391721.9050
West       725457.8245
Name: Sales, dtype: float64


#### C. Multiple Aggregations

Calculate sum, average, and count of sales per region:

In [152]:
region_stats = df.groupby('Region')['Sales'].agg(['sum', 'mean', 'count'])
print(region_stats)

                 sum        mean  count
Region                                 
Central  501239.8908  215.772661   2323
East     678781.2400  238.336110   2848
South    391721.9050  241.803645   1620
West     725457.8245  226.493233   3203


### 2. Cleaning Grouped Results

#### A. Reset Index to Convert to DataFrame

In [159]:
# Before reset_index() -> Grouped object
region_stats = df.groupby('Region')['Sales'].sum()

# After reset_index() -> Clean DataFrame
region_stats_df = df.groupby('Region')['Sales'].sum().reset_index()
print(region_stats_df.head())

    Region        Sales
0  Central  501239.8908
1     East  678781.2400
2    South  391721.9050
3     West  725457.8245


#### B. Rename Columns After Aggregation

In [165]:
sales_summary = df.groupby('Region').agg(
    total_sales=('Sales', 'sum'),
    avg_discount=('Discount', 'mean')
).reset_index()
print(sales_summary)

    Region  total_sales  avg_discount
0  Central  501239.8908      0.240353
1     East  678781.2400      0.145365
2    South  391721.9050      0.147253
3     West  725457.8245      0.109335


## Task 5: Merging DataFrames

### 1. Prepare Sample Data

In [169]:
# Create orders table
orders = df[['Order ID', 'Order Date', 'Customer ID', 'Sales']].copy()

# Create customers table (with some extra info)
customers = df[['Customer ID', 'Customer Name', 'Segment', 'City', 'State']].drop_duplicates()

print("Orders table shape:", orders.shape)
print("Customers table shape:", customers.shape)

Orders table shape: (9994, 4)
Customers table shape: (4688, 5)


### 2. Basic Merges

#### A. Inner Join

In [173]:
# Merge on Customer ID (only matching records)
inner_merged = pd.merge(orders, customers, on='Customer ID', how='inner')
print("Inner join result:", inner_merged.shape)

Inner join result: (68140, 8)


#### B. Left Join (Keep All Orders)

In [176]:
# Keep all orders even if customer info is missing
left_merged = pd.merge(orders, customers, on='Customer ID', how='left')
print("Left join result:", left_merged.shape)

Left join result: (68140, 8)


#### C. Outer Join

In [180]:
# Include all records from both tables
outer_merged = pd.merge(orders, customers, on='Customer ID', how='outer')
print("Outer join result:", outer_merged.shape)

Outer join result: (68140, 8)
