**Introduction:** Loading the pandas library and importing the raw Superstore dataset for initial inspection.

In [1]:
# 1. Import necessary libraries
import pandas as pd
import numpy as np

# 2. Read the dataset (Make sure filename matches your upload!)
df = pd.read_csv('SampleSuperstore.csv')

# 3. Inspect the first few rows
print("Dataset Shape:", df.shape)
df.head()

Dataset Shape: (9994, 13)


Unnamed: 0,Ship_Mode,Segment,Country,City,State,Postal_Code,Region,Category,Sub_Category,Sales,Quantity,Discount,Profit
0,Second Class,Consumer,United States,Henderson,Kentucky,42420,South,Furniture,Bookcases,261.96,2,0.0,41.9136
1,Second Class,Consumer,United States,Henderson,Kentucky,42420,South,Furniture,Chairs,731.94,3,0.0,219.582
2,Second Class,Corporate,United States,Los Angeles,California,90036,West,Office Supplies,Labels,14.62,2,0.0,6.8714
3,Standard Class,Consumer,United States,Fort Lauderdale,Florida,33311,South,Furniture,Tables,957.5775,5,0.45,-383.031
4,Standard Class,Consumer,United States,Fort Lauderdale,Florida,33311,South,Office Supplies,Storage,22.368,2,0.2,2.5164


**Exploration:** Using .info() and .head() to understand the table structure, column data types, and identify potential null values.

In [2]:
# Check column types and look for null values
df.info()

# Check for duplicates before cleaning
print("\nDuplicate rows found:", df.duplicated().sum())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9994 entries, 0 to 9993
Data columns (total 13 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Ship_Mode     9994 non-null   object 
 1   Segment       9994 non-null   object 
 2   Country       9994 non-null   object 
 3   City          9994 non-null   object 
 4   State         9994 non-null   object 
 5   Postal_Code   9994 non-null   int64  
 6   Region        9994 non-null   object 
 7   Category      9994 non-null   object 
 8   Sub_Category  9994 non-null   object 
 9   Sales         9994 non-null   float64
 10  Quantity      9994 non-null   int64  
 11  Discount      9994 non-null   float64
 12  Profit        9994 non-null   float64
dtypes: float64(3), int64(2), object(8)
memory usage: 1015.1+ KB

Duplicate rows found: 17


**Duplicate Removal:** Identified duplicate rows using .duplicated().sum() and removed them to prevent skewed analysis results.

In [6]:
# Print all column names to find the correct spelling
print(df.columns.tolist())

['Ship_Mode', 'Segment', 'Country', 'City', 'State', 'Postal_Code', 'Region', 'Category', 'Sub_Category', 'Sales', 'Quantity', 'Discount', 'Profit']


**Handling Nulls:** Checked for missing values. If any Postal Codes were null, they were filled with 0 to maintain data integrity.

In [7]:
# 1. Remove Duplicates
df_clean = df.drop_duplicates()

# 2. Handle Missing Values (Example logic)
# (If postal code is missing, fill with 0, though Superstore is usually clean)
df_clean['Postal_Code'] = df_clean['Postal_Code'].fillna(0)

# Verify cleaning
print("Duplicates remaining:", df_clean.duplicated().sum())
print("Missing values remaining:\n", df_clean.isnull().sum())

Duplicates remaining: 0
Missing values remaining:
 Ship_Mode       0
Segment         0
Country         0
City            0
State           0
Postal_Code     0
Region          0
Category        0
Sub_Category    0
Sales           0
Quantity        0
Discount        0
Profit          0
dtype: int64


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Postal_Code'] = df_clean['Postal_Code'].fillna(0)


**Type Conversion:** Converted Postal Code to string format because zip codes are categorical identifiers, not mathematical values.

In [8]:
# Convert 'Postal Code' from float/int to string (Object)
df_clean['Postal_Code'] = df_clean['Postal_Code'].astype(str)

# Verify the change
df_clean.dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Postal_Code'] = df_clean['Postal_Code'].astype(str)


Unnamed: 0,0
Ship_Mode,object
Segment,object
Country,object
City,object
State,object
Postal_Code,object
Region,object
Category,object
Sub_Category,object
Sales,float64


**Feature Engineering:** Calculated Profit Margin to normalize profitability across different order sizes, making comparison easier.

In [9]:
# Create a 'Profit Margin' column (Profit / Sales)
df_clean['Profit_Margin'] = df_clean['Profit'] / df_clean['Sales']

# Create a 'Price_Per_Unit' column
df_clean['Price_Per_Unit'] = (df_clean['Sales'] / df_clean['Quantity']).round(2)

# Show the new columns
df_clean[['Category', 'Sales', 'Profit', 'Profit_Margin', 'Price_Per_Unit']].head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Profit_Margin'] = df_clean['Profit'] / df_clean['Sales']
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_clean['Price_Per_Unit'] = (df_clean['Sales'] / df_clean['Quantity']).round(2)


Unnamed: 0,Category,Sales,Profit,Profit_Margin,Price_Per_Unit
0,Furniture,261.96,41.9136,0.16,130.98
1,Furniture,731.94,219.582,0.3,243.98
2,Office Supplies,14.62,6.8714,0.47,7.31
3,Furniture,957.5775,-383.031,-0.4,191.52
4,Office Supplies,22.368,2.5164,0.1125,11.18


**Final Export:** Saved the processed dataframe as cleaned_data.csv, ready for visualization tools like Tableau.

In [10]:
# Export to CSV (without the index numbers)
df_clean.to_csv('cleaned_data.csv', index=False)

print("File saved successfully! Refresh the files tab on the left to download.")

File saved successfully! Refresh the files tab on the left to download.
