# Pandas Data Cleaning

## Objective
Prepare raw sales data for analysis by cleaning, standardizing, and validating it.
The goal is to ensure the dataset is reliable, consistent, and analysis-ready.

## Dataset
**Source:** Superstore Sales Dataset (Kaggle)  
**Rows:** ~9,800
**Description:**  
Each row represents a single sales transaction from a retail superstore.

### Key Use Cases
- Analyze sales performance and revenue trends
- Understand customer behavior and segmentation
- Identify regional and product-level insights
- Support business intelligence and reporting use cases

This dataset is commonly used in analytics and visualization projects using tools
such as Python, Power BI, Excel, and Tableau.

In [14]:
import pandas as pd
import numpy as np

In [15]:
df=pd.read_csv("data/superstore_data.csv",encoding="latin1")

In [16]:
df.head()

Unnamed: 0,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Product ID,Category,Sub-Category,Product Name,Sales
0,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,CA-2017-138688,12-06-2017,16-06-2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


**Every row in the data set corresponds to a customer order transaction.**

The data set records the following:
- Order information, including the date of the order and the date of shipment
- Customer identifiers to enable the tracking of each order
- Geographical information, including the city and state from which the order was placed
- Product information, including category, sub-category, and product name
- Sales value, which corresponds to the revenue generated from each order

In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9800 entries, 0 to 9799
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Order ID       9800 non-null   object 
 1   Order Date     9799 non-null   object 
 2   Ship Date      9799 non-null   object 
 3   Ship Mode      9797 non-null   object 
 4   Customer ID    9800 non-null   object 
 5   Customer Name  9794 non-null   object 
 6   Segment        9798 non-null   object 
 7   Country        9794 non-null   object 
 8   City           9789 non-null   object 
 9   State          9794 non-null   object 
 10  Product ID     9799 non-null   object 
 11  Category       9793 non-null   object 
 12  Sub-Category   9792 non-null   object 
 13  Product Name   9779 non-null   object 
 14  Sales          9796 non-null   float64
dtypes: float64(1), object(14)
memory usage: 1.1+ MB


In [18]:
df.describe()

Unnamed: 0,Sales
count,9796.0
mean,230.823883
std,626.772102
min,0.444
25%,17.248
50%,54.432
75%,210.605
max,22638.48


In [19]:
df.shape

(9800, 15)

In [20]:
df.isnull()

Unnamed: 0,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Product ID,Category,Sub-Category,Product Name,Sales
0,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9795,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9796,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False
9797,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
9798,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


In [21]:
df.dropna()

Unnamed: 0,Order ID,Order Date,Ship Date,Ship Mode,Customer ID,Customer Name,Segment,Country,City,State,Product ID,Category,Sub-Category,Product Name,Sales
0,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600
1,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400
2,CA-2017-138688,12-06-2017,16-06-2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200
3,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9794,CA-2015-127166,21-05-2015,23-05-2015,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,OFF-BI-10000977,Office Supplies,Binders,Ibico Plastic Spiral Binding Combs,18.2400
9795,CA-2017-125920,21-05-2017,28-05-2017,Standard Class,SH-19975,Sally Hughsby,Corporate,United States,Chicago,Illinois,OFF-BI-10003429,Office Supplies,Binders,"Cardinal HOLDit! Binder Insert Strips,Extra St...",3.7980
9797,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.1880
9798,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.3760


**Observations from Initial Data Analysis:**
- There are missing values in some of the columns, which will need to be addressed before any analysis can be done.
- The columns that are related to dates are currently of object data type and need to be changed to datetime format.
- There are no extreme values that were noticed during the initial analysis, but these will be checked during the cleaning process.
- There could be some duplicate data that needs to be checked to avoid double counting.
- The structure of the data seems to be appropriate for analysis after the cleaning and standardization process.

In [22]:
df.rename(columns={"Sales":"Sales-per-order","Order Date":"Ordered Date","Ship Date":"Shipped Date","Ship Mode":"Shipped Mode"})

Unnamed: 0,Order ID,Ordered Date,Shipped Date,Shipped Mode,Customer ID,Customer Name,Segment,Country,City,State,Product ID,Category,Sub-Category,Product Name,Sales-per-order
0,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600
1,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400
2,CA-2017-138688,12-06-2017,16-06-2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200
3,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9795,CA-2017-125920,21-05-2017,28-05-2017,Standard Class,SH-19975,Sally Hughsby,Corporate,United States,Chicago,Illinois,OFF-BI-10003429,Office Supplies,Binders,"Cardinal HOLDit! Binder Insert Strips,Extra St...",3.7980
9796,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,OFF-AR-10001374,Office Supplies,Art,,10.3680
9797,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.1880
9798,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.3760


In [23]:
df.columns = (
    df.columns
      .str.strip()
      .str.lower()
      .str.replace(" ", "_")
)

In [24]:
df

Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,product_id,category,sub-category,product_name,sales
0,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600
1,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400
2,CA-2017-138688,12-06-2017,16-06-2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200
3,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9795,CA-2017-125920,21-05-2017,28-05-2017,Standard Class,SH-19975,Sally Hughsby,Corporate,United States,Chicago,Illinois,OFF-BI-10003429,Office Supplies,Binders,"Cardinal HOLDit! Binder Insert Strips,Extra St...",3.7980
9796,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,OFF-AR-10001374,Office Supplies,Art,,10.3680
9797,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.1880
9798,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.3760


**Column Standardization:**
- Column names were renamed to be descriptive, lowercase, and consistent across the dataset.
- Using `snake_case` improves readability and avoids issues when writing queries, transformations, and visualizations.
- Standardized naming conventions support easier collaboration and reduce the risk of errors during analysis.

In [25]:
df.isna().sum()

order_id          0
order_date        1
ship_date         1
ship_mode         3
customer_id       0
customer_name     6
segment           2
country           6
city             11
state             6
product_id        1
category          7
sub-category      8
product_name     21
sales             4
dtype: int64

In [26]:
(df.isna().mean() * 100).round(2)

order_id         0.00
order_date       0.01
ship_date        0.01
ship_mode        0.03
customer_id      0.00
customer_name    0.06
segment          0.02
country          0.06
city             0.11
state            0.06
product_id       0.01
category         0.07
sub-category     0.08
product_name     0.21
sales            0.04
dtype: float64

In [27]:
df.dropna(inplace=True)

In [29]:
(df.isna().mean() * 100).round(2)

order_id         0.0
order_date       0.0
ship_date        0.0
ship_mode        0.0
customer_id      0.0
customer_name    0.0
segment          0.0
country          0.0
city             0.0
state            0.0
product_id       0.0
category         0.0
sub-category     0.0
product_name     0.0
sales            0.0
dtype: float64

**Dealing with Missing Values:**
- There are some columns with missing values, but the percentage of missing values in each column is less than 1%.
- Since the effect on the amount of data is minimal, the rows with missing values are removed.
- Removing the rows will help avoid creating bias, which may arise from the imputation process.

In [38]:
df.duplicated().sum()

np.int64(1)

In [39]:
df.drop_duplicates()

Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,product_id,category,sub-category,product_name,sales
0,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.9600
1,CA-2017-152156,08-11-2017,11-11-2017,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.9400
2,CA-2017-138688,12-06-2017,16-06-2017,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.6200
3,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,11-10-2016,18-10-2016,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.3680
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9794,CA-2015-127166,21-05-2015,23-05-2015,Second Class,KH-16360,Katherine Hughes,Consumer,United States,Houston,Texas,OFF-BI-10000977,Office Supplies,Binders,Ibico Plastic Spiral Binding Combs,18.2400
9795,CA-2017-125920,21-05-2017,28-05-2017,Standard Class,SH-19975,Sally Hughsby,Corporate,United States,Chicago,Illinois,OFF-BI-10003429,Office Supplies,Binders,"Cardinal HOLDit! Binder Insert Strips,Extra St...",3.7980
9797,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10004977,Technology,Phones,GE 30524EE4,235.1880
9798,CA-2016-128608,12-01-2016,17-01-2016,Standard Class,CS-12490,Cindy Schnelling,Corporate,United States,Toledo,Ohio,TEC-PH-10000912,Technology,Phones,Anker 24W Portable Micro USB Car Charger,26.3760


**Dealing with Duplicate Records:**
- After eliminating the rows containing missing values, there was one duplicate record found in the dataset.
- The duplicate record was eliminated to avoid double counting.

In [45]:
df["order_date"] = pd.to_datetime(
    df["order_date"],
    format="%d-%m-%Y"
)
df["ship_date"] = pd.to_datetime(
    df["ship_date"],
    format="%d-%m-%Y"
)

In [52]:
df["sales"] = df["sales"].astype(float)

In [53]:
df.dtypes

order_id                 object
order_date       datetime64[ns]
ship_date        datetime64[ns]
ship_mode                object
customer_id              object
customer_name            object
segment                  object
country                  object
city                     object
state                    object
product_id               object
category                 object
sub-category             object
product_name             object
sales                   float64
dtype: object

**Handling Data Types:**
- Certain columns were of object data types, which were unnecessary since the columns contained dates and numeric values.
- Using incorrect data types may result in errors or restrictions when carrying out date-related analysis and mathematical computations.
- The date columns were altered to datetime data types, while the numeric columns were altered to their respective numeric data types.
- This is essential for carrying out mathematical computations and efficient analysis.

In [54]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 9738 entries, 0 to 9799
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype         
---  ------         --------------  -----         
 0   order_id       9738 non-null   object        
 1   order_date     9738 non-null   datetime64[ns]
 2   ship_date      9738 non-null   datetime64[ns]
 3   ship_mode      9738 non-null   object        
 4   customer_id    9738 non-null   object        
 5   customer_name  9738 non-null   object        
 6   segment        9738 non-null   object        
 7   country        9738 non-null   object        
 8   city           9738 non-null   object        
 9   state          9738 non-null   object        
 10  product_id     9738 non-null   object        
 11  category       9738 non-null   object        
 12  sub-category   9738 non-null   object        
 13  product_name   9738 non-null   object        
 14  sales          9738 non-null   float64       
dtypes: datetime64[ns](2), floa

In [57]:
df.isnull().sum()

order_id         0
order_date       0
ship_date        0
ship_mode        0
customer_id      0
customer_name    0
segment          0
country          0
city             0
state            0
product_id       0
category         0
sub-category     0
product_name     0
sales            0
dtype: int64

In [58]:
df.head()

Unnamed: 0,order_id,order_date,ship_date,ship_mode,customer_id,customer_name,segment,country,city,state,product_id,category,sub-category,product_name,sales
0,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-BO-10001798,Furniture,Bookcases,Bush Somerset Collection Bookcase,261.96
1,CA-2017-152156,2017-11-08,2017-11-11,Second Class,CG-12520,Claire Gute,Consumer,United States,Henderson,Kentucky,FUR-CH-10000454,Furniture,Chairs,"Hon Deluxe Fabric Upholstered Stacking Chairs,...",731.94
2,CA-2017-138688,2017-06-12,2017-06-16,Second Class,DV-13045,Darrin Van Huff,Corporate,United States,Los Angeles,California,OFF-LA-10000240,Office Supplies,Labels,Self-Adhesive Address Labels for Typewriters b...,14.62
3,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,FUR-TA-10000577,Furniture,Tables,Bretford CR4500 Series Slim Rectangular Table,957.5775
4,US-2016-108966,2016-10-11,2016-10-18,Standard Class,SO-20335,Sean O'Donnell,Consumer,United States,Fort Lauderdale,Florida,OFF-ST-10000760,Office Supplies,Storage,Eldon Fold 'N Roll Cart System,22.368


After the completion of the data cleaning process, the dataset does not have any missing values or duplicate entries, and all columns have proper data types.
The cleaned dataset is consistent and reliable, offering a solid foundation for precise transformations and analysis.

In [61]:
df.to_csv(
    "data/superstore_cleaned.csv",
    index=False
)

## Key Takeaways
- Raw sales data was successfully cleaned by handling missing values, removing duplicates, and correcting data types.
- Column names were standardized to improve readability and ensure consistency across the analysis workflow.
- Data quality issues that could impact aggregations and time-based analysis were resolved.
- The cleaned dataset was exported and preserved for reproducible downstream transformations and analysis.