## DATA UNDERSTANDING AND DATA CLEANING


### Goal: 
Open a dataset like a real analyst


### Objective
Prepare raw data for analysis by inspecting structure, fixing quality issues, and standardizing columns so the dataset is analysis-ready.

### Dataset Overview
- The dataset was successfully loaded into pandas.
- It contains multiple rows and columns related to business operations.
- Initial checks were performed to understand structure, size, and data types.

### Load Dataset

In [9]:
#importing pandas
import pandas as pd

#load dataset
data = pd.read_csv("E:/Vidya Career/IT JOB/Data Analyst Path/Python for Data Analysis/Pandas practice/superstore_data.csv", encoding = "latin1")
print(data)


       InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541904    581587     22613          PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899         CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138        BAKING SET 9 PIECE RETROSPOT          3   

             InvoiceDate  UnitPrice  CustomerID         Country  
0       1

***Result***
- First, the dataset is loaded to understand what kind of data we are working with and whether it imported correctly.

In [10]:
#information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


In [4]:
#Converting the dataset into dataframe
df = pd.DataFrame(data)
print(df)

       InvoiceNo StockCode                          Description  Quantity  \
0         536365    85123A   WHITE HANGING HEART T-LIGHT HOLDER         6   
1         536365     71053                  WHITE METAL LANTERN         6   
2         536365    84406B       CREAM CUPID HEARTS COAT HANGER         8   
3         536365    84029G  KNITTED UNION FLAG HOT WATER BOTTLE         6   
4         536365    84029E       RED WOOLLY HOTTIE WHITE HEART.         6   
...          ...       ...                                  ...       ...   
541904    581587     22613          PACK OF 20 SPACEBOY NAPKINS        12   
541905    581587     22899         CHILDREN'S APRON DOLLY GIRL          6   
541906    581587     23254        CHILDRENS CUTLERY DOLLY GIRL          4   
541907    581587     23255      CHILDRENS CUTLERY CIRCUS PARADE         4   
541908    581587     22138        BAKING SET 9 PIECE RETROSPOT          3   

             InvoiceDate  UnitPrice  CustomerID         Country  
0       1

### Understanding the structure, size and shape
- Column names and data types,
- Missing values,
- Overall size of the dataset are checked.
  

In [11]:
#get the information about the dataframe
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    406829 non-null  object 
 1   StockCode    406829 non-null  object 
 2   Description  406829 non-null  object 
 3   Quantity     406829 non-null  int64  
 4   InvoiceDate  406829 non-null  object 
 5   UnitPrice    406829 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      406829 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


In [12]:
#to get the statistical summary of the dataframe
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,406829.0,406829.0,406829.0
mean,12.061303,3.460471,15287.69057
std,248.69337,69.315162,1713.600303
min,-80995.0,0.0,12346.0
25%,2.0,1.25,13953.0
50%,5.0,1.95,15152.0
75%,12.0,3.75,16791.0
max,80995.0,38970.0,18287.0


In [13]:
#to return the number of rows and columns
df.shape

(406829, 8)

## Rename columns

In [14]:
#to return the column names
df.columns

Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

**Reason for renaming columns**
- Column names are renamed to avoid errors and improve readability during analysis.

In [15]:
#renaming fewer columns
print("RENAMING FEWER COLUMNS:\n")
df.rename(columns = {
    'InvoiceNo': "invoice_num"
}, inplace = True)

#printing the columns
df.columns


RENAMING FEWER COLUMNS:



Index(['invoice_num', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
       'UnitPrice', 'CustomerID', 'Country'],
      dtype='object')

In [16]:
#renaming all columns
print("RENAMING ALL COLUMNS:\n")
df.columns = ['invoice_id', 'stock_code', 'description', 'qty',
              'invoice_date', 'unit_date', 'cust_id', 'country']
df.columns


RENAMING ALL COLUMNS:



Index(['invoice_id', 'stock_code', 'description', 'qty', 'invoice_date',
       'unit_date', 'cust_id', 'country'],
      dtype='object')

In [17]:
#rename columns in the messy data
#convert the column names to lowercase and then replace _ with -
print("COLUMN NAMES:\n")
df.columns = df.columns.str.lower().str.replace('_', '-')
df.columns

COLUMN NAMES:



Index(['invoice-id', 'stock-code', 'description', 'qty', 'invoice-date',
       'unit-date', 'cust-id', 'country'],
      dtype='object')


### Initial Observations
- Missing values were present in some columns.
- Certain numerical and date columns were stored as text.
- Column naming was inconsistent.
- Duplicate records were found.

### Handling missing values
- Missing values can disturb analysis. 
- The strategy depends on the column’s business meaning—numerical values may be filled, while critical identifiers should not.
  

In [18]:
#to check whether any values are missing
#if there are any missing values => null values.
#returns True if value is a null value, else returns False

df.isnull()

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-date,cust-id,country
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
541904,False,False,False,False,False,False,False,False
541905,False,False,False,False,False,False,False,False
541906,False,False,False,False,False,False,False,False
541907,False,False,False,False,False,False,False,False


In [19]:
#to find out the sum of the null values
df.isnull().sum()

invoice-id      0
stock-code      0
description     0
qty             0
invoice-date    0
unit-date       0
cust-id         0
country         0
dtype: int64

In [20]:
#replace the nullvalues with some text
import numpy as np
df.replace(np.nan, "No data")

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-date,cust-id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12-01-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12-01-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12-09-2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12-09-2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12-09-2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12-09-2011 12:50,4.15,12680.0,France


In [22]:
#removing the missing values on the column CustomerID
df = df.dropna(subset=['cust-id'])

df.shape

(406829, 8)

In [23]:
#removing the missing values on the column Description

df = df.dropna(subset=['invoice-id'])
df.shape

(406829, 8)

In [24]:
df = df.dropna(subset=['description'])
df.shape

(406829, 8)

In [25]:
#filling the missing values
#forward fill
df.ffill()

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-date,cust-id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12-01-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12-01-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12-09-2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12-09-2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12-09-2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12-09-2011 12:50,4.15,12680.0,France


In [26]:
#backward fill
df.bfill()

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-date,cust-id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12-01-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12-01-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12-09-2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12-09-2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12-09-2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12-09-2011 12:50,4.15,12680.0,France



**Result**
- Missing values were handled to maintain dataset completeness without altering the core meaning of the data.


## Handling the Duplicate values


**Why duplicates matter**
- Duplicate records can inflate counts and totals, leading to incorrect business insights.
  

In [27]:
#to check whether the value is duplicate value or not
#If the value is a duplicate, returns True
#Else, returns False

df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
541904    False
541905    False
541906    False
541907    False
541908    False
Length: 406829, dtype: bool

In [28]:
#to check if the values in any particular column is duplicate value

df['description'].duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
541904     True
541905     True
541906     True
541907     True
541908     True
Name: description, Length: 406829, dtype: bool

In [29]:
#to get the sum of the number of duplicated values

df['description'].duplicated().sum()

402933

In [30]:
#to remove the duplicate values

df.drop_duplicates()

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-date,cust-id,country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12-01-2010 08:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12-01-2010 08:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12-01-2010 08:26,3.39,17850.0,United Kingdom
...,...,...,...,...,...,...,...,...
541904,581587,22613,PACK OF 20 SPACEBOY NAPKINS,12,12-09-2011 12:50,0.85,12680.0,France
541905,581587,22899,CHILDREN'S APRON DOLLY GIRL,6,12-09-2011 12:50,2.10,12680.0,France
541906,581587,23254,CHILDRENS CUTLERY DOLLY GIRL,4,12-09-2011 12:50,4.15,12680.0,France
541907,581587,23255,CHILDRENS CUTLERY CIRCUS PARADE,4,12-09-2011 12:50,4.15,12680.0,France


In [31]:
#to remove the duplicate values in a particular column
df.duplicated('invoice-id')
df.drop_duplicates('invoice-id')
df.shape

(406829, 8)

In [33]:
df.duplicated()

0         False
1         False
2         False
3         False
4         False
          ...  
541904    False
541905    False
541906    False
541907    False
541908    False
Length: 406829, dtype: bool

**Outcome**
- Duplicate records were identified and removed to ensure each transaction is counted only once.

## Fixing Data Types

**Why fixing data types is critical**
- Correct data types are required for accurate calculations, comparisons, and analysis.
  

In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   invoice-id    406829 non-null  object 
 1   stock-code    406829 non-null  object 
 2   description   406829 non-null  object 
 3   qty           406829 non-null  int64  
 4   invoice-date  406829 non-null  object 
 5   unit-date     406829 non-null  float64
 6   cust-id       406829 non-null  float64
 7   country       406829 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 27.9+ MB


In [36]:
#changing the data type to numeric

df['invoice-id'] = pd.to_numeric(df['invoice-id'], errors = "coerce")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype  
---  ------        --------------   -----  
 0   invoice-id    397924 non-null  float64
 1   stock-code    406829 non-null  object 
 2   description   406829 non-null  object 
 3   qty           406829 non-null  int64  
 4   invoice-date  406829 non-null  object 
 5   unit-date     406829 non-null  float64
 6   cust-id       406829 non-null  float64
 7   country       406829 non-null  object 
dtypes: float64(3), int64(1), object(4)
memory usage: 27.9+ MB


In [37]:
#changing the data type to date

df['invoice-date'] = pd.to_datetime(df['invoice-date'], errors = "coerce")
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   invoice-id    397924 non-null  float64       
 1   stock-code    406829 non-null  object        
 2   description   406829 non-null  object        
 3   qty           406829 non-null  int64         
 4   invoice-date  172782 non-null  datetime64[ns]
 5   unit-date     406829 non-null  float64       
 6   cust-id       406829 non-null  float64       
 7   country       406829 non-null  object        
dtypes: datetime64[ns](1), float64(3), int64(1), object(3)
memory usage: 27.9+ MB


In [39]:
#converting float to int

df['unit-date'] = df['unit-date'].astype(int)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   invoice-id    397924 non-null  float64       
 1   stock-code    406829 non-null  object        
 2   description   406829 non-null  object        
 3   qty           406829 non-null  int64         
 4   invoice-date  172782 non-null  datetime64[ns]
 5   unit-date     406829 non-null  int32         
 6   cust-id       406829 non-null  float64       
 7   country       406829 non-null  object        
dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(3)
memory usage: 26.4+ MB


In [40]:
df['description'] = df['description'].astype(str)
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 406829 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column        Non-Null Count   Dtype         
---  ------        --------------   -----         
 0   invoice-id    397924 non-null  float64       
 1   stock-code    406829 non-null  object        
 2   description   406829 non-null  object        
 3   qty           406829 non-null  int64         
 4   invoice-date  172782 non-null  datetime64[ns]
 5   unit-date     406829 non-null  int32         
 6   cust-id       406829 non-null  float64       
 7   country       406829 non-null  object        
dtypes: datetime64[ns](1), float64(2), int32(1), int64(1), object(3)
memory usage: 26.4+ MB



**Result**
- Columns now have appropriate data types, making the dataset reliable for analysis.
  

In [41]:
df.isnull()

Unnamed: 0,invoice-id,stock-code,description,qty,invoice-date,unit-date,cust-id,country
0,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False
3,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...
541904,False,False,False,False,False,False,False,False
541905,False,False,False,False,False,False,False,False
541906,False,False,False,False,False,False,False,False
541907,False,False,False,False,False,False,False,False


In [42]:
df.isnull().sum()

invoice-id        8905
stock-code           0
description          0
qty                  0
invoice-date    234047
unit-date            0
cust-id              0
country              0
dtype: int64

### Saving the cleaned csv file

In [43]:
df.to_csv("cleaned_data_superstore.csv", index=False)

### Data Cleaning Outcome
The dataset is now clean, consistent, and analysis-ready, with standardized columns, handled missing values, removed duplicates, and corrected data types.