# Retail Sales and Customer Shopping Trends

## Introduction

This project analyzes Retail Sales and Customer Shopping Trends to understand key factors that impact sales performance and customer behavior. By merging transaction data with customer demographics and shopping preferences, the analysis explores seasonal and holiday sales trends, popular products, and customer responses to promotions. The goal is to uncover insights that retailers can use to tailor their offerings, optimize marketing strategies, and adapt to seasonal demand patterns. Ultimately, this project provides a data-driven foundation for enhancing retail strategies, maximizing customer satisfaction, and boosting revenue.

## Questions
* What are the top-selling products and categories, and how do they perform across different seasons and customer demographics?
* How do seasonal trends, holidays, and promotions impact sales, and are there specific times of the year when sales peak or dip?
* Who are the top customer segments, and what are their purchasing preferences based on demographics like age, gender, and location?
* How does pricing and discounting influence sales volume, and what is the average purchase amount for discounted vs. non-discounted items?
* What are the trends in customer purchase frequency, loyalty, and preferred payment methods, and do these factors affect purchasing behavior?
* How does customer satisfaction vary by product category, and are subscription models effective in retaining customers with higher purchase frequency?

The datasets are from the website Kaggle.com

## Results

#### Before beginning we need to import the necessary libraries to read the two csv files.

In [2323]:
# Importing the pandas library for data manipulation and analysis
import pandas as pd

# Importing the os module to perform operating system related tasks
import os

# Importing the numpy library for numerical operations
import numpy as np

# Importing the matplotlib library for creating static, animated, and interactive visualizations
import matplotlib.pyplot as plt

# Importing the seaborn library for making statistical graphics, built on top of matplotlib
import seaborn as sns

# Importing the sqlite3 module to work with SQLite databases
import sqlite3

# Importing the time module to work with time-related functions
import time

# Importing the express module from the plotly library for interactive visualizations
import plotly.express as px

## Data Loading

In [2324]:
# Reading the CSV files from the 'data' folder
customer_df = pd.read_csv("../data/Customer_Shopping_Trends.csv")
retail_df = pd.read_csv("../data/Retail_Sales.csv")

## Inspecting The Data
### Exploring the data to understand its structure

In [2325]:
# Displaying the first 5 rows of the DataFrame
customer_df.head(5)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Clothing,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Sandals,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Blouse,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


In [2326]:
# Displaying the first 5 rows of the DataFrame
customer_df.tail(5)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
3895,3896,40,Female,Hoodie,Clothing,28,Virginia,L,Turquoise,Summer,4.2,No,2-Day Shipping,No,No,32,Venmo,Weekly
3896,3897,52,Female,Backpack,Accessories,49,Iowa,L,White,Spring,4.5,No,Store Pickup,No,No,41,Bank Transfer,Bi-Weekly
3897,3898,46,Female,Belt,Accessories,33,New Jersey,L,Green,Spring,2.9,No,Standard,No,No,24,Venmo,Quarterly
3898,3899,44,Female,Shoes,Footwear,77,Minnesota,S,Brown,Summer,3.8,No,Express,No,No,24,Venmo,Weekly
3899,3900,52,Female,Handbag,Accessories,81,California,M,Beige,Spring,3.1,No,Store Pickup,No,No,33,Venmo,Quarterly


In [2327]:
# Displaying the first 5 rows of the DataFrame
retail_df.head(5)

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,50,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,37,Clothing,1,500,500
4,5,2023-05-06,CUST005,Male,30,Beauty,2,50,100


In [2328]:
# Displaying the last 5 rows of the DataFrame
retail_df.tail(5)

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
995,996,2023-05-16,CUST996,Male,62,Clothing,1,50,50
996,997,2023-11-17,CUST997,Male,52,Beauty,3,30,90
997,998,2023-10-29,CUST998,Female,23,Beauty,4,25,100
998,999,2023-12-05,CUST999,Female,36,Electronics,3,50,150
999,1000,2023-04-12,CUST1000,Male,47,Electronics,4,30,120


## Data Quality Assessment

In [2329]:
# Checking if the DataFrame has null values, duplicates, or any issues.
customer_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3900 entries, 0 to 3899
Data columns (total 18 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Customer ID             3900 non-null   int64  
 1   Age                     3900 non-null   int64  
 2   Gender                  3900 non-null   object 
 3   Item Purchased          3900 non-null   object 
 4   Category                3900 non-null   object 
 5   Purchase Amount (USD)   3900 non-null   int64  
 6   Location                3900 non-null   object 
 7   Size                    3900 non-null   object 
 8   Color                   3900 non-null   object 
 9   Season                  3900 non-null   object 
 10  Review Rating           3900 non-null   float64
 11  Subscription Status     3900 non-null   object 
 12  Shipping Type           3900 non-null   object 
 13  Discount Applied        3900 non-null   object 
 14  Promo Code Used         3900 non-null   

In [2330]:
# Checking if the DataFrame has null values, duplicates, or any issues.
retail_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   Transaction ID    1000 non-null   int64 
 1   Date              1000 non-null   object
 2   Customer ID       1000 non-null   object
 3   Gender            1000 non-null   object
 4   Age               1000 non-null   int64 
 5   Product Category  1000 non-null   object
 6   Quantity          1000 non-null   int64 
 7   Price per Unit    1000 non-null   int64 
 8   Total Amount      1000 non-null   int64 
dtypes: int64(5), object(4)
memory usage: 70.4+ KB


### Since the datasets are clean. I will now introduce NaN values, duplicates, and leading/trailing whitespace into the datasets to simulate real-world data issues and practice data cleaning techniques.


In [2331]:
# Adding NaN values in Customer_df
customer_df.loc[3:4, 'Item Purchased'] = np.nan
# Duplicating the first row of the Customer_df
customer_df = pd.concat([customer_df, customer_df.iloc[[0]]], ignore_index=True)
# Adding whitespace Adding whitespace to the second row of custmer_df
customer_df.loc[1, 'Category'] = '   '
customer_df.head(10)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually
5,6,46,Male,Sneakers,Footwear,20,Wyoming,M,White,Summer,2.9,Yes,Standard,Yes,Yes,14,Venmo,Weekly
6,7,63,Male,Shirt,Clothing,85,Montana,M,Gray,Fall,3.2,Yes,Free Shipping,Yes,Yes,49,Cash,Quarterly
7,8,27,Male,Shorts,Clothing,34,Louisiana,L,Charcoal,Winter,3.2,Yes,Free Shipping,Yes,Yes,19,Credit Card,Weekly
8,9,26,Male,Coat,Outerwear,97,West Virginia,L,Silver,Summer,2.6,Yes,Express,Yes,Yes,8,Venmo,Annually
9,10,57,Male,Handbag,Accessories,31,Missouri,M,Pink,Spring,4.8,Yes,2-Day Shipping,Yes,Yes,4,Cash,Quarterly


In [2332]:
# Adding NaN values in retail_df
retail_df.loc[2:4, 'Age'] = np.nan
# Duplicating the first row of the retail_df
retail_df = pd.concat([retail_df, retail_df.iloc[[0]]], ignore_index=True)
# Adding whitespace Adding whitespace to the fourth row of retail_df
retail_df.loc[3, 'Product Category'] = '   '  
retail_df.head(5)

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34.0,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26.0,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,,,1,500,500
4,5,2023-05-06,CUST005,Male,,Beauty,2,50,100


## Data Cleaning 
#### Checking for missing values, duplicates and any issues in the datasets.

In [2333]:
# Checking for missing values
customer_df.isna()
# Checking for duplicate rows
duplicates = customer_df.duplicated()
# Checking if each cell contains only whitespace
only_whitespace = customer_df.map(lambda x: isinstance(x, str) and x.isspace())
customer_df.head(5)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


#### I'm going to handdle missing value, dropping dupliceted rows and stripping whitespace from both datasets

In [2334]:
# Filling missing values for string columns (object type) with 'Unknown'
customer_df[customer_df.select_dtypes(include='object').columns] = customer_df.select_dtypes(include='object').fillna('Unknown')
# Dropping duplicated rows
customer_df.drop_duplicates(inplace=True)
# Striping whitespace from 'Category' column in customer_df
fill_value = "Unknown"
customer_df = customer_df.map(lambda x: fill_value if isinstance(x, str) and x.isspace() else x)
customer_df.head(5)

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Size,Color,Season,Review Rating,Subscription Status,Shipping Type,Discount Applied,Promo Code Used,Previous Purchases,Payment Method,Frequency of Purchases
0,1,55,Male,Blouse,Clothing,53,Kentucky,L,Gray,Winter,3.1,Yes,Express,Yes,Yes,14,Venmo,Fortnightly
1,2,19,Male,Sweater,Unknown,64,Maine,L,Maroon,Winter,3.1,Yes,Express,Yes,Yes,2,Cash,Fortnightly
2,3,50,Male,Jeans,Clothing,73,Massachusetts,S,Maroon,Spring,3.1,Yes,Free Shipping,Yes,Yes,23,Credit Card,Weekly
3,4,21,Male,Unknown,Footwear,90,Rhode Island,M,Maroon,Spring,3.5,Yes,Next Day Air,Yes,Yes,49,PayPal,Weekly
4,5,45,Male,Unknown,Clothing,49,Oregon,M,Turquoise,Spring,2.7,Yes,Free Shipping,Yes,Yes,31,PayPal,Annually


In [2335]:
# Checking for missing values
retail_df.isna()
# Checking for duplicate rows
duplicates = retail_df.duplicated()
# Checking if each cell contains only whitespace
only_whitespace = retail_df.map(lambda x: isinstance(x, str) and x.isspace())
retail_df.head(5)

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34.0,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26.0,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,,,1,500,500
4,5,2023-05-06,CUST005,Male,,Beauty,2,50,100


#### I'm going to handdle missing values, dropping duplicated rows and stripping whitespace from retail_df

In [2336]:
# Filling missing values for numeric columns with 0
retail_df[retail_df.select_dtypes(include='number').columns] = retail_df.select_dtypes(include='number').fillna(0)
# Dropping duplicated rows
retail_df.drop_duplicates(inplace=True)
# Stripping whitespace from 'Product Category' column in customer_df
fill_value = "Unknown"
retail_df = retail_df.map(lambda x: fill_value if isinstance(x, str) and x.isspace() else x)
retail_df.head(5)

Unnamed: 0,Transaction ID,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount
0,1,2023-11-24,CUST001,Male,34.0,Beauty,3,50,150
1,2,2023-02-27,CUST002,Female,26.0,Clothing,2,500,1000
2,3,2023-01-13,CUST003,Male,0.0,Electronics,1,30,30
3,4,2023-05-21,CUST004,Male,0.0,Unknown,1,500,500
4,5,2023-05-06,CUST005,Male,0.0,Beauty,2,50,100


#### Now let's drop unecessary columns from both datasets

In [2337]:
# Lists of columns to drop from the customer_df
columns_to_drop = ['Size', 'Color']
customer_df.drop(columns= columns_to_drop, axis=1, inplace=True)
# Columns to drop from the retail_df
columns_to_drop = ['Transaction ID']
retail_df.drop(columns= columns_to_drop, axis=1, inplace=True)

#### Adding columns to calculate "Discount" and "Profit" in retail_df

In [2338]:
# Calculating 'discount' and 'profit' in retail_df
retail_df['Discount'] =(retail_df['Total Amount'] / (retail_df['Quantity'] * retail_df['Price per Unit']))
retail_df['Profit'] = retail_df['Total Amount'] - (retail_df['Quantity'] * retail_df['Price per Unit'] * ( retail_df['Discount']))
# Converting date columns to datetime format
retail_df['Date'] = pd.to_datetime(retail_df['Date'])

#### Checking the data type of both datasets

In [2339]:
# Check data types of each column in retail_df
retail_df.dtypes

Date                datetime64[ns]
Customer ID                 object
Gender                      object
Age                        float64
Product Category            object
Quantity                     int64
Price per Unit               int64
Total Amount                 int64
Discount                   float64
Profit                     float64
dtype: object

In [2340]:
# Check the data types of each column in customer_df
customer_df.dtypes

Customer ID                 int64
Age                         int64
Gender                     object
Item Purchased             object
Category                   object
Purchase Amount (USD)       int64
Location                   object
Season                     object
Review Rating             float64
Subscription Status        object
Shipping Type              object
Discount Applied           object
Promo Code Used            object
Previous Purchases          int64
Payment Method             object
Frequency of Purchases     object
dtype: object

In [2341]:
# Convert columns from both datasets to appropriate types
customer_df['Purchase Amount (USD)'] = customer_df['Purchase Amount (USD)'].astype(float)
customer_df['Discount Applied'] = customer_df['Discount Applied'].map({'Yes': True, 'No': False})
retail_df['Age'] = customer_df['Age'].astype(int)

#### Creating a new column in the retail_df to hold the names of holidays based on the dates of 2023. The holidays will help analyze the impact of seasonal sales and customer purchasing behavior during these key dates.

In [2342]:
# Create a dictionary of holidays with dates
holidays = {
    '2023-01-01': 'New Year\'s Day',
    '2023-02-14': 'Valentine\'s Day',
    '2023-02-20': 'Presidents\' Day',
    '2023-04-09': 'Easter Sunday',
    '2023-05-14': 'Mother\'s Day',
    '2023-05-29': 'Memorial Day',
    '2023-06-18': 'Father\'s Day',
    '2023-07-04': 'Independence Day',
    '2023-09-04': 'Labor Day',
    '2023-10-31': 'Halloween',
    '2023-11-11': 'Veterans Day',
    '2023-11-23': 'Thanksgiving',
    '2023-11-24': 'Black Friday',
    '2023-11-27': 'Cyber Monday',
    '2023-12-24': 'Christmas Eve',
    '2023-12-25': 'Christmas Day',
    '2023-12-31': 'New Year\'s Eve'
}

# Converting dictionary to DataFrame for easy merging
holiday_df = pd.DataFrame(list(holidays.items()), columns=['Date', 'Holiday'])
holiday_df['Date'] = pd.to_datetime(holiday_df['Date'])
# Merging retail_df with holiday_df on Date to add 'Holiday' column
retail_df = retail_df.merge(holiday_df, how='left')
# fill NaN values in the "Holiday" column with 'Non-Holiday'
retail_df['Holiday'] = retail_df['Holiday'].fillna('Non-Holiday')
retail_df.head()

Unnamed: 0,Date,Customer ID,Gender,Age,Product Category,Quantity,Price per Unit,Total Amount,Discount,Profit,Holiday
0,2023-11-24,CUST001,Male,55,Beauty,3,50,150,1.0,0.0,Black Friday
1,2023-02-27,CUST002,Female,19,Clothing,2,500,1000,1.0,0.0,Non-Holiday
2,2023-01-13,CUST003,Male,50,Electronics,1,30,30,1.0,0.0,Non-Holiday
3,2023-05-21,CUST004,Male,21,Unknown,1,500,500,1.0,0.0,Non-Holiday
4,2023-05-06,CUST005,Male,45,Beauty,2,50,100,1.0,0.0,Non-Holiday


#### Reording the columns in both the customer_df and the retail_df to enhance readability and organization. This will help streamline the merging process and make analysis easier.

In [2343]:
# Reordering the columns for customer_df
# Reorder columns in the Customer Shopping Trends DataFrame
customer_df = customer_df[['Customer ID', 'Age', 'Gender', 'Item Purchased', 'Category',
                            'Purchase Amount (USD)', 'Location', 'Season', 
                            'Review Rating', 'Subscription Status', 
                            'Shipping Type', 'Discount Applied', 
                            'Promo Code Used', 'Previous Purchases', 
                            'Payment Method', 'Frequency of Purchases']]

# Reordering the columns for retail_df
retail_df = retail_df[[
    'Customer ID', 'Gender', 'Age', 'Date', 'Product Category',
    'Quantity', 'Price per Unit', 'Total Amount', 'Discount', 
    'Profit', 'Holiday'
]]

### Merging the two datasets using inner join on the common keys 'Customer ID', 'Age' and 'Gender'

Before merging I will format the 'Customer ID' in customer_df to match the 'Customer ID' format in retail_df. This involves converting numerical IDs to the string format 'CUST 00X'.

In [2344]:
# Formatting Customer IDs in customer_df
customer_df['Customer ID'] = 'CUST' + customer_df['Customer ID'].astype(str).str.zfill(3)
# Merging datasets using inner join on common keys
merged_df = pd.merge(customer_df, retail_df, on=['Customer ID', 'Age', 'Gender'], how='outer')
# Sort the DataFrame by 'Date' in ascending order
merged_df = merged_df.sort_values(by='Date', ascending=True)
# Sorting the merged DataFrame by Customer ID, Season, Age, and Review Rating
merged_df = merged_df.sort_values(by=['Customer ID', 'Season', 'Age', 'Review Rating', 'Total Amount'], ascending=[True, True, True, True, True])
merged_df.head()

Unnamed: 0,Customer ID,Age,Gender,Item Purchased,Category,Purchase Amount (USD),Location,Season,Review Rating,Subscription Status,...,Payment Method,Frequency of Purchases,Date,Product Category,Quantity,Price per Unit,Total Amount,Discount,Profit,Holiday
0,CUST001,55,Male,Blouse,Clothing,53.0,Kentucky,Winter,3.1,Yes,...,Venmo,Fortnightly,2023-11-24,Beauty,3.0,50.0,150.0,1.0,0.0,Black Friday
2,CUST002,19,Male,Sweater,Unknown,64.0,Maine,Winter,3.1,Yes,...,Cash,Fortnightly,NaT,,,,,,,
1,CUST002,19,Female,,,,,,,,...,,,2023-02-27,Clothing,2.0,500.0,1000.0,1.0,0.0,Non-Holiday
3,CUST003,50,Male,Jeans,Clothing,73.0,Massachusetts,Spring,3.1,Yes,...,Credit Card,Weekly,2023-01-13,Electronics,1.0,30.0,30.0,1.0,0.0,Non-Holiday
4,CUST004,21,Male,Unknown,Footwear,90.0,Rhode Island,Spring,3.5,Yes,...,PayPal,Weekly,2023-05-21,Unknown,1.0,500.0,500.0,1.0,0.0,Non-Holiday


In [2345]:
merged_df.shape

(4410, 24)

In [2346]:
merged_df.columns

Index(['Customer ID', 'Age', 'Gender', 'Item Purchased', 'Category',
       'Purchase Amount (USD)', 'Location', 'Season', 'Review Rating',
       'Subscription Status', 'Shipping Type', 'Discount Applied',
       'Promo Code Used', 'Previous Purchases', 'Payment Method',
       'Frequency of Purchases', 'Date', 'Product Category', 'Quantity',
       'Price per Unit', 'Total Amount', 'Discount', 'Profit', 'Holiday'],
      dtype='object')

In [2347]:
# #Saving the merged dataframe to a CSV file('concatenated_dataset.csv')
# concatenated_df.to_csv('concatenated_dataset.csv', index=False)

In [2348]:
# Checking for missing values in both datasets
merged_df.isnull().sum()

Customer ID                  0
Age                          0
Gender                       0
Item Purchased             510
Category                   510
Purchase Amount (USD)      510
Location                   510
Season                     510
Review Rating              510
Subscription Status        510
Shipping Type              510
Discount Applied           510
Promo Code Used            510
Previous Purchases         510
Payment Method             510
Frequency of Purchases     510
Date                      3410
Product Category          3410
Quantity                  3410
Price per Unit            3410
Total Amount              3410
Discount                  3410
Profit                    3410
Holiday                   3410
dtype: int64

The output indicates that the Superstore dataset has no missing values in its columns, while the Retail dataset has 8,994 missing values across multiple columns, , which is expected due to its smaller size, suggesting that additional data cleaning may be necessary for the Retail information.

In [2349]:
# # Filling missing values for string columns (object type) with 'Unknown'
# concatenated_df[concatenated_df.select_dtypes(include='object').columns] = concatenated_df.select_dtypes(include='object').fillna('Unknown')
# # Filling missing values for numeric columns with 0
# concatenated_df[concatenated_df.select_dtypes(include='number').columns] = concatenated_df.select_dtypes(include='number').fillna(0)
# # Filling missing values for 'Retail_Order_Date' with a common date
# concatenated_df['Retail_Order_Date'] = concatenated_df['Retail_Order_Date'].fillna(pd.Timestamp('2024-01-01')) 
# concatenated_df.head()

In [2350]:
# # Check for any duplicated in the merged dataframe and as you will see above there is no duplicated.
# duplicates = concatenated_df.duplicated()
# duplicates

### Data Exploration

In [2351]:
# concatenated_df.describe()