<a href="https://colab.research.google.com/github/SheilaMumbi/PROJECTS/blob/main/Supermarket_Data_Analysis_and_Visualization_Case_Study_.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **INTRODUCTION**


Supermarkets play a vital role in the retail sector by providing consumers with a wide range of products, from groceries to household items. In a highly competitive market, understanding the factors that influence store performance is crucial for driving sales, increasing customer satisfaction, and maintaining profitability. This case study focuses on analyzing data from various supermarket stores to uncover trends, patterns, and insights that can inform strategic decisions.

ABOUT DATA SET


The Data set contains:

Store ID: (Index) ID of the particular store.

Store_Area: Physical Area of the store in yard square.

Items_Available: Number of different items available in the corresponding store.

Daily_Customer_Count: Number of customers who visited to stores on an average over month.

Store_Sales: Sales in (US $) that stores made.

# **Data Exploration and Preprocessing**

Loading the data set


In [None]:
pip install ydata-profiling


In [None]:
# importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from ydata_profiling import ProfileReport

# importing data
data = '/content/Stores.csv'
df = pd.read_csv(data)

# First five rows
df.head()

In [None]:
# Structure of the data set
df.info()

# Summary of the data set
df.describe()

In [None]:
df_report = ProfileReport(df)
df_report

***Data Cleaning***

In [None]:
# Checking for missing values
df.isnull().sum()

In [None]:
# Handling outliers
# Plot box plots to identify outliers
plt.figure(figsize=(15, 8))
sns.boxplot(data=df)
plt.title('Box Plot for All Numeric Columns')
plt.show()

Store_Sales: Outliers are present on the upper end, indicating that some stores have significantly higher sales compared to the rest.

In [None]:
# To check the number of duplicate values

df.duplicated('Store_Sales').sum()

In [None]:
# Remove duplicates
df_cleaned = df.drop_duplicates('Store_Sales')

#Reset the index
df_cleaned = df_cleaned.reset_index(drop=True)

df_cleaned


In [None]:
df_cleaned.describe()

***Initial Exploration***

In [None]:
# Plot histograms for each key variable
fig, axes = plt.subplots(1, 3, figsize=(18, 6))

sns.histplot(df_cleaned['Store_Area'], bins=20, ax=axes[0], kde=True)
axes[0].set_title('Distribution of Store Area')

sns.histplot(df_cleaned['Daily_Customer_Count'], bins=20, ax=axes[1], kde=True)
axes[1].set_title('Distribution of Customer Count')

sns.histplot(df_cleaned['Store_Sales'], bins=20, ax=axes[2], kde=True)
axes[2].set_title('Distribution of Store Sales')

plt.tight_layout()
plt.show()


# **Data Analysis**

In [None]:
# Correlation analysis between Store_Area and Items_Available
correlation_store_size_items = df['Store_Area'].corr(df['Items_Available'])
correlation_store_size_items

# Scatter plot for Store Size vs. Items Available
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Store_Area', y='Items_Available', data=df_cleaned)
plt.title(f'Store Size vs. Items Available (Correlation: {correlation_store_size_items:.3f})')
plt.xlabel('Store Area')
plt.ylabel('Items Available')
plt.show()


In [None]:
# Correlation analysis between Daily Customer Count and Store Sales
correlation_customer_sales = df['Daily_Customer_Count'].corr(df['Store_Sales'])
print(f"Correlation between Daily Customer Count and Store Sales: {correlation_customer_sales:.3f}")

# Scatter plot for Daily Customer Count vs. Store Sales
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Daily_Customer_Count', y='Store_Sales', data=df)
plt.title(f'Daily Customer Count vs. Store Sales (Correlation: {correlation_customer_sales:.3f})')
plt.xlabel('Daily Customer Count')
plt.ylabel('Store Sales')
plt.show()


In [None]:
# Creating size categories based on Store_Area
bins = [0, 1500, 2500, 3500, float('inf')]  # Define thresholds for store size
labels = ['Small', 'Medium', 'Large', 'Extra Large']  # Define labels for each category

# Segment stores into size categories
df['Store_Size_Category'] = pd.cut(df['Store_Area'], bins=bins, labels=labels, include_lowest=True)

# Calculate average sales performance for each size category
avg_sales_by_size = df.groupby('Store_Size_Category')['Store_Sales'].mean().reset_index()

# Plot a bar chart to visualize the average sales performance by store size category
plt.figure(figsize=(10, 6))
sns.barplot(x='Store_Size_Category', y='Store_Sales', data=avg_sales_by_size, palette='viridis')
plt.title('Average Sales Performance by Store Size Category')
plt.xlabel('Store Size Category')
plt.ylabel('Average Sales')
plt.show()


In [None]:

# Rank stores based on Store Sales and Customer Count
df['Sales_Rank'] = df['Store_Sales'].rank(ascending=False)
df['Customer_Rank'] = df['Daily_Customer_Count'].rank(ascending=False)

# Scatter plot for Sales Rank vs. Customer Rank
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Sales_Rank', y='Customer_Rank', data=df, s=100, hue='Store_Sales', palette='viridis', legend=False)
plt.title('Scatter Plot of Sales Rank vs. Customer Rank')
plt.xlabel('Sales Rank (1 = Best)')
plt.ylabel('Customer Rank (1 = Best)')
plt.show()

# Scatter plot for Store Sales vs. Daily Customer Count
plt.figure(figsize=(10, 6))
sns.scatterplot(x='Store_Sales', y='Daily_Customer_Count', data=df, s=100, hue='Store_Sales', palette='coolwarm', legend=False)
plt.title('Store Sales vs. Daily Customer Count')
plt.xlabel('Store Sales')
plt.ylabel('Daily Customer Count')
plt.show()


# **Data Visualization**

In [None]:
#Scatter Plots with Regression Lines

# Store Area vs. Store Sales with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='Store_Area', y='Store_Sales', data=df, scatter_kws={'s':10}, line_kws={'color':'red'})
plt.title('Store Area vs. Store Sales with Regression Line')
plt.xlabel('Store Area')
plt.ylabel('Store Sales')
plt.show()

# Customer Count vs. Store Sales with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='Daily_Customer_Count', y='Store_Sales', data=df, scatter_kws={'s':10}, line_kws={'color':'red'})
plt.title('Customer Count vs. Store Sales with Regression Line')
plt.xlabel('Daily Customer Count')
plt.ylabel('Store Sales')
plt.show()

# Store Area vs. Customer Count with regression line
plt.figure(figsize=(10, 6))
sns.regplot(x='Store_Area', y='Daily_Customer_Count', data=df, scatter_kws={'s':10}, line_kws={'color':'red'})
plt.title('Store Area vs. Customer Count with Regression Line')
plt.xlabel('Store Area')
plt.ylabel('Daily Customer Count')
plt.show()#

In [None]:
#Sales Distribution

# Histogram of Store Sales
plt.figure(figsize=(10, 6))
sns.histplot(df_cleaned['Store_Sales'], bins=20, kde=True)
plt.title('Distribution of Store Sales')
plt.xlabel('Store Sales')
plt.ylabel('Frequency')
plt.show()