# Explorartory Data Analysis Project - Retail

## Introduction

A supermarket is a self-service shop offering a wide variety of food, beverages and household products, organized into sections. This kind of store is larger and has a wider selection than earlier grocery stores, but is smaller and more limited in the range of merchandise than a hypermarket or big-box market. In everyday usage, however,* "grocery store"* is synonymous with supermarket, and is not used to refer to other types of stores that sell groceries.

<a href="https://imgbb.com/"><img src="https://i.ibb.co/kMwMyk6/super.jpg" alt="super" border="0"></a><br /><a target='_blank' href='https://the-crosswordsolver.com/genuine-in-germany-4-letters'></a><br />

----
### IMPORTING LIBRARIES¶
List of all the python libraries that are required:

- Library `pandas` will be required to work with data in tabular representation.
- Library `numpy` will be required to round the data in the correlation matrix.
- Library `warning` will be required to ignore all warnings.
- Library `matplotlib`, `seaborn`, `plotly` required for data visualization.

---

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np
import pandas as pd
import seaborn as sns
import plotly.express as px
import matplotlib.pyplot as plt
import matplotlib
%matplotlib inline

import warnings
warnings.simplefilter(action="ignore")

# Input data files are available in the read-only "../input/" directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
df = pd.read_csv('/kaggle/input/tsf-datasets/SampleSuperstore.csv')
print('Data loaded successfully.')

----
Now the dataset is loaded, we will now understand the dataset.

----

In [None]:
df.head()

In [None]:
print('Shape of our dataframe is :' +str(df.shape))

In [None]:
df.info()

In [None]:
df.isna().sum()

----
Dataframe has no NaN values. Lets look at its statistical view. using `describe()`

----

In [None]:
df.describe()

In [None]:
df.describe(include='object')

In [None]:
df.duplicated().sum()

----
The dataframe have 17 dupplicates values. Lets remove them first using `drop_duplicates()`

----

In [None]:
df.drop_duplicates(inplace= True)

In [None]:
df.shape

----
Shape has been reduced to `(9977,13)`.

Now we will check number of unique values for every column using `value_counts()`

---

In [None]:
dict = {}
for x in list(df.columns):
    dict[x] = df[x].value_counts().shape[0]

pd.DataFrame(dict, index=["Unique Counts"]).transpose()

In [None]:
sns.heatmap(df.corr(), annot = True, cmap = 'Blues', lw = 8, linecolor = 'white');

----
#### **Analysis:**
We can notice `sales` is fairly related to `profit` and `discount` is negatively related to `profit`.

**Intrepreted as**  higher sales and lesser discounts leads to `more profit`.

It is also noticable that `postal code` has nothing to do with overall `profit`

---

## **Univariate Analysis**

Since univariate analysis deal with one variable at a time, we will check frequency of most of the columns, using `subplots`.

In [None]:
fig, axes = plt.subplots(3,2, figsize = (12,18))
sns.set_theme(style="darkgrid")
axes[0,0].set_title("Ship mode")
axes[0,1].set_title("segment")
axes[1,0].set_title("Region")
axes[1,1].set_title("Category")
axes[2,0].set_title("Sub-category")
axes[2,1].set_title("Quantity")

sns.countplot(x=df['Ship Mode'],
              palette = 'copper',
              orient='h',
              ax=axes[0,0])

sns.countplot(x=df['Segment'],
              palette = 'copper_r',
              orient='h',
              ax=axes[0,1])

sns.countplot(x=df['Region'],
              palette = 'winter',
              orient='h',
              ax=axes[1,0])

sns.countplot(x=df['Category'],
              palette = 'winter_r',
              orient='h',
              ax=axes[1,1])

sns.countplot(x=df['Sub-Category'],
              palette = 'cividis',
              orient='h',
              ax=axes[2,0])

sns.countplot(x=df['Quantity'],
              palette = 'cividis_r',
              orient='h',
              ax=axes[2,1])

axes[2,0].set_xticklabels(list(df['Sub-Category'].unique()), rotation=90)

plt.tight_layout(pad=2);

---
#### **Analysis:**

* `Standard class` ship-mode is more preferred.
* `Consumer` is the majority segment.
* Superstores are more in `west` and least in `south`
* In category-wise, `office supplies` holds the majority.
* Top-selling sub-categories are `binders`, `paper`.
* Prople mostly go for 2 or 3 `quanitites`.

---

## **Bivariate Analysis**

We will be comparing other features to `profit`, `sales` and `quantities` to get a visual idea about what affects the profit most.

## 1. Ship-Mode

In [None]:
ship_df = pd.DataFrame(df.groupby(['Ship Mode'])[['Profit','Sales', 'Quantity']].sum())
ship_df

In [None]:
fig, axes = plt.subplots(3,1, figsize=(8,14))
sns.set_theme(style="darkgrid")
axes[0].set_title("Ship mode to Profit")
axes[1].set_title("Ship mode to Sales")
axes[2].set_title("Ship mode to Quantity")

sns.barplot(x=ship_df.index,
           y=ship_df['Profit'],
           data= ship_df,
            palette = 'Spectral_r',
           ax = axes[0]);

sns.barplot(x=ship_df.index,
           y=ship_df['Sales'],
           data= ship_df,
            palette = 'Spectral_r',
           ax = axes[1]);

sns.barplot(x=ship_df.index,
           y=ship_df['Quantity'],
           data= ship_df,
            palette = 'Spectral_r',
           ax = axes[2])

plt.tight_layout(pad=2);

---
**Analysis:**

`Standard class` ship-mode is more preferred. Maybe it is cheap and efficient.

---

## 2. Segment

In [None]:
segment_df = pd.DataFrame(df.groupby(['Segment'])[['Profit', 'Sales', 'Quantity']].sum())
segment_df

In [None]:
fig, axes = plt.subplots(1,3, figsize=(12,6))

sns.set_theme(style="darkgrid")
axes[0].set_title("Segment to Profit")
axes[1].set_title("Segment to Sales")
axes[2].set_title("Segment to Quantity")

sns.barplot(x=segment_df.index,
           y=segment_df['Profit'],
           data= segment_df,
            palette = 'Reds_r',
           ax= axes[0])

sns.barplot(x=segment_df.index,
           y=segment_df['Sales'],
           data= segment_df,
            palette = 'Reds_r',
           ax= axes[1])

sns.barplot(x=segment_df.index,
           y=segment_df['Quantity'],
           data= segment_df,
            palette = 'Reds_r',
           ax= axes[2])

plt.tight_layout(pad=2);

---
**Analysis:**

`Consumer` segment is most profitable, followed by Corporate Segment and home offices. Hence, marketing strategy has to target or place more focus on retaining `consumers`.

---

## 3. Region

In [None]:
region_df = pd.DataFrame(df.groupby(['Region'])[['Profit', 'Sales', 'Quantity']].sum())
region_df

In [None]:
fig, axes = plt.subplots(1,3, figsize=(14,5))

sns.set_theme(style="darkgrid")
axes[0].set_title("Region vs Profit")
axes[1].set_title("Region vs Sales")
axes[2].set_title("Region vs Quantity")

sns.barplot(x=region_df.index,
           y='Profit',
           data=region_df,
           palette='Paired',
           ax=axes[0])

sns.barplot(x=region_df.index,
           y='Sales',
           data=region_df,
           palette='Paired',
           ax=axes[1])

sns.barplot(x=region_df.index,
           y='Quantity',
           data=region_df,
           palette='Paired',
           ax=axes[2])

plt.tight_layout(pad=1);

---
**Analysis:**

Among every region, west and east region recorded more profit. So, strategy should focus more on `east` and `west` region.

---

## 4. Category

In [None]:
category_df = pd.DataFrame(df.groupby(['Category'])[['Profit', 'Sales', 'Quantity']].sum())
category_df

In [None]:
fig, axes = plt.subplots(1,3, figsize=(14,5))

sns.set_theme(style="darkgrid")
axes[0].set_title("Category vs Profit")
axes[1].set_title("Category vs Sales")
axes[2].set_title("Category vs Quantity")

sns.barplot(x=category_df.index,
           y='Profit',
           data=category_df,
           palette='Pastel2',
           ax=axes[0])

sns.barplot(x=category_df.index,
           y='Sales',
           data=category_df,
           palette='Pastel2',
           ax=axes[1])

sns.barplot(x=category_df.index,
           y='Quantity',
           data=category_df,
           palette='Pastel2',
           ax=axes[2])

plt.tight_layout(pad=1);

---
**Analysis:**

Though quantity is less, `technology` seems to have highest sales as well as profit. For more profit, focus should be more on technology.

---

## 5. Sub-Category

In [None]:
sub_category_df = pd.DataFrame(df.groupby(['Sub-Category'])[['Profit', 'Sales', 'Quantity']].sum())
sub_category_df

In [None]:
fig, axes = plt.subplots(3,1, figsize=(10,18))

sns.set_theme(style="darkgrid")
axes[0].set_title("Sub-Category vs Profit")
axes[1].set_title("Sub-Category vs Sales")
axes[2].set_title("Sub-Category vs Quantity")

sns.barplot(y=sub_category_df.index,
           x='Profit',
           data=sub_category_df,
           palette='icefire',
           ax=axes[0])

sns.barplot(y=sub_category_df.index,
           x='Sales',
           data=sub_category_df,
           palette='icefire',
           ax=axes[1])

sns.barplot(y=sub_category_df.index,
           x='Quantity',
           data=sub_category_df,
           palette='icefire',
           ax=axes[2])

plt.tight_layout(pad=3);

---
**Analysis:**

With analyzing these graphs, we may say that `copiers`, `accesories` and `phones` have more sales and profit.

---

## 6. Cities

In [None]:
cities_df = pd.DataFrame(df.groupby(['City'])[['Profit', 'Sales', 'Quantity']].sum().sort_values('Profit',ascending = False))
top10 = cities_df.head(10)
last10 = cities_df.tail(10)

In [None]:
high_low = top10.append(last10)

In [None]:
fig, axes = plt.subplots(3,1, figsize=(12, 19))

axes[0].set_title("Profit of top 10 and bottom 10")
axes[1].set_title("Sales of top 10 and bottom 10")
axes[2].set_title("Quantity of top 10 and bottom 10")

sns.barplot(y=high_low.index,
           x='Profit',
           data=high_low,
           palette='muted',
           ax=axes[0])

sns.barplot(y=high_low.index,
           x='Sales',
           data=high_low,
           palette='muted',
           ax=axes[1])

sns.barplot(y=high_low.index,
           x='Quantity',
           data=high_low,
           palette='muted',
           ax=axes[2])

plt.tight_layout(pad=4);

---
**Analysis:**

* `New York` have the most sales and profit.
* With a fairly high quantity and sales at `Philadelphia`, `Houston` and `Chicago`, profit at these places is in negative.
* There is a huge disparity between the cities with highest sales and lowest sales. Marketing strategy has to target the `top 10 cities`.

---

# Final Conclusions:

Consider following recommendations for higher profit:-
* Focus on category Technology like Phones as they are highest selling and most profitable. Bundle them with the less profitable products to offset the losses like tables and suppplies.
* Selling bookcases, tables and supplies result in losses, so SuperStores has to consider to bundle them together with High Selling or Profitable sub-category such as Machines, Copiers, Phones etc.
* For Home Offices customers, these people might be busy with work and less likely to spend time selecting individual products, so creating a Home Office Catalog with products used for offices such as paper, chairs, phone, copiers, storage,machines would result in better profits.
* Target consumer customers from East and West region of top 10 cities with most profits, with special promotions and advertisments for copiers, phones, accessories etc.

___

## Thank you :)

Do upvote, if you like