##### Supermarket store branches sales analysis

<img src='https://static.vecteezy.com/system/resources/previews/002/035/088/original/supermarket-grocery-store-in-city-flat-illustration-vector.jpg' width='500'/>

Super markets are stores that offer “a wide variety of food and household products such as meat, fresh produce, dairy, canned and packaged goods, household cleaners, pharmacy products and pet supplies.”

Here we want to examine how well our company's products are selling so we can provide important feedback for managers to gauge sales force performance.

###### Importing the libraries

In [None]:
# data operation libraries

import numpy as np
import pandas as pd

# visualization libraries

import matplotlib.pyplot as plt
import seaborn as sns

# for Q-Q plot

import scipy.stats as stats

###### Importing the dataset

In [None]:
dataset = pd.read_csv('../input/stores-area-and-sales-data/Stores.csv')

dataset.shape

In [None]:
dataset.head()

In [None]:
print('There are {} observations in our dataset'.format(len(dataset)))
print('There are {} unique store Ids in the dataste'.format(dataset['Store ID '].nunique()))

We can see that there are 896 unique store Ids meaning our Super market company has 896 braches situated in different parts of US.

Let's first understand our Atrributes we have been provided.

**Store ID**: (Index) ID of the particular store.

**Store_Area**: Physical Area of the store in yard square.

**Items_Available**: Number of different items available in the corresponding store.

**DailyCustomerCount****: Number of customers who visited to stores on an average over month.

**Store_Sales**: Sales in (US $) that stores made.

###### Data Analysis

<img src='https://miro.medium.com/max/875/1*7w7c8yS70eHR74qgBJJu8Q.gif' width='300'/>

Throgh data analysis we can understand the data:

> Maximize insight into a dataset

> uncover underlying structure

> Extract important variables

> detect outliers and anomalise

> test underlying assumptions

> determine optimal factor settings

In [None]:
# first understand the data structure

dataset.info()

In [None]:
dataset.describe()

In [None]:
sns.pairplot(dataset, diag_kind='kde')

###### Indiacations from above chart:

From the diagonal plots we can see that our variables are normally distibuted and concentrated at the centre. We can also see that the store Area and Items available has a linear relation.

In [None]:
# let's examine them closely

sns.regplot(x='Store_Area',y='Items_Available', data=dataset)
plt.show()

We basically don't want features which are correlated to each other and this we will address in our feature engineering part.

In [None]:
# now, lets' check the distribution of our target

sns.displot(data=dataset, x='Store_Sales')
plt.show()

We can see that our target is normally distributed.

In [None]:

fig, ax = plt.subplots(figsize=(8, 6))

tmp_df = dataset.copy()

tmp_df = tmp_df.sort_values(by='Store_Sales', ascending=False)[:10]
sns.barplot(x=tmp_df['Store_Sales'], y=tmp_df['Store ID '], orient='h', data= tmp_df,order=[650, 869, 433, 409, 759, 558, 867, 167, 693, 872],
           capsize=0.2, ax=ax)
lbs = ['Sale: 116320 \n Daily Customers: 860','Sale: 105150 \n Daily Customers: 980','Sale: 102920 \n Daily Customers: 680','Sale: 102310 \n Daily Customers: 1310','Sale: 101820 \n Daily Customers: 820','Sale: 101780 \n Daily Customers: 700','Sale: 100900 \n Daily Customers: 900','Sale: 99570 \n Daily Customers: 680',
        'Sale: 99480 \n Daily Customers: 480','Sale: 98260 \n Daily Customers: 1100']
ax.bar_label(ax.containers[-1], labels=lbs,label_type='center')

plt.ylabel('Store Sales')
plt.xlabel('Store Ids')
plt.title('Stores with highest sales')

plt.show()

###### Indications from above chart:

we can see that store number 409 and 872 have comparatively higher number of customer coming to the store than other stores yet the sale is 4th and 10th highest respectively. 

We can say that these stores either have low conversion rate or the products customers are buying have low costs.

As an analyst this information can help us schedule the right number of sales associates for busy and slow times, if we have further information regarding dates or month we could track conversion rates with sales data, and ensure that the store is designed in a way that facilitates customer engagement. 


In [None]:

fig, ax = plt.subplots(figsize=(8, 6))

tmp_df = dataset.copy()

tmp_df = tmp_df.sort_values(by='Store_Area', ascending=False)[:10]
sns.barplot(x=tmp_df['Store_Area'], y=tmp_df['Store ID '], orient='h', data= tmp_df,order=[467, 541,  92, 850, 399, 551, 629, 164, 568, 312],
           capsize=0.2, ax=ax)
lbs = ['2229 square yard \n Items available: 2667','2214 square yard \n Items available: 2647','2169 square yard \n Items available: 2617','2067 square yard \n Items available: 2492','2063 square yard \n Items available: 2493','2049 square yard \n Items available: 2465','2044 square yard \n Items available: 2408','2044 square yard \n Items available: 2474',
        '2026 square yard \n Items available: 2400','2019 square yard \n Items available: 2396']
ax.bar_label(ax.containers[-1], labels=lbs,label_type='center')

plt.ylabel('Store Area')
plt.xlabel('Store Ids')
plt.title('Stores with highest Area')

plt.show()

###### Indication from above chart:

We can see that as the are of the store decreases the number of available items in the store decreases. Store number 467 has the highest area available meaning out of all branches the store which bigger space is, 467 yet this store is not in our top 10 sales list.

###### Identifying Patterns with Hierarchical Clustering

<img src='https://miro.medium.com/max/800/0*eub1-NMmFZ1WlJVQ.jpg' width='300'/>

In [None]:
dataset.head()

In [None]:
# I will only use two variables: Daily_Customer_Count and Store_Sales

x = dataset[['Daily_Customer_Count','Store_Sales']].values

In [None]:
import scipy.cluster.hierarchy as sch

In [None]:
plt.figure(figsize=(8,5))
dendogram = sch.dendrogram(sch.linkage(x,method='ward'))
ax = plt.gca()
ax.axes.xaxis.set_ticks([])
plt.title('Dendogram')
plt.xlabel('Data Points')
plt.ylabel('Euclidean Distance')
plt.show()

In [None]:
from sklearn.cluster import AgglomerativeClustering

In [None]:
hc = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')

y_hc = hc.fit_predict(x)

In [None]:
y_hc

In [None]:
plt.scatter(x[y_hc==0,0],x[y_hc==0,1], c='green',s=100,label='cluster 1')
plt.scatter(x[y_hc==1,0],x[y_hc==1,1], c='orange',s=100,label='cluster 2')
plt.legend(loc='upper right', bbox_to_anchor=(1.26,1))
plt.axhline(y=60000, color='red')
plt.axvline(x=800, color='red')

plt.title('Customer Segmentation')
plt.xlabel('Daily customers')
plt.ylabel('Store Sales (k $)')
plt.show()

###### Indication from above chart

here to interpret the clusters better I decided to use quadrant which divides our clusters into four parts.

**bottom left quadrant**: consists of storeIds which have low count of daily customers and low sale.

**upper left quadrant**: consists of storeIDs which have low count of daily customers and still manages to have high sale.

**bottom right quadrant**: consists of storeIds which have high count of daily customers and low count of sale. As a business stratagiest we can draw attention of the management to look into these branches and can further investigate the reason behind the low sale and can address it appropriately.

**upper right quadrant**: consists of storeIds which have high count of walkin customers and high amount of sales.

The stores that require our attentions are the ones that fall in bottom right quadrant. The one reason can be the company is not marketing to the right audience. While it may seem best to market to a broad audience to reach more people, it’s best to narrow your focus on a precise demographic.

If company hasn’t identified which products will connect with the audience, the company risks feeling generic and lose the chance to connect emotionally with potential customers. In order to attract the right audience, we need to consider these initial questions, posed by Volusion:

> What the company is selling?

> Who will the product serve, connect with, and appeal to?

> How will the product be used?

**Bottom line**: just because the products aren’t selling, doesn’t mean the products are to blame. Company's lack of transactions could also be the result of backend oversights that, once detected, are easy to solve. 

In [None]:
y_hc = pd.Series(y_hc,name='Cluster')
x = dataset[['Daily_Customer_Count','Store_Sales']]
tmp_df = pd.concat([x,y_hc], axis=1)

tmp_df['Outcome'] = np.where(tmp_df['Cluster']==0,'Low Sale','High Sale')

In [None]:
tmp_df['Outcome'].value_counts()

In [None]:
tmp = tmp_df['Outcome'].value_counts()

In [None]:
plt.figure(figsize=(6,6))
tmp.plot.pie(autopct = "%1.0f%%", colors = sns.color_palette('winter_r'), shadow = True)
c = plt.Circle((0, 0), 0.4, color = 'white')
plt.gca().add_artist(c)
plt.title('Cluster outcome: Low Sale vs High Sale')
plt.show()

###### Predicting the price with regression algorithm

###### splitting the dataset

In [None]:
!pip install feature_engine

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
x_train, x_test, y_train, y_test = train_test_split(dataset.drop('Store_Sales', axis=1),
                                                   dataset['Store_Sales'],
                                                   test_size=0.2,
                                                   random_state=0)

x_train.shape, x_test.shape

We saw that store id is unique for each observation and that is the reason why we don't want to include it in our model.

In [None]:
x_train.drop('Store ID ', axis=1, inplace=True)
x_test.drop('Store ID ', axis=1, inplace=True)

###### uderstanding the variable nature

In [None]:
from feature_engine.selection import SmartCorrelatedSelection

In [None]:
correlated = SmartCorrelatedSelection(selection_method='variance')
correlated.fit(x_train)

In [None]:
correlated.correlated_feature_sets_

In [None]:
x_train['Items_Available'].var()

In [None]:
x_train['Store_Area'].var()

We want features that are correlated with the target and not correlated to each other. So I will drop the one with low variance and keep the one which has high variance.

In [None]:
x_train = correlated.transform(x_train)
x_test = correlated.transform(x_test)

In [None]:
x_train.shape, x_test.shape

In [None]:
x_train.columns

In [None]:
# let's make a list of numeric variables

discrete_var = [var for var in x_train.columns if x_train[var].nunique() < 15]
continuous_var = [var for var in x_train.columns if var not in discrete_var]

print('There are {} discrete variables'.format(len(discrete_var)))
print('There are {} continuous variables'.format(len(continuous_var)))

We have two variables and both of them are continuous. Let's check their distribution and if they have outliers.

In [None]:
def diagnostic_plot(df, var):
    fig = plt.figure(figsize=(15,4))
    
    plt.subplot(1,3,1)
    df[var].hist(bins=50)
    plt.title('Distribution of {}'.format(var))
    
    plt.subplot(1,3,2)
    stats.probplot(df[var], dist='norm', plot=plt)
    plt.ylabel('RM quantiles')
    
    plt.subplot(1,3,3)
    sns.boxplot(y=df[var])
    plt.title('Boxplot')
    
    plt.show()

In [None]:
for var in continuous_var:
    diagnostic_plot(x_train, var)

###### Indication from the above chart:

from the histogram and Q-Q plot we can see that our variables are normally distributed and follows a Gaussian distribution.

from the boxplot we can see that our variables have a few outliers.

###### Outliers

In [None]:
def find_gaussian_boundries(df,var):
    
    lower_boundry = df[var].mean() - 3 * df[var].std()
    upper_boundry = df[var].mean() + 3 * df[var].std()
    
    return lower_boundry, upper_boundry

In [None]:
print('There are {} observations in our trainset'.format(len(x_train)))
print()
for var in continuous_var:
    
    lower_boundry, upper_boundry = find_gaussian_boundries(x_train, var)
    
    print('Number of observations in {} having values lower than lower boundry: {}'.format(var, len(x_train[x_train[var] < lower_boundry])))
    print('% observations in {} having values lower than lower boundry: {}'.format(var, len(x_train[x_train[var] < lower_boundry])/len(x_train)))
    print()
    
    print('Number of observations in {} having values more than upper boundry: {}'.format(var, len(x_train[x_train[var] > upper_boundry])))
    print('% observations in {} having values more than upper boundry: {}'.format(var, len(x_train[x_train[var] > upper_boundry])/len(x_train)))
    print()

In [None]:
from feature_engine.outliers import Winsorizer

In [None]:
outlier_capper = Winsorizer(capping_method='gaussian',
                            tail='both',
                            fold=3,
                            variables=continuous_var)

outlier_capper.fit(x_train)

In [None]:
x_train = outlier_capper.transform(x_train)
x_test = outlier_capper.transform(x_test)

###### Model Building

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

In [None]:
regressor = LinearRegression()

regressor.fit(x_train,y_train)

y_pred = regressor.predict(x_test)

In [None]:
print('Mean absolute error: ',mean_absolute_error(y_test,y_pred))
print()
print('Root mean squared error: ', np.sqrt(mean_absolute_error(y_test,y_pred)))
print()
print('R2 score: ',r2_score(y_test,y_pred))

From our r2 score we can see that our variables are not best predictive of the target.

In [None]:
regressor.intercept_

In [None]:
regressor.coef_

In [None]:
pd.DataFrame(regressor.coef_, index=['Items_Available', 'Daily_Customer_Count'], columns=['Coefficient'])

###### Interpreting the coefficient:

Keeping all the other features fixed, a unit incerase in Items Available is associated with an increase of $ 4.662953

Keeping all the other features fixed, a unit incerase in Daily customer count is associated with an increase of $ 2.110929.

###### Conclusion:

Given the current state of the economy, having a well-defined target market is more important than ever. No one can afford to target everyone. Small businesses can effectively compete with large companies by targeting a niche market.

Targeting a specific market does not mean that you are excluding people who do not fit your criteria. Rather, target marketing allows you to focus your marketing dollars and brand message on a specific market that is more likely to buy from you than other markets. This is a much more affordable, efficient, and effective way to reach potential clients and generate business.
