# EXPLORATORY DATA ANALYSIS - SUPERMARKET CASE - DATASET STORES 
## Step 1 - Purely focusing on stores data

The first step in EDA is to find out if we can find patterns in the isolated datafiles. We have the folling raw data files:
1- Sales data (full as well as aggregated)
2- Holiday data
3- Items data
4- Oil data
5- Transaction data
6- Stores data

In this notebook we will dive deeper into the stores data

Last edit: 27-05-2024 by Sebastiaan de Bruin

In [None]:
import pandas as pd
import altair as alt
import numpy as np

file_path_stores = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\raw\stores.parquet'
df_stores = pd.read_parquet(file_path_stores)

file_path_citiescoordinates = r'C:\Users\sebas\OneDrive\Documenten\GitHub\Supermarketcasegroupproject\Group4B\data\external\ecuadorcities.csv'
df_citiescoordinates = pd.read_csv(file_path_citiescoordinates)

df_citiescoordinates['city'] = df_citiescoordinates['city'].replace("Santo Domingo de los Colorados", "Santo Domingo")

# Display the first few rows of the DataFrame
df_stores.head()

# Display the first few rows of the DataFrame
df_citiescoordinates.head()

merged_df = df_stores.merge(df_citiescoordinates[['city', 'lat', 'lng']], on='city', how='inner')

df_stores.head()


The dataset brings us 54 stores (identified by store_nr) that are labeled by city, state,type and cluster. 

In [None]:
df_stores.info()

SLAAT NERGENS OP heeft t emaken met ndarray functie van NumPy-> First observations from the data are that 1) the data is complete (no null values anywhere) and 2) we have some object datatypes (variable) while the columns are all filled with string values. 
Strictly speaking this is suboptimal (altough we might not bother since it's such a small table). However, due to our professionalism we will alter the dtypes to string.


In [None]:
df_stores_dtypedic = {'city': str,
                      "state": str,
                      "type": str}

df_stores.astype(df_stores_dtypedic)

df_stores.dtypes

## Step 1.2 - Approximately 50% of all stores is located in just 2 cities (Quito and Guayaguil)

When looking into the data, we first delved deeper into just the amount of stores per city. Here, we found that especially Quito is overrepresented by the amount of supermarkets located there.

In [None]:
# Make a groupby on city and give the count of stores per city
df_stores_groupedcountstores = df_stores.groupby(['city']).count().reset_index()

# Sort the values 
df_stores_groupedcountstores.sort_values(by='store_nbr', ascending= False)

# Rename the result of the count column to an actual count column
df_stores_groupedcountstores = df_stores_groupedcountstores.rename(columns={'store_nbr':'Storecount'})

# Calculate the percentage of stores of the city in relation to the total stores in Ecuador
df_stores_groupedcountstores['Percentage'] = (df_stores_groupedcountstores['Storecount'] / df_stores_groupedcountstores['Storecount'].sum())*100

# Sort values by count
df_stores_groupedcountstores = df_stores_groupedcountstores.sort_values(by='Storecount', ascending= False)

# Round the number up to 2 decimals
df_stores_groupedcountstores['Percentage'] = df_stores_groupedcountstores['Percentage'].round(2)

# Make a cumulative sum of the amount of stores
df_stores_groupedcountstores['Cum_sum'] = df_stores_groupedcountstores['Storecount'].cumsum()

# Make a cumulative percentage column
df_stores_groupedcountstores['Cumulative Percentage'] = round(100*df_stores_groupedcountstores.Cum_sum/df_stores_groupedcountstores['Storecount'].sum(),2)

# Select only the columns that give us information
df_stores_groupedcountstores = df_stores_groupedcountstores[['city','Storecount','Percentage','Cum_sum','Cumulative Percentage']]

# Format the table, just to make it look nice and use the gradient to show the effect of having a lot of stores in certain cities
df_stores_groupedcountstoresinstyle = df_stores_groupedcountstores.style.background_gradient(subset=['Storecount', 'Percentage'], cmap='Blues')\
                                                                        .format({"Percentage": "{:20,.1f}",
                                                                                 "Cumulative Percentage" : "{:20,.1f}"})
# Print the table
print(df_stores_groupedcountstoresinstyle)

df_stores_groupedcountstoresinstyle

In [None]:
df_stores_groupedcountstores.info()

In [None]:
barchartstorecitycount = alt.Chart(df_stores_groupedcountstores).mark_bar().encode(
    x='Storecount',
    y='city',
)
barchartstorecitycountlabels = barchartstorecitycount.mark_text(
    align= 'left',
    baseline='middle',
    dx=3
).encode(
    text='Storecount'
)

barchartstorecitycount + barchartstorecitycountlabels



## Step 1.3 - Store type D might be most interesting as it is represented all over Ecuador (might be better for generalization)

When looking into the data, we first delved deeper into just the amount of stores per city. Here, we found that especially Quito is overrepresented by the amount of supermarkets located there.

In [None]:
# Stacked bargraph for each city per type of store

#Group the data per city and type
df_storesgroupedstorestype = df_stores.groupby(['city','type']).count().reset_index()

#Add the storecount to each row as it might be needed for sorting the chart
df_storesgroupedstorestypestorecount = df_storesgroupedstorestype.merge(df_stores_groupedcountstores, how = 'inner', on = 'city')

#Select only the relevant columns
df_storesgroupedstorestypestorecount = df_storesgroupedstorestypestorecount[['city','type','store_nbr','Storecount']]

#Make the stacked bargraph on the amount of stores per type per city
chartstorespercitytype = alt.Chart(df_storesgroupedstorestypestorecount).mark_bar().encode(
    x=alt.X('store_nbr'),
    y=alt.Y('city', sort='-x'),
    color='type')

chartstorespercitytype 

We can see that store type C and D seem most spread accross Ecuador. Also, if we look at the distribution of store count accross the types of stores we can see that C and D are dominant without considering sold units. Thus, D would potentially be a good candidate to focus our forecasting model on as it comes back in the first 5 biggest places (in terms of store count) in our analysis, representing 18 stores that most likely have some similarity for Corporacion Favorita. 

In [None]:
df_storestypecountandpercentage =  df_stores.groupby(['type']).count().reset_index()
df_storestypecountandpercentage = df_storestypecountandpercentage[['type' , 'city']]
df_storestypecountandpercentage = df_storestypecountandpercentage.rename(columns={'city':'count'})
df_storestypecountandpercentage['percentage'] = round((df_storestypecountandpercentage['count']/df_storestypecountandpercentage['count'].sum())*100,1)


df_storestypecountandpercentage

## Step 1.4 - Looking at clusters per type of store - Data issues most likely with cluster 10, normally all clusters belong to 1 storetype (thus it is a hierarchy)

Delving into store type and cluster we found out except of cluster 10, all clusters belong to 1 storetype.

In [None]:
import plotly.express as px

df_storestypeclustercount = df_stores.groupby(['type','cluster']).count().reset_index()
df_storestypeclustercount = df_storestypeclustercount.rename(columns={'city':'count'})


Storessunburst1 = px.sunburst(df_storestypeclustercount, path=['type','cluster'], values='count',
                               color = 'type',
                                color_continuous_scale='RdBu',
                                color_continuous_midpoint=np.average(df_storestypeclustercount['count'], weights=df_storestypeclustercount['count']))

Storessunburst1.show()

df_storestypeclustercount

In [None]:
heatmap = alt.Chart(df_storestypeclustercount).mark_rect().encode(
    x=alt.X('cluster:O', title='Cluster Number'),
    y=alt.Y('type:O', title='Type of Store'),
    color=alt.Color('count:Q', title='Count')
).properties(
    title='Count of stores per Type and Cluster',
    width = 1000,
    height = 400
)


text = heatmap.mark_text(baseline='middle').encode(
    text='count:Q',
    color=alt.condition(
        alt.datum.count > 100,
        alt.value('black'),
        alt.value('white')
    )
)

Chartstoretypesandcluster = alt.layer(heatmap, text).configure_title(
    fontSize=30).configure_axis(
    titleFontSize = 20, 
    labelFontSize=20).configure_legend(
    titleFontSize=20)

Chartstoretypesandcluster

In the above chart, we find that most store type categories consist out of multiple clusters (3-4 or 7). Only type E cosists only of cluster 10 but as stated above, there seems to be something going on with cluster 10 (as it is the only cluster that is related to multiple types of stores)

In [None]:
import plotly.express as px

fig = px.treemap(df_stores, path=[ 'type','cluster'], values='store_nbr', color_continuous_scale= 'rdbu')
fig.show()

In [None]:
import folium

# Create a map centered around Ecuador
map_ecuador = folium.Map(location=[-1.8312, -78.1834], zoom_start=7)

# Add markers for each coordinate
# The for loop iterates over the rows of the DataFrame
# The iterrows() method returns an iterator that yields pairs of index and row data as Series
# The row data is a Series that contains the data of the row
# The index is the index of the row
for index, row in merged_df.iterrows():
    folium.Marker([row['lat'], row['lng']], icon = folium.Icon(icon='leaf'), popup=row['city']).add_to(map_ecuador)


# Display the map
map_ecuador

#plotlib


In [13]:
#Altair
#Plotlib

In [None]:
import folium

# Create a map centered around Ecuador
map_ecuador = folium.Map(location=[-1.8312, -78.1834], zoom_start=7)

# Add markers for each coordinate
# The for loop iterates over the rows of the DataFrame
# The iterrows() method returns an iterator that yields pairs of index and row data as Series
# The row data is a Series that contains the data of the row
# The index is the index of the row
for index, row in merged_df.iterrows():
    folium.Marker(
        [row['lat'], row['lng']], 
        popup=row['city']
        ).add_to(map_ecuador)


# Display the map
map_ecuador