

# Exploratory Data Analysis (EDA) for Kaggle's Time Series Forecasting Competition Data
### [Store Sales - Time Series Forecasting](https://www.kaggle.com/competitions/store-sales-time-series-forecasting)

Welcome to this notebook, where we will conduct an Exploratory Data Analysis (EDA) on the dataset provided by Kaggle's competition. The objective of this competition is to utilize time-series forecasting techniques to predict store sales based on data from Corporación Favorita, a prominent Ecuadorian-based grocery retailer.

**Goal of the Competition:**
The primary goal of this competition is to build a robust model that can accurately predict the unit sales of thousands of items sold across different Favorita stores. By analyzing the training dataset, which includes information such as dates, store details, item specifics, promotions, and unit sales, participants are encouraged to apply their machine learning skills to create precise forecasts.

**Context:**
Forecasting plays a crucial role not only in meteorology but also in various other domains. Governments rely on forecasts to predict economic growth, while scientists attempt to anticipate future population trends. For businesses, particularly brick-and-mortar grocery stores, accurate forecasting is essential to manage inventory effectively. An overestimation could lead to surplus perishable goods, while underestimation may result in popular items quickly running out, leading to revenue loss and dissatisfied customers. Machine learning, with its ability to provide more accurate predictions, offers a solution to this challenge faced by retailers.

Traditional forecasting methods in the retail sector often lack sufficient data to support their accuracy and are usually not automated. As retailers expand to new locations with distinct requirements, introduce new products, adapt to changing seasonal preferences, and implement unpredictable marketing strategies, the forecasting problem becomes even more complex.
For grocery stores, improved accuracy in forecasting can significantly reduce food waste related to overstocking and enhance overall customer satisfaction. As a result of this competition's findings, local stores may eventually be better equipped to provide precisely what customers need during their shopping experiences.

Let's proceed with the exploratory data analysis to gain valuable insights and prepare for the time series forecasting challenge ahead!


### train.csv
The training data, comprising time series of features store_nbr, family, and onpromotion as well as the target sales.
- **store_nbr** identifies the store at which the products are sold.
- **family** identifies the type of product sold.
- **sales** gives the total sales for a product family at a particular store at a given date. Fractional values are possible since products can be sold in fractional units (1.5 kg of cheese, for instance, as opposed to 1 bag of chips).
- **onpromotion** gives the total number of items in a product family that were being promoted at a store at a given date.

In [3]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.neighbors import LocalOutlierFactor


# Set the style
plt.style.use(
    "bmh"
)
# Set palette
sns.set_palette('crest')

# Set the size
plt.rcParams["figure.figsize"] = (
    15,
    6,
)
pd.options.display.float_format = '{:.2f}'.format

In [None]:
train_df[["year", "date_no_year", "day_name", "day_of_the_week"]] = train_df[
    "date"
].apply(
    lambda x: pd.Series(
        [
            x.year,
            x.strftime("%m-%d"),
            x.day_name(),
        ]
    )
)

In [None]:
display(
    train_df[["store_nbr", "sales", "onpromotion"]].describe(
        percentiles=[0.25, 0.5, 0.75, 0.99]
    )
)
print("The data has {} points.".format(train_df.shape[0]))
print(
    "There are {} different product families, sold in {} different stores.".format(
        train_df["family"].nunique(), train_df["store_nbr"].nunique()
    )
)

In [None]:
sns.boxplot(data=train_df, x="family", y="sales")
plt.xticks(rotation=90)
plt.show()

In [None]:
# Count outlier candidates using IQR Method (Interquartile Range):
q1, q3 = train_df["sales"].quantile(0.25), train_df["sales"].quantile(0.75)
outlier_percentage = (
    train_df[
        (train_df["sales"] < (q1 - 1.5 * (q3 - q1)))
        | (train_df["sales"] > (q3 + 1.5 * (q3 - q1)))
    ].shape[0]
    / train_df.shape[0]
) * 100
print(
    f"There are {outlier_percentage:.2f}% points that are considered outliers according to IQR method."
)

The data exhibits apparent irregularities in its sales figures, raising the need to distinguish between potential anomalies and justifiable outliers. To achieve this distinction, the application of Local Outlier Factor (LOF) for outlier detection is being considered. This approach aims to determine whether these outliers are genuinely atypical within their specific subgroups.

In [None]:
data_prepared = train_df[
    ["store_nbr", "family", "onpromotion", "year", "date_no_year", "day_name", "sales"]
]
data_prepared = pd.get_dummies(
    data_prepared, columns=["store_nbr", "family", "year", "date_no_year", "day_name"]
)

clf = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
y_pred = clf.fit_predict(data_prepared)
X_scores = clf.negative_outlier_factor_
X_scores

In [9]:
ax = train_df.query("sales != 0")["family"].value_counts().plot.bar()
plt.title("Frequency of purchases")
plt.xlabel("Product family")
plt.ylabel("Number of thays with a purchase")
plt.show()

NameError: name 'train_df' is not defined

In [None]:
# What family sells the most?
yearly_sum_sales = train_df.groupby(["family", "year"])["sales"].sum().reset_index()
sns.heatmap(
    data=yearly_sum_sales.pivot(index="family", columns="year", values="sales")
    .div(1000000)
    .round(2)
    .sort_values(by=2017, ascending=False),
    annot=True,
    cmap="bone_r",
    fmt="g",
)
plt.show()
# sum_sales

In [None]:
sum_sales = (
    yearly_sum_sales.groupby("family")["sales"].sum().sort_values(ascending=False)
)
cumulative_sum_percentage = (sum_sales.cumsum() / sum_sales.sum()) * 100

plt.plot(cumulative_sum_percentage, marker="o", linestyle="-")
plt.plot(
    cumulative_sum_percentage[cumulative_sum_percentage < 95],
    marker="o",
    linestyle="-",
    color="purple",
    label="product families that carry 95% of sales",
)

plt.title("Cumulative Sum Percentage Plot")
plt.ylabel("Cumulative Sum Percentage (%)")
plt.xticks(rotation=90)
plt.legend()
plt.show()

### Sales

In [None]:
avg_sales = (
    train_df.query("family in @majority_of_sales & onpromotion != 0")
    .groupby(["family", "onpromotion"])["sales"]
    .mean()
    .sort_values(ascending=False)
    .reset_index()
)
sns.jointplot(
    data=avg_sales, x="onpromotion", y="sales", alpha=0.5, hue="family", cmap="bone_r"
)

In [None]:
sns.lmplot(
    data=avg_sales,
    x="onpromotion",
    y="sales",
    col="family",
    col_wrap=4,
    lowess=True,
    # aspect=3
    line_kws={"color": "purple"},
    sharex=None,
    sharey=None,
)
plt.show()

In [None]:
sns.lineplot(
    data=train_df.query("family == 'GROCERY I'")
    .groupby(["year", "date_no_year"])["sales"]
    .mean()
    .reset_index(),
    x="date_no_year",
    y="sales",
    hue="year",
)
plt.xticks(rotation=90)
plt.show()

In [None]:
weekly_sale_average = (
    train_df.query("family == 'GROCERY I'")
    .groupby(["year", "day_of_the_week", "day_name"])["sales"]
    .median()
    .reset_index()
)
ax = sns.lineplot(
    data=weekly_sale_average,
    x="day_of_the_week",
    y="sales",
    hue="year",
)
ax.xaxis.set(
    ticks=weekly_sale_average["day_of_the_week"].unique(),
    ticklabels=weekly_sale_average["day_name"].unique(),
)
plt.show()

In [None]:
ax = sns.boxplot(
    data=train_df.query("family == 'GROCERY I'"),
    x="day_of_the_week",
    y="sales",
    hue="year",
)
ax.xaxis.set(
    ticks=train_df["day_of_the_week"].unique(), ticklabels=train_df["day_name"].unique()
)
plt.show()