<h1 align="center">🛍️ US Stores Sales 🛍️</h1>

<center><i>US Stores Sales Between 2010 and 2011<i></center>
    
**Dataset: [US Stores Sales](https://www.kaggle.com/datasets/dsfelix/us-stores-sales)**

----

<h1>📝 Description</h1>

<p>You're provided with daily historical sales data between January 1th in 2010 and December 31th in 2011. The task is to forecast the Total Value of Sales in Dollars given some info about the Stores, Products and Accountability.</p>

<p>Try on applying different Machine and Deep Learning Models. Good Luck!! 🍀🍀</p>

----

<h1>📁 File Descriptions</h1>

<br/>

> **sales.csv** - the training and testing set. Daily historical data from January 2010 to December 2011 (one of this challenge goals is to split up the dataset into training and testing set).

----

<h1>❓ Variables</h1>

<br/>

> **Area Code** - Store's Code;

> **State** - Store's State;

> **Market** - Store's Region;

> **Market Size** - Store's Size;

> **Profit** - Profits in Dollars;

> **Margin** - Profit + Total Expenses OR Sales - COGS;

> **🌟 Sales 🌟** - Values Acquired in Sales (this is your target);

> **COGS** - Cost of Goods Sold;

> **Total Expenses** - Total Expenses to get the Product to Sell;

> **Marketing** - Expenses in Marketing;

> **Inventory** - Inventory Value of the Product in the Sale Moment;

> **Budget Profit** - Expected Profit;

> **Budget COGS** - Expected COGS;

> **Budget Margin** - Expected Profit + Expected Total Expenses OR Expected Sales - Expected COGS;

> **Budget Sales** - Expected Value Acquired in Sales;

> **ProductID** - Product ID;

> **Date** - Sale Date;

> **Product Type** - Product Category;

> **Product** - Product Description;

> **Type** - Type.

----

<h1>🎯 Goals</h1>

<br/>

> **Goal 1:** predict the Total Acquired Value on Sales in Dollars;

> **Goal 2:** get a score lower than 15 dollars (RMSE);

> **Bonus Goal:** apply XGBoost;

<br/>

----

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

np.random.seed(2004)
sns.set_style('whitegrid')

In [None]:
full_sales = pd.read_csv('../input/us-stores-sales/sales.csv')
full_sales.head()

In [None]:
# As far the 'ProductId' Feature indicates each product
# there's no need to maintain 'Product' Feature
#
# Also, as far as 'Profit' and 'Margin' Featurees are 'Future Features',
# that is, their values are calculated with our target ('Sales') already known,
# we gotta drop them too in order to avoid Target Leakage
features_to_drop = ['Product', 'Profit', 'Margin']
sales = full_sales.drop(features_to_drop, axis=1).copy()
sales.head()

----

<h1>🔍 Exploratory Analysis</h1>

In the Exploratory Analysis, let's check out the variables' data types, see their distributions and correlations and plot some charts with seaborn!

<b>Data Types</b>

In [None]:
sales.dtypes

Well, it seems that almost all features are in its proper data type, but **Date**, which is *object* instead of *date*.

In the moment, all Date Values are in the format **DD/MM/YY HH:mm:ss**, so let's transform them into **DD/MM/YY**.

In [None]:
import datetime

# Getting just the date part of 'Date' Feature in DD/MM/YY format
# and converting the string into datetime
sales['Date'] = sales['Date'].apply(lambda row:row[0:8])
sales['Date'] = pd.to_datetime(sales['Date'], format='%d/%m/%y')
sales['Date'].head()

In [None]:
sales.dtypes

🎉🎉 Niceee!! We have converted the **Date** Feature to **Datetime64[ns]** datatype successfully!!

Now, let's move forward to some Statistical Overview!!

----

<b>Statistical Overview</b>

In order to analyse the dataset better, let's split up the statistical overview in two: *Numbers Stats* and *Categorical Stats*.

In [None]:
# Number Stats
sales.describe()

Looking at the stats, we can realize:

> **Inventory** minimum values are negative: it's not possible to happen, because at the moment the inventory is equals zero, the market doesn't have the product phisically to sell. So, let's make a small Data Transformation to convert all negative Invetories to zero;

> **High Standard Deviation**: take a look at *75%* and *max* stats from **Sales** Feature, the difference between these two is *930.00 dollars*, quite a big gap! Due to this the Standard Deviation is very high (*148,.89 dollars*), so, it'll be a problem in the  future if we don't take an action to fix it;

> **Data Standardization**: the maximum value of **Product ID** Feature is *13*, whereas the minimum value of **Area** one is *203.00*. Then, in order to don't make the ML Model learn that Area is more important than the Product ID just because its values are higher, we will have to apply Data Standardization during the Pipelines Step.  With this, we will solve this problem and the **High Standard Deviation** together. Two birds with one stone!! 🐦🐦🥌

For now, let's focus to convert all negative inventories to zero.

In [None]:
# Converting all negative values of inventory to zero
sales['Inventory'] = sales['Inventory'].apply(lambda x: x if x >= 0 else 0)
sales.describe()

All right, now the minimum value of **Inventory** is zero.

Let's take a peep on the Categorical Features' Stats!!

In [None]:
# Categorical Stats
sales.describe(include=['object'])

Hmm, it seems okay!! Just realize:

> There are **four Markets** distributed into **twenty States**;

> Also, we are working with two Market Sizes: **Small** and **Big**;

<br>

❓ Now, there's a question: is there any Linear Relationship between the numerical values? Let's find it out!!

----

<b>Correlations</b>

In [None]:
sales.corr()

Well, we have a bunch of Linear Relationships (LN) here (it's quite obvious, if you read the dataset desciption, you'll realize that most of the features were calculated using simple linear math equations, so, it would be kind of weird if the dataset doesn't have any Linear Relationships 😂😂).

Between all of LR, let's go deep in these three:

> **Marketing - Inventory (53.41%)**: as more marketing the market does for a specific as higher will be the Product Inventtory. One of the reasons here is due to the belief that more clients will be interested to buy the shared products and then, the market will more units of the products to sell;

> **Total Expenses - Marketing (96.66%)**: as higher is Total Expenses as higher is the costs in Marketing. It means that markets with more costs gain more money and, consequently, can apply more money in the Marketing;

<br>

Let's plot these relationships!!

In [None]:
# Marketing - Inventory (Linear Coefficient: 53.41%)

plt.figure(figsize=(15,7))
plt.title('Costs Marketing x Inventory')

sns.regplot(data=sales, x='Marketing', y='Inventory')

plt.xlabel('Marketing (U$)')
plt.ylabel('Inventory')
plt.show()

In [None]:
# Total Expenses - Marketing (Linear Coefficient: 96.66%)

plt.figure(figsize=(15,7))
plt.title('Total Expenses x Costs Marketing')

sns.regplot(data=sales, x='Total Expenses', y='Marketing')

plt.xlabel('Total Expenses (U$)')
plt.ylabel('Marketing (U$)')
plt.show()

----

Okay, now, let's see if any oof Number Features have a Normal Distribution.

In [None]:
# As far some features don't have a Linear Relationship
# at the moment, let's see how the datas are spread
sales.hist(bins=15, figsize=(15, 10));

Woow, we have such interesting histograms here:

> **Budget Profit and Budget Margin** have Normal Distribution;

> **Sales, COGS, Total Expenses, Marketing and Budget Sales** have Tail Distribution too Right;

> **Date** has values bettween 15 days in 2010 and 15 days in 2011 (the histogram has been plotted with each bar representing an interval of 15 possible values).

----

Now, to finish this section, let's create a new Feature called **Budget Total Expenses**, using the following equation:

```
                $Budget Total Expenses = Budget Margin - Budget Profit$
```

In [None]:
# Creating 'Budget Total Expenses' Feature
sales.insert(loc=12, column='Budget Total Expenses', value=sales['Budget Margin'] - sales['Budget Profit'])
sales.head()

----

<b>Categorical Features Data Entry</b>

Noow, let's coonvert all Categorical Features into lower case and check out if any of them has any inconsistent datas.

In [None]:
# Convertiing all Categorical Features into lower case
sales['State']          =  sales['State'].str.lower()
sales['Market']         =  sales['Market'].str.lower()
sales['Market Size']    =  sales['Market Size'].str.lower()
sales['Product Type']   =  sales['Product Type'].str.lower()
sales['Type']           =  sales['Type'].str.lower()

sales.head()

In [None]:
# Now, let's check out each unique value from the Categorical Features in order to check out
# for any inconsistent datas
print('States:',        sales['State'].unique(),         '\n')
print('Market:',        sales['Market'].unique(),        '\n')
print('Market Size:',   sales['Market Size'].unique(),   '\n')
print('Product Type:',  sales['Product Type'].unique(),  '\n')
print('Type:',          sales['Type'].unique(),          '\n')

🎉🎉 Hoorray!! There are not duplicated and inconsistent categorical values, so let's move on to split the dataset into Train and Test!!

----

<h1>📦 Splitting Dataset into Features and Target</h1>

As far as we gotta predict **Sales**, let's separate this Feature to be our Target!!

In [None]:
X = sales.copy()
y = X.pop('Sales')

----

<h1>⚙️ Feature Engineering</h1>

Before starting creating a ML Model, let's make some Data Engineering and Analysis.

Kicking off from Engineering, we will:

> check out which number features are most important with **Mutual Information (MI)**;

> try to find out similar datas in groups using **K-Means Clusters**.

C'mon, we're almost there,just a few steps to kick off our model building!!

----

<b>Mutual Information (MI)</b>

In [None]:
from sklearn.feature_selection import mutual_info_regression

In [None]:
# Getting the Discrete and Continuous Features' Names
X_discrete_features = [col for col in X.columns
                      if X[col].dtype == 'int64']
X_continuous_features = [col for col in X.columns
                        if X[col].dtype =='float64']

# Identifying which number features are discrete and which ones are continuous
discrete_features = X[X_discrete_features].dtypes == int
continuous_features = X[X_continuous_features].dtypes == float

In [None]:
# Function to Plot MI Scores
def plot_mi_scores(scores):
    """
    Plots Mutual Information Scores in Ascending Order
    """
    scores = scores.sort_values(ascending=True)
    width = np.arange(len(scores))
    ticks = list(scores.index)
    plt.barh(width, scores)
    plt.yticks(width, ticks)
    plt.title("Mutual Information Scores")

In [None]:
# Calculating and Plotting Discrete Features MI Scores
mi_scores_discrete_features = mutual_info_regression(X[X_discrete_features], y, discrete_features=discrete_features, random_state=2004)
mi_scores_discrete_features = pd.Series(mi_scores_discrete_features, name='MI Scores 1', index=X_discrete_features)
mi_scores_discrete_features = mi_scores_discrete_features.sort_values(ascending=False)

plt.figure(dpi=100, figsize=(8,5))
plot_mi_scores(mi_scores_discrete_features)

Okay, looking at the plot, we can assume that **Area Code** has more influence to increase the Market Sells than **ProductId**.

In [None]:
# Calculating and Plotting Continuous Features MI Scores
mi_scores_continuous_features = mutual_info_regression(X[X_continuous_features], y, discrete_features=continuous_features, random_state=2004)
mi_scores_continuous_features = pd.Series(mi_scores_continuous_features, name='MI Scores 1', index=X_continuous_features)
mi_scores_continuous_features = mi_scores_continuous_features.sort_values(ascending=False)

plt.figure(dpi=100, figsize=(8,5))
plot_mi_scores(mi_scores_continuous_features)

And, about the continuous features, **Inventory**, **COGS** and **Marketing** are the *most influencer features*.

----

<b>K-Means Cluster</b>

In [None]:
from sklearn.cluster import KMeans

In [None]:
# First, let's find ouot what's the perfect number of clusters applying
# the Elbow and WCSS Methods

number_features = [col for col in X.columns
                    if X[col].dtype in ['int64', 'float64']]

wcss = []

for i in range(1,7):
    kmeans = KMeans(n_clusters=i
                   , init='k-means++'
                   , max_iter=300
                   , n_init=10
                   , random_state=2004)
    kmeans.fit(X[number_features])
    wcss.append(kmeans.inertia_)

In [None]:
# Plotting the results
plt.plot(range(1,7), wcss)
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.show()

We have three elbows here: **2, 3 and 4 clusters**. Let's use **4 clusters** in our analysis.

In [None]:
# Calculating K-Means with 4 clusters
kmeans = KMeans(n_clusters=4
                   , init='k-means++'
                   , max_iter=300
                   , n_init=10
                   , random_state=2004)

X_temp = X.copy()
X_temp.insert(loc=0, column='Cluster', value=kmeans.fit_predict(X_temp[number_features]))
X_temp

In [None]:
# Checking out if the Cluster followed the ProductId pattern
sns.relplot(
    x='ProductId', y='Area Code', hue='Cluster', data=X_temp, height=6, palette='colorblind'
)

As far as Cluster didn't follow Product Id pattern, let's add this new Feature into the main dataset (to understand what I'm talking about, check out this notebook [🛍️ Predicting Future Sales 🛍️](https://www.kaggle.com/code/dsfelix/predicting-future-sales) and give a read on **K-Means Cluster** section).

In [None]:
# Adding Cluster Feature into the main dataset, deleting X_temp in the memory
# and seeing the result
X.insert(loc=0, column='Cluster', value=X_temp['Cluster'])

del X_temp

X.head()

----

<h1>🧺 Pipelines and Data Transformations</h1>

First things first, let's kicking off looking for missing values.

In [None]:
X.isnull().sum()

![OMG](https://steamuserimages-a.akamaihd.net/ugc/273974515706513039/441EA3D383846BA5EDCD6710B650ECD7704E6077/?interpolation=lanczos-none&output-format=jpeg&output-quality=95&fit=inside%7C637%3A358&composite-to=*,*%7C637%3A358&background-color=black)

Oh My Gooood! There is a single missing value, ooh yeah!! So our unique steps to Pipelines is the Standardization for the Number Features and One-Hot Encoding for the Categorical One.

Now, let's split the dataset into train and validation and search for any bad labels.

In [None]:
# Splitting datas 
from sklearn.model_selection import train_test_split

X_train, X_valid, y_train, y_valid = train_test_split(X, y, random_state=2004
                                                    , train_size=0.70
                                                    , test_size=0.30)

In [None]:
# Searching for any bad labels

cat_ord_features = [col for col in X_train.columns
                  if X_train[col].dtype == 'object']

good_labels_ord_features = [col for col in cat_ord_features
                          if set(X_valid[col]).issubset(set(X_train[col]))]

bad_labels_ord_features = list(set(cat_ord_features) - set(good_labels_ord_features))

print('Good Labels:', good_labels_ord_features, '\n')
print('Bad Labels:',  bad_labels_ord_features,  '\n')

![OMG](https://steamuserimages-a.akamaihd.net/ugc/273974515706513039/441EA3D383846BA5EDCD6710B650ECD7704E6077/?interpolation=lanczos-none&output-format=jpeg&output-quality=95&fit=inside%7C637%3A358&composite-to=*,*%7C637%3A358&background-color=black)

Double OMG!! No Bad Labels here!! Now iot's time to create Pipelines!!

----

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import RobustScaler
from sklearn.preprocessing import Normalizer
from sklearn.compose import ColumnTransformer

In [None]:
# Creating Numerical Transformer
numerical_transformer = Pipeline(steps=[
    ('scaler', RobustScaler())
])

In [None]:
# Creating Categorical Transformer
categorical_transformer = Pipeline(steps=[
    ('encoder', OrdinalEncoder()),
    ('scaler', MinMaxScaler())
])

In [None]:
# Bundling Preprocessors
preprocessor = ColumnTransformer(
    transformers=[
        ('numerical', numerical_transformer, number_features),
        ('categorical', categorical_transformer, good_labels_ord_features)
    ]
)

----

<h1>🤖 Baseline ML Models</h1>

Let's create three baseline models:

> **Linear Regressor**;

> **Random Forest Regressor**;

> **XGBoost**.

In [None]:
from sklearn.metrics import mean_squared_error # add 'np.sqrt()' to calculate RMSE
import statsmodels.api as sm

In [None]:
# Function to calculate RMSE
rmse = lambda predictions, real_values: np.sqrt(mean_squared_error(predictions, real_values))

----

<b>Linear Regressor</b>

In [None]:
from sklearn.linear_model import LinearRegression

In [None]:
# Creating Model
linear_model = LinearRegression(
    n_jobs=4
)

linear_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', linear_model)
])

In [None]:
# Training the Model and Making Predictions
linear_pipeline.fit(X_train, y_train)
print('Training Done!')
linear_predictions = linear_pipeline.predict(X_valid)
print('Predictions Done!')

In [None]:
# RMSE
linear_rmse = rmse(linear_predictions, y_valid)
print('Linear Regression RMSE: ', linear_rmse)

In [None]:
# Train and Validation Scores
print('Train Score: %.2f%%' % (linear_pipeline.score(X_train, y_train) * 100))
print('Validation Score: %.2f%%' % (linear_pipeline.score(X_valid, y_valid) * 100))

In [None]:
# Model Summary
model = sm.OLS(y, sm.add_constant(X['ProductId'])).fit()
print(model.summary())

----

<b>Random Forest Regressor</b>

In [None]:
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Creating Model
rfg_model = RandomForestRegressor(
    n_estimators=250,
    criterion='squared_error',
    random_state=2004,
)

rfg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', rfg_model)
])

In [None]:
# Training the Model and Making Predictions
rfg_pipeline.fit(X_train, y_train)
print('Training Done!')
rfg_predictions = rfg_pipeline.predict(X_valid)
print('Predictions Done!')

In [None]:
# RMSE
rfg_rmse = rmse(rfg_predictions, y_valid)
print('Random Forest Regressor RMSE: ', rfg_rmse)

In [None]:
# Train and Validation Scores
print('Train Score: %.2f%%' % (rfg_pipeline.score(X_train, y_train) * 100))
print('Validation Score: %.2f%%' % (rfg_pipeline.score(X_valid, y_valid) * 100))

-----

<b>XGBoost</b>

In [None]:
from xgboost import XGBRegressor

In [None]:
# Creating the Model
xgb_model = XGBRegressor(
    n_estimators=500
    , learning_rate=0.05
    , n_jobs=4
)

xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor)
])

In [None]:
# Training the Model and Making Predictions

xgb_X_train = xgb_pipeline.fit_transform(X_train)
xgb_X_valid = xgb_pipeline.transform(X_valid)

xgb_model.fit(
    xgb_X_train, y_train
    , early_stopping_rounds=5
    , eval_set=[(xgb_X_valid, y_valid)]
    , verbose=False
)

print('Training Done!')

xgb_predictions = xgb_model.predict(xgb_X_valid)
print('Prediction Done!')

In [None]:
# RMSE
xgb_rmse = rmse(xgb_predictions, y_valid)
print('XGBoost Regressor RMSE: ', xgb_rmse)

In [None]:
# Train and Validation Scores
print('Train Score: %.2f%%' % (xgb_model.score(xgb_X_train, y_train) * 100))
print('Validation Score: %.2f%%' % (xgb_model.score(xgb_X_valid, y_valid) * 100))

Analysing the threee models, we got impressive results here:

> RMSE lower than 15 dollars;

> Scores higher than 99%;

> Also, we can discar **overfitting** assumption since the gap between train and validation scores are not big and becausee we avoided **Target Leakage** dropping **Margin** and **Profit** Features at the beggining of the analysis.

----

<h1>🥒 Saving and Loading the Model</h1>

In [None]:
import pickle

In [None]:
# Saving the Model
#pickle.dump(rfg_model, open('rfg_model.pkl', 'wb'))

In [None]:
# Loading the Model
#pickled_model = pickle.load(open('rfg_modedl.pkl', 'rb'))
#pickled_model.predict(test)

----

Hoorray!!! This is one more finished tutorial! See you next time 👋👋

----

<br/>
<h1>📫 Reach Me 📫</h1>
<br/>

> **Email:** **[csfelix08@gmail.com](mailto:csfelix08@gmail.com?)**

> **Linkedin:** **[linkedin.com/in/csfelix/](https://www.linkedin.com/in/csfelix/)**

> **Instagram:** **[instagram.com/c0deplus/](https://www.instagram.com/c0deplus/)**

> **Portfolio:** **[CSFelix.io](https://csfelix.github.io/)**

> **Kaggle:** **[DSFelix](https://www.kaggle.com/dsfelix)**