# Machine Learing Development for Sales Forecasting

Another possible solution is to use machine learning algorithms to create predictive models based on sales data, to identify the market characteristics that most influence the the amount of sales. Once we identify these characteristics, they can focus their monitoring efforts on product with a higher likelihood of improving sales. 

The case of big mart sales can be treated as a regression problem. The goal is to forecast sales. The target variable in this case can be specified as a nominal variable, which are sales amount in a year

A wide range of machine learning algorithms can be used to solve the problem, we will use:

+ Linear regression techniques, such as LinearRegression, which use a linear function to predict the amount of sales.
+ Decision tree algorithms can be used to simulate more complex interactions between input factors and the target variable. RandomForestRegressor is a machine-learning model built from a collection of decision trees, each of which is trained on a different subset of training data. The program averages the predictions of all the trees to provide a final prediction.

To evaluate the performance of machine learning models, we will use various metrics, such as `mean absolute error (MAE)`, `mean squared error (MSE)`, `root mean squared error (RMSE)`, `R-Squared`, `mean absolute percentage error (MAPE)`.



In [31]:
#@title importing necesary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import pingouin as pg
import pickle
#import scikitplot as skplt

from typing import List, Tuple

from pandas.api.types import CategoricalDtype
from statsmodels.stats.contingency_tables import Table2x2
from scipy.stats import randint

from sklearn.compose import make_column_selector,make_column_transformer, ColumnTransformer
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, PowerTransformer, OrdinalEncoder,  OneHotEncoder

from sklearn.experimental import enable_halving_search_cv 
from sklearn.metrics import accuracy_score, log_loss, mean_absolute_error, mean_squared_error, r2_score
                                                        
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score,\
                                    learning_curve, validation_curve, RandomizedSearchCV, HalvingRandomSearchCV,\
                                    KFold
                                    
from sklearn.pipeline import Pipeline,make_pipeline

from sklearn.tree import DecisionTreeRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import GradientBoostingRegressor,RandomForestRegressor, ExtraTreesRegressor

import warnings
warnings.filterwarnings('ignore')

We have performed some feature engineering in the data exploration phase and some feature have therefore become obsolete. We will have to drop those columns. Also we will be using different encoders for the category columns based on its characteristics.

In [32]:
# load the dataset
# load the data (cleaned)
bigmart = pd.read_csv('bigmart_cleaned.csv')

bigmart_copy = bigmart.copy() # create a copy of the data
bigmart_copy.head()

Unnamed: 0,Item_Weight,Item_Fat_Content,Item_Visibility (%),Item_Type,Item_MRP,Outlet_Identifier,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales,Outlet_Age,Item_Type_Category,Item_MRP_Category
0,9.3,Low Fat,2.0,Dairy,249.8092,OUT049,Medium,Tier 1,Supermarket Type1,3735.138,14,Food,Very High
1,5.92,Regular,2.0,Soft Drinks,48.2692,OUT018,Medium,Tier 3,Supermarket Type2,443.4228,4,Drink,Low
2,17.5,Low Fat,2.0,Meat,141.618,OUT049,Medium,Tier 1,Supermarket Type1,2097.27,14,Food,High
3,19.2,Regular,12.0,Fruits and Vegetables,182.095,OUT010,Small,Tier 3,Grocery Store,732.38,15,Food,High
4,8.93,No Fat,6.0,Household,53.8614,OUT013,High,Tier 3,Supermarket Type1,994.7052,26,Non-Consumable,Low


In [34]:
# @let start with data preprocessing

# set the seed
seed = 200

# set the feature and target variable
X = bigmart_copy.drop('Item_Outlet_Sales', axis = 1)
y = bigmart_copy.Item_Outlet_Sales

# split data into train and test set
X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.3, shuffle = True, random_state=seed)
X_val, X_test, y_val, y_test = train_test_split(X_temp, y_temp, test_size=0.5, random_state=seed)

# create a list of numerical column
num_selector = make_column_selector(dtype_exclude='object')

# create two set of of category selector, one of ordinal type and 
# other of nominal type
cat_selector_nom = make_column_selector(dtype_include='object')
cat_selector_ord= make_column_selector(dtype_include= 'category')

# sekect this column from the data
num_cols = num_selector(X)
cat_cols_ord = cat_selector_ord(X)
cat_cols_nom = cat_selector_nom(X)

# initiate the preprocessor for each selctor
num_preprocessor = RobustScaler()
cat_selector_nom_preprocessor = OneHotEncoder()
cat_selector_ord_preprocessor = OrdinalEncoder()

# set the preprocessor
preprocesor = ColumnTransformer([
    ('RobustScaler', num_preprocessor, num_cols),
    ('OneHotEncoder', cat_selector_nom_preprocessor, cat_cols_nom),
    ('OrdinalEncoder', cat_selector_ord_preprocessor, cat_cols_ord)
])

# create a machine model pipeline
pipelines = {
    'Linear Regression':make_pipeline(preprocesor, LinearRegression),
    'Random Forest Regressor': make_pipeline(preprocesor, RandomForestRegressor),
    'Gradient Boost Regression':make_pipeline(preprocesor, GradientBoostingRegressor),
    'Extra Tree Regressor':make_pipeline(preprocesor, ExtraTreesRegressor)
}
