# Lack of Data

In this file, we consider a case in which there is data only for a single year. The methods with the best performance in the previous file are deployed to see the prediction accuracy.

In [10]:
import pandas as pd
sales=pd.read_csv("E:/GitHubActivities/rep1-Analysis on retail demand prediction/data/data_processed.csv")
NW = 52 #number of weeks to be considered in the dataset

# Centralized and decentralized approaches

In [11]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from statsmodels.regression.linear_model import OLS
from sklearn.metrics import r2_score, mean_squared_error

res=pd.DataFrame(index=['R2'])

## Structuring the dataset

Before proceeding to the approaches mentioned above, we need to split our data into train and test datasets for each specific product.

In [12]:
skuSet = list(sales.sku.unique())
skuData = {}
colnames = [i for i in sales.columns if i not in ["week","weekly_sales","sku"]]
for i in skuSet:
  df_i = sales[sales.sku == i]
  skuData[i] = {'X': df_i[:NW][colnames].values,
                'y': df_i[:NW]['weekly_sales'].values}
    
X_dict = {}
y_dict = {}

y_test = []
y_train = []

for i in skuSet:
  
  X_train_i,X_test_i = train_test_split(skuData[i]["X"], shuffle=False, train_size=0.7) #split for X
  y_train_i,y_test_i = train_test_split(skuData[i]["y"], shuffle=False, train_size=0.7) #split for y 

  X_dict[i] = {'train': X_train_i, 'test': X_test_i} #filling dictionary
  y_dict[i] = {'train': y_train_i, 'test': y_test_i}

  y_test += list(y_test_i) 
  y_train += list(y_train_i) 

## Centralized method

 Once the train and test datasets are created for each product, we combine them to deploy our centralized solution method and evaluate it.

In [13]:
X_cen_train = X_dict[skuSet[0]]['train'] #initialization with item 0
X_cen_test = X_dict[skuSet[0]]['test']

for i in skuSet[1:]: #Iteration over items
    X_cen_train = np.concatenate((X_cen_train, X_dict[i]['train']), axis = 0) #Bringing together the training set
    X_cen_test = np.concatenate((X_cen_test, X_dict[i]['test']), axis = 0)

model_cen = LinearRegression().fit(X_cen_train, y_train)

print('Centralized method with linear regression R2:',
      round(r2_score(y_test, model_cen.predict(X_cen_test)),3))  
print('Centralized method with linear regression MSE:',
      round(mean_squared_error(y_test, model_cen.predict(X_cen_test)),3))

res['Centralized(LR)']=[r2_score(y_test, model_cen.predict(X_cen_test))]

Centralized method with linear regression R2: 0.095
Centralized method with linear regression MSE: 107286.05


## Decentralized mehod

In this subsection, a dictionary of prediction mehods is created for each product and the total accuracy of it is caculated then.


In [14]:
y_pred = []
skuModels = {}

for i in skuSet:
 #one model for each item, fitted on training set
 model_i = OLS(y_dict[i]['train'], X_dict[i]['train'])
 skuModels[i] = model_i.fit()

 #compute and concatenate prediction of the model i on item i
 y_pred += list(skuModels[i].predict(X_dict[i]['test']))


#computing overall performance metrics on y_pred and y_test:
print('Decentralized method with linear regression R2:',round(r2_score(y_test, np.array(y_pred)),3))
print('Decentralized method with linear regression MSE:', round(mean_squared_error(y_test, np.array(y_pred)),3))

res['Decentralized(LR)']=[r2_score(y_test, np.array(y_pred))]

Decentralized method with linear regression R2: 0.174
Decentralized method with linear regression MSE: 97852.415


# Prediction with aggregated seasonality

A common approach in retail is to consider different coefficients for item products. In this section, all the features, except for the seasonality, are cosidered at product item level inside the dataset. This method is called feature-fixed effect.

## Structuring the dataset

In this part, the price of products are separately stored in the dictionary of datasets to be used for the clustering tecknique in Section 5.

In [25]:
sales_fe_sku = sales.copy()
sales_fe_sku = pd.get_dummies(data=sales_fe_sku, columns=['sku'])
sales_fe_sku["sku"] = sales["sku"] 


colnames_to_fix = [i for i in sales.columns if i not in ["week","weekly_sales","sku",
                                                         'month_2', 'month_3', 'month_4', 'month_5', 'month_6',
                                                         'month_7', 'month_8', 'month_9', 'month_10', 'month_11', 
                                                         'month_12']]

sales_seasonality = sales_fe_sku.copy()

for feature in colnames_to_fix:
  for i in range(1,45):
    sales_seasonality[str(feature)+"_fixed_effect_"+str(i)] = sales_seasonality[feature]*sales_seasonality["sku_"+str(i)]

skuSet = list(sales.sku.unique()) #the SKU numbers do not change
skuData = {}
colnames = [i for i in sales_seasonality.columns if i not in ["week","weekly_sales","sku"] and i not in colnames_to_fix]
for i in skuSet:
  df_i = sales_seasonality[sales_seasonality.sku == i]
  skuData[i] = {'X': df_i[:NW][colnames].values,
                'y': df_i[:NW]['weekly_sales'].values,
                'price': df_i.price[:NW].values}
    


X_dict = {}
y_dict = {}

y_test = []
y_train = []

for i in skuSet:
  
  X_train_i,X_test_i = train_test_split(skuData[i]["X"], shuffle=False, train_size=0.7) #split for X
  y_train_i,y_test_i = train_test_split(skuData[i]["y"], shuffle=False, train_size=0.7) #split for y 
  p_train, p_test = train_test_split(skuData[i]['price'], shuffle=False, train_size=0.7)

  X_dict[i] = {'train': X_train_i, 'test': X_test_i, 'price_train': p_train} #filling dictionary
  y_dict[i] = {'train': y_train_i, 'test': y_test_i}

  y_test += list(y_test_i) 
  y_train += list(y_train_i) 



  sales_seasonality[str(feature)+"_fixed_effect_"+str(i)] = sales_seasonality[feature]*sales_seasonality["sku_"+str(i)]


## Centralized method

In [16]:
X_cen_train = X_dict[skuSet[0]]['train'] #initialization with item 0
X_cen_test = X_dict[skuSet[0]]['test']

for i in skuSet[1:]: #Iteration over items
    X_cen_train = np.concatenate((X_cen_train, X_dict[i]['train']), axis = 0) #Bringing together the training set
    X_cen_test = np.concatenate((X_cen_test, X_dict[i]['test']), axis = 0)

model_cen = LinearRegression(fit_intercept=True).fit(X_cen_train, y_train)
print('Seasonality aggregated (LR) R2:', round(r2_score(y_test, model_cen.predict(X_cen_test)),3))  
print('Seasonality aggregated (LR) MSE:', round(mean_squared_error(y_test, model_cen.predict(X_cen_test)),3))

res['Centralized(SA-LR)']=[r2_score(y_test, model_cen.predict(X_cen_test))]

Seasonality aggregated (LR) R2: -1.7368363591631102e+22
Seasonality aggregated (LR) MSE: 2.0578989492657866e+27


## Decentralized method

In [17]:
y_pred = []
skuModelsElastic = {}

for i in skuSet:
    skuModels[i] = LinearRegression(fit_intercept=True).fit(X_dict[i]["train"],y_dict[i]["train"])
    y_pred += list(skuModels[i].predict(X_dict[i]['test']))

#computing overall performance metrics on y_pred and y_test:
print('OOS R2:',round(r2_score(y_test, np.array(y_pred)),3))
print('OOS MSE:', round(mean_squared_error(y_test, np.array(y_pred)),3))

res['Decentralized(SA-LR)']=[r2_score(y_test, np.array(y_pred))]

OOS R2: -7.907706583312968e+24
OOS MSE: 9.369484340334244e+29


# Regularization

As you noticed, the dataset in the previous section includes so many columns. Therefore, some problems like overfitting or multicolinearity may occur, reducing the prediction accuracy or decrasing the reliablity on coefficients. In response, we deploy the Elasticnet method, which has the advantage of both Lasso and Ridge regressions. Note that a hypeparameter tuning algorithm is used to determine the best value of parameters of the Elasticnet method.

In [18]:
from sklearn.linear_model import ElasticNet

## Centralized method

In this subsection, we consider the seasonality aggregated approach (Section 3.2) for the centralized prediction method.


In [None]:
'''
# Hyperparameter tuning
BestR2 = -1
BestPar = [-1,-1]

idx1 = [0.1*i for i in range(1,11)] # Set of values for alpha
idx2 = [0.1*i for i in range(11)]   # Set of values for l1_ratio

#hyperparameter tuning for centralized
for i in idx1:
    for j in idx2:
        model_cen = ElasticNet(alpha= i,l1_ratio=j)
        model_cen.fit(X_cen_train, y_train)
        R2 = r2_score(y_test, model_cen.predict(X_cen_test))
        if R2>BestR2:
            BestR2 = R2
            BestPar[0] = i
            BestPar[1] = j
print('The best value of alpha:',BestPar[0])
print('The best value of l1_ratio:',BestPar[1])
'''

In [19]:
#model_cen = ElasticNet(alpha=BestPar[0],l1_ratio=BestPar[1])
model_cen = ElasticNet(alpha=0.4,l1_ratio=1)
model_cen.fit(X_cen_train, y_train)
print('Seasonality aggregated (EN) R2:', round(r2_score(y_test, model_cen.predict(X_cen_test)),3))  
print('Seasonality aggregated (EN) MSE:', round(mean_squared_error(y_test, model_cen.predict(X_cen_test)),3))

res['Centralized(SA-EN)']=[r2_score(y_test, model_cen.predict(X_cen_test))]

Seasonality aggregated (EN) R2: 0.33
Seasonality aggregated (EN) MSE: 79389.941


  model = cd_fast.enet_coordinate_descent(


## Decentralized method

In [None]:
'''
# Hyperparameter tuning
BestR2 = -1
BestPar = [-1,-1]

idx1 = [0.1*i for i in range(1,11)] # Set of values for alpha
idx2 = [0.1*i for i in range(11)]   # Set of values for l1_ratio

for i in idx1:
    for j in idx2:
        y_pred = []
        y_test = []
        for k in skuSet:
            elastic = ElasticNet(alpha= i,l1_ratio=j)
            ModelsElastic = elastic.fit(X_dict[k]["train"],y_dict[k]["train"])
            y_pred += list(ModelsElastic.predict(X_dict[k]['test']))
            y_test += list(y_dict[k]["test"])
        R2 = r2_score(np.array(y_test), np.array(y_pred))
        if R2>BestR2:
            BestR2 = R2
            BestPar[0] = i
            BestPar[1] = j
            
print('The best value of alpha:',BestPar[0])
print('The best value of l1_ratio:',BestPar[1])
'''

In [20]:
y_pred = []
skuModels = {}

for i in skuSet:
#   elastic = ElasticNet(alpha=BestPar[0],l1_ratio=BestPar[1])
    elastic = ElasticNet(alpha=0.9,l1_ratio=0.5)
    skuModels[i] = elastic.fit(X_dict[i]["train"],y_dict[i]["train"])
    y_pred += list(skuModels[i].predict(X_dict[i]['test']))

#computing overall performance metrics on y_pred and y_test:
print('Seasonality aggregated (EN) R2:',round(r2_score(y_test, np.array(y_pred)),3))
print('Seasonality aggregated (EN) MSE:', round(mean_squared_error(y_test, np.array(y_pred)),3))

res['Decentralized(SA-EN)']=[r2_score(y_test, np.array(y_pred))]

Seasonality aggregated (EN) R2: 0.467
Seasonality aggregated (EN) MSE: 63140.873


# Clustering

In this section, we deploy a clustering algorithm for the method proposed in the section 4.2. We use the 'price' and 'weekly-sales' variables to cluster all observations:

In [29]:
from sklearn.cluster import KMeans
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
    
z = 8 #number of clusters
X_clus = np.zeros((len(skuSet), 2))
count = 0
for sku in skuSet:
    X_clus[count, :] = np.mean( np.concatenate(( np.array( [ [i] for i in X_dict[sku]['price_train'] ] ), 
                                                 np.array( [ [i] for i in y_dict[sku]['train'] ] )),
                                                 axis=1),
                                axis = 0 )
    count += 1

np.size(X_clus)
X_clus = scaler.fit_transform(X_clus)
kmeans = KMeans(n_clusters=z, random_state=0).fit(X_clus)


Now that the observations are clustered, we deploy an Elasticnet predictor for each cluster:

In [34]:
y_clus_pred = []
y_clus_test = []
for j in range(z):
  ##Get indices of items in cluster j 
  clus_items = list(np.where(kmeans.labels_ == j)[0])
  ##Initialization 
  #X
  X_clus_j_train = X_dict[skuSet[clus_items[0]]]['train'] #initialization with first item of the cluster
  X_clus_j_test = X_dict[skuSet[clus_items[0]]]['test']
  #y
  y_clus_j_train = list(y_dict[skuSet[clus_items[0]]]['train']) #initialization with first item of the cluster
  y_clus_j_test = list(y_dict[skuSet[clus_items[0]]]['test'])
  ##Loop 
  for idx in clus_items[1:]: #Iteration over items
    sku=skuSet[idx]
    #X
    X_clus_j_train = np.concatenate((X_clus_j_train, X_dict[sku]['train']), axis = 0) #Bringing together the training set for the cluster
    X_clus_j_test = np.concatenate((X_clus_j_test, X_dict[sku]['test']), axis = 0)
    #y
    y_clus_j_train += list(y_dict[sku]['train'])
    y_clus_j_test += list(y_dict[sku]['test'])
  ##Model
  elastic = ElasticNet(alpha= 0.4,l1_ratio=0)
  model_clus_j = elastic.fit(X_clus_j_train, y_clus_j_train)
  y_clus_pred += list(model_clus_j.predict(X_clus_j_test))
  y_clus_test += y_clus_j_test

#Results
print('Clustered (SA-EN) R2:',r2_score(y_clus_test, y_clus_pred))
print('Clustered (SA-EN) MSE:', mean_squared_error(y_clus_test, y_clus_pred))

res['Clustered (SA-EN)']=[r2_score(y_clus_test, y_clus_pred)]

  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(
  model = cd_fast.enet_coordinate_descent(


Clustered (SA-EN) R2: 0.4515569728626547
Clustered (SA-EN) MSE: 64982.536974405724


  model = cd_fast.enet_coordinate_descent(


# Results

In [35]:
res

Unnamed: 0,Centralized(LR),Decentralized(LR),Centralized(SA-LR),Decentralized(SA-LR),Centralized(SA-EN),Decentralized(SA-EN),Clustered (SA-EN)
R2,0.094522,0.17414,-1.736836e+22,-7.907707e+24,0.329961,0.4671,0.451557


Compared to the results in the previous file, we see a sharp decline in the prediction accuracy of the methods. The primary approches without seasonality aggregation, do not seem efficient at all, because their performances are bellow 20%. The seasonality aggregated methods with Linear Regression predictors are overfitted.
Finally, the Elasticnet method make a significant improvement. The best method is to use the seasonality aggregated data and deploy the Elasticnet predictor in a decentralized approach. Moreover, the clustering technique over the decentralized models does not create any improvements.