# Final Model
---
This model will be a less complete prediction of the overall dataset. Ideally I believe a decently accurate model could be constructed from the complete 2016 set of data while dropping all products that are not ordered more than a set number of times. This would remove many of the products from being able to be predicted but will provide ideally higher accuracy on the recently demanded items.

In [1]:
import pandas as pd #DataFrame
df = pd.read_csv('../data/processed/Usable_Historical_Data.csv')
df = df[df.Year == 2016]
df = df.drop(['Unnamed: 0','Year','Warehouse'],1)
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date')

Unnamed: 0_level_0,Product_Code,Product_Category,Order_Demand,Month,Day
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
2016-01-28,Product_1178,Category_024,10,1.0,28.0
2016-01-04,Product_1502,Category_019,100000,1.0,4.0
2016-01-06,Product_0190,Category_007,320,1.0,6.0
2016-01-06,Product_0337,Category_021,2,1.0,6.0
2016-01-06,Product_1053,Category_024,10,1.0,6.0
...,...,...,...,...,...
2016-04-27,Product_1791,Category_006,1000,4.0,27.0
2016-04-27,Product_1974,Category_006,1,4.0,27.0
2016-04-28,Product_1787,Category_006,2500,4.0,28.0
2016-10-07,Product_0901,Category_023,50,10.0,7.0


Going to one-hot encode the Product_Category as well as bin the days into weekly segments that will then be one-hot encoded.

In [2]:
df['Week1'] = (df['Day']<8) & (df['Day']>0)
df['Week2'] = (df['Day']<15) & (df['Day']>7)
df['Week3'] = (df['Day']<22) & (df['Day']>14)
df['Week4'] = (df['Day']>21)
df = df.join(pd.get_dummies(df['Product_Category'])).drop(['Product_Category','Day'],1)
df = df.join(pd.get_dummies(df['Month'],prefix='Month')).drop('Month',1)
df.head()

Unnamed: 0,Product_Code,Date,Order_Demand,Week1,Week2,Week3,Week4,Category_001,Category_003,Category_005,...,Month_3.0,Month_4.0,Month_5.0,Month_6.0,Month_7.0,Month_8.0,Month_9.0,Month_10.0,Month_11.0,Month_12.0
690943,Product_1178,2016-01-28,10,False,False,False,True,0,0,0,...,0,0,0,0,0,0,0,0,0,0
699219,Product_1502,2016-01-04,100000,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0
768552,Product_0190,2016-01-06,320,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0
768635,Product_0337,2016-01-06,2,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0
768656,Product_1053,2016-01-06,10,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Now aside for the Product_Code, with our data one-hot encouded as I want I'll take a look at the occurences of a product in the dataset and decide from there which ones aren't useful and should be dropped

In [15]:
df.groupby('Product_Code').count()['Date'].describe()

count    2119.000000
mean       89.025484
std       173.849107
min         1.000000
25%        20.000000
50%        37.000000
75%        83.000000
max      2875.000000
Name: Date, dtype: float64

  So our std deviation is 173 while our mean is 89 meaning 68% of our data would fall between 89-173 and 89+173. Considering we can't have negative Order_Demand this range would become 0 to 262. This standard deviation with this mean leads me to believe this dataset has a right skew.
  
  A right skew isn't necessarily bad as the more right a datapoint is the more instances of an order is placed for it which is useful for predictions. Also taking into account that this has seasonality based on the year it would be useful to simply drop products ordered less than 12 times as that would at least provide the possibility of being ordered once a month on a given day.

  This data is most likely not normal and trimming outliers based on mean/std deviation is most likely not suited for this dataset. dropping values that present themselves less than 12 times will at least provide us with more than 80% of the products as well as items that most likely can have their demand predicted

In [29]:
df = df.groupby('Product_Code').filter(lambda x: x['Date'].count() > 12)
df.head()

Unnamed: 0,Product_Code,Date,Order_Demand,Week1,Week2,Week3,Week4,Category_001,Category_003,Category_005,...,Month_3.0,Month_4.0,Month_5.0,Month_6.0,Month_7.0,Month_8.0,Month_9.0,Month_10.0,Month_11.0,Month_12.0
690943,Product_1178,2016-01-28,10,False,False,False,True,0,0,0,...,0,0,0,0,0,0,0,0,0,0
699219,Product_1502,2016-01-04,100000,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0
768552,Product_0190,2016-01-06,320,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0
768635,Product_0337,2016-01-06,2,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0
768656,Product_1053,2016-01-06,10,True,False,False,False,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [31]:
df['Product_Code'].unique().size

1863

This data looks much more clean than the one used in the initial model. I'll go with the Decision Tree Regression again since the dataset is 20% smaller than the initial model and it would be interesting to see if this could provide a significant difference in R2_Score.

In [32]:
target = df['Order_Demand']
df = df.join(pd.get_dummies(df['Product_Code'])).drop(['Product_Code','Date','Order_Demand'],1) #Dropping Date as it's one-hot encoded already

from sklearn.model_selection import train_test_split #Train Test Split
x_train, x_test, y_train, y_test = train_test_split(df.values, target.values, test_size=0.20, random_state=0)

from sklearn.tree import DecisionTreeRegressor #Decision Tree Regressor for modeling
model_dt = DecisionTreeRegressor(random_state=0)
model_dt.fit(x_train, y_train)

from sklearn.metrics import r2_score #R2_Score function
predicted_x_train = model_dt.predict(x_train)
r2_score(y_train, predicted_x_train)

0.3204339931587371

Admittedly an R2 of .32 is pretty bad. But considering the intiial model gave a best result of .18 I would consider this at least a significant step in the correct direction for this.

In [33]:
import pickle #To serialize the model
pickle.dump(model_dt, open('../models/Feature_Decision_Tree_2.sav', 'wb'))

# Final Thoughts
---
The model that I generated wasn't great. Far from it actually with a 0.32 R2 score. But it is still quite fulfilling to see the model get more accurate after some data cleaning when compared to the initial model that was generated.
  
* Some falling throughs:
  * The use of the product names as a one-hot encoded feature. This large dimensionality I feel took from the accuracy of the model overall and could be better suited to a different encoding scheme.
  * A better model for data with seasonality such as the ARIMA model which has a more robust set of parameters to utilize for predicting series.
  * Because the number of products made it difficult to simply do a sinusoidal fit to the prediction it may be beneficial to simply have iterated through the dataset and for each product generated a sinusoidal fit that would be used to predict future demand. This of course would result in 2000+ fit equations which couldn't be used as other models with simply passing in values but would be more useful overall.