# Goals

5. Build one prediction model using the ML algorithms of this course
6. Evaluate your prediction model
7. Try different ways to improve your model and show the improvements.
8. Submit code and results in Jupyter and HTML formats on canvas

### You are provided with daily historical sales data. The task is to forecast the total amount of products sold in every shop for the test set. Note that the list of shops and products slightly changes every month. Creating a robust model that can handle such situations is part of the challenge.

##### Imports

In [2]:
import numpy as np
import pandas as pd

##### Read in Data

In [3]:
item_cat = pd.read_csv("data/item_categories.csv")
items = pd.read_csv("data/items.csv")
sales = pd.read_csv("data/sales_train.csv")
shops = pd.read_csv("data/shops.csv")

In [4]:
item_cat.head()

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2
3,Аксессуары - PS4,3
4,Аксессуары - PSP,4


In [5]:
items.head()

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40
3,***ГОЛУБАЯ ВОЛНА (Univ) D,3,40
4,***КОРОБКА (СТЕКЛО) D,4,40


In [6]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0
3,06.01.2013,0,25,2554,1709.05,1.0
4,15.01.2013,0,25,2555,1099.0,1.0


In [7]:
shops.head()

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2
3,"Балашиха ТРК ""Октябрь-Киномир""",3
4,"Волжский ТЦ ""Волга Молл""",4


##### Notes from viewing the data
* our target value is the item_cnt_day - number of products sold; however per month... So, I will have to group the sales by month but also group by the shop_id because group by the month would be the combination of all shops 
* we can use the item_category_id as another metric to train our model because it can show that a genre of items being sold

##### Checking for null values
* we can see there there are no null values in our datasets

In [8]:
item_cat.isnull().any()

item_category_name    False
item_category_id      False
dtype: bool

In [9]:
items.isnull().any()

item_name           False
item_id             False
item_category_id    False
dtype: bool

In [10]:
shops.isnull().any()

shop_name    False
shop_id      False
dtype: bool

In [11]:
sales.isnull().any()

date              False
date_block_num    False
shop_id           False
item_id           False
item_price        False
item_cnt_day      False
dtype: bool

###### Assumption
* If item_cnt_day is the number of sales that was done that day, then I will assume that anything less than zero is should be zero because sales can only be done if it happen in the positive direction not in the negative direction

##### Making the target values that are less than 0 to 0

In [12]:
np.sort(sales["item_cnt_day"].unique().astype(int))

array([ -22,  -16,   -9,   -6,   -5,   -4,   -3,   -2,   -1,    1,    2,
          3,    4,    5,    6,    7,    8,    9,   10,   11,   12,   13,
         14,   15,   16,   17,   18,   19,   20,   21,   22,   23,   24,
         25,   26,   27,   28,   29,   30,   31,   32,   33,   34,   35,
         36,   37,   38,   39,   40,   41,   42,   43,   44,   45,   46,
         47,   48,   49,   50,   51,   52,   53,   54,   55,   56,   57,
         58,   59,   60,   61,   62,   63,   64,   65,   66,   67,   68,
         69,   70,   71,   72,   73,   74,   75,   76,   77,   78,   79,
         80,   81,   82,   83,   84,   85,   86,   87,   88,   89,   90,
         91,   92,   93,   95,   96,   97,   98,   99,  100,  101,  102,
        103,  104,  105,  106,  107,  108,  109,  110,  111,  112,  113,
        114,  115,  116,  117,  118,  121,  124,  126,  127,  128,  129,
        130,  131,  132,  133,  134,  135,  138,  139,  140,  142,  145,
        146,  147,  148,  149,  150,  151,  153,  1

In [13]:
day = np.array(sales["item_cnt_day"])
day[day < 0] = 0
sales["item_cnt_day"] = day

##### Adding another feature to the dataset which is the item category

In [14]:
item_categories = np.array(items["item_category_id"])
item_id_sales = sales["item_id"]
lst = list()
for item_identification in item_id_sales:
    lst.append(item_categories[item_identification])
sales["item_category"] = lst

##### Reformatting the date in the dataset so I can use the groupby function

In [15]:
sales['date'] = pd.to_datetime(sales["date"])
sales['month'] = sales['date'].dt.month

In [16]:
sales.head()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_category,month
0,2013-02-01,0,59,22154,999.0,1.0,37,2
1,2013-03-01,0,25,2552,899.0,1.0,58,3
2,2013-05-01,0,25,2552,899.0,0.0,58,5
3,2013-06-01,0,25,2554,1709.05,1.0,58,6
4,2013-01-15,0,25,2555,1099.0,1.0,56,1


##### This is what dataframe

In [17]:
shop_ident = np.array(sales.groupby(['month',"shop_id"]).agg({'item_cnt_day':'sum'}).index.get_level_values(1))
mons = np.array(sales.groupby(['month',"shop_id"]).agg({'item_cnt_day':'sum'}).index.get_level_values(0))
cum = np.array(sales.groupby(['month',"shop_id"]).agg({'item_cnt_day':'sum'})).astype(int)

In [18]:
# sales.groupby(['month',"shop_id","item_category","item_id"]).agg({'item_cnt_day':'sum'})

In [19]:
shop_ident.shape

(698,)

In [20]:
mons.shape

(698,)

In [21]:
cum = cum.reshape(-1)

In [22]:
df = pd.DataFrame({"month":mons,"shop_id":shop_ident,"total_products_sold":cum})

In [23]:
df

Unnamed: 0,month,shop_id,total_products_sold
0,1,0,3626
1,1,1,1955
2,1,2,2217
3,1,3,2068
4,1,4,3627
...,...,...,...
693,12,55,5321
694,12,56,7374
695,12,57,16788
696,12,58,10978


In [56]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),RandomForestRegressor())
x = df.drop("total_products_sold",axis=1).values
y = df["total_products_sold"]
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3)
pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)
from sklearn import metrics
mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rse = metrics.r2_score(y_test, y_pred)
print('Mean Absolute Error:', mae)  
print('Mean Squared Error:', mse)  
print('Root Mean Squared Error:', rmse)
print('R-squared Error:', rse)

Mean Absolute Error: 685.5908571428572
Mean Squared Error: 1458897.0754209524
Root Mean Squared Error: 1207.8481176956614
R-squared Error: 0.9459493403104712
