In this competition, we are given sale information spanning over the year 2013 to 2015. Our goal is to predict the future sales.

We have data from Jan 2013 to Oct 2015, We will predict the sale for Nov 2015 by analysing the given data.

The kernels that helped me and I took inpirations from:

- [kernel 1](https://www.kaggle.com/sanjayar/step-by-step-guide-for-sales-data-prediction-lstm)
- [kernel 2](https://github.com/sharmaroshan/Predict-Future-Sales/blob/master/Predicting_Future_Sales.ipynb)
- [kernel 3](https://www.kaggle.com/homiarafarhana/predict-future-sales)
- [kernel 4](https://www.kaggle.com/stefanschulmeister87/extensive-eda-and-data-preparation)


Contents:

- [Imports](#imports)
    - [data description](#data-desc)
- [EDA](#eda)
    - [checking for nulls](#nulls)
    - [monthly sale](#month)
    - [correlation](#correlation)
    - [item sold per category](#soldpcat)
    - [item sold per month](#soldpmonth)
    - [wordcloud](#wordcloud)
    - [busiest days/months/years for the shops](#busy)
    - [dealing with outliers](#outlier)
- [Data Processing](#data)
    - [Checking for missing columns](#miss)
    - [processing shop data](#shopp)
    - [processing item category data](#catdatap)
- [Preparing final DF for modeling](#final)
- [Model Creation](#model)
- [Prediction](#prediction)
- [Submission](#submission)

# <a name="imports"></a>Imports


In [None]:
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings("ignore", module="lightgbm")

import itertools
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.model_selection import train_test_split
from lightgbm import LGBMRegressor
import matplotlib.pyplot as plt
import datetime 
import lightgbm as lgbm

pd.set_option('display.max_colwidth',None)

##### <a name="data-desc"></a>Provided data description

- sales_train.csv - the training set. Daily historical data from January 2013 to October 2015.
- test.csv - the test set. You need to forecast the sales for these shops and products for November 2015.
- sample_submission.csv - a sample submission file in the correct format.
- items.csv - supplemental information about the items/products.
- item_categories.csv  - supplemental information about the items categories.
- shops.csv- supplemental information about the shops.

In [None]:
items = pd.read_csv('../input/competitive-data-science-predict-future-sales/items.csv')
item_cat = pd.read_csv('../input/competitive-data-science-predict-future-sales/item_categories.csv')
shops = pd.read_csv('../input/competitive-data-science-predict-future-sales/shops.csv')

train = pd.read_csv('../input/competitive-data-science-predict-future-sales/sales_train.csv')
test_dataset = pd.read_csv('../input/competitive-data-science-predict-future-sales/test.csv')

In [None]:
train.head()

In [None]:
train.shape

# <a name="eda"></a>EDA | Exploratory Data Analysis

- ID - an Id that represents a (Shop, Item) tuple within the test set
- shop_id - unique identifier of a shop
- item_id - unique identifier of a product
- item_category_id - unique identifier of item category
- item_cnt_day - number of products sold. You are predicting a monthly amount of this measure
- item_price - current price of an item
- date - date in format dd/mm/yyyy
- date_block_num - a consecutive month number, used for convenience. January 2013 is 0, February 2013 is 1,..., October 2015 is 33
- item_name - name of item
- shop_name - name of shop
- item_category_name - name of item category

In [None]:
train.isnull().sum()

##### <a name="nulls"></a>there are no nulls.

Making a copy of the `train_dataset`, It is a good thing to so because we do not want to mess up the original content while exploring the data.

In [None]:
train_dataset = train.copy()

In [None]:
train_dataset

**2169** pieces of item ID **1173** were sold on **28/10/2015**

In [None]:
train_dataset[train_dataset['item_cnt_day'] == 2169.0]

##### <a name="month"></a>Monthly sales

In [None]:
monthly_sales=train_dataset.groupby(["date_block_num","shop_id","item_id"])[
    "date","item_price","item_cnt_day"].agg({"date":["min",'max'],"item_price":"mean","item_cnt_day":"sum"})

In [None]:
monthly_sales

In [None]:
monthly_sales.columns

plotting the monthly sales

In [None]:
sales_by_month = train_dataset.groupby(['date_block_num'])['item_cnt_day'].sum()
sales_by_month.plot()

By looking at the plot above we can say that the sale is decreasing over months. However, some peaks are spotted during November.

### <a name="correlation"></a>checking for correlation

there are no noticeably strong pos/neg correlation in sight 

In [None]:
corr = train_dataset.corr()
fig = plt.figure()
ax = fig.add_subplot(111)
cax = ax.matshow(corr,cmap='coolwarm', vmin=-1, vmax=1)
fig.colorbar(cax)
ticks = np.arange(0,len(train_dataset.columns),1)
ax.set_xticks(ticks)
plt.xticks(rotation=90)
ax.set_yticks(ticks)
ax.set_xticklabels(train_dataset.columns)
ax.set_yticklabels(train_dataset.columns)
plt.show()

### <a name="soldpcat"></a>Checking how many items sold per category

In [None]:
items.head()

In [None]:
plt.rcParams['figure.figsize'] = (24, 9)
sns.barplot(items['item_category_id'], items['item_id'], palette = 'colorblind')
plt.title('Number of Item Sold Per Category', fontsize = 30)
plt.xlabel('Item Categories', fontsize = 15)
plt.ylabel('Items', fontsize = 15)
plt.show()

### <a name="soldpmonth"></a>Checking how many items sold per per month i.e. (jan 2013 ~ Oct 2015)

In [None]:
plt.rcParams['figure.figsize'] = (24, 9)
sns.countplot(train_dataset['date_block_num'], palette = 'colorblind')
plt.title('Number of Item Sold Per Month Over 2013 - 2015', fontsize = 30)
plt.xlabel('Month', fontsize = 15)
plt.ylabel('Items Count', fontsize = 15)
plt.show()

checking the number of unique snop names and item category names

In [None]:
# item_cat['item_category_name'].count()
print(item_cat['item_category_name'].nunique())
print(shops['shop_name'].nunique())

#### <a name="wordcloud"></a>WordCloud for shop name

In [None]:
from wordcloud import WordCloud
from wordcloud import STOPWORDS

plt.rcParams['figure.figsize'] = (15, 12)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(background_color = 'pink',
                      max_words = 200, 
                      stopwords = stopwords,
                     width = 1200,
                     height = 800,
                     random_state = 42).generate(str(shops['shop_name']))


plt.title('Wordcloud for Shop Names', fontsize = 25)
plt.axis('off')
plt.imshow(wordcloud, interpolation = 'bilinear')

wordcloud for item categories

In [None]:
plt.rcParams['figure.figsize'] = (15, 12)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(background_color = 'lightyellow',
                      max_words = 200, 
                      stopwords = stopwords,
                     width = 1200,
                     height = 800,
                     random_state = 42).generate(str(item_cat['item_category_name']))


plt.title('Wordcloud for Item Category Names', fontsize = 24)
plt.axis('off')
plt.imshow(wordcloud, interpolation = 'bilinear')

#### <a name="busy"></a>Busiest days for the shop

converting the date into datetimelike format

i.e. 01.02.2013    ==>    2013-02-01

In [None]:
train_dataset['date'] = pd.to_datetime(train_dataset['date'], errors='coerce')

In [None]:
days = []
months = []
years = []

for day in train_dataset['date']:
    days.append(day.day)
for month in train_dataset['date']:
    months.append(month.month)    
for year in train_dataset['date']:
    years.append(year.year)

In [None]:
plt.rcParams['figure.figsize'] = (15, 7)
sns.countplot(days, palette= 'pastel')
plt.title('The busiest days for the shops', fontsize = 24)
plt.xlabel('Days', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)

plt.show()

Busiest months and years for shops

In [None]:
# busy month
plt.rcParams['figure.figsize'] = (15, 7)
sns.countplot(months, palette= 'rocket')
plt.title('The busiest months for the shops', fontsize = 24)
plt.xlabel('Months', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)

plt.show()

# busy year
plt.rcParams['figure.figsize'] = (15, 7)
sns.countplot(years, palette= 'cubehelix')
plt.title('The busiest years for the shops', fontsize = 24)
plt.xlabel('Years', fontsize = 12)
plt.ylabel('Frequency', fontsize = 12)

plt.show()

In [None]:
train_dataset['day'] = days
train_dataset['month'] = months
train_dataset['year'] = years

In [None]:
train_dataset

Note that the list of shops and products slightly changes every month.

we can see that in Feb of 2013 `shop_id` `31` has the highest number of sales

In [None]:
sns.countplot(train_dataset[(train_dataset.month == 2) & (train_dataset.year == 2013)]['shop_id'], palette='pastel')

### <a name="outlier"></a>Outliers

In [None]:
train_dataset.describe()

from the description above, when we look at the max and min values we can see that there is an outlier for `item_price` and `item_cnt_day`

### Below we plot the outliers

[this](https://www.kaggle.com/homiarafarhana/predict-future-sales#Exploratory-Data-Analysis) kernel has helped me understand this concept.

In [None]:
plt.figure(figsize=(10,4))
plt.xlim(train_dataset.item_price.min(), train_dataset.item_price.max()*1.1)
sns.boxplot(x=train_dataset.item_price)

In [None]:
plt.figure(figsize=(10,4))
plt.xlim(train_dataset.item_cnt_day.min(), train_dataset.item_cnt_day.max()*1.1)
sns.boxplot(x=train_dataset.item_cnt_day)

By judging from the above  outlier diagram above we see a price point further than the other points. So we can get rid of that point.

Also for `item_cnt_day` there is a point further than other point we will get rid of that point too.

The demonstration has been shown below:


In [None]:
train_dataset = train_dataset[(train_dataset["item_price"] > 0) & (train_dataset["item_price"] < 50000)]
train_dataset = train_dataset[(train_dataset["item_cnt_day"] > 0) & (train_dataset["item_cnt_day"] < 1000)]

In [None]:
train_dataset.shape

Also, from the diagram we can see that there is a price point that is less than zero. We will fill that price with median value.


In [None]:
train_dataset[train_dataset['item_price'] < 0]

In [None]:
median = train_dataset[(train_dataset.shop_id==32)&(train_dataset.item_id==2973)&(train_dataset.date_block_num==4)&(train_dataset.item_price>0)].item_price.median()
median

🤗 

After assigning the median value we can no longer find any record with negative pricing.

In [None]:
train_dataset["item_price"] = train_dataset["item_price"].map(lambda x: median if x<0 else x)

No `item_price` less than 0 remaining

In [None]:
train_dataset[train_dataset['item_price'] < 0]

We can also see from the (2nd outlier)`item_cnt_day` diagram that there are some negative values. 

In [None]:
train_dataset[train_dataset['item_cnt_day'] < 0]

`item_cnt_day` < 0 or -1 probably means that those items were returned. If the Items are returned then there are no sales involved as well. So, we can get rid of negative values and set it as 0.

In [None]:
train_dataset["item_cnt_day"] = train_dataset["item_cnt_day"].map(lambda x: 0 if x<0 else x)

no < 0 `item_cnt_day` value remaining

In [None]:
train_dataset[train_dataset['item_cnt_day'] < 0]

# <a name="datap"></a>Data Preprocessing

In [None]:
train_dataset.head(2)

##### <a name="miss"></a>checking to see if all `shop_id` and `item_id` from `test dataset` is also present in the `train dataset`

In [None]:
print("total unique items: ", items['item_id'].nunique())
print("total unique items in train dataset: ", train_dataset['item_id'].nunique())
print("total unique items in test dataset: ", test_dataset['item_id'].nunique())

print("total unique shops: ", shops['shop_id'].nunique())
print("total unique shops in train dataset: ", train_dataset['shop_id'].nunique())
print("total unique shops in test dataset: ", test_dataset['shop_id'].nunique())

we can see that item numbers in test and train sets are not equal. So making prediction for the missing items is going to be difficult.

Lets find out which `item_ids` are in test_set but not in train_set

363 items are not found in `train_dataset` so predicting sales for these items is not easy since we do not have the prices for these items.

In [None]:
test_item_list = [x for x in (np.unique(test_dataset['item_id']))]
train_item_list = [x for x in (np.unique(train_dataset['item_id']))]

missing_item_ids_ = [element for element in test_item_list if element not in train_item_list]
len(missing_item_ids_)

### <a name="shopp"></a>Processing shop data

Lets Look at all the Shops now. Every shop_name is designed like this:
shop_name = city + kind of shop.

We attempt to extract the city feature out of the shop_name to add more diversity to our dataset.

The first two row shows that a city name is starting with '!', so we will get rid of the '!'.

Index 46 gives us a shop_name as **Сергиев Посад ТЦ "7Я"** 

IDK Russian but analyzing different kernels it seem like the city name is actually **СергиевПосад** not **Сергиев Посад**, there is an extra " ". So we will get rid of that too.

In [None]:
shops

In [None]:
# getting rid of "!" before shop_names
shops['shop_name'] = shops['shop_name'].map(lambda x: x.split('!')[1] if x.startswith('!') else x)
shops['shop_name'] = shops["shop_name"].map(lambda x: 'СергиевПосад ТЦ "7Я"' if x == 'Сергиев Посад ТЦ "7Я"' else x)

extracting the city names

with `city_code` we are assigning a unique label to each `city`

In [None]:
shops['city'] = shops['shop_name'].map(lambda x: x.split(" ")[0])
# lets assign code to these city names too
shops['city_code'] = shops['city'].factorize()[0]

In [None]:
shops.head(2)

lets add few more features to our shop dataset like below:

``` 
"num_products"
"min_price"
"max_price"
"mean_price"
```

In [None]:
for shop_id in shops['shop_id'].unique():
    shops.loc[shop_id, 'num_products'] = train_dataset[train_dataset['shop_id'] == shop_id]['item_id'].nunique()
    shops.loc[shop_id, 'min_price'] = train_dataset[train_dataset['shop_id'] == shop_id]['item_price'].min()
    shops.loc[shop_id, 'max_price'] = train_dataset[train_dataset['shop_id'] == shop_id]['item_price'].max()
    shops.loc[shop_id, 'mean_price'] = train_dataset[train_dataset['shop_id'] == shop_id]['item_price'].mean()

In [None]:
shops.head(2)

### <a name="catdatap"></a>Processing Item Category data
The Item category name is designed like below:


- Item category name = type of the category + sub types

for example: an item category name **Служебные - Билеты	** is tranlated as **Service - Tickets**

where the `type` of this category is **Service** and `subtype` is **Tickets** (what kind of service)..

we will now add these new features to our dataset

In [None]:
item_cat

In [None]:
cat_list = []
for name in item_cat['item_category_name']:
    cat_list.append(name.split('-'))

creating a column`split` after `item_category_name` at '-'

In [None]:
item_cat['split'] = (cat_list)
item_cat['cat_type'] = item_cat['split'].map(lambda x: x[0])
item_cat['cat_type_code'] = item_cat['cat_type'].factorize()[0]
item_cat['sub_cat_type'] = item_cat['split'].map(lambda x: x[1] if len(x)>1 else x[0])
item_cat['sub_cat_type_code'] = item_cat['sub_cat_type'].factorize()[0]

In [None]:
item_cat.head(2)

In [None]:
item_cat.drop('split', axis = 1, inplace=True)
item_cat.head(2)

# <a name="final"></a>Preparing our final DF

We will also prepare our `train` and `test` datasets.

Now we will split `train_dataset` into `train_set` and `validation_set`.


In [None]:
train_dataset = train_dataset[train_dataset["item_cnt_day"]>0]
train_dataset = train_dataset[["month", "date_block_num", "shop_id", "item_id", "item_price", "item_cnt_day"]].groupby(
    ["date_block_num", "shop_id", "item_id"]).agg(
    {"item_price": "mean","item_cnt_day": "sum", "month": "min"}).reset_index()
train_dataset.rename(columns={"item_cnt_day": "item_cnt_month"}, inplace=True)
train_dataset = pd.merge(train_dataset, items, on="item_id", how="inner")
train_dataset = pd.merge(train_dataset, shops, on="shop_id", how="inner")
train_dataset = pd.merge(train_dataset, item_cat, on="item_category_id", how="inner")

In [None]:
train_dataset.head(2)

Lets drop the following columns since it may not be necessary while building models

In [None]:
train_dataset.drop(['item_name', 'shop_name', 'city', 'item_category_name', 'cat_type', 'sub_cat_type'], axis = 1, inplace=True)

In [None]:
train_dataset.head(1)

lets take a look at our `test_dataset`

In [None]:
test_dataset.head()

In [None]:
test_dataset.shape

As shown above that not all the shop and item IDs from `test` are present in our `train_dataset` so lets only keep the IDs that are present in the `test_dataset`

In [None]:
train_dataset.shape

In [None]:
train_dataset = train_dataset[train_dataset['shop_id'].isin(test_dataset['shop_id'].unique())]
train_dataset = train_dataset[train_dataset['item_id'].isin(test_dataset['item_id'].unique())]

shape reduced

In [None]:
train_dataset.shape

In [None]:
train_dataset.head(2)

Since we want to predict the sale for Nov, 2015 i.e. `date_block_num = 34` so we will have to add that to our `test_dataset` to be able to make sale prediciton.

First lets create `test` and `train` dataset copies

In [None]:
final_train_dataset = train_dataset.copy()
final_test_dataset = test_dataset.copy()

In [None]:
def data_preprocess(sales_train, test=None):
    indexlist = []
    for i in sales_train.date_block_num.unique():
        x = itertools.product(
            [i],
            sales_train.loc[sales_train.date_block_num == i].shop_id.unique(),
            sales_train.loc[sales_train.date_block_num == i].item_id.unique(),
        )
        indexlist.append(np.array(list(x)))
    df = pd.DataFrame(
        data=np.concatenate(indexlist, axis=0),
        columns=["date_block_num", "shop_id", "item_id"],
    )

    # Adding new revenue column
    sales_train["item_revenue_day"] = sales_train["item_price"] * sales_train["item_cnt_month"]
    # Aggregate item_id / shop_id item_cnts and revenue at the month level
    sales_train_grouped = sales_train.groupby(["date_block_num", "shop_id", "item_id"]).agg(
        item_cnt_month=pd.NamedAgg(column="item_cnt_month", aggfunc="sum"),
        item_revenue_month=pd.NamedAgg(column="item_revenue_day", aggfunc="sum"),
    )
    #print(sales_train_grouped)
    # Merge the grouped data with the index
    df = df.merge(
        sales_train_grouped, how="left", on=["date_block_num", "shop_id", "item_id"],
    )

    if test is not None:
        test["date_block_num"] = 34
        test["date_block_num"] = test["date_block_num"].astype(np.int8)
        test["shop_id"] = test.shop_id.astype(np.int8)
        test["item_id"] = test.item_id.astype(np.int16)
        test = test.drop(columns="ID")

        df = pd.concat([df, test[["date_block_num", "shop_id", "item_id"]]])

    # Fill empty item_cnt entries with 0
    df.item_cnt_month = df.item_cnt_month.fillna(0)
    df.item_revenue_month = df.item_revenue_month.fillna(0)

    return df

dataset_final = data_preprocess(final_train_dataset, final_test_dataset)

In [None]:
dataset_final = pd.merge(dataset_final, items, on="item_id", how="inner")
dataset_final = pd.merge(dataset_final, shops, on="shop_id", how="inner")
dataset_final = pd.merge(dataset_final, item_cat, on="item_category_id", how="inner")
dataset_final.head(3)

In [None]:
dataset_final.drop(['item_name', 'shop_name', 'city', 'item_category_name', 'cat_type', 'sub_cat_type'], axis = 1, inplace=True)
dataset_final.head(2)

In [None]:
dataset_final.shape

Adding the lag feature for column names `item_cnt_month` and `item_revenue_month`

In [None]:
def lag_feature(matrix, lag_feature, lags):
    for lag in lags:
        newname = lag_feature + f"_lag_{lag}"
        print(f"Adding feature {newname}")
        targetseries = matrix.loc[:, ["date_block_num", "item_id", "shop_id"] + [lag_feature]]
        targetseries["date_block_num"] += lag
        targetseries = targetseries.rename(columns={lag_feature: newname})
        matrix = matrix.merge(
            targetseries, on=["date_block_num", "item_id", "shop_id"], how="left"
        )
#     print(matrix)
    return matrix

dataset_final = lag_feature(dataset_final, 'item_cnt_month', lags=[1,2,3])
dataset_final = lag_feature(dataset_final, 'item_revenue_month', lags=[1])
print("Lag features created..")
print(dataset_final.columns)

Adding some `mean` value calculation to the `DF`

In [None]:
mean = dataset_final.groupby(['date_block_num']).agg({'item_cnt_month': ['mean']})
mean.columns = ['avg_by_month_item_cnt']
mean = mean.reset_index()
dataset_final = pd.merge(dataset_final, mean, on=['date_block_num'], how='left')
del(mean)
#Lagging
dataset_final = lag_feature(dataset_final, "avg_by_month_item_cnt", [1,2,3,6])
dataset_final.drop(columns = ['avg_by_month_item_cnt'], axis = 1, inplace = True)
print("_____________________________________________________________________________")

#Let's make a mean by month / id
mean = dataset_final.groupby(['date_block_num', 'item_id']).agg({'item_cnt_month': ['mean']})
mean.columns = ['avg_by_month_item_id_item_cnt']
mean = mean.reset_index()
dataset_final = pd.merge(dataset_final, mean, on=['date_block_num', 'item_id'], how='left')
del(mean)

#Lagging
dataset_final = lag_feature( dataset_final, "avg_by_month_item_id_item_cnt" ,[1,2,3] )
dataset_final.drop(columns = ['avg_by_month_item_id_item_cnt'], axis = 1, inplace = True)
print("_____________________________________________________________________________")

#Now a mean by month / shop
mean = dataset_final.groupby(['date_block_num', 'shop_id']).agg({'item_cnt_month': ['mean']})
mean.columns = ['avg_by_month_shop_item_cnt']
mean = mean.reset_index()
dataset_final = pd.merge(dataset_final, mean, on=['date_block_num', 'shop_id'], how='left')
del(mean)

#Lagging
dataset_final = lag_feature( dataset_final, "avg_by_month_shop_item_cnt", [1,2,3])
dataset_final.drop(columns = ['avg_by_month_shop_item_cnt'], axis = 1, inplace = True)
print("_____________________________________________________________________________")

#Now a mean by month / city
mean = dataset_final.groupby(['date_block_num', 'city_code']).agg({'item_cnt_month': ['mean']})
mean.columns = ['avg_by_month_city_item_cnt']
mean = mean.reset_index()
dataset_final = pd.merge(dataset_final, mean, on=['date_block_num', 'city_code'], how='left')
del(mean)

#Lagging
dataset_final = lag_feature( dataset_final, "avg_by_month_city_item_cnt", [1])
dataset_final.drop(columns = ['avg_by_month_city_item_cnt'], axis = 1, inplace = True)
print("_____________________________________________________________________________")

#Now a mean by month / category
mean = dataset_final.groupby(['date_block_num', 'item_category_id']).agg({'item_cnt_month': ['mean']})
mean.columns = ['avg_by_month_cat_item_cnt']
mean = mean.reset_index()
dataset_final = pd.merge(dataset_final, mean, on=['date_block_num', 'item_category_id'], how='left')
del(mean)

#Lagging
dataset_final = lag_feature( dataset_final, "avg_by_month_cat_item_cnt" ,[1])
dataset_final.drop(columns = ['avg_by_month_cat_item_cnt'], axis = 1, inplace = True)
print("_____________________________________________________________________________")

In [None]:
dataset_final.fillna(0, inplace= True)
dataset_final.head(2)

-----------------------------------------------------------------------------------------------


leaving behind the sale record of year 2013

In [None]:
matrix = dataset_final[dataset_final.date_block_num>=12] 
matrix.reset_index(drop=True, inplace=True)

In [None]:
matrix.head(2)

In [None]:
matrix.columns

In [None]:
# # final_train_df = train_dataset[['date_block_num','item_id','shop_id','item_cnt_month']]
# final_train_df = train_dataset.copy()
# final_train_df = pd.concat([final_train_df, test_copy[["date_block_num", "shop_id", "item_id"]]])
# # final_train_df = final_train_df.pivot_table(index = ['shop_id','item_id'],values = ['item_cnt_month'],columns = ['date_block_num'],fill_value = 0,aggfunc='sum')
# # final_train_df.reset_index(inplace = True)
# final_train_df = final_train_df.pivot_table(index=['item_id','shop_id'], columns = 'date_block_num', values = 'item_cnt_month', fill_value = 0).reset_index()

# # final_train_df = pd.merge(test_dataset,final_train_df,on = ['item_id','shop_id'],how = 'left')
# # final_train_df.fillna(0,inplace = True)
# final_train_df

# <a name="model"></a>Model Creation

[Reference kernel](https://github.com/angliu-bu/Kaggle-Predict-Future-Sales/blob/main/predict-future-sales.ipynb)

In [None]:
def fit_booster(
    X_train,
    y_train,
    X_test=None,
    y_test=None,
    params=None,
    test_run=False,
    categoricals=[],
    dropcols=[],
    early_stopping=True,
):
    if params is None:
        params = {"learning_rate": 0.1, "subsample_for_bin": 300000, "n_estimators": 50}

    early_stopping_rounds = None
    if early_stopping == True:
        early_stopping_rounds = 50

    if test_run:
        eval_set = [(X_train, y_train)]
    else:
        eval_set = [(X_train, y_train), (X_test, y_test)]

    booster = lgbm.LGBMRegressor(**params)

    categoricals = [c for c in categoricals if c in X_train.columns]

    booster.fit(
        X_train,
        y_train,
        eval_set=eval_set,
        eval_metric=["rmse"],
        verbose=100,
        categorical_feature=categoricals,
        early_stopping_rounds=early_stopping_rounds,
    )

    return booster



keep_from_month = 2  # The first couple of months are dropped because of distortions to their features (e.g. wrong item age)
test_month = 33
# dropping this will reduce overfitting
dropcols = [
    "shop_id",
    "item_id"
] 
valid = matrix.drop(columns=dropcols).loc[matrix.date_block_num == test_month, :]
train__ = matrix.drop(columns=dropcols).loc[matrix.date_block_num < test_month, :]
train__ = train__[train__.date_block_num >= keep_from_month]
X_train = train__.drop(columns="item_cnt_month")
y_train = train__.item_cnt_month
X_valid = valid.drop(columns="item_cnt_month")
y_valid = valid.item_cnt_month



params = {
    "num_leaves": 966,
    "cat_smooth": 45.01680827234465,
    "min_child_samples": 27,
    "min_child_weight": 0.021144950289224463,
    "max_bin": 214,
    "learning_rate": 0.01,
    "subsample_for_bin": 300000,
    "min_data_in_bin": 7,
    "colsample_bytree": 0.8,
    "subsample": 0.6,
    "subsample_freq": 5,
    "n_estimators": 8000,
}

categoricals = [
    "item_category_id",
    "month",
]  # These features will be set as categorical features by LightGBM and handled differently

lgbooster = fit_booster(
    X_train,
    y_train,
    X_valid,
    y_valid,
    params=params,
    test_run=False,
    categoricals=categoricals,
)

# <a name="prediction"></a>Prediction

In [None]:
matrix['item_cnt_month'] = matrix['item_cnt_month'].clip(0,20)
keep_from_month = 2
test_month = 34
test__ = matrix.loc[matrix.date_block_num==test_month, :]
X_test = test__.drop(columns="item_cnt_month")
y_test = test__.item_cnt_month

X_test["item_cnt_month"] = lgbooster.predict(X_test.drop(columns=dropcols)).clip(0, 20)

In [None]:
X_test["item_cnt_month"] 

# <a name="submission"></a>Submission

In [None]:
testing = test_dataset.merge(
    X_test[["shop_id", "item_id", "item_cnt_month"]],
    on=["shop_id", "item_id"],
    how="inner",
    copy=True,
)
# Verify that the indices of the submission match the original
assert test_dataset.equals(testing[["ID", "shop_id", "item_id"]])
testing[["ID", "item_cnt_month"]].to_csv("./submission.csv", index=False)