### **Initialization**

I use these 3 lines of code on top of my each notebook because it won't cause any trouble while reloading or reworking on the Project or Problem. And the third line of code helps to make visualization within the Notebook.

In [1]:
# Initialization.
# I use these 3 lines of code on top of each Notebooks.
%reload_ext autoreload
%autoreload 2
%matplotlib inline

### **Downloading the Dependencies**

In [2]:
# Downloading all the necessary Libraries and Dependencies
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import re, math, graphviz, scipy
import seaborn as sns

# I will use XGboost in this Project because the Dataset has Timeseries Data.
from sklearn.linear_model import LogisticRegression
import xgboost as xgb
from xgboost import XGBRegressor
from xgboost import plot_importance

# I will also use the Fastai API in this Project for Data Preprocessing and Data Preparation
from pandas.api.types import is_string_dtype, is_numeric_dtype
from IPython.display import display
from sklearn.ensemble import forest
from sklearn_pandas import DataFrameMapper
from sklearn.preprocessing import LabelEncoder, StandardScaler
from scipy.cluster import hierarchy as hc
from plotnine import *
from sklearn import metrics
from concurrent.futures import ProcessPoolExecutor

  import pandas.util.testing as tm


### **Getting the Data**

I have downloaded the Data from one of the **Kaggle** competition Dataset, **Predict Future Sales**. And I have used Google Colab so the act of reading Data might be different on different platforms.

In [3]:
# Loading the Data
# I am using Colab for this Project so accessing the Data might be different in different platforms.
path = "/content/drive/My Drive/Predict Future Sales"

# Creating the DataFrames using Pandas
transactions = pd.read_csv(os.path.join(path, "sales_train.csv.zip"))
items = pd.read_csv(os.path.join(path, "items.csv.zip"))
item_categories = pd.read_csv(os.path.join(path, "item_categories.csv"))
shops = pd.read_csv(os.path.join(path, "shops.csv"))
test = pd.read_csv(os.path.join(path, "test.csv.zip"))

### **Inspecting the Data**

Now, I am going to take the overview of each DataFrame defined above and I will walk through each process so you can gain more insights from it.

In [4]:
# Looking and Inspecting the Data
## Transactions DataFrame 
display(transactions.head(3)); 
transactions.shape

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day
0,02.01.2013,0,59,22154,999.0,1.0
1,03.01.2013,0,25,2552,899.0,1.0
2,05.01.2013,0,25,2552,899.0,-1.0


(2935849, 6)

Basically, Transactions DataFrame is a training Dataset. It contains numbers of columns or features. The **item_cnt_day** column is our target feature. We should convert it per month to match the competition overlook. And as we can see that **date** column is not in the datetime format and we should focus on converting it into datetime object while working with **Time Series** Data.

In [5]:
## Items DataFrame
display(items.head(3)); 
items.shape

Unnamed: 0,item_name,item_id,item_category_id
0,! ВО ВЛАСТИ НАВАЖДЕНИЯ (ПЛАСТ.) D,0,40
1,!ABBYY FineReader 12 Professional Edition Full...,1,76
2,***В ЛУЧАХ СЛАВЫ (UNV) D,2,40


(22170, 3)

Similarly, Items DataFrame contains different items name, items id and item category id.

In [6]:
## Item Categories DataFrame
display(item_categories.head(3));
item_categories.shape

Unnamed: 0,item_category_name,item_category_id
0,PC - Гарнитуры/Наушники,0
1,Аксессуары - PS2,1
2,Аксессуары - PS3,2


(84, 2)

In [7]:
## Shops DataFrame
display(shops.head(3));
shops.shape

Unnamed: 0,shop_name,shop_id
0,"!Якутск Орджоникидзе, 56 фран",0
1,"!Якутск ТЦ ""Центральный"" фран",1
2,"Адыгея ТЦ ""Мега""",2


(60, 2)

In [8]:
# Test DataFrame
display(test.head());
test.shape

Unnamed: 0,ID,shop_id,item_id
0,0,5,5037
1,1,5,5320
2,2,5,5233
3,3,5,5232
4,4,5,5268


(214200, 3)

### **Preparing the DataFrame**

First, we should create a one common DataFrame for training the Mode. We can create a common DataFrame for trainig by merging all the DataFrames defined above except the Test DataFrame. In the process of merging the DataFrame I have gone through multiple Feature Engineering and Preprocessing steps which will enhance the Exploratory Data Analysis (EDA) of the Data.

In [9]:
# Merging the Transactions and Items DataFrame on "Item Id" column 
train = pd.merge(transactions, items, on="item_id", how="left")
train.tail()

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id
2935844,10.10.2015,33,25,7409,299.0,1.0,V/A Nu Jazz Selection (digipack),55
2935845,09.10.2015,33,25,7460,299.0,1.0,V/A The Golden Jazz Collection 1 2CD,55
2935846,14.10.2015,33,25,7459,349.0,1.0,V/A The Best Of The 3 Tenors,55
2935847,22.10.2015,33,25,7440,299.0,1.0,V/A Relax Collection Planet MP3 (mp3-CD) (jewel),57
2935848,03.10.2015,33,25,7460,299.0,1.0,V/A The Golden Jazz Collection 1 2CD,55


Though we can use join method to join two DataFrames. I prefer to use merge method because merge method of **Pandas** is more generalized form and and we don't have to apply suffix to the columns created as well.
We can merge two DataFrames on the common columns as you can see, I have merged Transactions and Items on **item_id** column and so on.

In [10]:
# Merging the Train, Item Categories and Shops DataFrame as well.
# Merging Train and Item Categories on "Item Category Id" column.
train_df = pd.merge(train, item_categories, on="item_category_id", how="left")
# Merging Train and Shops DataFrame on "Shop Id" column.
train_df = pd.merge(train_df, shops, on="shop_id", how="left")
train_df.head(10)

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name
0,02.01.2013,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир"""
1,03.01.2013,0,25,2552,899.0,1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум"""
2,05.01.2013,0,25,2552,899.0,-1.0,DEEP PURPLE The House Of Blue Light LP,58,Музыка - Винил,"Москва ТРК ""Атриум"""
3,06.01.2013,0,25,2554,1709.05,1.0,DEEP PURPLE Who Do You Think We Are LP,58,Музыка - Винил,"Москва ТРК ""Атриум"""
4,15.01.2013,0,25,2555,1099.0,1.0,DEEP PURPLE 30 Very Best Of 2CD (Фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум"""
5,10.01.2013,0,25,2564,349.0,1.0,DEEP PURPLE Perihelion: Live In Concert DVD (К...,59,Музыка - Музыкальное видео,"Москва ТРК ""Атриум"""
6,02.01.2013,0,25,2565,549.0,1.0,DEEP PURPLE Stormbringer (фирм.),56,Музыка - CD фирменного производства,"Москва ТРК ""Атриум"""
7,04.01.2013,0,25,2572,239.0,1.0,DEFTONES Koi No Yokan,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""
8,11.01.2013,0,25,2572,299.0,1.0,DEFTONES Koi No Yokan,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""
9,03.01.2013,0,25,2573,299.0,3.0,DEL REY LANA Born To Die,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""


**Preprocessing and Feature Engineering**

Now, I am converting the date column into Datetime Object. Here, you can see that I have added format argument because the Data in date column is not properly organized so we need to pass the format argument if we end up getting an Error in fromat.

In [11]:
# Changing the Data column in Datetime Object
train_df["date"] = pd.to_datetime(train_df["date"], format="%d.%m.%Y")
train_df["date"].head()

0   2013-01-02
1   2013-01-03
2   2013-01-05
3   2013-01-06
4   2013-01-15
Name: date, dtype: datetime64[ns]

In [12]:
# Working on Data Leakages
# Checking on Test DataFrame and Removing the Unnecessary Features
test_shops = test["shop_id"].unique()
test_items = test["item_id"].unique()
# Removing the Redundant Features
train_df = train_df[train_df["shop_id"].isin(test_shops)]
train_df = train_df[train_df["item_id"].isin(test_items)]
display(train_df.head()); train_df.shape

Unnamed: 0,date,date_block_num,shop_id,item_id,item_price,item_cnt_day,item_name,item_category_id,item_category_name,shop_name
0,2013-01-02,0,59,22154,999.0,1.0,ЯВЛЕНИЕ 2012 (BD),37,Кино - Blu-Ray,"Ярославль ТЦ ""Альтаир"""
10,2013-01-03,0,25,2574,399.0,2.0,DEL REY LANA Born To Die The Paradise Editio...,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""
11,2013-01-05,0,25,2574,399.0,1.0,DEL REY LANA Born To Die The Paradise Editio...,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""
12,2013-01-07,0,25,2574,399.0,1.0,DEL REY LANA Born To Die The Paradise Editio...,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""
13,2013-01-08,0,25,2574,399.0,2.0,DEL REY LANA Born To Die The Paradise Editio...,55,Музыка - CD локального производства,"Москва ТРК ""Атриум"""


(1224439, 10)

In [13]:
# Keeping only the Items whose price is greater than 0
train_df = train_df.query("item_price > 0")

In [14]:
# Creating the new features which contains the Items sold on a particulat month
# Item_cnt_day contains the number of Items sold
train_df["item_cnt_day"] = train_df["item_cnt_day"].clip(0, 20)
train_df = train_df.groupby(["date", "item_category_id", "shop_id", "item_id", "date_block_num"])
train_df = train_df.agg({'item_cnt_day':"sum", 'item_price':"mean"}).reset_index()
train_df = train_df.rename(columns={"item_cnt_day":'item_cnt_month'})
# Using clip(0, 20) to meet the requirements of the Competition
train_df["item_cnt_month"] = train_df["item_cnt_month"].clip(0, 20)
train_df.head()

Unnamed: 0,date,item_category_id,shop_id,item_id,date_block_num,item_cnt_month,item_price
0,2013-01-01,2,15,5643,0,1.0,2390.0
1,2013-01-01,2,19,5572,0,1.0,1590.0
2,2013-01-01,2,28,5572,0,1.0,1590.0
3,2013-01-01,2,46,5643,0,1.0,2390.0
4,2013-01-01,5,7,5605,0,1.0,489.3


### **Working on DataFrame using Fastai API**

**Fastai Library or API**
- [Fast.ai](https://www.fast.ai/about/) is the first deep learning library to provide a single consistent interface to all the most commonly used deep learning applications for vision, text, tabular data, time series, and collaborative filtering.
- [Fast.ai](https://www.fast.ai/about/) is a deep learning library which provides practitioners with high-level components that can quickly and easily provide state-of-the-art results in standard deep learning domains, and provides researchers with low-level components that can be mixed and matched to build new approaches.

**Preparing the Model**
- I have used [Fastai](https://www.fast.ai/about/) API to train the Model. It seems quite challenging to understand the code if you have never encountered with Fast.ai API before.
One important note for anyone who has never used Fastai API before is to go through [Fastai Documentation](https://docs.fast.ai/). And if you are using Fastai in Jupyter Notebook then you can use doc(function_name) to get the documentation instantly.

**Writing and Downloading the Dependencies**


*   These Functions are already defined by Fastai and I have just copy and pasted from Fastai. Anybody with knowledge of its Implementation can use it. Fastai is an Open Source.



In [15]:
def proc_df(df, y_fld=None, skip_flds=None, ignore_flds=None, do_scale=False, na_dict=None, prepoc_fn=None, max_n_cat=None,
           subset=None, mapper=None):
    if not ignore_flds: ignore_flds=[]
    if not skip_flds: skip_flds=[]
    if subset:
        df = get_sample(df, subset)
    else:
        df = df.copy()
    ignored_flds = df.loc[:, ignore_flds]
    df.drop(ignore_flds, axis=1, inplace=True)
    if prepoc_fn: prepoc_fn(df)
    if y_fld is None: y=None
    else:
        if not is_numeric_dtype(df[y_fld]): df[y_fld] = pd.Categorical(df[y_fld]).codes
        y = df[y_fld].values
        skip_flds += [y_fld]
    df.drop(skip_flds, axis=1, inplace=True)
    
    if na_dict is None: na_dict = {}
    else: na_dict = na_dict.copy()
    na_dict_initial = na_dict.copy()
    for n, c in df.items(): na_dict = fix_missing(df, c, n, na_dict)
    if len(na_dict_initial.keys()) > 0:
        df.drop([a + '_na' for a in list(set(na_dict.keys()) - set(na_dict_initial.keys()))], axis=1, inplace=True)
    if do_scale: mapper = scale_vars(df, mapper)
    for n, c in df.items(): numericalize(df, c, n, max_n_cat)
    df = pd.get_dummies(df, dummy_na=True)
    df = pd.concat([ignored_flds, df], axis=1)
    res = [df, y, na_dict]
    if do_scale: res = res + [mapper]
    return res

In [16]:
def fix_missing(df, col, name, na_dict):
    if is_numeric_dtype(col):
        if pd.isnull(col).sum() or (name in na_dict): 
            df[name + '_na'] = pd.isnull(col)
            filler = na_dict[name] if name in na_dict else col.median()
            df[name] = col.fillna(filler)
            na_dict[name] = filler
    return na_dict

In [17]:
def numericalize(df, col, name, max_n_cat):
    if not is_numeric_dtype(col) and (max_n_cat is None or col.nunique()>max_n_cat):
        df[name] = col.cat.codes+1

def get_sample(df, n):
    idxs = sorted(np.random.permutation(len(df))[:n])
    return df.iloc[idxs].copy()

def set_rf_samples(n):
    forest._generate_sample_indices = (lambda rs, n_samples:
                                      forest.check_random_state(rs).randit(0, n_samples, n))

def reset_rf_samples():
    forest._generate_sample_indices = (lambda rs, n_samples:
                                      forest.check_random_state(rs).randit(0, n_samples, n_samples))        

In [18]:
def split_vals(a, n):
    return a[:n].copy(), a[n:].copy()

def train_cats(df):
    for n,c in df.items():
        if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()

def train_cats(df):
    for n,c in df.items():
        if is_string_dtype(c): df[n] = c.astype('category').cat.as_ordered()

def apply_cats(df, trn):
    for n, c in df.items():
        if trn[n].dtype.name == "category":
            df[n] = pd.Categorical(c, categories = trn[n].cat.categories, ordered = True)

In [19]:
def add_datepart(df, fldnames, drop=True, time=False, errors="raise"):
    if isinstance(fldnames, str):
        fldnames = [fldnames]
    for fldname in fldnames:
        fld = df[fldname]
        fld_dtype = fld.dtype
        if isinstance(fld_dtype, pd.core.dtypes.dtypes.DatetimeTZDtype):
            fld_dtype = np.datetime64
            
        if not np.issubdtype(fld_dtype, np.datetime64):
            df[fldname] = fld = pd.to_datetime(fld, infer_datetime_format=True, errors=errors)
        targ_pre = re.sub("[Dd]ate$", '', fldname)
        attr = ['Year', 'Month', 'Week', 'Day', 'Dayofweek', 'Dayofyear',
                'Is_month_end', 'Is_month_start', 'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start']
        if time: attr = attr + ['Hour', 'Minute', 'Second']
        for n in attr: df[targ_pre + n] = getattr(fld.dt, n.lower())
        df[targ_pre + 'Elasped'] = fld.astype(np.int64) // 10**9
        if drop: df.drop(fldname, axis=1, inplace=True)

In [20]:
def scale_vars(df, mapper):
    warnings.filterwarnings("ignore", category = sklearn.exceptions.DataConversionWarning)
    if mapper is None:
        map_f = [([n], StandardScaler()) for n in df.columns if is_numeric_dtype(df[n])]
        mapper = DataFrameMapper(map_f).fit(df)
    df[mapper.transformed_names_] = mapper.transform(df)
    return mapper

In [21]:
def rmse(x, y):
    return math.sqrt(((x-y)**2).mean())

def print_score(m):
    res = [rmse(m.predict(X_train), y_train),
          rmse(m.predict(X_valid), y_valid),
          m.score(X_train, y_train),
          m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'):
        res.append(m.oob_score_)
    print(res)

In [22]:
# Using add_datepart function 
# This function is very useful while working on Time-Series Data
add_datepart(train_df, "date")
train_df.columns

Index(['item_category_id', 'shop_id', 'item_id', 'date_block_num',
       'item_cnt_month', 'item_price', 'Year', 'Month', 'Week', 'Day',
       'Dayofweek', 'Dayofyear', 'Is_month_end', 'Is_month_start',
       'Is_quarter_end', 'Is_quarter_start', 'Is_year_end', 'Is_year_start',
       'Elasped'],
      dtype='object')

In [23]:
# Observing the DataFrame again after applying API
train_df.head()

Unnamed: 0,item_category_id,shop_id,item_id,date_block_num,item_cnt_month,item_price,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elasped
0,2,15,5643,0,1.0,2390.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400
1,2,19,5572,0,1.0,1590.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400
2,2,28,5572,0,1.0,1590.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400
3,2,46,5643,0,1.0,2390.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400
4,5,7,5605,0,1.0,489.3,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400


In [24]:
# Dealing with Categorical Features
train_cats(train_df)

In [25]:
# Checking for Null Values in DataFrame
train_df.isnull().sum().sort_index() / len(train_df)

Day                 0.0
Dayofweek           0.0
Dayofyear           0.0
Elasped             0.0
Is_month_end        0.0
Is_month_start      0.0
Is_quarter_end      0.0
Is_quarter_start    0.0
Is_year_end         0.0
Is_year_start       0.0
Month               0.0
Week                0.0
Year                0.0
date_block_num      0.0
item_category_id    0.0
item_cnt_month      0.0
item_id             0.0
item_price          0.0
shop_id             0.0
dtype: float64

In [26]:
os.makedirs("tmp", exist_ok=True)
train_df.to_feather("tmp/new")

### **Preparing the Model: XGBoost**

**Processing**

In [41]:
# Loading the Data and Going through simple Exploratory Data Analysis
data = pd.read_feather("tmp/new")
display(data.head(3));
data.shape

Unnamed: 0,item_category_id,shop_id,item_id,date_block_num,item_cnt_month,item_price,Year,Month,Week,Day,Dayofweek,Dayofyear,Is_month_end,Is_month_start,Is_quarter_end,Is_quarter_start,Is_year_end,Is_year_start,Elasped
0,2,15,5643,0,1.0,2390.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400
1,2,19,5572,0,1.0,1590.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400
2,2,28,5572,0,1.0,1590.0,2013,1,1,1,1,1,False,True,False,True,False,True,1356998400


(1224427, 19)

In [42]:
data.describe()

Unnamed: 0,item_category_id,shop_id,item_id,date_block_num,item_cnt_month,item_price,Year,Month,Week,Day,Dayofweek,Dayofyear,Elasped
count,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0,1224427.0
mean,40.55885,32.15116,9614.839,19.35474,1.2876,1030.671,2014.144,6.628653,26.46565,16.12158,3.316003,186.5283,1409100000.0
std,18.60689,16.46563,6299.849,9.110718,1.360843,1827.392,0.7686042,3.470039,15.2283,8.912855,2.002864,106.7442,23957280.0
min,2.0,2.0,30.0,0.0,0.0,0.5,2013.0,1.0,1.0,1.0,0.0,1.0,1356998000.0
25%,25.0,19.0,4181.0,12.0,1.0,299.0,2014.0,4.0,13.0,8.0,2.0,92.0,1390003000.0
50%,38.0,31.0,7856.0,21.0,1.0,549.0,2014.0,7.0,27.0,16.0,4.0,191.0,1413072000.0
75%,55.0,46.0,15229.0,27.0,1.0,1199.0,2015.0,10.0,39.0,24.0,5.0,276.0,1428624000.0
max,83.0,59.0,22167.0,33.0,20.0,59200.0,2015.0,12.0,52.0,31.0,6.0,365.0,1446250000.0


In [43]:
new_df, y, nas = proc_df(data, "item_cnt_month")

In [44]:
# Preparing the Validation Data
n_valid = 200000
n_trn = len(data) - n_valid
raw_train, raw_valid = split_vals(data, n_trn)
X_train, X_valid = split_vals(new_df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

# Checking the Shape of Training and Validation Data
X_train.shape, X_valid.shape, y_train.shape, y_valid.shape

((1024427, 18), (200000, 18), (1024427,), (200000,))

In [45]:
# Creating the Regressor Model
model = XGBRegressor(
    max_depth=8,
    n_estimators=1000,
    min_child_weight=300,
    colsample_bytree=0.8,
    subsample=0.8,
    eta=0.3, 
    seed=42
)

# Fitting the Model
model.fit(
    X_train,
    y_train,
    eval_metric="rmse",
    eval_set=[(X_train, y_train), (X_valid, y_valid)],
    verbose=True,
    early_stopping_rounds=10
)

[0]	validation_0-rmse:1.54233	validation_1-rmse:1.17585
Multiple eval metrics have been passed: 'validation_1-rmse' will be used for early stopping.

Will train until validation_1-rmse hasn't improved in 10 rounds.
[1]	validation_0-rmse:1.47928	validation_1-rmse:1.12619
[2]	validation_0-rmse:1.42583	validation_1-rmse:1.08627
[3]	validation_0-rmse:1.38284	validation_1-rmse:1.05557
[4]	validation_0-rmse:1.32813	validation_1-rmse:1.01093
[5]	validation_0-rmse:1.29566	validation_1-rmse:0.98887
[6]	validation_0-rmse:1.26779	validation_1-rmse:0.9724
[7]	validation_0-rmse:1.23144	validation_1-rmse:0.943823
[8]	validation_0-rmse:1.21181	validation_1-rmse:0.932509
[9]	validation_0-rmse:1.19429	validation_1-rmse:0.925067
[10]	validation_0-rmse:1.16802	validation_1-rmse:0.905811
[11]	validation_0-rmse:1.14662	validation_1-rmse:0.890634
[12]	validation_0-rmse:1.13671	validation_1-rmse:0.887368
[13]	validation_0-rmse:1.11955	validation_1-rmse:0.876297
[14]	validation_0-rmse:1.10431	validation_1-rms

XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
             colsample_bynode=1, colsample_bytree=0.8, eta=0.3, gamma=0,
             importance_type='gain', learning_rate=0.1, max_delta_step=0,
             max_depth=8, min_child_weight=300, missing=None, n_estimators=1000,
             n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
             reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=42,
             silent=None, subsample=0.8, verbosity=1)

**Preparing the Submission**

In [46]:
X_test = data[data["date_block_num"] == 33].drop(["item_cnt_month"], axis=1)

In [58]:
Y_test = model.predict(X_test)

In [56]:
submission = pd.DataFrame({
    "ID": test["ID"].iloc[:49531], 
    "item_cnt_month": Y_test.clip(0, 20)
})
submission.to_csv('xgb_submission.csv', index=False)