# **Introduction**

This is notebook of competition **"Kaggle Competition 2024 for Ukrainians" by Google.** [URL](https://www.kaggle.com/competitions/ml-competition-2024-for-ukrainians/overview)

The **goal** of the competition is to predict the value of sales by product item in different outlets.

The **approach** used here is an ensemble of Gradient Boosting (LightGBM) and Neural Networks with embeddings.

The evaluation metric of the model is **RMSLE:** 0.6988



# **Import libraries**

In [37]:
!pip install lightgbm



In [38]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Dense, Flatten, Concatenate
from tensorflow.keras.callbacks import ModelCheckpoint, EarlyStopping
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from sklearn.metrics import mean_squared_error, r2_score, mean_squared_log_error
from sklearn.preprocessing import LabelEncoder, StandardScaler, RobustScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
import lightgbm as lgb
from google.colab import files
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#**1. Data Analysis**

In [39]:
df_train = pd.read_csv ('/content/drive/MyDrive/train1.csv')
df_test = pd.read_csv ('/content/drive/MyDrive/test.csv')

In [40]:
df_train.head()

Unnamed: 0,id,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,0,NCU06,17.6,Low Fat,0.024795,Household,231.101,OUT017,2007,Medium,Tier 2,Supermarket Type1,1760.43266
1,1,FDY26,20.5,Regular,0.102226,Dairy,212.6244,OUT017,2007,Medium,Tier 2,Supermarket Type1,101.2016
2,2,FDK21,18.35,Low Fat,0.092238,Snack Foods,250.1092,OUT013,1987,High,Tier 3,Supermarket Type1,2042.6155
3,3,NCN05,12.15,Low Fat,0.043942,Health and Hygiene,182.295,OUT049,1999,Medium,Tier 1,Supermarket Type1,3103.9596
4,4,FDA47,10.5,Regular,0.042967,Baking Goods,162.421,OUT035,2004,Small,Tier 2,Supermarket Type1,442.757


##**1.1 Data types checking and Error correction**

1. Checkind data types

In [41]:
df_train.dtypes

id                             int64
Item_Identifier               object
Item_Weight                  float64
Item_Fat_Content              object
Item_Visibility              float64
Item_Type                     object
Item_MRP                     float64
Outlet_Identifier             object
Outlet_Establishment_Year      int64
Outlet_Size                   object
Outlet_Location_Type          object
Outlet_Type                   object
Item_Outlet_Sales            float64
dtype: object

**Conclusion:** All data types are correct

2. Checking the categorical data

In [42]:
metadata_dict = {
    'Item_Fat_Content': [df_train['Item_Fat_Content'].unique()],
    'Item_Type': [df_train['Item_Type'].unique()],
    'Outlet_Identifier': [df_train['Outlet_Identifier'].unique()],
    'Outlet_Size': [df_train['Outlet_Size'].unique()],
    'Outlet_Location_Type': [df_train['Outlet_Location_Type'].unique()],
    'Outlet_Type': [df_train['Outlet_Type'].unique()]
}

df_metadata = pd.DataFrame(metadata_dict)

df_metadata_transposed = df_metadata.T

df_metadata_transposed.columns = ['Unique_Values']

df_metadata_transposed.head()

Unnamed: 0,Unique_Values
Item_Fat_Content,"[Low Fat, Regular, reg, LF, low fat]"
Item_Type,"[Household, Dairy, Snack Foods, Health and Hyg..."
Outlet_Identifier,"[OUT017, OUT013, OUT049, OUT035, OUT045, OUT01..."
Outlet_Size,"[Medium, High, Small]"
Outlet_Location_Type,"[Tier 2, Tier 3, Tier 1]"


There are some **mistakes in the dataset** that's why the types of data in **Item_Fat_Content** will be changed to:

*1. 'reg' to 'Regular';*

*2. 'LF' to 'Low Fat';*

*3. 'low fat' to 'Low Fat'*

In [43]:
df_train.Item_Fat_Content = df_train.Item_Fat_Content.replace ({'reg': 'Regular',
                                                                'LF': 'Low Fat',
                                                                'low fat': 'Low Fat'
                                                                })

In [44]:
metadata_dict = {
    'Item_Fat_Content': [df_train['Item_Fat_Content'].unique()],
    'Item_Type': [df_train['Item_Type'].unique()],
    'Outlet_Identifier': [df_train['Outlet_Identifier'].unique()],
    'Outlet_Size': [df_train['Outlet_Size'].unique()],
    'Outlet_Location_Type': [df_train['Outlet_Location_Type'].unique()],
    'Outlet_Type': [df_train['Outlet_Type'].unique()]
}

df_metadata = pd.DataFrame(metadata_dict)

df_metadata_transposed = df_metadata.T

df_metadata_transposed.columns = ['Unique_Values']

df_metadata_transposed.head()

Unnamed: 0,Unique_Values
Item_Fat_Content,"[Low Fat, Regular]"
Item_Type,"[Household, Dairy, Snack Foods, Health and Hyg..."
Outlet_Identifier,"[OUT017, OUT013, OUT049, OUT035, OUT045, OUT01..."
Outlet_Size,"[Medium, High, Small]"
Outlet_Location_Type,"[Tier 2, Tier 3, Tier 1]"


3. Cheking the missing data

In [45]:
df_train.isnull().sum()

id                           0
Item_Identifier              0
Item_Weight                  0
Item_Fat_Content             0
Item_Visibility              0
Item_Type                    0
Item_MRP                     0
Outlet_Identifier            0
Outlet_Establishment_Year    0
Outlet_Size                  0
Outlet_Location_Type         0
Outlet_Type                  0
Item_Outlet_Sales            0
dtype: int64

There are no missing data

##**1.2 Statistical analysis and preprocessing**

###**1.2.1 Descriptive Statistics**

In [46]:
# Descreptive statistics
describe_table = df_train [['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].describe()

# Calculation of coefficient of variation
cv = df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].std() / df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].mean()

# Calculation of Skewness
skewness = df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].skew()

# Calculation of Kurtosis
kurtosis = df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].kurtosis()

# Calculation of Moda
mode = df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].mode().iloc[0]

# Calculation of Range
range = df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].max() - df_train[['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Outlet_Sales']].min()

In [47]:
descriptive_stat = {
    "mode": mode,
    "range": range,
    "cv": cv,
    "skewness": skewness,
    "kurtosis": kurtosis\
}
descriptive_stat_table = pd.DataFrame (descriptive_stat).T
descriptive_stat_table = pd.concat([describe_table, descriptive_stat_table])
descriptive_stat_table.head (len(descriptive_stat_table))

Unnamed: 0,Item_Weight,Item_Visibility,Item_MRP,Item_Outlet_Sales
count,378428.0,378428.0,378428.0,378428.0
mean,12.800922,0.054546,137.761605,2125.058867
std,4.618353,0.046882,60.978569,1667.612362
min,4.555,0.0,31.29,33.29
25%,8.775,0.017434,92.9462,965.36448
50%,12.5,0.044917,131.0626,1751.82718
75%,16.75,0.081419,182.2634,2877.30125
max,30.0,0.328391,266.8884,31224.72695
mode,17.6,0.0,178.237,1342.2528
range,25.445,0.328391,235.5984,31191.43695


**Conclusion**:
1. ***'Item_Weight', 'Item_MRP'*** have normal disribution, therefore, the normal approach is to use StandardScaler or RobustScaler. Model evaluation showed that the approach of RobustScaler give us better results
2. ***Item_Visibility.*** Logarithm of data is often used to normalize data with a highly skewed distribution. This can also reduce the impact of emissions.

###**1.2.2 Data preprocessing**

In [48]:
scaler = RobustScaler()
df_train [['Item_Weight', 'Item_MRP']] = scaler.fit_transform (df_train [['Item_Weight', 'Item_MRP']])
df_train [['Item_Visibility']] = np.log1p (df_train [['Item_Visibility']])

In [49]:
df_train.Item_Fat_Content = df_train.Item_Fat_Content.replace ({'Low Fat': 0,
                                                                'Regular': 1
                                                                })
df_train.Outlet_Size = df_train.Outlet_Size.replace ({'Small': 1,
                                                                'Medium': 2,
                                                                'High': 3
                                                                })
df_train = df_train.drop ('id', axis =1)

In [50]:
df_train.head()

Unnamed: 0,Item_Identifier,Item_Weight,Item_Fat_Content,Item_Visibility,Item_Type,Item_MRP,Outlet_Identifier,Outlet_Establishment_Year,Outlet_Size,Outlet_Location_Type,Outlet_Type,Item_Outlet_Sales
0,NCU06,0.639498,0,0.024493,Household,1.120035,OUT017,2007,2,Tier 2,Supermarket Type1,1760.43266
1,FDY26,1.003135,1,0.097332,Dairy,0.91317,OUT017,2007,2,Tier 2,Supermarket Type1,101.2016
2,FDK21,0.733542,0,0.088229,Snack Foods,1.332852,OUT013,1987,3,Tier 3,Supermarket Type1,2042.6155
3,NCN05,-0.043887,0,0.043004,Health and Hygiene,0.573601,OUT049,1999,2,Tier 1,Supermarket Type1,3103.9596
4,FDA47,-0.250784,1,0.042069,Baking Goods,0.35109,OUT035,2004,1,Tier 2,Supermarket Type1,442.757


In [51]:
df_train.shape

(378428, 12)

In [52]:
df_train_a = df_train
df_train_b = df_train

#**2. Modeling**

The approach presented here is creating the ensemble of embedding and gradient boosting (LightGBM)

##**2.1 Gradient boosting (LightGBM)**

In [17]:
n_splits = 20
rs = ShuffleSplit(n_splits=n_splits, test_size=0.2, random_state=42)

In [18]:
# Split data for train and validation datasets
X_train_lgb, X_test_lgb, y_train_lgb, y_test_lgb = train_test_split(df_train.drop(['Item_Outlet_Sales'], axis=1), df_train.Item_Outlet_Sales, test_size=0.2, random_state=42)
categorical_features = ['Item_Identifier', 'Item_Type', 'Outlet_Identifier', 'Outlet_Location_Type', 'Outlet_Type']
for feature in categorical_features:
    X_train_lgb[feature] = X_train_lgb[feature].astype('category')
    X_test_lgb[feature] = X_test_lgb[feature].astype('category')

In [19]:
params = {
    'boosting_type': 'dart',
    'objective': 'regression',
    'metric': 'rmse',
    'num_leaves': 140,
    'learning_rate': 0.05,
    'feature_fraction': 1,
    'min_data_in_leaf': 150,
    'min_gain_to_split': 50,
    'max_drop': 50
}

In [20]:
models = []
predictions = np.zeros(len(X_test_lgb))

In [21]:
for train_index, test_index in rs.split(X_train_lgb):
    X_train_split, X_valid_split = X_train_lgb.iloc[train_index], X_train_lgb.iloc[test_index]
    y_train_split, y_valid_split = y_train_lgb.iloc[train_index], y_train_lgb.iloc[test_index]

    train_data_split = lgb.Dataset(X_train_split, label=y_train_split, categorical_feature=categorical_features)
    valid_data_split = lgb.Dataset(X_valid_split, label=y_valid_split, categorical_feature=categorical_features)

    model_split = lgb.train(
        params,
        train_data_split,
        num_boost_round=146,
        valid_sets=[train_data_split, valid_data_split],
    )

    models.append(model_split)
    predictions += model_split.predict(X_test_lgb)

predictions /= n_splits

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.070811 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2325
[LightGBM] [Info] Number of data points in the train set: 242193, number of used features: 11
[LightGBM] [Info] Start training from score 2125.978066
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.015433 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 2323
[LightGBM] [Info] Number of data points in the train set: 242193, number of used features: 11
[LightGBM] [Info] Start training from score 2127.040573
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.090700 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] 

In [22]:
rmse = mean_squared_error(y_test_lgb, predictions, squared=False)
print(f'RMSE: {rmse}')
rmsle = np.sqrt(mean_squared_log_error(y_test_lgb, predictions))
print(f'RMSLE: {rmsle}')

RMSE: 1499.7825120579714
RMSLE: 0.7082043157019053


##**2.2 Neural network and embeddings**


In [23]:
numerical_features = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Fat_Content', 'Outlet_Establishment_Year', 'Outlet_Size']
categorical_features = ['Item_Identifier', 'Outlet_Identifier', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Type']

In [24]:
def rmsle(y_true, y_pred):
    y_true = K.clip(y_true, K.epsilon(), None)
    y_pred = K.clip(y_pred, K.epsilon(), None)
    return K.sqrt(K.mean(K.square(K.log(y_pred + 1) - K.log(y_true + 1))))

In [25]:
# Splitting the data
X = df_train.drop(columns=['Item_Outlet_Sales'])
y = df_train['Item_Outlet_Sales']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

Embedding

In [26]:
# Label encoding categorical features
label_encoders = {}
for cat_col in categorical_features:
    le = LabelEncoder()
    X_train[cat_col] = le.fit_transform(X_train[cat_col])
    X_val[cat_col] = le.transform(X_val[cat_col])  # Transform validation set with training set encoders
    label_encoders[cat_col] = le

# Model inputs and embeddings for categorical features
inputs = []
embeddings = []
for cat_col in categorical_features:
    input_cat = Input(shape=(1,))
    vocab_size = X_train[cat_col].nunique()
    emb_dim = min(50, vocab_size // 2)
    embedding = Embedding(input_dim=vocab_size, output_dim=emb_dim)(input_cat)
    embedding = Flatten()(embedding)
    inputs.append(input_cat)
    embeddings.append(embedding)

# Input for numerical features
input_num = Input(shape=(len(numerical_features),))
inputs.append(input_num)
embeddings.append(input_num)

# Preparing the training and validation data
X_train_list = [X_train[cat_col].values for cat_col in categorical_features] + [X_train[numerical_features].values]
y_train = y_train.values
X_val_list = [X_val[cat_col].values for cat_col in categorical_features] + [X_val[numerical_features].values]
y_val = y_val.values

Callbacks

In [27]:
checkpoint_callback = ModelCheckpoint(
    'best_model.h5',
    monitor='val_loss',
    save_best_only=True,
    mode='min',
    verbose=1
)
early_stopping_callback = EarlyStopping(
    monitor='val_loss',
    patience=7,
    mode='min',
    verbose=1,
    restore_best_weights=True
)

In [28]:
# Model
x = Concatenate()(embeddings)
x = Dense(128, activation='relu')(x)
x = Dense(64, activation='relu')(x)
x = Dense(32, activation='relu')(x)
output = Dense(1)(x)
model = Model(inputs, output)

In [29]:
model.compile(optimizer='adam', loss=rmsle)

In [30]:
history = model.fit(
    X_train_list, y_train,
    validation_data=(X_val_list, y_val),
    epochs=30,
    batch_size=32,
    callbacks=[checkpoint_callback, early_stopping_callback]
)

Epoch 1/30
Epoch 1: val_loss improved from inf to 0.71133, saving model to best_model.h5
Epoch 2/30
  24/9461 [..............................] - ETA: 42s - loss: 0.7574

  saving_api.save_model(


Epoch 2: val_loss did not improve from 0.71133
Epoch 3/30
Epoch 3: val_loss improved from 0.71133 to 0.70348, saving model to best_model.h5
Epoch 4/30
Epoch 4: val_loss improved from 0.70348 to 0.69760, saving model to best_model.h5
Epoch 5/30
Epoch 5: val_loss did not improve from 0.69760
Epoch 6/30
Epoch 6: val_loss did not improve from 0.69760
Epoch 7/30
Epoch 7: val_loss improved from 0.69760 to 0.69692, saving model to best_model.h5
Epoch 8/30
Epoch 8: val_loss did not improve from 0.69692
Epoch 9/30
Epoch 9: val_loss did not improve from 0.69692
Epoch 10/30
Epoch 10: val_loss did not improve from 0.69692
Epoch 11/30
Epoch 11: val_loss improved from 0.69692 to 0.69370, saving model to best_model.h5
Epoch 12/30
Epoch 12: val_loss did not improve from 0.69370
Epoch 13/30
Epoch 13: val_loss did not improve from 0.69370
Epoch 14/30
Epoch 14: val_loss did not improve from 0.69370
Epoch 15/30
Epoch 15: val_loss did not improve from 0.69370
Epoch 16/30
Epoch 16: val_loss did not improve 

##**2.3 Ensemble models: Gradient Boosting (LightGBM) and Neural Network**


###**2.3.1 Prediction for Gradient Boosting (LightGBM)**


In [31]:
# data preparation
y_boost = df_train_a.Item_Outlet_Sales

df_train_a = df_train_a.drop (['Item_Outlet_Sales'] , axis =1)
categorical_features = ['Item_Identifier', 'Item_Type', 'Outlet_Identifier', 'Outlet_Location_Type', 'Outlet_Type']
for feature in categorical_features:
    X_train_lgb[feature] = X_train_lgb[feature].astype('category')
    X_test_lgb[feature] = X_test_lgb[feature].astype('category')

categorical_features = ['Item_Identifier', 'Item_Type', 'Outlet_Identifier', 'Outlet_Location_Type', 'Outlet_Type']
test_data_lgb = lgb.Dataset(df_train_a, label=y_boost, categorical_feature=categorical_features)
for feature in categorical_features:
    df_train_a[feature] = df_train_a[feature].astype('category')

In [32]:
def predict_with_ensemble(models, X):
    predictions = np.zeros(len(X))
    for model in models:
        predictions += model.predict(X)
    predictions /= len(models)
    return predictions

In [33]:
y_pred_lgb_ful = predict_with_ensemble(models, df_train_a)

In [34]:
rmsle = np.sqrt(mean_squared_log_error(y_boost, np.clip(y_pred_lgb_ful, 0, None)))
print(f'RMSLE: {rmsle}')

RMSLE: 0.7006820998639279


###**2.3.2 Prediction for Neural Network**


In [61]:
# data preparation
df_train_b = df_train_b.drop (['Item_Outlet_Sales'] , axis =1)

numerical_features = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Fat_Content', 'Outlet_Establishment_Year', 'Outlet_Size']
categorical_features = ['Item_Identifier', 'Outlet_Identifier', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Type']
label_encoders = {}
for cat_col in categorical_features:
    le = LabelEncoder()
    df_train_b[cat_col] = le.fit_transform(df_train_b[cat_col])
    label_encoders[cat_col] = le
inputs = []
embeddings = []
for cat_col in categorical_features:
    input_cat = Input(shape=(1,))
    vocab_size = df_train_b[cat_col].nunique()
    emb_dim = min(50, vocab_size // 2)
    embedding = Embedding(input_dim=vocab_size, output_dim=emb_dim)(input_cat)
    embedding = Flatten()(embedding)
    inputs.append(input_cat)
    embeddings.append(embedding)
input_num = Input(shape=(len(numerical_features),))
inputs.append(input_num)
embeddings.append(input_num)
df_train_b = [df_train_b[cat_col].values for cat_col in categorical_features] + [df_train_b[numerical_features].values]

In [63]:
predictions = model.predict(df_train_b)



###**2.3.3 Ensemble models**


In [64]:
predictions = predictions.flatten()
y_pred_lgb_ful1 = y_pred_lgb_ful.flatten()
df_lin = pd.DataFrame({
    'nn_prediction': predictions,
    'lgb_prediction': y_pred_lgb_ful1,
    'y': y_boost
})
df_lin.head()

Unnamed: 0,nn_prediction,lgb_prediction,y
0,3172.796143,2802.70175,1760.43266
1,2741.27002,2558.737798,101.2016
2,3130.75415,3505.348206,2042.6155
3,2222.503174,2333.273818,3103.9596
4,2097.512451,1845.751032,442.757


Determining the best ratio of models (best_a)

In [65]:
X = df_lin[['nn_prediction', 'lgb_prediction']]
y = df_lin['y']
a_values = np.arange(0.05, 1.05, 0.05)

best_a = None
best_rmsle = float('inf')

for a in a_values:
    y_pred_train = df_lin.lgb_prediction * a + df_lin.nn_prediction * (1 - a)
    rmsle = np.sqrt(mean_squared_log_error(y_boost, np.clip(y_pred_train, 0, None)))
    print(f'a: {a}, RMSLE: {rmsle}')

    if rmsle < best_rmsle:
        best_rmsle = rmsle
        best_a = a

print(f'Best a: {best_a}')
print(f'Min RMSLE: {best_rmsle}')

a: 0.05, RMSLE: 0.7023379366008731
a: 0.1, RMSLE: 0.7017141516650867
a: 0.15000000000000002, RMSLE: 0.7011551399175346
a: 0.2, RMSLE: 0.7006592266538485
a: 0.25, RMSLE: 0.7002249940506852
a: 0.3, RMSLE: 0.6998512436998849
a: 0.35000000000000003, RMSLE: 0.6995369735075875
a: 0.4, RMSLE: 0.699281359097182
a: 0.45, RMSLE: 0.6990837341121027
a: 0.5, RMSLE: 0.6989435839364452
a: 0.55, RMSLE: 0.6988605319310459
a: 0.6000000000000001, RMSLE: 0.6988343329016921
a: 0.6500000000000001, RMSLE: 0.6988648661376343
a: 0.7000000000000001, RMSLE: 0.6989521333527153
a: 0.7500000000000001, RMSLE: 0.6990962539296508
a: 0.8, RMSLE: 0.6992974650808835
a: 0.8500000000000001, RMSLE: 0.6995561210730935
a: 0.9000000000000001, RMSLE: 0.6998726949361709
a: 0.9500000000000001, RMSLE: 0.7002477811143505
a: 1.0, RMSLE: 0.7006820998639279
Best a: 0.6000000000000001
Min RMSLE: 0.6988343329016921


#**3. Prediction**

In [95]:
df_test = pd.read_csv ('/content/drive/MyDrive/test.csv')
df_test_b = df_test.id

In [84]:
# data preprocessing for Gradient Boosting (LightGBM) model
df_test.Item_Fat_Content = df_test.Item_Fat_Content.replace ({'reg': 'Regular',
                                                                'LF': 'Low Fat',
                                                                'low fat': 'Low Fat'
                                                                })
metadata_dict = {
    'Item_Fat_Content': [df_test['Item_Fat_Content'].unique()],
    'Item_Type': [df_test['Item_Type'].unique()],
    'Outlet_Identifier': [df_test['Outlet_Identifier'].unique()],
    'Outlet_Size': [df_test['Outlet_Size'].unique()],
    'Outlet_Location_Type': [df_test['Outlet_Location_Type'].unique()],
    'Outlet_Type': [df_test['Outlet_Type'].unique()]
}

df_metadata = pd.DataFrame(metadata_dict)
df_metadata_transposed = df_metadata.T
df_metadata_transposed.columns = ['Unique_Values']
df_metadata_transposed.head()

scaler = RobustScaler()
df_test [['Item_Weight', 'Item_MRP']] = scaler.fit_transform (df_test [['Item_Weight', 'Item_MRP']])
df_test [['Item_Visibility']] = np.log1p (df_test [['Item_Visibility']])

df_test.Item_Fat_Content = df_test.Item_Fat_Content.replace ({'Low Fat': 0,
                                                                'Regular': 1
                                                                })
df_test.Outlet_Size = df_test.Outlet_Size.replace ({'Small': 1,
                                                                'Medium': 2,
                                                                'High': 3
                                                                })
df_test = df_test.drop (['id'] , axis =1)
df_test_a = df_test

df_test['Item_Identifier'] = df_test['Item_Identifier'].astype('category')
df_test['Item_Type'] = df_test['Item_Type'].astype('category')
df_test['Outlet_Identifier'] = df_test['Outlet_Identifier'].astype('category')
df_test['Outlet_Location_Type'] = df_test['Outlet_Location_Type'].astype('category')
df_test['Outlet_Type'] = df_test['Outlet_Type'].astype('category')

categorical_features = ['Item_Identifier', 'Item_Type', 'Outlet_Identifier', 'Outlet_Location_Type', 'Outlet_Type']

In [85]:
# Prediction for Gradient Boosting (LightGBM) model
y_pred_lgb_ful = predict_with_ensemble(models, df_test)

In [88]:
# data preprocessing for for Neural Network
numerical_features = ['Item_Weight', 'Item_Visibility', 'Item_MRP', 'Item_Fat_Content', 'Outlet_Establishment_Year', 'Outlet_Size']
categorical_features = ['Item_Identifier', 'Outlet_Identifier', 'Item_Type', 'Outlet_Location_Type', 'Outlet_Type']
label_encoders = {}
for cat_col in categorical_features:
    le = LabelEncoder()
    df_test_a[cat_col] = le.fit_transform(df_test_a[cat_col])
    label_encoders[cat_col] = le
inputs = []
embeddings = []
for cat_col in categorical_features:
    input_cat = Input(shape=(1,))
    vocab_size = df_test_a[cat_col].nunique()
    emb_dim = min(50, vocab_size // 2)
    embedding = Embedding(input_dim=vocab_size, output_dim=emb_dim)(input_cat)
    embedding = Flatten()(embedding)
    inputs.append(input_cat)
    embeddings.append(embedding)
input_num = Input(shape=(len(numerical_features),))
inputs.append(input_num)
embeddings.append(input_num)
df_test_a = [df_test_a[cat_col].values for cat_col in categorical_features] + [df_test_a[numerical_features].values]

In [89]:
# Prediction for for Neural Network
predictions_nn = model.predict(df_test_a)



In [90]:
predictions = predictions_nn.flatten()
y_pred_lgb_ful = y_pred_lgb_ful.flatten()
df_lin = pd.DataFrame({
    'nn_prediction': predictions,
    'lgb_prediction': y_pred_lgb_ful,
})
df_lin.head()

Unnamed: 0,nn_prediction,lgb_prediction
0,3121.76123,3143.814175
1,2764.2146,2735.618332
2,1794.382324,1882.765695
3,1720.179199,1574.057362
4,1540.211304,1686.447238


In [91]:
predictions = df_lin.lgb_prediction*best_a + df_lin.nn_prediction * (1 - best_a)
predictions

0         3134.992973
1         2747.056912
2         1847.412347
3         1632.506133
4         1627.952876
             ...     
252281     986.022908
252282    1293.068066
252283    1595.113728
252284    1231.955693
252285    1695.752550
Length: 252286, dtype: float64

In [96]:
df_output = pd.DataFrame({
    'id': df_test_b,
    'Item_Outlet_Sales': predictions
})

In [97]:
df_output.head()

Unnamed: 0,id,Item_Outlet_Sales
0,378428,3134.992973
1,378429,2747.056912
2,378430,1847.412347
3,378431,1632.506133
4,378432,1627.952876


In [None]:
df_output.to_csv('predictions.csv', index=False)

In [None]:
files.download('predictions.csv')