**Merchandise Popularity Prediction Challenge**  --- **`AMEYA PATIL`**

Overview
Big Brands spend a significant amount on popularizing a product. Nevertheless, their efforts go in vain while establishing the merchandise in the hyperlocal market. Based on different geographical conditions same attributes can communicate a piece of much different information about the customer. Hence, insights this is a must for any brand owner.

In this competition, we have brought the data gathered from one of the top apparel brands in India. Provided the details concerning category, score, and presence in the store, participants are challenged to predict the popularity level of the merchandise. 

The popularity class decides how popular the product is given the attributes which a store owner can control to make it happen.

**Dataset Description:**

Train.csv - 18208 rows x 12 columns (Includes popularity Column as Target variable)\
Test.csv - 12140 rows x 11 columns\
Sample Submission.csv - Please check the Evaluation section for more details on how to generate a valid submission\
 
**Attributes:**
* store_ratio
* basket_ratio
* category_1 
* store_score
* category_2
* store_presence
* score_1
* score_2 
* score_3
* score_4
* time
* popularity - Class of popularity (Target Column)

# Import Libraries and Name Submission File

In [1]:
# Import Necessary Libraries
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, VotingClassifier
from catboost import CatBoostClassifier
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd

# Ignore Warnings 
import warnings
warnings.filterwarnings('ignore')

# Random State
seed = 42
np.random.seed(seed)

# Create Submission File -- Name for CSV File 
subfile_name = 'A_7feb_E03.csv'

# Read & DataPreprocessing

In [2]:
train  = pd.read_csv('train.csv')
test   = pd.read_csv('test.csv')

In [3]:
train  = train.drop_duplicates().reset_index(drop=True) # Drop Duplicates 

In [4]:
X = train.drop(['popularity'], axis = 1)
y = train['popularity']

## One Hot Encode and Rescale

In [5]:
#cols = ['Category_1','Category_2'] # Without one hot encoding Category_1 score improved
cols = ['Category_2']
X_scaled = pd.get_dummies(X, columns=cols)
test_scaled = pd.get_dummies(test , columns=cols)

In [6]:
# SCALING
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X_scaled)
test_scaled = scaler.transform(test_scaled)

# Display
display(X_scaled.shape, test_scaled.shape)

(15285, 12)

(12140, 12)

# Voting Classifier

In [7]:
%%time


# Instantiate CatBOOST
model_cat = CatBoostClassifier(metric_period=400,
                          random_state=seed,
                          learning_rate=0.01,
                          loss_function = 'MultiClass',
                          border_count=1500)   # Catboost

# Instantiate ExtraTreeClassifer
model_etr=ExtraTreesClassifier(
    n_estimators = 6000, max_depth = None, n_jobs = -1, random_state = seed, verbose = 1,bootstrap=True)     # ExtraTreesClassifier

# Instantiate RandomForestClassifier
model_rf=RandomForestClassifier(
    n_estimators = 9000,  max_depth = None, n_jobs = -1, random_state = seed, verbose = 1)    # RandomForestClassifier


# Combine all models(estimators) and use 'soft voting' -- Voting Classifier
vote=VotingClassifier(estimators = [(
    'CatBoost', model_cat), ('ETR', model_etr), ('RF', model_rf)], voting = 'soft', weights = [8, 6, 8])



# Fit the Model
vote.fit(X_scaled, y)

# Make Predictions and convert Predictions to Dataframe
pred = pd.DataFrame(vote.predict_proba(test_scaled))
display(pred)

0:	learn: 1.5825669	total: 267ms	remaining: 4m 26s
400:	learn: 0.4635001	total: 30.9s	remaining: 46.1s
800:	learn: 0.4377283	total: 1m 1s	remaining: 15.2s
999:	learn: 0.4295014	total: 1m 16s	remaining: 0us


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    5.6s
[Parallel(n_jobs=-1)]: Done 536 tasks      | elapsed:    9.0s
[Parallel(n_jobs=-1)]: Done 1536 tasks      | elapsed:   15.7s
[Parallel(n_jobs=-1)]: Done 2936 tasks      | elapsed:   24.6s
[Parallel(n_jobs=-1)]: Done 4736 tasks      | elapsed:   36.5s
[Parallel(n_jobs=-1)]: Done 6000 out of 6000 | elapsed:   45.9s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  52 tasks      | elapsed:    0.9s
[Parallel(n_jobs=-1)]: Done 352 tasks      | elapsed:    5.4s
[Parallel(n_jobs=-1)]: Done 852 tasks      | elapsed:   12.9s
[Parallel(n_jobs=-1)]: Done 1552 tasks      | elapsed:   23.0s
[Parallel(n_jobs=-1)]: Done 2452 tasks      | elapsed:   36.7s
[Parallel(n_jobs=-1)]: Done 3552 tasks      | elapsed:   53.2s
[Parallel(n_jobs=-1)]: Done 4852 tasks      | elapsed:  1.2min
[Parallel(n_jobs=-1)]: D

Unnamed: 0,0,1,2,3,4
0,0.002403,0.064266,0.598503,0.261213,0.073615
1,0.000045,0.005574,0.031738,0.955256,0.007386
2,0.000065,0.004153,0.055776,0.925023,0.014984
3,0.000066,0.005355,0.046603,0.933864,0.014112
4,0.000074,0.001058,0.008534,0.986190,0.004144
...,...,...,...,...,...
12135,0.000246,0.029402,0.155123,0.784889,0.030340
12136,0.000049,0.000910,0.009918,0.986506,0.002616
12137,0.000048,0.002572,0.020809,0.973723,0.002849
12138,0.000058,0.003624,0.014762,0.978950,0.002607


Wall time: 5min 54s


# Prediction Adjuster 

Steps:-
1. Import 'common_grounds.csv' file 
2. Loop through each predictions 
3. Ensure predictions are adjusted 
4. create final submission file

In [8]:
# Adjusting Predictions -- Using previously created common_grounds file

# Read file 
cg = pd.read_csv('common_grounds.csv', index_col=0)
cg = cg.replace({3:2,4:3,5:4}) # Align indexes as per prediction file
#display(cg.head(10))

# To loop through each predictions row-wise
for i in cg.index:
    pos = cg.loc[i]
    for n in [0,1,2,3,4]: # (0,1,3,4,5) and (0,1,2,3,4) gives same prediction and creates proper submission file 
        pred.loc[i,n] = 0
    pred.loc[i,pos] = 1 

# Submission File Creation

In [9]:
# Create Submission file 

pred.to_csv(subfile_name,index=False)
print(f'Submission File {subfile_name} generated and saved as CSV')

Submission File A_7feb_E03.csv generated and saved as CSV


<span style="font-family: Arial; font-weight:bold;font-size:1.9em;color:#f97102;"> -------------------------End of File---------------------------------