<div>
    <h1 align="center">Tabular Playground Series - Jun 2021</h1>
    <h2 align="center">Ensembling & Comparative Method for Classifier</h2>
    <h4 align="center">By: Somayyeh Gholami & Mehran Kazeminia</h4>
</div>

<div class="alert alert-success">  
</div>

## Description:

#### In Kaggle challenges as well as real projects, we usually calculate the average (Weighted average) of the results of several different methods  to get a better score. This method is called "Ensembling" and is very common.

#### **What does it mean if the new results get a better score? Did all the hundreds of thousands of rows get better results? Certainly not.**

#### The fact is that only the results of some rows are better and the results of others are worse. However, we see that the overall result has improved.

#### Here we want to identify rows whose new results are probably not better. Of course, after identifying these rows, we return the results to their pre-"Ensembling" state and we will see that by doing so, our score will be even better.

#### We call this method "Comparative Method" and it is our initiative. The "Comparative Method" can be used for both regression and classifier. But of course there are differences in the way it is done on different issues.

#### For more information, refer to the following address:

#### https://www.kaggle.com/c/tabular-playground-series-jun-2021/discussion/246031


<div class="alert alert-success">  
</div>

## Import & Data Set

In [1]:
import numpy as np 
import pandas as pd
import seaborn as sns

import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission2.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission4.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission3.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/__results__.html
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission6.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission1.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission5.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/submission.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/__resultx__.html
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/__notebook__.ipynb
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/__output__.json
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/val_X.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/val_y.csv
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/custom.css
/kaggle/input/1-tps-jun-21-histgradient-catboost-nn/__results

In [3]:
df1 = pd.read_csv('../input/tabular-playground-series-jun-2021/train.csv')
df2 = pd.read_csv('../input/tabular-playground-series-jun-2021/test.csv')
sam = pd.read_csv('../input/tabular-playground-series-jun-2021/sample_submission.csv')

In [4]:
sub1 = pd.read_csv('../input/1-tps-jun-21-histgradient-catboost-nn/submission1.csv')
# Public Score: 1.75770
# HistGradientBoostingClassifier
# display(sub1)

In [5]:
sub2 = pd.read_csv('../input/1-tps-jun-21-histgradient-catboost-nn/submission2.csv')
# Public Score: 1.75011
# CatBoostClassifier
# display(sub2)

In [6]:
sub3 = pd.read_csv('../input/1-tps-jun-21-histgradient-catboost-nn/submission3.csv')
# Public Score: 1.74587
# Neural Networks
# display(sub3)

Thanks to: @oxzplvifi https://www.kaggle.com/oxzplvifi/tabular-residual-network

In [7]:
sub4 = pd.read_csv('../input/1-tps-jun-21-histgradient-catboost-nn/submission4.csv')
# Public Score: 1.74522
# tabular residual network
# display(sub4)

Thanks to: @bhavikjain https://www.kaggle.com/bhavikjain/tps-june-21-eda-models

In [8]:
sub5 = pd.read_csv('../input/1-tps-jun-21-histgradient-catboost-nn/submission5.csv')
# Public Score: 1.74456
# EDA + Models
# display(sub5)

Thanks to: @fusioncenter https://www.kaggle.com/fusioncenter/residual-network-for-tabular-data

In [9]:
sub6 = pd.read_csv('../input/1-tps-jun-21-histgradient-catboost-nn/submission6.csv')
# Public Score: 1.74442
# Residual network
# display(sub6)

<div class="alert alert-success">  
</div>

## Ensembling

In [10]:
def generate(main, support, coeff):
    
    g = main.copy()    
    for i in main.columns[1:]:
        
        res = []
        lm, Is = [], []        
        lm = main[i].tolist()
        ls = support[i].tolist()  
        
        for j in range(len(main)):
            res.append((lm[j] * coeff) + (ls[j] * (1.- coeff)))            
        g[i] = res
        
    return g

In [11]:
sub = generate(sub2, sub1, 0.85)

sub = generate(sub3, sub , 0.85)

sub = generate(sub4, sub , 0.85)

sub = generate(sub5, sub , 0.85)

sub = generate(sub6, sub , 0.55)

sub_ens = sub

In [12]:
display(sub_ens, sub_ens.describe().transpose())

Unnamed: 0,id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9
0,200000,0.065233,0.412874,0.163801,0.026352,0.013123,0.136873,0.020597,0.043014,0.118133
1,200001,0.043917,0.073774,0.055350,0.019417,0.013772,0.271951,0.086928,0.304924,0.129967
2,200002,0.018246,0.027091,0.017714,0.010576,0.005988,0.747704,0.023868,0.105481,0.043331
3,200003,0.049925,0.127541,0.089373,0.034033,0.018137,0.230037,0.075726,0.212955,0.162274
4,200004,0.043782,0.111239,0.082003,0.025667,0.015519,0.272872,0.072847,0.226407,0.149663
...,...,...,...,...,...,...,...,...,...,...
99995,299995,0.079182,0.368577,0.145113,0.031257,0.016672,0.103340,0.035974,0.077667,0.142218
99996,299996,0.048118,0.242324,0.132491,0.028037,0.016411,0.162394,0.053110,0.153880,0.163236
99997,299997,0.059417,0.236502,0.123658,0.029569,0.015028,0.195043,0.049292,0.138666,0.152826
99998,299998,0.036098,0.021789,0.020296,0.011968,0.011675,0.353187,0.074146,0.382750,0.088091


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,100000.0,249999.5,28867.657797,200000.0,224999.75,249999.5,274999.25,299999.0
Class_1,100000.0,0.04502,0.011955,0.001416,0.039578,0.045647,0.051179,0.108009
Class_2,100000.0,0.121891,0.112132,0.000838,0.040982,0.079853,0.164986,0.539838
Class_3,100000.0,0.073825,0.047938,0.000387,0.034761,0.061801,0.10738,0.243158
Class_4,100000.0,0.023488,0.007988,0.000603,0.018004,0.024601,0.029549,0.066554
Class_5,100000.0,0.015086,0.004946,0.000289,0.012249,0.015244,0.018134,0.046656
Class_6,100000.0,0.259086,0.163228,0.004277,0.158346,0.244064,0.309323,0.981633
Class_7,100000.0,0.07424,0.026847,0.001602,0.056577,0.078983,0.094701,0.146041
Class_8,100000.0,0.259697,0.117738,0.009315,0.169077,0.267883,0.351282,0.666928
Class_9,100000.0,0.127666,0.037018,0.00218,0.106086,0.133517,0.156131,0.211176


In [13]:
sub_ens.to_csv("submission_ens.csv",index=False)
# Public Score: 1.74415
!ls

__notebook__.ipynb  submission_ens.csv


<div class="alert alert-success">  
</div>

## Comparative Method for Classifier

In [14]:
def improve(sub1, sub2, sub3, sub4, sub5, sub6, sub_ens, majority, m_majority):  
    
    sub1v = sub1.values
    sub2v = sub2.values
    sub3v = sub3.values
    sub4v = sub4.values
    sub5v = sub5.values 
    sub6v = sub6.values
    
    imp = sub_ens.copy()
    impv = imp.values
    NCLASS = 9
    number = 0

    for i in range (len(sub_ens)):
        c_count = 0  
        row = impv[i,1:]
        row_sort = np.sort(row)        
        
        row1 = sub1v[i,1:]
        row2 = sub2v[i,1:]
        row3 = sub3v[i,1:]
        row4 = sub4v[i,1:]
        row5 = sub5v[i,1:]    
        row6 = sub6v[i,1:] 
        row1_sort = np.sort(row1)
        row2_sort = np.sort(row2)
        row3_sort = np.sort(row3)
        row4_sort = np.sort(row4)
        row5_sort = np.sort(row5)
        row6_sort = np.sort(row6)
                      
        for j in range (NCLASS): 
            count = 0
            
            for k in range (NCLASS):                
                if (row6[j] == row6_sort[k]): 
                    
                    if (row1[j] == row1_sort[k]):
                        count = count + 1
                    if (row2[j] == row2_sort[k]):
                        count = count + 1                   
                    if (row3[j] == row3_sort[k]):
                        count = count + 1   
                    if (row4[j] == row4_sort[k]):
                        count = count + 1 
                    if (row5[j] == row5_sort[k]):
                        count = count + 1 
                        
            if (count >= majority):
                c_count = c_count + 1
        
        if ((c_count >= m_majority) and (row6_sort[8] >= row_sort[8])): 
            impv[i, 1:] = row6            
            number = number + 1            
#           print (number, i, c_count)                         
                    
    imp.iloc[:, 1:] = impv[:, 1:]
    p_number = round(((number / 100000) * 100),2)
    print('>>>  R  E  T  U  R  N  S  <<<')
    print(30 * '=')
    print(f'Number of changes: {number}\n')
    print(f'Percentage of changes: {p_number} %')
    print(30 * '=')
    return imp      

In [15]:
sub_imp = improve(sub1, sub2, sub3, sub4, sub5, sub6, sub_ens, 5, 7)

>>>  R  E  T  U  R  N  S  <<<
Number of changes: 24122

Percentage of changes: 24.12 %


In [16]:
display(sub_imp, sub_imp.describe().transpose())

Unnamed: 0,id,Class_1,Class_2,Class_3,Class_4,Class_5,Class_6,Class_7,Class_8,Class_9
0,200000,0.065233,0.412874,0.163801,0.026352,0.013123,0.136873,0.020597,0.043014,0.118133
1,200001,0.043917,0.073774,0.055350,0.019417,0.013772,0.271951,0.086928,0.304924,0.129967
2,200002,0.018246,0.027091,0.017714,0.010576,0.005988,0.747704,0.023868,0.105481,0.043331
3,200003,0.050316,0.126482,0.086695,0.035330,0.016767,0.234146,0.075299,0.211286,0.163680
4,200004,0.043782,0.111239,0.082003,0.025667,0.015519,0.272872,0.072847,0.226407,0.149663
...,...,...,...,...,...,...,...,...,...,...
99995,299995,0.079182,0.368577,0.145113,0.031257,0.016672,0.103340,0.035974,0.077667,0.142218
99996,299996,0.048118,0.242324,0.132491,0.028037,0.016411,0.162394,0.053110,0.153880,0.163236
99997,299997,0.060779,0.238580,0.125162,0.028374,0.014319,0.195207,0.048074,0.136210,0.153295
99998,299998,0.036098,0.021789,0.020296,0.011968,0.011675,0.353187,0.074146,0.382750,0.088091


Unnamed: 0,count,mean,std,min,25%,50%,75%,max
id,100000.0,249999.5,28867.657797,200000.0,224999.75,249999.5,274999.25,299999.0
Class_1,100000.0,0.044979,0.012036,0.001416,0.039437,0.045593,0.051202,0.108009
Class_2,100000.0,0.121698,0.112299,0.000838,0.040797,0.079348,0.164806,0.539838
Class_3,100000.0,0.073653,0.048035,0.000387,0.034556,0.06143,0.107342,0.243158
Class_4,100000.0,0.023456,0.00802,0.000603,0.017926,0.02456,0.029539,0.066554
Class_5,100000.0,0.015019,0.004973,0.000289,0.012125,0.015163,0.01809,0.046656
Class_6,100000.0,0.259358,0.163943,0.004277,0.157469,0.243782,0.310066,0.981633
Class_7,100000.0,0.074209,0.026934,0.001602,0.056377,0.078909,0.094819,0.146041
Class_8,100000.0,0.260179,0.118313,0.009315,0.168812,0.268047,0.352425,0.666928
Class_9,100000.0,0.127451,0.037191,0.00218,0.105715,0.133299,0.15611,0.211176


In [17]:
sub_imp.to_csv("submission_imp.csv",index=False)
# Public Score: 1.74408
!ls

__notebook__.ipynb  submission_ens.csv	submission_imp.csv


<div class="alert alert-success">
    <h1 align="center">If you find this work useful, please don't forget upvoting :)</h1>
</div>