<p style="border-bottom: 20px solid darkcyan"></p>

<div style="padding: 20px; background-color: darkcyan; border-radius: 10px; box-shadow: 0 4px 8px rgba(0, 0, 0, 0.1);">
    <div style="border: 2px solid darkcyav; padding: 20px; text-align: center; border-radius: 10px; background-color: lightgreen;">
        <h1 style="color: #00000; font-size: 32px; text-transform: uppercase; letter-spacing: 2px; margin-bottom: 20px;">🍂Ensembling & Correlation</h1>
        <div><em>
       By: Somayyeh Gholami & Mehran Kazeminia
    </em></div>
</div>
    
## <div style="color:darkcyan;background-color:white;padding:1.0%;border-radius:10px 10px;font-size:1em;text-align:center">Ensembling with Correlation Guidance - ECG</div>    

In [None]:
from IPython.display import HTML
import time

handle = display(HTML("""<marquee>👌</marquee>"""), display_id='html_marquee1')
time.sleep(2)
handle = display(HTML("""<marquee>🪟The goal of this notebook is to get the most out of all results... even the ones that don't seem to score well.</marquee>"""), display_id='html_marquee1', update=True)

<p style="border-bottom: 10px solid gray"></p>

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:lightgray;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:navy;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>🪶 DESCRIPTION :</p></div>

- Validation data is used for tuning hyperparameters, determining coefficients for Ensembling, etc. But in many cases (such as Kaggle challenges, etc.) only the results and their public scores are known, and the determination of Ensembling coefficients is practically **done in the dark**.

- The participants of the challenges, in order to get the best score, Ensembling the results of their notebooks or others and usually use the trial and error method to determine the best coefficients.

- But Ensembling the best scores is not always successful. That is, in most cases, only by Ensembling some results, the score will improve. Also, the only option for most participants to choose the right results for Ensembling is the trial and error method.

- When the number of answer columns is more than one, the darkness increases. Because it is not known that after finding the right coefficient for Ensembling the first columns, the same coefficient is optimal for Ensembling the next columns.

- In other words, in many cases, we are looking for an optimal coefficient for the linear combination of two lists (a pair of lists). But to combine two pairs of lists, we should logically look for two coefficients. That is, for thousands of pairs of lists, we should no longer expect a coefficient to be the best coefficient and the optimal coefficient. Although it is possible to use only two or three coefficients to combine thousands of pairs of lists.

- Recently, the [Open Problems - Single-Cell Perturbations challenge](https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/overview) was held in Kaggle, which includes the prediction of more than eighteen thousand columns. That is, the issue of Ensembling in these cases becomes very complicated, because if a coefficient is chosen for Ensembling the first column, it cannot be sure that the same coefficient is optimal for more than eighteen thousand other Ensembling.

- In this notebook, we want to share our experience of **using Correlation** with you. Correlation of projected columns, when you want to Ensembling two columns together, like a candle in the dark can help you reach your destination.

- For example, when the correlation between column A and column B is negative, Ensembling should not be performed on these two columns. That is, if the public score for result A is better than the public score for result B, the first choice is result A alone and the second choice is result B alone, and no linear combination between these two columns can be good.

- It is obvious that when the correlation of column A and column B is a positive number, it is still necessary to see how far the correlation value is from zero and how close it is to one. In addition, it should be seen how much the public score of result A is better than the public score for result B. For example, if the correlation is very close to one and the public score of result A is much better, result A alone is a good option and Ensembling cannot help in improving the score.

- In this notebook, we try to clarify the issue with some examples. You will see in the examples that even within a fixed notebook netbook, some prediction columns can be very good and some can be really bad, but the public score we see is the score of all the good and bad columns. Of course, finding these good and bad columns is done by calculating the correlation of similar columns in two separate notebooks.

- It should be noted that Ensembling should not increase the difference between public score and private score. That is, so to speak, the model should not become unstable and overfitting should occur. For example, using the results of notebooks that are Ensembling products, usually increase this risk. We will also explore and explain this issue in the notebook.

- Considering the separation of testing samples in kegel (that is, the existence of a public score and a private score) and the lack of knowledge of the method of sorting testing samples, etc., it is obvious that choosing a different coefficient for ensembling the rows of a column can be completely be misleading and you should definitely expect overfitting.

- Please note that in many cases Ensembling is complex and there is no unique answer to all challenges. Of course, when there are thousands of prediction columns, using correlation seems necessary, but we have also used other methods in different challenges, which have been effective, and we will mention them below:

- We previously used Comparative Method and Snap to Grid in the Indoor Location & Navigation challenge:

> https://www.kaggle.com/code/mehrankazeminia/1-3-indoor-navigation-cost-minimization-floor/notebook

> https://www.kaggle.com/code/mehrankazeminia/2-3-indoor-navigation-comparative-method

> https://www.kaggle.com/code/mehrankazeminia/3-3-g6-snap-to-grid-fix-the-timestamps

- In the Tabular Playground Series challenge - Jul 2021, we used the Smart Ensembling method:

> https://www.kaggle.com/code/mehrankazeminia/2-tps-jul-21-smart-ensembling

- In the Tabular Playground Series - Jul 2022 challenge, we used the Clustering-Ensembling method:

> https://www.kaggle.com/code/mehrankazeminia/3-3-tps22jul-clustering-ensembling

- If these notebooks appeal to you, please don't forget to **upvote** them.


<p style="border-bottom: 10px solid gray"></p>

In [None]:
#:::::::::::::::::::::::::::::::::::
import warnings # suppress warnings
warnings.filterwarnings('ignore')
#:::::::::::::::::::::::::::::::::::
import os
import gc
import glob
import random
import numpy as np 
import pandas as pd
import seaborn as sns
from tqdm import tqdm
from scipy import stats
from pathlib import Path
from itertools import groupby
#:::::::::::::::::::::::::::::::::::
import matplotlib.pyplot as plt
import plotly.figure_factory as ff
import plotly.express as px
%matplotlib inline
#:::::::::::::::::::::::::::::::::::
!ls ../input/*

In [None]:
sample_submission = pd.read_csv('../input/open-problems-single-cell-perturbations/sample_submission.csv', index_col='id')

## <span style="color:navy;">Auxiliary Functions</span>

<p style="border-bottom: 5px solid navy"></p>

In [None]:
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 1
def df_creator(dfx, dfy, n):
    df_sub = pd.DataFrame(columns=['dfx','dfy'])
    
    df_sub['dfx'] = dfx.iloc[:, n].copy()
    df_sub['dfy'] = dfy.iloc[:, n].copy()
        
    return df_sub  

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 2 
def df_corr(df_sub):   
    corr = df_sub.corr(numeric_only=True).round(3)  
    
    corr_list = list(corr.iloc[0])[1:]
    return corr_list

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 3
def heatmap(dfx, dfy):
    N = random.randrange(dfx.shape[1])
    
    print('\n\nHeatmap (For a random column)')
    print(':' *40)
    print('Column Name :', list(dfx.columns)[N])
    print('Column Number :', N)
    print(':' *40, '\n')
    
    df_sub = df_creator(dfx, dfy, N)
    corr_matrix = df_sub.corr()
    fig = plt.figure(figsize=(4,3));

    cmap=sns.diverging_palette(240, 10, s=75, l=50, sep=1, n=6, center='light', as_cmap=False);
    sns.heatmap(corr_matrix, center=0, annot=True, cmap=cmap, linewidths=2);
    plt.suptitle(f'Heatmap (N={N})', y=0.95, fontsize=12, c='darkred');
    plt.show()

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 4
def ensembling_histograms(submission, dfx, dfy):
    N = random.randrange(dfx.shape[1])
    
    print('\n\nEnsembling Histograms (For a random column)')
    print(':' *50)
    print('Column Name :', list(dfx.columns)[N])
    print('Column Number :', N)
    print(':' *50)
    
    hist_data = [submission.iloc[:, N], dfx.iloc[:, N], dfy.iloc[:, N]]
    group_labels = ['Generated', 'First Results (Main)', 'Second Results (Support)']
    
    fig = ff.create_distplot(hist_data, group_labels, bin_size=.2, show_hist=False, show_rug=False)
    fig.show()    

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 5
def info_corr(corr_limit, counter_x, counter_y, dfx, dfy):
    counter_z = dfx.shape[1] - (counter_x + counter_y)
    
    print('\nCorrelation information (For all columns)')
    print(':' *70)
    print('A Percent =', round(counter_y / dfx.shape[1], 3)) 
    print('A: Correlation less than Zero')
    print('The number of columns evaluated with only the second result =', counter_y) 
    print('-  ' *24)
    print('B Percent =', round(counter_z / dfx.shape[1], 3)) 
    print('B: Correlation more than Zero and less than corr_limit')
    print('The number of columns evaluated by Ensembling =', counter_z)
    print('-  ' *24)
    print('C Percent =', round(counter_x / dfx.shape[1], 3))
    print('C: Correlation more than corr_limit')
    print('The number of columns evaluated with only the first result =', counter_x)
    print('-  ' *24)   
    print('The correlation limit that was considered as the basis =', corr_limit)
    print(':' *70 ,'\n\n')
    
    columns = ['A: Correlation less than Zero','B: Correlation more than Zero and less than corr_limit','C: Correlation more than corr_limit']
    data = [[ round(counter_y / dfx.shape[1], 3), round(counter_z / dfx.shape[1], 3), round(counter_x / dfx.shape[1], 3)]]
    de_data = pd.DataFrame(data=data , columns=columns)
    
    sns.set()
    de_data.plot(kind='barh', stacked=True, figsize=(10,1), color=['pink','violet','purple'])
    plt.gca().set_facecolor('lightyellow')
    plt.legend(fontsize=10, loc=3, bbox_to_anchor=(0, 1))
    plt.show() 

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 6
def ensembling_scatter(submission, dfx, dfy):   
    
    X  = dfx  # main
    Y1 = dfy  # support
    Y2 = submission
    
    sns.set()
    plt.style.use('seaborn-whitegrid') 
    plt.figure(figsize=(12, 6), facecolor='lightyellow')
    plt.title('Scatter Graph (For all columns)\n', fontsize=12)   

    plt.scatter(X, Y1, s=2.5, label='dfy - Support', c='darkcyan')    
    plt.scatter(X, Y2, s=2.5, label='Generated', c='red')
    plt.scatter(X, X , s=4.0, label='dfx - Main(X=Y)', c='orange')
     
    plt.gca().set_facecolor('lightgreen')
    plt.legend(fontsize=10, loc=4)
    # plt.savefig('scatter101.png')
    plt.show()     

# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 7
def plot_corr(tcorr):  
    display(tcorr.iloc[:, :4].style.background_gradient(cmap='Pastel1', axis=None, vmin=0, vmax=1.0))

    sns.set()
    plt.style.use('seaborn-whitegrid') 
    plt.figure(figsize=(12, 5), facecolor='lightyellow')
    plt.title('Percentage of States\n', fontsize=12)   
    plt.xlabel('Correlation Limit')
    plt.xticks(range(11), round(tcorr.iloc[:,0], 1))

    plt.plot(tcorr.iloc[:,1], color='orange', lw=2, label='A: Correlation less than Zero')
    plt.plot(tcorr.iloc[:,2], color='darkcyan', lw=2, label='B: Correlation more than Zero and less than corr_limit')
    plt.plot(tcorr.iloc[:,3], color='red', lw=2, label='C: Correlation more than corr_limit')
     
    plt.gca().set_facecolor('lightgreen')
    plt.legend(fontsize=10, loc=0)
    # plt.savefig('plot101.png')
    plt.show()  
    
# ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

## <span style="color:navy;">Ensembling & Tuning Functions</span>

<p style="border-bottom: 5px solid navy"></p>

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:pink;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:darkred;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>(1) generate_corr_coeff (dfx, dfy, corr_limit, coeff)</p></div>

- This function takes the regression results and prediction of two notebooks, each of these results can have thousands of columns. The first result (dfx) is called "Main" and the second result (dfy) is called "Support".

- This function ignores the column corresponding to "Main" when the correlation of two similar columns in "Main" and "Support" becomes negative. We can also set "corr_limit" to be less than one (and greater than zero). When the correlation of two similar columns in "Main" and "Support" is greater than "corr_limit", this function ignores the column corresponding to "Support".

- This function performs Ensembling with a coefficient that we define as "coeff", only for columns whose correlation is between zero and "corr_limit".

- It is obvious that if we change the places of "Main" and "Support", the result of Ensembling will change and the score may be better with the new setting of "corr_limit" and "coeff".

In [None]:
# Ensemble for two results by determining "corr_limit @ coeff"
def generate_corr_coeff(dfx, dfy, corr_limit, coeff):
    submission = sample_submission.copy()
    
    counter_x = 0
    counter_y = 0
    for n in range(dfx.shape[1]):   
        df_sub = df_creator(dfx, dfy, n)
        corr_list = df_corr(df_sub)
        
        submission.iloc[:, n] = (dfx.iloc[:, n] * coeff) + (dfy.iloc[:, n] * (1.- coeff))
        
        if (corr_list[0] > corr_limit):  
            submission.iloc[:, n] = dfx.iloc[:, n]
            counter_x += 1  
            
        if (corr_list[0] < 0):
            submission.iloc[:, n] = dfy.iloc[:, n]
            counter_y += 1
            
    print('\n\n', ':. ' *12, 'Ensembling (Different coefficients for different columns)', '.: ' *12)
    
    # heatmap(dfx, dfy)
    # ensembling_histograms(submission, dfx, dfy)
    
    info_corr(corr_limit, counter_x, counter_y, dfx, dfy)
    ensembling_scatter(submission, dfx, dfy)
    
    # display(submission)
    return submission

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:pink;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:darkred;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>(2) generate_corr (dfx, dfy, corr_limit)</p></div>

- This function is very similar to the previous function, except that you do not need to specify the "coeff". This function calculates a coefficient for Ensembling for both columns whose correlation is between zero and "corr_limit". That is, it considers "coeff" equal to the correlation value of two columns.

- Certainly, the results of this function are not as good as the first function, but trying this function can be a good image to determine "coeff" and "corr_limit".

In [None]:
# Ensemble for two results by determining "corr_limit"
def generate_corr(dfx, dfy, corr_limit):
    submission = sample_submission.copy()
    
    counter_x = 0
    counter_y = 0
    for n in range(dfx.shape[1]):   
        df_sub = df_creator(dfx, dfy, n)
        corr_list = df_corr(df_sub)
        
        submission.iloc[:, n] = (dfx.iloc[:, n] * corr_list[0]) + (dfy.iloc[:, n] * (1.- corr_list[0]))
        
        if (corr_list[0] > corr_limit):  
            submission.iloc[:, n] = dfx.iloc[:, n]
            counter_x += 1  
            
        if (corr_list[0] < 0):
            submission.iloc[:, n] = dfy.iloc[:, n]
            counter_y += 1
            
    print('\n\n', ':. ' *12, 'Ensembling (Different coefficients for different columns)', '.: ' *12)
    
    # heatmap(dfx, dfy)
    # ensembling_histograms(submission, dfx, dfy)
    
    info_corr(corr_limit, counter_x, counter_y, dfx, dfy)
    ensembling_scatter(submission, dfx, dfy)
    
    # display(submission)
    return submission

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:pink;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:darkred;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>(3) tuning_corr_limit (dfx, dfy)</p></div>

- Before setting the parameters of the above functions, you can use the following function to see the correlation status of "Main" and "Support" in the table and graph.

- If the number of prediction columns is thousands of numbers, this function must perform a lot of calculations and the execution of this function takes time.

In [None]:
def tuning_corr_limit(dfx, dfy): 
    '''
    A : Less than Zero
    B : More than Zero and less than corr_limit
    C : more than corr_limit
    '''
    cname = ['corr_limit','Percentage of A','Percentage of B','Percentage of C','The number of A','The number of B','The number of C']
    
    clist = np.arange(0, 1.1, 0.1)
    cdata = np.zeros((len(clist), 7), dtype=int)
    tcorr = pd.DataFrame(data=cdata, columns=cname)
    
    for c in range(len(clist)):
        counter_a = 0
        counter_b = 0
        counter_c = 0
        
        for n in range(dfx.shape[1]): 
            df_sub = df_creator(dfx, dfy, n)
            corr_list = df_corr(df_sub)
        
            if (corr_list[0] < 0):  
                counter_a += 1 
            if (corr_list[0] > 0) and (corr_list[0] < clist[c]):
                counter_b += 1 
            if (corr_list[0] > clist[c]):
                counter_c += 1
        
        tcorr.iloc[c,0] = clist[c]
        tcorr.iloc[c,1] = round(counter_a / dfx.shape[1], 3)
        tcorr.iloc[c,2] = round(counter_b / dfx.shape[1], 3)
        tcorr.iloc[c,3] = round(counter_c / dfx.shape[1], 3)
        tcorr.iloc[c,4] = counter_a
        tcorr.iloc[c,5] = counter_b
        tcorr.iloc[c,6] = counter_c
        
    plot_corr(tcorr)
    display(tcorr)
    return tcorr

## <span style="color:navy;">Ensembling with Correlation Guidance - ECG</span>

<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:cyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:navy;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>EXAMPLE : 1 - Chain Ensembling</p></div>

<p style="border-bottom: 5px solid gray"></p>

- For the first example, we Ensembling the results of four notebooks in a row. We create the so-called "Ensembling Chain". The first one is our own notebook where we have done "Feature Augmentation". The following results are before any Ensembling.

- https://www.kaggle.com/code/mehrankazeminia/80-3-op2-feature-augmentation/notebook

-----

![](https://cdn-images-1.medium.com/max/1000/1*lcpSEsDv1JvWRLkx3t5khg.png)

In [None]:
# Feature Augmentation
sub_606 = pd.read_csv('../input/80-3-op2-feature-augmentation/prediction.csv', index_col='id')
# ............................................................................................
# Public Score: 0.606
# Private Score: 0.809

<p style="border-bottom: 5px solid gray"></p>

- The next notebook has effectively used "Neural Network".

- https://www.kaggle.com/code/kishanvavdara/neural-network-regression

-----

![](https://cdn-images-1.medium.com/max/1000/1*qWP-PAuonBldoJD6qgE1Gw.png)


In [None]:
# Thanks to: @kishanvavdara - Neural Network
sub_604 = pd.read_csv('../input/op2-604/submission_df.csv', index_col='id')
# ..........................................................................
# Public Score: 0.604
# Private Score: 0.824

<p style="border-bottom: 5px solid gray"></p>

- The next notebook has effectively used NLP (SMILES Embedding).

- https://www.kaggle.com/code/kishanvavdara/nlp-regression?scriptVersionId=146928847

-----

![](https://cdn-images-1.medium.com/max/1000/1*DmqWkqcShGKxUXSImIiLEw.png)

In [None]:
# Thanks to: @kishanvavdara - NLP(SMILES Embedding)
sub_607 = pd.read_csv('../input/op2-607/OP2_607.csv', index_col='id')
# ....................................................................
# Public Score: 0.607
# Private Score: 0.813

<p style="border-bottom: 5px solid gray"></p>

- Before starting any Ensembling for parameter tuning, we can compare the correlation values of all the columns of the two results and see the details in the table and graph. (This function takes a long time)

In [None]:
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 1a
tuning_corr_limit(sub_604, sub_607)
gen_1a = generate_corr_coeff(sub_604, sub_607, 0.95, 0.50)
gen_1a.to_csv('gen_1a.csv')
# Public Score: 0.595
# Private Score: 0.808

# Classic method of Ensembling - 1a
ens_1a = (sub_604 *0.50) + (sub_607 *0.50)
ens_1a.to_csv('ens_1a.csv')
# Public Score: 0.596
# Private Score: 0.808
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 1b
tuning_corr_limit(gen_1a, sub_606)
gen_1b = generate_corr_coeff(gen_1a, sub_606, 0.85, 0.60)
gen_1b.to_csv('gen_1b.csv')
# Public Score: 0.589
# Private Score: 0.796

# Classic method of Ensembling - 1b
ens_1b = (gen_1a *0.60) + (sub_606 *0.40)
ens_1b.to_csv('ens_1b.csv')
# Public Score: 0.590
# Private Score: 0.797
# ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

<p style="border-bottom: 5px solid gray"></p>

### Explanations for Example : 1

- Only the correlation of 3.7% of sub_604 and sub_607 columns is negative, maybe because their scores are close to each other. However, if these columns are ignored during Ensembling, the Ensembling score will be better. Note that the values of 3.7% of the columns must be extracted from only one of the results. In the function we wrote, these values are extracted from the second result in the function, the sub_607 result. In cases where you see a negative correlation, you should definitely try which result should be written first and which result should be written second. If the number of columns with negative correlation is high, removing them from Ensembling is very important and will have a great impact on the final score. As mentioned earlier, **two columns with negative correlation have no chance for Ensembling**.

- To calculate gen_1a, the value of 5% of the columns with a correlation close to one was ignored, and for this reason, the value of corr_limit is considered equal to 0.95. This 5% is extracted from the sub_604 result. But for the calculation of gen_1b, 15% of the columns with a correlation close to one were ignored, and for this reason, the value of corr_limit is considered equal to 0.85. This 15% is extracted from the gen_1a result.

- The coeff value for gen_1a calculation is equal to 0.50 and the coeff value for gen_1b calculation is equal to 0.60. Meanwhile, Ensembling has been done separately in the classical way, with exactly the same coefficients with the names ens_1a and ens_1b, so that there is a possibility of comparison.


<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:cyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:navy;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>EXAMPLE : 2 - Ensembling for challenge winners</p></div>

<p style="border-bottom: 5px solid gray"></p>

- The ["Open Problems - Single-Cell Perturbations"](https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/overview) challenge has ended and the winners of this challenge were determined based on the best private score. For this reason, for the second example, we use the results of the competition winners and try to optimize the private score by Ensembling.

- https://www.kaggle.com/competitions/open-problems-single-cell-perturbations/leaderboard

-----

![](https://cdn-images-1.medium.com/max/1000/1*M0FyDUCXjGAY6_XtK4_DUg.png)

<p style="border-bottom: 5px solid gray"></p>

- The results of the first and second teams have not been published or we could not find them. So we will check the results of the third and fourth teams.

- https://www.kaggle.com/code/jankowalski2000/3rd-place-solution

-----

![](https://cdn-images-1.medium.com/max/1000/1*ZNW2NFb1oZCYzdettDprSg.png)

In [None]:
# Thanks to: @jankowalski2000 (3rd place)
sub_3rd = pd.read_csv('/kaggle/input/op2-548-3rd/OP2_548_3rd.csv', index_col='id')
# ................................................................................
# Private Score: 0.732

<p style="border-bottom: 5px solid gray"></p>

- We noticed that after the end of the "4th Place" contest, he published a better version and a better private score than his Leaderboard score. So we also use their final version.

- https://www.kaggle.com/code/raki21/4th-place-magic-postprocessing?scriptVersionId=153284875

-----

![](https://cdn-images-1.medium.com/max/1000/1*qtwa8lVSTQbuFU0WlksvLw.png)

In [None]:
# Thanks to: @raki21 (4th Place)
sub_4th = pd.read_csv('../input/op2-565-4th/OP2_565_4th.csv', index_col='id')
# ............................................................................
# Private Score: 0.712

<p style="border-bottom: 5px solid gray"></p>

- Before doing anything on the above two results, we can compare the correlation value of all their columns. (This function takes a long time)

In [None]:
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 2
tuning_corr_limit(sub_4th, sub_3rd)
gen_2 = generate_corr_coeff(sub_4th, sub_3rd, 0.90, 0.50)
gen_2.to_csv('gen_2.csv')
# Private Score: 0.707

# Classic method of Ensembling - 2
ens_2 = (sub_4th *0.50) + (sub_3rd *0.50)
ens_2.to_csv('ens_2.csv')
# Private Score: 0.713
# ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

<p style="border-bottom: 5px solid gray"></p>

### Explanations for Example : 2

- So the correlation of two similar columns is never negative. In addition, if we consider "corr_limit" equal to 0.90, almost Ensembling is done for forty percent of the columns. Because the scores of the two results are close to each other, we consider the "coeff" value to be 0.50 so that they can correct each other if they can.

- The private score for this calculation is 0.707, which is better than the score of sub_3rd and sub_4th results. In addition, it is much better than the challenge champion's private score, which is 0.729.

- But as you can see above; If the ensembling is done with the classical method and with a coefficient of 0.50, the private score becomes 0.713, which is even worse than the sub_4th result. That is, at first glance and with simple Ensembling, we think that these two results cannot help each other. But by using correlation guidance, the score will improve.

<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>

# <div style="color:yellow;display:inline-block;border-radius:5px;background-color:cyan;font-block:Nexa;overflow:hidden"><p style="padding:15px;color:navy;overflow:hidden;font-size:70%;letter-spacing:0.5px;margin:0"><b> </b>EXAMPLE : 3 - "Results of impure golden"</p></div>

<p style="border-bottom: 5px solid gray"></p>

- For the third example, let's go to a specific notebook, to make some things clearer. The public score of the following notebook is 0.720 and its private score is 0.960. These scores are not good at all, because they are even worse than "sample_submission" scores. If you try "sample_submission" (ie when all answers are zero), the public score is 0.666 and the private score is 0.902.

- But at the beginning of the challenge, we realized that the solution of this notebook is very good, and for this reason, we Ensembled the results of this notebook with the results of our own notebooks, and our score was much better. We named this type of results "Results of impure golden". Of course, in [another challenge](https://www.kaggle.com/code/mehrankazeminia/2-tps22nov-results-of-impure-golden-eda),we have checked this type of results. 

- Next, we will try Ensembling our own notebook, i.e. sub_606 result, with this particular notebook, so that you can see the improvement in the score.

- https://www.kaggle.com/code/vendekagonlabs/jax-autoencoder-quickstart?scriptVersionId=144671651

-----

![](https://cdn-images-1.medium.com/max/1000/1*ltVebrPQPyWSRyh9e50y0g.png)

In [None]:
# Thanks to: @vendekagonlabs - "Results of impure golden"
sub_720 = pd.read_csv('../input/op2-720/op2_720.csv', index_col='id')
# ....................................................................
# Public Score: 0.720
# Private Score: 0.960

<p style="border-bottom: 5px solid gray"></p>

- Before doing anything on the above two results, we can compare the correlation value of all their columns. (This function takes a long time)

In [None]:
# :::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::: 3
tuning_corr_limit(sub_720, sub_606)
gen_3 = generate_corr_coeff(sub_720, sub_606, 1.00, 0.25)
gen_3.to_csv('gen_3.csv')
# Public Score: 0.601
# Private Score: 0.792
# ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
!ls

<p style="border-bottom: 5px solid gray"></p>

### Explanations for Example : 3

- So the correlation of 8.4 percent of the columns is negative. These columns cannot be useful for Ensembling. We ignored them and Ensembled the rest of the sub_720 columns by a factor of 0.25 with our notebook columns. The general score of our notebook changed from 0.606 to 0.601 and the private score changed from 0.809 to 0.792. We see a significant improvement in the score, while the sub_720 scores were not good at all.

- Please note that in all the above examples, you still need to use trial and error to get the optimal value for "coeff" and "corr_limit", but knowing the correlation of the columns can somewhat clarify the decision scene. It means that we **no longer need to work in absolute darkness**.


<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>
<p style="border-bottom: 5px solid navy"></p>
: .