<h1 style="color:#189AB4;font-size:50px;"><strong>Simple Weighted Average Ensemble<strong style="color:black"> TPS November 2021</strong></strong></h1>

<a><img src="https://i.ibb.co/PWvpT9F/header.png" alt="header" border="0" width=800 height=400></a>

From [Medium](https://medium.com/analytics-vidhya/simple-weighted-average-ensemble-machine-learning-777824852426)

<p style="font-size:120%"><strong>The simple math behind it</strong></p>

Suppose you have a set of five classifiers. You find out each of them may produce an error of 0.2, assuming the classifiers are all independent. And the situation where the ensemble classifier goes wrong on an instance is at least 3 out of your 5 classifiers made mistakes together (majority voting). Therefore, the probability of a wrong prediction is calculated as the following:

> The combination of 3 out of 5 goes wrong is 10;

> The combination of 4 out of 5 goes wrong is 5;

> And the combination of all 5 goes wrong is 1.

![for mula](https://miro.medium.com/max/622/1*wV2ohNiZsn-o0dCULvWXag.png)

As you can see, the total error is decreased from 0.2 to 0.058; however, as you add more classifiers into the combination, you should expect a bottleneck, and the trend of reduction of errors will plateau.

![memes](https://miro.medium.com/max/630/1*E4_pTJctmAofSRpZCZbv-g.jpeg)

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.svm import LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import roc_auc_score, accuracy_score
import sys
import os
import gc

<h1 style="color:#189AB4;font-size:20px;"><strong>Read in <strong style="color:black"> submission files</strong></strong></h1>

<center>

<h1 style="color:#189AB4;font-size:20px;"><strong><strong style="color:black">Credits:</strong></strong></h1>

</center>    
    
<a href="https://www.kaggle.com/adityasharma01/simple-nn-tps-nov-21"><p style="text-align:center">ADITYA SHARMA</p></a>

<a href="https://www.kaggle.com/jmcslk/tps-nov-21-dnn-cnn-model-extras"><p style="text-align:center">MARCIN PIETRZYCKI</p></a>

<a href="https://www.kaggle.com/javiervallejos/simple-nn-with-good-results-tps-nov-21"><p style="text-align:center">JAVIER VALLEJOS</p></a>

<a href="https://www.kaggle.com/chaudharypriyanshu/understanding-neural-net"><p style="text-align:center">PRIYANSHU CHAUDHARY</p></a>

<a href="https://www.kaggle.com/dlaststark/tps-1121-dnn-v4"><p style="text-align:center">DLASTSTARK</p></a>

<a href="https://www.kaggle.com/edrickkesuma/power-averaging-is-your-friend"><p style="text-align:center">Edrick Kesuma</p></a>

<a href="https://www.kaggle.com/ambrosm/tpsnov21-007-postprocessing"><p style="text-align:center">AmbrosM</p></a>


In [None]:
%%time

dir = '../input/tabular-playground-series-nov-2021/'
z = '.csv'

train = pd.read_csv('../input/november21/train.csv')
test = pd.read_csv(dir+'test'+z)

def seed_everything(seed=42):
    np.random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)

    
TARGET = 'target'
FOLD = 5
SEED = 42
N_ESTIMATORS=15000
DEVICE = 'CPU'

LOSS = 'BinaryCrossEntropy'
EVAL_METRIC = "AUC"

STUDY_TIME = 60*60*8
seed_everything(SEED)

In [None]:
def postprocess_separate(sub, test_df=test, pure_df=train):
    """Update sub so that the predictions for the two sides of the hyperplane don't overlap.
    
    Parameters
    ----------
    sub : pandas DataFrame with columns 'id' and 'target'
    test_df : the competition's test data
    pure_df : the competition's original training data
    
    From https://www.kaggle.com/ambrosm/tpsnov21-007-postprocessing
    """
    if pure_df is None: pure_df = pd.read_csv('../input/november21/train.csv')
    if pure_df.shape != (600000, 102): raise ValueError("pure_df has the wrong shape")
    if test_df is None: test_df = pd.read_csv('../input/tabular-playground-series-nov-2021/test.csv')
    if test_df.shape[0] != sub.shape[0] or test_df.shape[1] != 101: raise ValueError("test_df has the wrong shape")

    # Find the separating hyperplane for pure_df, step 1
    # Use an SVM with almost no regularization
    model1 = make_pipeline(StandardScaler(), LinearSVC(C=1e5, tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=1))
    model1.fit(pure_df.drop(columns=['id', 'target']), pure_df.target)
    pure_pred = model1.predict(pure_df.drop(columns=['id', 'target']))
    print((pure_pred != pure_df.target).sum(), (pure_pred == pure_df.target).sum()) # 1 599999
    # model1 is not perfect: it predicts the wrong class for 1 of 600000 samples

    # Find the separating hyperplane for pure_df, step 2
    # Fit a second SVM to a subset of the points which contains the support vectors
    pure_pred = model1.decision_function(pure_df.drop(columns=['id', 'target']))
    subset_df = pure_df[(pure_pred > -5) & (pure_pred < 0.9)]
    model2 = make_pipeline(StandardScaler(), LinearSVC(C=1e5, tol=1e-7, penalty='l2', dual=False, max_iter=2000, random_state=1))
    model2.fit(subset_df.drop(columns=['id', 'target']), subset_df.target)
    pure_pred = model2.predict(pure_df.drop(columns=['id', 'target']))
    print((pure_pred != pure_df.target).sum(), (pure_pred == pure_df.target).sum()) # 0 600000
    # model2 is perfect: it predicts the correct class for all 600000 training samples
    
    pure_test_pred = model2.predict(test_df.drop(columns=['id', 'target'], errors='ignore'))
    lmax, rmin = sub[pure_test_pred == 0].target.max(), sub[pure_test_pred == 1].target.min()
    if lmax < rmin:
        print("There is no overlap. No postprocessing needed.")
        return
    # There is overlap. Remove this overlap
    sub.loc[pure_test_pred == 0, 'target'] -= lmax + 1
    sub.loc[pure_test_pred == 1, 'target'] -= rmin - 1
    print(sub[pure_test_pred == 0].target.min(), sub[pure_test_pred == 0].target.max(),
          sub[pure_test_pred == 1].target.min(), sub[pure_test_pred == 1].target.max())
    
    del model1, model2
    gc.collect()
    return sub


#reduce memory
def reduce_mem_usage(df, verbose=True):
    numerics = ['int16', 'int32', 'int64', 'float16', 'float32', 'float64']
    start_mem = df.memory_usage().sum() / 1024**2    
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type in numerics:
            c_min = df[col].min()
            c_max = df[col].max()
            if str(col_type)[:3] == 'int':
                if c_min > np.iinfo(np.int8).min and c_max < np.iinfo(np.int8).max:
                    df[col] = df[col].astype(np.int8)
                elif c_min > np.iinfo(np.int16).min and c_max < np.iinfo(np.int16).max:
                    df[col] = df[col].astype(np.int16)
                elif c_min > np.iinfo(np.int32).min and c_max < np.iinfo(np.int32).max:
                    df[col] = df[col].astype(np.int32)
                elif c_min > np.iinfo(np.int64).min and c_max < np.iinfo(np.int64).max:
                    df[col] = df[col].astype(np.int64)  
            else:
                if c_min > np.finfo(np.float32).min and c_max < np.finfo(np.float32).max:
                    df[col] = df[col].astype(np.float32)
                else:
                    df[col] = df[col].astype(np.float64)    
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose: print('Mem. usage decreased to {:5.2f} Mb ({:.1f}% reduction)'.format(end_mem, 100 * (start_mem - end_mem) / start_mem))
    return df

In [None]:
#reduce memory by changing its datatype datatype
train = reduce_mem_usage(train)
test = reduce_mem_usage(test)

In [None]:
dir_ = '../input/tps-nov-2021/wmean/'

sub1 = pd.read_csv(dir_ + 'sub1 0.74988.csv')
sub2 = pd.read_csv(dir_ + 'sub2 0.74966.csv')
sub3 = pd.read_csv(dir_ + 'sub3 0.74951.csv')
sub4 = pd.read_csv(dir_ + 'sub4 0.74940.csv')
sub5 = pd.read_csv(dir_ + 'sub5 0.74935.csv')
sub6 = pd.read_csv(dir_ + 'sub6 0.74918.csv')

sub1 = postprocess_separate(sub1)
sub2 = postprocess_separate(sub2)
sub3 = postprocess_separate(sub3)
sub4 = postprocess_separate(sub4)
sub5 = postprocess_separate(sub5)
sub6 = postprocess_separate(sub6)

<h1 style="color:#189AB4;font-size:20px;"><strong>Check for <strong style="color:black">correlations</strong></strong></h1>

In [None]:
import matplotlib as plt
import plotly.figure_factory as ff
import plotly.express as px

hist_data = [sub1.target, sub2.target, sub3.target,sub4.target, sub5.target, sub6.target]
group_labels = ['sub1','sub2','sub3','sub4','sub5','sub6']
fig = ff.create_distplot(hist_data, group_labels, bin_size=0.3, show_hist=False, show_rug=False)
fig.show()

In [None]:
# High correlation between all models ~0.994+
data = np.corrcoef([sub1.target, sub2.target, sub3.target,sub4.target, sub5.target, sub6.target])
fig=px.imshow(data,x=group_labels, y=group_labels)

fig.show()

<h1 style="color:#189AB4;font-size:20px;"><strong>How to implement one <strong style="color:black">intuitively</strong></strong></h1>

Usually, <mark>there are two ways to do it.</mark>

First, you train the same classifier (e.g., Decision Tree) over multiple different subsets of training data, which leads to multiple different models (DT1, DT2, DT3,…). Then, you predict the test data with those models and average the results.

![b](https://miro.medium.com/max/700/1*oJKi_Xdle5qXjv_XgIKExA.png)

Second, you can train multiple different (the more diverse, the better) classifiers with the whole training set, and average the results

![n](https://miro.medium.com/max/700/1*UrmG9r6dc2I_ouwtadrWnQ.png)

<h1 style="color:#189AB4;font-size:20px;"><strong>Submission file<strong style="color:black"></strong></strong></h1>


In [None]:
submission_df = pd.read_csv('../input/tabular-playground-series-nov-2021/sample_submission.csv')

submission0 = submission_df.copy()
submission0.loc[:,'target'] = sub1.target

submission1 = submission_df.copy() #33 hai abhi
submission1.loc[:,'target'] = (sub1.target*170 + sub2.target*15 + sub3.target*13 + sub4.target*12 + sub5.target*6 + sub6.target*4) /220

submission2 = submission_df.copy()
submission2.loc[:,'target'] = (sub1.target*170 + sub2.target*20 + sub3.target*18 + sub4.target*16 + sub5.target*14 + sub6.target*12) /250

submission3 = submission_df.copy()
submission3.loc[:,'target'] = (sub1.target*190  + sub2.target*23 + sub3.target*14 + sub4.target*13) /240

submission4 = submission_df.copy()
submission4.loc[:,'target'] = (sub1.target*250  + sub2.target*33 + sub3.target*17) /300

submission5 = submission_df.copy()
submission5.loc[:,'target'] = (sub1.target*93  + sub2.target*7) /100

<img src="https://i.postimg.cc/L5L1LCHN/241426370-4989674537714337-7340156749542474389-n.jpg" 
     width="400" 
     height="500" />

In [None]:
submission0.to_csv('Wmean0.csv', index=False)
submission1.to_csv('Wmean1.csv', index=False)
submission2.to_csv('Wmean2.csv', index=False)
submission3.to_csv('Wmean3.csv', index=False)
submission4.to_csv('Wmean4.csv', index=False)
submission5.to_csv('Wmean5.csv', index=False)

In [None]:
import matplotlib as plt
import plotly.figure_factory as ff
import plotly.express as px

hist_data = [sub1.target, sub2.target, sub3.target,sub4.target, sub5.target, sub6.target, submission1.target]
group_labels = ['sub1','sub2','sub3','sub4','sub5','sub6','submission1']
fig = ff.create_distplot(hist_data, group_labels, bin_size=0.3, show_hist=False, show_rug=False)
fig.show()