# Version
* `v4`: **2-cls filter**
* `v5`: **2-cls filter** + [**1x1 bbox trick** 🔥](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/211971)

# 🌟2 Class Filter🌟
Previously I have trained `YOLOv5` using `14` class data. As it creates `FP` we can tackle that just simply using a `2 class filter`. Here I'll be using 2 class model (`AUC`:`0.98`) prediction to filter out the `FP` predictions. I used `EfficientNetB6` to generate these predictions.
It should increase the score as `FP` would be reduced significantly

**Notebooks**
* [14 class train](https://www.kaggle.com/awsaf49/vinbigdata-cxr-ad-yolov5-14-class-train)
* [14 class infer](https://www.kaggle.com/awsaf49/vinbigdata-cxr-ad-yolov5-14-class-infer)

**Dataset:**
* [YOLOv5 Labels](https://www.kaggle.com/awsaf49/vinbigdata-yolo-labels-dataset)
* [1024x1024 Dataset](https://www.kaggle.com/awsaf49/vinbigdata-1024-image-dataset)
* [512x512 Dataset](https://www.kaggle.com/awsaf49/vinbigdata-512-image-dataset)
* [256x256 Dataset](https://www.kaggle.com/awsaf49/vinbigdata-512-image-dataset)
* [Original Size '.jpg'](https://www.kaggle.com/awsaf49/vinbigdata-original-image-dataset)

以前，我已经使用14类数据训练了YOLOv5。 当这个模型产生FP时，我们只需使用2分类过滤器即可解决。 
在这里，我将使用2类模型（AUC：0.98）预测来过滤FP预测。 我使用EfficientNetB6生成了这些预测。 它会增加分数，因为FP会大大降低

# Loading Package

In [None]:
import pandas as pd
import numpy as np
from glob import glob
import shutil

In [None]:
raw_pred_2cls = pd.read_csv('../input/vinbigdata-2class-prediction/2-cls test pred.csv')
other_pred_2cls = pd.read_csv('../input/temp-submission/2-cls test pred.csv')

In [None]:
tmp_pred_2cls=other_pred_2cls
tmp_pred_2cls['target']=1-other_pred_2cls['target']

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# 画出训练集和验证集的概率分布
sns.distplot(raw_pred_2cls["target"].values, color='green', label='raw_pred_2cls pred')
# sns.distplot(other_pred_2cls["target"].values, color='orange', label='other_pred_2cls pred')
plt.title("Prediction results histogram")
plt.xlim([0., 1.])
plt.legend()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# 画出训练集和验证集的概率分布
# sns.distplot(raw_pred_2cls["target"].values, color='green', label='raw_pred_2cls pred')
sns.distplot(other_pred_2cls["target"].values, color='orange', label='other_pred_2cls pred')
plt.title("Prediction results histogram")
plt.xlim([0., 1.])
plt.legend()

In [None]:
import seaborn as sns
import matplotlib.pyplot as plt
# 画出训练集和验证集的概率分布
sns.distplot(raw_pred_2cls["target"].values, color='green', label='raw_pred_2cls pred')
sns.distplot(other_pred_2cls["target"].values, color='orange', label='other_pred_2cls pred')
plt.title("Prediction results histogram")
plt.xlim([0., 1.])
plt.legend()

In [None]:
compare = pd.merge(raw_pred_2cls, other_pred_2cls, on = 'image_id', how = 'left')
compare.head()

In [None]:
compare['dif']=compare['target_x']-compare['target_y']

In [None]:
compare.to_csv('compare.csv',index = False)

In [None]:
sns.distplot(compare["dif"].values, color='green', label='raw_pred_2cls pred')
plt.title("Prediction results histogram")
plt.xlim([0., 1.])
plt.legend()

# Loading csv

In [None]:
pred_14cls = pd.read_csv('../input/vinbigdata-14-class-submission-lb0154/submission.csv')
pred_2cls = pd.read_csv('../input/vinbigdata-2class-prediction/2-cls test pred.csv')
# pred_2cls = pd.read_csv('../input/temp-submission/2-cls test pred.csv')

In [None]:
pred_2cls.to_csv('pred_2cls.csv',index = False)

In [None]:
pred_14cls.head()

In [None]:
pred_raw = pd.merge(pred_14cls, pred_2cls, on = 'image_id', how = 'left')
pred_raw.head()

In [None]:
pred = pd.merge(pred_14cls, tmp_pred_2cls, on = 'image_id', how = 'left')
pred.head()

# Before 2 Class Filter Number of `No Finding`

In [None]:
pred['PredictionString'].value_counts().iloc[[0]]

# 2 Class Filter + [**1x1 bbox trick** 🔥](https://www.kaggle.com/c/vinbigdata-chest-xray-abnormalities-detection/discussion/211971)

In [None]:
count_low=0
count_mid=0
count_high=0
def filter_2cls_raw(row, low_thr=0.08, high_thr=0.95):
    global count_low
    global count_mid
    global count_high
    prob = row['target']
    if prob<low_thr:
        ## Less chance of having any disease
        row['PredictionString'] = '14 1 0 0 1 1'
        count_low+=1
    elif low_thr<=prob<high_thr:
        ## More change of having any diesease
        row['PredictionString']+=f' 14 {prob} 0 0 1 1'
        count_mid+=1
    elif high_thr<=prob:
        ## Good chance of having any disease so believe in object detection model
        row['PredictionString'] = row['PredictionString']
        count_high+=1
    else:
        raise ValueError('Prediction must be from [0-1]')
    return row

In [None]:
sub_raw = pred_raw.apply(filter_2cls_raw, axis=1)
print(count_low/3000,count_mid/3000,count_high/3000)
sub_raw[60:63]

In [None]:
count_low=0
count_mid=0
count_high=0
def filter_2cls(row, low_thr=0.05, high_thr=0.99):
    global count_low
    global count_mid
    global count_high
    prob = row['target']
    if prob<low_thr:
#         pass
        ## Less chance of having any disease
        row['PredictionString'] = '14 1 0 0 1 1'
        count_low+=1
    elif low_thr<=prob<high_thr:
        ## More change of having any diesease
        row['PredictionString']+=f' 14 {prob} 0 0 1 1'
#         pass
        count_mid+=1
    elif high_thr<=prob:
        ## Good chance of having any disease so believe in object detection model
        row['PredictionString']+=f' 14 {prob} 0 0 1 1'
#         row['PredictionString'] = row['PredictionString']
        count_high+=1
    else:
        raise ValueError('Prediction must be from [0-1]')
    return row

In [None]:
sub = pred.apply(filter_2cls, axis=1)
print(count_low/3000,count_mid/3000,count_high/3000)
sub[60:63]

In [None]:
merge_sub = pd.merge(sub_raw, sub, on = 'image_id', how = 'left')

In [None]:
merge_sub[150:200]

In [None]:
merge_sub.to_csv('merge_sub.csv',index = False)

# After 2 Class Filter Number of `No Finding`

In [None]:
sub['PredictionString'].value_counts().iloc[[0]]

As we can see from above that applying `2 class filter` Number of `'No Finding'`increases significanly. **[549->1912]**. We can also see that `1x1 bbox trick` increases the result

In [None]:
sub_raw[['image_id', 'PredictionString']].to_csv('submission_raw.csv',index = False)

In [None]:
sub[['image_id', 'PredictionString']].to_csv('submission.csv',index = False)

# Result
As we can see applying the `2 class filter` improves the result significantly, from `0.154` to `0.201`. But bear in mind that choosing the `thershold` could be a bit `tricky`.

如我们所见，应用2类滤波器可将结果从0.154显着提高到0.201。 但是请记住，选择阈值可能会有些棘手。

# Please Upvote If You Have Found This Notebook Useful 😃