<a id="overview"></a>
# Overview 🧐
[Pulmonary fibrosis is a lung disease that occurs when lung tissue becomes damaged and scarred. This thickened, stiff tissue makes it more difficult for your lungs to work properly. As pulmonary fibrosis worsens, you become progressively more short of breath.](https://www.mayoclinic.org/diseases-conditions/pulmonary-fibrosis/symptoms-causes/syc-20353690)<br>
<font color="RoyalBlue">肺線維症とは、肺の組織が傷つき、傷跡が残ることで起こる肺の病気です。この肥厚した硬い組織は、肺の正常な動作を困難にします。肺線維症が悪化すると、徐々に息切れがひどくなります。</font>
<img src='https://i.imgur.com/edKPRik.png' width="600">
In "OSIC Pulmonary Fibrosis Progression", we needs to predict a patient’s severity of decline in lung function based on a CT scan of their lungs by using AI machine learning. In detail, we must predict both a Forced vital capacity (FVC) and a confidence measure for each patient.<br>
<font color="RoyalBlue">「OSIC 肺線維症の進行」では、肺のCTスキャンに基づいて患者の肺機能の低下の重症度を AI 機械学習を用いて予測する必要があります。</font><br>
<font color="RoyalBlue">詳しく言うと、各患者の努力肺活量（FVC）と信頼度の両方を予測しなければなりません。</font>

# Table of contents 📖
* [Overview 🧐](#overview)
* [Acknowledgements 🙇](#acknowledgements)
* [Setup 💻](#setup)
* [Load the data 📃](#load)
* [Explore CSV data 📊](#explore)
    * [Distribution of unique patients data 😷 (Age, Sex, SmokingStatus)](#unique)
    * [Weeks distribution 📅](#weeks)
    * [FVC & Percent distribution 💨](#percent)
    * [Relationships between FVC and other variables 🤝](#fvc)
* [Linear Decay (based on EfficientNets) 📷](#efficient)
* [Multiple Quantile Regression 🌒](#quantile)
    * [Data preprocessing for Multiple Quantile Regression 🧹](#quantile_d)
    * [Build the model 🧠](#quantile_m)
    * [Cross validation 💭](#quantile_c)
* [Ensemble & Submit 📝](#submit)

<a id="acknowledgements"></a>
# Acknowledgements 🙇

- Ulrich GOUE's [Osic-Multiple-Quantile-Regression-Starter](https://www.kaggle.com/ulrich07/osic-multiple-quantile-regression-starter)
- Michael Kazachok's [Linear Decay (based on ResNet CNN)](https://www.kaggle.com/miklgr500/linear-decay-based-on-resnet-cnn)
- Wei Hao Khoong's [](http://)[EfficientNets + Quantile Regression (Inference)](https://www.kaggle.com/khoongweihao/efficientnets-quantile-regression-inference)

<a id="setup"></a>
# Setup 💻
All seed values are fixed at 42.<br>
<font color="RoyalBlue">シード値は全て42で固定しています。</font><br>

In [None]:
!pip install ../input/kerasapplications/keras-team-keras-applications-3b180cb -f ./ --no-index
!pip install ../input/efficientnet/efficientnet-1.1.0/ -f ./ --no-index

import efficientnet.tfkeras as efn

In [None]:
import os
import random
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns

from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import KFold, train_test_split 
import tensorflow as tf
from tensorflow.keras import Model, backend
import tensorflow.keras.layers as L
import tensorflow.keras.models as M
from tensorflow.keras.utils import Sequence
from keras.utils.vis_utils import plot_model

import pydicom
import cv2

def seed_everything(seed=2020):
    random.seed(seed)
    os.environ['PYTHONHASHSEED'] = str(seed)
    np.random.seed(seed)
    tf.random.set_seed(seed)
    
seed_everything(42)

config = tf.compat.v1.ConfigProto()
config.gpu_options.allow_growth = True
session = tf.compat.v1.Session(config=config)

palette_ro = ["#ee2f35", "#fa7211", "#fbd600", "#75c731", "#1fb86e", "#0488cf", "#7b44ab"]

ROOT = "../input/osic-pulmonary-fibrosis-progression/"

<a id="load"></a>
# Load the data 📃

In [None]:
train = pd.read_csv(ROOT + "train.csv")
test = pd.read_csv(ROOT + "test.csv")
sub = pd.read_csv(ROOT + "sample_submission.csv")

print("Training data shape: ", train.shape)
print("Test data shape: ", test.shape)

train.head(10)

* `Patient` - 各患者固有の ID（患者の DICOM フォルダの名前でもあります）
* `Weeks` - ベースライン CT の前後の相対的な週数（負数の場合も）
* `FVC` - 記録された努力肺活量（mL）
* `Percent` - %FVC, パーセント肺活量。年齢・身長・性別から計算した予測 FVC に対する実際の FVC の割合
* `Age` - 年齢
* `Sex` - 性別（`Male` / `Female`）
* `SmokingStatus` - 喫煙状態（`Never smoked` / `Ex-smoker` / `Currently smokes`）

<a id="explore"></a>
# Explore CSV data 📊

In [None]:
train.isnull().sum()

In [None]:
test.isnull().sum()

There are no missing values in both `train` and `test`.<br>
<font color="RoyalBlue">train と test の両方に欠損値はありません。</font><br>
Before EDA, we will try to find duplicate rows in the `train` where the `Patient` and `Weeks` elements match.<br>
<font color="RoyalBlue">EDA の前に、train の中にある Patient と Weeks の要素が一致し重複している行を探してみます。</font>

In [None]:
dupRows_train = train[train.duplicated(subset=['Patient', 'Weeks'], keep=False)]

print("There are {} duplicate rows here ({:.2f} percent of the total).".format(len(dupRows_train), len(dupRows_train)/len(train)*100))
dupRows_train

There don't seem to be too many of them. These duplicate rows should be removed.<br>
<font color="RoyalBlue">数はあまり多くないようです。これらの重複した行は削除しておきましょう。</font>

In [None]:
train.drop_duplicates(subset=['Patient', 'Weeks'], keep=False, inplace=True)

The following table from [Is this Malware? [EDA, FE and lgb][updated]](https://www.kaggle.com/artgor/is-this-malware-eda-fe-and-lgb-updated)<br>

<font color="RoyalBlue">カラム名 / カラムごとのユニーク値数 / 最も出現頻度の高い値 / 最も出現頻度の高い値の出現回数 / 欠損損値の割合 / 最も多いカテゴリの割合 / dtypes を表示しています。<br>
train における Patient の固有 ID 数は176のようです。</font>

In [None]:
stats = []
for col in train.columns:
    stats.append((col,
                  train[col].nunique(),
                  train[col].value_counts().index[0],
                  train[col].value_counts().values[0],
                  train[col].isnull().sum() * 100 / train.shape[0],
                  train[col].value_counts(normalize=True, dropna=False).values[0] * 100,
                  train[col].dtype))
stats_df = pd.DataFrame(stats, columns=['Feature', 'Unique values', 'Most frequent item', 'Freuquence of most frequent item', 'Percentage of missing values', 'Percentage of values in the biggest category', 'Type'])
stats_df.sort_values('Percentage of missing values', ascending=False)

<a id="unique"></a>
## Distribution of unique patients data 😷 (Age, Sex, SmokingStatus)
<font color="RoyalBlue">では、train における Patient の固有 ID ごとの年齢、性別、喫煙状況の分布から見ていきましょう。</font>

In [None]:
data = train.groupby("Patient").first().reset_index(drop=True)
data.head()

In [None]:
fig, ((ax1, ax2, ax3), (ax4, ax5, ax6)) = plt.subplots(2, 3, figsize=(16, 12))

sns.distplot(data["Age"], ax=ax1, bins=data["Age"].max()-data["Age"].min()+1, color=palette_ro[1])
ax1.annotate("Min: {:,}".format(data["Age"].min()), xy=(data["Age"].min(), 0.005), 
             xytext=(data["Age"].min()-8, 0.02),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:,}".format(data["Age"].max()), xy=(data["Age"].max(), 0.005), 
             xytext=(data["Age"].max()-2, 0.02),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3, rad=-0.2"))
ax1.axvline(x=data["Age"].median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax1.annotate("Med: {:.0f}".format(data["Age"].median()), xy=(data["Age"].median(), 0.056), 
             xytext=(data["Age"].median()-15, 0.065),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3, rad=0.25"))

sns.countplot(x="Sex", ax=ax2, data=data, palette=palette_ro[-2::-4])
sns.countplot(x="SmokingStatus", ax=ax3, data=data,
              order=["Never smoked", "Ex-smoker", "Currently smokes"], palette=palette_ro[-3::-2])

sns.distplot(data[data["Sex"]=="Male"].Age, label="Male", ax=ax4, hist=False, color=palette_ro[5])
sns.distplot(data[data["Sex"]=="Female"].Age, label="Female", ax=ax4, hist=False, color=palette_ro[1])

sns.distplot(data[data["SmokingStatus"]=="Never smoked"].Age, label="Never smoked", ax=ax5, hist=False, color=palette_ro[4])
sns.distplot(data[data["SmokingStatus"]=="Ex-smoker"].Age, label="Ex-smoker", ax=ax5, hist=False, color=palette_ro[2])
sns.distplot(data[data["SmokingStatus"]=="Currently smokes"].Age, label="Currently smokes", ax=ax5, hist=False, color=palette_ro[0])

sns.countplot(x="SmokingStatus", ax=ax6, data=data, hue="Sex",
              order=["Never smoked", "Ex-smoker", "Currently smokes"], palette=palette_ro[-2::-4])

fig.suptitle("Distribution of unique patients data", fontsize=18);

Compared to `male`, `female` seems to be of a wider age range and is less likely to smoke. And, `Never smoked` tend to be younger than `Ex-smoker`.<br>
<font color="RoyalBlue">男性に比べて女性は年齢層が幅広く、喫煙者が少ないようです。また、喫煙未経験者は元喫煙者よりも若い傾向にあります。</font>

<a id="weeks"></a>
## Weeks distribution 📅

In [None]:
fig, ax = plt.subplots(figsize=(16, 6))

sns.distplot(train["Weeks"], ax=ax, color=palette_ro[1], bins=train["Weeks"].max()-train["Weeks"].min()+1)
ax.annotate("Min: {:,}".format(train["Weeks"].min()), xy=(train["Weeks"].min(), 0.005), 
            xytext=(train["Weeks"].min()-8, 0.008),
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3, rad=0.2"))
ax.annotate("Max: {:,}".format(train["Weeks"].max()), xy=(train["Weeks"].max(), 0.005), 
            xytext=(train["Weeks"].max()-2, 0.008),
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3, rad=-0.2"))
ax.axvline(x=0, color=palette_ro[5], linestyle="--", alpha=0.5)
ax.annotate("CT Scan", xy=(0, 0.013), 
            xytext=(-12, 0.016),
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3, rad=0.2"))
ax.axvline(x=train["Weeks"].median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax.annotate("Med: {:.0f}".format(train["Weeks"].median()), xy=(train["Weeks"].median(), 0.020), 
            xytext=(train["Weeks"].median()+2, 0.024),
            bbox=dict(boxstyle="round", fc="none", ec="gray"),
            arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.2"))

ax.set_title("Weeks Distribution", fontsize=18);

<a id="percent"></a>
## FVC & Percent distribution 💨

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.distplot(train["FVC"], ax=ax1, color=palette_ro[5], hist=False)
ax1.annotate("Min: {:,}".format(train["FVC"].min()), xy=(train["FVC"].min(), 0.00005), 
             xytext=(train["FVC"].min()-300, 0.0001),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3, rad=0.2"))
ax1.annotate("Max: {:,}".format(train["FVC"].max()), xy=(train["FVC"].max(), 0.00005), 
             xytext=(train["FVC"].max()-200, 0.0001),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3, rad=-0.2"))
ax1.axvline(x=train["FVC"].median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax1.annotate("Med: {:,.0f}".format(train["FVC"].median()), xy=(train["FVC"].median(), 0.00005), 
             xytext=(train["FVC"].median()-750, 0.0001),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))

ax1.set_title("FVC Distribution", fontsize=16);

sns.distplot(train["Percent"], ax=ax2, color=palette_ro[3], hist=False)
ax2.annotate("Min: {:.2f}".format(train["Percent"].min()), xy=(train["Percent"].min(), 0.0015), 
             xytext=(train["Percent"].min()-8, 0.0040),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3, rad=0.2"))
ax2.annotate("Max: {:.2f}".format(train["Percent"].max()), xy=(train["Percent"].max(), 0.0015), 
             xytext=(train["Percent"].max()-4, 0.0040),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                             connectionstyle="arc3, rad=-0.2"))
ax2.axvline(x=train["Percent"].median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax2.annotate("Med: {:.2f}".format(train["Percent"].median()), xy=(train["Percent"].median(), 0.0015), 
             xytext=(train["Percent"].median()-17, 0.0040),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))

ax2.set_title("Percent Distribution", fontsize=16);

<a id="fvc"></a>
## Relationships between FVC and other variables 🤝

Let's look at the relationships between the objective variable, `FVC`, and other variables.<br>
<font color="RoyalBlue">目的変数である FVC と他の変数との関係を見ていきましょう。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.distplot(train[train["Sex"]=="Male"].FVC, label="Male", ax=ax1, hist=False, color=palette_ro[5])
ax1.axvline(x=train[train["Sex"]=="Male"].FVC.median(), color=palette_ro[5], linestyle="--", alpha=0.5)
ax1.annotate("Male\nMed: {:,.0f}".format(train[train["Sex"]=="Male"].FVC.median()), xy=(train[train["Sex"]=="Male"].FVC.median(), 0.0006), 
             xytext=(train[train["Sex"]=="Male"].FVC.median()+100, 0.00065),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.2"))
sns.distplot(train[train["Sex"]=="Female"].FVC, label="Female", ax=ax1, hist=False, color=palette_ro[1])
ax1.axvline(x=train[train["Sex"]=="Female"].FVC.median(), color=palette_ro[1], linestyle="--", alpha=0.5)
ax1.annotate("Female\nMed: {:,.0f}".format(train[train["Sex"]=="Female"].FVC.median()), xy=(train[train["Sex"]=="Female"].FVC.median(), 0.0008), 
             xytext=(train[train["Sex"]=="Female"].FVC.median()+100, 0.00085),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.2"))

sns.distplot(train[train["SmokingStatus"]=="Never smoked"].FVC, label="Never smoked", ax=ax2, hist=False, color=palette_ro[4])
ax2.axvline(x=train[train["SmokingStatus"]=="Never smoked"].FVC.median(), color=palette_ro[4], linestyle="--", alpha=0.5)
ax2.annotate("Never smoked\nMed: {:.0f}".format(train[train["SmokingStatus"]=="Never smoked"].FVC.median()), xy=(train[train["SmokingStatus"]=="Never smoked"].FVC.median(), 0.0005), 
             xytext=(train[train["SmokingStatus"]=="Never smoked"].FVC.median()-1000, 0.00055),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))
sns.distplot(train[train["SmokingStatus"]=="Ex-smoker"].FVC, label="Ex-smoker", ax=ax2, hist=False, color=palette_ro[2])
ax2.axvline(x=train[train["SmokingStatus"]=="Ex-smoker"].FVC.median(), color=palette_ro[2], linestyle="--", alpha=0.75)
ax2.annotate("Ex-smoker\nMed: {:.0f}".format(train[train["SmokingStatus"]=="Ex-smoker"].FVC.median()), xy=(train[train["SmokingStatus"]=="Ex-smoker"].FVC.median(), 0.00058), 
             xytext=(train[train["SmokingStatus"]=="Ex-smoker"].FVC.median()-1200, 0.0007),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.25"))
sns.distplot(train[train["SmokingStatus"]=="Currently smokes"].FVC, label="Currently smokes", ax=ax2, hist=False, color=palette_ro[0])
ax2.axvline(x=train[train["SmokingStatus"]=="Currently smokes"].FVC.median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax2.annotate("Currently smokes\nMed: {:.0f}".format(train[train["SmokingStatus"]=="Currently smokes"].FVC.median()), xy=(train[train["SmokingStatus"]=="Currently smokes"].FVC.median(), 0.0009), 
             xytext=(train[train["SmokingStatus"]=="Currently smokes"].FVC.median()+400, 0.00095),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))

ax1.set_title("Relationship between FVC and Sex", fontsize=16)
ax2.set_title("Relationship between FVC and SmokingStatus", fontsize=16);

You can see that `FVC` of `Female` tends to be much lower than of `Male`, and there is also a difference in FVC by `SmokingStatus`, but this may be because there are more `Female` in `Never smoked`. Let's check it out.<br>
<font color="RoyalBlue">男性と比べると女性の FVC はかなり低くなる傾向にあることが分かります。喫煙状態によっても FVC に差が出ていますが、これは喫煙未経験者に女性が多いためかもしれません。実際に確認してみましょう。</font>

In [None]:
train_m = train[train["Sex"]=="Male"].reset_index(drop=True)
train_f = train[train["Sex"]=="Female"].reset_index(drop=True)

fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.distplot(train_m[train_m["SmokingStatus"]=="Never smoked"].FVC, label="Never smoked", ax=ax1, hist=False, color=palette_ro[4])
ax1.axvline(x=train_m[train_m["SmokingStatus"]=="Never smoked"].FVC.median(), color=palette_ro[4], linestyle="--", alpha=0.5)
ax1.annotate("Never smoked\nMed: {:.0f}".format(train_m[train_m["SmokingStatus"]=="Never smoked"].FVC.median()), xy=(train_m[train_m["SmokingStatus"]=="Never smoked"].FVC.median(), 0.0005), 
             xytext=(train_m[train_m["SmokingStatus"]=="Never smoked"].FVC.median()-1400, 0.0006),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))
sns.distplot(train_m[train_m["SmokingStatus"]=="Ex-smoker"].FVC, label="Ex-smoker", ax=ax1, hist=False, color=palette_ro[2])
ax1.axvline(x=train_m[train_m["SmokingStatus"]=="Ex-smoker"].FVC.median(), color=palette_ro[2], linestyle="--", alpha=0.75)
ax1.annotate("Ex-smoker\nMed: {:.0f}".format(train_m[train_m["SmokingStatus"]=="Ex-smoker"].FVC.median()), xy=(train_m[train_m["SmokingStatus"]=="Ex-smoker"].FVC.median(), 0.00063), 
             xytext=(train_m[train_m["SmokingStatus"]=="Ex-smoker"].FVC.median()-1400, 0.00045),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.2"))
sns.distplot(train_m[train_m["SmokingStatus"]=="Currently smokes"].FVC, label="Currently smokes", ax=ax1, hist=False, color=palette_ro[0])
ax1.axvline(x=train_m[train_m["SmokingStatus"]=="Currently smokes"].FVC.median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax1.annotate("Currently smokes\nMed: {:.0f}".format(train_m[train_m["SmokingStatus"]=="Currently smokes"].FVC.median()), xy=(train_m[train_m["SmokingStatus"]=="Currently smokes"].FVC.median(), 0.00066), 
             xytext=(train_m[train_m["SmokingStatus"]=="Currently smokes"].FVC.median()+400, 0.00055),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))

sns.distplot(train_f[train_f["SmokingStatus"]=="Never smoked"].FVC, label="Never smoked", ax=ax2, hist=False, color=palette_ro[4])
ax2.axvline(x=train_f[train_f["SmokingStatus"]=="Never smoked"].FVC.median(), color=palette_ro[4], linestyle="--", alpha=0.5)
ax2.annotate("Never smoked\nMed: {:.0f}".format(train_f[train_f["SmokingStatus"]=="Never smoked"].FVC.median()), xy=(train_f[train_f["SmokingStatus"]=="Never smoked"].FVC.median(), 0.001), 
             xytext=(train_f[train_f["SmokingStatus"]=="Never smoked"].FVC.median()-600, 0.0015),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.2"))
sns.distplot(train_f[train_f["SmokingStatus"]=="Ex-smoker"].FVC, label="Ex-smoker", ax=ax2, hist=False, color=palette_ro[2])
ax2.axvline(x=train_f[train_f["SmokingStatus"]=="Ex-smoker"].FVC.median(), color=palette_ro[2], linestyle="--", alpha=0.75)
ax2.annotate("Ex-smoker\nMed: {:.0f}".format(train_f[train_f["SmokingStatus"]=="Ex-smoker"].FVC.median()), xy=(train_f[train_f["SmokingStatus"]=="Ex-smoker"].FVC.median(), 0.0013), 
             xytext=(train_f[train_f["SmokingStatus"]=="Ex-smoker"].FVC.median()+100, 0.0018),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=-0.25"))
sns.distplot(train_f[train_f["SmokingStatus"]=="Currently smokes"].FVC, label="Currently smokes", ax=ax2, hist=False, color=palette_ro[0])
ax2.axvline(x=train_f[train_f["SmokingStatus"]=="Currently smokes"].FVC.median(), color=palette_ro[0], linestyle="--", alpha=0.5)
ax2.annotate("Currently smokes\nMed: {:.0f}".format(train_f[train_f["SmokingStatus"]=="Currently smokes"].FVC.median()), xy=(train_f[train_f["SmokingStatus"]=="Currently smokes"].FVC.median(), 0.0035), 
             xytext=(train_f[train_f["SmokingStatus"]=="Currently smokes"].FVC.median()+200, 0.004),
             bbox=dict(boxstyle="round", fc="none", ec="gray"),
             arrowprops=dict(arrowstyle="->",
                           connectionstyle="arc3, rad=0.25"))

ax1.set_title("Relationship between FVC and SmokingStatus in Male", fontsize=16)
ax2.set_title("Relationship between FVC and SmokingStatus in Female", fontsize=16);

When limited to `Male`, `FVC` does not seem to change much with `SmokingStatus`. In the case of `Female`, there is a difference, but this is probably due to the small sample size (especially for `Currently smokes`). It may also be important to consider that patients who are `Currently smokes` are likely to be less severely affected.<br>
<font color="RoyalBlue">男性に限定してみると、FVC は喫煙状態によってはあまり変化しないようです。女性の場合は差が出ていますが、これはサンプル数が少ないためでしょう（特に現喫煙者）。現喫煙者の患者には重症者が少ないであろうことも考慮すべきかもしれません。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.scatterplot(x=train["FVC"], y=train["Age"], ax=ax1,
                palette=[palette_ro[5], palette_ro[1]], hue=train["Sex"], style=train["Sex"])
sns.scatterplot(x=train["FVC"], y=train["Age"], ax=ax2,
                palette=[palette_ro[2], palette_ro[4], palette_ro[0]], hue=train["SmokingStatus"], style=train["SmokingStatus"])

fig.suptitle("Correlation between FVC and Age (Pearson Corr: {:.4f})".format(train["FVC"].corr(train["Age"])), fontsize=16);

There was almost no correlation between `FVC` and `Age`.<br>
<font color="RoyalBlue">FVC と年齢には相関はほぼ見られませんでした。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.scatterplot(x=train["FVC"], y=train["Weeks"], ax=ax1,
                palette=[palette_ro[5], palette_ro[1]], hue=train["Sex"], style=train["Sex"])
sns.scatterplot(x=train["FVC"], y=train["Weeks"], ax=ax2,
                palette=[palette_ro[2], palette_ro[4], palette_ro[0]], hue=train["SmokingStatus"], style=train["SmokingStatus"])

fig.suptitle("Correlation between FVC and Weeks (Pearson Corr: {:.4f})".format(train["FVC"].corr(train["Weeks"])), fontsize=16);

There was almost no correlation between `FVC` and `Weeks` either.<br>
<font color="RoyalBlue">FVC と週数にも相関はほぼ見られませんでした。</font>

In [None]:
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

sns.scatterplot(x=train["FVC"], y=train["Age"], ax=ax1,
                palette=[palette_ro[5], palette_ro[1]], hue=train["Sex"], style=train["Sex"])
sns.scatterplot(x=train["FVC"], y=train["Age"], ax=ax2,
                palette=[palette_ro[2], palette_ro[4], palette_ro[0]], hue=train["SmokingStatus"], style=train["SmokingStatus"])

fig.suptitle("Correlation between FVC and Age (Pearson Corr: {:.4f})".format(train["FVC"].corr(train["Age"])), fontsize=16);

There was a positive correlation between `FVC` and `Percent`, which is not surprising since `Percent` is a value calculated from `FVC` and other data.<br>
<font color="RoyalBlue">FVC と Percent には正の相関が見られました。Percent は FVC などから算出する値なので当然といえば当然です。</font>

<a id="efficient"></a>
# Linear Decay (based on EfficientNets) 📷
First of all, let's try to make predictions from DICOM and other data. We make a function to put `Age`, `Sex`, and `SmokingStatus` into NumPy array.<br>
<font color="RoyalBlue">それでは、まずは DICOM を含めたデータから予測を出してみましょう。Age, Sex, SmokingStatus を変換しつつ NumPy array にまとめる関数を作成します。</font>

In [None]:
def get_tab(df):
    vector = [(df.Age.values[0] - 30) / 30]
    
    if df.Sex.values[0] == 'male':
       vector.append(0)
    else:
       vector.append(1)
    
    if df.SmokingStatus.values[0] == 'Never smoked':
        vector.extend([0,0])
    elif df.SmokingStatus.values[0] == 'Ex-smoker':
        vector.extend([1,1])
    elif df.SmokingStatus.values[0] == 'Currently smokes':
        vector.extend([0,1])
    else:
        vector.extend([1,0])
    return np.array(vector)

Next, we organize the data by Patient's unique ID.<br>
In the for statement, we extract `FVC` from `fvc` and `Weeks` from `weeks` in a NumPy array format, respectively.
We join weeks and the array with all 1's in the vertical direction and transpose them, and then assign them to `c`.
Then find the least-squares of `c` and `fvc`, and add the slope to `A`.
We then add the value obtained from the above function `get_tab` to `TAB` and the unique ID of Patient to `p`.<br>
<font color="RoyalBlue">次に、Patient の固有 ID ごとにデータを整理します。<br>
for 文の中で fvc に FVC, weeks に Weeks をそれぞれ NumPy array 形式で取り出します。。<br>
weeks と要素が全て1の配列を縦方向に結合し、転置したものを c とします。<br>
そして c と fvc の最小二乗解を求め、その傾きを A に追加します。<br>
さらに上記の上記の関数 get_tab で取得した値を TAB に、Patient の固有 ID を p に追加していきます。</font>

In [None]:
A = {} 
TAB = {} 
P = [] 
for i, p in tqdm(enumerate(train.Patient.unique())):
    sub = train.loc[train.Patient == p, :] 
    fvc = sub.FVC.values
    weeks = sub.Weeks.values
    c = np.vstack([weeks, np.ones(len(weeks))]).T
    a, b = np.linalg.lstsq(c, fvc)[0]
    
    A[p] = a
    TAB[p] = get_tab(sub)
    P.append(p)

fig, ax = plt.subplots(1, 1, figsize=(16, 6))
sns.distplot(list(A.values()), ax=ax, color=palette_ro[1]);

Create a function to read a DICOM file from the given path.
For ease of use in deep learning we will divide the value by 2048 and then crop and resize the image.<br>
<font color="RoyalBlue">渡されたパスから DICOM ファイルを読み込む関数を作成します。<br>
ディープラーニングで扱いやすくするために値を 2**11 で割り、さらに画像をクロップ・リサイズします。</font>

In [None]:
# def get_img(path):
#     d = pydicom.dcmread(path)
#     return cv2.resize(d.pixel_array / 2**11, (528, 528))    # changed from 512

# https://www.kaggle.com/allunia/pulmonary-dicom-preprocessing
def get_img(path, new_shape=(528, 528)):
    d = pydicom.dcmread(path)
    scan = d.pixel_array / 2**11
    
    left = int((scan.shape[0]-512)/2)
    right = int((scan.shape[0]+512)/2)
    top = int((scan.shape[1]-512)/2)
    bottom = int((scan.shape[1]+512)/2)
    
    img = scan[top:bottom, left:right]
    cropped_resized_scan = cv2.resize(img, new_shape, interpolation=cv2.INTER_LANCZOS4)
    return cropped_resized_scan

# get_img("../input/osic-pulmonary-fibrosis-progression/train/ID00007637202177411956430/1.dcm")

In [None]:
class IGenerator(Sequence):
    BAD_ID = ['ID00011637202177653955184', 'ID00052637202186188008618']
    def __init__(self, keys, a, tab, batch_size=32):
        self.keys = [k for k in keys if k not in self.BAD_ID]
        self.a = a
        self.tab = tab
        self.batch_size = batch_size
        
        self.train_data = {}
        for p in train.Patient.values:
            self.train_data[p] = os.listdir(f'../input/osic-pulmonary-fibrosis-progression/train/{p}/')
    
    def __len__(self):
        return 1000
    
    def __getitem__(self, idx):
        x = []
        a, tab = [], [] 
        keys = np.random.choice(self.keys, size = self.batch_size)
        for k in keys:
            try:
                i = np.random.choice(self.train_data[k], size=1)[0]
                img = get_img(f'../input/osic-pulmonary-fibrosis-progression/train/{k}/{i}')
                x.append(img)
                a.append(self.a[k])
                tab.append(self.tab[k])
            except:
                print(k, i)
       
        x,a,tab = np.array(x), np.array(a), np.array(tab)
        x = np.expand_dims(x, axis=-1)
        return [x, tab] , a

Let's create a model.
We take image data and run the tensor through `EfficientNetB6` and `GlobalAveragePooling2D`, and we take the pre-processed CSV data and add Gaussian noise to the tensor and concatenate them together to output the model.
The model weights are trained.<br>
<font color="RoyalBlue">では、モデルを作成していきます。<br>
画像データを受け取って EfficientNetB6 と GlobalAveragePooling2D に通したテンソルと、前処理済みの CSV データを受け取ってガウシアンノイズを加えたテンソルを連結して出力します。<br>
モデルの重みは訓練済みのものを使用します。</font>

In [None]:
%%time
def get_efficientnet(model, shape):
    models_dict = {
        'b0': efn.EfficientNetB0(input_shape=shape,weights=None,include_top=False),
        'b1': efn.EfficientNetB1(input_shape=shape,weights=None,include_top=False),
        'b2': efn.EfficientNetB2(input_shape=shape,weights=None,include_top=False),
        'b3': efn.EfficientNetB3(input_shape=shape,weights=None,include_top=False),
        'b4': efn.EfficientNetB4(input_shape=shape,weights=None,include_top=False),
        'b5': efn.EfficientNetB5(input_shape=shape,weights=None,include_top=False),
        'b6': efn.EfficientNetB6(input_shape=shape,weights=None,include_top=False),
        'b7': efn.EfficientNetB7(input_shape=shape,weights=None,include_top=False)
    }
    return models_dict[model]

def build_model(shape=(528, 528, 1), model_class=None):    # changed from 512
    inp = L.Input(shape=shape)
    base = get_efficientnet(model_class, shape)
    x = base(inp)
    x = L.GlobalAveragePooling2D()(x)
    inp2 = L.Input(shape=(4,))
    x2 = L.GaussianNoise(0.2)(inp2)
    x = L.Concatenate()([x, x2]) 
    x = L.Dropout(0.32)(x)    # changed from 0.4
    x = L.Dense(1)(x)
    model = Model([inp, inp2] , x)
    
    weights = [w for w in os.listdir('../input/osic-model-weights') if model_class in w][0]
    model.load_weights('../input/osic-model-weights/' + weights)
    return model

model_classes = ["b6"] #['b0','b1','b2','b3',b4','b5','b6','b7']    # changed from b5
models = [build_model(model_class=m, shape=(528, 528, 1)) for m in model_classes]    # changed from 512
print('Number of models: ' + str(len(models)))

In [None]:
plot_model(models[0])

We will create a modified version of the Laplace Log Likelihood function, which is the evaluation function for this competition.<br>
<font color="RoyalBlue">このコンペでの評価関数である修正版ラプラス対数尤度の関数を作成しておきます。</font>

In [None]:
def score(fvc_true, fvc_pred, sigma):
    sigma_clip = np.maximum(sigma, 70)
    delta = np.abs(fvc_true - fvc_pred)
    delta = np.minimum(delta, 1000)
    sq2 = np.sqrt(2)
    metric = (delta / sigma_clip)*sq2 + np.log(sigma_clip * sq2)
    return np.mean(metric)

Now, we will use the model to make predictions.<br>
<font color="RoyalBlue">それでは、モデルを使って予測を行います。</font>

In [None]:
tr_p, vl_p = train_test_split(P, 
                              shuffle=True, 
                              train_size=0.8)

subs = []
for model in models:
    metric = []
    for q in tqdm(range(1, 10)):
        m = []
        for p in vl_p:
            x = [] 
            tab = [] 

            if p in ['ID00011637202177653955184', 'ID00052637202186188008618']:
                continue

            ldir = os.listdir(f'../input/osic-pulmonary-fibrosis-progression/train/{p}/')
            for i in ldir:
                if int(i[:-4]) / len(ldir) < 0.8 and int(i[:-4]) / len(ldir) > 0.15:
                    x.append(get_img(f'../input/osic-pulmonary-fibrosis-progression/train/{p}/{i}')) 
                    tab.append(get_tab(train.loc[train.Patient == p, :])) 
            if len(x) < 1:
                continue
            tab = np.array(tab) 

            x = np.expand_dims(x, axis=-1) 
            _a = model.predict([x, tab]) 
            a = np.quantile(_a, q / 10)

            percent_true = train.Percent.values[train.Patient == p]
            fvc_true = train.FVC.values[train.Patient == p]
            weeks_true = train.Weeks.values[train.Patient == p]

            fvc = a * (weeks_true - weeks_true[0]) + fvc_true[0]
            percent = percent_true[0] - a * abs(weeks_true - weeks_true[0])
            m.append(score(fvc_true, fvc, percent))
        print(np.mean(m))
        metric.append(np.mean(m))

    q = (np.argmin(metric) + 1)/ 10

    sub = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/sample_submission.csv') 
    test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv') 
    A_test, B_test, P_test, W, FVC = {}, {}, {}, {}, {} 
    STD, WEEK = {}, {} 
    for p in test.Patient.unique():
        x = [] 
        tab = [] 
        ldir = os.listdir(f'../input/osic-pulmonary-fibrosis-progression/test/{p}/')
        for i in ldir:
            if int(i[:-4]) / len(ldir) < 0.8 and int(i[:-4]) / len(ldir) > 0.15:
                x.append(get_img(f'../input/osic-pulmonary-fibrosis-progression/test/{p}/{i}')) 
                tab.append(get_tab(test.loc[test.Patient == p, :])) 
        if len(x) <= 1:
            continue
        tab = np.array(tab) 

        x = np.expand_dims(x, axis=-1) 
        _a = model.predict([x, tab]) 
        a = np.quantile(_a, q)
        A_test[p] = a
        B_test[p] = test.FVC.values[test.Patient == p] - a*test.Weeks.values[test.Patient == p]
        P_test[p] = test.Percent.values[test.Patient == p] 
        WEEK[p] = test.Weeks.values[test.Patient == p]

    for k in sub.Patient_Week.values:
        p, w = k.split('_')
        w = int(w) 

        fvc = A_test[p] * w + B_test[p]
        sub.loc[sub.Patient_Week == k, "FVC"] = fvc
        sub.loc[sub.Patient_Week == k, "Confidence"] = (
            P_test[p] - A_test[p] * abs(WEEK[p] - w) 
    ) 

    _sub = sub[["Patient_Week","FVC","Confidence"]].copy()
    subs.append(_sub)

In [None]:
N = len(subs)
sub = subs[0].copy() # ref
sub["FVC"] = 0
sub["Confidence"] = 0
for i in range(N):
    sub["FVC"] += subs[0]["FVC"] * (1/N)
    sub["Confidence"] += subs[0]["Confidence"] * (1/N)

In [None]:
sub.head()

In [None]:
sub[["Patient_Week", "FVC", "Confidence"]].to_csv("submission_img.csv", index=False)

In [None]:
img_sub = sub[["Patient_Week","FVC","Confidence"]].copy()

Linear Decay predictions have been made!<br>
<font color="RoyalBlue">Linear Decay での予測が作成できました！</font>

<a id="quantile"></a>
# Multiple Quantile Regression 🌒
<a id="quantile_d"></a>
## Data preprocessing for Multiple Quantile Regression 🧹

Next, let's pre-process the data for multiple quantile regression. First, check the format of the `sub` (sample_submission.csv).<br>
<font color="RoyalBlue">次に、重分位点回帰のためのデータの前処理を行っていきましょう。まず、sub (sample_submission.csv) の形式をチェックします。</font>

In [None]:
sub = pd.read_csv(ROOT + "sample_submission.csv")
sub.head()

Split `Patient_Week` in `sub` into `Patient` and `Weeks`, according to the `train` and `test` formats. Then, attach the `Patient` to `Patient` in `sub` and merge it with the `Patient`. This makes it easier to handle the prediction.<br>
<font color="RoyalBlue">sub の Patient_Week を Patient と Weeks に分割し、train や test の形式に合わせます。そして、sub に test を Patient に紐づけて結合します。こうすれば、予測時に簡単に処理を行えるようになります。</font>

In [None]:
sub['Patient'] = sub['Patient_Week'].apply(lambda x:x.split('_')[0])
sub['Weeks'] = sub['Patient_Week'].apply(lambda x: int(x.split('_')[-1]))
sub =  sub[['Patient', 'Weeks', 'Confidence', 'Patient_Week']]
sub = sub.merge(test.drop('Weeks', axis=1), on="Patient")
sub.head()

Add `Where` column to all the dataframes.<br>
<font color="RoyalBlue">すべてのデータフレームに Where 列を追加します。</font><br>
Then, in order to process the `train`, `test` and `sub` at the same time, these three are concatenated vertically into `data`.<br>
<font color="RoyalBlue">そして、train, test, sub を同時にデータ処理するために、縦方向に結合して data としておきます。</font>

In [None]:
train['WHERE'] = 'train'
test['WHERE'] = 'val'
sub['WHERE'] = 'test'
data = train.append([test, sub])

print(train.shape, test.shape, sub.shape, data.shape)
print(train.Patient.nunique(), test.Patient.nunique(), sub.Patient.nunique(), data.Patient.nunique())

data.head(10)

Add a `min_week` column for the minimum number of weeks per Patient.<br>
<font color="RoyalBlue">Patient ごとの最小の週数を示す min_week 列を追加します。</font>

In [None]:
data['min_week'] = data['Weeks']
data.loc[data.WHERE=='test', 'min_week'] = np.nan
data['min_week'] = data.groupby('Patient')['min_week'].transform('min')

data.head(10)

From here, calculate `base_FVC` (= the `FVC` of `Patient` at `min_week`) and the `base_week` (= how many weeks have passed since `min_week`).<br>
<font color="RoyalBlue">ここから、base_FVC（＝Patient の min_week 時の FVC）と base_week（＝min_week から何週経ったときのデータか）を算出していきます。</font><br>


First, extract the rows in the data where `Weeks` is `min_week` and set them to `base`.
Extract only the `Patient` and `FVC` columns from the `base`, and change the column name from `FVC` to `base_FVC`.
Then create a new `nb` column and set all the values to 1.
Group the `base` with `Patient` and compute the cumulative sum with the `nb` column.
Extract only the rows from `base` that have `nb` columns of 1, and replace `base`.
This allows us to eliminate duplicate `Patient` rows from the `base` dataframe with `base_FVC` in it.
Let's remove the `nb` column.<br>
<font color="RoyalBlue">まず、data の Weeks が min_week である行を抽出し、base とします。<br>
base から Patient, FVC 列だけを抜き出します。<br>
列名を FVC から base_FVC に変更します。<br>
そして新たに nb 列を作り、値をすべて1とします。<br>
base を Patient でグループ化し、nb 列を指定して累積和を計算します。<br>
base から nb 列が1の行のみを抽出し、base を置き換えます。<br>
これにより、base_FVC が載っているデータフレーム base から Patient の重複を無くすことができます。<br>
nb 列は削除しておきましょう。</font>

In [None]:
base = data.loc[data.Weeks == data.min_week]
base = base[['Patient', 'FVC']].copy()
base.columns = ['Patient', 'base_FVC']
base['nb'] = 1
base['nb'] = base.groupby('Patient')['nb'].transform('cumsum')
base = base[base.nb==1]
base.drop('nb', axis=1, inplace=True)

base.head()

Next, we associate `base` with `Patient` in `data`.
We will create `base_week` column in `data`, which will be `Weeks` minus `min_week`.
This will add the `base_FVC` and `base_week` columns to `data`.
We should remove the `base` column.<br>
<font color="RoyalBlue">次に、data に base を Patient に紐づけて結合します。<br>
data に base_week 列を作成し、Weeks から min_week を引いた値とします。<br>
これで、data に base_FVC 列と base_week 列が追加されます。<br>
base は削除しておきましょう。</font>

In [None]:
data = data.merge(base, on='Patient', how='left')
data['base_week'] = data['Weeks'] - data['min_week']
del base

data.head(10)

Perform one-hot-encoding of `Sex` and `SmokingStatus`.<br>
<font color="RoyalBlue">Sex と SmokingStatus のワンホットエンコーディングを行います。</font>

In [None]:
categorical_features = ['Sex', 'SmokingStatus']
features_nn = []
for col in categorical_features:
    for mod in data[col].unique():
        features_nn.append(mod)
        data[mod] = (data[col] == mod).astype(int)

data.head(10)

Normalize `Percent`, `Age`, `base_FVC`, and `base_week`.<br>
<font color="RoyalBlue">Percent, Age, base_FVC, and base_week の正規化を行います。</font>

In [None]:
data['Percent_n'] = (data['Percent'] - data['Percent'].min() ) / ( data['Percent'].max() - data['Percent'].min() )
data['Age_n'] = (data['Age'] - data['Age'].min() ) / ( data['Age'].max() - data['Age'].min() )
data['base_FVC_n'] = (data['base_FVC'] - data['base_FVC'].min() ) / ( data['base_FVC'].max() - data['base_FVC'].min() )
data['base_week_n'] = (data['base_week'] - data['base_week'].min() ) / ( data['base_week'].max() - data['base_week'].min() )
features_nn += ['Age_n', 'Percent_n', 'base_week_n', 'base_FVC_n']

print(features_nn)
data.head(10)

Now that the process is done, let's split data into `train`, `test` and `sub` using the `WHERE` column, and remove `data`.<br>
<font color="RoyalBlue">処理が終わったので、WHERE 列を使って data を train, test, sub に分割し直しましょう。data は削除しておきます。</font>

In [None]:
train = data.loc[data.WHERE=='train']
test = data.loc[data.WHERE=='val']
sub = data.loc[data.WHERE=='test']
del data

train.shape, test.shape, sub.shape

<a id="quantile_m"></a>
## Build the model 🧠

This competition is evaluated on a modified version of the Laplace Log Likelihood. For each true FVC measurement, you will predict both an FVC and a confidence measure (standard deviation σ). The metric is computed as:<br>
<font color="RoyalBlue">このコンペでは、ラプラス対数尤度の修正版で評価されます。真の各 FVC 測定について、FVC と信頼度（標準偏差 σ）の両方を予測します。メトリックは次のように計算されます。</font><br><br>

$\large \sigma_{clipped} = max(\sigma, 70),$<br>
$\large \Delta = min ( |FVC_{true} - FVC_{predicted}|, 1000 ),$<br>
$\Large metric = -   \frac{\sqrt{2} \Delta}{\sigma_{clipped}} - \ln ( \sqrt{2} \sigma_{clipped} ).$<br>

In the following code, C1 is the value of confidence clipping in the modified Laplace Log Likelihood, an evaluation metric, and C2 is the error threshold.<br>
<font color="RoyalBlue">下記のコードにおいて、C1 は評価指標である修正版ラプラス対数尤度における信頼度のクリッピングの値、C2 は誤差の閾値です。</font><br>

The `score` function takes the true and predicted values of the target variable and returns a score based on the modified Laplace Log Likelihood.
The `qloss` function is a pinball loss function, which is the loss function used when a multiple-quantile regression prediction is trained.
The `mloss` function takes a percentage and returns a function that sums the return values of the `score` function and the `qloss` function according to the percentage.<br>
<font color="RoyalBlue">score 関数はこの評価メトリックです。コンペの目的変数の真の値と予測値を受け取り、修正版ラプラス対数尤度に基づいたスコアを返します。<br>
qloss 関数は、重分位点回帰予測が学習するときに使用する損失関数であるピンボールロス関数です。<br>
mloss 関数は割合 _lambda を受け取り、その割合に応じて score 関数と qloss 関数の戻り値を合計する関数を返します。</font><br><br>
Here, we define confidence as the difference between the predicted values at 0.2 and 0.8 quartiles.<br>
<font color="RoyalBlue">ここで、信頼度を0.2分位点と0.8分位点における予測値の差として定義しています。</font>

In [None]:
C1, C2 = tf.constant(70, dtype="float32"), tf.constant(1000, dtype="float32")
#=============================#
def score(y_true, y_pred):
    tf.dtypes.cast(y_true, tf.float32)
    tf.dtypes.cast(y_pred, tf.float32)
    sigma = y_pred[:, 2] - y_pred[:, 0]
    fvc_pred = y_pred[:, 1]

    sigma_clip = tf.maximum(sigma, C1)
    delta = tf.abs(y_true[:, 0] - fvc_pred)
    delta = tf.minimum(delta, C2)
    sq2 = tf.sqrt( tf.dtypes.cast(2, dtype=tf.float32) )
    metric = (delta / sigma_clip)*sq2 + tf.math.log(sigma_clip * sq2)
    return backend.mean(metric)
#============================#
def qloss(y_true, y_pred):
    # Pinball loss for multiple quantiles
    qs = [0.2, 0.5, 0.8]
    q = tf.constant(np.array([qs]), dtype=tf.float32)
    e = y_true - y_pred
    v = tf.maximum(q*e, (q-1)*e)
    return backend.mean(v)
#=============================#
def mloss(_lambda):
    def loss(y_true, y_pred):
        return _lambda * qloss(y_true, y_pred) + (1 - _lambda)*score(y_true, y_pred)
    return loss

In [None]:
def make_model():
    inp = L.Input(len(features_nn), name="Patient")
    x = L.Dense(100, activation="relu", name="d1")(inp)
    x = L.Dense(100, activation="relu", name="d2")(x)
    p1 = L.Dense(3, activation="linear", name="p1")(x)
    p2 = L.Dense(3, activation="relu", name="p2")(x)
    preds = L.Lambda(lambda x: x[0] + tf.cumsum(x[1], axis=1), 
                     name="preds")([p1, p2])
    
    model = M.Model(inp, preds, name="NeuralNet")
    model.compile(loss=mloss(0.64),    # changed from 0.8
                  optimizer=tf.keras.optimizers.Adam(lr=0.1, decay=0.01),
                  metrics=[score])
    return model

model = make_model()
model.summary()

In [None]:
plot_model(model)

<a id="quantile_c"></a>
## Cross validation 💭

Use the model we created to cross-validate.<br>
<font color="RoyalBlue">作成したモデルを使ってクロスバリデーションを行いましょう。</font>

In [None]:
X_train = train[features_nn].values
X_test = sub[features_nn].values

y_train = train['FVC'].values

oof_train = np.zeros((X_train.shape[0], 3))
y_preds = np.zeros((X_test.shape[0], 3))

In [None]:
BATCH_SIZE = 128
EPOCHS = 804    # changed from 800
NFOLD = 5

kf = KFold(n_splits=NFOLD)

In [None]:
%%time
for fold_id, (tr_idx, va_idx) in enumerate(kf.split(X_train)):
    print(f"FOLD {fold_id+1}")
    model = make_model()
    model.fit(X_train[tr_idx], y_train[tr_idx], batch_size=BATCH_SIZE, epochs=EPOCHS, 
              validation_data=(X_train[va_idx], y_train[va_idx]), verbose=0)
    print("train", model.evaluate(X_train[tr_idx], y_train[tr_idx], verbose=0, batch_size=BATCH_SIZE))
    print("val", model.evaluate(X_train[va_idx], y_train[va_idx], verbose=0, batch_size=BATCH_SIZE))
    oof_train[va_idx] = model.predict(X_train[va_idx], batch_size=BATCH_SIZE, verbose=0)
    y_preds += model.predict(X_test, batch_size=BATCH_SIZE, verbose=0) / NFOLD

Let's illustrate the correct and predicted values.<br>
<font color="RoyalBlue">正解値と予測値を図示してみましょう。</font>

In [None]:
fig, ax = plt.subplots(figsize=(12, 12))

idxs = np.random.randint(0, y_train.shape[0], 100)
ax.plot(y_train[idxs], label="ground truth", color=palette_ro[0])
ax.plot(oof_train[idxs, 0], label="q20", color=palette_ro[3], ls=':', alpha=0.5)
ax.plot(oof_train[idxs, 1], label="q50", color=palette_ro[4], ls=':', alpha=0.5)
ax.plot(oof_train[idxs, 2], label="q80", color=palette_ro[5], ls=':', alpha=0.5)
ax.legend(loc="best");

We calculate the optimized 𝜎 (standard deviation) from the `oof_train`. `sigma_opt` is the mean absolute error between the correct value of each fold and the prediction (median), `sigma_unc` is the difference between the prediction (0.2 quantile) and the prediction (0.8 quantile), and `sigma_mean` is the mean value of the difference.<br>
<font color="RoyalBlue">では、oof_train から最適化された 𝜎（標準偏差）を計算しましょう。各フォールドの正解値と予測値（中央値）との平均絶対誤差を sigma_opt, 予測値（0.2分位数）と予測値（0.8分位数）との差を sigma_unc, その平均値を sigma_mean とします。</font>

In [None]:
sigma_opt = mean_absolute_error(y_train, oof_train[:, 1])
sigma_unc = oof_train[:, 2] - oof_train[:, 0]
sigma_mean = np.mean(sigma_unc)
print(sigma_opt, sigma_mean)

In [None]:
print(sigma_unc.min(), sigma_unc.mean(), sigma_unc.max(), (sigma_unc>=0).mean())

In [None]:
print(np.mean(y_train / oof_train[:, 1]))

In [None]:
fig, ax = plt.subplots(figsize=(16, 6))

sns.distplot(sigma_unc, ax=ax, color=palette_ro[1])
ax.set_title("uncertainty in prediction", fontsize=18);

<a id="submit"></a>
# Ensemble & Submit 📝

In [None]:
sub.head(10)

Prepare a submission file from the predictions of the neural network.<br>
<font color="RoyalBlue">ニューラルネットワークの予測結果から提出ファイルを作成する準備をします。</font>

In [None]:
sub['FVC1'] = y_preds[:, 1]
sub['Confidence1'] = y_preds[:, 2] - y_preds[:, 0]

sub.head(10)

`subm` is defined by extracting the required columns from the `sub` and leaving only the rows with non-null data in `FVC1`.<br>
<font color="RoyalBlue">`sub` から必要な列を抜き出し、`FVC1` のデータが `null` でない行だけにしたものを `subm` とします。</font>

In [None]:
subm = sub[['Patient_Week', 'FVC', 'Confidence', 'FVC1', 'Confidence1']].copy()
subm.loc[~subm.FVC1.isnull(),'FVC'] = subm.loc[~subm.FVC1.isnull(),'FVC1']

In [None]:
if sigma_mean<70:
    subm['Confidence'] = sigma_opt
else:
    subm.loc[~subm.FVC1.isnull(),'Confidence'] = subm.loc[~subm.FVC1.isnull(),'Confidence1']

subm.head(10)

In [None]:
subm.describe().T

Read the original `test.csv` and overwrite the `FVC` and `Confidence` in the predicted data to be submitted if they are known in the `test.csv`.<br>
<font color="RoyalBlue">オリジナルの test.csv を読み込み、投稿予定の予測データの中に test.csv で既知のデータがあれば FVC と Confidence を上書きします。</font>

In [None]:
org_test = pd.read_csv('../input/osic-pulmonary-fibrosis-progression/test.csv')
for i in range(len(org_test)):
    subm.loc[subm['Patient_Week']==org_test.Patient[i]+'_'+str(org_test.Weeks[i]), 'FVC'] = org_test.FVC[i]
    subm.loc[subm['Patient_Week']==org_test.Patient[i]+'_'+str(org_test.Weeks[i]), 'Confidence'] = 70

subm[["Patient_Week","FVC","Confidence"]].to_csv("submission_regression.csv", index=False)
reg_sub = subm[["Patient_Week","FVC","Confidence"]].copy()

Ensemble two models.<br>
<font color="RoyalBlue">２つのモデルをアンサンブルします。</font>

In [None]:
df1 = img_sub.sort_values(by=['Patient_Week'], ascending=True).reset_index(drop=True)
df2 = reg_sub.sort_values(by=['Patient_Week'], ascending=True).reset_index(drop=True)

df = df1[['Patient_Week']].copy()
df['FVC'] = 0.2*df1['FVC'] + 0.8*df2['FVC']    # changed from 0.25, 0.75
df['Confidence'] = 0.0*df1['Confidence'] + 1.0*df2['Confidence']    # changed from 0.26, 0.74
df.head()

In [None]:
df.to_csv('submission.csv', index=False)

We could create `submission.csv`. Thank you so much for reading!<br>
<font color="RoyalBlue">submission.csv を作成できました。読んでくださりありがとうございました！</font>