## 1st level. Statoil/C-CORE Iceberg Classifier Challenge

- [자료1](https://www.kaggle.com/devm2024/keras-model-for-beginners-0-210-on-lb-eda-r-d), [자료4](https://www.kaggle.com/wvadim/keras-tf-lb-0-18)
- [정리본](https://www.kaggle.com/sh0wmaker/binary-image-classification)

CNN: 행렬로 표현된 필터의 각 요소가 데이터 처리에 적합하도록 자동으로 학습되게 하자

In [None]:
# raw data
import numpy as np
import pandas as pd

# modeling
from sklearn.model_selection import train_test_split

# visualization
import matplotlib.pyplot as plt
from plotly import tools
import plotly.graph_objs as go
import plotly.offline as py

# setting
import warnings
import os

warnings.filterwarnings("ignore")
%matplotlib inline
py.init_notebook_mode(connected=True)

In [None]:
!pip install py7zr
import py7zr

### 데이터 가져오기

위성에서 쏜 rader가 특정 objet를 만나서 튕겨서 다시 돌아오게 되는데 이를 이미지로 저장한 것.

backscatter

- object들이 solid할수록 더 강한 rader energy가 reflection해서 실제 이미지에서 더 밝게 나온다고 한다. 이런 현상을 backscatter라고 한다.
- 주변 환경에 영향을 많이 받는데, 주변에 바람이 강할수록 이미지가 밝아지고, 바람이 약할수록 이미지가 어두워진다. 아마 강한 바람에 담긴 여러 분자들의 운동에너기자 reflection되면서 레이더에 담긴다고 본다

inc_angle

- 특정 각도의 이미지를 활용하고자
- band_1, band_2를 바라보는 angle을 의미

In [None]:
path = "../input/statoil-iceberg-classifier-challenge/"
with py7zr.SevenZipFile(path + "train.json.7z", 'r') as z:
    z.extractall(path="/kaggle")
with py7zr.SevenZipFile(path + "test.json.7z", 'r') as z:
    z.extractall(path="/kaggle")

In [None]:
for dirname, _, filenames in os.walk('/kaggle'): 
    for filename in filenames: 
        print(os.path.join(dirname, filename))

In [None]:
train = pd.read_json("../data/processed/train.json")
test = pd.read_json("../data/processed/test.json")

In [None]:
train.head()

In [None]:
train.info()

band1, band2는 이미지 상의 구역을 의미하는 것으로 보이고, inc_angle은 뭘까.<br />
is_iceberg는 당연히 빙하인지 아닌지(배)이고.

In [None]:
train.inc_angle.value_counts()

와 결측치가 1/8 정도라니

In [None]:
train.inc_angle.replace("na", 0, inplace=True)

#### band: 현재 length: 5625 --> (75 X 75, 3)로 변환

In [None]:
len(train.band_1[0])

In [None]:
XBand1 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in train.band_1])
XBand2 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in train.band_2])
Xtrain = np.concatenate([XBand1[:, :, :, np.newaxis], XBand2[:, :, :, np.newaxis],
                         ((XBand1 + XBand2) / 2)[:, :, :, np.newaxis]], axis=-1)

In [None]:
XBand3 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in test.band_1])
XBand4 = np.array([np.array(band).astype(np.float32).reshape(75, 75) for band in test.band_2])
Xtest = np.concatenate([XBand3[:, :, :, np.newaxis], XBand4[:, :, :, np.newaxis],
                        ((XBand3 + XBand4) / 2)[:, :, :, np.newaxis]], axis=-1)

#### 14번을 시각화 해보자.

In [None]:
train.loc[14, "is_iceberg"]

In [None]:
fig = tools.make_subplots(rows=1, cols=2, specs=[[{"is_3d": True}, {"is_3d": True}]])
data = go.Surface(z=XBand1[14, :, :], colorscale="RdBu_r", scene="scene", showscale=True)
data1 = go.Surface(z=XBand2[14, :, :], colorscale="RdBu_r", scene="scene", showscale=True)

fig["layout"].update(title='3D surface plot for "{}"', height=800, width=1200,
                     titlefont=dict(size=30))
fig.append_trace(data, 1, 1)
fig.append_trace(data1, 1, 2)
py.iplot(fig)

왜 제목에 저렇게 빈 칸을 둔 걸까? 왤까 거슬리는데<br />
다음 걸을 위해서 만든 거라면 함수로 정의하던가 이게 뭐지.

In [None]:
plt.imshow(XBand1[14, :, :])
plt.show()

In [None]:
label = "ship"

fig = tools.make_subplots(rows=1, cols=2, specs=[[{"is_3d": True}, {"is_3d": True}]])

fig.append_trace(dict(type="surface", z=XBand1[14, :, :], colorscale="RdBu_r",
                      scene="scene", showscale=False), 1, 1)
fig.append_trace(dict(type="surface", z=XBand2[14, :, :], colorscale="RdBu_r",
                      scene="scene", showscale=False), 1, 2)

fig["layout"].update(
    title=f'3D surface plot for "{label}" (left is from band1, right is from band2)',
    height=800, width=1200, titlefont=dict(size=30))

py.iplot(fig)

#### 역시 함수화가 짱.

In [None]:
def plot_contour_2d(band1, band2, label):
    fig = tools.make_subplots(rows=1, cols=2, specs=[[{"is_3d": True}, {"is_3d": True}]])
    fig.append_trace(dict(type="surface", z=band1, colorscale="RdBu_r",
                          scene="scene", showscale=False), 1, 1)
    fig.append_trace(dict(type="surface", z=band2, colorscale="RdBu_r",
                          scene="scene", showscale=False), 1, 2)
    
    fig["layout"].update(
        title=f'3D surface plot for "{label}" (left is from band1, right is from band2)',
        titlefont=dict(size=30), height=800, width=1200)
    
    py.iplot(fig)
    
    fig, ax = plt.subplots(1, 2, figsize=(16, 10))
    ax[0].imshow(band1)
    ax[0].set_title("Image from band_1", fontsize=15)
    
    ax[1].imshow(band2)
    ax[1].set_title("Image from band_2", fontsize=15)
    
    plt.show()

In [None]:
num = 0
label = "iceberg" if (train["is_iceberg"].values[num] == 1) else "ship"
plot_contour_2d(XBand1[num, :, :], XBand2[num, :, :], label)

In [None]:
num = 100
plot_contour_2d(XBand1[num, :, :], XBand2[num, :, :], label)

### Smart Deep Learning

In [None]:
from keras.preprocessing.image import ImageDataGenerator

from keras.models import Sequential, Model
from keras.layers import Input
from keras.layers import Conv2D, MaxPooling2D, GlobalMaxPooling2D
from keras.layers import Dense
from keras.layers import Dropout, BatchNormalization
from keras.layers import Flatten, Concatenate, Activation

from keras import initializers
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint, Callback, EarlyStopping

In [None]:
model = Sequential()

# Conv 1
model.add(Conv2D(64, input_shape=(75, 75, 3), kernel_size=3, activation="relu"))
model.add(MaxPooling2D(pool_size=2, strides=2))
model.add(Dropout(0.2))

# Conv 2
model.add(Conv2D(128, kernel_size=3, activation="relu"))
model.add(MaxPooling2D(pool_size=2, strides=2))
model.add(Dropout(0.2))

# Conv 3
model.add(Conv2D(128, kernel_size=3, activation="relu"))
model.add(MaxPooling2D(pool_size=2, strides=2))
model.add(Dropout(0.2))

# Conv 4
model.add(Conv2D(64, kernel_size=3, activation="relu"))
model.add(MaxPooling2D(pool_size=2, strides=2))
model.add(Dropout(0.2))

model.add(Flatten())

# Dense 1 (5)
model.add(Dense(512, activation="relu"))
model.add(Dropout(0.2))

# Dense 2 (6)
model.add(Dense(256, activation="relu"))
model.add(Dropout(0.2))

# final
model.add(Dense(1, activation="sigmoid"))

model.compile(loss="binary_crossentropy", optimizer="adam", metrics=["accuracy"])

In [None]:
model.summary()

In [None]:
def get_callbacks(filepath, patience=2):
    es = EarlyStopping("val_loss", patience=patience, mode="min")
    msave = ModelCheckpoint(filepath, save_best_only=True)
    return [es, msave]

In [None]:
file_path = ".model_weights.hdf5"
callbacks = get_callbacks(file_path, 5)

Conv2D, Relu, MaxPooling2D, Dropout 설며을 해놨다. 저자도 공부하는 입장?

#### 진짜 학습시킴

In [None]:
target = train.is_iceberg
trainX, validX, trainY, validY = train_test_split(Xtrain, target, random_state=1, train_size=0.8)
# 이거 말고 fit에서 validation_split=0.2해도 되는데

In [None]:
history = model.fit(trainX, trainY, validation_data=(validX, validY),
                    batch_size=24, epochs=10, callbacks=callbacks)

나는 ModelCheckPoint 마지막 저장 모델보다 train에 대한 학습을 조금 더 진행한 EarlyStopping으로 끊어졌을 때의 모델을 사용하는 걸 더 선호하는 편이라 이렇게 하긴 했는데 아니라면 다음과 같이 함.

```Python
model.load_weights(filepath=filepath)
score = model.evaluate(validX, validY)
print(f"Validation accuracy: {score[1]}", f"Validation loss: {score[0]}", sep="\n")
```

In [None]:
print("Validation accuracy:", model.evaluate(validX, validY)[1])

In [None]:
tloss = history.history["loss"]
vloss = history.history["val_loss"]
xband = np.arange(len(tloss))

plt.plot(xband, tloss, label="train loss", color="lightblue")
plt.plot(xband, vloss, label="validation loss", color="forestgreen")
plt.legend(loc="best")
plt.show()

```Python
predicted_test = model.predict_proba(testX)
```

또, 제출은 이런 식으로

```Python
submission = pd.DataFrame()
submission["id"] = test["id"]
submission["is_iceberg"] = predicted_test.reshape((predicted_test.shape[0]))
submission.to_csv("sub.csv", index=False)
```