# Aerial Project

<img src="img/logo.jpg" width=150 ALIGN="left" border="20">

## Starting Kit for raw data (images)

Created by Aerial Team

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". The CDS, CHALEARN, AND/OR OTHER ORGANIZERS OR CODE AUTHORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL AUTHORS AND ORGANIZERS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE.

## Introduction

Aerial imagery has been a primary source of geographic data for quite a long time. With technology progress, aerial imagery became really practical for remote sensing : the science of obtaining information about an object, area or phenomenon.
Nowadays, there are many uses of image recognition spanning from robotics/drone vision to autonomous driving vehicules or face detection.

In this challenge, we will use pre-processed data, coming from landscape images. The goal is to learn to differentiate common and uncommon landscapes such as a beach, a lake or a meadow.
Data comes from part of the data set (NWPU-RESISC45) originally used in the paper [*Remote Sensing Image Scene Classification*](https://arxiv.org/pdf/1703.00121.pdf). This data set contains 45 categories while we only kept 13 out of them.

**Challenge website:** https://codalab.lisn.upsaclay.fr/competitions/8854#participate-submit_results

References and credits:

Yuliya Tarabalka, Guillaume Charpiat, Nicolas Girard for the data sets presentation.<br>
Gong Cheng, Junwei Han, and Xiaoqiang Lu, for the original article on the chosen data set.
</div>

### Requirements / Installation

```bash
conda create -n torch python=3
conda activate torch
conda install pytorch torchvision torchaudio -c pytorch
conda install ipykernel pyyaml pandas matplotlib scipy scikit-learn
```

Code tested with:

```
python=3.9.7
pytorch=1.10.0
pyyaml=6.0
pandas=1.3.4
matplotlib=3.5.0
scipy=1.7.1
scikit-learn=1.0.1
```

In [1]:
import csv
import platform
import shutil
from datetime import datetime
from pathlib import Path

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import sklearn
from sklearn import decomposition, metrics, model_selection, naive_bayes, pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.utils.estimator_checks import check_estimator

from submission_code import cnn_model

ModuleNotFoundError: ignored

In [None]:
if platform.system() == "Darwin":
    %config InlineBackend.figure_format="retina"  # For high DPI display

print(sklearn.__version__)  # Version tested on sklearn.__version__ == 1.0.1

%load_ext autoreload
%autoreload 2

In [None]:
DATA_PATH = Path("../public_data")      # Uncomment to use FULL DATASET
# DATA_PATH = Path("sample_data")         # Sample dataset

SUBM_PATH = Path("submissions")
MODEL_PATH = Path("submission_code")
SCORE_PATH = Path("scoring_output")

RESULTS_PATH = SUBM_PATH / "submission_results"

DATA_NAME = "Areal"

DATA_SETS = ["train", "valid", "test"]
ALL_SETS = ["train", "valid-lab", "valid", "test"]

TORCH_MODEL = Path("torch/model")

## Step 1: Exploratory data analysis

We provide sample_data with the starting kit, but to prepare your submission, you must fetch the public_data from the challenge website and point to it.

**<span style="color:red">Warning</span>**

*In case you want to load the full data*

Files being big, your computer needs to have enough space available in your RAM. It should take about 3-4GB while loading and 1.5GB in the end.

### Load data

In [None]:
def number_lines(fname):
    with open(fname) as f:
        return sum(1 for l in f)

In [None]:
def fast_import(arr, fpath):
    with open(fpath) as f:
        for i, row in enumerate(csv.reader(f, delimiter=" ")):
            arr[i] = row

In [None]:
num_fts = number_lines(DATA_PATH / f"{DATA_NAME}_feat.name")

num = {
    data_set: number_lines(DATA_PATH / f"{DATA_NAME}_{data_set}.data")
    for data_set in DATA_SETS
}

xs_raw = {
    data_set: np.empty((num[data_set], num_fts))
    for data_set in DATA_SETS
}

for data_set in DATA_SETS:
    fast_import(
        xs_raw[data_set], 
        fpath=DATA_PATH / f"{DATA_NAME}_{data_set}.data"
    )

In [None]:
labels_df = pd.read_csv(
    DATA_PATH / f"{DATA_NAME}_label.name", header=None, names=["name"]
)

labels = labels_df.name.to_list()

ys_df = pd.read_csv(
    DATA_PATH / f"{DATA_NAME}_train.solution", header=None, names=["value"]
)

ys_raw = ys_df.values.squeeze()

ys_df["label"] = ys_df.value.map(labels_df.name)

ys_df

### Visualize dataset sample

In [None]:
NUM_TO_SHOW = 6

fig, axs_ = plt.subplots(nrows=2, ncols=3, figsize=(10, 10))
fig.subplots_adjust(hspace=0.3)
axs = axs_.flatten()

for i in range(NUM_TO_SHOW):
    img = xs_raw["train"][i].reshape(128, 128, -1)
    label = ys_df.label[i]
    axs[i].set_title(f"Example of {label}")
    axs[i].imshow(img / 255)

plt.show()

## Step 2 : Building a predictive model

### Baseline model

In [None]:
print(xs_raw["train"].shape, ys_raw.shape, "\n")


xs, ys = {}, {}
(
    xs["train"],
    xs["valid-lab"],
    ys["train"],
    ys["valid-lab"],
) = model_selection.train_test_split(
    xs_raw["train"], ys_raw, test_size=0.2, random_state=123
)

xs["test"], xs["valid"] = xs_raw["test"], xs_raw["valid"]


print(xs["train"].shape, ys["train"].shape)
print(xs["valid-lab"].shape, ys["valid-lab"].shape)
print(xs["valid"].shape)
print(xs["test"].shape)

In [None]:
N_COMP = 40  # 300 1000

scaler = StandardScaler()
pca = decomposition.PCA(n_components=N_COMP)

preproc_pipe = pipeline.Pipeline(steps=[("scaler", scaler), ("pca", pca)])  

preproc_pipe.fit(xs["train"])

xps = {
    data_set: preproc_pipe.transform(xs[data_set]) 
    for data_set in ALL_SETS
}


print(xps["train"].shape, ys["train"].shape)
print(xps["valid-lab"].shape, ys["valid-lab"].shape)
print(xps["test"].shape)

In [None]:
print(pca.explained_variance_ratio_.shape)

print(f"{pca.explained_variance_ratio_.cumsum()[-1]:.3f}")

# with np.printoptions(precision=3):
#     print(pca.explained_variance_ratio_.cumsum())

In [None]:
# model = naive_bayes.GaussianNB()
# model.fit(xps["train"], ys["train"])                                    # WITH PCA

In [None]:
# Train PyTorch model:

time_stamp = datetime.now().strftime("%Y-%m-%dT%H-%M")
torch_fpath = TORCH_MODEL / f"cifar_net_{time_stamp}.pth"

model = cnn_model.BasicCNN(nb_epoch=30, n_batches=200, model_fpath=torch_fpath)
print(model.hyperparameters)

model.fit(xs["train"], ys["train"], xs["valid-lab"], ys["valid-lab"])   # Without PCA

In [None]:
pred = {
    data_set: model.predict(xs[data_set])           # Without PCA
    # data_set: model.predict(xps[data_set])          # WITH PCA
    for data_set in ALL_SETS
}

In [None]:
# Predict for both valid (phase 1) and test set (final phase) for results migration

for data_set in ["valid", "test"]:
    with open(RESULTS_PATH / f"{DATA_NAME}_{data_set}.predict", "w") as f:
        print(*pred[data_set], sep="\n", file=f)

## Scoring the result

In [None]:
for data_set in ["train", "valid-lab"]:
    print(
        f"{data_set} set accuracy =",
        f"{metrics.accuracy_score(ys[data_set], pred[data_set]):5.4f}",
    )

### confusion matrix

In [None]:
data_set = "valid-lab"

disp = metrics.ConfusionMatrixDisplay.from_predictions(
    ys[data_set], pred[data_set], cmap=plt.cm.Blues
)
disp.figure_.suptitle("Confusion Matrix")
plt.show()

## Cross-validation

In [None]:
model = naive_bayes.GaussianNB()

scores = model_selection.cross_val_score(
    model,
    X=np.vstack([xps["train"], xps["valid-lab"]]),  # WITH PCA: xps, Without PCA: xs
    y=np.hstack([ys["train"], ys["valid-lab"]]),  # TEMP HACK: Merge previous splits
    cv=4,
    # scoring="accuracy",
    verbose=3,
)
print(
    f"\nCV score (95 perc. CI): "
    f"{scores.mean():0.2f} "
    f"(+/- {2 * scores.std():0.2f})"
)

## Submission

### Prepare submission file

In [None]:
time_stamp = datetime.now().strftime("%Y-%m-%dT%H-%M")

subm_results_name = f"submission_results_{time_stamp}"

shutil.make_archive(SUBM_PATH / subm_results_name, "zip", RESULTS_PATH)

print(f"The prediction to submit is ready: {subm_results_name}")
print("Submit one of these files:\n{subm_code_name}\n{subm_results_name}")

---