# Areal Project

<div>
<img src="logo.jpg" width=150 ALIGN="left" border="20">
<h1> Starting Kit for preprocessed data</h1>
<br>This code was tested with <br>
Python 3.6.7 <br>
Created by Areal Team <br><br>
ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". The CDS, CHALEARN, AND/OR OTHER ORGANIZERS OR CODE AUTHORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL AUTHORS AND ORGANIZERS BE LIABLE FOR ANY SPECIAL, 
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. 
</div>

<div>
    <h2>Introduction </h2>
     <br>
Aerial imagery has been a primary source of geographic data for quite a long time. With technology progress, aerial imagery became really practical for remote sensing : the science of obtaining information about an object, area or phenomenon.
Nowadays, there are many uses of image recognition spanning from robotics/drone vision to autonomous driving vehicules or face detection.
<br>
In this challenge, we will use pre-processed data, coming from landscape images. The goal is to learn to differentiate common and uncommon landscapes such as a beach, a lake or a meadow.
    Data comes from part of the data set (NWPU-RESISC45) originally used in <a href="https://arxiv.org/pdf/1703.00121.pdf?fbclid=IwAR16qo-EX_Z05ZpxvWG8F-oBU0SlnY-3BPCWBVVOGPyJcVy7BBqCKjnsvJo">Remote Sensing Image Scene Classification</a>. This data set contains 45 categories while we only kept 13 out of them.

References and credits: 
Yuliya Tarabalka, Guillaume Charpiat, Nicolas Girard for the data sets presentation.<br>
Gong Cheng, Junwei Han, and Xiaoqiang Lu, for the original article on the chosen data set.
</div>

### Requirements 

Our code uses multiple libraries, so the next cell will install python's required dependencies (probably only possible on your personal computers). In case you don't want to, or are running in the competition's docker, you can comment it.

In [1]:
#!pip install --user -r requirements.txt

In [2]:
import numpy as np
import random
import re

In [3]:
model_dir = "sample_code_submission/"
result_dir = 'sample_result_submission/' 
problem_dir = 'ingestion_program/'  
score_dir = 'scoring_program/'

In [4]:
from sys import path; path.append(model_dir); path.append(problem_dir); path.append(score_dir);

<div>
    <h1> Step 1: Exploratory data analysis </h1>
<p>
We provide sample_data with the starting kit, but to prepare your submission, you must fetch the public_data from the challenge website and point to it.
</div>

<div>
<img src="CNN.png" width=800 align="center" border="20">
We used a special Convolutional Neural Network (CNN) already trained to recognize images in order to create the new data. To create it, we kept the form of the data at three-fourth in the CNN. <br>
This new form is highly similar to weights we can see in a classic neural network.
</div>

In [None]:
data_dir = 'public_data'
data_name = 'Areal'

In [None]:
from ingestion_program.data_io import read_as_df
data = read_as_df(data_dir  + '/' + data_name)

Reading public_data/Areal_train from AutoML format


In [None]:
#data.head()

In [None]:
#data.describe()

In [None]:
print(data.iloc[:, -1:])
X = data.iloc[:, :-1]
y = data.iloc[:, -1:]

### Visualization of values

Most values are in the range (0, 5).

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(30,15))
# Plot values of first 10 features
data_plot=data.iloc[:,:10]
data_plot.boxplot()
plt.show()

### Visualisation as images

Although any human won't be able to make sense of those images, it is by far simpler and faster for computers to correctly recognize those "images" and assign them their correct classes.

In [None]:
import matplotlib.image as mpimg

num_toshow = 6
fig, _axs = plt.subplots(nrows=2, ncols=3, figsize=(10,10))
fig.subplots_adjust(hspace=0.3)
axs = _axs.flatten()

for i in range(num_toshow):
    img = data.iloc[i].values[:-1].reshape(32, 32)
    label = data.values[i][-1:]
    axs[i].set_title('Example of {}'.format(label))
    axs[i].imshow(img.astype(float))

plt.show()

# Step 2: Building a predictive model

Use DataManager to separate data for train, validation and test

In [None]:
from data_manager import DataManager
D = DataManager(data_name, data_dir)
print(D)

Get data and labels by calling D.data (DataManager.data)

In [None]:
X_train = D.data['X_train']
Y_train = D.data['Y_train']

The model is a simpler version of the decision tree algorithm of sklearn.

You can only change the parameter max_depth which has a default value of 5.

In [None]:
from model import UltimateClassifier

In [None]:
M = UltimateClassifier()

#### Fit the model

Data as first argument, Labels as second, with .reshape(-1) to make sure that the array containing labels is flat and don't have multiple dimensions.

In [None]:
M.fit(X_train, Y_train.reshape(-1))

In [None]:
Y_hat_train = M.predict(D.data['X_train'])
Y_hat_valid = M.predict(D.data['X_valid'])
Y_hat_test = M.predict(D.data['X_test'])

In [None]:
#M.save(trained_model_name)            

"""
import seaborn as sns; sns.set()
sns.pairplot(data, hue="target"), 
corr_mat = data.corr(method='spearman')
sns.heatmap(corr_mat, annot=True, center=0)
"""

result_name = result_dir + data_name
from data_io import write
from data_io import mkdir
mkdir(result_dir)

write(result_name + '_train.predict', Y_hat_train)
write(result_name + '_valid.predict', Y_hat_valid)
write(result_name + '_test.predict', Y_hat_test)
!ls $result_name*

# Scoring predictions

In [None]:
from libscores import get_metric
metric_name, scoring_function = get_metric()
print('Using scoring metric:', metric_name)

In [None]:
"""
print('Ideal score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_train), "\n")

print("Scores with BaselineModel")
print('Training score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_hat_train))
if len(D.data['Y_valid'] > 0) and len(D.data['Y_test'] > 0):
    print('Validation score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_valid'], Y_hat_valid))
    print('Test score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_test'], Y_hat_test))
"""
from model import compute_accuracy
rep = compute_accuracy(M, D, "Ultimate Classifier")

Keep in mind that the provided model is one that heavily overfits so you shouldn't look too much at the training score.

Using cross-validation (see a bit below) will give more significant results.

## Confusion matrix

Doesn't have much value on train with a 100% accuracy, but it can be good to look at if you change your model.

In [None]:
from sklearn.metrics import confusion_matrix
cnf_matrix = confusion_matrix(Y_train, Y_hat_train)

import itertools
import numpy as np
import matplotlib.pyplot as plt

from sklearn import svm, datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix

class_names = ""

def plot_confusion_matrix(cm, classes,
                          normalize=False,
                          title='Confusion matrix',
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    #print('Confusion matrix, without normalization')

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=45)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()

np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names,
                      title='Confusion matrix, without normalization')

plt.show()

# Cross-validation

Because validation and test labels should first not be available, we do cross validation to see our models quality

In [None]:
from sklearn.metrics import make_scorer
from sklearn.model_selection import cross_val_score

scores = cross_val_score(M, X_train, Y_train.ravel(), cv=5, scoring=make_scorer(scoring_function))
print('\nCV score (95 perc. CI): %0.2f (+/- %0.2f)' % (scores.mean(), scores.std() * 2))

# Submission

## Example

Example needs to have python3 installed

Test to see whether submission with ingestion program is working

In [None]:
!python3 $problem_dir/ingestion.py $data_dir $result_dir $problem_dir $model_dir

### Test scoring program

In [None]:
scoring_output_dir = 'scoring_output'
!python3 $score_dir/score.py $data_dir $result_dir $scoring_output_dir

# Prepare the submission

In [None]:
import datetime 
from data_io import zipdir
the_date = datetime.datetime.now().strftime("%y-%m-%d-%H-%M")
sample_code_submission = './sample_code_submission_prep_' + the_date + '.zip'
sample_result_submission = './sample_result_submission_prep_' + the_date + '.zip'
zipdir(sample_code_submission, model_dir)
zipdir(sample_result_submission, result_dir)
print("Submit one of these files:\n" + sample_code_submission + "\n" + sample_result_submission)