# Areal Project

<div>
<img src="logo.jpg", width=150, ALIGN="left", border=20>

ALL INFORMATION, SOFTWARE, DOCUMENTATION, AND DATA ARE PROVIDED "AS-IS". The CDS, CHALEARN, AND/OR OTHER ORGANIZERS OR CODE AUTHORS DISCLAIM ANY EXPRESSED OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR ANY PARTICULAR PURPOSE, AND THE WARRANTY OF NON-INFRIGEMENT OF ANY THIRD PARTY'S INTELLECTUAL PROPERTY RIGHTS. IN NO EVENT SHALL AUTHORS AND ORGANIZERS BE LIABLE FOR ANY SPECIAL, 
INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF SOFTWARE, DOCUMENTS, MATERIALS, PUBLICATIONS, OR INFORMATION MADE AVAILABLE FOR THE CHALLENGE. 
</div>

<div>
    <h2>Introduction </h2>
    <p>
     <br>
Aerial imagery has been a primary source of geographic data for quite a long time. With technology progress, aerial imagery became really practical for remote sensing : the science of obtaining information about an object, area or phenomenon.
Nowadays, New challenges in remote sensing impose the necessity of designing
pixel classification methods that, once trained on a certain dataset, generalize to other areas of the earth.
In this challenge, we will thus design pixel classification methods on areas.  The goal is to find urban areas in the Areal dataset. Areal Dataset is a small data set created from the <a href="https://project.inria.fr/aerialimagelabeling/">Inria Aerial Image Labeling Dataset</a>. The data set contains covers a wide range of urban settlement appearances from 5 differents cities of different geographic locations. The data set is divided into 3 parts : training set, validation set and test set.

References and credits: 
Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat, Pierre Alliez.
</div>

In [1]:
import numpy as np
import random
import re

In [2]:
model_dir = "sample_code_submission"
result_dir = 'sample_result_submission_pp/' 
problem_dir = 'ingestion_program/'  
score_dir = 'scoring_program/'

In [3]:
from sys import path; path.append(model_dir); path.append(problem_dir); path.append(score_dir);

In [4]:
from model import BaselineModel
from model import BaselineModel2

<div>
    <h1> Step 1: Exploratory data analysis </h1>
<p>
We provide sample_data with the starting kit, but to prepare your submission, you must fetch the public_data from the challenge website and point to it.
</div>

In [5]:
data_dir = 'sample_preprocessed_data'
data_name = 'Areal'

In [6]:
from ingestion_program.data_io import read_as_df
data = read_as_df(data_dir  + '/' + data_name)

Reading sample_preprocessed_data/Areal_train from AutoML format
Number of examples = 300
Number of features = 4096
        Class
0       beach
1   chaparral
2       cloud
3      desert
4      forest
5      island
6        lake
7      meadow
8    mountain
9       river
10    sea_ice
11   snowberg
12    wetland
Number of classes = 13


In [7]:
data.head()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_4088,feature_4089,feature_4090,feature_4091,feature_4092,feature_4093,feature_4094,feature_4095,feature_4096,target
0,0.0,0.0,0.0,0.0,0.086625,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.014147,0.0,0.95539,0.0,0.0,0.0,forest
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.312153,0.0,0.0,0.0,2.953564,0.0,0.0,1.072329,chaparral
2,0.0,0.0,0.0,2.515129,0.095439,0.0,0.045738,0.148969,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,chaparral
3,0.0,0.0,0.0,0.0,1.188573,0.0,0.0,0.0,0.0,0.0,...,2.228957,1.464646,0.0,0.0,0.0,1.62498,0.0,0.0,0.0,beach
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,2.063259,0.0,0.0,0.0,1.344442,0.0,0.0,0.552447,river


In [8]:
data.describe()

Unnamed: 0,feature_1,feature_2,feature_3,feature_4,feature_5,feature_6,feature_7,feature_8,feature_9,feature_10,...,feature_4087,feature_4088,feature_4089,feature_4090,feature_4091,feature_4092,feature_4093,feature_4094,feature_4095,feature_4096
count,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,...,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0,300.0
mean,0.006075,0.000996,0.035626,0.198109,0.306302,0.145392,0.029904,0.015574,0.017993,0.288413,...,0.004769,0.60952,0.59716,0.32359,0.020528,0.031902,1.481293,0.0,0.004534,1.239524
std,0.105224,0.014681,0.23613,0.668374,0.436969,0.522102,0.198194,0.179693,0.155665,0.591321,...,0.082608,1.323339,0.917379,1.01901,0.174347,0.250427,1.776533,0.0,0.078524,1.317253
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.063618,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.872153,0.0,0.0,0.898127
75%,0.0,0.0,0.0,0.0,0.523254,0.0,0.0,0.0,0.0,0.366293,...,0.0,0.559731,0.934315,0.0,0.0,0.0,2.437103,0.0,0.0,2.187367
max,1.822537,0.249662,2.510623,4.527941,2.389151,3.870526,1.803066,2.875917,2.048211,4.265208,...,1.430818,9.752917,4.911635,7.399162,2.014147,3.326735,9.211884,0.0,1.360076,5.854851


In [9]:
print(data.iloc[:, -1:])
X = data.iloc[:, :-1]
y = data.iloc[:, -1:]

        target
0       forest
1    chaparral
2    chaparral
3        beach
4        river
5        beach
6    chaparral
7        cloud
8       forest
9    chaparral
10      island
11     sea_ice
12      meadow
13       river
14      forest
15      desert
16       cloud
17       river
18      island
19      meadow
20       river
21       beach
22       beach
23       beach
24    mountain
25       river
26     wetland
27       cloud
28       cloud
29        lake
..         ...
270    wetland
271  chaparral
272     meadow
273   mountain
274    wetland
275   mountain
276    sea_ice
277  chaparral
278   snowberg
279    sea_ice
280      cloud
281    wetland
282   mountain
283     desert
284      river
285     island
286    sea_ice
287      cloud
288     forest
289   snowberg
290   mountain
291      river
292    sea_ice
293      cloud
294   snowberg
295      cloud
296      cloud
297     island
298       lake
299     forest

[300 rows x 1 columns]


# Step 2: Building a predictive model

In [10]:
from data_manager import DataManager
D = DataManager(data_name, data_dir)
print(D)

Info file found : /home/samuel/Documents/Cours/M2_AIC/Remote-Sensing-Image/starting_kit/sample_preprocessed_data/Areal_public.info
DataManager : Areal
info:
	usage = Sample dataset Areal preprocessed data
	name = areal
	task = multiclass.classification
	target_type = Categorical
	feat_type = Numerical
	metric = accuracy
	time_budget = 12000
	feat_num = 4096
	target_num = 13
	label_num = 13
	train_num = 300
	valid_num = 100
	test_num = 0
	has_categorical = 0
	has_missing = 0
	is_sparse = 0
	format = dense
data:
	X_train = array(300, 4096)
	Y_train = array(300, 1)
	X_valid = array(100, 4096)
	Y_valid = array(100, 1)
	X_test = array(100, 4096)
	Y_test = array(100, 1)
feat_type:	array(4096,)
feat_idx:	array(0,)



In [11]:
X_train = D.data['X_train']
Y_train = D.data['Y_train']

In [12]:
M = BaselineModel()
M2 = BaselineModel2()

In [13]:
M.fit(X_train, Y_train.reshape(-1))
M2.fit(X_train, Y_train.reshape(-1))

FIT: dim(X)= [300, 4096]
FIT: dim(y)= [300, 1]
FIT: dim(X)= [300, 4096]
FIT: dim(y)= [300, 1]


In [14]:
Y_hat_train = M.predict(D.data['X_train'])
Y_hat_valid = M.predict(D.data['X_valid'])
Y_hat_test = M.predict(D.data['X_test'])

Y_hat_train2 = M2.predict(D.data['X_train'])
Y_hat_valid2 = M2.predict(D.data['X_valid'])
Y_hat_test2 = M2.predict(D.data['X_test'])

PREDICT: dim(X)= [300, 4096]
PREDICT: dim(y)= [300, 1]
PREDICT: dim(X)= [100, 4096]
PREDICT: dim(y)= [100, 1]
PREDICT: dim(X)= [100, 4096]
PREDICT: dim(y)= [100, 1]
PREDICT: dim(X)= [300, 4096]
PREDICT: dim(y)= [300, 1]
PREDICT: dim(X)= [100, 4096]
PREDICT: dim(y)= [100, 1]
PREDICT: dim(X)= [100, 4096]
PREDICT: dim(y)= [100, 1]


In [15]:
# m.save(trained_model_name)                 
result_name = result_dir + data_name
from data_io import write
from data_io import mkdir
mkdir(result_dir)

write(result_name + '_train.predict', Y_hat_train)
write(result_name + '_valid.predict', Y_hat_valid)
write(result_name + '_test.predict', Y_hat_test)
!ls $result_name*

sample_result_submission_pp/Areal_test.predict
sample_result_submission_pp/Areal_train.predict
sample_result_submission_pp/Areal_valid.predict


# Scoring predictions

In [16]:
from libscores import get_metric
metric_name, scoring_function = get_metric()
print('Using scoring metric:', metric_name)

Using scoring metric: accuracy


In [17]:
print('Ideal score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_train), "\n")

print('Training score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_hat_train))
print('Validation score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_valid'], Y_hat_valid))
print('Test score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_test'], Y_hat_test))
print()
print('Training score for the', metric_name, 'metric = %5.4f' % scoring_function(Y_train, Y_hat_train2))
print('Validation score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_valid'], Y_hat_valid2))
print('Test score for the', metric_name, 'metric = %5.4f' % scoring_function(D.data['Y_test'], Y_hat_test2))

Ideal score for the accuracy metric = 1.0000 

Training score for the accuracy metric = 1.0000
Validation score for the accuracy metric = 0.3500
Test score for the accuracy metric = 0.3000

Training score for the accuracy metric = 1.0000
Validation score for the accuracy metric = 0.5200
Test score for the accuracy metric = 0.4700
