# Iceberg Classification Step 1: Preprocessing the data

This notebook performs the data preprocessing step in the end-to-end pipeline.

This notebook will perform the following operations:
- Take the 'train.json' as input file
- Do some preprocessing and data engineering
- Save the preprocessed dataset in a Hopsworks dataset

In [1]:
import os

import hops
from hops import hdfs
from hops import pandas_helper as pd

# SparkSession available as 'spark'
print(
    f"-----------------------------------------------\n" \
    f"This notebook is tested with:\n" \
    f"  - Hopsworks {hops.__version__}.\n" \
    f"  - Spark {spark.version}.\n"
)

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log
61,application_1623657485747_0001,pyspark,idle,Link,Link


SparkSession available as 'spark'.
-----------------------------------------------
This notebook is tested with:
  - Hopsworks 2.2.0.1.
  - Spark 2.4.3.2.

## Define relevant paths

In [2]:
# 'DATA_FOLDER' refers to the folder name under the Data Sets page in hopsworks UI.
# In this case, the training dataset is stored in the 'eodata' folder.
DATA_FOLDER = 'eodata'

# Get the paths to read the original dataset and save the preprocessed dataset.
train_ds_path = os.path.join(hdfs.project_path(), DATA_FOLDER, 'train.json')
train_preprocessed_all_ds_path = os.path.join(hdfs.project_path(), DATA_FOLDER, 'train_preprocessed_all.json')

print(f"train_ds_path: {train_ds_path}")
print(f"train_preprocessed_all_ds_path: {train_preprocessed_all_ds_path}", )

train_ds_path: hdfs://rpc.namenode.service.consul:8020/Projects/demo_ml_meb10180/eodata/train.json
train_preprocessed_all_ds_path: hdfs://rpc.namenode.service.consul:8020/Projects/demo_ml_meb10180/eodata/train_preprocessed_all.json

## Read the raw data

The data \[[iceberg dataset](https://www.kaggle.com/c/statoil-iceberg-classifier-challenge/data)\] (train.json) is presented in json format. The files consist of a list of images, and for each image, you can find the following fields:

- ``id``: the id of the image
- ``band_1``, ``band_2``: the flattened image data. Each band has 75x75 pixel values in the list, so the list has 5625 elements. Note that these values are not the normal non-negative integers in image files since they have physical meanings - these are float numbers with unit being dB. Band 1 and Band 2 are signals characterized by radar backscatter produced from different polarizations at a particular incidence angle. The polarizations correspond to HH (transmit/receive horizontally) and HV (transmit horizontally and receive vertically). More background on the satellite imagery can be found here.
- ``inc_angle``: the incidence angle of which the image was taken. Note that this field has missing data marked as "na", and those images with "na" incidence angles are all in the training data to prevent leakage.
- ``is_iceberg``: the target variable, set to 1 if it is an iceberg, and 0 if it is a ship. This field only exists in train.json.

In [3]:
# Read the raw data to pandas dataframe using the pandas function provided in hopsworks.
# Note that the read_json function provided by hopsworks is needed since we are reading over hdfs.
raw_train_df = pd.read_json(train_ds_path)

In [4]:
raw_train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1604 entries, 0 to 1603
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          1604 non-null   object
 1   band_1      1604 non-null   object
 2   band_2      1604 non-null   object
 3   inc_angle   1604 non-null   object
 4   is_iceberg  1604 non-null   int64 
dtypes: int64(1), object(4)
memory usage: 62.8+ KB

In [5]:
raw_train_df

            id  ... is_iceberg
0     dfd5f913  ...          0
1     e25388fd  ...          0
2     58b2aaa0  ...          1
3     4cfc3a18  ...          0
4     271f93f4  ...          0
...        ...  ...        ...
1599  04e11240  ...          0
1600  c7d6f6f8  ...          0
1601  bba1a0f1  ...          0
1602  7f66bb44  ...          0
1603  9d8f326c  ...          0

[1604 rows x 5 columns]

## Create new feature band_avg

In [6]:
# A function for taking list average
def list_avg(row):
    return [sum(x)/2 for x in zip(row['band_1'], row['band_2'])]

# Construct a new feature called 'band_avg' by taking element-wise average from 'band_1' and 'band_2'
raw_train_df['band_avg'] = raw_train_df.apply(lambda row: list_avg(row), axis=1)

In [7]:
raw_train_df

            id  ...                                           band_avg
0     dfd5f913  ...  [-27.516239499999998, -28.346024, -29.84960749...
1     e25388fd  ...  [-21.874347999999998, -21.4524295, -20.7830205...
2     58b2aaa0  ...  [-24.737316, -24.348173, -22.762496, -21.28190...
3     4cfc3a18  ...  [-25.172013999999997, -25.301306500000003, -25...
4     271f93f4  ...  [-26.6069355, -26.712035999999998, -26.7120359...
...        ...  ...                                                ...
1599  04e11240  ...  [-29.4237985, -29.105365, -26.472991999999998,...
1600  c7d6f6f8  ...  [-27.437631500000002, -27.400965, -27.76694599...
1601  bba1a0f1  ...  [-21.723625, -23.7647725, -23.9906165, -22.930...
1602  7f66bb44  ...  [-24.262994499999998, -23.944199, -24.2661145,...
1603  9d8f326c  ...  [-22.1770305, -22.817203499999998, -23.9654685...

[1604 rows x 6 columns]

In [8]:
# Save raw train df in dataset.
raw_train_df.to_json(path_or_buf='train_preprocessed_all.json', orient='records')

# Note that one needs to be owner of 'DATA_FOLDER' in order to write the file to the folder.
# Otherwise, permission error might occur.
hdfs.copy_to_hdfs('train_preprocessed_all.json', DATA_FOLDER , overwrite=True)

Started copying local path train_preprocessed_all.json to hdfs path hdfs://rpc.namenode.service.consul:8020/Projects/demo_ml_meb10180/eodata/train_preprocessed_all.json

Finished copying

# End of Step 1