# Aalto Pro/Diploma in Artificial Intelligence: Project work

# Recommendating Virtual Assistant for Sustainable Waste Management Onboard Cruise Ship

## Background

In this work, a dataproduct will be produced by using data science tools and machine learning (ML) algorithms that would help a cruise ship operator to plan the future waste management operations by a most sustainable and feasible way. The project work is done by constructing the ML-product by using real shipboard operational data and external web sources. The purpose is to create ML-system that would give a decent picture and prediction of the best option for the operations.

The main goal is to produce a tool and by no means aiming for a product that would give a comprehensive picture of the whole situation. The work is done with Jupyter Lab and by using Anaconda Python package management system, meaning thatthe coding part will be done on Python language.

There are two aspects in this work: Sustainability and feasibility. Evac Oy supplies integrated cleantech solutions, including e.g. waste management systems, to all types of ships. The sustainability is the key value for all major cruise ship owners. Not only because of regulations and guidelines, but because of acceptability of the cruise business as a whole. The sustainability part will be done as a prediction by using mass balance of recyclable materials and carbon balance derived from the balance. In addition to sustainability, the operations should preferably be as economical as possible. This part would be covered by market price data of recyclable materials as well as the carbon price.

### Goals and risks

The goal is to study how to extract usable data from the existing sources, clean and explore the data so that a usable predictive ML models could be produced and trained with new and fresh data in the future from the IoT-system. If this goal is achieved, it may be worth of considering to apply and further develop the extent of the work to an usable data science product.

The IoT-system that would be used for the purpose is installed and connected to our recent new-build cruise ship project. The risk is that the availability of the data may delay. In that case, we need to somehow "construct" the data based on our best knowledge. I am planning to parse a part of the data from the web. The other risk is that the data cannot be parsed and cleaned properly due to complexity into a usable form. Third risk here may be that the available relevant data may not be sufficient for training of the constructed ML model.

### Data Perspective

There are both internal and external data sources here that would be used. The internal data source would be a dry and wet waste production data. That would be derived on a daily basis from the ship IoT-system in form of CSV-files. One CSV file for the dry waste production and one for the wet waste production. The external data would be historical trends of market values of recycled materials and value of carbon in terms of greenhouse gas emission abatement.

Concerning the data availability, I have already contacted the ship owner for the waste data. As mentioned, the raw data would be in form of CSV-files. Some data wrangling with Pandas is needed as well as calculations for the deriving the actual values. The challenge for the shipboard data is the timing of the availability. We are working on that matter. In terms of market value information, some data is already gained. However, I would like to automize this activity as a continuous retrieval of the information. That wouldupdate the output signal of the ML-model each time the data would be available. The challenge here may be to find a data source that would produce a single value. For this project work purpose, we may need to use the source that is fairl easily available.



# Data sources and methods

## The work flow for the data science and producing the machine learning model

The text will be added here. Before that, I will include below Scikitlearn-rehearsal:

In [4]:
import seaborn as sns
iris = sns.load_dataset('iris')
iris.head(8)

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa
5,5.4,3.9,1.7,0.4,setosa
6,4.6,3.4,1.4,0.3,setosa
7,5.0,3.4,1.5,0.2,setosa


or:

In [7]:
from sklearn.datasets import load_iris
iris = load_iris()
type(iris)

sklearn.utils.Bunch

In [11]:
print (iris.data)

[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.2]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.6 1.4 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1 4.

In [10]:
print(iris.feature_names)

['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']


In [12]:
#  print integers representing the species of each observation
print(iris.target)

[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]


In [13]:
#  print the encoding scheme for species: 0 = setose, 1 = versicolor, 2 = virginica
print(iris.target_names)

['setosa' 'versicolor' 'virginica']


# Requirements for working with data in scikit-learn

1. Features and response are **separate objects**

2. Features and response should be **numeric**

3. Features and response should be **NumPy arrays**

4. Features and response should have **specific shapes**

In [19]:
# check the types of the features and response

print(type(iris.data))
print(type(iris.target))

<class 'numpy.ndarray'>
<class 'numpy.ndarray'>


In [20]:
#  check the shape of the features(1st dim = number of observations, 2nd dim = number of features)
print(iris.data.shape)

(150, 4)


In [21]:
#  check the shape of the response(single dimension matching the number of observations)
print(iris.target.shape)

(150,)


In [22]:
#  store feature matrix in "X"
x = iris.data

# store response vector in "Y"
y = iris.target

# Training a machine learning model with scikit-learn

- What is the **K-nearest neighbors** classification model?

- What are the four steps for **model training and prediction** in scikit-learn?

- How can I apply this pattern to **other machine learning models?**

# K-nearest neighbors (KNN) classification

1. Pick a value for K (such as 5)

2. Search for the K observations in the training data that are "nearest" to the measurements of the unknown iris.

3. Use the most popular resopnse value from the K nearest neighbors as the predicted resopnse value for the unknown iris.


In [23]:
# Verify that x and y have appropriate shapes:
print(x.shape)
print(y.shape)

(150, 4)
(150,)


# scikit-learn 4-step modeling pattern

**Step 1:** Import the class you plan to use

In [24]:
from sklearn.neighbors import KNeighborsClassifier

**Step 2:** "Instantiate" the "estimator"

- "Estimator" is scikit-learn's term for model

- "Instatiate" means "make an instance of"

In [25]:
knn = KNeighborsClassifier(n_neighbors=1)

- Name of the object does not matter

- Can specify tuning parameters (meaning in this case n_neighbors-parameter, aka "hyperparameters") during this step

- All parameters not specified are set to their defaults

In [26]:
print(knn)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')


**Step 3:** Fit the model with data (aka "model training")

- Model is learning the relationship between x and y

- Occurs in-place

In [27]:
knn.fit(x, y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=None, n_neighbors=1, p=2,
           weights='uniform')

**Step 4:** Predict the response for a new observation

- New observations are called "out-of-sample" data

- Uses the information it learned during the model training process

In [29]:
knn.predict([[3, 5, 4, 2]])

array([2])

- Returns a NumPy array that is the prediction of the target (species of iris)

- Can predict for multiple observations at once

In [30]:
X_new = [[3, 5, 4, 2], [5, 4, 3, 2]]
knn.predict(X_new)

array([2, 1])

# Using a different value for K

In [31]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(x, y)

# predict the response for new observations
knn.predict(X_new)

array([1, 1])

# Using different classification model

In [32]:
# import the class
from sklearn.linear_model import LogisticRegression

# instantiate the model (using the default parameters)
logreg = LogisticRegression()

# fit the model with data
logreg.fit(x, y)

# predict the response for new observations
logreg.predict(X_new)



array([2, 0])