# Cruise ship crew member prediction - Linear regression

The task is to provide accurate estimates of how many crew members new ships under construction will require.

Data details:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.

    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96

In [4]:
import findspark
findspark.init('/home/matt/spark-3.1.1-bin-hadoop2.7')

In [5]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cruise').getOrCreate()

In [6]:
df = spark.read.csv('cruise_ship_info.csv',inferSchema=True,header=True)

In [7]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [8]:
df.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 5 rows



## Feature engineering
Convert cruise line into categorical variable.

In [7]:
# what and how many cruise lines?
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [11]:
# use stringIndxer to define new column with cruise line as a categorical variable
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat")
indexed = indexer.fit(df).transform(df)

In [16]:
# check indexing
indexed.select('Cruise_line','cruise_cat').head(5)

[Row(Cruise_line='Azamara', cruise_cat=16.0),
 Row(Cruise_line='Azamara', cruise_cat=16.0),
 Row(Cruise_line='Carnival', cruise_cat=1.0),
 Row(Cruise_line='Carnival', cruise_cat=1.0),
 Row(Cruise_line='Carnival', cruise_cat=1.0)]

## Feature Vectors and Labels

In [18]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [19]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

In [20]:
assembler = VectorAssembler(
  inputCols=['Age',
             'Tonnage',
             'passengers',
             'length',
             'cabins',
             'passenger_density',
             'cruise_cat'],
    outputCol="features")

In [21]:
output = assembler.transform(indexed)

In [24]:
# view set features and labels
output.select("features", "crew").show(5)

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
+--------------------+----+
only showing top 5 rows



In [25]:
# set model data 
final_data = output.select("features", "crew")

In [26]:
# split data
train_data,test_data = final_data.randomSplit([0.7,0.3])

## Train model

In [29]:
from pyspark.ml.regression import LinearRegression

# Create a Linear Regression Model object
lr = LinearRegression(labelCol='crew')

In [30]:
# Fit the model to the data and call this model lrModel
lrModel = lr.fit(train_data)

In [31]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [-0.019690012348180897,0.012769261606903858,-0.1597609944478677,0.38786608202047035,0.8614709298473009,-0.007933313502932004,0.0556933926775416] Intercept: -0.5513901088843238


## Evaluate model

In [32]:
test_results = lrModel.evaluate(test_data)

In [33]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))

RMSE: 0.5880268605193681
MSE: 0.34577558869226444
R2: 0.9793254200224946


## Data exploration - check correlation

In [38]:
# R2 of 0.86 is pretty good, let's check the data a little closer
from pyspark.sql.functions import corr

In [39]:
# check coorelation between passengers feature
df.select(corr('crew','passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [40]:
# check coorelation between cabins feature
df.select(corr('crew','cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



These strong correlations indicate that passengers and cabins are key factors determining crew numbers