# Predicting the crew size from cruise ship attributes

A large ship manufacturer would like us to give them accurate estimates of how many crew members a ship will require.

Our model must predict how many crew members the ships will need so that this information can be passed on to the customers of these cruise ships, in order to help with the purchase decision making process.

We will be creating a regression model that will help predict how many crew members will be needed for future ships.

### Importing Libraries

In [1]:
import os
from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.ml.regression import LinearRegression

### Creating the `SparkSession` and importing the dataset

In [2]:
spark = SparkSession.builder.appName('ship_crews').getOrCreate()

In [3]:
os.chdir('..')

In [4]:
DATA_FILE = os.getcwd() + '/data/cruise_ship_info.csv'
df = spark.read.csv(DATA_FILE, inferSchema=True, header=True)
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



### EDA and summaries

In [5]:
print("Rows ->", df.count())

Rows -> 158


In [6]:
stringCols = [item[0] for item in df.dtypes if item[1].startswith('string')]
numCols = [item[0] for item in df.dtypes if item[0] not in stringCols]
print(stringCols)
print(numCols)

['Ship_name', 'Cruise_line']
['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'crew']


In [7]:
df.select(numCols).describe().show(truncate=False, vertical=True)

-RECORD 0-------------------------------
 summary           | count              
 Age               | 158                
 Tonnage           | 158                
 passengers        | 158                
 length            | 158                
 cabins            | 158                
 passenger_density | 158                
 crew              | 158                
-RECORD 1-------------------------------
 summary           | mean               
 Age               | 15.689873417721518 
 Tonnage           | 71.28467088607599  
 passengers        | 18.45740506329114  
 length            | 8.130632911392404  
 cabins            | 8.830000000000005  
 passenger_density | 39.90094936708861  
 crew              | 7.794177215189873  
-RECORD 2-------------------------------
 summary           | stddev             
 Age               | 7.615691058751413  
 Tonnage           | 37.229540025907866 
 passengers        | 9.677094775143416  
 length            | 1.793473548054825  
 cabins         

We have data for **158 ships**.

The **independent variables** are - 

1. `Ship_name` (**<font color=steelblue>string</font>**) : The name of the ship.
2. `Cruise_line`(**<font color=steelblue>string</font>**) : The cruise line that owns that ship (recall that these are ships that have already been sold).
3. `Age` (**<font color=darkgreen>numeric</font>**) : The Age of the ship.
4. `Tonnage` (**<font color=darkgreen>numeric</font>**) : The weight the ship can carry.
5. `passengers` (**<font color=darkgreen>numeric</font>**) : The number of passengers the ship can carry.
6. `length` (**<font color=darkgreen>numeric</font>**) : The length of the ship.
7. `cabins` (**<font color=darkgreen>numeric</font>**) : The number of cabins on the ship.
8. `passenger_density` (**<font color=darkgreen>numeric</font>**) : How many passengers the ship can sustain in a pre-determined area.

The **dependent variable**, that we are trying to predict - 

   -  `crew` (**<font color=darkgreen>numeric</font>**) : The number of crew needed to service the ship.

### Data Transformations

To get the data into `pyspark.ml`, we need to transform it into a `DenseVector` format. To do so we must first encode the string columns to their "numerical equivalents". To do so, we use `StringIndexer` in conjunction with PySpark `Pipelines`.

In [8]:
indexers = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(df) for column in stringCols]

In [9]:
pipeline = Pipeline(stages=indexers)
new_df = pipeline.fit(df).transform(df)
new_df.head(1)[0].asDict()

{'Ship_name': 'Journey',
 'Cruise_line': 'Azamara',
 'Age': 6,
 'Tonnage': 30.276999999999997,
 'passengers': 6.94,
 'length': 5.94,
 'cabins': 3.55,
 'passenger_density': 42.64,
 'crew': 3.55,
 'Ship_name_index': 32.0,
 'Cruise_line_index': 16.0}

Using `Pipelines` and `StringIndexer` we have encoded the string columns to numeric ones, with the appendix `index` in the column names. We can now use `VectorAssember` to transform the numerical features to a `DenseVector` to build our model.

In [10]:
new_df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Ship_name_index: double (nullable = false)
 |-- Cruise_line_index: double (nullable = false)



In [11]:
stringCols = [item[0] for item in new_df.dtypes if item[1].startswith('string')]
numCols = [item[0] for item in new_df.dtypes if item[0] not in stringCols]
new_df = new_df.select(numCols)
new_df.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Ship_name_index: double (nullable = false)
 |-- Cruise_line_index: double (nullable = false)



`VectorAssember` requires as input our **independent variables**, i.e **features**.

In [12]:
indep = list(set(numCols) - set(['crew']))
indep

['length',
 'Ship_name_index',
 'Age',
 'cabins',
 'Tonnage',
 'passenger_density',
 'passengers',
 'Cruise_line_index']

In [13]:
assembler = VectorAssembler(inputCols = indep, outputCol = 'features')
output = assembler.transform(new_df)
output.printSchema()

root
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Ship_name_index: double (nullable = false)
 |-- Cruise_line_index: double (nullable = false)
 |-- features: vector (nullable = true)



In [14]:
output.head(1)[0].asDict()

{'Age': 6,
 'Tonnage': 30.276999999999997,
 'passengers': 6.94,
 'length': 5.94,
 'cabins': 3.55,
 'passenger_density': 42.64,
 'crew': 3.55,
 'Ship_name_index': 32.0,
 'Cruise_line_index': 16.0,
 'features': DenseVector([5.94, 32.0, 6.0, 3.55, 30.277, 42.64, 6.94, 16.0])}

The `features` column contains all the numeric **independent variables** in `DenseVector` representation. The next step is to build our final dataset containing only `features` and **dependent variable**.

In [15]:
final_df = output.select(['features', 'crew'])
final_df.show(n = 5, truncate=False, vertical=True)

-RECORD 0-----------------------------------------------------------
 features | [5.94,32.0,6.0,3.55,30.276999999999997,42.64,6.94,16.0] 
 crew     | 3.55                                                    
-RECORD 1-----------------------------------------------------------
 features | [5.94,46.0,6.0,3.55,30.276999999999997,42.64,6.94,16.0] 
 crew     | 3.55                                                    
-RECORD 2-----------------------------------------------------------
 features | [7.22,134.0,26.0,7.43,47.262,31.8,14.86,1.0]            
 crew     | 6.7                                                     
-RECORD 3-----------------------------------------------------------
 features | [9.53,78.0,11.0,14.88,110.0,36.99,29.74,1.0]            
 crew     | 19.1                                                    
-RECORD 4-----------------------------------------------------------
 features | [8.92,36.0,17.0,13.21,101.353,38.36,26.42,1.0]          
 crew     | 10.0                  

### Train-Test Split

In [16]:
train, test = final_df.randomSplit([0.75, 0.25])
train.describe().show()
test.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               117|
|   mean| 7.718632478632479|
| stddev|3.3463940721584815|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                41|
|   mean| 8.009756097560976|
| stddev|3.9544326317468586|
|    min|              0.59|
|    max|              21.0|
+-------+------------------+



### Fit the model on the training data

In [17]:
lr = LinearRegression(labelCol='crew')
lr_model = lr.fit(train)

### Evaluate the model on the testing data

In [18]:
eval_results = lr_model.evaluate(test)
eval_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
| 0.40799862966573974|
|-0.01683710237391...|
|-0.28242513892663323|
| 0.17819088776259528|
| -1.1231155209062678|
| -0.2626515886317424|
|  0.5056704680439683|
| 0.20295223393832895|
|-0.20932167757645193|
| -0.1960801408600501|
| -0.2529811089349874|
| -1.1741491571023408|
| -0.4924854858555996|
| -0.8287568100385494|
|-0.48053199193710494|
|  0.6565729701602869|
|-0.49101431798548134|
| -0.6319717525705517|
|  0.6800326949235149|
|  0.7063075106945469|
+--------------------+
only showing top 20 rows



### Evaluation Metrics

In [19]:
print("RMSE ::", round(eval_results.rootMeanSquaredError, 4))
print("R2 ::", round(eval_results.r2, 4))

RMSE :: 1.2873
R2 :: 0.8914


In [20]:
final_df.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



We have fit a linear regression model to our training data and upon testing have seen an `r-squared` of $0.8914$ and a `RMSE` of $1.2873$, considering that the `mean(crew)` is $7.8$ our RMSE, indicates that the model is fairly accurate.