# Predicting the number of crew members required on a cruise liner using Linear Regression 

[Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

The created model is used to predict how many crew members the newly constructed ships will need.

The data contains:

    TheMeasurements of ship size, capacity, crew, and age for 158 cruise ships:
    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is observed that particular cruise lines will differ in the minimum acceptable crew counts, so it is most likely an important feature and is included in the analysis.

### Importing all the important Spark and MLlib libraries

In [64]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer
from pyspark.sql.functions import corr

In [2]:
spark = SparkSession.builder.appName('ship').getOrCreate()

### Importing the data from cruise_ship_info.csv to the spark data frame and printing the schema

In [24]:
ship_data = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)
ship_data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



### This is how the different rows of the data frame look like:

In [29]:
for ship in ship_data.head(4):
    print(ship)
    print("\n")

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)


Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1)




### Using the Spark MLlib StringIndexer to handle the categorical values so that they can be used for linear regression. The Cruise_category column is added to the dataframe.

In [55]:
indexer = StringIndexer(inputCol = 'Cruise_line', outputCol='Cruise_category')
indexed = indexer.fit(ship_data).transform(ship_data)
indexed.head(2)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_category=16.0),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_category=16.0)]

### All the imporpant and relevant features that will be used for the regression are combined into one large vector so that the MLlib can use it to make predictions.

In [56]:
indexed.columns
assembler = VectorAssembler(inputCols = ['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_category'], outputCol = 'features')

In [57]:
data = assembler.transform(indexed)

In [58]:
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+---------------+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_category|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+---------------+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|           16.0|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|           16.0|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|            1.0|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|            1.0|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8

In [59]:
final_data = data.select('features', 'crew')

In [60]:
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



### Initializing the Linear Regression Model and setting the label as crew and splitting the dataset randomly into a 70:30 split for training and testing

In [62]:
lr = LinearRegression(labelCol='crew')
training_data, test_data = final_data.randomSplit([0.7,0.3])

### The training data is then used to fit the model and the built model is evaluated on the test data

In [63]:
lr_model = lr.fit(training_data)
test_results = lr_model.evaluate(test_data)

In [43]:
test_results.meanAbsoluteError

1.752735072918864e-15

### The prediction column contains the predictions made by our linear regression model and the column crew contains the actual values. So, we can see that the model is really working well and the predictions made are almost perfect

In [44]:
test_results.predictions.show()

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[5.0,86.0,21.04,9...|  8.0|7.9999999999999964|
|[6.0,110.23899999...| 11.5|11.500000000000004|
|[6.0,113.0,37.82,...| 12.0|12.000000000000002|
|[6.0,158.0,43.7,1...| 13.6|              13.6|
|[7.0,116.0,31.0,9...| 12.0|11.999999999999998|
|[9.0,81.0,21.44,9...| 10.0|10.000000000000004|
|[9.0,90.09,25.01,...| 8.69| 8.689999999999998|
|[10.0,58.825,15.6...|  7.0| 6.999999999999999|
|[10.0,90.09,25.01...| 8.58|              8.58|
|[11.0,90.09,25.01...| 8.48|              8.48|
|[11.0,108.977,26....| 12.0|12.000000000000004|
|[11.0,138.0,31.14...|11.85|11.849999999999996|
|[12.0,25.0,3.88,5...| 2.87| 2.869999999999999|
|[12.0,58.6,15.66,...|  7.0| 6.999999999999998|
|[12.0,77.104,20.0...| 9.59| 9.590000000000003|
|[12.0,88.5,21.24,...|  9.3| 9.299999999999997|
|[12.0,90.09,25.01...| 8.68|              8.68|
|[12.0,108.865,27....| 11.0|11.000000000

In [45]:
test_results.rootMeanSquaredError

2.1883243225969897e-15

In [46]:
test_results.r2

1.0

In [50]:
ship_data.select(corr('crew', 'passenger_density')).show()

+-----------------------------+
|corr(crew, passenger_density)|
+-----------------------------+
|         -0.15550928421699717|
+-----------------------------+

