# Ships Crew Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [1]:
from pyspark.sql import SparkSession

In [2]:
spark = SparkSession.builder.appName('cruise crew').getOrCreate()

In [3]:
df = spark.read.csv("cruise_ship_info.csv",header=True,inferSchema=True)

In [4]:
df.head(2)[0]

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)

In [5]:
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [6]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [7]:
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
df.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       null|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|      NaN|       null| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

# Crusie line is a Categorical Variable
Grouping it and analyzing

In [8]:
from pyspark.sql.functions import * 
df.groupby("Cruise_line").agg(count('Cruise_line')).show()

# or simply

df.groupBy('Cruise_line').count().show()

+-----------------+------------------+
|      Cruise_line|count(Cruise_line)|
+-----------------+------------------+
|            Costa|                11|
|              P&O|                 6|
|           Cunard|                 3|
|Regent_Seven_Seas|                 5|
|              MSC|                 8|
|         Carnival|                22|
|          Crystal|                 2|
|           Orient|                 1|
|         Princess|                17|
|        Silversea|                 4|
|         Seabourn|                 3|
| Holland_American|                14|
|         Windstar|                 3|
|           Disney|                 2|
|        Norwegian|                13|
|          Oceania|                 3|
|          Azamara|                 2|
|        Celebrity|                10|
|             Star|                 6|
|  Royal_Caribbean|                23|
+-----------------+------------------+

+-----------------+-----+
|      Cruise_line|count|
+----------

In [9]:
# to process we need to assign some numbers to cruise_line
# or category indexes

In [10]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol = 'Cruise_line',outputCol='Cruise_index')
indexed = indexer.fit(df).transform(df)
indexed.head(4)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_index=16.0),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_index=16.0),
 Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7, Cruise_index=1.0),
 Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1, Cruise_index=1.0)]

In [11]:
from pyspark.ml.feature import VectorAssembler

In [12]:
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|         1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|         1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|         1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|         1.0|
|    Elati

In [13]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_index']

Spark need features input column

In [14]:
assembler  = VectorAssembler(inputCols=['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density'],outputCol='Features')

In [15]:
output = assembler.transform(indexed)

In [16]:
output.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_index|            Features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|         1.0|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|         1.0|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|          

# Lets list Feature and Crew 

In [17]:
output.select('features','crew').show()
final_data = output.select('features','crew')

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [18]:
# Train and Test

In [19]:
train_data, test_data = final_data.randomSplit([0.7,0.3])

In [20]:
final_data.count()

158

In [21]:
train_data.count()

116

In [22]:
test_data.count()

42

In [23]:
# Now we make model

In [24]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol = 'crew')


In [38]:
lrModel = lr.fit(train_data)

pred = lrModel.transform(test_data)


pred.show()

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[6.0,30.276999999...| 3.55|  3.98441877015964|
|[6.0,110.23899999...| 11.5|10.947660835831364|
|[6.0,112.0,38.0,9...| 10.9|10.931620615182448|
|[7.0,89.6,25.5,9....| 9.87|10.695984678013696|
|[7.0,116.0,31.0,9...| 12.0|12.396585077850252|
|[8.0,91.0,22.44,9...| 11.0|10.021650292785994|
|[8.0,110.0,29.74,...| 11.6|11.947114170197956|
|[9.0,85.0,19.68,9...| 8.69| 9.153229070153612|
|[9.0,90.09,25.01,...| 8.69|  9.40897310145532|
|[9.0,110.0,29.74,...| 11.6|11.940488672561512|
|[10.0,58.825,15.6...|  7.0| 7.190660131704547|
|[10.0,86.0,21.14,...|  9.2| 9.591810577147529|
|[10.0,105.0,27.2,...|10.68| 10.97065802129362|
|[11.0,91.0,20.32,...| 9.99| 9.198458939972962|
|[11.0,110.0,29.74...| 19.1|11.930421817022879|
|[11.0,138.0,31.14...|11.85|  12.9790638873799|
|[12.0,2.329,0.94,...|  0.6|0.7143326424508706|
|[12.0,25.0,3.88,5...| 2.87| 3.121000132

In [26]:
# test we can evaluate the model with test with different matric

In [41]:
testeval = lrModel.evaluate(test_data)

testeval.predictions.show()
testeval.predictionCol

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[6.0,30.276999999...| 3.55|  3.98441877015964|
|[6.0,110.23899999...| 11.5|10.947660835831364|
|[6.0,112.0,38.0,9...| 10.9|10.931620615182448|
|[7.0,89.6,25.5,9....| 9.87|10.695984678013696|
|[7.0,116.0,31.0,9...| 12.0|12.396585077850252|
|[8.0,91.0,22.44,9...| 11.0|10.021650292785994|
|[8.0,110.0,29.74,...| 11.6|11.947114170197956|
|[9.0,85.0,19.68,9...| 8.69| 9.153229070153612|
|[9.0,90.09,25.01,...| 8.69|  9.40897310145532|
|[9.0,110.0,29.74,...| 11.6|11.940488672561512|
|[10.0,58.825,15.6...|  7.0| 7.190660131704547|
|[10.0,86.0,21.14,...|  9.2| 9.591810577147529|
|[10.0,105.0,27.2,...|10.68| 10.97065802129362|
|[11.0,91.0,20.32,...| 9.99| 9.198458939972962|
|[11.0,110.0,29.74...| 19.1|11.930421817022879|
|[11.0,138.0,31.14...|11.85|  12.9790638873799|
|[12.0,2.329,0.94,...|  0.6|0.7143326424508706|
|[12.0,25.0,3.88,5...| 2.87| 3.121000132

In [28]:
testeval.rootMeanSquaredError

1.4054167263260222

In [29]:
testeval.meanSquaredError 

1.9751961746369535

In [30]:
testeval.r2

0.871968969084351

Correlations

In [31]:
output.select(corr('crew','cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



In [32]:
# That shows strong relationship

In [33]:
# Also

In [35]:
# Little low but also depicts strong relationship

In [42]:
output.select(corr('crew','passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+

