# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [81]:
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.sql.functions import corr

In [2]:
spark = SparkSession.builder.appName('lr_consulting_project').getOrCreate()
spark

In [3]:
df = spark.read.csv('cruise_ship_info.csv',inferSchema=True,header=True)
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [4]:
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [5]:
df.count()

158

In [6]:
df.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|
|   mean| Infinity|       NULL|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|
| stddev|     NULL|       NULL| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473548054825|4.4714172221480615| 8.63921711391542|3.503486564627034|
|    min|Adventure|    Azamara|   

In [79]:
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [10]:
indexer = StringIndexer(inputCol='Cruise_line',outputCol='Cruise_line_indexed')
indexer.setHandleInvalid("error")

StringIndexer_52cd8c22dcb4

In [11]:
indexer_model = indexer.fit(df)
indexer_model.setHandleInvalid("error")

StringIndexerModel: uid=StringIndexer_52cd8c22dcb4, handleInvalid=error

In [15]:
output = indexer_model.transform(df)
output

DataFrame[Ship_name: string, Cruise_line: string, Age: int, Tonnage: double, passengers: double, length: double, cabins: double, passenger_density: double, crew: double, Cruise_line_indexed: double]

In [16]:
output.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_indexed|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|                1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|                1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|                1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.5

In [17]:
output.describe().show()

+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-------------------+
|summary|Ship_name|Cruise_line|               Age|           Tonnage|       passengers|           length|            cabins|passenger_density|             crew|Cruise_line_indexed|
+-------+---------+-----------+------------------+------------------+-----------------+-----------------+------------------+-----------------+-----------------+-------------------+
|  count|      158|        158|               158|               158|              158|              158|               158|              158|              158|                158|
|   mean| Infinity|       NULL|15.689873417721518| 71.28467088607599|18.45740506329114|8.130632911392404| 8.830000000000005|39.90094936708861|7.794177215189873|  5.063291139240507|
| stddev|     NULL|       NULL| 7.615691058751413|37.229540025907866|9.677094775143416|1.793473

In [18]:
output.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_line_indexed']

In [19]:
assembler = VectorAssembler(inputCols=[
     'Age',
     'Tonnage',
     'passengers',
     'length',
     'cabins',
     'passenger_density',
     'Cruise_line_indexed'
],outputCol='features')
assembler

VectorAssembler_5dd44feb4001

In [20]:
assembled_df = assembler.transform(output)
assembled_df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_indexed|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|                1.0|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|                1.0|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|     

In [21]:
assembled_df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_line_indexed: double (nullable = false)
 |-- features: vector (nullable = true)



In [23]:
assembled_df.head(1)[0]

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_line_indexed=16.0, features=DenseVector([6.0, 30.277, 6.94, 5.94, 3.55, 42.64, 16.0]))

In [24]:
final_df = assembled_df.select('features','crew')
final_df.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [25]:
final_df.count()

158

In [26]:
final_df.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



In [27]:
train_df, test_df = final_df.randomSplit([0.7,0.3])
train_df.show()

+--------------------+-----+
|            features| crew|
+--------------------+-----+
|[5.0,86.0,21.04,9...|  8.0|
|[5.0,115.0,35.74,...| 12.2|
|[5.0,122.0,28.5,1...|  6.7|
|[5.0,133.5,39.59,...|13.13|
|[5.0,160.0,36.34,...| 13.6|
|[6.0,30.276999999...| 3.55|
|[6.0,90.0,20.0,9....|  9.0|
|[6.0,93.0,23.94,9...|11.09|
|[6.0,112.0,38.0,9...| 10.9|
|[6.0,113.0,37.82,...| 12.0|
|[6.0,158.0,43.7,1...| 13.6|
|[7.0,89.6,25.5,9....| 9.87|
|[7.0,158.0,43.7,1...| 13.6|
|[8.0,77.499,19.5,...|  9.0|
|[8.0,91.0,22.44,9...| 11.0|
|[8.0,110.0,29.74,...| 11.6|
|[9.0,59.058,17.0,...|  7.4|
|[9.0,81.0,21.44,9...| 10.0|
|[9.0,85.0,19.68,9...| 8.69|
|[9.0,88.5,21.24,9...| 10.3|
+--------------------+-----+
only showing top 20 rows



In [28]:
test_df.show()

+--------------------+-----+
|            features| crew|
+--------------------+-----+
|[4.0,220.0,54.0,1...| 21.0|
|[6.0,30.276999999...| 3.55|
|[6.0,110.23899999...| 11.5|
|[7.0,116.0,31.0,9...| 12.0|
|[10.0,77.0,20.16,...|  9.0|
|[10.0,81.76899999...| 8.42|
|[10.0,110.0,29.74...| 11.6|
|[11.0,138.0,31.14...|11.85|
|[12.0,77.104,20.0...| 9.59|
|[13.0,25.0,3.82,5...| 2.95|
|[13.0,91.0,20.32,...| 9.99|
|[14.0,33.0,4.9,5....| 3.24|
|[14.0,76.8,19.6,8...| 12.0|
|[14.0,83.0,17.5,9...| 9.45|
|[15.0,30.27699999...|  4.0|
|[17.0,70.0,20.76,...|  7.2|
|[18.0,70.367,20.5...|  9.2|
|[18.0,70.60600000...| 8.58|
|[18.0,77.499,19.5...|  9.0|
|[19.0,16.8,2.96,5...|  2.1|
+--------------------+-----+
only showing top 20 rows



In [29]:
train_df.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               119|
|   mean| 7.974789915966397|
| stddev|3.3580045650475965|
|    min|              0.59|
|    max|              19.1|
+-------+------------------+



In [30]:
test_df.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                39|
|   mean| 7.243076923076919|
| stddev|3.9093882640625868|
|    min|              0.88|
|    max|              21.0|
+-------+------------------+



In [31]:
lr = LinearRegression(featuresCol='features',labelCol='crew',predictionCol='prediction')
lr

LinearRegression_e9926ff89384

In [32]:
lr_model = lr.fit(train_df)
lr_model.summary

<pyspark.ml.regression.LinearRegressionTrainingSummary at 0x1fd255fc2d0>

In [52]:
lr_model.coefficients

DenseVector([-0.0209, 0.0154, -0.1562, 0.3422, 0.8343, -0.0094, 0.0528])

In [53]:
lr_model.intercept

-0.1612990321896567

In [56]:
lr_model.numFeatures

7

In [62]:
assembler.getInputCols()

['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'Cruise_line_indexed']

In [75]:
feature_imps = dict(zip(assembler.getInputCols(),lr_model.coefficients))
feature_imps

{'Age': -0.02089828517729866,
 'Tonnage': 0.015443430983301597,
 'passengers': -0.1561866947500571,
 'length': 0.3421716648426525,
 'cabins': 0.8343037280140753,
 'passenger_density': -0.009443717243607725,
 'Cruise_line_indexed': 0.052776167963893186}

In [76]:
feature_imps = sorted(feature_imps.items(),key=lambda kv: (kv[1],kv[0]),reverse=True)
feature_imps

[('cabins', 0.8343037280140753),
 ('length', 0.3421716648426525),
 ('Cruise_line_indexed', 0.052776167963893186),
 ('Tonnage', 0.015443430983301597),
 ('passenger_density', -0.009443717243607725),
 ('Age', -0.02089828517729866),
 ('passengers', -0.1561866947500571)]

From the model coefficients, we can infer that the feature 'cabin' is the most significant when it comes to predicting the number of crew members in a particular ship.

In [33]:
test_results = lr_model.evaluate(test_df)
test_results

<pyspark.ml.regression.LinearRegressionSummary at 0x1fd256609d0>

In [34]:
test_results.r2

0.9603810093776035

In [35]:
test_results.r2adj

0.9514347856886752

In [36]:
test_results.predictions.show()

+--------------------+-----+------------------+
|            features| crew|        prediction|
+--------------------+-----+------------------+
|[4.0,220.0,54.0,1...| 21.0|20.904513821240023|
|[6.0,30.276999999...| 3.55| 4.532972862832758|
|[6.0,110.23899999...| 11.5|11.068468738662364|
|[7.0,116.0,31.0,9...| 12.0| 12.63839344288523|
|[10.0,77.0,20.16,...|  9.0| 8.795143116958535|
|[10.0,81.76899999...| 8.42|  8.73701863368302|
|[10.0,110.0,29.74...| 11.6|12.047105257679602|
|[11.0,138.0,31.14...|11.85|12.938074110376412|
|[12.0,77.104,20.0...| 9.59|  8.76934595865606|
|[13.0,25.0,3.82,5...| 2.95| 2.980236486609423|
|[13.0,91.0,20.32,...| 9.99|  9.12884710614818|
|[14.0,33.0,4.9,5....| 3.24|3.1423761735371247|
|[14.0,76.8,19.6,8...| 12.0| 8.851307911678116|
|[14.0,83.0,17.5,9...| 9.45| 9.195410572508287|
|[15.0,30.27699999...|  4.0| 4.078420155243922|
|[17.0,70.0,20.76,...|  7.2| 7.495640321034496|
|[18.0,70.367,20.5...|  9.2|  8.50870563017861|
|[18.0,70.60600000...| 8.58| 7.817230041

In [37]:
test_results.explainedVariance

13.298997786135132

In [38]:
test_results.rootMeanSquaredError

0.7681039580890837

In [39]:
test_results.meanAbsoluteError

0.5489336583992321

In [40]:
test_results.meanSquaredError

0.589983690432117

In [42]:
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
| 0.09548617875997678|
| -0.9829728628327583|
| 0.43153126133763564|
| -0.6383934428852296|
|  0.2048568830414652|
|-0.31701863368301986|
|-0.44710525767960263|
| -1.0880741103764127|
|  0.8206540413439392|
|-0.03023648660942...|
|  0.8611528938518198|
| 0.09762382646287548|
|  3.1486920883218836|
| 0.25458942749171243|
| -0.0784201552439221|
|-0.29564032103449556|
|  0.6912943698213887|
|  0.7627699589234265|
|  0.5925484433751009|
|-0.17681490510868247|
+--------------------+
only showing top 20 rows



In [44]:
test_results.degreesOfFreedom

31

In [45]:
test_results.devianceResiduals

[-1.1222501704262848, 3.1486920883218836]

In [47]:
test_results.numInstances

39

In [48]:
unlabeled_df = test_df.select('features')
unlabeled_df.show()

+--------------------+
|            features|
+--------------------+
|[4.0,220.0,54.0,1...|
|[6.0,30.276999999...|
|[6.0,110.23899999...|
|[7.0,116.0,31.0,9...|
|[10.0,77.0,20.16,...|
|[10.0,81.76899999...|
|[10.0,110.0,29.74...|
|[11.0,138.0,31.14...|
|[12.0,77.104,20.0...|
|[13.0,25.0,3.82,5...|
|[13.0,91.0,20.32,...|
|[14.0,33.0,4.9,5....|
|[14.0,76.8,19.6,8...|
|[14.0,83.0,17.5,9...|
|[15.0,30.27699999...|
|[17.0,70.0,20.76,...|
|[18.0,70.367,20.5...|
|[18.0,70.60600000...|
|[18.0,77.499,19.5...|
|[19.0,16.8,2.96,5...|
+--------------------+
only showing top 20 rows



In [49]:
test_predictions = lr_model.transform(unlabeled_df)
test_predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[4.0,220.0,54.0,1...|20.904513821240023|
|[6.0,30.276999999...| 4.532972862832758|
|[6.0,110.23899999...|11.068468738662364|
|[7.0,116.0,31.0,9...| 12.63839344288523|
|[10.0,77.0,20.16,...| 8.795143116958535|
|[10.0,81.76899999...|  8.73701863368302|
|[10.0,110.0,29.74...|12.047105257679602|
|[11.0,138.0,31.14...|12.938074110376412|
|[12.0,77.104,20.0...|  8.76934595865606|
|[13.0,25.0,3.82,5...| 2.980236486609423|
|[13.0,91.0,20.32,...|  9.12884710614818|
|[14.0,33.0,4.9,5....|3.1423761735371247|
|[14.0,76.8,19.6,8...| 8.851307911678116|
|[14.0,83.0,17.5,9...| 9.195410572508287|
|[15.0,30.27699999...| 4.078420155243922|
|[17.0,70.0,20.76,...| 7.495640321034496|
|[18.0,70.367,20.5...|  8.50870563017861|
|[18.0,70.60600000...| 7.817230041076574|
|[18.0,77.499,19.5...| 8.407451556624899|
|[19.0,16.8,2.96,5...|2.2768149051086826|
+--------------------+------------

The Linear regression model performs really well as it has achieved an amazing r2 score of 96% on the test set.

In [83]:
df.select(corr(col1='cabins',col2='crew')).show()

+------------------+
|corr(cabins, crew)|
+------------------+
|0.9508226063578497|
+------------------+



In [84]:
df.select(corr(col1='passengers',col2='crew')).show()

+----------------------+
|corr(passengers, crew)|
+----------------------+
|    0.9152341306065384|
+----------------------+

