# Linear Regression Consulting Project.

## (Henri's Solution + Additional Comments)

You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships.

You've been flown out to their HQ in Ulsan, South Korea!

Hyundai is one of the world's largest manufacturers of large ships, including cruise liners!

They need your help to give them accurate estimates of how many crew members a ship will require.

They are currently selling ships to some new customers and want you to create a model and use it to predict how many crew members the ships will need.

They provided you data with these features:
- Ship Name
- Cruise Line
- Age (as of 2013)
- Tonnage (1000s of tons)
- passengers (100s)
- Length (100s of feet)
- Cabins (100s)
- Passenger Density
- Crew (100s)

Your job is to create a regression model that will help predict how many crew members will be needed for future ships.

In other words, use the features you think will be useful to predict the value in the Crew column.

The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis!

The cruise line value is a string however!

We have not covered exactly how to convert strings to numbers with Python and Spark (yet).

Try to see if you can discover how to use **StringIndexer** from the documentation!

As in any real world project, there are no "100% correct" answers here.

Just try your best to build the model!

You can optionally see if you can figure out **StringIndexer** on your own (we'll cover it more formally in future lectures).

There can be more than one way to create this model.

All of the necessary information for the project can be found in the files:
- ```Linear_Regression_Consulting_Project.ipynb```
- ```cruise_ship_info.csv```

In [1]:
from pyspark.sql import SparkSession

# Start a SparkSession
spark = SparkSession.builder.appName("hyundai").getOrCreate()

In [2]:
from pyspark.ml.regression import LinearRegression

In [3]:
# Get the data.
data = spark.read.csv("cruise_ship_info.csv", inferSchema=True, header=True)

In [4]:
data.printSchema()

# The actual value that we are trying to predict is "crew".

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [27]:
# Just some Explanatory Data Analysis.
for ship in data.head(5):
    print(ship)
    print("\n")

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)


Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7)


Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1)


Row(Ship_name='Destiny', Cruise_line='Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0)




In [5]:
data.show()
print("The total number of data samples is: {:.0f}.".format(data.count()))

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [28]:
data.groupBy("Cruise_line").count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [6]:
from pyspark.ml.feature import StringIndexer
# Henri's question is why you are not using one-hot encoding???
# The instructor admits though that this is not the most optimal way to deal with this data.

In [7]:
indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_index", handleInvalid="skip")
# A label indexer that maps a string column of labels to an ML column of label indices.
# The indices span the range [0, numLabels).  By default, this is ordered by label frequencies so the most
# frequent label gets index 0.

final_data = indexer.fit(data).transform(data)
final_data.show()
print(final_data.count())

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|cruise_index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|        16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|         1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|         1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|         1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|         1.0|
|    Elati

In [8]:
final_data.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_index']

In [9]:
# Firstly, we need to import the vector assembler:
from pyspark.ml.feature import VectorAssembler

In [10]:
# Proceed by creating an assembler object:
assembler = VectorAssembler(inputCols=['Age',
                                       'Tonnage',
                                       'passengers',
                                       'length',
                                       'cabins',
                                       'passenger_density',
                                       'cruise_index'],
                           outputCol="features")

# A feature transformer that merges multiple columns into a vector column.

In [11]:
# Now, we need to transform the data:
output = assembler.transform(final_data)
output.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- cruise_index: double (nullable = false)
 |-- features: vector (nullable = true)



In [12]:
output.head(1)[0][-1]

DenseVector([6.0, 30.277, 6.94, 5.94, 3.55, 42.64, 16.0])

In [13]:
# Next, we need to create our cleaned data
cleaned_data = output.select(["features", "crew"])

cleaned_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [14]:
# Next, we need to split cleaned_data into a train/test split:
train_data, test_data = cleaned_data.randomSplit([0.7, 0.3], seed=123)

In [15]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               107|
|   mean| 7.650560747663561|
| stddev|3.5746763494546427|
|    min|               0.6|
|    max|              21.0|
+-------+------------------+



In [16]:
test_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               51|
|   mean|8.095490196078426|
| stddev| 3.36376414971412|
|    min|             0.59|
|    max|             13.6|
+-------+-----------------+



**Now, let us focus on creating our Linear Regression Model:**

In [17]:
# Instantiate a LinearRegression model object:
lr = LinearRegression(labelCol="crew")

In [18]:
lr_model = lr.fit(dataset=train_data)

**Let us see how well our model actually performed.**

In [19]:
test_results = lr_model.evaluate(dataset=test_data)  # Evaluates the model on a test dataset.

In [20]:
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
| -1.2436453515613266|
| 0.39555679536734445|
|   0.779325033202305|
|  -1.114539935828752|
|  0.5293319265632839|
| -0.7364868100479445|
|-0.21066095651902295|
| 0.38350662903607535|
| -0.5080006970434567|
|  0.9417292283530454|
| -0.2762836267021136|
| -0.2190190001522314|
| -0.4508245338709269|
| -0.6320049581569318|
|  -1.306349795830993|
| 0.25182903824420055|
|  0.9684292315668621|
|  0.8036265196989678|
|  -0.238710224099826|
|  0.8347557051448913|
+--------------------+
only showing top 20 rows



In [21]:
print("The RMSE is:    {:8.4f}.".format(test_results.rootMeanSquaredError))
print("The r^2 is:     {:8.4f}.".format(test_results.r2))  # How much of the variance is explained by the model.
print("The r^2_adj is: {:8.4f}.".format(test_results.r2adj))
print("\n")

cleaned_data.describe().show()

# Implies that some of our columns are highly correlated with the "crew" column.

The RMSE is:      0.8232.
The r^2 is:       0.9389.
The r^2_adj is:   0.9290.


+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



In [29]:
from pyspark.sql.functions import corr

In [31]:
data.select([corr("crew", "passengers"), corr("crew", "cabins")]).show()

# A pearson r^2 coefficient of 0.9 is quite high/significant.

+----------------------+------------------+
|corr(crew, passengers)|corr(crew, cabins)|
+----------------------+------------------+
|    0.9152341306065384|0.9508226063578497|
+----------------------+------------------+



**Let us review how we can deploy this model for data for which we only have feature data (and no label data).**

In [22]:
# We are only mimicking the process:
unlabeled_data = test_data.select(["features"])
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[5.0,86.0,21.04,9...|
|[5.0,115.0,35.74,...|
|[5.0,122.0,28.5,1...|
|[6.0,90.0,20.0,9....|
|[6.0,110.23899999...|
|[7.0,116.0,31.0,9...|
|[7.0,158.0,43.7,1...|
|[8.0,77.499,19.5,...|
|[9.0,90.09,25.01,...|
|[9.0,113.0,26.74,...|
|[9.0,116.0,26.0,9...|
|[10.0,68.0,10.8,7...|
|[10.0,86.0,21.14,...|
|[10.0,105.0,27.2,...|
|[10.0,138.0,31.14...|
|[11.0,58.6,15.66,...|
|[11.0,90.0,22.4,9...|
|[11.0,91.0,20.32,...|
|[11.0,91.62700000...|
|[11.0,108.977,26....|
+--------------------+
only showing top 20 rows



In [23]:
predictions = lr_model.transform(dataset=unlabeled_data)  # Transforms the input dataset with optional parameters.

In [24]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[5.0,86.0,21.04,9...| 9.243645351561327|
|[5.0,115.0,35.74,...|11.804443204632655|
|[5.0,122.0,28.5,1...| 5.920674966797695|
|[6.0,90.0,20.0,9....|10.114539935828752|
|[6.0,110.23899999...|10.970668073436716|
|[7.0,116.0,31.0,9...|12.736486810047944|
|[7.0,158.0,43.7,1...|13.810660956519023|
|[8.0,77.499,19.5,...| 8.616493370963925|
|[9.0,90.09,25.01,...| 9.198000697043456|
|[9.0,113.0,26.74,...|11.438270771646955|
|[9.0,116.0,26.0,9...|11.276283626702114|
|[10.0,68.0,10.8,7...| 6.579019000152232|
|[10.0,86.0,21.14,...| 9.650824533870926|
|[10.0,105.0,27.2,...|11.312004958156932|
|[10.0,138.0,31.14...|13.156349795830993|
|[11.0,58.6,15.66,...| 7.348170961755799|
|[11.0,90.0,22.4,9...|10.031570768433138|
|[11.0,91.0,20.32,...| 9.186373480301032|
|[11.0,91.62700000...| 9.238710224099826|
|[11.0,108.977,26....|11.165244294855109|
+--------------------+------------