# Linear Regression Exercise

A scenario of a ship manufacturing company that builds cruise liners. Currently, they are building new ships for some customers, and they want you, as a data scientist, to help them with estimating how many crew members a ship will require. They stored all the previous data related to their cruses in a csv file for your reference.

You should create a regression model that will help predict how many crew members will be required for the new ships. The design team also declared that they have found that particular cruise lines will differ in acceptable crew counts, so it seems it is a critical feature to include in your linear regression model. In the end, evaluate the performance of your model.


In [1]:
#First, import all the necessary package
import pyspark
import os
import sys
os.environ["JAVA_HOME"] = "C:\Program Files\Java\jre1.8.0_301"
os.environ["SPARK_HOME"] = "C:\spark\spark-3.1.2-bin-hadoop3.2"
os.environ["PYLIB"] = os.environ["SPARK_HOME"] + "\python\lib"
from pyspark.sql import SparkSession

import pyspark
from pyspark.sql.functions import *
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import StringIndexer

from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

In [2]:
spark = SparkSession.builder.appName('cruise').getOrCreate()

In [3]:
df = spark.read.csv('cruise_ship_info.csv',inferSchema = True, header = True)

In [4]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [5]:
df.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 5 rows



In [8]:
df.count()

158

In [9]:
vecAssembler= VectorAssembler(inputCols=["Age","Tonnage","passengers","length","cabins","passenger_density"], 
                              outputCol="features")
vec_df=vecAssembler.transform(df)
vec_df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|[17.0,101.353,26....|
|    Ecstasy|   Carnival| 22|            70.367|     20.

In [21]:
splits = vec_df.randomSplit([0.8,0.2],1)
train_df=splits[0]
test_df=splits[1]

In [22]:
train_df.count()

131

In [23]:
test_df.count()

27

In [24]:
lr = LinearRegression(featuresCol='features', labelCol='crew', maxIter=10, regParam=0.3, elasticNetParam=0.8)
LinRegModel = lr.fit(train_df)
pred_df=LinRegModel.transform(test_df)
pred_df.select(['Age','Tonnage','passengers','length','cabins','passenger_density','crew','prediction']).show(100)

+---+------------------+----------+------+------+-----------------+-----+------------------+
|Age|           Tonnage|passengers|length|cabins|passenger_density| crew|        prediction|
+---+------------------+----------+------+------+-----------------+-----+------------------+
| 22|             3.341|      0.66|   2.8|  0.33|            50.62| 0.59|1.2773877021825064|
| 22|            52.926|     13.02|  7.18|  6.54|            40.65| 6.17| 6.213243878471165|
| 19|              16.8|      2.96|  5.14|  1.48|            56.76|  2.1|2.9065385959547583|
| 11|             110.0|     29.74|  9.53| 14.88|            36.99| 19.1|11.194009447429984|
| 11| 91.62700000000001|     19.74|  9.64|  9.87|            46.42|  9.0|  9.04558031532133|
| 16|            74.137|      19.5|  9.16|  9.75|            38.02|  7.6| 8.590749492135874|
| 13|             138.0|     31.14|  10.2| 15.57|            44.32|11.76|12.076845669962097|
| 27|              12.5|      3.94|  4.36|  0.88|            31.73| 1.

In [25]:
LinRegModel.coefficients

DenseVector([0.0, 0.0113, -0.0, 0.4361, 0.3969, 0.0])

In [26]:
LinRegModel.intercept

-0.11241710183403018

In [27]:
LinRegModel.summary.rootMeanSquaredError

0.9030693429583632

In [28]:
trainingSummary = LinRegModel.summary
print('RMSE: %f' % trainingSummary.rootMeanSquaredError)
print('r2: %f' % trainingSummary.r2)

RMSE: 0.903069
r2: 0.917466
