## **Linear Regression**








Hyundai Heavy Industries  are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:
Description: Measurements of ship size, capacity, crew, and age for 158 cruise
ships.


Variables/Columns
> Ship Name     1-20

>Cruise Line   21-40

>Age (as of 2013)   46-48

>Tonnage (1000s of tons)   50-56

>passengers (100s)   58-64

>Length (100s of feet)  66-72

>Cabins  (100s)   74-80

>Passenger Density   82-88

>Crew  (100s)   90-96

In [2]:
!pip install pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("lr_model").getOrCreate()

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 39 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 56.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=45ac65e5887a6d1dca66eb1a30561a5e089bcadfe7f410520c58ac46bfb8fc47
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [4]:
#Importing data
data = spark.read.csv("cruise_ship_info.csv",inferSchema=True,header=True)
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [5]:
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [9]:
data.head()

Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55)

In [11]:
for item in data.head():
  print(item)

Journey
Azamara
6
30.276999999999997
6.94
5.94
3.55
42.64
3.55


Model building

In [12]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [13]:
data.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']

In [15]:
assembler = VectorAssembler(inputCols= ['Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density'], outputCol = 'features')

In [16]:
output = assembler.transform(data)

In [18]:
output.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|[17.0,101.353,26....|
+-----------+-----------+---+------------------+--------

In [19]:
final_data= output.select('features','crew')

In [20]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [21]:
train_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              105|
|   mean|7.527809523809534|
| stddev|3.362858782961186|
|    min|             0.59|
|    max|             13.6|
+-------+-----------------+



In [22]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                53|
|   mean|  8.32188679245283|
| stddev|3.7436027775617906|
|    min|              1.97|
|    max|              21.0|
+-------+------------------+



In [24]:
#Create a Linear Regression Model object
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(labelCol='crew')

In [29]:
lrModel = lr.fit(train_data)

In [30]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

Coefficients: [-0.011748454955341037,0.013770001630854995,-0.11594311285679082,0.42569951882706913,0.6983053526538295,0.0005109114943578635] Intercept: -0.6088847787685222


In [31]:
test_results = lrModel.evaluate(test_data)

In [32]:
# Interesting results....
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|  0.9805789655990012|
| -1.2192591453193344|
| -1.1666828263494615|
|  0.5720510521408162|
|-0.20666121154220285|
|  1.0878618479564341|
|-0.20728975903893954|
|-0.01734726095948602|
| 0.16632221026245375|
|  0.5315155784349237|
| -0.3943688871362685|
|-0.16329565644219102|
|  1.1463946342450306|
|-0.25460281211436175|
|   7.304967102156933|
|-0.20358106378192042|
|-0.23599996453155025|
| -0.3567745621511591|
|-0.11931032532223185|
| -1.4841507626901782|
+--------------------+
only showing top 20 rows



In [33]:
unlabeled_data = test_data.select('features')

In [34]:
predictions = lrModel.transform(unlabeled_data)

In [35]:
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[4.0,220.0,54.0,1...|   20.019421034401|
|[5.0,86.0,21.04,9...| 9.219259145319334|
|[5.0,160.0,36.34,...|14.766682826349461|
|[6.0,113.0,37.82,...|11.427948947859184|
|[6.0,158.0,43.7,1...|13.806661211542202|
|[8.0,91.0,22.44,9...| 9.912138152043566|
|[9.0,110.0,29.74,...| 11.80728975903894|
|[9.0,116.0,26.0,9...|11.017347260959486|
|[10.0,68.0,10.8,7...| 6.193677789737547|
|[10.0,77.0,20.16,...| 8.468484421565076|
|[10.0,81.76899999...| 8.814368887136268|
|[10.0,105.0,27.2,...| 10.84329565644219|
|[11.0,90.0,22.4,9...|  9.85360536575497|
|[11.0,91.62700000...| 9.254602812114362|
|[11.0,110.0,29.74...|11.795032897843068|
|[12.0,25.0,3.88,5...|3.0735810637819205|
|[12.0,58.6,15.66,...|  7.23599996453155|
|[12.0,90.09,25.01...| 9.036774562151159|
|[13.0,25.0,3.82,5...| 3.069310325322232|
|[13.0,63.0,14.4,7...| 6.794150762690178|
+--------------------+------------

In [36]:
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))

RMSE: 1.413087078740729
MSE: 1.9968150921040073
