# Linear Regression: Predicting Crew Size
## 1.) Brief
* We must accurately estimate of number of crew per ship
* Based on knowledge of existing ships, predict crew number for new ships
* Different cruise lines have significantly different distributions of crew members per ship
    * This variable is therefore important and should be used in our analysis/model
    * It is a string in the raw data, so we must use **StringIndexer()** to process it (more below)
    
## 2.) Data Load and Pre-Processing
* We'll load our data in from a CSV
* Then we'll investigate for any cleaning steps required (missing data etc.)
* Finally, we'll transform the data into a Spark-friendly input (i.e. 1 label, 1 feature column)

In [1]:
### setup pyspark ###
# load libs
import findspark

# store location of spark files
findspark.init('/home/matt/spark-3.0.2-bin-hadoop3.2')

# load libs
import pyspark
from pyspark.sql import SparkSession

# start new session
spark = SparkSession.builder.appName('crew').getOrCreate()

### load data ###
# read in data
df = spark.read.csv('Data/cruise_ship_info.csv', inferSchema=True, header=True)

# show schema
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [2]:
# peek at data
df.show(3)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
only showing top 3 rows



In [3]:
# load null check libs
from pyspark.sql.functions import isnan, when, count, col

# check nulls by column
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()

+---------+-----------+---+-------+----------+------+------+-----------------+----+
|Ship_name|Cruise_line|Age|Tonnage|passengers|length|cabins|passenger_density|crew|
+---------+-----------+---+-------+----------+------+------+-----------------+----+
|        0|          0|  0|      0|         0|     0|     0|                0|   0|
+---------+-----------+---+-------+----------+------+------+-----------------+----+



## 3.) Encode Text Variables
* Ship name and cruise line are both text variables
* We must convert these into numeric variables using categorical encoding
* Here, we will use **StringIndexer()**

### StringIndexer()
* Converts string labels into indexed values
* 4 options:
    * Descending by frequency (default)
    * Ascending by frequency
    * Descending alphabetically
    * Ascending alphabetically
* If two labels occur with the same frequency, alphabetical sorting is used to distinguish them
* Indexes calculated are 0 indexed
* If a model is trained with x labels and new data includes > x labels, you can choose to:
    * Drop any new labels
    * Keep new labels and assign to new index (all new labels put into same index)
    * Throw an exception (default)
* [StringIndexer() Spark Docs](https://spark.apache.org/docs/latest/ml-features#stringindexer)

In [6]:
# load string indexer lib
from pyspark.ml.feature import StringIndexer

# create indexer instance
indexer = StringIndexer(inputCol='Cruise_line', outputCol='Cruise_line_idx')

# fit indexer to data and create index column
df_idx = indexer.fit(df).transform(df)

# check output
df_idx.show(3)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+---------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_idx|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+---------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|           16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|           16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|            1.0|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+---------------+
only showing top 3 rows



## 4.) Vectorize Features
* We must convert our features into a Spark-friendly format
* Here, we simply use a VectorAssembler to convert multiple feature columns into a single vector column

In [10]:
# load vector libs
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

# store instructions for feature transformation
assembler = VectorAssembler(inputCols=['Age', 'Tonnage', 'passengers', 
                                       'length', 'cabins', 'passenger_density',
                                       'Cruise_line_idx'],
                            outputCol='features')

# transform features into single features column
df_vect = assembler.transform(df_idx)

# extract features and lables only
df_final = df_vect.select('crew', 'features')

# check output
df_final.show(3)

+----+--------------------+
|crew|            features|
+----+--------------------+
|3.55|[6.0,30.276999999...|
|3.55|[6.0,30.276999999...|
| 6.7|[26.0,47.262,14.8...|
+----+--------------------+
only showing top 3 rows



## 5.) Build Regression Model
* Split data 70:30 train:test
* Fit model to train data
* Evaluate test data (act vs pred)
* Make predictions on test data

In [13]:
# load linear regression libs
from pyspark.ml.regression import LinearRegression

# split data into train/test 70/30
train, test = df_final.randomSplit([0.7, 0.3])

# create linear regression model instance
lr = LinearRegression(featuresCol='features',
                      labelCol='crew',
                      predictionCol='predictions')

# fit model to train data
lr_model = lr.fit(train)

# evaluate test results
# i.e. act vs. pred
test_results = lr_model.evaluate(test)

# check residuals (i.e. variance between act and pred)
test_results.residuals.show(5)

+--------------------+
|           residuals|
+--------------------+
|  0.3948335210756415|
| -0.8393980014055366|
| -0.0954473737173771|
|0.005568388964199755|
| -1.1989041805894969|
+--------------------+
only showing top 5 rows



## 6.) Evaluate Model
* Our r2 value is quite high at ~97%
* Our RMSE is 0.69 which is pretty low compared to our mean (7.8) and std (3.5) for example
* Overall this looks like a good fit for our data
* The above residuals also look pretty small, suggesting little variance to our model
* We will also output our predicted crew sizes for our test data and take a look to see if these seem sensible

In [16]:
# evaluation metrics
print("r2: ", test_results.r2)
print("RMSE: ", test_results.rootMeanSquaredError)

r2:  0.9671694464669325
RMSE:  0.6867053882992723


In [15]:
# compare to actual data
df_final.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



## 7.) Predictions
* Below we make predictions of crew size based off our test features
* You can see that our predictions aren't exactly the same, but they do scale fairly well in comparison with our actual crew sizes

In [17]:
# create unlabelled test data
unlabelled_data = test.select('features')

# make predictions
predictions = lr_model.transform(unlabelled_data)

# peek at data
predictions.show(5)

+--------------------+-------------------+
|            features|        predictions|
+--------------------+-------------------+
|[22.0,3.341,0.66,...|0.19516647892435846|
|[27.0,5.35,1.67,4...| 1.7193980014055366|
|[27.0,10.0,2.08,4...| 1.6954473737173772|
|[19.0,16.8,2.96,5...| 2.0944316110358003|
|[48.0,22.08,8.26,...|  4.698904180589497|
+--------------------+-------------------+
only showing top 5 rows



In [18]:
# compare to actual results
test.show(5)

+----+--------------------+
|crew|            features|
+----+--------------------+
|0.59|[22.0,3.341,0.66,...|
|0.88|[27.0,5.35,1.67,4...|
| 1.6|[27.0,10.0,2.08,4...|
| 2.1|[19.0,16.8,2.96,5...|
| 3.5|[48.0,22.08,8.26,...|
+----+--------------------+
only showing top 5 rows

