
## Predict how many Crew Members will be needed for future ships

### Hyundai Heavy Industries is one of the world's largest ship manufacturing companies and builds cruise liners. 

South Korea to help them give accurate estimates of how many crew members a ship will require.
They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.
Here is what the data looks like so far:
Description: Measurements of ship size, capacity, crew, and age for 158 cruise
ships.


Variables/Columns
Ship Name     1-20
Cruise Line   21-40
Age (as of 2013)   46-48
Tonnage (1000s of tons)   50-56
passengers (100s)   58-64
Length (100s of feet)  66-72
Cabins  (100s)   74-80
Passenger Density   82-88
Crew  (100s)   90-96

In [None]:
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.regression import LinearRegression

In [None]:

spark = SparkSession.builder.appName('cruise').getOrCreate()

In [None]:
crew_ships_data = spark.read.csv("DataSets/cruise_ship_info.csv",inferSchema=True,header=True)
crew_ships_data


## Explorator Data Analysis (EDA)

In [None]:
crew_ships_data.printSchema()
crew_ships_data.describe().show()

In [None]:
crew_ships_data.show()

In [None]:
"""
Dealing with Cruise Line Data. 
StringIndexer encodes a string column of labels to a column of label indices
"""

crew_ships_data.select('Cruise_line').show()
crew_ships_data.groupBy('Cruise_line').count().show()

indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat")
indexed = indexer.fit(crew_ships_data).transform(crew_ships_data)
indexed.head(5)


In [None]:
indexed.columns

In [None]:
assembler = VectorAssembler(
  inputCols=['Age',
             'Tonnage',
             'passengers',
             'length',
             'cabins',
             'passenger_density',
             'cruise_cat'],
    outputCol="features")

output = assembler.transform(indexed)
output.select("features", "crew").show()


In [None]:
final_data = output.select("features", "crew")
final_data

In [None]:
train_data,test_data = final_data.randomSplit([0.7,0.3])

In [None]:
lr = LinearRegression(labelCol='crew')
lrModel = lr.fit(train_data)

In [None]:
# Print the coefficients and intercept for linear regression
print("Coefficients: {} Intercept: {}".format(lrModel.coefficients,lrModel.intercept))

In [None]:
test_results = lrModel.evaluate(test_data)
print("RMSE: {}".format(test_results.rootMeanSquaredError))
print("MSE: {}".format(test_results.meanSquaredError))
print("R2: {}".format(test_results.r2))