# Linear Regression

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
 Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [0]:
# Import Spark Session
from pyspark.sql import SparkSession

In [0]:
# Instance spark 
spark = SparkSession.builder.appName('lr_consulting').getOrCreate()

In [0]:
# Get data from csv
data = spark.sql('SELECT * FROM cruise_ship_info')

In [0]:
# Print Schema (format of columns)
data.printSchema()

In [0]:
# Print dataset 
data.show()

In [0]:
# Count unique Cruise Line
data.select('Cruise_line').distinct().count()

In [0]:
# Indexer Cruise Line
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_line_indexed")
indexed_data = indexer.fit(data).transform(data)

In [0]:
# Show indexed dataset
indexed_data.show()

In [0]:
# Show columns start in index 2 
indexed_data.columns[2:]

In [0]:
# Select and create final dataset
final_df = indexed_data.select(indexed_data.columns[2:])

In [0]:
final_df.show()

In [0]:
final_df.columns

In [0]:
# Import Verctors
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [0]:
# Assembler 
assembler = VectorAssembler(inputCols=['Age','Tonnage','passengers','length','cabins','passenger_density','Cruise_line_indexed'], outputCol='features')

In [0]:
# get final vector data
final_vec_data = assembler.transform(final_df) 

In [0]:
# Print final dataset
final_vec_data.show(truncate=False)

In [0]:
# Select X like features and Y like crew
filtered_data = final_vec_data.select('features', 'crew')

In [0]:
filtered_data.show()

In [0]:
# Split data into train and test(70% to train and 30% to test)
train_data, test_data = filtered_data.randomSplit([0.7,0.3])

In [0]:
# Describe train
train_data.describe().show()

In [0]:
# Describe test
test_data.describe().show()

In [0]:
# Import linear regression
from pyspark.ml.regression import LinearRegression

In [0]:
# Build model
lr = LinearRegression(featuresCol='features', labelCol='crew',predictionCol='prediction')

In [0]:
# Train model
lrModel = lr.fit(train_data)

In [0]:
# Evaluate model with test data
model_results = lrModel.evaluate(test_data)

In [0]:
# MSE
model_results.meanSquaredError

In [0]:
# RMSE
model_results.rootMeanSquaredError

In [0]:
# Square R
model_results.r2

In [0]:
# "Manual" prediction with unkown data 
predict_data = lrModel.transform(test_data.select('features'))

In [0]:
test_data.select('crew').show()

In [0]:
predict_data.show()