# Linear Regression Consulting Project

## Project objective:
Predict how many crew members a ship will need

## Project detail:

Hyundai Heavy Industries wants to build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

#### hint: use StringIndexer or OneHotEncoder

##### StringIndexer:
StringIndexer (equivalent to LabelEncoder in sklearn) assigns index [0-num_labels] to each string.
The problem here is, since there are different numbers in the same column, the model will misunderstand the data to be in some kind of order. But this is not the case at all. To overcome this problem, we use One Hot Encoder.

##### OneHotEncoder:
**Note:** pyspark OneHotEncoder is different from scikit-learn’s OneHotEncoder.

A one-hot encoder that maps a column of **category indices** to a column of binary vectors. So, we should encode categorical features to categorical indices using **StringIndexer** first then apply pyspark OneHotEncoder.

**Note:** The last category is not included by default (configurable via dropLast), because it makes the vector entries sum up to one, and hence linearly dependent. So an input value of 4.0 maps to [0.0, 0.0, 0.0, 0.0].


In [2]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('MLlibLRProject').getOrCreate()


In [3]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator

In [4]:
data = spark.read.csv('cruise_ship_info.csv',header=True, inferSchema=True)

In [5]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [6]:
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder

In [7]:
indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_lineIndex")
indexed = indexer.fit(data).transform(data)
indexed.head(2)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_lineIndex=16.0),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_lineIndex=16.0)]

In [8]:
onehotencoder = OneHotEncoder(inputCol="Cruise_lineIndex",
                        outputCol="Cruise_lineVec")
data_encoded = onehotencoder.fit(indexed).transform(indexed)
data_encoded.head(2)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_lineIndex=16.0, Cruise_lineVec=SparseVector(19, {16: 1.0})),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, Cruise_lineIndex=16.0, Cruise_lineVec=SparseVector(19, {16: 1.0}))]

In [9]:
data_encoded.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_lineIndex: double (nullable = false)
 |-- Cruise_lineVec: vector (nullable = true)



In [10]:
# VectorAssembler to bring the data to the MLlib format
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(inputCols=['Age','Tonnage','passengers','length','cabins','passenger_density',
                                       'Cruise_lineVec'],
                            outputCol = 'features')
final_data = assembler.transform(data_encoded)
final_data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_lineIndex: double (nullable = false)
 |-- Cruise_lineVec: vector (nullable = true)
 |-- features: vector (nullable = true)



In [11]:
#train test split
train_data , test_data = final_data.randomSplit([0.7,0.3])

In [12]:
# Linear regression model
lr = LinearRegression()

#### We use a ParamGridBuilder to construct a grid of parameters to search over.

In [13]:
paramGrid = ParamGridBuilder()\
    .addGrid(lr.regParam, [0.1, 0.01]) \
    .addGrid(lr.fitIntercept, [False, True])\
    .addGrid(lr.elasticNetParam, [0.0, 0.5, 1.0])\
    .build()


#### A CrossValidator is defined to choose the best parameters.

CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator.

The estimator can be an algorithm (e.g., lr) or a Pipeline.

In [14]:
regEvaluator = RegressionEvaluator()
cv = CrossValidator(estimator=lr,
                           estimatorParamMaps=paramGrid,
                           evaluator=regEvaluator,
                           numFolds=2)
cvmodel = cv.fit(train_data.withColumnRenamed('crew','label').select('label','features'))

After identifying the best ParamMap, CrossValidator finally **re-fits the Estimator using the best ParamMap and the entire dataset**.

#### Make predictions on test set:

In [15]:
test_results = cvmodel.transform(test_data.withColumnRenamed('crew','label').select('label','features'))

In [16]:
test_results.show()

+-----+--------------------+------------------+
|label|            features|        prediction|
+-----+--------------------+------------------+
|  6.0|(25,[0,1,2,3,4,5,...|6.5994171777344235|
|  7.0|(25,[0,1,2,3,4,5,...| 7.362197774546054|
|  4.7|(25,[0,1,2,3,4,5,...| 5.067769575890635|
|  9.0|(25,[0,1,2,3,4,5,...| 9.073485780976117|
| 6.14|(25,[0,1,2,3,4,5,...| 7.668591138876581|
|  9.2|(25,[0,1,2,3,4,5,...| 8.920178819107162|
|  8.0|(25,[0,1,2,3,4,5,...| 9.272748444462165|
|11.76|(25,[0,1,2,3,4,5,...|12.070807073568638|
|  9.2|(25,[0,1,2,3,4,5,...| 8.920195166619415|
|10.68|(25,[0,1,2,3,4,5,...|10.889680734430822|
|  4.7|(25,[0,1,2,3,4,5,...| 4.431157605028788|
|  6.6|(25,[0,1,2,3,4,5,...| 6.589839964906625|
| 13.6|(25,[0,1,2,3,4,5,...| 14.04047648559747|
|  9.2|(25,[0,1,2,3,4,5,...| 8.920206064960919|
|  9.0|(25,[0,1,2,3,4,5,...| 9.342592150078772|
|  6.8|(25,[0,1,2,3,4,5,...| 7.208845904257517|
| 5.57|(25,[0,1,2,3,4,5,...| 5.983319466127483|
| 4.38|(25,[0,1,2,3,4,5,...| 4.893535609

#### Evaluate by various metrics, e.g., 'rmse', 'mse'

First, we need to call setMetricName of the evaluator instance:

In [17]:
regEvaluator.setMetricName('rmse') # or set metricName="rmse" when defined regEvaluator

RegressionEvaluator_d1395552505c

In [18]:
regEvaluator.evaluate(cvmodel.transform(test_data.withColumnRenamed('crew','label').select('label','features')))


0.6001965838001554