# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [2]:
#Importar la libreria de spark e iniciar una sesion 
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('Hyundai Project').getOrCreate()

In [3]:
#Cargamos la base de datos 
df = spark.read.csv("cruise_ship_info.csv" , inferSchema=True , header=True)

In [4]:
#Veamos un poco la base de datos 
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [5]:
#En un comienzo me interesa sacar de la df el nombre del barco
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [12]:
#Tratamos de indexar la linea del crucero. Importamos StringIndexer
from pyspark.ml.feature import StringIndexer
import numpy as np


In [15]:
# Creamos el indexer y lo ajustamos con la df, luego la transformamos
indexer = StringIndexer(inputCol='Cruise_line' , outputCol='Cruise_line_indexed')
indexed = indexer.fit(df).transform(df)
indexed.show(truncate=False)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|Ship_name  |Cruise_line|Age|Tonnage           |passengers|length|cabins|passenger_density|crew|Cruise_line_indexed|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|Journey    |Azamara    |6  |30.276999999999997|6.94      |5.94  |3.55  |42.64            |3.55|16.0               |
|Quest      |Azamara    |6  |30.276999999999997|6.94      |5.94  |3.55  |42.64            |3.55|16.0               |
|Celebration|Carnival   |26 |47.262            |14.86     |7.22  |7.43  |31.8             |6.7 |1.0                |
|Conquest   |Carnival   |11 |110.0             |29.74     |9.53  |14.88 |36.99            |19.1|1.0                |
|Destiny    |Carnival   |17 |101.353           |26.42     |8.92  |13.21 |38.36            |10.0|1.0                |
|Ecstasy    |Carnival   |22 |70.367            |20.52     |8.55 

In [16]:
# Para alimentar a un modelo de ML necesitamos todas las features en un solo vector 
# Para eso tenemos el vector assembles 
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler



In [17]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_line_indexed']

In [18]:
assembler = VectorAssembler(inputCols= [ 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'Cruise_line_indexed'] , outputCol = 'features')

assembled = assembler.transform(indexed)

In [22]:
final_data =  assembled.select('features' , 'crew')

In [23]:
# final_data seria nuestra df con lo que queremos alimentar el modelo. 
# Para despues evaluar al modelo tenemos que divir el data set en 2 
train_data , test_data = final_data.randomSplit([0.7,0.3] , seed=0)


In [28]:
# Ahora viene la parte de construir el modelo. Importamos el modelo
from pyspark.ml.regression import LinearRegression
# Creamos el objeto del modelo de regresion lineal e indicamos que el label es la columna crew
lr = LinearRegression(labelCol='crew')

In [31]:
# Ahora deberiamos alimentar al modelo con los datos de entrenamiento y guardarlo 
lr_model = lr.fit(train_data)

In [33]:
lr_model.coefficients

DenseVector([-0.014, 0.002, -0.181, 0.2809, 1.0317, 0.0054, 0.0392])

In [35]:
test_result = lr_model.evaluate(test_data)

In [41]:
print('The R2 is {}'.format(test_result.r2))
print('The MSE is {}'.format(test_result.meanSquaredError))
print("RMSE: {}".format(test_result.rootMeanSquaredError))


The R2 is 0.9610363931946619
The MSE is 0.6354155506121196
RMSE: 0.7971295695256322


In [45]:
# Podemos averiguar la correlacion entre variables 
from pyspark.sql.functions import corr
df.select(corr('Age' , 'crew')).show()
df.select(corr('Tonnage' , 'crew')).show()
df.select(corr('passengers' , 'crew')).show()
df.select(corr('length' , 'crew')).show()
df.select(corr('cabins', 'crew')).show()

+-------------------+
|    corr(Age, crew)|
+-------------------+
|-0.5306565039638852|
+-------------------+

+-------------------+
|corr(Tonnage, crew)|
+-------------------+
| 0.9275688115449388|
+-------------------+

+----------------------+
|corr(passengers, crew)|
+----------------------+
|    0.9152341306065384|
+----------------------+

+------------------+
|corr(length, crew)|
+------------------+
| 0.895856627101658|
+------------------+

+------------------+
|corr(cabins, crew)|
+------------------+
|0.9508226063578497|
+------------------+



In [44]:
df.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew']