# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and **use it to predict how many crew members the ships will need.**

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

In [1]:
import findspark, pyspark
findspark.find()

'C:\\bigdata\\spark-2.4.5-bin-hadoop2.7'

In [2]:
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
conf= pyspark.SparkConf().setAppName('appName').setMaster('local[2]')
sc = pyspark.SparkContext(conf = conf)
spark = SparkSession(sc)

In [70]:
df = spark.read.csv('./cruise_ship_info.csv',header=True)

In [71]:
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [72]:
import pyspark.sql.types as typ
labels = [
    ('Ship_name', typ.StringType()),
    ('Cruise_line', typ.StringType()),
    ('Age', typ.IntegerType()),
    ('Tonnage', typ.DoubleType()),
    ('passengers', typ.DoubleType()),
    ('length', typ.DoubleType()),
    ('cabins', typ.DoubleType()),
    ('passenger_density', typ.DoubleType()),
    ('crew', typ.DoubleType())
]
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])

In [73]:
data = spark.read.csv('./cruise_ship_info.csv',header=True,schema=schema)

In [74]:
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [75]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [76]:
data.describe().toPandas()

Unnamed: 0,summary,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew
0,count,158,158,158.0,158.0,158.0,158.0,158.0,158.0,158.0
1,mean,Infinity,,15.689873417721518,71.28467088607599,18.45740506329114,8.130632911392404,8.830000000000005,39.90094936708861,7.794177215189873
2,stddev,,,7.615691058751413,37.229540025907866,9.677094775143416,1.793473548054825,4.4714172221480615,8.63921711391542,3.503486564627034
3,min,Adventure,Azamara,4.0,2.329,0.66,2.79,0.33,17.7,0.59
4,max,Zuiderdam,Windstar,48.0,220.0,54.0,11.82,27.0,71.43,21.0


from pyspark.ml.feature import StringIndexer, OneHotEncoder
from pyspark.ml.feature import VectorAssembler
varIdxer = StringIndexer(inputCol='Cruise_line',outputCol='change_Cruise_line').fit(df)
df = varIdxer.transform(df)
df = OneHotEncoder(inputCol="change_Cruise_line", outputCol="change_Cruise_line_f").transform(df)
assembler = VectorAssembler(inputCols=["change_Cruise_line_f"],outputCol="features")
df =  assembler.transform(df)

In [77]:
#df = data.select(['Age','Tonnage','passengers','length','cabins','passenger_density','crew'])
len(data.columns)

9

In [78]:
import six
for i in range(len(data.columns)):
    for j in range(i+1, len(data.columns),1):
        if not( isinstance(data.select(data.columns[i]).take(1)[0][0], six.string_types)):
            if not ( isinstance(data.select(data.columns[j]).take(1)[0][0], six.string_types)):
                print( "Correlation to ",data.columns[i]," for ", data.columns[j], data.stat.corr(data.columns[i],data.columns[j]))

Correlation to  Age  for  Tonnage -0.6066460870567227
Correlation to  Age  for  passengers -0.5155422760201276
Correlation to  Age  for  length -0.5322858870729916
Correlation to  Age  for  cabins -0.5100190265901992
Correlation to  Age  for  passenger_density -0.2788302011720371
Correlation to  Age  for  crew -0.5306565039638852
Correlation to  Tonnage  for  passengers 0.9450614045989862
Correlation to  Tonnage  for  length 0.9223683220426181
Correlation to  Tonnage  for  cabins 0.9487635739004593
Correlation to  Tonnage  for  passenger_density -0.040846239040556634
Correlation to  Tonnage  for  crew 0.9275688115449388
Correlation to  passengers  for  length 0.8835347869399605
Correlation to  passengers  for  cabins 0.9763413679845944
Correlation to  passengers  for  passenger_density -0.29486708165841
Correlation to  passengers  for  crew 0.9152341306065384
Correlation to  length  for  cabins 0.889798209935357
Correlation to  length  for  passenger_density -0.0904884688873216
Correla

In [79]:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol="Ship_name", outputCol="Ship_name_indexed")
indexer_model = indexer.fit(data)
indexed_data= indexer_model.transform(data)
# to view the data
indexed_data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Ship_name_indexed|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             32.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             46.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|            134.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|             78.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|             36.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|       

In [80]:
indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_line_indexed")
indexer_model = indexer.fit(indexed_data)
indexed_data= indexer_model.transform(indexed_data)
# to view the data
indexed_data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+-------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Ship_name_indexed|Cruise_line_indexed|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+-------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             32.0|               16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             46.0|               16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|            134.0|                1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|             78.0|                1.0|
|    Destiny|   Carnival| 17|           101.353|     26

In [103]:
indexed_data.describe().toPandas()

Unnamed: 0,summary,Ship_name,Cruise_line,Age,Tonnage,passengers,length,cabins,passenger_density,crew,Ship_name_indexed,Cruise_line_indexed
0,count,158,158,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0,158.0
1,mean,Infinity,,15.689873417721518,71.28467088607599,18.45740506329114,8.130632911392404,8.830000000000005,39.90094936708861,7.794177215189873,60.607594936708864,5.063291139240507
2,stddev,,,7.615691058751413,37.229540025907866,9.677094775143416,1.793473548054825,4.4714172221480615,8.63921711391542,3.503486564627034,42.78457895932363,4.758744608182735
3,min,Adventure,Azamara,4.0,2.329,0.66,2.79,0.33,17.7,0.59,0.0,0.0
4,max,Zuiderdam,Windstar,48.0,220.0,54.0,11.82,27.0,71.43,21.0,137.0,19.0


In [99]:
indexed_data.groupby('Cruise_line').min().toPandas()

Unnamed: 0,Cruise_line,min(Age),min(Tonnage),min(passengers),min(length),min(cabins),min(passenger_density),min(crew),min(Ship_name_indexed),min(Cruise_line_indexed)
0,Costa,6,25.0,7.76,6.16,3.86,29.47,3.85,17.0,5.0
1,P&O,5,45.0,11.78,7.54,5.3,32.18,5.2,28.0,9.0
2,Cunard,6,70.327,17.91,9.63,9.5,39.27,9.0,35.0,15.0
3,Regent_Seven_Seas,10,12.5,3.2,4.36,0.88,31.73,1.46,10.0,10.0
4,MSC,5,16.852,9.52,5.41,3.83,17.7,2.97,11.0,7.0
5,Carnival,6,46.052,14.52,7.22,7.26,29.79,6.6,0.0,1.0
6,Crystal,10,51.004,9.4,7.81,4.8,54.26,5.45,20.0,18.0
7,Orient,48,22.08,8.26,5.78,4.25,26.73,3.5,106.0,19.0
8,Princess,6,30.277,6.86,5.93,3.44,29.88,3.73,2.0,2.0
9,Silversea,12,16.8,2.96,5.14,1.48,56.76,1.97,4.0,11.0


In [100]:
indexed_data.groupby('Cruise_line').max().toPandas()

Unnamed: 0,Cruise_line,max(Age),max(Tonnage),max(passengers),max(length),max(cabins),max(passenger_density),max(crew),max(Ship_name_indexed),max(Cruise_line_indexed)
0,Costa,27,112.0,38.0,9.6,15.0,40.68,10.9,123.0,5.0
1,P&O,29,115.0,35.74,9.35,15.32,43.19,12.2,128.0,9.0
2,Cunard,44,151.4,26.2,11.32,11.34,57.79,12.53,124.0,15.0
3,Regent_Seven_Seas,27,50.0,7.0,7.09,3.54,71.43,4.47,75.0,10.0
4,MSC,36,133.5,39.59,10.93,16.37,37.71,13.13,77.0,7.0
5,Carnival,28,110.239,37.0,9.63,14.88,41.67,19.1,134.0,1.0
6,Crystal,18,68.0,10.8,7.9,5.5,62.96,6.36,88.0,18.0
7,Orient,48,22.08,8.26,5.78,4.25,26.73,3.5,106.0,19.0
8,Princess,29,116.0,37.82,9.64,15.57,46.42,12.38,133.0,2.0
9,Silversea,19,25.0,3.88,5.97,1.94,65.45,2.95,127.0,11.0


In [98]:
indexed_data.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Ship_name_indexed',
 'Cruise_line_indexed']

In [82]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = [
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
    'Cruise_line_indexed',
], outputCol = 'features')
vhouse_df = vectorAssembler.transform(indexed_data)
vhouse_df = vhouse_df.select(['features','crew'])
vhouse_df.show(3)

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
+--------------------+----+
only showing top 3 rows



In [83]:
data_train, data_test= vhouse_df.randomSplit([0.7, 0.3], seed=333)

In [84]:
data_train.show()

+--------------------+-----+
|            features| crew|
+--------------------+-----+
|[4.0,220.0,54.0,1...| 21.0|
|[5.0,86.0,21.04,9...|  8.0|
|[5.0,115.0,35.74,...| 12.2|
|[5.0,122.0,28.5,1...|  6.7|
|[5.0,133.5,39.59,...|13.13|
|[5.0,160.0,36.34,...| 13.6|
|[6.0,30.276999999...| 3.55|
|[6.0,30.276999999...| 3.55|
|[6.0,110.23899999...| 11.5|
|[6.0,112.0,38.0,9...| 10.9|
|[6.0,113.0,37.82,...| 12.0|
|[6.0,158.0,43.7,1...| 13.6|
|[7.0,89.6,25.5,9....| 9.87|
|[7.0,116.0,31.0,9...| 12.0|
|[8.0,91.0,22.44,9...| 11.0|
|[8.0,110.0,29.74,...| 11.6|
|[9.0,59.058,17.0,...|  7.4|
|[9.0,81.0,21.44,9...| 10.0|
|[9.0,85.0,19.68,9...| 8.69|
|[9.0,88.5,21.24,9...| 10.3|
+--------------------+-----+
only showing top 20 rows



In [85]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='crew', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(data_train)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Coefficients: [0.0,0.013133109205978953,0.0009731670854856729,0.2280492031841632,0.48298102206648047,0.0,0.0]
Intercept: 0.7111890857249724


In [86]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 1.125229
r2: 0.903228


In [104]:
print("adj r2: %f" % trainingSummary.r2adj)

adj r2: 0.896777
