# Assignment 3

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. However, before you do so, please apply the following feature generation techniques:

**(a)** Use StringIndexer to convery the *Cruise_line* categorical variable into numeric variables

**(b)** Use VectorAssembler to create a feature vector using the following variables: 'Age',
        'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'cruise_cat'. Name 
        the output of the variable "features." Your final output will look something like the 
        following: **output.select("features", "crew").show()**
        
**(c)** Use a model-based approach to determine which features should/shouldn't be selected

# Loading Data

In [1]:
#Loading Pyspark

import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()


'C:\\Spark\\spark-3.1.2-bin-hadoop3.2'

In [2]:
#Loading Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySpark_Basics").getOrCreate()
from pyspark.sql.functions import approxCountDistinct, countDistinct, count, when, isnan, col, isnull
from pyspark.ml.feature import StringIndexer, VectorAssembler, PCA
from pyspark.ml.linalg import Vectors
from pyspark.ml.functions import vector_to_array
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_val_score ,cross_val_predict
from sklearn.metrics import r2_score,mean_absolute_error,mean_squared_error
from pyspark.ml.classification import DecisionTreeClassifier


In [3]:
#Loading Data

file_location = "cruise_ship_info.csv"
file_type = "csv"
infer_schema = "true"
first_row_is_header = "true"
delimiter = ","

df = spark.read.format(file_type)\
.option("InferSchema", infer_schema)\
.option("header", first_row_is_header)\
.option("sep", delimiter)\
.load(file_location)

df.printSchema()

df.na.drop(subset=["crew"]).show()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|  

# Feature Engineering

In [4]:
# Converting String to Numerical Values

indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_line_Indexed")
indexed = indexer.fit(df).transform(df)


indexed.show(5)

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_Indexed|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|                1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|                1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|                1.0|
+-----------+-----------+---+------------------+----------+-----

In [5]:
# Assembler

assembler = VectorAssembler(
    inputCols=['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'Cruise_line_Indexed'],
    outputCol="features")

output = assembler.transform(indexed)
output.select("features", "crew").show(truncate=False)
#output = output.select("features", "crew")

+--------------------------------------------------+----+
|features                                          |crew|
+--------------------------------------------------+----+
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0]|3.55|
|[6.0,30.276999999999997,6.94,5.94,3.55,42.64,16.0]|3.55|
|[26.0,47.262,14.86,7.22,7.43,31.8,1.0]            |6.7 |
|[11.0,110.0,29.74,9.53,14.88,36.99,1.0]           |19.1|
|[17.0,101.353,26.42,8.92,13.21,38.36,1.0]         |10.0|
|[22.0,70.367,20.52,8.55,10.2,34.29,1.0]           |9.2 |
|[15.0,70.367,20.52,8.55,10.2,34.29,1.0]           |9.2 |
|[23.0,70.367,20.56,8.55,10.22,34.23,1.0]          |9.2 |
|[19.0,70.367,20.52,8.55,10.2,34.29,1.0]           |9.2 |
|[6.0,110.23899999999999,37.0,9.51,14.87,29.79,1.0]|11.5|
|[10.0,110.0,29.74,9.51,14.87,36.99,1.0]           |11.6|
|[28.0,46.052,14.52,7.27,7.26,31.72,1.0]           |6.6 |
|[18.0,70.367,20.52,8.55,10.2,34.29,1.0]           |9.2 |
|[17.0,70.367,20.52,8.55,10.2,34.29,1.0]           |9.2 |
|[11.0,86.0,21

In [6]:
from pyspark.ml.classification import DecisionTreeClassifier
import pyspark.sql.functions as func


for k, v in output.schema["features"].metadata["ml_attr"]["attrs"].items():
    features_df = pd.DataFrame(v)

print(features_df)

output = output.withColumn("crew", func.round(df["crew"]).cast('integer'))


dt = DecisionTreeClassifier(featuresCol="features", labelCol= 'crew')
dt_model = dt.fit(output)
dt_model.featureImportances

dt_output = dt_model.featureImportances
features_df['Decision_Tree'] = features_df['idx'].apply(lambda x: dt_output[x] if x in dt_output.indices  else 0)

features_df.sort_values("Decision_Tree", ascending=False, inplace=True)

features_df

                                                vals  idx                 name
0  [Royal_Caribbean, Carnival, Princess, Holland_...    6  Cruise_line_Indexed


Unnamed: 0,vals,idx,name,Decision_Tree
0,"[Royal_Caribbean, Carnival, Princess, Holland_...",6,Cruise_line_Indexed,0.141808


In [7]:
# Model-based feature extraction

pca = PCA(k=3, inputCol="features", outputCol="PCAfeatures")
model = pca.fit(output)

result = model.transform(output).select("PCAfeatures","crew")
result.show(truncate=False)


+---------------------------------------------------------+----+
|PCAfeatures                                              |crew|
+---------------------------------------------------------+----+
|[28.90696689209417,37.662466754564434,24.200940329862544]|4   |
|[28.90696689209417,37.662466754564434,24.200940329862544]|4   |
|[46.16719592570902,16.643252999294056,39.664719487674994]|7   |
|[112.2379125870115,25.402813490740897,35.41568793937]    |19  |
|[102.24963951777274,24.983148264782145,40.45721128374385]|10  |
|[70.37482833639419,19.7773431120852,39.67143902891689]   |9   |
|[71.21630037265649,22.283002603021608,33.20319746753755] |9   |
|[70.26702247257334,19.355294765619043,40.57129392581935] |9   |
|[70.73545920907803,20.851197179629374,36.899335502611464]|9   |
|[114.87257321266344,19.1422815488352,27.69404085261306]  |12  |
|[112.35617864186317,25.761901533596003,34.49173927942568]|12  |
|[44.67562241141424,15.894295631606047,41.34157153832137] |7   |
|[70.85566949997265,21.20

# Formating Data

In [8]:
# Splitting Vectorized Column into Columns

final = result.withColumn("PCAfeatures", vector_to_array("PCAfeatures")).select(["crew"] + [col("PCAfeatures")[i] for i in range(3)])
final.show(5)


+----+------------------+------------------+------------------+
|crew|    PCAfeatures[0]|    PCAfeatures[1]|    PCAfeatures[2]|
+----+------------------+------------------+------------------+
|   4| 28.90696689209417|37.662466754564434|24.200940329862544|
|   4| 28.90696689209417|37.662466754564434|24.200940329862544|
|   7| 46.16719592570902|16.643252999294056|39.664719487674994|
|  19| 112.2379125870115|25.402813490740897|    35.41568793937|
|  10|102.24963951777274|24.983148264782145| 40.45721128374385|
+----+------------------+------------------+------------------+
only showing top 5 rows



In [9]:
# Converting to Pandas

df_p = final.toPandas()
df_p.head(5)


Unnamed: 0,crew,PCAfeatures[0],PCAfeatures[1],PCAfeatures[2]
0,4,28.906967,37.662467,24.20094
1,4,28.906967,37.662467,24.20094
2,7,46.167196,16.643253,39.664719
3,19,112.237913,25.402813,35.415688
4,10,102.24964,24.983148,40.457211


# Multiple Linear Regression Analysis

In [10]:
# Defining X and Y #########################################################################################

features_X = ['PCAfeatures[0]','PCAfeatures[1]','PCAfeatures[2]']
X = df_p[features_X].values

predictor = ['crew']
Y = df_p[predictor].values

X_train, X_test, Y_train, Y_test = train_test_split(X,Y, test_size=0.2, random_state=123)



# Fitting and Predicting Multiple Linear Regression Model##################################################

reg = LinearRegression()
reg.fit(X_train, Y_train)

pred = reg.predict(X_test)



# Results #################################################################################################

MLR_Training = reg.score(X_train, Y_train)
print('Multiple Linear Regression Training Accuracy: ', MLR_Training)

MLR_Testing = reg.score(X_test, Y_test)
print('Multiple Linear Regression Test Accuracy: ', MLR_Testing)

MLR_r2 = r2_score(Y_test, pred)
print('r2 value: ', MLR_r2)

MLR_MSE = mean_squared_error(Y_test, pred)
print('MSE: ', MLR_MSE)

Multiple Linear Regression Training Accuracy:  0.8783603919150549
Multiple Linear Regression Test Accuracy:  0.899192461530934
r2 value:  0.899192461530934
MSE:  1.0139820763978313


In [11]:
spark.stop()