<a href="https://colab.research.google.com/github/Ricardo-Jaramillo/PySpark/blob/main/Project_LinearRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis!

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

## First Install pyspark and download the data file

In [90]:
# Install pyspark
!pip install pyspark



In [91]:
# Download the data file
!wget https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/cruise_ship_info.csv

--2023-10-03 16:16:45--  https://raw.githubusercontent.com/Ricardo-Jaramillo/PySpark/main/datasets/cruise_ship_info.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 8734 (8.5K) [text/plain]
Saving to: ‘cruise_ship_info.csv.1’


2023-10-03 16:16:45 (76.2 MB/s) - ‘cruise_ship_info.csv.1’ saved [8734/8734]



## Create the Spark session

In [92]:
# Import libraries
from pyspark.sql import SparkSession
from pyspark.ml.regression import LinearRegression

In [93]:
# Create a session
spark = SparkSession.builder.appName('lr_project').getOrCreate()

In [94]:
# Read in the file
df = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)

In [95]:
# Print out the schema
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [96]:
# Show the data
df.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [97]:
# Count distinct Cruise_line
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



## Indexing string into numberic value

In [98]:
# Import the function
from pyspark.ml.feature import StringIndexer

In [99]:
# Create an indexer object and transofrm it, fitting with the df data
indexer = StringIndexer(inputCol="Cruise_line", outputCol="cruise_cat")
indexed = indexer.fit(df).transform(df)

In [100]:
# Show the indexed data
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|cruise_cat|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|       1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|       1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|       1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|       1.0|
|    Elation|   Carnival| 15

In [101]:
# Get each index with its label
indexed.select(['Cruise_line', 'cruise_cat']).distinct().orderBy('cruise_cat').show()

+-----------------+----------+
|      Cruise_line|cruise_cat|
+-----------------+----------+
|  Royal_Caribbean|       0.0|
|         Carnival|       1.0|
|         Princess|       2.0|
| Holland_American|       3.0|
|        Norwegian|       4.0|
|            Costa|       5.0|
|        Celebrity|       6.0|
|              MSC|       7.0|
|              P&O|       8.0|
|             Star|       9.0|
|Regent_Seven_Seas|      10.0|
|        Silversea|      11.0|
|           Cunard|      12.0|
|          Oceania|      13.0|
|         Seabourn|      14.0|
|         Windstar|      15.0|
|          Azamara|      16.0|
|          Crystal|      17.0|
|           Disney|      18.0|
|           Orient|      19.0|
+-----------------+----------+



## Assemble each feature with its own label

In [102]:
# Import functions
from pyspark.ml.linalg import Vector
from pyspark.ml.feature import VectorAssembler

In [103]:
# Show all the column names
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

In [104]:
# Create assembler object
assembler = VectorAssembler(inputCols=['Age', 'Tonnage', 'passengers',
                                       'length', 'cabins', 'passenger_density',
                                       'crew', 'cruise_cat'],
                            outputCol='features')

In [105]:
# Transform the data with the assembler we just created
output = assembler.transform(indexed)

Check the new data

In [106]:
# Show the data
output.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+--------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|cruise_cat|            features|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+----------+--------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|[6.0,30.276999999...|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|      16.0|[6.0,30.276999999...|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|       1.0|[26.0,47.262,14.8...|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|       1.0|[11.0,110.0,29.74...|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0| 

In [107]:
# Select features and labels and save it in a variable
final_data = output.select(['features', 'crew'])
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



## Split into train and test data

In [108]:
# Split data
train_data, test_data = final_data.randomSplit([0.7, 0.3])
train_data.show(1)

+--------------------+----+
|            features|crew|
+--------------------+----+
|[4.0,220.0,54.0,1...|21.0|
+--------------------+----+
only showing top 1 row



In [109]:
# Describe train and test data
train_data.describe().show()

test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               119|
|   mean| 7.867563025210095|
| stddev|3.6633732177951197|
|    min|              0.59|
|    max|              21.0|
+-------+------------------+

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|               39|
|   mean|7.570256410256411|
| stddev|2.995134200648118|
|    min|             0.88|
|    max|             13.6|
+-------+-----------------+



## Create the Linear Regression model and evaluate

In [110]:
# Create the model
lr = LinearRegression(labelCol='crew')

In [111]:
# Fit the model with our train data
lr_model = lr.fit(train_data)

In [112]:
# Print the coefficients and intercept for linear regression
print(f"Coefficients: {lr_model.coefficients}\nIntercept: {lr_model.intercept}")

Coefficients: [-1.0393128803583915e-15,1.1222543310495738e-15,-2.4388796259229762e-14,3.72256605236998e-14,3.5938411983451785e-14,-3.769956854357219e-15,0.999999999999988,1.5160874729464377e-15]
Intercept: 0.0


In [113]:
# Evaluate the model on the test_data
test_results = lr_model.evaluate(test_data)

In [114]:
# Show the errors
test_results.residuals.show()

+--------------------+
|           residuals|
+--------------------+
|-4.79616346638067...|
|1.332267629550187...|
|-2.84217094304040...|
|-7.63833440942107...|
|1.456612608308205...|
|1.474376176702208...|
|1.421085471520200...|
|1.278976924368180...|
|-2.30926389122032...|
|5.329070518200751...|
|-1.06581410364015...|
|-4.08562073062057...|
|-1.24344978758017...|
|-5.15143483426072...|
|-1.42108547152020...|
|-7.99360577730112...|
|-2.48689957516035...|
|-2.30926389122032...|
|-1.24344978758017...|
|                 0.0|
+--------------------+
only showing top 20 rows



In [115]:
# Show the MSE and RMSE
test_results.meanSquaredError, test_results.rootMeanSquaredError

(3.49866890986463e-27, 5.914954699627572e-14)

In [116]:
# Show R squared
test_results.r2

1.0

In [117]:
# Compare with the whole data summary
final_data.describe().show()

+-------+-----------------+
|summary|             crew|
+-------+-----------------+
|  count|              158|
|   mean|7.794177215189873|
| stddev|3.503486564627034|
|    min|             0.59|
|    max|             21.0|
+-------+-----------------+



We can confirm that the model performs pretty well.

A low MSE and high R2

## Check the corr of some variables

In [118]:
# Import corr function
from pyspark.sql.functions import corr

In [119]:
# corr crew - passengers
df.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [120]:
# corr crew - cabins
df.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



## Now let's make som predictions

In [121]:
# Take some unlabeled data
unlabeled_data = test_data.select('features')

In [122]:
# Show the unlabeled data
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[5.0,86.0,21.04,9...|
|[5.0,122.0,28.5,1...|
|[6.0,30.276999999...|
|[6.0,90.0,20.0,9....|
|[6.0,110.23899999...|
|[6.0,112.0,38.0,9...|
|[6.0,113.0,37.82,...|
|[6.0,158.0,43.7,1...|
|[9.0,59.058,17.0,...|
|[9.0,90.09,25.01,...|
|[9.0,116.0,26.0,9...|
|[10.0,58.825,15.6...|
|[10.0,77.0,20.16,...|
|[10.0,86.0,21.14,...|
|[11.0,138.0,31.14...|
|[12.0,88.5,21.24,...|
|[12.0,91.0,20.32,...|
|[13.0,91.0,20.32,...|
|[14.0,138.0,31.14...|
|[15.0,30.27699999...|
+--------------------+
only showing top 20 rows



In [123]:
# Get the predictions
predictions = lr_model.transform(unlabeled_data)

In [124]:
# Show predictions
predictions.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[5.0,86.0,21.04,9...| 8.000000000000048|
|[5.0,122.0,28.5,1...| 6.699999999999867|
|[6.0,30.276999999...|3.5500000000000282|
|[6.0,90.0,20.0,9....| 9.000000000000076|
|[6.0,110.23899999...|11.499999999999854|
|[6.0,112.0,38.0,9...|10.899999999999853|
|[6.0,113.0,37.82,...|11.999999999999858|
|[6.0,158.0,43.7,1...|13.599999999999872|
|[9.0,59.058,17.0,...|7.4000000000000234|
|[9.0,90.09,25.01,...| 8.689999999999994|
|[9.0,116.0,26.0,9...| 11.00000000000001|
|[10.0,58.825,15.6...| 7.000000000000041|
|[10.0,77.0,20.16,...| 9.000000000000012|
|[10.0,86.0,21.14,...|  9.20000000000005|
|[11.0,138.0,31.14...|11.850000000000014|
|[12.0,88.5,21.24,...|  9.30000000000008|
|[12.0,91.0,20.32,...| 9.990000000000025|
|[13.0,91.0,20.32,...| 9.990000000000023|
|[14.0,138.0,31.14...|11.760000000000012|
|[15.0,30.27699999...|               4.0|
+--------------------+------------