# Linear Regression Consulting Project

Congratulations! You've been contracted by Hyundai Heavy Industries to help them build a predictive model for some ships. [Hyundai Heavy Industries](http://www.hyundai.eu/en) is one of the world's largest ship manufacturing companies and builds cruise liners.

You've been flown to their headquarters in Ulsan, South Korea to help them give accurate estimates of how many crew members a ship will require.

They are currently building new ships for some customers and want you to create a model and use it to predict how many crew members the ships will need.

Here is what the data looks like so far:

    Description: Measurements of ship size, capacity, crew, and age for 158 cruise
    ships.


    Variables/Columns
    Ship Name     1-20
    Cruise Line   21-40
    Age (as of 2013)   46-48
    Tonnage (1000s of tons)   50-56
    passengers (100s)   58-64
    Length (100s of feet)  66-72
    Cabins  (100s)   74-80
    Passenger Density   82-88
    Crew  (100s)   90-96
    
It is saved in a csv file for you called "cruise_ship_info.csv". Your job is to create a regression model that will help predict how many crew members will be needed for future ships. The client also mentioned that they have found that particular cruise lines will differ in acceptable crew counts, so it is most likely an important feature to include in your analysis! 

Once you've created the model and tested it for a quick check on how well you can expect it to perform, make sure you take a look at why it performs so well!

# requierments

In [64]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("Lr_project").getOrCreate()
# import Linear Regression from MLib 
from pyspark.ml.regression import LinearRegression

### load the data 

In [65]:
data = spark.read.csv("../../Data/cruise_ship_info.csv", inferSchema=True, header=True)

In [66]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [67]:
for value, item in zip(data.head(1)[0],data.columns):
    print(item)
    print(value)
    print("\n")

Ship_name
Journey


Cruise_line
Azamara


Age
6


Tonnage
30.276999999999997


passengers
6.94


length
5.94


cabins
3.55


passenger_density
42.64


crew
3.55




# let's convert Cruise_line to numeric data as its categorial and it's important to us 
using String indexer  that transfrom the value to numeric and give the importance to the most frequent 
- example : if we have [a,a,a,b,b,c] == > [0,0,0,1,1,2]

In [88]:
# we can use also a dummy variable for the transformation 
from pyspark.ml.feature import StringIndexer
new_data = StringIndexer(inputCol="Cruise_line",outputCol="Cruise_line_numeric")
new_data = new_data.fit(data).transform(data)
new_data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_numeric|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-------------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|               16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|                1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|                1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|                1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.5

In [89]:
data.groupBy("Cruise_line").count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [90]:
new_data.groupBy("Cruise_line_numeric").count().show()

+-------------------+-----+
|Cruise_line_numeric|count|
+-------------------+-----+
|                8.0|    6|
|                0.0|   23|
|                7.0|    8|
|               18.0|    2|
|                1.0|   22|
|                4.0|   13|
|               11.0|    4|
|               14.0|    3|
|                3.0|   14|
|               19.0|    1|
|                2.0|   17|
|               17.0|    2|
|               10.0|    5|
|               13.0|    3|
|                6.0|   10|
|                5.0|   11|
|               15.0|    3|
|                9.0|    6|
|               16.0|    2|
|               12.0|    3|
+-------------------+-----+



# now let Continue our processes 
## select features and labels  using Vectors Assembler 

In [84]:
new_data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_line_numeric: double (nullable = false)



In [85]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.linalg import Vectors 

In [97]:
# Grab the features and labels 
# Vector Assembler take inputCols as columns we want to take together and output par that indicates what they mean


## get the numeric data
assembler = VectorAssembler(inputCols=["Age","Tonnage","passengers","length","cabins","passenger_density",
                                       "Cruise_line_numeric"], outputCol = "features")
output_df = assembler.transform(new_data)
output_df = output_df.select(["features","crew"])
output_df.show()


+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



# Now lets Split our Data to train_data and Test_data

In [99]:
train_data, test_data = output_df.randomSplit([0.7,.3])

In [102]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               115|
|   mean|7.7569565217391405|
| stddev| 3.607465498470425|
|    min|              0.59|
|    max|              19.1|
+-------+------------------+



In [103]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                43|
|   mean| 7.893720930232557|
| stddev|3.2474319694885465|
|    min|              1.97|
|    max|              21.0|
+-------+------------------+



# Lets create our Model Linear Regression to predict the number of Crew in the ships 

In [107]:
lr = LinearRegression(labelCol="crew", featuresCol="features")
lr_model = lr.fit(train_data)
test_result = lr_model.evaluate(test_data)

# Evaluation of the models 

In [108]:
test_result.rootMeanSquaredError

0.5858384628053303

when we see the RMS that give us the mean error in the unit of crew, we see that the mean crew in the real data is 7.75 and std = 3.60 the min = 0.59 , max = 19.5 and the error prediction is about 0.58 that is small compared to the value we seen above

- so we can say at the first sight that the model fit better 

In [109]:
test_result.r2

0.9666807816357342

we look that the models explain 96 % of the variance in the data, which i a good sign, so also from here we can say that the model are good.

# Now lets us deploy it in unlabeled Data to make a reality check specialy if we get a good reasult in RMS, R2


In [119]:
# lets take example as the test_data without the crew feature
unlabeled_data = test_data.select("features")
unlabeled_data.show()

+--------------------+
|            features|
+--------------------+
|[4.0,220.0,54.0,1...|
|[5.0,115.0,35.74,...|
|[6.0,90.0,20.0,9....|
|[9.0,90.09,25.01,...|
|[10.0,68.0,10.8,7...|
|[10.0,77.0,20.16,...|
|[11.0,58.6,15.66,...|
|[11.0,85.0,18.48,...|
|[11.0,90.0,22.4,9...|
|[11.0,91.62700000...|
|[11.0,108.977,26....|
|[12.0,50.0,7.0,7....|
|[12.0,90.09,25.01...|
|[12.0,91.0,20.32,...|
|[12.0,138.0,31.14...|
|[13.0,25.0,3.82,5...|
|[13.0,30.27699999...|
|[13.0,85.619,21.1...|
|[14.0,30.27699999...|
|[14.0,77.104,20.0...|
+--------------------+
only showing top 20 rows



In [122]:
resut_prediction = lr_model.transform(unlabeled_data)
resut_prediction.show()

+--------------------+------------------+
|            features|        prediction|
+--------------------+------------------+
|[4.0,220.0,54.0,1...|20.781330177300596|
|[5.0,115.0,35.74,...|11.817778101766423|
|[6.0,90.0,20.0,9....|10.100451388007452|
|[9.0,90.09,25.01,...|  9.34116848699818|
|[10.0,68.0,10.8,7...| 6.607846888487899|
|[10.0,77.0,20.16,...|  8.80137085079257|
|[11.0,58.6,15.66,...| 7.476099383394816|
|[11.0,85.0,18.48,...| 8.868851552136396|
|[11.0,90.0,22.4,9...|10.109552531612168|
|[11.0,91.62700000...|  9.28352955485062|
|[11.0,108.977,26....|11.103839558952421|
|[12.0,50.0,7.0,7....| 4.645795896023072|
|[12.0,90.09,25.01...| 8.944092321375132|
|[12.0,91.0,20.32,...| 9.278281504970256|
|[12.0,138.0,31.14...|12.966815286092508|
|[13.0,25.0,3.82,5...| 3.080248369091348|
|[13.0,30.27699999...| 4.007984465527718|
|[13.0,85.619,21.1...| 9.708142544991308|
|[14.0,30.27699999...|3.4939247643319353|
|[14.0,77.104,20.0...| 8.802374897781784|
+--------------------+------------

# let see if any two columns are correlated with the target  it is then maybe its the reason why our model perform very well 

In [126]:
from pyspark.sql.functions import corr

In [129]:
data.select(corr("crew","passengers")).show()
data.select(corr("crew","cabins")).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



but nothing mean for sure that there a link betwenn them but is significant analysis