## Linear Regression Consulting Project

Import bibliotek koniecznych do zrealizowania projektu. [SparkSession oraz LinearRegression]

In [70]:
import findspark

In [71]:
findspark.init('/home/ubuntu/spark-2.1.1-bin-hadoop2.7')

In [72]:
from pyspark.sql import SparkSession

In [73]:
from pyspark.ml.regression import LinearRegression

Aby rozpoczać analizę należy w pierwszej kolejności w odpowiedni sposób przygotowac dane. Użyjemy do tego klasy VectorAssembler, która ma za zadanie scalić kolumny zawierające cechy w wektory cech.

In [74]:
from pyspark.ml.linalg import Vectors

In [75]:
from pyspark.ml.feature import VectorAssembler

Stworzenie nowej sesji Spark oraz import danych z pliku.

In [76]:
spark = SparkSession.builder.appName('lr_project').getOrCreate()

In [77]:
data = spark.read.csv('cruise_ship_info.csv', inferSchema=True, header=True)

In [78]:
data.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Elation|   Carnival| 15|            70.367|     20.52|  8.55|  10.2|            34.29| 9.2|
|    Fantasy|   Carnival| 23| 

In [79]:
data.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



W tym momencie nalezy podjąć decyzję o tym który czynniki mają wpływ na wielkość załogi. Intuicyjnie czujemy, że na wielkość załogi będą miały wpływ natępujące cechy: 'Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density'. Z treści zadania wiemy, że na wielkosć załogi ma również wpływ typ statku podany w kolumnie 'Cruise_line'. Niestety jest to kolumna typu string. Bedziemy musieli skożystać z stringindexer. (coś w wtylu OneHotEncodera, który z napisów tworzy kategorie i nadaje im numery id).

Sprawdźmy najpierw ile kategori mamy:

In [80]:
data.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [81]:
from pyspark.ml.feature import StringIndexer

In [82]:
indexer = StringIndexer(inputCol="Cruise_line", outputCol="Cruise_line_Index")

In [83]:
indexed = indexer.fit(data).transform(data)

Utworzona zostanie nowa kolumna o nazwie Cruise_line_Index, która będzie zawierała indeksy statków z kolumny Cruise_line.

In [84]:
indexed.show()

+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|  Ship_name|Cruise_line|Age|           Tonnage|passengers|length|cabins|passenger_density|crew|Cruise_line_Index|
+-----------+-----------+---+------------------+----------+------+------+-----------------+----+-----------------+
|    Journey|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|      Quest|    Azamara|  6|30.276999999999997|      6.94|  5.94|  3.55|            42.64|3.55|             16.0|
|Celebration|   Carnival| 26|            47.262|     14.86|  7.22|  7.43|             31.8| 6.7|              1.0|
|   Conquest|   Carnival| 11|             110.0|     29.74|  9.53| 14.88|            36.99|19.1|              1.0|
|    Destiny|   Carnival| 17|           101.353|     26.42|  8.92| 13.21|            38.36|10.0|              1.0|
|    Ecstasy|   Carnival| 22|            70.367|     20.52|  8.55|  10.2|       

In [85]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'Cruise_line_Index']

Wybierając interesujące mnie featur'y pomijam kolumny: 'Ship_name' - bo nie ma wpływu na wiejkosć załogi, 'Cruise_line' - bo został zamieniony na wartosć numeryczną 'Cruise_line_Index' oraz 'crew' bo to label a nie feature.

In [86]:
assembler = VectorAssembler(inputCols=['Age', 'Tonnage', 'passengers', 'length', 'cabins', 'passenger_density', 'Cruise_line_Index'],
                           outputCol='features')

In [87]:
output = assembler.transform(indexed)

In [88]:
output.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)
 |-- Cruise_line_Index: double (nullable = true)
 |-- features: vector (nullable = true)



In [89]:
final_data = output.select('features', 'crew')

In [90]:
final_data.show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [91]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [92]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               113|
|   mean| 7.695840707964614|
| stddev|3.4991901424430947|
|    min|              0.59|
|    max|              21.0|
+-------+------------------+



In [93]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                45|
|   mean| 8.041111111111112|
| stddev|3.5415772181217777|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+



In [94]:
lr = LinearRegression(labelCol='crew')

In [95]:
lr_model = lr.fit(train_data)

In [96]:
test_results = lr_model.evaluate(test_data)

In [97]:
test_results.rootMeanSquaredError

0.8902799099871659

In [98]:
test_results.r2

0.9353721730933033

W zwiazku z tym, że osiągneliśmy bardzo dobry wynik r2 musimy podejsć krytycznie do modelu i sprawdzić czy aby na pewno otrzymanie tak dobrych wyników jest możliwe. Sprawdźmy czy któraś z kolum jest w silnej korelacji z końcowymi wynikami.

In [99]:
from pyspark.sql.functions import corr

In [101]:
data.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [102]:
data.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+

