# Linear Regression Example

__In this simple example, you are given a dataset about cruise ships from a ship manufacturer. Your task is to build a model that can predict how many crew members are needed for a given ship, using linear regression. 
The data schema is as follows:__
        Variable                       Columns
        Ship Name                      1-20
        Cruise Line                    21-40
        Age (as of 2013)               46-48
        Tonnage (1000s of tons)        50-56
        passengers (100s)              58-64
        Length (100s of feet)          66-72
        Cabins  (100s)                 74-80
        Passenger Density              82-88
        Crew  (100s)                   90-96


In [10]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cruise').getOrCreate()

In [11]:
df = spark.read.csv ('cruise_ship_info.csv', inferSchema=True, header =True)

In [13]:
df.printSchema()

root
 |-- Ship_name: string (nullable = true)
 |-- Cruise_line: string (nullable = true)
 |-- Age: integer (nullable = true)
 |-- Tonnage: double (nullable = true)
 |-- passengers: double (nullable = true)
 |-- length: double (nullable = true)
 |-- cabins: double (nullable = true)
 |-- passenger_density: double (nullable = true)
 |-- crew: double (nullable = true)



In [14]:
df.head(10)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55),
 Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7),
 Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1),
 Row(Ship_name='Destiny', Cruise_line='Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0),
 Row(Ship_name='Ecstasy', Cruise_line='Carnival', Age=22, Tonnage=70.367, passengers=20.52, length=8.55, cabins=10.2, passenger_density=34.29, crew=9.2),
 Row(Ship_name='Elation', Cruise_line='Carnival', 

**Take a look at all the features, do you notice something that may pose difficulties to using linear regression?**

**Yes, the problem is with the attribute "Cruise_line", depends on which company (cruise line) owns the ship, the number of crews equipped with the ship varies. It is a factor we need to consider in linear regression, but it is not a numeric value that can be processed. Therefore, we need to find a way to turn it into numeric value."**

In [15]:
df.groupBy('Cruise_line').count().show()

+-----------------+-----+
|      Cruise_line|count|
+-----------------+-----+
|            Costa|   11|
|              P&O|    6|
|           Cunard|    3|
|Regent_Seven_Seas|    5|
|              MSC|    8|
|         Carnival|   22|
|          Crystal|    2|
|           Orient|    1|
|         Princess|   17|
|        Silversea|    4|
|         Seabourn|    3|
| Holland_American|   14|
|         Windstar|    3|
|           Disney|    2|
|        Norwegian|   13|
|          Oceania|    3|
|          Azamara|    2|
|        Celebrity|   10|
|             Star|    6|
|  Royal_Caribbean|   23|
+-----------------+-----+



In [16]:
from pyspark.ml.feature import StringIndexer

In [80]:
indexer = StringIndexer(inputCol='Cruise_line', outputCol='cruise_cat')
indexed = indexer.fit(df).transform(df)
indexed.head(5)

[Row(Ship_name='Journey', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0),
 Row(Ship_name='Quest', Cruise_line='Azamara', Age=6, Tonnage=30.276999999999997, passengers=6.94, length=5.94, cabins=3.55, passenger_density=42.64, crew=3.55, cruise_cat=16.0),
 Row(Ship_name='Celebration', Cruise_line='Carnival', Age=26, Tonnage=47.262, passengers=14.86, length=7.22, cabins=7.43, passenger_density=31.8, crew=6.7, cruise_cat=1.0),
 Row(Ship_name='Conquest', Cruise_line='Carnival', Age=11, Tonnage=110.0, passengers=29.74, length=9.53, cabins=14.88, passenger_density=36.99, crew=19.1, cruise_cat=1.0),
 Row(Ship_name='Destiny', Cruise_line='Carnival', Age=17, Tonnage=101.353, passengers=26.42, length=8.92, cabins=13.21, passenger_density=38.36, crew=10.0, cruise_cat=1.0)]

In [18]:
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler

In [20]:
indexed.columns

['Ship_name',
 'Cruise_line',
 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'crew',
 'cruise_cat']

In [22]:
assembler = VectorAssembler(inputCols=[ 'Age',
 'Tonnage',
 'passengers',
 'length',
 'cabins',
 'passenger_density',
 'cruise_cat'
], outputCol='features')

In [23]:
output = assembler.transform(indexed)

In [24]:
output.select ('features', 'crew').show()

+--------------------+----+
|            features|crew|
+--------------------+----+
|[6.0,30.276999999...|3.55|
|[6.0,30.276999999...|3.55|
|[26.0,47.262,14.8...| 6.7|
|[11.0,110.0,29.74...|19.1|
|[17.0,101.353,26....|10.0|
|[22.0,70.367,20.5...| 9.2|
|[15.0,70.367,20.5...| 9.2|
|[23.0,70.367,20.5...| 9.2|
|[19.0,70.367,20.5...| 9.2|
|[6.0,110.23899999...|11.5|
|[10.0,110.0,29.74...|11.6|
|[28.0,46.052,14.5...| 6.6|
|[18.0,70.367,20.5...| 9.2|
|[17.0,70.367,20.5...| 9.2|
|[11.0,86.0,21.24,...| 9.3|
|[8.0,110.0,29.74,...|11.6|
|[9.0,88.5,21.24,9...|10.3|
|[15.0,70.367,20.5...| 9.2|
|[12.0,88.5,21.24,...| 9.3|
|[20.0,70.367,20.5...| 9.2|
+--------------------+----+
only showing top 20 rows



In [26]:
final_data = output.select(['features', 'crew'])

In [27]:
train_data, test_data = final_data.randomSplit([0.7, 0.3])

In [28]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               111|
|   mean| 7.837567567567579|
| stddev|3.4591010961115884|
|    min|              0.88|
|    max|              21.0|
+-------+------------------+



In [29]:
test_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|                47|
|   mean| 7.691702127659576|
| stddev|3.6421125732601562|
|    min|              0.59|
|    max|              13.6|
+-------+------------------+



In [30]:
from pyspark.ml.regression import LinearRegression

In [31]:
ship_lr = LinearRegression(labelCol='crew')

In [32]:
trained_ship_model = ship_lr.fit(train_data)

In [33]:
ship_results = trained_ship_model.evaluate(test_data)

In [35]:
ship_results.rootMeanSquaredError

0.6696859333344719

In [36]:
train_data.describe().show()

+-------+------------------+
|summary|              crew|
+-------+------------------+
|  count|               111|
|   mean| 7.837567567567579|
| stddev|3.4591010961115884|
|    min|              0.88|
|    max|              21.0|
+-------+------------------+



In [37]:
ship_results.r2

0.9654557594391486

In [38]:
ship_results.meanSquaredError

0.4484792493060628

In [39]:
from pyspark.sql.functions import corr

In [40]:
df.select(corr('crew', 'passengers')).show()

+----------------------+
|corr(crew, passengers)|
+----------------------+
|    0.9152341306065384|
+----------------------+



In [42]:
df.select(corr('crew', 'cabins')).show()

+------------------+
|corr(crew, cabins)|
+------------------+
|0.9508226063578497|
+------------------+



# Logistic Regression Example

**For businesses especially service-based businesses, it is important to keep customer base stable. That is, you do not wish to see your customers go to your competitors. The measure they need to monitor is customer churn rate. 
If based on historical customer data, we can predict how likely a customer is going to leave, we can then hand over their cases for damange control such as providing more incentives to stay.
In this example, we are going to use logistic regression to classify customers into "who will remain" and "who will leave".**

The dataset contains the following attributes:

Name :                     Name of the latest contact at Company
Age:                       Customer Age
Total_Purchase:            Total Ads Purchased
Account_Manager:           Binary 0=No manager, 1= Account manager assigned
Years:                     Totaly Years as a customer
Num_sites:                 Number of websites that use the service.
Onboard_date:              Date that the name of the latest contact was onboarded
Location:                  Client HQ Address
Company:                   Name of Client Company

In [43]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('logregchurn').getOrCreate()

In [44]:
data = spark.read.csv('customer_churn.csv', inferSchema=True, header=True)

In [45]:
data.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- Churn: integer (nullable = true)



In [46]:
data.describe().show()

+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|summary|        Names|              Age|   Total_Purchase|   Account_Manager|            Years|         Num_Sites|            Location|             Company|              Churn|
+-------+-------------+-----------------+-----------------+------------------+-----------------+------------------+--------------------+--------------------+-------------------+
|  count|          900|              900|              900|               900|              900|               900|                 900|                 900|                900|
|   mean|         null|41.81666666666667|10062.82403333334|0.4811111111111111| 5.27315555555555| 8.587777777777777|                null|                null|0.16666666666666666|
| stddev|         null|6.127560416916251|2408.644531858096|0.4999208935073339|1.274449013194616|1.764835592035

In [47]:
data.columns

['Names',
 'Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites',
 'Onboard_date',
 'Location',
 'Company',
 'Churn']

In [50]:
from pyspark.ml.feature import VectorAssembler

In [51]:
assembler = VectorAssembler(inputCols=['Age',
 'Total_Purchase',
 'Account_Manager',
 'Years',
 'Num_Sites'
 ], outputCol='features')

In [53]:
output = assembler.transform(data)

In [54]:
final_data=output.select('features', 'churn')

In [56]:
train_churn, test_churn=final_data.randomSplit([0.7,0.3])

In [57]:
from pyspark.ml.classification import LogisticRegression

In [59]:
lr_churn=LinearRegression(labelCol='churn')

In [60]:
fitted_churn_model = lr_churn.fit(train_churn)

In [61]:
training_summary = fitted_churn_model.summary

In [62]:
training_summary.predictions.describe().show()

+-------+-------------------+-------------------+
|summary|              churn|         prediction|
+-------+-------------------+-------------------+
|  count|                617|                617|
|   mean|0.17017828200972449|0.17017828200972449|
| stddev|0.37609424849144185| 0.2141348557444161|
|    min|                0.0|-0.4706819255204475|
|    max|                1.0| 0.8824614231124981|
+-------+-------------------+-------------------+



In [63]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

In [64]:
pred_and_labels = fitted_churn_model.evaluate(test_churn)

In [66]:
pred_and_labels.predictions.show()

+--------------------+-----+--------------------+
|            features|churn|          prediction|
+--------------------+-----+--------------------+
|[25.0,9672.03,0.0...|    0|-0.01256752500887...|
|[26.0,8787.39,1.0...|    1|   0.354829789448059|
|[26.0,8939.61,0.0...|    0|-0.16494842363168116|
|[28.0,9090.43,1.0...|    0|  0.2765091740220522|
|[28.0,11245.38,0....|    0| 0.07330141185841521|
|[29.0,12711.15,0....|    0|-0.07607958060088249|
|[29.0,13240.01,1....|    0|-0.20127102129352137|
|[29.0,13255.05,1....|    0|0.031199649579349398|
|[30.0,6744.87,0.0...|    0| 0.10343547596810199|
|[30.0,7960.64,1.0...|    1|  0.1310854518374831|
|[30.0,12788.37,0....|    0| 0.18132861894755514|
|[31.0,7073.61,0.0...|    0| 0.14094237445023117|
|[31.0,10182.6,1.0...|    0|-0.02019453283686623|
|[31.0,12264.68,1....|    0| 0.09299916167446742|
|[32.0,9036.27,0.0...|    0|  0.4405877880976443|
|[32.0,11540.86,0....|    0| -0.2007578118592166|
|[32.0,12403.6,0.0...|    0|-0.09627948894102123|


In [68]:
churn_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction', labelCol='churn')

In [69]:
auc = churn_eval.evaluate(pred_and_labels.predictions)

In [70]:
auc

0.9186741363211953

## Apply the model to new data

In [71]:
final_lr_model = lr_churn.fit(final_data)

In [72]:
new_customers = spark.read.csv('new_customers.csv', inferSchema = True, header = True)

In [74]:
new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)



In [75]:
test_new_customers = assembler.transform(new_customers)

In [76]:
test_new_customers.printSchema()

root
 |-- Names: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- Total_Purchase: double (nullable = true)
 |-- Account_Manager: integer (nullable = true)
 |-- Years: double (nullable = true)
 |-- Num_Sites: double (nullable = true)
 |-- Onboard_date: timestamp (nullable = true)
 |-- Location: string (nullable = true)
 |-- Company: string (nullable = true)
 |-- features: vector (nullable = true)



In [77]:
final_results = final_lr_model.transform(test_new_customers)

In [78]:
final_results.show()

+--------------+----+--------------+---------------+-----+---------+--------------------+--------------------+----------------+--------------------+-------------------+
|         Names| Age|Total_Purchase|Account_Manager|Years|Num_Sites|        Onboard_date|            Location|         Company|            features|         prediction|
+--------------+----+--------------+---------------+-----+---------+--------------------+--------------------+----------------+--------------------+-------------------+
| Andrew Mccall|37.0|       9935.53|              1| 7.71|      8.0|2011-08-29 18:37:...|38612 Johnny Stra...|        King Ltd|[37.0,9935.53,1.0...|0.22798365244237284|
|Michele Wright|23.0|       7526.94|              1| 9.28|     15.0|2013-07-22 18:19:...|21083 Nicole Junc...|   Cannon-Benson|[23.0,7526.94,1.0...| 0.9873799841785502|
|  Jeremy Chang|65.0|         100.0|              1|  1.0|     15.0|2006-12-11 07:48:...|085 Austin Views ...|Barron-Robertson|[65.0,100.0,1.0,1...| 0.7319

In [79]:
final_results.select('Company', 'prediction').show()

+----------------+-------------------+
|         Company|         prediction|
+----------------+-------------------+
|        King Ltd|0.22798365244237284|
|   Cannon-Benson| 0.9873799841785502|
|Barron-Robertson| 0.7319758617888905|
|   Sexton-Golden| 0.8923338370243068|
|        Wood LLC| 0.3400542608316848|
|   Parks-Robbins| 0.5601580752724327|
+----------------+-------------------+

