# Big Data: Tools and Methods

# Assignment 4 - Xin Zhou

<font color=blue>
Titantic Dataset<br>
Source: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic3info.txt<br>
The titanic data set describes the survival status of individual passengers on the Titanic. It does not contain information for the crew, but it does contain actual and estimated ages for al- most 80% of the passengers.

|Feature|Explain|
|------|------|
| pclass | Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)   |
|survival|Survival (0 = No; 1 = Yes)|
|name|Name|
|sex|Sex|
|age|Age|
|sibsp|Number of Siblings/Spouses Aboard|
|parch|Number of Parents/Children Aboard|
|ticket|Ticket Number|
|fare|Passenger Fare|
|cabin|Cabin|
|embarked|Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton)|
|boat|Lifeboat|
|body|Body Identification Number|
|home.dest|Home/Destination|

<font color=blue>Notes:<br>
Pclass is a proxy for socio-economic status 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
Age is in Years; Fractional if Age less than One (1). If the Age is Estimated, it is in the form xx.5

<font color=green> #1. Did age have any affect on the survival of the passengers? Divide the passengers into age groups spanning 5 years each - [0, 5), [5, 10), [10, 15), ... . For each group compute the number of passengers in each group. Then compute the percent of survivors in each group.<br>


In [1]:
import org.apache.spark.sql.types.{StructField, StructType, StringType, DoubleType, IntegerType}
val schema = new StructType(Array(
  new StructField("survived", StringType, true),
  new StructField("sex", StringType, true),
  new StructField("age", DoubleType, true),
  new StructField("pclass", StringType, true),
  new StructField("name", StringType, true),
  new StructField("sibsp", IntegerType, true),
  new StructField("parch", IntegerType, true),
  new StructField("fare", DoubleType, true),
  new StructField("cabin", StringType, true),
  new StructField("embarked", StringType, true),
  new StructField("boat", StringType, true),
  new StructField("body", IntegerType, true),
  new StructField("homeDest", StringType, true)))

// val titanicDf = spark.read.option("header", true).option("inferSchema",true).option("sep", "\t").csv("titanic.tsv")
// In original tsv file, home.dest name was modified into homeDest to avoid compile error
val titanicDf = spark.read.format("csv").schema(schema).option("header",true).option("sep", "\t").load("titanic.tsv")

titanicDf.printSchema
titanicDf.show(3)

root
 |-- survived: string (nullable = true)
 |-- sex: string (nullable = true)
 |-- age: double (nullable = true)
 |-- pclass: string (nullable = true)
 |-- name: string (nullable = true)
 |-- sibsp: integer (nullable = true)
 |-- parch: integer (nullable = true)
 |-- fare: double (nullable = true)
 |-- cabin: string (nullable = true)
 |-- embarked: string (nullable = true)
 |-- boat: string (nullable = true)
 |-- body: integer (nullable = true)
 |-- homeDest: string (nullable = true)

+--------+------+----+------+--------------------+-----+-----+--------+--------+--------+----+----+--------+
|survived|   sex| age|pclass|                name|sibsp|parch|    fare|   cabin|embarked|boat|body|homeDest|
+--------+------+----+------+--------------------+-----+-----+--------+--------+--------+----+----+--------+
|       y|female|29.0| first|Allen, Miss. Elis...|    0|    0| 24160.0|211.3375|      B5|   S|   2|    null|
|       y|  male|null| first|Allison, Master. ...|    1|    2|113781.0| 

In [2]:
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.functions._

val splits = (0 to 20).map(_ * 5.0).toArray
val bucketizer = new Bucketizer().setInputCol("age").setOutputCol("age_above").setSplits(splits)
val bucketed = bucketizer.transform(titanicDf)
bucketed.groupBy("age_above").agg(count("survived").alias("count"),
                                  round((count(when(col("survived")==="y",1))/count("survived")),3).
                                  alias("survival_rate")).sort("age_above").
                                    select(col("age_above")*5, col("count"), col("survival_rate")).show()

+---------------+-----+-------------+
|(age_above * 5)|count|survival_rate|
+---------------+-----+-------------+
|           null|  264|         0.28|
|            0.0|   50|         0.64|
|            5.0|   31|        0.548|
|           10.0|   27|        0.407|
|           15.0|  116|        0.388|
|           20.0|  184|        0.386|
|           25.0|  160|         0.35|
|           30.0|  132|        0.409|
|           35.0|  100|         0.44|
|           40.0|   69|         0.29|
|           45.0|   66|        0.485|
|           50.0|   43|        0.488|
|           55.0|   27|        0.407|
|           60.0|   27|         0.37|
|           65.0|    5|          0.0|
|           70.0|    6|          0.0|
|           75.0|    1|          1.0|
|           80.0|    1|          1.0|
+---------------+-----+-------------+



<font color=blue>For the following problems divide the data into a training set and a test set. After you have created your models in problems 2-4 compute the percent false positives and false negatives you get from your model on the test set.


<font color=green>#2. Logistic on age. Using logistic regression with independent variable age and dependent variable survived create a model to classify passengers as survivors.

In [3]:
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.feature.RFormula

val ageCleaned = titanicDf.select("age","survived").na.drop("any",Seq("age")).na.drop("any",Seq("survived")) //clean null data
val formula = new RFormula().setFormula("survived ~ age")
val preparedDF = formula.fit(ageCleaned).transform(ageCleaned)
val Array(train, test) = preparedDF.randomSplit(Array(0.7, 0.3)) 
val lr =  new LogisticRegression()
val lrModel = lr.fit(train)
val ageLrPredict = lrModel.evaluate(test).predictions
ageLrPredict.show(5)

val label_predict_AgeLR = ageLrPredict.select("label","prediction")//for question 5

+------+--------+--------+-----+--------------------+--------------------+----------+
|   age|survived|features|label|       rawPrediction|         probability|prediction|
+------+--------+--------+-----+--------------------+--------------------+----------+
|0.1667|       y|[0.1667]|  1.0|[0.29849381055224...|[0.57407427506285...|       0.0|
|0.3333|       n|[0.3333]|  0.0|[0.29913280095508...|[0.57423050912405...|       0.0|
|0.4167|       y|[0.4167]|  1.0|[0.29945267970416...|[0.57430871436789...|       0.0|
|0.6667|       y|[0.6667]|  1.0|[0.30041154885608...|[0.57454312026661...|       0.0|
|0.8333|       y|[0.8333]|  1.0|[0.30105053925892...|[0.57469930975920...|       0.0|
+------+--------+--------+-----+--------------------+--------------------+----------+
only showing top 5 rows



<font color=green>#3. Logistic on age, sex and pclass. Same as problem two but use independent variables sex, age, and pclass. Since sex and pclass are categorical they need special treatment.

In [4]:
val dataCleaned = titanicDf.select("age","sex","pclass","survived").na.drop("any",Seq("age")).
                            na.drop("any",Seq("sex")).na.drop("any",Seq("pclass"))
val formula = new RFormula().setFormula("survived ~ age + sex + pclass")
val preparedDF = formula.fit(dataCleaned).transform(dataCleaned)
val Array(train, test) = preparedDF.randomSplit(Array(0.7, 0.3)) 
val lr =  new LogisticRegression()
val lrModel = lr.fit(train)
val lrPredict = lrModel.evaluate(test).predictions
lrPredict.show(5)

val label_predict_LR = lrPredict.select("label", "prediction")//for question 5

+----+------+------+--------+------------------+-----+--------------------+--------------------+----------+
| age|   sex|pclass|survived|          features|label|       rawPrediction|         probability|prediction|
+----+------+------+--------+------------------+-----+--------------------+--------------------+----------+
|0.75|female| third|       y|[0.75,0.0,1.0,0.0]|  1.0|[-1.3494368343811...|[0.20596245773852...|       1.0|
|0.75|  male| third|       n|[0.75,1.0,1.0,0.0]|  0.0|[1.08192737601279...|[0.74685854804200...|       0.0|
| 1.0|female| third|       n| [1.0,0.0,1.0,0.0]|  0.0|[-1.3387414995218...|[0.20771709468526...|       1.0|
| 1.0|  male| third|       y| [1.0,1.0,1.0,0.0]|  1.0|[1.09262271087211...|[0.74887527336629...|       0.0|
| 2.0|female|second|       y|     (4,[0],[2.0])|  1.0|[-2.4116583150749...|[0.08228800089114...|       1.0|
+----+------+------+--------+------------------+-----+--------------------+--------------------+----------+
only showing top 5 rows



<font color=green>#4. Decision tree. Instead of using logistic regression use Decisiontree with the independent variables sex, age, and pclass.

In [5]:
import org.apache.spark.ml.regression.DecisionTreeRegressionModel
import org.apache.spark.ml.regression.DecisionTreeRegressor

//Using the same train data for decision tree as in logistic regression

val dt =  new DecisionTreeRegressor().setLabelCol("label").setFeaturesCol("features")
val dtModel = dt.fit(train)

val predictionsDT = dtModel.transform(test)
predictionsDT.show(5)

import org.apache.spark.ml.evaluation.RegressionEvaluator
val evaluator = new RegressionEvaluator().setLabelCol("label").setPredictionCol("prediction").setMetricName("rmse")

val rmse = evaluator.evaluate(predictionsDT)
// println("Root Mean Squared Error (RMSE) on test data = " + rmse)

// println("Learned regression tree model:\n" + dtModel.toDebugString)

val label_predict_DT = predictionsDT.
        select("label", "prediction").
        withColumn("prediction", 
        when(col("prediction") > 0.5, 1).otherwise(0)) //for question 5, threshhold with 0.5

+----+------+------+--------+------------------+-----+------------------+
| age|   sex|pclass|survived|          features|label|        prediction|
+----+------+------+--------+------------------+-----+------------------+
|0.75|female| third|       y|[0.75,0.0,1.0,0.0]|  1.0|0.8571428571428571|
|0.75|  male| third|       n|[0.75,1.0,1.0,0.0]|  0.0|0.3333333333333333|
| 1.0|female| third|       n| [1.0,0.0,1.0,0.0]|  0.0|0.8571428571428571|
| 1.0|  male| third|       y| [1.0,1.0,1.0,0.0]|  1.0|0.3333333333333333|
| 2.0|female|second|       y|     (4,[0],[2.0])|  1.0|               1.0|
+----+------+------+--------+------------------+-----+------------------+
only showing top 5 rows



<font color=green>#5. How do the models created in problems 2-4 compare based on the false positives & false negatives the produce on your test data.

In [6]:
println("70% data for training, 30% data for testing")
println("------------------------------------")
val FP2 = label_predict_AgeLR.filter("label=0").filter("prediction=1").count()
val FN2 = label_predict_AgeLR.filter("label=1").filter("prediction=0").count()
val total2 = label_predict_AgeLR.count()
println("Age feature logistic regression:")
println("Total samples: "+ total2)
println("False positive: "+ FP2)
println("False negative: "+ FN2)
println("Misclassification rate: " + math.round((FP2+FN2)*1000.0/total2.toDouble)/10.0+"%")
println("------------------------------------")

val FP3 = label_predict_LR.filter("label=0").filter("prediction=1").count()
val FN3 = label_predict_LR.filter("label=1").filter("prediction=0").count()
val total3 = label_predict_LR.count()
println("Logistic regression:")
println("Total samples: "+ total3)
println("False positive: "+ FP3)
println("False negative: "+ FN3)
println("Misclassification rate: " + math.round((FP3+FN3)*1000.0/total3.toDouble)/10.0+"%")
println("------------------------------------")

val FP4 = label_predict_DT.filter("label=0").filter("prediction=1").count()
val FN4 = label_predict_DT.filter("label=1").filter("prediction=0").count()
val total4 = label_predict_DT.count()
println("Decision tree:")
println("Total samples: "+ total4)
println("False positive: "+ FP4)
println("False negative: "+ FN4)
println("Misclassification rate: " + math.round((FP4+FN4)*1000.0/total4.toDouble)/10.0+"%")

70% data for training, 30% data for testing
------------------------------------
Age feature logistic regression:
Total samples: 339
False positive: 0
False negative: 145
Misclassification rate: 42.8%
------------------------------------
Logistic regression:
Total samples: 312
False positive: 34
False negative: 35
Misclassification rate: 22.1%
------------------------------------
Decision tree:
Total samples: 312
False positive: 18
False negative: 54
Misclassification rate: 23.1%


## Comment:<br>
Using only one feature of age with logistic regression model has a really high misclassification rate, but it has 0 false positive which seems like the logistic regression model predicts all passagers on the titanic dead. It is not performing well using only one feature on the logistic regression model<br>
Using three features sex, age, pclass on logistic regression model and decision tree model has much lower misclassification rate comparing to a one feature model. Applying the same train data on two different models of logistic regression model and decision tree model almost have the same misclassification rate. However, decision tree has much higher false positives than logistic regression(decision tree with prediction threshold of 0.5). Logistic regression model has more false positive than decision tree.<br>
More test with the threshold, if I change the decision tree prediction threshold to be larger than 0.5, the FN increases, FP and misclassification rate both decreases. If threshold is smaller than 0.5, the FP and misclassification rate both increases, but FN decreases. For current random data set, if threshold is set to be 0.45, the decision tree almost have the same FN, FP and misclassification rate as logistic regression.