## In which we explore Disasters, Trees, Classification & the Kaggle Competition

1. Read Titanic Data
2. Transform and select features
3. Create a simple model & Predict
5. Decision Tree Model, Predict
6. Random Forest Model, Predict
7. Discussion

Read Titanic Data
-----
### The Data is part of the Kaggle Competition "Titanic: Machine Learning from Disaster"
### Download data from http://www.kaggle.com/c/titanic-gettingStarted

In [None]:
titanic = spark.read.option("header", True).csv("/data/training/titanic/train.csv")
titanic.createOrReplaceTempView("titanic")

In [None]:
spark.sql("SELECT * from titanic LIMIT 10").show()

## Question : How do we transform the data to something that we can use with Spark MLlib ?

In [None]:
from pyspark.mllib.regression import LabeledPoint
def num(s):
    try:
        return int(s)
    except (ValueError,TypeError):
        try:
            return float(s)
        except (ValueError,TypeError):
            return 0
#
def parse_train_data(x):
    pass_id = x[0]
    survived = num(x[1])
    pclass = num(x[2])
    # sex
    sex=0
    if x[4]=='male':
        sex = 1
    age=0
    age = num(x[5])
    sibsp = 0
    sibsp = num(x[6])
    parch = 0
    parch = num(x[7])
    fare = 0
    fare = num(x[9])
    cabin = x[10] # not now, categorical
    # return labelled point
    return LabeledPoint(survived,[pclass,sex,age,sibsp,parch,fare]) 

In [None]:
df_train = spark.table("titanic").rdd.map(lambda row: parse_train_data(row))

In [None]:
df_train.count()

In [None]:
df_train.take(3)

In [None]:
for x in df_train.take(3):
    print(x)
# pclass,sex,age,sibsp,parch,fare,cabin,survived

### Dick, The butcher to Jack Cade
### Dick: The first thing we do, let's kill all the men.
### Cade: Nay, that I mean to do.
#### Ref : http://www.enotes.com/shakespeare-quotes/lets-kill-all-lawyers

In [None]:
spark.sql("SELECT COUNT(*) FROM titanic").show()

In [None]:
spark.sql("SELECT COUNT(*) FROM titanic WHERE Survived = 1").show()

In [None]:
spark.sql("SELECT COUNT(*) FROM titanic WHERE Survived = 1 AND Sex = 'female'").show()

In [None]:
print("Survived = Female : %5.2f%%" % (100.0*233/342))

In [None]:
spark.sql("SELECT COUNT(*) FROM titanic WHERE Sex = 'female'").show()

In [None]:
print("Female = Survived : %5.2f%%" % (100.0*233/314))

## Strategy #1 : Female = Survived

### Read and Predict the test data

In [None]:
#spark.sql("DROP TABLE titanic_test")

In [None]:
titanic_test = spark.read.option("header", True).csv("/data/training/titanic/test.csv")
titanic_test.createOrReplaceTempView("titanic_test")

In [None]:
spark.sql("SELECT * from titanic_test LIMIT 10").show()

In [None]:
def parse_solution(x):
    pass_id = x[0]
    survived = 0
    if x[3]=='female':
        survived = 1
    age=0
    # return the solution
    return (pass_id,survived) 

In [None]:
solution_one = spark.table("titanic_test").rdd.map(lambda row:parse_solution(row))

In [None]:
solution_one.count()

In [None]:
solution_one.take(3)

In [None]:
from pyspark.sql import SQLContext
from pyspark.sql.types import *

schema = StructType([StructField("PassengerId", StringType(), False),StructField("Survived", IntegerType(), False)])
s_df = spark.createDataFrame(solution_one, schema)
s_df.createOrReplaceTempView("SolutionOne")

In [None]:
spark.sql("SELECT * FROM SolutionOne").show()

## if you were to submit to Kaggle
## You'd get a score of ~0.7655 Rank : 1276 Gender Based Model