# John Salmon: Kaggle.com Titanic Survival Challenge

Information about the data and this specific challenge can be found on kaggle.com [here](https://www.kaggle.com/competitions/titanic/data?select=test.csv).

In short the object of this challenge is to design a model that performs binary classification to predict the survival of the titanic passengers based on a few factors such as economic status, gender, placement on the ship etc.

For my analysis I will be using pyspark for machine learning

In [16]:
#imports
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

# Initialize Spark session
spark = SparkSession.builder.appName('BinaryClassification').getOrCreate()

In [18]:


# Load data
training = spark.read.option('inferSchema', 'true').option('header', 'true').csv('/Users/mainuser/Desktop/Kaggle-Challenges/Titanic_Survival/Challenge_Data/train.csv')
test = spark.read.option('inferSchema', 'true').option('header', 'true').csv("/Users/mainuser/Desktop/Kaggle-Challenges/Titanic_Survival/Challenge_Data/test.csv")

#fix null
training = training.fillna(0)
test = test.fillna(0)

#add survived column to hold prediction to test data
test = test.withColumn('Survived', lit(-1))

# Data preparation
vectorAssembler = VectorAssembler(inputCols=['PassengerId', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], outputCol='Features')
catcols = ['Sex', 'Embarked', 'Name', 'Ticket', 'Cabin']  # Assuming you want to include these in the model
indexers = [StringIndexer(inputCol=column, outputCol=column + 'Index') for column in catcols]
encoder = OneHotEncoder(inputCols=[column + 'Index' for column in catcols if column not in ['Sex', 'Embarked']], outputCols=[column + 'Vec' for column in catcols if column not in ['Sex', 'Embarked']])

# Train Models
lr = LogisticRegression(featuresCol='Features', labelCol='Survived', maxIter=10)
pipeline = Pipeline(stages=indexers + [vectorAssembler, encoder, lr])
model = pipeline.fit(training)

# Evaluation
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator(labelCol="Survived")
accuracy = evaluator.evaluate(predictions)
print('Test Accuracy:', accuracy)

Test Accuracy: 0.0
