# John Salmon: Kaggle.com Titanic Survival Challenge

Information about the data and this specific challenge can be found on kaggle.com [here](https://www.kaggle.com/competitions/titanic/data?select=test.csv).

In short the object of this challenge is to design a model that performs binary classification to predict the survival of the titanic passengers based on a few factors such as economic status, gender, placement on the ship etc.

For my analysis I will be using pyspark for machine learning

In [38]:
#imports
from pyspark.sql import SparkSession
from pyspark.ml.feature import Imputer, StringIndexer, OneHotEncoder, VectorAssembler
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.sql.functions import lit, col
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

# Initialize Spark session
spark = SparkSession.builder.appName('BinaryClassification').getOrCreate()

In [39]:
# Load data
training = spark.read.option('inferSchema', 'true').option('header', 'true').csv(r'/Volumes/T7 Shield APFS/Non School Projects/Kaggle-Challenges/Titanic_Survival/Challenge_Data/train.csv')
test = spark.read.option('inferSchema', 'true').option('header', 'true').csv(r'/Volumes/T7 Shield APFS/Non School Projects/Kaggle-Challenges/Titanic_Survival/Challenge_Data/test.csv')


In [42]:
#Handle Missing Values
imputer = Imputer(inputCols = ['Age', 'Fare'], outputCols = ['Age', 'Fare']).setStrategy("mean") #fill with mean
training = imputer.fit(training).transform(training)
test = imputer.fit(test).transform(test)

training = training.fillna({'Cabin': 'Unknown'})
test = test.fillna({'Cabin': 'Unknown'})

#add survived column to hold predicition
test = test.withColumn('Survived', lit(-1))

In [1]:
#Data preparation

catcols = ['Sex', 'Cabin', 'SibSp', 'Parch', 'Embarked']

indexers = [StringIndexer(inputCol = col, outputCol = col + 'Index', handelInvalid = 'Skip') for col in catcols]
encoders = [OneHotEncoder(inputCol = col + 'Index', outputCol = col + 'Vec') for col in catcols]
assembler = VectorAssembler(inputCols = ['PassengerId', 'Survived', 'Pclass', 'SexVec', 'Age', 'SibSpVec', 'ParchVec', 'Fare', 'CabinVec', 'EmbarkedVec' ], outputCol = 'Features')

rf = RandomForestClassifier(featuresCol = 'Features', labelCol = 'Survived', numTrees = 10)

pipeline = Pipeline(stages=[*indexers, *encoders, assembler, rf])



# Train Models


# Define parameter grid
paramGrid = ParamGridBuilder().addGrid(rf.numTrees, [10, 20, 50]).addGrid(rf.maxDepth, [5, 10, 20]).build()

# Set up cross-validator
crossval = CrossValidator(estimator = pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=BinaryClassificationEvaluator(labelCol = "Survived"),
                          numFolds=5)  # Use 5-fold cross-validation

# Fit model using CrossValidator
cvModel = crossval.fit(training)

# Apply the best model to the test data
test_transformed = cvModel.transform(test)

# Evaluation
evaluator = BinaryClassificationEvaluator(labelCol = 'Survived')
accuracy = evaluator.evaluate(test_transformed)
print('Test Accuracy:', accuracy)

# Select PassengerId and the predicted Survived column
output = test_transformed.select('PassengerId', 'prediction')

# Rename the prediction column to Survived for clarity
output = output.withColumnRenamed('prediction', 'Survived')

# Save the DataFrame to a CSV file
output.write.csv('/path/to/save/output.csv', header=True)



NameError: name 'StringIndexer' is not defined