# John Salmon: Kaggle.com Titanic Survival Challenge

Information about the data and this specific challenge can be found on kaggle.com [here](https://www.kaggle.com/competitions/titanic/data?select=test.csv).

In short the object of this challenge is to design a model that performs binary classification to predict the survival of the titanic passengers based on a few factors such as economic status, gender, placement on the ship etc.

For my analysis I will be using pyspark for machine learning

In [2]:
#imports
from pyspark.sql import SparkSession
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import StringIndexer, VectorAssembler, OneHotEncoder
from pyspark.ml.evaluation import BinaryClassificationEvaluator

ModuleNotFoundError: No module named 'numpy'

In [1]:


#init spark session
spark = SparkSession.builder.appName('BinaryClassification').getOrCreate()

#load data
training = spark.read.option('inferSchema', 'true').option('header', 'true').csv('Titanic Survival/titanic/train.csv')
test = spark.read.option('inferSchema', 'true').option('header', 'true').csv('Titanic Survival/titanic/test.csv')

#Data prep
vectorAssembler = VectorAssembler(inputCols=['PassengerID', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare'], outputCol='Features')
catcols = ['Sex', 'Embarked']
indexers = [StringIndexer(inputCol=column, outputCol=column + 'Index') for column in catcols]
encoder = OneHotEncoder(inputCols=['Name', 'Ticket', 'Cabin'], outputCols=['NameVec', 'TicketVec', 'CabinVec'])

#Train Models
lr = LogisticRegression(featuresCol = 'Features', labelCol = 'Survived', maxIter = 10)
pipeline = Pipeline(stages = [vectorAssembler, indexers, encoder, lr])
model = pipeline.fit(training)

#eval
predictions = model.transform(test)
evaluator = BinaryClassificationEvaluator()
accuracy = evaluator.evaluate(predictions)
print('Test Accuracy:', accuracy)



NameError: name 'SparkSession' is not defined