# Logistic Regression Project



## Task #1: Understand the problem

####  Predicting the Survival of Titanic Passengers
Titanic survivor dataset captures the various details of people who survived or not survived in the ship. Using this data, you need to build a model which predicts probability of someone’s survival based on features like sex, cabin etc. It’s a classification problem.

## Task #2: Import Libraries





In [4]:
from pyspark.sql import SparkSession

In [5]:
spark = SparkSession.builder.appName('myproj').getOrCreate()

## Task #3: Preprocessing
First part of data analysis to load data and pre-process to fit to machine learning.



## Task #3.1 Loading CSV data




In [6]:

data = spark.read.csv('titanic.csv',inferSchema=True,header=True)

In [8]:
data.printSchema()

root
 |-- PassengerId: integer (nullable = true)
 |-- Survived: integer (nullable = true)
 |-- Pclass: integer (nullable = true)
 |-- Name: string (nullable = true)
 |-- Sex: string (nullable = true)
 |-- Age: double (nullable = true)
 |-- SibSp: integer (nullable = true)
 |-- Parch: integer (nullable = true)
 |-- Ticket: string (nullable = true)
 |-- Fare: double (nullable = true)
 |-- Cabin: string (nullable = true)
 |-- Embarked: string (nullable = true)



In [11]:
data.columns
# The Survived column is the target column.



['PassengerId',
 'Survived',
 'Pclass',
 'Name',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Ticket',
 'Fare',
 'Cabin',
 'Embarked']

In [12]:
my_cols = data.select(['Survived',
 'Pclass',
 'Sex',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'Embarked'])

## Task #3.2 Handling Missing Values
it is important to handle missing values in data science. These are the values which are not observed or not present due to issue in data capturing process.


In [13]:
# drop missing values
my_final_data = my_cols.na.drop()

## Task #3.3: Handling Categorical Columns
In this project many columns like Sex,Embarked are categorical variables. So we are one-hot encoding them using spark ML pipeline API’s. In this example, we are using StringIndexer and OneHotEncoder to do that.




In [14]:
from pyspark.ml.feature import (VectorAssembler,VectorIndexer,
                                OneHotEncoder,StringIndexer)

In [15]:
gender_indexer = StringIndexer(inputCol='Sex',outputCol='SexIndex')
gender_encoder = OneHotEncoder(inputCol='SexIndex',outputCol='SexVec')

In [16]:
embark_indexer = StringIndexer(inputCol='Embarked',outputCol='EmbarkIndex')
embark_encoder = OneHotEncoder(inputCol='EmbarkIndex',outputCol='EmbarkVec')

In [27]:
assembler = VectorAssembler(inputCols=['Pclass',
 'SexVec',
 'Age',
 'SibSp',
 'Parch',
 'Fare',
 'EmbarkVec'],outputCol='features')

## Task #4: Classification using Logistic Regression

In [28]:
from pyspark.ml.classification import LogisticRegression

## Pipelines 



In [29]:
from pyspark.ml import Pipeline

In [30]:
log_reg_titanic = LogisticRegression(featuresCol='features',labelCol='Survived')

In [31]:
pipeline = Pipeline(stages=[gender_indexer,embark_indexer,
                           gender_encoder,embark_encoder,
                           assembler,log_reg_titanic])

## Task #5: Split Data for Train and Holdout
Split the data for training and hold out with the help of spark’s randomSplit method to do the same.




In [32]:
train_titanic_data, test_titanic_data = my_final_data.randomSplit([0.7,.3])

In [33]:
fit_model = pipeline.fit(train_titanic_data)

## Task #6: Fit the model

In [34]:

results = fit_model.transform(test_titanic_data)

In [35]:
from pyspark.ml.evaluation import BinaryClassificationEvaluator

## Task #7: Evaluate 

In [36]:
my_eval = BinaryClassificationEvaluator(rawPredictionCol='prediction',
                                       labelCol='Survived')

In [37]:
results.select('Survived','prediction').show()

+--------+----------+
|Survived|prediction|
+--------+----------+
|       0|       1.0|
|       0|       0.0|
|       0|       1.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       1.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
|       0|       0.0|
+--------+----------+
only showing top 20 rows



In [41]:
Result = my_eval.evaluate(results)

In [42]:
Result

0.8046195380461955

*  Binary Calssification Evaluater return area under the curve.