# MACHINE LEARNING: SUPERVISED LEARNING 🤖
# Predict Titanic Survival
The RMS Titanic set sail on its maiden voyage in 1912, crossing the Atlantic from Southampton, England to New York City. The ship never completed the voyage, sinking to the bottom of the Atlantic Ocean after hitting an iceberg, bringing down 1,502 of 2,224 passengers onboard.

In this project you will create a Logistic Regression model that predicts which passengers survived the sinking of the Titanic, based on features like age and class.

The data we will be using for training our model is provided by Kaggle. Feel free to make the model better on your own and submit it to the Kaggle Titanic competition!

If you get stuck during this project or would like to see an experienced developer copmplete it, check out the project walkthrough video which can be found in the “get help” menu in the bottom-right of this window.

# ---------------------------------------------------------------------------------
# ---------------------------------------------------------------------------------
# ---------------------------------------------------------------------------------

# Before creating the model first import all relevant libraries.

In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# 1.The file passengers.csv contains the data of 892 passengers onboard the Titanic when it sank that fateful day.
#   Let’s begin by loading the data into a pandas DataFrame named passengers. 
Print passengers and inspect the columns. What features could we use to predict survival?

In [4]:
passengers = pd.read_csv('passengers.CSV')
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


# Clean the Data

# --------------------------------------------------------------------
# --------------------------------------------------------------------
# --------------------------------------------------------------------

# 2.Given the saying, “women and children first,” Sex and Age seem like good features to predict survival. 
# Let’s map the text values in the Sex column to a numerical value. 
# Update Sex such that all values female are replaced with 1 and all values male are replaced with 0.

We can update the Sex column in passengers with the following syntax:

passengers['Sex'] = expression_for_new_values
To map the values in Sex we can use pandas‘ .map() method with the below syntax:
    
data_frame(['column']).map({'initial_value_1':'updated_value_1','initial_value_2':'updated_value_2'})

In [6]:
# Update sex column to numerical
passengers['Sex'] = passengers. Sex.map({'male': 0, 'female': 1})
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


# 3.Let’s take a look at Age. Print passengers['Age'].values. You can see we have multiple missing values, or nans. 
# Fill all the empty Age values in passengers with the mean age.

The .fillna() method allows us to fill all the missing values in a column with the below syntax:

data_frame(['column']).fillna(value='value_to_replace_nan',inplace=True)

"data_frame" -  is the name of the DataFrame.
  "column" -  is name of the column in which we want to fill missing values
  "value_to_replace_nan" -  is the value that will replace the missing values
  "inplace=True" - fills the missing values in our DataFrame rather than returning a new DataFrame

In [8]:
# Fill the nan values in the age column
passengers = passengers.fillna(value={'Age':passengers.Age.mean()})
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S


# 5.Create a new column named SecondClass that stores 1 for all passengers in second class and 0 for all other passengers.

# Print passengers and inspect the DataFrame to ensure all the updates have been made.

#We can create the SecondClass column in passengers with the following syntax:

passengers['SecondClass'] = expression_for_column_values

To set the value of SecondClass for each passenger to 1 or 0, we can use pandas‘ .apply() method on Pclass as shown below:

passengers['Pclass'].apply(lambda x: 1 if x == 2 else 0)

In [12]:
passengers['SecondClass'] = passengers.Pclass.apply(lambda x: 1 if x == 2 else 0)
passengers['FirstClass'] = passengers.Pclass.apply(lambda x: 1 if x == 1 else 0)
passengers.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,SecondClass,FirstClass
0,1,0,3,"Braund, Mr. Owen Harris",0,22.0,1,0,A/5 21171,7.25,,S,0,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",1,38.0,1,0,PC 17599,71.2833,C85,C,0,1
2,3,1,3,"Heikkinen, Miss. Laina",1,26.0,0,0,STON/O2. 3101282,7.925,,S,0,0
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",1,35.0,1,0,113803,53.1,C123,S,0,1
4,5,0,3,"Allen, Mr. William Henry",0,35.0,0,0,373450,8.05,,S,0,0


# Select and Split the Data

# --------------------------------------------------------------------
# --------------------------------------------------------------------
# --------------------------------------------------------------------

# 6.Now that we have cleaned our data, let’s select the columns we want to build our model on. 
# Select columns Sex, Age, FirstClass, and SecondClass and store them in a variable named features. 
# Select column Survived and store it a variable named survival.

In [13]:
features = passengers[['Sex', 'Age', 'FirstClass', 'SecondClass']]
features.head()

Unnamed: 0,Sex,Age,FirstClass,SecondClass
0,0,22.0,0,0
1,1,38.0,1,0
2,1,26.0,0,0
3,1,35.0,1,0
4,0,35.0,0,0


In [14]:
survival = passengers.Survived
survival.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

# 7.Split the data into training and test sets using sklearn‘s train_test_split() method. 
# We’ll use the training set to train the model and the test set to evaluate the model.

In [28]:
train_features, test_features, train_labels, test_labels = train_test_split(features, survival, test_size = 0.2)


print(train_features.shape)
print(test_features.shape)
print(train_labels.shape)
print(test_labels.shape)

(712, 4)
(179, 4)
(712,)
(179,)


# Normalize the Data

# 8.Since sklearn‘s Logistic Regression implementation uses Regularization, we need to scale our feature data. 
# Create a StandardScaler object, .fit_transform() it on the training features, and .transform() the test features.

We can create a StandardScaler object with the below syntax:
"scaler = StandardScaler()"

To determine the scaling factors and apply the scaling to the feature data:
"train_features = scaler.fit_transform(train_features)"

To apply the scaling to the test data:
"test_features = scaler.transform(test_features)"



In [29]:
scaler = StandardScaler()
train_features = scaler.fit_transform(train_features)
test_features = scaler.transform(test_features)

# Create and Evaluate the Model


# 9.Create a LogisticRegression model with sklearn and .fit() it on the training data.

Fitting the model will perform gradient descent to find the feature coefficients that minimize the log-loss for the training data.

To create a LogisticRegression model in sklearn:

"model = LogisticRegression()"
To  ".fit()" the model to training data:

  "model.fit(X_train, y_tain)"

In [39]:
regr = LogisticRegression()
regr.fit(train_features, train_labels)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='warn', n_jobs=None, penalty='l2',
                   random_state=None, solver='warn', tol=0.0001, verbose=0,
                   warm_start=False)

# 10 ".score()"" the model on the training data and print the training score.

Scoring the model on the training data will run the data through the model and make final classifications on survival for each passenger in the training set.

The score returned is the percentage of correct classifications, or the accuracy.

In [40]:
print("Train score:")
regr.score(train_features, train_labels)

Train score:


0.7780898876404494

# 11 ".score()"" the model on the test data and print the test score.

Similarly, scoring the model on the testing data will run the data through the model and make final classifications on survival for each passenger in the test set.

How well did your model perform?

In [41]:
print("Test score:")
regr.score(test_features, test_labels)

Test score:


0.7988826815642458

# 12.Print the feature coefficients determined by the model. 
# Which feature is most important in predicting survival on the sinking of the Titanic?

To print the coefficients of the model, access the .coef_ attribute:
"print(model.coef_)"

To print each feature with its respectice coefficient value, you can use the following expression:
"print(list(zip(['Sex','Age','FirstClass','SecondClass'],model.coef_[0])))"

In [43]:
print(regr.coef_)

[[ 1.156773   -0.44982048  1.01863451  0.45464245]]


In [44]:
print(list(zip(['Sex','Age','FirstClass','SecondClass'],regr.coef_[0])))

[('Sex', 1.1567729962104312), ('Age', -0.44982047502188044), ('FirstClass', 1.018634510725555), ('SecondClass', 0.45464244930534803)]


# Predict with the Model

# 13.Let’s use our model to make predictions on the survival of a few fateful passengers. 
# Provided in the code editor is information for 3rd class passenger Jack and 1st class passenger Rose, stored in NumPy arrays. 
# The arrays store 4 feature values, in the following order:

Sex, represented by a 0 for male and 1 for female

Age, represented as an integer in years

FirstClass, with a 1 indicating the passenger is in first class

SecondClass, with a 1 indicating the passenger is in second class

A third array, You, is also provided in the code editor with empty feature values. 
Uncomment the line containing You and update the array with your information, 
or the information for some fictitious passenger. Make sure to enter all values as floats with a .!


# Hint

If you or your fictitious passenger identify as non-binary, the Sex value can be 0.5.

If you or your fictitious passenger are sailing in 1st class, the FirstClass value should be 1 and the SecondClass value should be 0.

If you or your fictitious passenger are sailing in 2nd class, the FirstClass value should be 0 and the SecondClass value should be 1.

If you or your fictitious passenger are sailing in 3rd class, the FirstClass value should be 0 and the SecondClass value should be 0.

In [45]:
# Sample passenger features
Jack = np.array([0.0,20.0,0.0,0.0])
Rose = np.array([1.0,17.0,1.0,0.0])
You = np.array([0.0,27.0,0.0,1.0])

# 14.Combine Jack, Rose, and You into a single NumPy array named sample_passengers.

# Hint
To combine the arrays for each passenger into a single NumPy array, use the following syntax:
    
"combined_arrays = np.array([ ___ , ___, ___ ])"

where the blanks are replaced with Jack, Rose, and You.

In [46]:
# Combine passenger arrays
sample_passengers=np.array([Jack,Rose,You])

# 15.Since our Logistic Regression model was trained on scaled feature data, 
# we must also scale the feature data we are making predictions on. 
# Using the StandardScaler object created earlier, apply its .transform() method to sample_passengers and save the result to sample_passengers.

Print sample_passengers to view the scaled features.

# Hint
To scale the features of our sample passengers according to the same scaling as the training data, use the following syntax:
    
"sample_passengers = scaler.transform(sample_passengers)"

In [47]:
# Scale the sample passenger features
sample_passengers = scaler.transform(sample_passengers)

# 16.Who will survive, and who will sink? Use your model’s .predict() method on sample_passengers and print the result to find out.

Want to see the probabilities that led to these predictions? Call your model’s .predict_proba() method on sample_passengers and print the result. 
The 1st column is the probability of a passenger perishing on the Titanic, 
and the 2nd column is the probability of a passenger surviving the sinking (which was calculated by our model to make the final classification decision).

In [48]:
predict_survial = regr.predict(sample_passengers)
predict_survial

array([0, 1, 0], dtype=int64)

In [49]:
predict_proba_survival = regr.predict_proba(sample_passengers)
predict_proba_survival

array([[0.87705441, 0.12294559],
       [0.04948035, 0.95051965],
       [0.74403448, 0.25596552]])

# From the model’s predictions, it appears Rose has a better fate than Jack.

Me is little better than jack but atlast I am also not sirvived.