### Case Study Introduction

*   The sinking of the titanic is one of the most infamous shipwrecks in history

*   The Titanic sank after colliding with an
iceberg. The unavailability of many lifeboats resulted in the death of as many as 1502 passengers out of 2224 passengers onboarded.

*   The Titanic is a classic example of supervised machine learning task. Thus in this notebook, our goal is to build a predictive model that answers the question "what people were more likely to suvive than others?"

*  In this notebook, we will be using the logistic regression algorithm and model provided within the Scikit-learn library to build a model that will predict those that survived and those that did not using passenger data like :name, age, gender, socio-economic class, fare etc


In [7]:
#import relevant libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn as sk
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import LabelEncoder

In [8]:
# uploading our   data set
df = pd.read_csv('/content/train.csv')

In [9]:
# performing a .head function
# displaying the first few rows of a data set
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [10]:
# performing a .describe function
# displaying the statistical summary of the numer column in my data set
df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


In [11]:
# performing a .info function
# understanding the structure and properties of my data set
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [12]:
# performing a . shape function
# This lets us know the number of rows and columns in our data set
df.shape

(891, 12)

In [13]:
# performing a . shape[0] function
# to know only the number of rows
df.shape[0]

891

In [14]:
# performing a .shape[1]
# to know the numer of columns only
df.shape[1]

12

In [15]:
# performing a .isna count
# checking the cells in the data set or missing values
df.isna()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,False,False,False,False,False,False,False,False,False,False,True,False
1,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,True,False
...,...,...,...,...,...,...,...,...,...,...,...,...
886,False,False,False,False,False,False,False,False,False,False,True,False
887,False,False,False,False,False,False,False,False,False,False,False,False
888,False,False,False,False,False,True,False,False,False,False,True,False
889,False,False,False,False,False,False,False,False,False,False,False,False


In [16]:
# performing a .isna  sum count
# checking the cells in the data set for the count of  missing values
df.isna().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,177
SibSp,0
Parch,0
Ticket,0
Fare,0


In [17]:
# fill the missing values for age using simple impute
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])

In [18]:
# fill in the missing values for age using median
#df['Age'].fillna(df['Age'].median(), inplace=True)

In [19]:
# fill in the missing values for embarked using the mode S
df['Embarked'].fillna('S', inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['Embarked'].fillna('S', inplace=True)


In [20]:
# performing a .isna.sum count
# to see if the missing values have een filled
df.isna().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


In [21]:
# deleting the name column
df.drop('Name', axis=1, inplace=True)

In [22]:
# deleting the parch column
df.drop('Parch', axis=1, inplace=True)

In [23]:
# deleting the cabin column
df.drop('Cabin', axis=1, inplace=True)

In [24]:
# deleting the passenger id column
df.drop('PassengerId', axis=1, inplace=True)

In [25]:
# encoding the age column to binary using label encoder
le = LabelEncoder()
df['Age'] = le.fit_transform(df['Age'])


In [26]:
# encoding the sex column to binary using label encoder
le = LabelEncoder()
df['Sex'] = le.fit_transform(df['Sex'])

In [27]:
# encoding embarked using one hot encoding
df = pd.get_dummies(df, columns=['Embarked'])
print(df.columns)

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Ticket', 'Fare',
       'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')


In [28]:
print(df.columns)

Index(['Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Ticket', 'Fare',
       'Embarked_C', 'Embarked_Q', 'Embarked_S'],
      dtype='object')


In [29]:
#Tryin to see if age sex embarked has been encoded successfully
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Ticket,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,28,1,A/5 21171,7.25,False,False,True
1,1,1,0,52,1,PC 17599,71.2833,True,False,False
2,1,3,0,34,0,STON/O2. 3101282,7.925,False,False,True
3,1,1,0,48,1,113803,53.1,False,False,True
4,0,3,1,48,0,373450,8.05,False,False,True


In [30]:
# turning the encoded embarked columns into binary
df['Embarked_C'] = le.fit_transform(df['Embarked_C'])
df['Embarked_Q'] = le.fit_transform(df['Embarked_Q'])
df['Embarked_S'] = le.fit_transform(df['Embarked_S'])


In [31]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Ticket,Fare,Embarked_C,Embarked_Q,Embarked_S
0,0,3,1,28,1,A/5 21171,7.25,0,0,1
1,1,1,0,52,1,PC 17599,71.2833,1,0,0
2,1,3,0,34,0,STON/O2. 3101282,7.925,0,0,1
3,1,1,0,48,1,113803,53.1,0,0,1
4,0,3,1,48,0,373450,8.05,0,0,1


In [32]:
#defining a function to map the one hot encoded columns to 0,1,2
def map_embarked(row):
    if row['Embarked_C'] == 1:
        return 0  # Representing 'C' as 0
    elif row['Embarked_Q'] == 1:
        return 1  # Representing 'Q' as 1
    elif row['Embarked_S'] == 1:
        return 2  # Representing 'S' as 2
    else:
        return None # Handle cases where none of the Embarked columns are 1

In [33]:
# putting the encoded embarked column into one
#after turning the column into binary
df['Embarked'] = df.apply(map_embarked, axis=1)
df.drop('Embarked_C', axis=1, inplace=True)
df.drop('Embarked_Q', axis=1, inplace=True)
df.drop('Embarked_S', axis=1, inplace=True)

In [34]:
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Ticket,Fare,Embarked
0,0,3,1,28,1,A/5 21171,7.25,2
1,1,1,0,52,1,PC 17599,71.2833,0
2,1,3,0,34,0,STON/O2. 3101282,7.925,2
3,1,1,0,48,1,113803,53.1,2
4,0,3,1,48,0,373450,8.05,2


In [36]:
# choosing features for the train and test data
X = df[['Pclass', 'Sex', 'Age', 'SibSp', 'Fare', 'Embarked']]  # Select only numerical features
y = df['Survived']

In [35]:
# choosing features for the train and test data
X = df[['Sex','Age']]
y = df['Survived']

In [37]:
# splitting the data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [38]:
# feature scaling for age and sex
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

In [39]:
X_test

array([[ 0.81303367,  0.7243102 , -0.0034528 ,  0.37992316, -0.33390078,
        -2.02505292],
       [-0.40055118,  0.7243102 ,  0.17642952, -0.47072241, -0.42528387,
         0.5635246 ],
       [ 0.81303367,  0.7243102 , -0.84290362, -0.47072241, -0.47486697,
         0.5635246 ],
       ...,
       [ 0.81303367, -1.38062393,  0.77603725,  0.37992316, -0.02308312,
         0.5635246 ],
       [-0.40055118, -1.38062393, -1.02278594, -0.47072241, -0.42528387,
         0.5635246 ],
       [ 0.81303367, -1.38062393, -1.86223676,  0.37992316, -0.30589933,
         0.5635246 ]])

In [40]:
X_train

array([[-1.61413602,  0.7243102 ,  1.31568421, -0.47072241, -0.07868358,
         0.5635246 ],
       [-0.40055118,  0.7243102 , -0.60306053, -0.47072241, -0.37714494,
         0.5635246 ],
       [ 0.81303367,  0.7243102 ,  0.23639029, -0.47072241, -0.47486697,
         0.5635246 ],
       ...,
       [ 0.81303367,  0.7243102 ,  1.01588034,  1.23056874, -0.35580399,
         0.5635246 ],
       [-1.61413602, -1.38062393, -1.26262903,  0.37992316,  1.68320121,
         0.5635246 ],
       [-1.61413602,  0.7243102 , -0.72298207, -0.47072241,  0.86074761,
         0.5635246 ]])

In [41]:
y_test

Unnamed: 0,Survived
709,1
439,0
840,0
720,1
39,1
...,...
433,0
773,0
25,1
84,1


In [42]:
y_train

Unnamed: 0,Survived
331,0
733,0
382,0
704,0
813,0
...,...
106,1
270,0
860,0
435,1


In [43]:
#the number of rows present in the training dataset (X_train).
len(X_train)

712

In [44]:
#the number of rows present in the training dataset (X_test).
len(X_test)

179

In [45]:
#the number of rows present in the training dataset (y_train).
len(y_train)

712

In [46]:
#the number of rows present in the training dataset (y_test).
len(y_test)

179

In [47]:
# fitting the logistics regression model
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)

In [48]:

y_pred = model.predict(X_test)
y_pred
log = round(model.score(X_test, y_test)* 100, 2)
log

79.89

In [49]:
y_pred = model.predict(X_test)
y_pred

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1,
       0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 1, 1])

In [50]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)


print(f'Accuracy: {accuracy}')
print('Confusion Matrix:\n', conf_matrix)


Accuracy: 0.7988826815642458
Confusion Matrix:
 [[88 17]
 [19 55]]


In [57]:
X.head(1)

Unnamed: 0,Pclass,Sex,Age,SibSp,Fare,Embarked
0,3,1,28,1,7.25,2


In [52]:
# outputting the correlation of our model
corr_df=pd.DataFrame(X.columns) #train_dataset is not defined. Assuming you intended to use X.
corr_df.columns=['Feature']
corr_df['Correlation'] = pd.Series(model.coef_[0])
corr_df.sort_values(by='Correlation', ascending=False)

Unnamed: 0,Feature,Correlation
4,Fare,0.098746
5,Embarked,-0.179872
3,SibSp,-0.377934
2,Age,-0.380999
0,Pclass,-0.788776
1,Sex,-1.25784


In [53]:
coeff_df = pd.DataFrame(X.columns.delete(0), columns=['Features'])
coeff_df['Coefficient Estimate'] = pd.Series(model.coef_[0])
coeff_df.sort_values(by='Coefficient Estimate', ascending=False)
coeff_df

Unnamed: 0,Features,Coefficient Estimate
0,Sex,-0.788776
1,Age,-1.25784
2,SibSp,-0.380999
3,Fare,-0.377934
4,Embarked,0.098746


In [54]:
#outputting the correlation
coeff_df=pd.DataFrame(X.columns)
coeff_df.columns=['Features']
coeff_df['Correlation'] = pd.Series(model.coef_[0])
corr_df.sort_values(by='Correlation', ascending=False)

Unnamed: 0,Feature,Correlation
4,Fare,0.098746
5,Embarked,-0.179872
3,SibSp,-0.377934
2,Age,-0.380999
0,Pclass,-0.788776
1,Sex,-1.25784


In [55]:
X_train_df = pd.DataFrame(X_train, columns=X.columns)  # Use original column names from X

# Create a new DataFrame combining features and target
train_df = pd.concat([X_train_df, y_train], axis=1)

# Rename the target column to 'Survived' for clarity
train_df.rename(columns={train_df.columns[-1]: 'Survived'}, inplace=True)

# Display the first few rows of the combined DataFrame
print(train_df.head())

     Pclass       Sex       Age     SibSp      Fare  Embarked  Survived
0 -1.614136  0.724310  1.315684 -0.470722 -0.078684  0.563525       0.0
1 -0.400551  0.724310 -0.603061 -0.470722 -0.377145  0.563525       1.0
2  0.813034  0.724310  0.236390 -0.470722 -0.474867  0.563525       1.0
3  0.813034  0.724310 -0.303257  0.379923 -0.476230  0.563525       1.0
4  0.813034 -1.380624 -1.742315  2.931860 -0.025249  0.563525       0.0
