## Titanic Survival Prediction

### Algorithms used: Logistic Regression , Random Forest Classifier

### Importing necessary libraries

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score

### Loading the Titanic dataset

In [2]:
titanic_df = pd.read_csv('tested.csv')
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,0,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,1,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


In [3]:
unique_class=titanic_df['Survived'].unique()
print(unique_class)

[0 1]


There are two classes so it will be a `binary classification` problem.

### Data Exploration and Preprocessing

In [4]:
# Drop irrelevant columns
titanic_df = titanic_df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1)

# Handling missing data
titanic_df['Age'].fillna(titanic_df['Age'].median(), inplace=True)
titanic_df['Fare'].fillna(titanic_df['Fare'].mean(), inplace=True)

# Convert categorical data to numerical format
titanic_df['Sex'] = titanic_df['Sex'].map({'male': 0, 'female': 1})
titanic_df['Embarked'] = titanic_df['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

# Drop any remaining rows with missing data
titanic_df.dropna(inplace=True)

print("Dataframe after preprocessing:")
print(titanic_df.head())

Dataframe after preprocessing:
   Survived  Pclass  Sex   Age  SibSp  Parch     Fare  Embarked
0         0       3    0  34.5      0      0   7.8292         1
1         1       3    1  47.0      1      0   7.0000         2
2         0       2    0  62.0      0      0   9.6875         1
3         0       3    0  27.0      0      0   8.6625         2
4         1       3    1  22.0      1      1  12.2875         2


### Initialising predictor matrix and dependent variable.

Dependent variable: Survived

In [5]:
X = titanic_df.drop('Survived', axis=1)
y = titanic_df['Survived']
print(X.columns)

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked'], dtype='object')


### Split the data into training and testing sets

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=123)

### Logistic Regression : Model fitting and predictions

In [7]:
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
y_pred_logreg = logreg_model.predict(X_test)

# Calculate accuracy, precision, and recall
accuracy_logreg = accuracy_score(y_test, y_pred_logreg)
precision_logreg = precision_score(y_test, y_pred_logreg)
recall_logreg = recall_score(y_test, y_pred_logreg)

print("Logistic Regression - Evaluation Metrics:")
print("Accuracy:", accuracy_logreg)
print("Precision:", precision_logreg)
print("Recall:", recall_logreg)

Logistic Regression - Evaluation Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


### Random Forest Classifier: Model fitting and predictions

In [8]:
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)
y_pred_rf = rf_model.predict(X_test)

# Calculate accuracy, precision, and recall
accuracy_rf = accuracy_score(y_test, y_pred_rf)
precision_rf = precision_score(y_test, y_pred_rf)
recall_rf = recall_score(y_test, y_pred_rf)

print("Random Forest Classifier - Evaluation Metrics:")
print("Accuracy:", accuracy_rf)
print("Precision:", precision_rf)
print("Recall:", recall_rf)

Random Forest Classifier - Evaluation Metrics:
Accuracy: 1.0
Precision: 1.0
Recall: 1.0


### Let's compile the different performance metrics of the three algorithms in a single dataframe for comparison.

In [9]:
log = {
    "Accuracy": accuracy_logreg,
    'Precision': precision_logreg,
    'Recall': recall_logreg,
}
rfc={
    'Accuracy': accuracy_rf,
    'Precision':precision_rf,
    'Recall': recall_rf,
}

combined_metrics = {
    'Logistic Regression': log,
    'Random Forest': rfc
}


# Creating a DataFrame from the combined_metrics dictionary
df_metrics = pd.DataFrame(combined_metrics)

# Transpose the DataFrame for a more readable format
df_metrics = df_metrics.transpose()

# Display the DataFrame
print(df_metrics)

                     Accuracy  Precision  Recall
Logistic Regression       1.0        1.0     1.0
Random Forest             1.0        1.0     1.0


### Conclusion:

Logistic Regression and Random Forest Classifier performs equally better.