# Titanic Dataset Analysis

This notebook demonstrates the application of various machine learning algorithms to the Titanic dataset. Each section includes a brief explanation of the algorithm and the corresponding code.

## 1. Data Preparation

First, we load and preprocess the Titanic dataset.

In [30]:
# Import necessary libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Titanic dataset
file_path = 'C:/Users/Me/Downloads/titanic.csv'  # Update this with the actual path
df = pd.read_csv(file_path)

# Create a new feature FamilySize
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Create a new binary feature IsAlone
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)

# Convert Fare to a categorical feature
df['FareCategory'] = pd.qcut(df['Fare'], 3, labels=['Low', 'Medium', 'High'])

# Extract Title from Name
df['Title'] = df['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)

# Handle missing values
df['Age'] = df['Age'].fillna(df['Age'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Fare'] = df['Fare'].fillna(df['Fare'].median())

# Verify that there are no more missing values
print(df.isnull().sum())

# Prepare the data for modeling
df = pd.get_dummies(df, columns=['Sex', 'Embarked', 'FareCategory', 'Title'], drop_first=True)

# Define features and target variable
features = df.drop(columns=['PassengerId', 'Survived', 'Name', 'Ticket', 'Cabin'])
X = features
y = df['Survived']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define a function to evaluate models
def evaluate_model(model, X_train, X_test, y_train, y_test):
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    return accuracy, precision, recall, f1

PassengerId       0
Survived          0
Pclass            0
Name              0
Sex               0
Age               0
SibSp             0
Parch             0
Ticket            0
Fare              0
Cabin           687
Embarked          0
FamilySize        0
IsAlone           0
FareCategory      0
Title             0
dtype: int64


## 2. Logistic Regression

Logistic regression is used for binary classification problems. It uses the sigmoid function to map predicted values to probabilities. The cost function is optimized using gradient descent.


In [31]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

# Scale the data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Logistic Regression with increased max_iter
logistic_model = LogisticRegression(max_iter=500)
accuracy, precision, recall, f1 = evaluate_model(logistic_model, X_train_scaled, X_test_scaled, y_train, y_test)

print(f'Logistic Regression - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

Logistic Regression - Accuracy: 0.8156424581005587, Precision: 0.7808219178082192, Recall: 0.7702702702702703, F1-Score: 0.7755102040816326


## 3. Decision Tree

Decision trees are used for both regression and classification tasks. They split the data into subsets based on feature values, creating a tree-like model. The splits are chosen to maximize information gain or minimize impurity.

In [32]:
from sklearn.tree import DecisionTreeClassifier

# Decision Tree
decision_tree_model = DecisionTreeClassifier()
accuracy, precision, recall, f1 = evaluate_model(decision_tree_model, X_train, X_test, y_train, y_test)

print(f'Decision Tree - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

Decision Tree - Accuracy: 0.776536312849162, Precision: 0.717948717948718, Recall: 0.7567567567567568, F1-Score: 0.7368421052631579


## 4. Random Forest

Random forest is an ensemble method that combines multiple decision trees. It reduces overfitting by averaging the predictions of individual trees. It is robust and provides high accuracy.

In [33]:
from sklearn.ensemble import RandomForestClassifier

# Random Forest
random_forest_model = RandomForestClassifier()
accuracy, precision, recall, f1 = evaluate_model(random_forest_model, X_train, X_test, y_train, y_test)

print(f'Random Forest - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

Random Forest - Accuracy: 0.8379888268156425, Precision: 0.7922077922077922, Recall: 0.8243243243243243, F1-Score: 0.8079470198675497


## 5. Support Vector Machine (SVM)

SVM is used for classification tasks. It finds the hyperplane that best separates the classes in the feature space. It can handle non-linear data using kernel functions.

In [34]:
from sklearn.svm import SVC

# Support Vector Machine
svm_model = SVC()
accuracy, precision, recall, f1 = evaluate_model(svm_model, X_train, X_test, y_train, y_test)

print(f'Support Vector Machine - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

Support Vector Machine - Accuracy: 0.664804469273743, Precision: 0.7692307692307693, Recall: 0.2702702702702703, F1-Score: 0.4


## 6. K-Nearest Neighbors (KNN)

KNN is a simple, instance-based learning algorithm. It classifies a data point based on the majority class of its k-nearest neighbors. It is easy to implement but can be computationally expensive.

In [35]:
from sklearn.neighbors import KNeighborsClassifier

# K-Nearest Neighbors
knn_model = KNeighborsClassifier()
accuracy, precision, recall, f1 = evaluate_model(knn_model, X_train, X_test, y_train, y_test)

print(f'K-Nearest Neighbors - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

K-Nearest Neighbors - Accuracy: 0.7206703910614525, Precision: 0.7142857142857143, Recall: 0.5405405405405406, F1-Score: 0.6153846153846154


## 7. Naive Bayes

Naive Bayes is a probabilistic classifier based on Bayes' theorem. It assumes independence between features. It is fast and works well with high-dimensional data.

In [36]:
from sklearn.naive_bayes import GaussianNB

# Naive Bayes
naive_bayes_model = GaussianNB()
accuracy, precision, recall, f1 = evaluate_model(naive_bayes_model, X_train, X_test, y_train, y_test)

print(f'Naive Bayes - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

Naive Bayes - Accuracy: 0.664804469273743, Precision: 1.0, Recall: 0.1891891891891892, F1-Score: 0.3181818181818182


## 8. Gradient Boosting

Gradient boosting is an ensemble method that builds models sequentially. Each new model corrects the errors of the previous ones. It is powerful but can be prone to overfitting.

In [37]:
from sklearn.ensemble import GradientBoostingClassifier

# Gradient Boosting
gradient_boosting_model = GradientBoostingClassifier()
accuracy, precision, recall, f1 = evaluate_model(gradient_boosting_model, X_train, X_test, y_train, y_test)

print(f'Gradient Boosting - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

Gradient Boosting - Accuracy: 0.8100558659217877, Precision: 0.7941176470588235, Recall: 0.7297297297297297, F1-Score: 0.7605633802816901


## 9. XGBoost

XGBoost is an optimized implementation of gradient boosting. It includes regularization to prevent overfitting. It is highly efficient and widely used in competitions.

In [16]:
from sklearn.datasets import load_iris
from xgboost import XGBClassifier

# Load dataset
data = load_iris()
X = data.data
y = data.target

# XGBoost
xgboost_model = XGBClassifier()
accuracy, precision, recall, f1 = evaluate_model(xgboost_model, X_train, X_test, y_train, y_test)

print(f'XGBoost - Accuracy: {accuracy}, Precision: {precision}, Recall: {recall}, F1-Score: {f1}')

XGBoost - Accuracy: 1.0, Precision: 1.0, Recall: 1.0, F1-Score: 1.0
