# Titanic Extra Credit Analysis


## Introduction
This notebook performs a machine learning analysis on the Titanic dataset. We will clean the data, visualize it, and train three different classifiers to evaluate their performance using cross-validation.


In [5]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import KFold
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score


ModuleNotFoundError: No module named 'matplotlib'

## Load the Titanic Dataset

In [None]:

# Load the dataset
file_path = "train_and_test2.csv"
titanic_data = pd.read_csv(file_path)

# Display first few rows
titanic_data.head()


## Data Cleaning

In [None]:

def clean_data(frame):
    frame = frame.replace(['N/A', 'NULL', '?', 'None', 'n/a'], np.nan)
    frame['Age'].fillna(frame['Age'].mean(), inplace=True)
    frame['Embarked'].fillna(frame['Embarked'].mode()[0], inplace=True)
    frame.drop(columns=['Passengerid'], inplace=True, errors='ignore')
    
    return frame

titanic_data = clean_data(titanic_data)
titanic_data.head()


## Data Visualization

In [None]:

plt.figure(figsize=(10, 6))
sns.histplot(titanic_data['Age'], bins=20, kde=True)
plt.title("Age Distribution of Titanic Passengers")
plt.xlabel("Age")
plt.ylabel("Count")
plt.show()


In [None]:

plt.figure(figsize=(8, 6))
sns.countplot(x='Sex', hue='2urvived', data=titanic_data)
plt.title("Survival Count by Gender")
plt.xlabel("Gender")
plt.ylabel("Count")
plt.legend(title="Survived", labels=["No", "Yes"])
plt.show()


## Modeling

In [None]:

def cross_fold_validation(classifier, frame, folds):
    kf = KFold(n_splits=folds, shuffle=True, random_state=42)
    X = frame.drop(columns=['2urvived']).copy()
    y = frame['2urvived'].copy()
    imputer = SimpleImputer(strategy='mean')
    X = imputer.fit_transform(X)
    scaler = StandardScaler()
    X = scaler.fit_transform(X)
    accuracy_scores = []
    
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y.iloc[train_index], y.iloc[test_index]
        classifier.fit(X_train, y_train)
        y_pred = classifier.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        accuracy_scores.append(accuracy)
    
    return accuracy_scores

classifiers = {
    "Logistic Regression": LogisticRegression(),
    "K-Nearest Neighbor": KNeighborsClassifier(n_neighbors=5),
    "Decision Tree": DecisionTreeClassifier(max_depth=5)
}

model_results = {}
for name, clf in classifiers.items():
    scores = cross_fold_validation(clf, titanic_data, 5)
    model_results[name] = (np.mean(scores), np.std(scores))

# Display results
pd.DataFrame(model_results, index=["Mean Accuracy", "Std Dev"]).T


## Conclusion


Based on the analysis, K-Nearest Neighbors performed the best with the highest mean accuracy. The Decision Tree model had the highest standard deviation, indicating that it was more variable across different folds. 
Future improvements could involve feature engineering, hyperparameter tuning, and testing additional classifiers.
