# Titanic Survival Prediction (Iris Format)

This notebook demonstrates the process of predicting survival on the Titanic, following the educational format of the Iris flower classification project.

### Install Dependencies

Installs the necessary library `kagglehub` to download the Titanic dataset.

In [None]:
%pip install kagglehub

### Import Libraries and Load Data

Imports libraries for data analysis (pandas, numpy) and visualization (seaborn, matplotlib), and loads the Titanic dataset.

In [None]:
import kagglehub
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from kagglehub import KaggleDatasetAdapter

file_path = "Titanic-Dataset.csv"

df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "yasserh/titanic-dataset",
  file_path,
)

### Dataset Information

Displays the structure and summary of the DataFrame using `df.info()`.

In [None]:
df.info()

### Dataset Shape

Prints the dimensions (rows, columns) of the Titanic dataset.

In [None]:
df.shape

### Descriptive Statistics

Provides a statistical summary of the numerical columns in the dataset.

In [None]:
df.describe()

### Check for Missing Data

Sums the number of null values across all columns to identify required data cleaning.

In [None]:
df.isnull().sum()

### Drop Cabin Column

Removes the 'Cabin' column because it contains too many missing values to be useful.

In [None]:
df = df.drop(columns='Cabin')

### Initialize Age Groups

Creates a categorical 'Age group' feature based on passenger ages.

In [None]:
df['Age group'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Young Adult' if x < 30 else 'Adult' if x < 60 else 'Unknown' if pd.isna(x) else 'Elder')

### Fill Missing Age Values

Uses the median age to fill in gaps in the 'Age' column.

In [None]:
median_age = df['Age'].median()
df['Age'].fillna(median_age, inplace=True)

### Verify Age Imputation

Confirms that there are no longer any null values in the 'Age' column.

In [None]:
df['Age'].isnull().sum()

### View Processed Data Samples

Displays the first 10 rows of the cleaned DataFrame.

In [None]:
df.head(10)

### Re-calculate Age Groups

Updates the 'Age group' values to reflect the newly imputed age data.

In [None]:
df['Age group'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Young Adult' if x < 30 else 'Adult' if x < 60 else 'Unknown' if pd.isna(x) else 'Elder')

### Confirm Categorization

Checks for any remaining 'Unknown' entries in the age groups.

In [None]:
(df['Age group'] == 'Unknown').sum()

### Check Embarked Nulls

Examines missing values in the 'Embarked' column.

In [None]:
df['Embarked'].isnull().sum()

### Port Distribution

Analyzes the frequency of each embarkation port.

In [None]:
df['Embarked'].value_counts()

### Fill Embarked with Mode

Imputes missing 'Embarked' data using the most common port (Southampton).

In [None]:
most_common_port = df['Embarked'].mode().values[0]
df['Embarked'] = df['Embarked'].fillna(most_common_port)

### Verify Cleaned Columns

Final check for any remaining nulls in 'Embarked'.

In [None]:
df['Embarked'].isnull().sum()

### Final Nuil Count Check

Confirms the entire dataset is now clean and ready for analysis.

In [None]:
df.isnull().sum()

### Dataset Preview

Views the head of the fully cleaned DataFrame.

In [None]:
df.head()

### Feature Selection Analysis

Reviews available columns to decide which features to include in the model.

In [None]:
df.columns

### Map Survival Labels

Adds a 'Survived-text' column for easier labeling in plots.

In [None]:
df['Survived-text'] = df['Survived'].apply(lambda x: 'Survived' if x == 1 else 'Not Survived')

### Survival by Class Plot

Generates a count plot to visualize survival rates across different passenger classes.

In [None]:
sns.countplot(x='Pclass', hue='Survived-text', data=df, palette={'Survived': 'green', 'Not Survived': 'red'}, hue_order=['Not Survived', 'Survived'])
plt.show()

### Survival by Gender Plot

Visualizes the survival count distribution between male and female passengers.

In [None]:
sns.countplot(x='Sex', hue='Survived-text',data=df, palette={'Survived': 'green', 'Not Survived': 'red'}, hue_order=['Not Survived', 'Survived'])
plt.show()

### Survival by Age Group Plot

Displays survival counts for each defined age category.

In [None]:
sns.countplot(x='Age group',hue='Survived-text',data=df, palette={'Survived': 'green', 'Not Survived': 'red'}, hue_order=['Not Survived', 'Survived'])
plt.show()

### Calculate Class Survival Probabilities

Determines the historical percentage of survivors in each ticket class.

In [None]:
survivor_pclass_1 = ((df['Pclass'] == 1) & (df['Survived'] == 1)).sum()
not_survivor_pclass_1 = ((df['Pclass'] == 1) & (df['Survived'] == 0)).sum()

survivor_pclass_2 = ((df['Pclass'] == 2) & (df['Survived'] == 1)).sum()
not_survivor_pclass_2 = ((df['Pclass'] == 2) & (df['Survived'] == 0)).sum()

survivor_pclass_3 = ((df['Pclass'] == 3) & (df['Survived'] == 1)).sum()
not_survivor_pclass_3 = ((df['Pclass'] == 3) & (df['Survived'] == 0)).sum()

def calculate_survival_rate(survivor_count, not_survivor):
    return (survivor_count / (survivor_count + not_survivor)) * 100

Pclass_1_rate = calculate_survival_rate(survivor_pclass_1, not_survivor_pclass_1)
Pclass_2_rate = calculate_survival_rate(survivor_pclass_2, not_survivor_pclass_2)
Pclass_3_rate = calculate_survival_rate(survivor_pclass_3, not_survivor_pclass_3)

### Calculate Gender Survival Probabilities

Calculates survival rates separately for men and women.

In [None]:
survivor_male = ((df['Sex'] == 'male') & (df['Survived'] == 1)).sum()
not_survivor_male = ((df['Sex'] == 'male') & (df['Survived'] == 0)).sum()
survivor_female = ((df['Sex'] == 'female') & (df['Survived'] == 1)).sum()
not_survivor_female = ((df['Sex'] == 'female') & (df['Survived'] == 0)).sum()

Male_rate = calculate_survival_rate(survivor_male, not_survivor_male)
Female_rate = calculate_survival_rate(survivor_female, not_survivor_female)

### Calculate Age Group Probabilities

Computes survival percentages for children, young adults, adults, and elders.

In [None]:
survivor_child = ((df['Age group'] == 'Child') & (df['Survived'] == 1)).sum()
not_survivor_child = ((df['Age group'] == 'Child') & (df['Survived'] == 0)).sum()

survivor_young = ((df['Age group'] == 'Young Adult') & (df['Survived'] == 1)).sum()
not_survivor_young = ((df['Age group'] == 'Young Adult') & (df['Survived'] == 0)).sum()

survivor_adult = ((df['Age group'] == 'Adult') & (df['Survived'] == 1)).sum()
not_survivor_adult = ((df['Age group'] == 'Adult') & (df['Survived'] == 0)).sum()

survivor_elder = ((df['Age group'] == 'Elder') & (df['Survived'] == 1)).sum()
not_survivor_elder = ((df['Age group'] == 'Elder') & (df['Survived'] == 0)).sum()

Child_rate = calculate_survival_rate(survivor_child, not_survivor_child)
Young_rate = calculate_survival_rate(survivor_young, not_survivor_young )
Adult_rate = calculate_survival_rate(survivor_adult, not_survivor_adult )
Elder_rate = calculate_survival_rate(survivor_elder, not_survivor_elder)

### Split Training and Test Data

Divides the dataset into a 75% training set and a 25% test set.

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size=0.75)

### Check Training Size

Displays the number of records used for training the model.

In [None]:
df_train.shape

### Check Testing Size

Displays the number of records reserved for model evaluation.

In [None]:
df_test.shape

### Prepare Target Variables

Separates the features from the target variable 'Survived' for the training set.

In [None]:
x_train = df_train.drop(columns=['Survived', 'Survived-text'])
y_train = df_train['Survived']

### Manual Prediction Model

Defines a function that predicts survival by averaging historical probabilities from multiple features.

In [None]:
def survival_rate_model(Pclass,Sex,AgeG):
  # Pclass survival rate
  pclass_rate = 0
  if Pclass == 1:
    pclass_rate = Pclass_1_rate
  elif Pclass == 2:
    pclass_rate = Pclass_2_rate
  elif Pclass == 3:
    pclass_rate = Pclass_3_rate

  # Sex survival rate
  sex_rate = 0
  if Sex == 'male':
    sex_rate = Male_rate
  elif Sex == 'female':
    sex_rate = Female_rate

  # Age group survival rate
  age_group_rate = 0
  if AgeG == 'Child':
    age_group_rate = Child_rate
  elif AgeG == 'Young Adult':
    age_group_rate = Young_rate
  elif AgeG == 'Adult':
    age_group_rate = Adult_rate
  elif AgeG == 'Elder':
    age_group_rate = Elder_rate

  combined_rate = (pclass_rate + sex_rate + age_group_rate) / 3
  return 1 if combined_rate >= 50 else 0

### Run Model Predictions

Applies the manual model to the entire training set to generate predictions.

In [None]:
simple_model_prediction = np.array([survival_rate_model(Pclass, Sex, AgeG) for Pclass, Sex, AgeG in zip(df_train['Pclass'], df_train['Sex'], df_train['Age group'])])

### Accuracy Check

Compares model predictions with actual outcomes in the training set.

In [None]:
simple_model_prediction == y_train

### Print Model Accuracy

Calculates and prints the final accuracy percentage of the manual prediction model.

In [None]:
print(f"Accuracy: {(np.mean(simple_model_prediction == y_train))*100:.2f}%")