# Titanic Survival Prediction

### Install Dependencies

We begin by installing the core libraries for our analysis:
- `pandas`: For data manipulation and analysis.
- `numpy`: For numerical operations.
- `seaborn` & `matplotlib`: For creating data visualizations.
- `scikit-learn`: For machine learning algorithms.

In [None]:
%pip install kagglehub numpy pandas seaborn kaggle scikit-learn matplotlib

### Import Libraries and Load Data

We import our libraries and use `kagglehub.load_dataset` to fetch the Titanic survival data. The data is loaded into a Pandas DataFrame (`df`), which is the standard structure for data analysis in Python.

In [None]:
import kagglehub
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from kagglehub import KaggleDatasetAdapter

file_path = "Titanic-Dataset.csv"

df = kagglehub.load_dataset(
  KaggleDatasetAdapter.PANDAS,
  "yasserh/titanic-dataset",
  file_path,
)

### Dataset Schema Analysis

The `df.info()` method provides a high-level overview of our DataFrame. It shows the number of entries (rows), the column names, and the data type of each column (e.g., `int64`, `object`, `float64`). It also reveals missing values via the 'Non-Null Count'.

In [None]:
df.info()

### Dataset Dimensions

Using `df.shape`, we can see the exact dimensions of our dataset as a tuple: (number of rows, number of columns). This helps us understand the scale of the data we are working with.

In [None]:
df.shape

### Statistical Summary

The `df.describe()` method calculates descriptive statistics for all numerical columns. It includes the mean, standard deviation, minimum, maximum, and the values at the 25%, 50% (median), and 75% percentiles.

In [None]:
df.describe()

### Missing Data Detection

We use `df.isnull()` to create a boolean mask where `True` represents a missing value. Calling `.sum()` then aggregates these values, giving us a count of missing entries for every column.

In [None]:
df.isnull().sum()

### Data Cleaning: Dropping Irrelevant Columns

The 'Cabin' column contains a high percentage of missing values (over 75%). Columns with such high sparsity often add noise rather than signal to a model, so we remove it using `.drop()`.

In [None]:
df = df.drop(columns='Cabin')

### Feature Transformation: Initial Age Grouping

We use a `lambda` function with conditional logic to group passengers into categories based on their age. This simplifies the continuous age data into discrete buckets like 'Child', 'Young Adult', and 'Adult'.

In [None]:
df['Age group'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Young Adult' if x < 30 else 'Adult' if x < 60 else 'Unknown' if pd.isna(x) else 'Elder')

### Handling Missing Values: Median Imputation

We calculate the `median()` age to fill in missing values. The median is often preferred over the mean because it is less sensitive to extreme outliers. We then use `.fillna()` to replace all nulls with this value.

In [None]:
median_age = df['Age'].median()
df['Age'] = df['Age'].fillna(median_age)

### Verification: Age Imputation

We run a final null check on the 'Age' column to confirm that our imputation was successful and no missing values remain.

In [None]:
df['Age'].isnull().sum()


### Data Inspection

We display the first 10 rows of the DataFrame using `.head(10)` to visually inspect the results of our cleaning and feature engineering so far.

In [None]:
df.head(10)

In [None]:
df['Age group'] = df['Age'].apply(lambda x: 'Child' if x < 18 else 'Young Adult' if x < 30 else 'Adult' if x < 60 else 'Unknown' if pd.isna(x) else 'Elder')

In [None]:
(df['Age group'] == 'Unknown').sum()

In [None]:
df['Embarked'].isnull().sum()

### Categorical Data Analysis

We examine the distribution of the 'Embarked' column using `.value_counts()`. This tells us which ports passengers departed from and identifies the most frequent starting point.

In [None]:
df['Embarked'].value_counts()

### Imputing Categorical Data

For categorical columns like 'Embarked', we fill missing values with the `mode()` (the most frequent value). Since `.mode()` returns a series, we use `.values[0]` to get the most common port name.

In [None]:
most_common_port = df['Embarked'].mode().values[0]
df['Embarked'] = df['Embarked'].fillna(most_common_port)

In [None]:
df['Embarked'].isnull().sum()

In [None]:
df.isnull().sum()

In [None]:
df.head()

In [None]:
df.columns

### Label Engineering for Visuals

We create a more descriptive version of the 'Survived' column. Mapping 0 to 'Not Survived' and 1 to 'Survived' makes our charts easier to interpret for readers without needing to reference the legend.

In [None]:
df['Survived-text'] = df['Survived'].apply(lambda x: 'Survived' if x == 1 else 'Not Survived')

### Visualization: Survival by Passenger Class

We use Seaborn's `countplot` to analyze survival distributions across different classes. The 'hue' parameter allows us to split the bars by survival status, revealing that higher-class passengers had a significantly better chance of survival.

In [None]:
sns.countplot(x='Pclass', hue='Survived-text', data=df, palette={'Survived': 'green', 'Not Survived': 'red'}, hue_order=['Not Survived', 'Survived'])
plt.show()

### Visualization: Survival by Gender

This count plot visualizes the 'women and children first' observation. We see a stark difference in survival rates between male and female passengers.

In [None]:
sns.countplot(x='Sex', hue='Survived-text',data=df, palette={'Survived': 'green', 'Not Survived': 'red'}, hue_order=['Not Survived', 'Survived'])
plt.show()

### Visualization: Survival by Age Group

We visualize survival across our engineered age buckets. This helps us see if certain age groups, such as children, were prioritized during the evacuation.

In [None]:
sns.countplot(x='Age group',hue='Survived-text',data=df, palette={'Survived': 'green', 'Not Survived': 'red'}, hue_order=['Not Survived', 'Survived'])
plt.show()

### Manual Feature Analysis

We manually calculate the survivor counts for specific groups using boolean indexing. Example: `(df['Pclass'] == 1) & (df['Survived'] == 1)` identifies survivors in the 1st class.

In [None]:
survivor_pclass_1 = ((df['Pclass'] == 1) & (df['Survived'] == 1)).sum()
not_survivor_pclass_1 = ((df['Pclass'] == 1) & (df['Survived'] == 0)).sum()

survivor_pclass_2 = ((df['Pclass'] == 2) & (df['Survived'] == 1)).sum()
not_survivor_pclass_2 = ((df['Pclass'] == 2) & (df['Survived'] == 0)).sum()

survivor_pclass_3 = ((df['Pclass'] == 3) & (df['Survived'] == 1)).sum()
not_survivor_pclass_3 = ((df['Pclass'] == 3) & (df['Survived'] == 0)).sum()

def calculate_survival_rate(survivor_count, not_survivor):
    return (survivor_count / (survivor_count + not_survivor)) * 100

Pclass_1_rate = calculate_survival_rate(survivor_pclass_1, not_survivor_pclass_1)
Pclass_2_rate = calculate_survival_rate(survivor_pclass_2, not_survivor_pclass_2)
Pclass_3_rate = calculate_survival_rate(survivor_pclass_3, not_survivor_pclass_3)


### Gender Performance Summary

Similarly, we calculate and display the survival rates for male and female passengers.

In [None]:
survivor_male = ((df['Sex'] == 'male') & (df['Survived'] == 1)).sum()
not_survivor_male = ((df['Sex'] == 'male') & (df['Survived'] == 0)).sum()
survivor_female = ((df['Sex'] == 'female') & (df['Survived'] == 1)).sum()
not_survivor_female = ((df['Sex'] == 'female') & (df['Survived'] == 0)).sum()

Male_rate = calculate_survival_rate(survivor_male, not_survivor_male)
Female_rate = calculate_survival_rate(survivor_female, not_survivor_female)


### Age Group Performance Summary

Finally, we calculate the survival success for each age category we defined earlier.

In [None]:
survivor_child = ((df['Age group'] == 'Child') & (df['Survived'] == 1)).sum()
not_survivor_child = ((df['Age group'] == 'Child') & (df['Survived'] == 0)).sum()

survivor_young = ((df['Age group'] == 'Young Adult') & (df['Survived'] == 1)).sum()
not_survivor_young = ((df['Age group'] == 'Young Adult') & (df['Survived'] == 0)).sum()

survivor_adult = ((df['Age group'] == 'Adult') & (df['Survived'] == 1)).sum()
not_survivor_adult = ((df['Age group'] == 'Adult') & (df['Survived'] == 0)).sum()

survivor_elder = ((df['Age group'] == 'Elder') & (df['Survived'] == 1)).sum()
not_survivor_elder = ((df['Age group'] == 'Elder') & (df['Survived'] == 0)).sum()

Child_rate = calculate_survival_rate(survivor_child, not_survivor_child)
Young_rate = calculate_survival_rate(survivor_young, not_survivor_young )
Adult_rate = calculate_survival_rate(survivor_adult, not_survivor_adult )
Elder_rate = calculate_survival_rate(survivor_elder, not_survivor_elder)


### Predictive Modeling: Data Splitting

To evaluate our model fairly, we split the data into a training set (75%) and a test set (25%). The model learns from the training set, and we check its performance on the unseen test set to prevent overfitting.

In [None]:
from sklearn.model_selection import train_test_split
df_train, df_test = train_test_split(df, train_size=0.75)

In [None]:
df_train.shape

In [None]:
df_test.shape

### Feature Selection

We select the relevant features for our model by dropping identifiers (like 'Name' and 'PassengerId') and target variables. This ensures the model only focuses on data that actually predicts survival.

In [None]:
x_train = df_train.drop(columns=['Survived','Survived-text','PassengerId','Name','Ticket','Fare'])
y_train = df_train['Survived']

x_test = df_test.drop(columns=['Survived','Survived-text','PassengerId','Name','Ticket','Fare'])
y_test = df_test['Survived']

In [None]:
x_train.head()

### Categorical Encoding

Machine learning models require numerical input. We use `.map()` to convert strings like 'male' and 'female' into binary numbers (0 and 1), making them compatible with mathematical algorithms.

In [None]:
x_train['Sex'] = x_train['Sex'].map({'female':0,'male':1})
x_train['Embarked'] = x_train['Embarked'].map({'C':0,'Q':1,'S':2})
x_train['Age group'] = x_train['Age group'].map({'Child':0,'Young Adult':1,'Adult':2,'Elder':3})

x_test['Sex'] = x_test['Sex'].map({'female':0,'male':1})
x_test['Embarked'] = x_test['Embarked'].map({'C':0,'Q':1,'S':2})
x_test['Age group'] = x_test['Age group'].map({'Child':0,'Young Adult':1,'Adult':2,'Elder':3})

In [None]:
x_test.head()


### Simulated Model Logic

We define a manual predictive model that calculates an average survival likelihood based on three features. If the combined probability is 50% or higher, the model predicts 'Survived'.

In [None]:
def survival_rate_model(Pclass,Sex,AgeG):
  pclass_rate = 0
  if Pclass == 1:
    pclass_rate = Pclass_1_rate
  elif Pclass == 2:
    pclass_rate = Pclass_2_rate
  elif Pclass == 3:
    pclass_rate = Pclass_3_rate

  sex_rate = 0
  if Sex == 'male':
    sex_rate = Male_rate
  elif Sex == 'female':
    sex_rate = Female_rate

  age_group_rate = 0
  if AgeG == 'Child':
    age_group_rate = Child_rate
  elif AgeG == 'Young Adult':
    age_group_rate = Young_rate
  elif AgeG == 'Adult':
    age_group_rate = Adult_rate
  elif AgeG == 'Elder':
    age_group_rate = Elder_rate

  combined_rate = (pclass_rate + sex_rate + age_group_rate) / 3
  return 1 if combined_rate >= 50 else 0

### Manual Model Execution

We apply our manual model logic to the entire training set to see how well it performs as a basic rule-based engine.

In [None]:
simple_model_prediction = np.array([survival_rate_model(Pclass, Sex, AgeG) for Pclass, Sex, AgeG in zip(df_train['Pclass'], df_train['Sex'], df_train['Age group'])])

In [None]:
simple_model_prediction == y_train

In [None]:
print(f"Accuracy: {(np.mean(simple_model_prediction == y_train))*100:.2f}%")

In [None]:
x_train.isnull().sum()

### Advanced Modeling: Logistic Regression

We initialize a Logistic Regression model from Scikit-Learn. The `max_iter=500` parameter ensures the optimization algorithm has enough iterations to converge on a solution.

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=500)

### Machine Learning Training

The `.fit()` method is where the learning happens. The algorithm adjusts its internal parameters to find the best mathematical relationship between the input features and the known survival outcomes.

In [None]:
model.fit(x_train, y_train)

### Evaluation: Training Accuracy

We check how accurately the trained model can predict outcomes on the very data it just learned from.

In [None]:
logistic_accurate_w_train_data = np.mean(model.predict(x_train) == y_train)
print(f"Accuracy: {logistic_accurate_w_train_data*100:.2f}%")

### Evaluation: Test Accuracy

This is the true test of the model. We check its accuracy on the test setâ€”data it has never seen before.

In [None]:
logistic_accurate_w_test_data = np.mean(model.predict(x_test) == y_test)
print(f"Accuracy: {logistic_accurate_w_test_data*100:.2f}%")

In [None]:
from sklearn.model_selection import cross_val_score , cross_val_predict
model = LogisticRegression(max_iter=500)

### Advanced Verification: Cross-Validation

We use Cross-Validation to ensure our model's performance is consistent. The data is split into 5 'folds', and the model is trained and tested 5 times on different slices.

In [None]:
cross_val_score(model,x_train,y_train,cv=5,scoring='accuracy')

In [None]:
mean_accurate_cross_val = np.mean(cross_val_score(model,x_train,y_train,cv=5,scoring='accuracy'))
print(f"Accuracy: {mean_accurate_cross_val*100:.2f}%")

### Hyperparameter Optimization: Parameter Tuning

We iterate through several values for the regularization parameter `C`. Tuning these 'hyperparameters' helps us find the optimal balance between model complexity and generalizability.

In [None]:
for reg_param in [0.001,0.01,0.1,0.11,1]:
  print(f'Regularization parameter: {reg_param}')
  model = LogisticRegression(max_iter=500, C=reg_param)
  accuracies = cross_val_score(model,x_train,y_train,cv=5,scoring='accuracy')
  print(f'Accuracy: {np.mean(accuracies)*100:.2f}')

### Final Model Configuration

After tuning, we select the parameter `C=0.11` as it provided the best performance across our validation folds.

In [None]:
model = LogisticRegression(max_iter=500, C=0.11)

In [None]:
model.fit(x_train,y_train)

In [None]:
y_train_pred = model.predict(x_train)

In [None]:
y_test_pred = model.predict(x_test)

In [None]:
test_set_correctly_classified = y_test_pred == y_test
test_set_accuracy = np.mean(test_set_correctly_classified)

In [None]:
train_set_correctly_classified = y_train_pred == y_train
train_set_accuracy = np.mean(train_set_correctly_classified)

### Conclusion: Final Performance Metrics

We display the final accuracy results for both training and test sets. A small gap between these two numbers indicates a robust model that isn't overfitting.

In [None]:
print(f'Final model accuracy with test data set: {test_set_accuracy*100:.2f}')

In [None]:
print(f'Final model accuracy with train data set: {train_set_accuracy*100:.2f}')

In [None]:
print(f'Accuracy from model without tuning and cross validation with train data:{logistic_accurate_w_train_data*100: .2f}%')
print(f'Accuracy from model without tuning and cross validation with test data:{logistic_accurate_w_test_data*100: .2f}%')
print(f'Mean accuracy after 5 fold(cv=5):{mean_accurate_cross_val*100: .2f}%')
print(f'Accuracy from tuning model with train data after 5 fold(cv=5):{test_set_accuracy*100: .2f}%')
print(f'Accuracy from tuning model with test data after 5 fold(cv=5):{train_set_accuracy*100: .2f}%')