# Titanic Survival Prediction

## 1. Import Libraries
**What is this?**
Importing the necessary Python libraries.
**Why is this used?**
- `numpy` and `pandas` are used for data manipulation and analysis.
- `matplotlib` and `seaborn` are used for data visualization.
- `warnings` is used to suppress warning messages for a cleaner output.
**What is happening?**
We are setting up the environment with the tools needed to process the data and build models.

In [116]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

## 2. Load Dataset
**What is this?**
Loading the Titanic dataset from a CSV file.
**Why is this used?**
To bring the data into the Python environment so we can work with it.
**What is happening?**
We use `pd.read_csv()` to read the 'titanic.csv' file and store it in a pandas DataFrame called `df`. We then display the first few rows using `head()` to check if it loaded correctly.

In [117]:
df=pd.read_csv('titanic.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


## 3. Exploratory Data Analysis (EDA)
**What is this?**
Analyzing the dataset's structure and statistics.
**Why is this used?**
To understand the data quality, distribution, and identify missing values.
**What is happening?**
- `describe()`: Shows summary statistics (mean, std, min, max) for numerical columns.
- `info()`: Shows data types and non-null counts.
- `isnull().sum()`: Counts missing values in each column.
- `nunique()`: Counts unique values in each column.

In [118]:
print(df.describe(),
df.info(),
df.isnull().sum(),
df.nunique())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mean    446.000000    0.383838    2.308642   29.699118    0.523008   
std     257.353842    0.486592    

## 4. Data Preprocessing: Handling Missing Values & Dropping Columns
**What is this?**
Cleaning the data by filling missing values and removing unnecessary columns.
**Why is this used?**
Machine learning models cannot handle missing values, and some columns (like Name, Ticket) may not be useful for prediction or require complex processing.
**What is happening?**
- Filling missing 'Age' values with the median age.
- Filling missing 'Embarked' values with the mode (most frequent value).
- Dropping 'Cabin' (too many missing), 'Ticket', and 'Name' (identifiers) columns.

In [119]:
df['Age']=df['Age'].fillna(df['Age'].median())
df['Embarked']=df['Embarked'].fillna(df['Embarked'].mode()[0])
df=df.drop(columns=['PassengerId','Cabin','Ticket','Name'],axis=1)

## 5. Data Preprocessing: Categorical Encoding
**What is this?**
Converting categorical text data into numerical format.
**Why is this used?**
Most machine learning algorithms require numerical input.
**What is happening?**
- `Sex`: Mapping 'male' to 1 and 'female' to 0.
- `Embarked`: Using One-Hot Encoding (creating dummy variables) to convert the 'Embarked' column into binary columns, dropping the first one to avoid multicollinearity.

In [120]:

df['Sex'] = df['Sex'].map({'male': 1, 'female': 0})

df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)


## 6. Verify Processed Data
**What is this?**
Displaying the DataFrame after preprocessing.
**Why is this used?**
To ensure that all transformations (filling missing values, dropping columns, encoding) were applied correctly before moving to modeling.
**What is happening?**
Printing the dataframe to inspect the changes.

In [121]:
df

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked_Q,Embarked_S
0,0,3,1,22.0,1,0,7.2500,False,True
1,1,1,0,38.0,1,0,71.2833,False,False
2,1,3,0,26.0,0,0,7.9250,False,True
3,1,1,0,35.0,1,0,53.1000,False,True
4,0,3,1,35.0,0,0,8.0500,False,True
...,...,...,...,...,...,...,...,...,...
886,0,2,1,27.0,0,0,13.0000,False,True
887,1,1,0,19.0,0,0,30.0000,False,True
888,0,3,0,28.0,1,2,23.4500,False,True
889,1,1,1,26.0,0,0,30.0000,False,False


## 7. Train-Test Split
**What is this?**
Splitting the data into training and testing sets.
**Why is this used?**
To evaluate the model's performance on unseen data. We train on one part and test on another to check for overfitting.
**What is happening?**
- `x`: Features (all columns except 'Survived').
- `y`: Target variable ('Survived').
- `train_test_split`: Splits the data into 80% training and 20% testing. `random_state=42` ensures reproducibility.

In [122]:
from sklearn.model_selection import train_test_split

x=df.drop('Survived', axis=1)
y=df['Survived']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)

## 8. Feature Scaling
**What is this?**
Standardizing the range of independent variables.
**Why is this used?**
Many machine learning algorithms (like KNN, SVM, Logistic Regression) perform better or converge faster when features are on a similar scale (e.g., mean=0, std=1).
**What is happening?**
- `StandardScaler`: Fits on the training data and transforms both training and testing data to have zero mean and unit variance.

In [123]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


## 9. Basic Models Comparison
**What is this?**
Training and evaluating basic machine learning models.
**Why is this used?**
To establish a baseline performance. These models are simpler and faster to train.
**What is happening?**
We are training three models:
1.  **Logistic Regression (LR)**: A linear model for classification.
2.  **Decision Tree (DT)**: A tree-based model that splits data based on feature values.
3.  **K-Nearest Neighbors (KNN)**: A distance-based algorithm that classifies based on neighbors.
We loop through them, fit them to the scaled training data, predict on the test data, and calculate accuracy.

In [124]:
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

model={"LR":LogisticRegression(),
       "DT":DecisionTreeClassifier(),
       "KNN":KNeighborsClassifier()
       }

results=[]
for name, mod in model.items():
    mod.fit(X_scaled, y_train)
    y_pred=mod.predict(X_test_scaled)
    acc=accuracy_score(y_test, y_pred)
    results.append((name, acc))

print(results)  

[('LR', 0.8100558659217877), ('DT', 0.7821229050279329), ('KNN', 0.8044692737430168)]


## 10. Intermediate Models Comparison
**What is this?**
Training and evaluating more complex, often ensemble, models.
**Why is this used?**
To see if we can improve accuracy over the basic models. These models can capture more complex patterns and non-linear relationships.
**What is happening?**
We are training three advanced models:
1.  **Random Forest (RF)**: An ensemble of decision trees (bagging).
2.  **Gradient Boosting (GB)**: An ensemble technique that builds trees sequentially to correct errors (boosting).
3.  **Support Vector Machine (SVM)**: Finds the optimal hyperplane to separate classes.
Similar to before, we fit, predict, and calculate accuracy for comparison.

In [125]:
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC

model2={"RF":RandomForestClassifier(),
        "GB":GradientBoostingClassifier(),
        "SVM":SVC()
        }

results2=[]
for name, mod in model2.items():
    mod.fit(X_scaled, y_train)
    y_pred=mod.predict(X_test_scaled)
    acc=accuracy_score(y_test, y_pred)
    results2.append((name, acc))

print(results2)

[('RF', 0.8100558659217877), ('GB', 0.8044692737430168), ('SVM', 0.8212290502793296)]


In [126]:
import joblib
# Combine all models
all_models = {**model, **model2}
joblib.dump(all_models, "titanic_model.pkl")
joblib.dump(scaler, "titanic_scaler.pkl")

['titanic_scaler.pkl']