# üö¢ **Titanic - Machine Learning with PyCaret** üöÄ  

Welcome to my **Titanic - Machine Learning from Disaster** notebook! üéØ  
In this notebook, I aim to predict passenger survival using **AutoML with PyCaret**.  
I have performed **Exploratory Data Analysis (EDA), Feature Engineering, and Model Training** to improve accuracy.  

## üìå **Notebook Highlights**   
‚úîÔ∏è **Exploratory Data Analysis (EDA)**
   - üìä Survival rates based on **Pclass, Age, Gender, Family Size, Fare, Embarkation Port**  
   - üìâ Interactive **Plotly visualizations with dark backgrounds**  
‚úîÔ∏è **Feature Engineering**
   - üîπ Extracted **Cabin Letter** (First-class passengers had a better survival rate)  
   - üîπ Extracted **Ticket Prefix** (Linked to passenger wealth)  
   - üîπ Created **Family Size** (Passengers with small families had higher survival rates)  
   - üîπ Applied **Log Transformation on Fare** to reduce skewness  
‚úîÔ∏è **Machine Learning with PyCaret**
   - üèÜ **AutoML Setup:** Handled categorical & numerical features automatically  
   - üîç **Compare Models:** Selected the best model  
   - üîß **Hyperparameter Tuning:** Improved accuracy  

---

## üóÇ **Dataset Overview**  

| Feature | Description |
|---------|------------|
| `PassengerId` | Unique ID for each passenger |
| `Survived` | Survival status (0 = No, 1 = Yes) |
| `Pclass` | Passenger class (1st, 2nd, 3rd) |
| `Sex` | Gender (Male/Female) |
| `Age` | Age of the passenger |
| `SibSp` | Number of siblings/spouses aboard |
| `Parch` | Number of parents/children aboard |
| `Fare` | Ticket price paid |
| `Embarked` | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton) |
| `Cabin` | Cabin number (if available) |
| `Ticket` | Ticket number |

### üõ† **Engineered Features**  
| New Feature | Description |
|-------------|------------|
| `Cabin_Letter` | Extracted first letter from Cabin (A, B, C‚Ä¶) |
| `Ticket_Prefix` | Extracted prefix from Ticket number |
| `Family_Size` | Calculated from SibSp + Parch + 1 |
| `Log_Fare` | Log-transformed Fare to reduce skewness |

---

## üî• **Let‚Äôs Begin!**  
I will now explore the data, visualize patterns, and apply **AutoML with PyCaret** to build the best model! üöÄ  


In [1]:
!pip install pycaret

Collecting pycaret
  Downloading pycaret-3.3.2-py3-none-any.whl.metadata (17 kB)
Collecting pandas<2.2.0 (from pycaret)
  Downloading pandas-2.1.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting scipy<=1.11.4,>=1.6.1 (from pycaret)
  Downloading scipy-1.11.4-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m60.4/60.4 kB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting joblib<1.4,>=1.2.0 (from pycaret)
  Downloading joblib-1.3.2-py3-none-any.whl.metadata (5.4 kB)
Collecting scikit-learn>1.4.0 (from pycaret)
  Downloading scikit_learn-1.6.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Collecting pyod>=1.1.3 (from pycaret)
  Downloading pyod-2.0.3.tar.gz (169 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚î

## üìö **Importing Libraries**

In [2]:
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from pycaret.classification import *
import warnings

## ‚öôÔ∏è **Basic Important Settings**

In [3]:
pio.renderers.default = 'iframe'
warnings.filterwarnings("ignore")

## üì• **Loading the Dataset**

In [4]:
df = pd.read_csv("/kaggle/input/titanic/train.csv")

## **üìä Exploring the Dataset**

In [5]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [7]:
df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

## **üîç Data Cleaning/Preprocessing**

In [8]:
df["Age"] = df["Age"].fillna(int(df['Age'].mean()))
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
df['Cabin'] = df['Cabin'].fillna(df['Cabin'].mode()[0])

## **üìä Exploring the Dataset**

# **1Ô∏è‚É£ Overall Survival Rate**

### **üìä 1Ô∏è‚É£ Overall Survival Rate**

In [9]:
survival_counts = df['Survived'].value_counts()

fig = go.Figure(data=[go.Pie(
    labels=['Not Survived', 'Survived'],
    values=survival_counts,
    marker=dict(colors=['red', 'green']), 
    hole=0.4,  
    textinfo='percent+label'
)])

fig.update_layout(
    title="üö¢ Titanic Survival Rate",
    template="plotly_dark",
    font=dict(size=14),
    paper_bgcolor="black",
    plot_bgcolor="black",
)

fig.show()

# **2Ô∏è‚É£ Gender & Survival Analysis**

### **üìä 1Ô∏è‚É£ Gender-wise Survival Count (Grouped Bar Chart)**

In [10]:
fig = px.bar(df, x="Sex", y="Survived", color="Sex", barmode="group",
             text_auto=True,  
             color_discrete_map={"male": "blue", "female": "pink"})

fig.update_layout(
    title="üö¢ Survival Rate by Gender",
    xaxis_title="Gender",
    yaxis_title="Survival Count",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **üìä 2Ô∏è‚É£ Gender-wise Survival Rate (Pie Chart)**

In [11]:
gender_survival = df.groupby("Sex")["Survived"].mean()

fig = go.Figure(data=[go.Pie(
    labels=["Female Survived", "Male Survived"],
    values=[gender_survival["female"], gender_survival["male"]],
    marker=dict(colors=["pink", "blue"]),
    hole=0.4,  
    textinfo="percent+label"
)])

fig.update_layout(
    title="üö¢ Gender-Based Survival Rate",
    template="plotly_dark",
    font=dict(size=14),
    paper_bgcolor="black",
    plot_bgcolor="black",
)

fig.show()

# **3Ô∏è‚É£ Passenger Class & Survival**

### **üìä 1Ô∏è‚É£ Survival Count by Passenger Class (Grouped Bar Chart)**

In [12]:
fig = px.bar(df, x="Pclass", y="Survived", color="Pclass", barmode="group",
             text_auto=True, 
             color_discrete_map={1: "gold", 2: "silver", 3: "brown"})

fig.update_layout(
    title="üö¢ Survival Rate by Passenger Class",
    xaxis_title="Passenger Class",
    yaxis_title="Survival Count",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **üìä 2Ô∏è‚É£ Survival Rate by Passenger Class (Pie Chart)**

In [13]:
class_survival = df.groupby("Pclass")["Survived"].mean()

fig = go.Figure(data=[go.Pie(
    labels=["1st Class", "2nd Class", "3rd Class"],
    values=class_survival,
    marker=dict(colors=["gold", "silver", "brown"]),
    hole=0.4, 
    textinfo="percent+label"
)])

fig.update_layout(
    title="üö¢ Passenger Class Survival Rate",
    template="plotly_dark",
    font=dict(size=14),
    paper_bgcolor="black",
    plot_bgcolor="black",
)

fig.show()

# **4Ô∏è‚É£ Age & Survival**

### **üìä 1Ô∏è‚É£ Age Distribution by Survival (Histogram)**

In [14]:
fig = px.histogram(df, x="Age", color="Survived",
                   nbins=40,  
                   barmode="overlay",
                   opacity=0.7,
                   color_discrete_map={0: "red", 1: "green"},
                   histnorm='percent')

fig.update_layout(
    title="üö¢ Age Distribution of Survivors vs. Non-Survivors",
    xaxis_title="Age",
    yaxis_title="Percentage of Passengers",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **üìä 2Ô∏è‚É£ Survival Rate by Age Group (Bar Chart)**

In [15]:
bins = [0, 10, 19, 59, 100]  
labels = ["Child (0-10)", "Teen (11-19)", "Adult (20-59)", "Senior (60+)"]
df["Age Group"] = pd.cut(df["Age"], bins=bins, labels=labels)

In [16]:
age_group_survival = df.groupby("Age Group")["Survived"].mean().reset_index()

fig = px.bar(age_group_survival, x="Age Group", y="Survived",
             color="Age Group",
             text=age_group_survival["Survived"].round(2),
             color_discrete_map={"Child (0-10)": "lightblue", 
                                 "Teen (11-19)": "orange", 
                                 "Adult (20-59)": "purple",
                                 "Senior (60+)": "red"})

fig.update_layout(
    title="üö¢ Survival Rate by Age Group",
    xaxis_title="Age Group",
    yaxis_title="Survival Rate",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

# **5Ô∏è‚É£ Family Size & Survival**

### **1Ô∏è‚É£ Box Plot (Distribution of Survival by Family Size)**

In [17]:
df["Family Size"] = df["SibSp"] + df["Parch"]

fig = px.box(df, x="Family Size", y="Survived", 
             color="Family Size"
             )

fig.update_layout(
    title="üö¢ Family Size vs. Survival (Box Plot)",
    xaxis_title="Family Size (SibSp + Parch)",
    yaxis_title="Survival (0 = No, 1 = Yes)",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **2Ô∏è‚É£ Violin Plot (Density of Survival by Family Size)**

In [18]:
fig = px.violin(df, x="Family Size", y="Survived", 
                box=True, 
                color="Family Size")

fig.update_layout(
    title="üö¢ Family Size vs. Survival (Violin Plot)",
    xaxis_title="Family Size (SibSp + Parch)",
    yaxis_title="Survival (0 = No, 1 = Yes)",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **3Ô∏è‚É£ Heatmap (Survival by Family Size & Class)**

In [19]:
df["Family Size Category"] = pd.cut(df["Family Size"], bins=[0, 1, 4, 7, 20], labels=["Single", "Small", "Medium", "Large"])
df["Pclass"] = df["Pclass"].astype(str)

In [20]:
survival_matrix = df.groupby(["Family Size Category", "Pclass"])["Survived"].mean().reset_index()

fig = px.density_heatmap(survival_matrix, x="Family Size Category", y="Pclass", z="Survived",
                         color_continuous_scale="Viridis")

fig.update_layout(
    title="üö¢ Survival Rate by Family Size & Class (Heatmap)",
    xaxis_title="Family Size Category",
    yaxis_title="Passenger Class",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **4Ô∏è‚É£ Bar Chart (Survival Rate by Family Size Categories)**

In [21]:
bins = [0, 1, 4, 7, 20]
labels = ["Single", "Small Family (2-4)", "Medium Family (5-7)", "Large Family (8+)"]
df["Family Size Category"] = pd.cut(df["Family Size"], bins=bins, labels=labels)

In [22]:
family_size_survival = df.groupby("Family Size Category")["Survived"].mean().reset_index()

fig = px.bar(family_size_survival, x="Family Size Category", y="Survived", 
             color="Family Size Category", 
             color_discrete_map={"Single": "gray", "Small Family (2-4)": "lightblue", 
                                 "Medium Family (5-7)": "orange", "Large Family (8+)": "red"},
             text=family_size_survival["Survived"].round(2))

fig.update_layout(
    title="üö¢ Survival Rate by Family Size Category",
    xaxis_title="Family Size Category",
    yaxis_title="Survival Rate",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

# **6Ô∏è‚É£ Embarkation Port Impact**

### **1Ô∏è‚É£ Bar Plot (Survival Rate by Embarkation Port)**

In [23]:
survival_by_port = df.groupby("Embarked")["Survived"].mean().reset_index()

In [24]:
fig = px.bar(survival_by_port, x="Embarked", y="Survived", color="Embarked", 
             color_discrete_map={"C": "green", "Q": "red", "S": "blue"},
             text="Survived")

fig.update_layout(
    title="üö¢ Survival Rate by Embarkation Port",
    xaxis_title="Embarkation Port",
    yaxis_title="Survival Rate",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **2Ô∏è‚É£ Pie Chart (Proportion of Survival by Embarkation Port)**

In [25]:
fig = px.pie(survival_by_port, names="Embarked", values="Survived", 
             color="Embarked", 
             color_discrete_map={"C": "green", "Q": "red", "S": "blue"})

fig.update_layout(
    title="üö¢ Proportion of Survival by Embarkation Port",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **3Ô∏è‚É£ Box Plot (Survival Distribution by Embarkation Port)**

In [26]:
fig = px.box(df, x="Embarked", y="Survived", color="Embarked", 
             color_discrete_map={"C": "green", "Q": "red", "S": "blue"})

fig.update_layout(
    title="üö¢ Survival Distribution by Embarkation Port",
    xaxis_title="Embarkation Port",
    yaxis_title="Survival (0 = No, 1 = Yes)",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

### **4Ô∏è‚É£ Violin Plot (Density of Survival by Embarkation Port)**

In [27]:
fig = px.violin(df, x="Embarked", y="Survived", color="Embarked", 
                color_discrete_map={"C": "green", "Q": "red", "S": "blue"},
                box=True)

fig.update_layout(
    title="üö¢ Survival Density by Embarkation Port",
    xaxis_title="Embarkation Port",
    yaxis_title="Survival (0 = No, 1 = Yes)",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black"
)

fig.show()

# **7Ô∏è‚É£ Ticket Price & Survival**

### **1Ô∏è‚É£ Ticket Price and Survival**

In [28]:
df['Log_Fare'] = np.log(df['Fare'] + 1)  # Adding 1 to avoid log(0)

In [29]:
fig = px.scatter(df, x="Log_Fare", y="Survived", color="Survived",
                 color_continuous_scale="Viridis", opacity=0.7, 
                 title="üö¢ Ticket Price vs. Survival",
                 labels={"Log_Fare": "Log(Price)", "Survived": "Survival (0 = No, 1 = Yes)"})
fig.update_layout(
    title="üö¢ Ticket Price & Survival",
    xaxis_title="Log(Price)",
    yaxis_title="Survival (0 = No, 1 = Yes)",
    template="plotly_dark",
    paper_bgcolor="black",
    plot_bgcolor="black",
    coloraxis_showscale=False
)

fig.show()

# **ü§ñ Auto ML üöÄ**

In [30]:
df_for_ml = pd.read_csv("/kaggle/input/titanic/train.csv")

In [31]:
df_for_ml["Age"] = df_for_ml["Age"].fillna(int(df_for_ml['Age'].mean()))
df_for_ml['Embarked'] = df_for_ml['Embarked'].fillna(df_for_ml['Embarked'].mode()[0])
df_for_ml['Cabin'] = df_for_ml['Cabin'].fillna(df_for_ml['Cabin'].mode()[0])


df_for_ml['Cabin_Letter'] = df_for_ml['Cabin'].astype(str).str[0]  
df_for_ml['Ticket_Prefix'] = df_for_ml['Ticket'].apply(lambda x: x.split()[0] if len(x.split()) > 1 else 'None')
df_for_ml['Family_Size'] = df_for_ml['SibSp'] + df_for_ml['Parch'] + 1

In [32]:
df_for_ml.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 15 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   PassengerId    891 non-null    int64  
 1   Survived       891 non-null    int64  
 2   Pclass         891 non-null    int64  
 3   Name           891 non-null    object 
 4   Sex            891 non-null    object 
 5   Age            891 non-null    float64
 6   SibSp          891 non-null    int64  
 7   Parch          891 non-null    int64  
 8   Ticket         891 non-null    object 
 9   Fare           891 non-null    float64
 10  Cabin          891 non-null    object 
 11  Embarked       891 non-null    object 
 12  Cabin_Letter   891 non-null    object 
 13  Ticket_Prefix  891 non-null    object 
 14  Family_Size    891 non-null    int64  
dtypes: float64(2), int64(6), object(7)
memory usage: 104.5+ KB


In [33]:
clf1 = setup(data=df_for_ml, target='Survived', 
             categorical_features=['Sex', 'Embarked', 'Cabin_Letter', 'Ticket_Prefix'],  
             numeric_features=['Age', 'SibSp', 'Parch', 'Fare', 'Pclass', 'Family_Size'],  
             verbose=True)

Unnamed: 0,Description,Value
0,Session id,8896
1,Target,Survived
2,Target type,Binary
3,Original data shape,"(891, 15)"
4,Transformed data shape,"(891, 24)"
5,Transformed train set shape,"(623, 24)"
6,Transformed test set shape,"(268, 24)"
7,Numeric features,6
8,Categorical features,4
9,Preprocess,True


In [34]:
# best_model = compare_models()
best_model = tune_model(compare_models())

Unnamed: 0,Model,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC,TT (Sec)
lightgbm,Light Gradient Boosting Machine,0.8073,0.8523,0.7237,0.7709,0.7422,0.5892,0.5943,15.346
dummy,Dummy Classifier,0.6164,0.5,0.0,0.0,0.0,0.0,0.0,0.056


Unnamed: 0_level_0,Accuracy,AUC,Recall,Prec.,F1,Kappa,MCC
Fold,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
0,0.7778,0.7906,0.5833,0.7778,0.6667,0.5051,0.5168
1,0.8095,0.8259,0.75,0.75,0.75,0.5962,0.5962
2,0.8413,0.9038,0.6667,0.8889,0.7619,0.6465,0.6615
3,0.7419,0.8454,0.75,0.6429,0.6923,0.4723,0.4765
4,0.8871,0.9189,0.75,0.9474,0.8372,0.7526,0.7646
5,0.9032,0.9342,0.875,0.875,0.875,0.7961,0.7961
6,0.8387,0.8531,0.625,0.9375,0.75,0.6379,0.6664
7,0.8065,0.7796,0.625,0.8333,0.7143,0.5724,0.586
8,0.871,0.9309,0.7917,0.8636,0.8261,0.7238,0.7256
9,0.7258,0.8629,0.6087,0.6364,0.6222,0.4072,0.4074


Fitting 10 folds for each of 10 candidates, totalling 100 fits


In [35]:
final_model = finalize_model(best_model)

## **Thank You!** üéâ  

Thank you for exploring this **Titanic Survival Prediction** notebook! üö¢‚öì  

I hope you found this analysis insightful and enjoyed the journey through **EDA, Visualization, and AutoML with PyCaret**. üöÄ  

If you liked this notebook, don't forget to leave your **feedback** üí¨.  
Happy coding! üñ•Ô∏è‚ú®  

**üìå Stay Connected & Keep Learning!** üìöüí° 