### Project Title: Titanic Survival Prediction
#### Done By: Nozipho Sithembiso Ndebele
---

<div style="text-align: center;">
<img src="https://itsmeprasannak.wordpress.com/wp-content/uploads/2021/03/10-sn56-20201221-titanicsinking-hr.jpg?w=1440" alt="Titanic Image" width="1000"/>
</div>

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**

### Purpose
This project aims to analyze the Titanic passenger dataset to uncover patterns and factors that contributed to survival during the tragic sinking of the RMS Titanic. By applying data preprocessing, exploratory data analysis, and classification techniques, the project seeks to predict which types of passengers were more likely to survive.

### Significance
Understanding survival patterns on the Titanic is not only a classic problem in data science but also valuable for:

* Identifying key demographic and socio-economic factors (e.g., gender, age, class) linked to survival

* Highlighting potential biases in rescue decisions

* Improving classification models by exploring real-world, historical datasets

* Demonstrating the power of machine learning in uncovering insights from structured data

This analysis helps build a strong foundation for applying data science skills to real-world problems, especially in binary classification and predictive modeling.

### Problem Domain
The Titanic dataset includes various features such as age, sex, passenger class, number of siblings/spouses aboard, and fare paid. The goal is to predict the target variable—Survived (0 = No, 1 = Yes)—based on these features. Developing a robust classification model involves understanding relationships between variables and identifying the most influential predictors of survival.
### Challenges
* Missing Data: Age and Cabin information contain missing values that must be handled.

* Imbalanced Features: Some classes (e.g., 1st class vs. 3rd class) may have disproportionate survival rates.

* Feature Engineering: Extracting meaningful insights from names, tickets, or titles (e.g., "Mr.", "Mrs.")

* Model Evaluation: Selecting and comparing appropriate metrics like accuracy, precision, recall, and F1-score.

### Key Questions
* Which passenger features (e.g., gender, class, age) were most associated with survival?

* How did socio-economic status affect the likelihood of survival?

* What role did age and family relationships (siblings/parents onboard) play in survival outcomes?

* Can we accurately predict a passenger’s survival using classification algorithms?

* How do different models (e.g., logistic regression, decision trees, random forest) perform on this task?

---
<a href=#one></a>
## **Importing Packages**

### Purpose
To set up the Python environment with the necessary libraries for data manipulation, visualization, and machine learning. These libraries will facilitate data preprocessing, feature extraction, model training, and evaluation.

### Details
* Pandas: For handling and analyzing data.

* NumPy: For numerical operations.

* Matplotlib/Seaborn: For data visualization to understand trends and patterns.

* scikit-learn: For building and evaluating machine learning models.

* NLTK/Spacy: For text preprocessing and natural language processing tasks.

---

In [37]:
# Import necessary packages  

# Data manipulation and analysis  
import pandas as pd  # Pandas for data handling  
import numpy as np  # NumPy for numerical operations  

# Data visualization  
import matplotlib.pyplot as plt  # Matplotlib for static plots  
import seaborn as sns  # Seaborn for statistical visualization  
import plotly.express as px  # Plotly for interactive plots  

# Natural Language Processing  
import nltk  # Natural Language Toolkit  
from nltk.corpus import stopwords  # Stopword removal  
from nltk.tokenize import word_tokenize  # Tokenization  
import re  # Regular expressions for text cleaning  

# Configure visualization settings
sns.set(style='whitegrid')  # Set the default style for Seaborn plots
plt.rcParams['figure.figsize'] = (10, 6)  # Set default figure size for Matplotlib

# Suppress warnings
import warnings  # Import the warnings module
warnings.filterwarnings('ignore')  # Ignore all warning messages


---
<a href=#two></a>
## **Data Collection and Description**
### Purpose
This section provides an overview of the Titanic dataset and its structure. Understanding the dataset is essential to build accurate classification models and extract meaningful insights regarding survival patterns among passengers.

### Details
The dataset captures demographic and travel information of passengers aboard the RMS Titanic during its ill-fated maiden voyage. It includes attributes such as age, sex, passenger class, and fare paid, which can be used to predict survival outcomes.

* Source: The dataset is publicly available on Kaggle as part of the Titanic Machine Learning competition.
Link to dataset

* Method of Collection: The dataset compiles historical records from the Titanic's passenger manifest, integrating demographic, ticketing, and survival information.

* Size: The dataset contains 891 rows and 12 columns, where each row corresponds to an individual passenger.

* Scope: The dataset includes information relevant to survival prediction, such as:

  * Socio-economic status (ticket class and fare)

  * Family relations aboard (siblings/spouses, parents/children)

  * Passenger age and gender

  * Embarkation port and cabin information

* Types of Data:

  * `Feature`	Description
  * `PassengerId`	Unique identifier for each passenger
  * `Survived`	Survival status (0 = No, 1 = Yes)
  * `Pclass`	Passenger class (1 = Upper, 2 = Middle, 3 = Lower)
  * `Name`	Full name, including title
  * `Sex`	Gender of the passenger
  * `Age`	Age in years (some missing values)
  * `SibSp`	Number of siblings/spouses aboard
  *` Parch`	Number of parents/children aboard
  * `Ticket`	Ticket number
  * `Fare`	Fare paid for the ticket
  * `Cabin`	Cabin number (many missing values)
  * `Embarked`	Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)


---
<a href=#three></a>
## **Loading Data**
### Purpose
The purpose of this section is to load the dataset into the notebook for further manipulation and analysis. This is the first step in working with the data, and it allows us to inspect the raw data and get a sense of its structure.

### Details
In this section, we will load the dataset into a Pandas DataFrame and display the first few rows to understand what the raw data looks like. This will help in planning the next steps of data cleaning and analysis.


---

In [38]:
# Load the dataset into a Pandas DataFrame

# The dataset is stored in a CSV file named 'Domestic violence.csv'
df = pd.read_csv('Titanic-Dataset.csv')

In [39]:
# df is the original dataset (DataFrame), this creates a copy of it
df_copy = df.copy()

# Now 'df_copy' is an independent copy of 'df'. Changes to 'df_copy' won't affect 'df'.


In [40]:
# Display the first few rows of the dataset to get a sense of what the raw data looks like
df_copy.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [41]:
# Display the number of rows and columns in the dataset to understand its size
df_copy.shape

(891, 12)

In [42]:
# Check the structure of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


---
<a href=#four></a>
## **Data Cleaning and Filtering**
Before analyzing the data, it is crucial to clean and filter it. This process involves handling missing values, removing outliers, correcting errors, and possibly reducing the data by filtering out irrelevant features. These steps ensure that the analysis is based on accurate and reliable data.

Details
In this section, we will:

* Check for Missing Values: Identify if there are any missing values in the dataset and handle them accordingly.
* Remove Duplicates: Ensure there are no duplicate rows that could skew the analysis.
* Correct Errors: Look for and correct any obvious data entry errors.
* Filter Data: Depending on the analysis requirements, filter the data to include only relevant records.

In [43]:
# 1. Check for missing values in the dataset

def check_missing_values(df):
    """
    Check for missing values in the dataset and display the number of missing values per column.

    Parameters:
    df (pandas.DataFrame): The dataset to check for missing values.

    Returns:
    pandas.Series: A series showing the number of missing values for each column.
    """
     # Check for missing values in the dataset and display them
    print("Missing values per column:")
    missing_values = df.isnull().sum()
    print(missing_values)
    return missing_values


In [44]:
# Assuming df is your DataFrame
missing_values = check_missing_values(df_copy)


Missing values per column:
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64


After examining the dataset, missing values were identified in some columns. All other columns are complete, ensuring data integrity for most features. Depending on the analysis goals, handling these missing values may be necessary through imputation or other data-cleaning techniques.

In [45]:
def handling_missing_values(df):
    """
    Check for missing values in the dataset, display the number of missing values per column,
    handle them appropriately, and drop the 'Cabin' column.

    Parameters:
    df (pandas.DataFrame): The dataset to check and clean missing values.

    Returns:
    pandas.DataFrame: The cleaned dataset with missing values handled.
    """
    print("Missing values per column:\n")
    missing_values = df.isnull().sum()
    
    for column, count in missing_values.items():
        if count > 0:
            print(f"{column}: {count} missing")
            if column == 'Age':
                # Fill 'Age' with the median
                median_age = df['Age'].median()
                df['Age'].fillna(median_age, inplace=True)
                print(f"  ➤ Filled 'Age' with median: {median_age}")
            elif column == 'Cabin':
                # Drop the 'Cabin' column
                df.drop('Cabin', axis=1, inplace=True)
                print("  ➤ Dropped 'Cabin' column.")
            elif column == 'Embarked':
                # Fill 'Embarked' with the mode
                mode_embarked = df['Embarked'].mode()[0]
                df['Embarked'].fillna(mode_embarked, inplace=True)
                print(f"  ➤ Filled 'Embarked' with mode: {mode_embarked}")
            else:
                print("  ➤ No specific rule. Consider domain-specific handling.")
        else:
            print(f"{column}: No missing values")
    
    print("\nMissing values handled.")
    return df


In [46]:
# Call the function
df_copy = handling_missing_values(df_copy)

Missing values per column:

PassengerId: No missing values
Survived: No missing values
Pclass: No missing values
Name: No missing values
Sex: No missing values
Age: 177 missing
  ➤ Filled 'Age' with median: 28.0
SibSp: No missing values
Parch: No missing values
Ticket: No missing values
Fare: No missing values
Cabin: 687 missing
  ➤ Dropped 'Cabin' column.
Embarked: 2 missing
  ➤ Filled 'Embarked' with mode: S

Missing values handled.


In [47]:
def remove_duplicates(df):
    """
    Checks for duplicate rows in the dataset and removes them if any are found.

    Args:
    df (pandas.DataFrame): The dataframe to check for duplicate rows.

    Returns:
    pandas.DataFrame: The dataframe with duplicate rows removed, if any existed.
    """
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"\nNumber of duplicate rows: {duplicate_rows}")
    
    # Remove duplicates if any exist
    if duplicate_rows > 0:
        df.drop_duplicates(inplace=True)
        print(f"Duplicate rows removed. Updated dataframe has {len(df)} rows.")
    else:
        print("No duplicate rows found.")
    
    return df

In [48]:
df_copy = remove_duplicates(df_copy)


Number of duplicate rows: 0
No duplicate rows found.


Upon reviewing the dataset, no duplicate rows were found. This ensures that all records are unique, and no further action is required for data deduplication.


## **Saving the Cleaned Dataset**
### Purpose

This section outlines how to save the cleaned dataset for future use. Saving the dataset ensures that the data cleaning process does not need to be repeated and allows for consistent use in subsequent analyses.

### Details

We will save the cleaned dataset as a CSV file.

In [49]:
#6. Save the cleaned dataset to a new CSV file

def save_cleaned_dataset(df, filename='Titanic_Survival_cleaned.csv'):
    """
    Saves the cleaned dataframe to a CSV file.

    Args:
    df (pandas.DataFrame): The cleaned dataframe to save.
    filename (str): The name of the file to save the dataframe to (default is 'cleaned_domestic_violence.csv').

    Returns:
    None
    """
    # Save the cleaned dataset to a CSV file
    df.to_csv(filename, index=False)
    print(f"Cleaned dataset saved successfully as '{filename}'.")


In [50]:
save_cleaned_dataset(df_copy)


Cleaned dataset saved successfully as 'Titanic_Survival_cleaned.csv'.


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**

It is the process of analyzing datasets to summarize key features, often through visualization methods. It aims to discover patterns, spot anomalies, and formulate hypotheses for deeper insights, which informs subsequent analysis.
#### Advantages

- Helps in understanding the data before modeling.
- Provides insights that guide feature selection and engineering.
- Assists in choosing appropriate modeling techniques.
- Uncovers potential data quality issues early.

`The following methods were employed to communicate our objective:`



---


---
<a href=#nine></a>
## **Conclusion and Future Work**


##### Conclusion



##### Future Work

To build upon this study, future work could focus on the following areas:



---
<a href=#ten></a>
## **References**

## Additional Sections to Consider

**Contributors**: Nozipho Sithembiso Ndebele
