# Titanic Survival Prediction

## **Project Description**
The challenge is to predict which passengers survived the Titanic disaster based on their characteristics, such as name, age, ticket price, and class. By analyzing the provided data, the goal is to build a model that determines whether a passenger survived or not.

## **The Data**
This project uses three main files, accessible via the "Data" tab on the competition page:

### **1. train.csv**
- Contains detailed information about **891 passengers** of the Titanic.
- Each row represents a passenger, with columns describing their features (e.g., age, sex, class) and a binary `Survived` column (1 = survived, 0 = did not survive).
- Used to train the model and identify survival patterns.

### **2. test.csv**
- Contains similar information to `train.csv` for **418 passengers**, but without the `Survived` column.
- Your task is to predict the survival of these passengers using the patterns found in `train.csv`.

### **3. gender_submission.csv**
- An example submission file:
  - Includes two columns: `PassengerId` (IDs from `test.csv`) and `Survived` (predictions: 1 = survived, 0 = did not survive).
  - This file assumes that all female passengers survived and all male passengers did not. Your actual predictions will likely differ.
## **Objective**
- Build a machine learning model to predict passenger survival.
- Submit predictions in a file structured like `gender_submission.csv`.
- Evaluate your model's performance based on the accuracy of your predictions.

## **Motivation**
This project serves as an excellent introduction to machine learning, providing hands-on experience with data exploration, pattern recognition, and predictive modeling.


### **1. Environment Setup**
This section initializes the Python environment and loads essential libraries for data analysis (`numpy`, `pandas`). It also scans the Kaggle input directory to list the available data files. This is useful for verifying that all necessary files are accessible.


In [1]:
# The Kaggle environment includes pre-installed analytics libraries 
# such as numpy and pandas, which are essential for this project.

import numpy as np
import pandas as pd

# List all files in the input directory to verify data availability.
# The files are stored in a read-only directory: "../input/"

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Outputs are stored in the "/kaggle/working/" directory and preserved when saved.
# Temporary files can be written to "/kaggle/temp/" but won't persist after the session ends.

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


### Loading the Titanic Dataset

The dataset is loaded from the Kaggle Titanic competition's input directory using `pandas`. The `train.csv` file contains the training data, which includes information about the passengers, such as their age, sex, class, and whether they survived or not. 

The `train_data.head()` function is used to display the first few rows of the dataset, providing a quick overview of its structure and the features available.


In [2]:
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### Loading the Titanic Test Dataset

The test dataset is loaded from the Kaggle Titanic competition's input directory using `pandas`. The `test.csv` file contains the data for passengers that will be used for making predictions. Unlike the training dataset, the test dataset does not contain the 'Survived' column, which is the target variable we aim to predict.

The `test_data.head()` function is used to display the first few rows of the test dataset, providing an overview of its structure and the features available for making predictions.


In [None]:
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")
test_data.head()

### Calculating the Survival Rate for Women

In this section, we are filtering the training data to extract the survival information for female passengers. We use the `loc` function to filter rows where the 'Sex' column is equal to 'female' and then select the 'Survived' column. 

In [4]:
women = train_data.loc[train_data.Sex == 'female']["Survived"]
rate_women = sum(women)/len(women)

print("% of women who survived:", rate_women)

% of women who survived: 0.7420382165605095


### Calculating the Survival Rate for Men

In this section, we are filtering the training data to extract the survival information for male passengers. We use the `loc` function to filter rows where the 'Sex' column is equal to 'male' and then select the 'Survived' column. 

In [5]:
men = train_data.loc[train_data.Sex == 'male']["Survived"]
rate_men = sum(men)/len(men)

print("% of men who survived:", rate_men)

% of men who survived: 0.18890814558058924


### Gender-Based Survival Analysis

From the analysis, it becomes evident that a significant difference in survival rates exists between male and female passengers. Around **75%** of women on board survived, whereas only **19%** of men survived. This stark contrast suggests that gender is a strong predictor of survival, making the submission file in `gender_submission.csv` a reasonable initial guess for predictions.

However, basing predictions solely on gender (i.e., a single feature) is quite simplistic. It limits the ability to capture more complex patterns that could lead to more accurate predictions. By considering multiple features simultaneously, we can uncover more sophisticated relationships in the data that may improve our model's performance.

While manually exploring every possible pattern across various features would be time-consuming and inefficient, **machine learning** provides an efficient solution by automating the process of pattern discovery. By training a model on multiple features, we can make better-informed predictions for survival.


### Building a Machine Learning Model: Random Forest

In this section, we'll build a **random forest model**. A random forest consists of multiple decision trees. Each tree makes its own prediction based on passenger data and "votes" on whether a passenger survived. The final prediction is made based on the majority vote from all the trees in the forest.

We will use the features **Pclass**, **Sex**, **SibSp**, and **Parch** from the dataset to train the model. The training data comes from `train.csv`, and the model will generate predictions for the passengers in `test.csv`. These predictions will be saved in a new CSV file called `submission.csv`.

### Model Training and Prediction Using Random Forest

In this section, we build a **Random Forest** model to predict passenger survival on the Titanic. Here's a breakdown of the process:

1. **Target Variable (y)**: 
   The target variable `y` represents whether each passenger survived, which is extracted from the `Survived` column of the training data (`train_data`).

2. **Features (X)**:
   The model uses the following features to make predictions:
   - `Pclass`: The class of the passenger (1st, 2nd, or 3rd class)
   - `Sex`: The gender of the passenger (male or female)
   - `SibSp`: The number of siblings or spouses aboard
   - `Parch`: The number of parents or children aboard
   
   We convert these categorical features into numerical values using `pd.get_dummies`, which performs one-hot encoding.

3. **Random Forest Model**:
   A `RandomForestClassifier` is initialized with the following settings:
   - `n_estimators=100`: This means the model will create 100 individual decision trees.
   - `max_depth=5`: This limits the maximum depth of each tree to prevent overfitting.
   - `random_state=1`: Ensures reproducibility of results by fixing the random seed.

4. **Model Training**:
   The model is trained using the training data (`X` as the features and `y` as the target variable). The model learns patterns from the features to make predictions about survival.

5. **Predictions**:
   The trained model is then used to predict the survival outcomes for the passengers in the `test_data` dataset, using the same features (`X_test`).

6. **Output**:
   The predictions are stored in a DataFrame with `PassengerId` and the predicted `Survived` values. This DataFrame is saved as a CSV file called `submission.csv` for submission.

The process concludes

In [6]:
from sklearn.ensemble import RandomForestClassifier

y = train_data["Survived"]

features = ["Pclass", "Sex", "SibSp", "Parch"]
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")

Your submission was successfully saved!


### Conclusion

At the end of this process, the model has made predictions for the test data, which are stored in a CSV file called `submission.csv`. This file contains the predicted survival outcomes for each passenger in the test set, with two columns: `PassengerId` and `Survived`.

### Submission Output

| PassengerId | Survived |
|-------------|----------|
| 892         | 0        |
| 893         | 1        |
| 894         | 0        |
| 895         | 0        |
| 896         | 1        |
