# 🚢 **TITANIC SURVIVAL PREDICTION - NEURAL NETWORK PROJECT**

**Francisco Teixeira Barbosa** | GitHub: [Tuminha](https://github.com/Tuminha) | Email: cisco@periospot.com

---

## 📋 **PROJECT OVERVIEW**

Following the success of my **Hotel Cancellation Predictor** (82.65% accuracy), I'm now tackling the classic Titanic survival prediction challenge using neural networks. This smaller dataset (891 passengers vs 119K hotel bookings) will help me reinforce deep learning concepts with faster iteration and clearer results.

### **🎯 Learning Objectives:**
- Master binary classification with PyTorch neural networks
- Practice feature engineering on historical data
- Implement proper train/test evaluation methodology
- Build production-ready ML pipeline

### **📊 Dataset Context:**
- **Source**: Kaggle Titanic Competition
- **Size**: 891 training samples
- **Target**: Predict passenger survival (1 = survived, 0 = died)
- **Expected Performance**: 80%+ accuracy

---


## 📚 **PHASE 1: DATA EXPLORATION**

### **🎯 Phase Objectives:**
Understand the Titanic dataset structure, identify missing values, and discover survival patterns that will guide our feature engineering and model design.

---


### **Task 1: Download and Load Titanic Dataset**

**Objective**: Obtain the Kaggle Titanic dataset and load it into our analysis environment.

**TODO**: Download the Titanic dataset from Kaggle and load it using pandas
- Research: What are the different ways to download Kaggle datasets?
- Hint: You can use the Kaggle API or download manually from the website
- Consider: Where should you save the dataset files?

**Business Context**: The Titanic dataset is a classic in ML education, representing a real historical event with clear survival outcomes. Understanding this context helps us interpret our model's predictions meaningfully.


In [None]:
# Import necessary libraries
# TODO: Import pandas, numpy, and matplotlib/seaborn for data analysis
# Research: What other libraries might be useful for data exploration?
import pandas as pd
import numpy as np

# Load the Titanic dataset
# TODO: Read the train.csv file from your data directory
# Hint: Use pd.read_csv() and check the file path
# Research: What parameters can you use with read_csv()?
train_data = pd.read_csv("data/train.csv")


# Display basic information about the dataset
# TODO: Show the first few rows and basic dataset info
# Hint: Use .head() and .info() methods
# Research: What other methods help you understand dataset structure?
train_data.head()
train_data.info()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [13]:
# Load test data 
test_data = pd.read_csv("data/test.csv")
test_data.head()


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


### **Task 2: Exploratory Data Analysis**

**Objective**: Understand the dataset structure, data types, and identify missing values.

**TODO**: Perform comprehensive EDA to understand the data
- Research: What information does .describe() provide for numerical vs categorical columns?
- Hint: Check data types, missing values, and basic statistics
- Consider: Which columns might be most important for survival prediction?

**Business Context**: Each passenger had different characteristics that influenced their survival chances. Understanding these patterns helps us build a better predictive model.


In [34]:
# Dataset shape and basic information
# TODO: Check the shape of the dataset
# Research: What does the shape tell us about our data?
train_data.shape

# Data types and missing values
# TODO: Use .info() to see data types and missing value counts
# TODO: Use .isnull().sum() to get detailed missing value information
# Research: Which columns have the most missing data?
train_data.info()
null_sum = train_data.isnull().sum()
questinable_features = []
for column in null_sum.index:
    if null_sum[column] == 0:
        print(f"Column '{column}' has 0 nulls")
    elif null_sum[column] > 0:
        questinable_features.append([column])
    else:
        print(f"Column '{column}' has {null_sum[column]} nulls")
print(f"Features {questinable_features} maybe are not a good features because they have some nulls")


# Basic statistics
# TODO: Use .describe() to see numerical column statistics
# Research: What insights can you draw from the statistical summary?
train_data.describe()
# Target variable analysis
# TODO: Check the distribution of the Survived column
# Research: What does this tell us about the class balance?


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
Column 'PassengerId' has 0 nulls
Column 'Survived' has 0 nulls
Column 'Pclass' has 0 nulls
Column 'Name' has 0 nulls
Column 'Sex' has 0 nulls
Column 'SibSp' has 0 nulls
Column 'Parch' has 0 nulls
Column 'Ticket' has 0 nulls
Column 'Fare' has 0 

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### **Task 3: Visualize Survival Patterns**

**Objective**: Create visualizations to understand which factors correlate with survival.

**TODO**: Create visualizations showing survival patterns by key features
- Research: What types of plots work best for categorical vs numerical features?
- Hint: Use seaborn for statistical visualizations
- Consider: How can you show survival rates by different passenger characteristics?

**Business Context**: Historical accounts mention "women and children first" - our visualizations should reveal if this pattern appears in the data.


In [None]:
# Import visualization libraries
# TODO: Import matplotlib and seaborn for plotting
# Research: What other visualization libraries might be useful?

# Set up plotting style
# TODO: Configure matplotlib for better-looking plots
# Hint: Use plt.style.use() or seaborn.set_style()

# Survival rate by gender
# TODO: Create a visualization showing survival rates by Sex
# Research: What's the best plot type for comparing categorical groups?
# Hint: Consider using seaborn.countplot() or bar plots

# Survival rate by passenger class
# TODO: Visualize survival patterns by Pclass
# Research: How can you show both counts and percentages?

# Survival rate by age groups
# TODO: Create age groups and visualize survival patterns
# Research: How can you bin continuous data for analysis?
# Hint: Use pd.cut() or create custom age ranges

# Family size impact
# TODO: Create a family size feature and analyze its impact
# Research: How can you combine SibSp and Parch to create meaningful groups?


### **Task 4: Correlation Analysis and Feature Insights**

**Objective**: Analyze correlations between features and identify the most important predictors.

**TODO**: Perform correlation analysis and derive business insights
- Research: What's the difference between Pearson and Spearman correlation?
- Hint: Use correlation matrices and heatmaps
- Consider: Which features show the strongest relationship with survival?

**Business Context**: Understanding feature importance helps us focus our feature engineering efforts and build more interpretable models.


In [None]:
# Correlation matrix
# TODO: Create a correlation matrix for numerical features
# Research: Which features should you include in the correlation analysis?
# Hint: Convert categorical variables to numerical first if needed

# Visualize correlations
# TODO: Create a heatmap of the correlation matrix
# Research: What parameters can you use to make heatmaps more readable?
# Hint: Use seaborn.heatmap() with appropriate parameters

# Feature importance analysis
# TODO: Calculate survival rates by different feature combinations
# Research: How can you quantify the predictive power of each feature?
# Hint: Consider using groupby() with multiple features

# Summary insights
# TODO: Document your key findings from Phase 1
# Research: What are the most important insights for feature engineering?
# Consider: Which features should we focus on in preprocessing?


## 📋 **PHASE 1 SUMMARY**

### **Key Findings:**
- [ ] **Dataset Structure**: Document shape, data types, and missing values
- [ ] **Survival Patterns**: Identify key factors that correlate with survival
- [ ] **Missing Data Strategy**: Plan how to handle missing values in Age, Cabin, Embarked
- [ ] **Feature Engineering Opportunities**: Identify new features to create

### **Next Steps for Phase 2:**
- [ ] Handle missing values intelligently
- [ ] Engineer new features (Title, Family Size, etc.)
- [ ] Encode categorical variables
- [ ] Scale features for neural network training

---

**🎯 Ready for Phase 2: Data Preprocessing!**
