# ðŸš¢ **TITANIC SURVIVAL PREDICTION - NEURAL NETWORK PROJECT**

**Francisco Teixeira Barbosa** | GitHub: [Tuminha](https://github.com/Tuminha) | Email: cisco@periospot.com

---

## ðŸ“‹ **PROJECT OVERVIEW**

Following the success of my **Hotel Cancellation Predictor** (82.65% accuracy), I'm now tackling the classic Titanic survival prediction challenge using neural networks. This smaller dataset (891 passengers vs 119K hotel bookings) will help me reinforce deep learning concepts with faster iteration and clearer results.

### **ðŸŽ¯ Learning Objectives:**
- Master binary classification with PyTorch neural networks
- Practice feature engineering on historical data
- Implement proper train/test evaluation methodology
- Build production-ready ML pipeline

### **ðŸ“Š Dataset Context:**
- **Source**: Kaggle Titanic Competition
- **Size**: 891 training samples
- **Target**: Predict passenger survival (1 = survived, 0 = died)
- **Expected Performance**: 80%+ accuracy

---


## ðŸ“š **PHASE 1: DATA EXPLORATION**

### **ðŸŽ¯ Phase Objectives:**
Understand the Titanic dataset structure, identify missing values, and discover survival patterns that will guide our feature engineering and model design.

---


### **Task 1: Download and Load Titanic Dataset**

**Objective**: Obtain the Kaggle Titanic dataset and load it into our analysis environment.

**TODO**: Download the Titanic dataset from Kaggle and load it using pandas
- Research: What are the different ways to download Kaggle datasets?
- Hint: You can use the Kaggle API or download manually from the website
- Consider: Where should you save the dataset files?

**Business Context**: The Titanic dataset is a classic in ML education, representing a real historical event with clear survival outcomes. Understanding this context helps us interpret our model's predictions meaningfully.


In [None]:
# Import necessary libraries
# TODO: Import pandas, numpy, and matplotlib/seaborn for data analysis
# Research: What other libraries might be useful for data exploration?

# Load the Titanic dataset
# TODO: Read the train.csv file from your data directory
# Hint: Use pd.read_csv() and check the file path
# Research: What parameters can you use with read_csv()?

# Display basic information about the dataset
# TODO: Show the first few rows and basic dataset info
# Hint: Use .head() and .info() methods
# Research: What other methods help you understand dataset structure?


### **Task 2: Exploratory Data Analysis**

**Objective**: Understand the dataset structure, data types, and identify missing values.

**TODO**: Perform comprehensive EDA to understand the data
- Research: What information does .describe() provide for numerical vs categorical columns?
- Hint: Check data types, missing values, and basic statistics
- Consider: Which columns might be most important for survival prediction?

**Business Context**: Each passenger had different characteristics that influenced their survival chances. Understanding these patterns helps us build a better predictive model.


In [None]:
# Dataset shape and basic information
# TODO: Check the shape of the dataset
# Research: What does the shape tell us about our data?

# Data types and missing values
# TODO: Use .info() to see data types and missing value counts
# TODO: Use .isnull().sum() to get detailed missing value information
# Research: Which columns have the most missing data?

# Basic statistics
# TODO: Use .describe() to see numerical column statistics
# Research: What insights can you draw from the statistical summary?

# Target variable analysis
# TODO: Check the distribution of the Survived column
# Research: What does this tell us about the class balance?


### **Task 3: Visualize Survival Patterns**

**Objective**: Create visualizations to understand which factors correlate with survival.

**TODO**: Create visualizations showing survival patterns by key features
- Research: What types of plots work best for categorical vs numerical features?
- Hint: Use seaborn for statistical visualizations
- Consider: How can you show survival rates by different passenger characteristics?

**Business Context**: Historical accounts mention "women and children first" - our visualizations should reveal if this pattern appears in the data.


In [None]:
# Import visualization libraries
# TODO: Import matplotlib and seaborn for plotting
# Research: What other visualization libraries might be useful?

# Set up plotting style
# TODO: Configure matplotlib for better-looking plots
# Hint: Use plt.style.use() or seaborn.set_style()

# Survival rate by gender
# TODO: Create a visualization showing survival rates by Sex
# Research: What's the best plot type for comparing categorical groups?
# Hint: Consider using seaborn.countplot() or bar plots

# Survival rate by passenger class
# TODO: Visualize survival patterns by Pclass
# Research: How can you show both counts and percentages?

# Survival rate by age groups
# TODO: Create age groups and visualize survival patterns
# Research: How can you bin continuous data for analysis?
# Hint: Use pd.cut() or create custom age ranges

# Family size impact
# TODO: Create a family size feature and analyze its impact
# Research: How can you combine SibSp and Parch to create meaningful groups?


### **Task 4: Correlation Analysis and Feature Insights**

**Objective**: Analyze correlations between features and identify the most important predictors.

**TODO**: Perform correlation analysis and derive business insights
- Research: What's the difference between Pearson and Spearman correlation?
- Hint: Use correlation matrices and heatmaps
- Consider: Which features show the strongest relationship with survival?

**Business Context**: Understanding feature importance helps us focus our feature engineering efforts and build more interpretable models.


In [None]:
# Correlation matrix
# TODO: Create a correlation matrix for numerical features
# Research: Which features should you include in the correlation analysis?
# Hint: Convert categorical variables to numerical first if needed

# Visualize correlations
# TODO: Create a heatmap of the correlation matrix
# Research: What parameters can you use to make heatmaps more readable?
# Hint: Use seaborn.heatmap() with appropriate parameters

# Feature importance analysis
# TODO: Calculate survival rates by different feature combinations
# Research: How can you quantify the predictive power of each feature?
# Hint: Consider using groupby() with multiple features

# Summary insights
# TODO: Document your key findings from Phase 1
# Research: What are the most important insights for feature engineering?
# Consider: Which features should we focus on in preprocessing?


## ðŸ“‹ **PHASE 1 SUMMARY**

### **Key Findings:**
- [ ] **Dataset Structure**: Document shape, data types, and missing values
- [ ] **Survival Patterns**: Identify key factors that correlate with survival
- [ ] **Missing Data Strategy**: Plan how to handle missing values in Age, Cabin, Embarked
- [ ] **Feature Engineering Opportunities**: Identify new features to create

### **Next Steps for Phase 2:**
- [ ] Handle missing values intelligently
- [ ] Engineer new features (Title, Family Size, etc.)
- [ ] Encode categorical variables
- [ ] Scale features for neural network training

---

**ðŸŽ¯ Ready for Phase 2: Data Preprocessing!**
