# Titanic Dataset - Social Inequality Analysis

## 1. Introduction
This notebook explores **social inequality** in the Titanic dataset by examining how passenger class, gender, and other factors influenced survival rates. We applied two machine learning models, a **Shallow Artificial Neural Network (ANN)** and **Multiclass Logistic Regression**, to predict survival based on key features.

## 2. Data Preparation and Feature Engineering
Before building machine learning models, we performed several data preparation steps:
- **Handling duplicates**: We identified and removed **95 duplicates** from the dataset.
- **Feature engineering `Pclass`**: We combined the one-hot encoded columns into a single categorical feature, preserving the ordinal nature of passenger class.

### 2.1 Feature Engineering for `Pclass`
- Initially, `Pclass` was one-hot encoded into `Pclass_1`, `Pclass_2`, `Pclass_3`, which resulted in an accuracy of **58%** with the logistic regression model.
- After feature engineering `Pclass` by combining it into a single column, the accuracy significantly improved to **95.6%** with the logistic regression model.

## 3. Exploratory Data Analysis (EDA)

### 3.1 Survival Rate Distribution Across Classes and Genders
- A bar plot of survival rates across different passenger classes and genders shows a clear pattern:
    - **First-class passengers** had the highest survival rate, with females surviving at a much higher rate than males.
    - **Third-class passengers** had the lowest survival rate, particularly among males, indicating a strong influence of both class and gender on survival.

#### Observations:
- **Wealthier passengers** (first class) had better access to lifeboats, while passengers from lower classes (third class) had a much lower chance of survival.
- **Gender** also played a significant role, as females were given priority during the rescue, leading to significantly higher survival rates among women.

### 3.2 Influence of Fare on Survival
- A boxplot of fare distribution across passenger classes and survival status shows that:
    - Passengers who paid higher fares were more likely to survive.
    - First-class passengers, who paid the highest fares, had the best survival rate, while third-class passengers with the lowest fares had the worst survival rate.

#### Observations:
- **Fare is a proxy for wealth**, and wealthier passengers, who had better accommodations, were more likely to have access to lifeboats, contributing to higher survival rates.

### 3.3 Correlation Matrix for Social Inequality Features
- A heatmap displaying the correlation between features related to social inequality (e.g., `Pclass`, `Fare`, `Age`, `Sex`, and `Title`) shows strong correlations:
    - `Pclass` is strongly correlated with `Fare`, highlighting the fact that wealthier passengers traveled in higher classes.
    - `Sex` and `Title_Mr`, `Title_Mrs`, and `Title_Miss` showed significant correlations with survival, reflecting the gender-based priority given during evacuation.

#### Observations:
- **Passenger class and fare** are key indicators of wealth, and wealth had a substantial impact on survival.
- **Gender-based societal norms** during the time of the Titanic disaster are evident in the correlation between `Sex` and survival, as women had a higher survival rate.

### 3.4 Influence of Family Size on Survival
- A bar plot visualizing the survival rate by family size shows:
    - **Smaller family sizes (1-3 members)** had higher survival rates compared to larger families.
    - Passengers with very large families (more than 4 members) had significantly lower survival rates, possibly due to difficulties in evacuating larger groups.

#### Observations:
- Traveling alone or in small groups appears to have been an advantage during evacuation, while larger families faced more challenges, leading to lower survival rates.

## 4. Model 1: Shallow Artificial Neural Network (ANN)

### 4.1 Model Description
We implemented a **Shallow ANN** with the following architecture:
- **Input layer**: Corresponding to the number of input features.
- **Hidden layer**: 6 neurons using the `tanh` activation function.
- **Output layer**: Softmax function to handle multiclass classification.

### 4.2 Results
- After training the model for **30,000 epochs**, the **Shallow ANN** achieved an accuracy of **69.38%** (`np.float64(0.69375)`).

### 4.3 Observations
- While the shallow ANN performed decently, the accuracy was lower compared to the logistic regression model with feature engineering. This suggests that logistic regression is more effective for this dataset when `Pclass` is feature-engineered, potentially due to the smaller size and simplicity of the Titanic dataset.

## 5. Model 2: Multiclass Logistic Regression

### 5.1 Model Description
We implemented a **Multiclass Logistic Regression** using softmax activation for multiclass classification. We focused on optimizing the model by:
- Feature engineering the `Pclass` feature.
- Fine-tuning hyperparameters such as the learning rate (`eta`) and the number of epochs.

### 5.2 Results
- Without feature engineering `Pclass` (using one-hot encoding), the logistic regression model achieved an accuracy of **58%**.
- After feature engineering `Pclass`, the accuracy increased significantly to **95.6%**.

### 5.3 Observations
- The logistic regression model outperformed the shallow ANN when the `Pclass` feature was combined into a single column, highlighting the importance of proper feature engineering for this dataset.
- The accuracy of **95.6%** demonstrates that logistic regression effectively models survival on the Titanic, especially with the engineered features.

## 6. Hyperparameter Tuning and Performance Comparison

### 6.1 Shallow ANN
We performed hyperparameter tuning for the shallow ANN by experimenting with:
- **Learning rates**: `1e-4`, `1e-3`, `1e-2`.
- **Number of neurons** in the hidden layer: `3`, `6`, `10`.
- **Number of epochs**: `5,000`, `10,000`, `30,000`.

Despite extensive tuning, the shallow ANN achieved a maximum accuracy of **69.38%**, suggesting that more complex neural networks may be unnecessary for this dataset.

### 6.2 Logistic Regression
We tuned the logistic regression model and found that, after feature engineering, it achieved a near-perfect accuracy of **95.6%**. This highlights the effectiveness of logistic regression for this classification task, especially with properly engineered features like `Pclass`.

## 7. Conclusion
- The **Multiclass Logistic Regression** model performed best with an accuracy of **95.6%** after feature engineering the `Pclass` feature. This demonstrates the importance of preserving the ordinal relationship between passenger classes in the dataset.
- The **Shallow ANN**, even with hyperparameter tuning, reached an accuracy of **69.38%**, which is lower compared to logistic regression.
- **Feature engineering** and **hyperparameter tuning** played a critical role in improving model performance, particularly in the logistic regression model.

## 8. Future Improvements
- Consider using **deeper neural networks** for capturing more complex feature interactions.
- Apply additional **feature engineering techniques**, such as binning continuous variables like `Age` and `Fare`, to further improve model performance.
