# 2. Data Understanding

## 2.1 Collecting Initial Data
### Initial data collection Report

## Collecting Initial Data
This project uses a single existing data source provided with the course assignment: titanic1.csv. The dataset contains 1,309 passenger records, where each row represents one passenger and includes demographic and travel-related attributes like Age, Sex, Fare, SibSp, Parch, Pclass, Embarked and the binary target variable 2urvived (survived vs. not survived).

Initial inspection shows two missing values in Embarked (Passengerid 62 and 830). These missing values will be handled during the data preparation phase using a documented strategy but the initial solution would be to input the most frequent category to missing fields in Embarked. The dataset also contains several columns named zero, that are constant (all zeros) and therefore provide no predictive value. These will be removed during preprocessing.
## 2.2 Describe Data
### Data Description Report

### Data Quantity
The dataset is loaded directly into the **Jupyter Notebook** from a local file using Python. The dataset contains **1,309 rows** aka passenger records and **9 important columns** representing passenger attributes;

- **Passengerid** — *Integer (ID)*; unique passenger identifier
- **Age** — *Numeric (float)*; passenger age in years.
- **Fare** — *Numeric (float)*; ticket price.
- **Sex** — *Categorical (binary)*; passenger sex (encoding described in preprocessing).
- **SibSp** — *Numeric (integer)*; number of siblings/spouses aboard.
- **Parch** — *Numeric (integer)*; number of parents/children aboard.
- **Pclass** — *Categorical (integer)*; passenger class 1, 2, 3
- **Embarked** — *Categorical (integer)*; port of embarkation 1, 2, 3
- **2urvived** — *Targ  et (binary)*; survival outcome (1 = survived, 0 = not survived).
- **zero\*** — *Constant (all zeros)*; no information content

Varieble influance
* At this stage, attributes can be  prioritized based on historical context and domain knowledge. **Sex and Age** are expected to be highly influential due to the evacuation policy often summarized as “women and children first.” 
* Additionally, **Pclass and Fare** are expected to be relevant, as passenger class reflects socio-economic status, which likely affected access to lifeboats and crew assistance. 
* Variables related to family structure (**SibSp, Parch**) may also be influential, since travelling with close family especially children could affect evacuation priority and decision-making during the disaster. Identifier and constant variables like **Passengerid**, **zero*** columns are not considered relevant for prediction. 

At this stage, relevant attributes can be preliminarily prioritized based on knowledge of the Titanic disaster history.
**Sex, Pclass, Age, and Fare** are expected to be the most influential, as historical evidence suggests strong survival differences across gender, socio-economic class, and age groups.
Variables related to family structure (**SibSp, Parch**) are considered moderately relevant, while identifier like Passengerid are not relevant for prediction.

### Planned Descriptive Statistics

- **Age**  
  - Calculate mean, median, standard deviation, minimum, and maximum age  
  - Compare average age between survivors and non-survivors  
  - Check the distribution of age and presence of missing values  

- **Fare**  
  - Calculate mean, median, standard deviation, minimum, and maximum fare  
  - Compare average fare between survivors and non-survivors  
  - Inspect fare distribution to identify skewness and outliers  

- **Sex**  
  - Calculate frequency counts for each category  
  - Compute survival rates by sex  

- **Pclass**  
  - Calculate frequency counts for each passenger class  
  - Compute survival rates per class  
  - Compare survival patterns across classes  

- **SibSp**  
  - Calculate distribution of values (counts)  
  - Compare average SibSp values between survivors and non-survivors  

- **Parch**  
  - Calculate distribution of values (counts)  
  - Compare average Parch values between survivors and non-survivors  

- **Embarked**  
  - Calculate frequency counts per embarkation port  
  - Compute survival rates per port  



This prioritization is tentative and will be validated during exploratory data analysis and modeling. As this is an academic case study, no external business analysts are involved; analytical decisions are made based on data analysis and documented reasoning.


## 2.3 Explore Data
### Data Exploration Report

## 2.4 Verify Data
### Data Quality report


In [None]:
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt


# Find project root (folder that contains "data")
def get_project_root():
    p = Path.cwd()
    while not (p / "data").exists() and p != p.parent:
        p = p.parent
    return p

PROJECT_ROOT = get_project_root()
RAW_DATA_DIR = PROJECT_ROOT / "data" / "raw"

def load_data(filename="titanic1.csv"):
    input_path = RAW_DATA_DIR / filename
    print("Reading from:", input_path)  # optional but useful
    return pd.read_csv(input_path)

# Load dataset
df = load_data()

# Verify
print("Shape:", df.shape)
display(df)




In this step, the Titanic dataset was read from the raw data directory and displayed to verify that it was successfully loaded.

**Age**  
  - Calculate mean, median, standard deviation, minimum, and maximum age  
  - Compare average age between survivors and non-survivors  
  - Check the distribution of age and presence of missing values  


In [None]:
# Calculate mean, median, standard deviation, minimum, and maximum age for observation
df['Age'].agg(['mean', 'median', 'std', 'min', 'max'])a

To summarize the Age variable, basic descriptive statistics were calculated. 
* The average passenger age is 29.50 years
* The median age is 28 years
* The standard deviation is 12.91, indicating that passenger ages vary widely around the average
* The youngest recorded passenger is 0.17 years and the oldest is 80 years

In [None]:
#Compare average age between survivors and non-survivors to see if generally younger or older people survivors
df.groupby('2urvived')['Age'].agg(['count', 'mean'])

The average age of survivors is slightly lower 28.29  than non-survivors 29.93, with difference inky 1.64 suggesting a weak association. However, averages can hide differences at the extremes children vs elderly, so survival should also be examined across age groups

In [None]:
# Number of missing values in Age
na = df['Age'].isna().sum()
age_is_zero = (df['Age'] == 0.00).sum()
print(na, age_is_zero)

I checked both missing values in Age for NaN and zero values. The result for both is 0. This means data should be correct and constant.

In [None]:
 
    age_bins = pd.cut(
        df['Age'],
        bins=[0, 12, 18, 30, 45, 60, 80],
        right=False
    )   
    # Counts per age group and survival outcome
    counts = pd.crosstab(age_bins, df['2urvived'])
    
    # Side-by-side bars (NOT stacked)
    counts.plot(kind='bar', stacked=False)
    plt.xlabel("Age group")
    plt.ylabel("Count")
    plt.title("Survival vs Death Counts by Age Group")
    plt.xticks(rotation=45)
    plt.legend(["Not Survived - 0", "Survived - 1 "], title="Outcome")
    plt.show()

To further explore the relationship between age and survival, passengers were grouped into meaningful age intervals like children, teenagers, adults, and elderly.

We can see that it confirms that for children from 0-18 the rate of survival is very high and confirms the rule, children first. After that there is a high rate of people between 18 and 30 where most of this age gap the people did not make it. After that we can see high recrudesce in survivability the older they are. We can later see different coronations adding gender to the graph. 

**Fare**  
  - Calculate mean, median, standard deviation, minimum, and maximum fare  
  - Compare average fare between survivors and non-survivors  
  - See correlation between price of fare and survival

In [None]:
# Calculate mean, median, standard deviation, minimum, and maximum fare 
df['Fare'].agg(['mean', 'median', 'std', 'min', 'max'])


* The price of the ticket has a wide range from 0 to 512.33.
* The average price of the ticket was 33.28
* Median was 14.45 indicating lots of tickets were low, but there were some expensive tickets increasing the average
* The standard deviation is 51.74, confirming high variability and the presence of extreme prices

Fare is therefore considered a relevant variable for survival analysis, acting as a proxy for passenger class and socio-economic status rather than a direct cause of survival.


In [None]:
# Compare average fare between survivors and non-survivors
df.groupby('2urvived')['Fare'].agg(['count', 'mean'])

The average price of non survivors is 27.94, while survivors paid an average fare of 48.40. This show that the survivals paid more the tickets and there can be corolation between price and survivability.

In [None]:
#How many people were actually in each group, and how many of them lived vs. died
import pandas as pd
import matplotlib.pyplot as plt

# Create fare bins
fare_bins = pd.cut(df['Fare'], bins=[0,1,5, 10, 25, 50, 100, 600])

# Count survived vs died in each fare group
fare_survival = pd.crosstab(fare_bins, df['2urvived'])

fare_survival.plot(kind='bar')
plt.xlabel("Fare range")
plt.ylabel("Number of passeangers")
plt.title("Survival vs Death by Fare Range")
plt.xticks(rotation=45)
plt.legend(["Died (0)", "Survived (1)"])
plt.show()

Visualizations of fare show the same underlying pattern: most passengers paid low fares, while higher fares are associated with higher survival, with no clear cutoff point.

In [None]:
#Regardless of how many people were in my class, what were my individual chances of making it out alive based on what I paid
fare_survival_rate = pd.crosstab(
    fare_bins, 
    df['2urvived'], 
    normalize='index'
)

fare_survival_rate.plot(kind='bar')
plt.xlabel("Fare range")
plt.ylabel("Survival rate")
plt.title("Survival Rate by Fare Range")
plt.xticks(rotation=45) 
plt.legend(["Died (0)", "Survived (1)"])
plt.show()  

This figure shows survival rates by fare range, normalized within each group. Passengers who paid lower fares had a substantially lower probability of survival, while survival rates increase steadily with higher fare ranges. This indicates a strong association between fare and survival probability. However, fare likely reflects passenger class and access to evacuation resources rather than being a direct causal factor.

**Sex**  
  - Calculate frequency counts for each category  
  - Compute survival rates by sex
  - Find correlation between Sex, Age and survivability

In [None]:
sex_counts = df['Sex'].value_counts().rename({0: 'Male', 1: 'Female'})
sex_counts.name = 'Count'
sex_counts

There are 843 male passengers and 466 female passengers.

In [None]:
sex_outcome = pd.crosstab(df['Sex'], df['2urvived']).rename({0: 'Male', 1: 'Female'})

sex_outcome

Among males, 734 did not survive and 109 survived.While 233 woman did not survive and 233 survived. This shows a strong association between sex and survival. Females have a much higher survival count relative to their group size, while male survival is much lower.

In [None]:
sex_outcome.plot(kind='bar', stacked=False)
plt.xlabel("Sex")
plt.ylabel("Count")
plt.title("Survival Counts by Sex")
plt.xticks(rotation=0)
plt.legend(["Died (0)", "Survived (1)"], title="Outcome")
plt.show()

The chart shows clear visual in survival by sex.

In [None]:
# Reuse existing age_sex_outcome to get Male/Female tables
age_sex_outcome = pd.crosstab([age_bins, df['Sex']], df['2urvived'])
age_sex_outcome.columns = ['Died', 'Survived']
male_counts   = age_sex_outcome.xs(0, level=1)   # Sex=0 (Male)
female_counts = age_sex_outcome.xs(1, level=1)   # Sex=1 (Female)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

male_counts.plot(kind='bar', stacked=False, ax=axes[0], title='Male')
axes[0].set_xlabel("Age group")
axes[0].set_ylabel("Count")
axes[0].legend(["Died", "Survived"], title="Outcome")
axes[0].tick_params(axis='x', rotation=45)

female_counts.plot(kind='bar', stacked=False, ax=axes[1], title='Female')
axes[1].set_xlabel("Age group")
axes[1].legend(["Died", "Survived"], title="Outcome")
axes[1].tick_params(axis='x', rotation=45)

plt.suptitle("Survival Counts by Age Group and Sex")
plt.tight_layout()
plt.show()

The left chart shows survival outcomes for male passengers across age groups. In every age group, the number of male non-survivors is substantially higher than the number of survivors, with the largest concentration of deaths among young adult males (18–30 and 30–45). Survival among males remains low across all adult age groups and decreases further at older ages.

The right chart shows survival outcomes for female passengers across age groups. In contrast to males, females show much higher survival counts in most age groups, particularly between ages 18–45, where the number of survivors is comparable to or higher than the number of deaths. This pattern indicates a strong survival advantage for female passengers across most age ranges.

Taken together, the two graphs demonstrate a clear interaction between age and sex: survival outcomes differ strongly by sex within the same age groups, with females consistently experiencing higher survival counts than males. This supports the historical pattern often summarized as “women and children first,” while remaining a descriptive analysis based on observed counts rather than probabilities or causal inference.

In [None]:
# Rates by age group and survival, split by Sex (0=Male, 1=Female)
male_rate = pd.crosstab(
    age_bins[df['Sex'] == 0],
    df.loc[df['Sex'] == 0, '2urvived'],
    normalize='index'
)

female_rate = pd.crosstab(
    age_bins[df['Sex'] == 1],
    df.loc[df['Sex'] == 1, '2urvived'],
    normalize='index'
)

fig, axes = plt.subplots(1, 2, figsize=(12, 4), sharey=True)

male_rate.plot(kind='bar', stacked=False, ax=axes[0], title='Male')
axes[0].set_xlabel("Age group")
axes[0].set_ylabel("Rate")
axes[0].legend(["Died (0)", "Survived (1)"], title="Outcome")
axes[0].tick_params(axis='x', rotation=45)

female_rate.plot(kind='bar', stacked=False, ax=axes[1], title='Female')
axes[1].set_xlabel("Age group")
axes[1].legend(["Died (0)", "Survived (1)"], title="Outcome")
axes[1].tick_params(axis='x', rotation=45)

plt.suptitle("Death and Survival Rates by Age Group and Sex")
plt.tight_layout()
plt.show()

The rate plots show that male passengers have low survival rates across all age groups, with death rates dominating especially among adults. In contrast, female passengers exhibit substantially higher survival rates in most age groups, particularly between ages 12 and 45. This highlights a strong interaction between sex and age, where sex is a dominant factor influencing survival probability.