# 2. Data Understanding

## 2.1 Collecting Initial Data
### Initial data collection Report

## Collecting Initial Data
This project uses a single existing data source provided with the course assignment: titanic1.csv. The dataset contains 1,309 passenger records, where each row represents one passenger and includes demographic and travel-related attributes like Age, Sex, Fare, SibSp, Parch, Pclass, Embarked and the binary target variable 2urvived (survived vs. not survived).

Initial inspection shows two missing values in Embarked (Passengerid 62 and 830). These missing values will be handled during the data preparation phase using a documented strategy but the initial solution would be to input the most frequent category to missing fields in Embarked. The dataset also contains several columns named zero, that are constant (all zeros) and therefore provide no predictive value. These will be removed during preprocessing.
## 2.2 Describe Data
### Data Description Report

### Data Quantity
The dataset is loaded directly into the **Jupyter Notebook** from a local file using Python. The dataset contains **1,309 rows** aka passenger records and **9 important columns** representing passenger attributes;

- **Passengerid** — *Integer (ID)*; unique passenger identifier
- **Age** — *Numeric (float)*; passenger age in years.
- **Fare** — *Numeric (float)*; ticket price.
- **Sex** — *Categorical (binary)*; passenger sex (encoding described in preprocessing).
- **SibSp** — *Numeric (integer)*; number of siblings/spouses aboard.
- **Parch** — *Numeric (integer)*; number of parents/children aboard.
- **Pclass** — *Categorical (integer)*; passenger class 1, 2, 3
- **Embarked** — *Categorical (integer)*; port of embarkation 1, 2, 3
- **2urvived** — *Targ  et (binary)*; survival outcome (1 = survived, 0 = not survived).
- **zero\*** — *Constant (all zeros)*; no information content

Varieble influance
* At this stage, attributes can be  prioritized based on historical context and domain knowledge. **Sex and Age** are expected to be highly influential due to the evacuation policy often summarized as “women and children first.” 
* Additionally, **Pclass and Fare** are expected to be relevant, as passenger class reflects socio-economic status, which likely affected access to lifeboats and crew assistance. 
* Variables related to family structure (**SibSp, Parch**) may also be influential, since travelling with close family especially children could affect evacuation priority and decision-making during the disaster. Identifier and constant variables like **Passengerid**, **zero*** columns are not considered relevant for prediction. 

At this stage, relevant attributes can be preliminarily prioritized based on knowledge of the Titanic disaster history.
**Sex, Pclass, Age, and Fare** are expected to be the most influential, as historical evidence suggests strong survival differences across gender, socio-economic class, and age groups.
Variables related to family structure (**SibSp, Parch**) are considered moderately relevant, while identifier like Passengerid are not relevant for prediction.

### Planned Descriptive Statistics

- **Age**  
  - Calculate mean, median, standard deviation, minimum, and maximum age  
  - Compare average age between survivors and non-survivors  
  - Check the distribution of age and presence of missing values  

- **Fare**  
  - Calculate mean, median, standard deviation, minimum, and maximum fare  
  - Compare average fare between survivors and non-survivors  
  - Inspect fare distribution to identify skewness and outliers  

- **Sex**  
  - Calculate frequency counts for each category  
  - Compute survival rates by sex  

- **Pclass**  
  - Calculate frequency counts for each passenger class  
  - Compute survival rates per class  
  - Compare survival patterns across classes  

- **SibSp**  
  - Calculate distribution of values (counts)  
  - Compare average SibSp values between survivors and non-survivors  

- **Parch**  
  - Calculate distribution of values (counts)  
  - Compare average Parch values between survivors and non-survivors  

- **Embarked**  
  - Calculate frequency counts per embarkation port  
  - Compute survival rates per port  



This prioritization is tentative and will be validated during exploratory data analysis and modeling. As this is an academic case study, no external business analysts are involved; analytical decisions are made based on data analysis and documented reasoning.


## 2.3 Explore Data
### Data Exploration Report

## 2.4 Verify Data
### Data Quality report


In [None]:
from pathlib import Path
import pandas as pd
import matplotlib.pyplot as plt


# Find project root (folder that contains "data")
def get_project_root():
    p = Path.cwd()
    while not (p / "data").exists() and p != p.parent:
        p = p.parent
    return p

PROJECT_ROOT = get_project_root()
RAW_DATA_DIR = PROJECT_ROOT / "data" / "raw"

def load_data(filename="titanic1.csv"):
    input_path = RAW_DATA_DIR / filename
    print("Reading from:", input_path)  # optional but useful
    return pd.read_csv(input_path)

# Load dataset
df = load_data()

# Verify
print("Shape:", df.shape)
display(df)




**Age**  
  - Calculate mean, median, standard deviation, minimum, and maximum age  
  - Compare average age between survivors and non-survivors  
  - Check the distribution of age and presence of missing values  


In [None]:
# Calculate mean, median, standard deviation, minimum, and maximum age
df['Age'].agg(['mean', 'median', 'std', 'min', 'max'])

In [None]:
df.groupby('2urvived')['Age'].agg(['count', 'mean'])

In [None]:
# Number of missing values in Age
df['Age'].isna().sum()

In [None]:
# Age groups
age_bins = pd.cut(df['Age'], bins=range(0, 81, 10), right=False)

# Counts per age group and survival outcome
counts = pd.crosstab(age_bins, df['2urvived'])

# Side-by-side bars (NOT stacked)
counts.plot(kind='bar', stacked=False)
plt.xlabel("Age group")
plt.ylabel("Count")
plt.title("Survival vs Death Counts by Age Group")
plt.xticks(rotation=45)
plt.legend(["Not Survived - 0", "Survived - 1 "], title="Outcome")
plt.show()

**Fare**  
  - Calculate mean, median, standard deviation, minimum, and maximum fare  
  - Compare average fare between survivors and non-survivors  
  - See correlation between price of fare and survival

In [None]:
# Calculate mean, median, standard deviation, minimum, and maximum fare 
df['Fare'].agg(['mean', 'median', 'std', 'min', 'max'])

In [None]:
# Compare average fare between survivors and non-survivors 
df.groupby('2urvived')['Fare'].agg(['count', 'mean'])

In [None]:
df['Fare'].dropna().hista(bins=100)

In [None]:
#How Fare differs between survivors and non-survivors
import matplotlib.pyplot as plt

bins = 50  

df[df['2urvived'] == 0]['Fare'].dropna().hist(bins=bins, alpha=0.6, label='Died (0)')
df[df['2urvived'] == 1]['Fare'].dropna().hist(bins=bins, alpha=0.6, label='Survived (1)')

plt.xlabel("Fare")
plt.ylabel("Count")
plt.title("Fare Distribution by Survival Outcome")
plt.legend()
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Create fare bins
fare_bins = pd.cut(df['Fare'], bins=[0,1,5, 10, 25, 50, 100, 600])

# Count survived vs died in each fare group
fare_survival = pd.crosstab(fare_bins, df['2urvived'])

fare_survival.plot(kind='bar')
plt.xlabel("Fare range")
plt.ylabel("Number of passengers")
plt.title("Survival vs Death by Fare Range")
plt.xticks(rotation=45)
plt.legend(["Died (0)", "Survived (1)"])
plt.show()

In [None]:
fare_survival_rate = pd.crosstab(
    fare_bins, 
    df['2urvived'], 
    normalize='index'
)

fare_survival_rate.plot(kind='bar')
plt.xlabel("Fare range")
plt.ylabel("Survival rate")
plt.title("Survival Rate by Fare Range")
plt.xticks(rotation=45)
plt.legend(["Died (0)", "Survived (1)"])
plt.show()

**Sex**  
  - Calculate frequency counts for each category  
  - Compute survival rates by sex 

In [None]:
#*Calculate frequency counts for each category  
df['Sex'].value_counts()

In [None]:
#Compute survival rates by sex
pd.crosstab(df['Sex'], df['2urvived']).rename({1: 'Female', 0: 'Male'})