# Case Study: Titanic Dataset - A Journey into Statistics and Data Analysis

The Titanic dataset is a popular resource in data science and machine learning. It provides information about passengers on the Titanic, such as survival status, demographics, and ticket fares, classified into various data types like nominal, ordinal, discrete, and continuous. In this case study, we aim to explore statistical concepts—ranging from Descriptive and Inferential Statistics to Measures of Relationship—using univariate and bivariate analysis of the Titanic dataset.

The structure of the case study emphasizes the questions and analysis techniques associated with the dataset while diving deeper into statistical concepts such as Mean, Standard Deviation, Skewness, Correlation Coefficients, and more.


## Data and its measures

The Titanic dataset is indeed a fascinating and widely used resource in data science and machine learning. It provides a diverse range of data types (nominal, ordinal, discrete, and continuous), making it ideal for understanding how to work with various types of variables and gaining insights through classification and exploratory analysis. Below is a clear classification and explanation of the data types within the Titanic dataset:

| **Variable**     | **Data Type**  | **Explanation**                                                                 |
|-------------------|----------------|---------------------------------------------------------------------------------|
| **survived**      | Discrete       | Indicates whether the passenger survived (0 = No, 1 = Yes). Countable integers. |
| **pclass**        | Ordinal        | Passenger class (1 = First, 2 = Second, 3 = Third). Ordered categories.         |
| **sex**           | Nominal        | Gender of the passenger (Male or Female). No intrinsic order.                   |
| **age**           | Continuous     | Age of the passenger in years. Measurable and can include decimal values.       |
| **sibsp**         | Discrete       | Number of siblings/spouses aboard. Countable integer values.                    |
| **parch**         | Discrete       | Number of parents/children aboard. Countable integer values.                    |
| **fare**          | Continuous     | Ticket fare paid by the passenger. Measurable and includes decimal values.      |
| **embarked**      | Nominal        | Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).           |
| **class**         | Ordinal        | Passenger class described as First, Second, or Third. Ordered categories.       |
| **who**           | Nominal        | Classification of the individual (man, woman, child). No intrinsic order.       |
| **deck**          | Nominal        | The deck level where the passenger's cabin is located (e.g., A, B, C, etc.).    |
| **embark_town**   | Nominal        | The town where the passenger embarked (Cherbourg, Queenstown, Southampton).     |
| **alive**         | Nominal        | Survival status (yes = survived, no = did not survive). No intrinsic order.     |
| **alone**         | Nominal        | Indicates if the passenger traveled alone (yes or no). Categories without order.|

This diversity in data types provides an excellent opportunity for exploring key data science concepts such as:
- **Handling different variable types** during data preprocessing.
- **Feature engineering** by converting categorical variables into numerical ones.
- **Building predictive models** to classify survival based on variables like `pclass`, `sex`, `age`, etc.
- **Creating visualizations** to understand relationships between variables.


Import titanic dataset

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## **Univariate Analysis**
Univariate analysis focuses on the distribution and characteristics of a single variable.

**1. What is the average age of passengers?**

**Answer:** The mean age of passengers is approximately **29.70 years**.

In [None]:
mean_age = titanic['age'].mean()
print(f"Mean Age: {mean_age:.2f}")

Mean Age: 29.70


**2. What is the most common passenger class?**

**Answer:** The most common passenger class is **3rd Class**.

In [None]:
mode_pclass = titanic['pclass'].mode()[0]
print(f"Most Common Passenger Class: {mode_pclass}")

Most Common Passenger Class: 3


**3. What is the range of ticket fares?**

**Answer:** The range of ticket fares is approximately **512.33 units**.

In [None]:
fare_range = titanic['fare'].max() - titanic['fare'].min()
print(f"Range of Ticket Fares (in $): {fare_range:.2f}")

Range of Ticket Fares (in $): 512.33


**4. What is the median fare paid by passengers?**

**Answer:** The median fare paid by passengers is approximately **14.45 units**.

In [None]:
median_fare = titanic['fare'].median()
print(f"Median Fare: {median_fare:.2f}")

Median Fare: 14.45


**5. What is the mode of embarkation ports?**

**Answer:** The most common port of embarkation is **Southampton (S)**.

In [None]:
mode_embarked = titanic['embarked'].mode()[0]
print(f"Most Common Embarkation Port: {mode_embarked}")

Most Common Embarkation Port: S


**6. What is the mean absolute deviation (MAD) for passenger ages?**

**Answer:** The MAD for passenger ages is approximately **11.32 years**.

In [None]:
def mad(data):
  return np.abs(titanic.age - titanic.age.mean()).mean()

In [None]:
mad(titanic.age)

np.float64(11.322944471906409)

**7. What is the standard deviation for ticket fares?**

**Answer:** The standard deviation for ticket fares is approximately **49.69 units**.

In [None]:
std_fare = titanic['fare'].std()
print(f"Standard Deviation of Ticket Fares: {std_fare:.2f}")

Standard Deviation of Ticket Fares: 49.69


**8. What is the interquartile range (IQR) for ages?**

**Answer:** The IQR for passenger ages is approximately **17.88 years**.

In [None]:
iqr_age = titanic['age'].quantile(0.75) - titanic['age'].quantile(0.25)
print(f"Interquartile Range for Age: {iqr_age:.2f}")

Interquartile Range for Age: 17.88


**9. Is the distribution of ticket fares skewed?**

**Answer:** The skewness of ticket fares is approximately **4.79**, indicating a highly skewed distribution.

In [None]:
skew_fare = titanic['fare'].skew()
print(f"Skewness of Ticket Fare Distribution: {skew_fare:.2f}")

Skewness of Ticket Fare Distribution: 4.79


**10. What is the kurtosis of passenger ages?**

**Answer:** The kurtosis of passenger ages is approximately **0.18**, indicating a nearly normal distribution.

In [None]:
kurt_age = titanic['age'].kurt()
print(f"Kurtosis of Age Distribution: {kurt_age:.2f}")

Kurtosis of Age Distribution: 0.18


## **Bivariate Analysis**
Bivariate analysis examines the relationship between two variables.

**1. What is the covariance between age and fare?**

**Answer:** This suggests that wealthier individuals who could afford premium-class tickets were generally older. Younger passengers were more likely to have paid lower fares, possibly in third-class.

In [None]:
covariance = titanic[['age', 'fare']].cov()
print("Covariance Matrix:")
print(covariance)

Covariance Matrix:
             age         fare
age   211.019125    73.849030
fare   73.849030  2469.436846


**2. What is the Pearson correlation between fare and survival?**

**Answer:** The Pearson correlation between fare and survival is approximately **0.26**, showing a slight positive relationship.

In [None]:
correlation = titanic['fare'].corr(titanic['survived'])
print(f"Pearson Correlation between Fare and Survival: {correlation:.2f}")

Pearson Correlation between Fare and Survival: 0.26


## **Multivariate Analysis**
Multivariate analysis involves relationships among three or more variables. While specific questions from above related to multivariate analysis typically involve visualization and interpretation, descriptive statistics for three or more variables can include combined covariance matrices or higher-dimensional statistical measures. Here's one example involving three variables:

**How does the combined impact of age, class, and fare affect survival probability?**

**Answer:** Multivariate relationships can be explored using covariance or correlation matrices involving all variables:

In [None]:
cov_matrix = titanic[['age', 'pclass', 'fare', 'survived']].cov()
print("Covariance Matrix:")
print(cov_matrix)

Covariance Matrix:
                 age     pclass         fare  survived
age       211.019125  -4.496004    73.849030 -0.551296
pclass     -4.496004   0.699015   -22.830196 -0.137703
fare       73.849030 -22.830196  2469.436846  6.221787
survived   -0.551296  -0.137703     6.221787  0.236772


**Age vs. Fare (`73.85`)**  
   - The **positive covariance** suggests that as age increases, fare tends to increase as well.
   - This could mean older passengers generally paid higher fares, possibly indicating a preference for higher-class accommodations.

**Pclass vs. Fare (`-22.83`)**  
   - A **negative covariance** implies that as passenger class (`pclass`) increases (lower socio-economic classes), fare decreases.
   - This is expected, as third-class passengers typically paid lower fares compared to first-class passengers.

**Age vs. Survived (`-0.55`)**  
   - The **slightly negative covariance** suggests that older passengers had a lower chance of survival.
   - This aligns with historical accounts—many younger individuals, particularly women and children, were prioritized for rescue.

**Fare vs. Survived (`6.22`)**  
   - A **positive covariance** indicates that higher fares are associated with higher survival rates.
   - This suggests that passengers in first-class (who paid more) had better chances of survival, possibly due to their proximity to lifeboats or preferential treatment.

**Age vs. Pclass (`-4.49`)**  
   - A **negative covariance** suggests that older passengers were more likely to be in higher classes.
   - This could mean that wealthier individuals, who could afford first-class tickets, were generally older.

**Diagonal Values (Variance)**  
   - The diagonal values represent **variance** in each variable.
   - `Fare` has the highest variance (`2469.43`), indicating a wide range of ticket prices.
   - `Age` also has significant variance (`211.02`), suggesting a diverse age distribution among passengers.


**Overall Summary**
- Wealthier, first-class passengers had better survival rates.
- Older passengers tended to pay more but had a lower chance of survival.
- Lower-class passengers generally paid lower fares and had poorer survival odds.

In [None]:
corr_matrix = titanic[['age', 'pclass', 'fare', 'survived']].corr()
print("Correlation Matrix:")
print(corr_matrix)

Correlation Matrix:
               age    pclass      fare  survived
age       1.000000 -0.369226  0.096067 -0.077221
pclass   -0.369226  1.000000 -0.549500 -0.338481
fare      0.096067 -0.549500  1.000000  0.257307
survived -0.077221 -0.338481  0.257307  1.000000


**Age vs. Pclass (`-0.369`)**
   - **Negative correlation** suggests that older passengers were more likely to be in higher classes (lower `pclass` values).
   - This aligns with the idea that wealthier, first-class passengers tended to be older.

**Age vs. Fare (`0.096`)**
   - A **very weak positive correlation** implies a slight tendency for older passengers to pay higher fares.
   - However, since the correlation is small, age wasn't a strong determining factor for ticket price.

**Pclass vs. Fare (`-0.549`)**
   - **Moderate negative correlation** confirms that lower-class passengers (higher `pclass` values) paid significantly lower fares.
   - First-class passengers paid much higher fares, a clear distinction in socio-economic status.

**Pclass vs. Survived (`-0.338`)**
   - **Negative correlation** suggests that survival rates decreased as class number increased.
   - First-class passengers (lower `pclass` values) had a better chance of survival—likely due to better access to lifeboats.

**Fare vs. Survived (`0.257`)**
   - **Positive correlation** indicates that passengers who paid higher fares had a higher chance of survival.
   - This further supports the idea that wealthier individuals had better rescue opportunities.

**Age vs. Survived (`-0.077`)**
   - **Weak negative correlation** suggests that older passengers had slightly lower survival rates.
   - While the correlation isn’t strong, it aligns with historical accounts—children and younger passengers had higher chances of survival.

**Overall Summary**
- Wealthier, first-class passengers had **higher survival rates**.
- Older passengers tended to be in **higher social classes** but had **lower chances of survival**.
- Lower-class passengers (higher `pclass` values) generally **paid less** and had **lower survival odds**.