# 🚢 Part 2: Titanic Dataset - Loading and First Look

**Goal:** To practice loading real-world data directly from a URL and perform essential initial data assessment, which is crucial for any data science project's starting phase.

---
### Key Learning Objectives
1.  Load data using `pd.read_csv()` from an external URL.
2.  Use the `head()`, `tail()`, and `.shape` methods for quick data inspection.
3.  Utilize `info()` and `describe()` to understand data types and statistics.
4.  Perform a simple scan for **missing values** using `.isnull().sum()`.

In [1]:
import os
import pandas as pd

# Load Titanic dataset from a reliable URL
url = 'https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv'
titanic_df = pd.read_csv(url)

print("--- Data Frame Preview ---")
print(titanic_df.head())

--- Data Frame Preview ---
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450  

## 2. Loading Real-World Data

Unlike the simple dictionary-based DataFrame in Part 1, real-world data is typically loaded from a file using `pd.read_csv()`. Pandas can handle local file paths or web URLs, providing seamless access to external datasets like the famous Titanic passenger manifest.

In [2]:
print("=== FIRST 5 ROWS ===")
print(titanic_df.head())

print("\n=== LAST 5 ROWS ===")
print(titanic_df.tail())

print("\n=== SHAPE (rows, columns) ===")
print(titanic_df.shape)

print("\n=== COLUMNS ===")
print(list(titanic_df.columns))

=== FIRST 5 ROWS ===
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.050

## 3. Dimensions and Boundaries

Checking the **`.head()`** (top rows) and **`.tail()`** (bottom rows) ensures the data loaded correctly and gives a first impression of the values. The **`.shape`** attribute provides the exact dimensions, confirming the total number of passengers (rows) and attributes (columns).

In [3]:
print("=== INFO: Data Types and Non-Null Counts ===")
titanic_df.info()

print("\n=== DESCRIBE: Basic Statistics for Numeric Columns ===")
print(titanic_df.describe())

=== INFO: Data Types and Non-Null Counts ===
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

=== DESCRIBE: Basic Statistics for Numeric Columns ===
       PassengerId    Survived      Pclass         Age       SibSp  \
count   891.000000  891.000000  891.000000  714.000000  891.000000   
mea

## 4. Understanding Data Structure

* **`df.info()`**: Shows the data type (`Dtype`) for each column and the count of non-null entries. This is the **primary tool** for identifying columns with missing data at a glance.
* **`df.describe()`**: Calculates descriptive statistics (count, mean, standard deviation, min/max, quartiles) for all numeric columns. This helps quickly assess data distribution and detect potential outliers.

In [4]:
# Simple counts
print("=== Survival Counts (0=Died, 1=Survived) ===")
print(titanic_df['Survived'].value_counts())

print("\n=== Pclass Counts (Passenger Class) ===")
print(titanic_df['Pclass'].value_counts().sort_index())

# Missing-data scan
missing_data = titanic_df.isnull().sum()
print("\n=== Missing values per column (only columns with missing data) ===")
print(missing_data[missing_data > 0])

=== Survival Counts (0=Died, 1=Survived) ===
Survived
0    549
1    342
Name: count, dtype: int64

=== Pclass Counts (Passenger Class) ===
Pclass
1    216
2    184
3    491
Name: count, dtype: int64

=== Missing values per column (only columns with missing data) ===
Age         177
Cabin       687
Embarked      2
dtype: int64


## 5. Identifying Gaps (Missing Data)

The combination of `.isnull()` (checks for nulls) and `.sum()` (counts the total `True` values) provides a quick summary of data quality. We see that the **Age**, **Cabin**, and **Embarked** columns have missing values, which will require attention (imputation or dropping) in later steps.

In [5]:
# Summary
summary = f"""
TITANIC DATASET ANALYSIS SUMMARY
================================

Initial Status: {titanic_df.shape[0]} passengers, {titanic_df.shape[1]} columns loaded.
Data Quality Alert: Age, Cabin, and Embarked columns contain missing values.
Survival Rate Preview: {titanic_df['Survived'].mean():.2%} of passengers in this dataset survived.

Next Steps:
- Apply filtering and grouping techniques to analyze survival rates by categorical features (Sex, Pclass).
- Address the identified missing data points.
"""
print(summary)

# Save snapshot for next steps
os.makedirs('data-visualization/data', exist_ok=True)
titanic_df.to_csv('data-visualization/data/titanic_snapshot.csv', index=False)
print("\nSaved snapshot for next notebook: data-visualization/data/titanic_snapshot.csv")


TITANIC DATASET ANALYSIS SUMMARY

Initial Status: 891 passengers, 12 columns loaded.
Data Quality Alert: Age, Cabin, and Embarked columns contain missing values.
Survival Rate Preview: 38.38% of passengers in this dataset survived.

Next Steps:
- Apply filtering and grouping techniques to analyze survival rates by categorical features (Sex, Pclass).
- Address the identified missing data points.


Saved snapshot for next notebook: data-visualization/data/titanic_snapshot.csv
