# ðŸš¢ Titanic Survival Prediction: A Classification Baseline

## 1. Data Acquisition and Initial Inspection
*Goal: Load data and check its quality (missing values, types).*

### 1.1 Load DataFrames

In [2]:
import pandas as pd
import os

In [3]:
# Define the path to raw data folder
data_dir = os.path.join(os.getcwd(), '..', 'data', 'raw')
train_file = os.path.join(data_dir, 'train.csv')
test_file = os.path.join(data_dir, 'test.csv')

# Load the Dataframes
train_df = pd.read_csv(train_file)
test_df = pd.read_csv(test_file)

print("Train Data Shape:", train_df.shape)
print("Test Data Shape:", test_df.shape)

# Initial look at the data
print("\n--- Train DataFrame Info ---")
print(train_df.info())

print("\n--- Missing Values Summary (Train) ---")
print(train_df.isnull().sum())

Train Data Shape: (891, 12)
Test Data Shape: (418, 11)

--- Train DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None

--- Missing Values Summary (Train) ---
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177


## 2. Data Cleaning and Preparation
*Goal: Address missing values and prepare features for modeling.*

### 2.1 Concatenating Data

In [4]:
# 1. Track the split point
n_train = train_df.shape[0]

# 2. Drop the target variable from the training data
y_train = train_df['Survived']
X_train = train_df.drop('Survived', axis=1)

# 3. Combine the DataFrames
combined_df = pd.concat([X_train, test_df], ignore_index=True)

print(f"Original Training Size (n_train): {n_train}")
print(f"Combined DataFrame Shape: {combined_df.shape}")
print("\n--- Combined DataFrame Info ---")
combined_df.info()

Original Training Size (n_train): 891
Combined DataFrame Shape: (1309, 11)

--- Combined DataFrame Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Pclass       1309 non-null   int64  
 2   Name         1309 non-null   object 
 3   Sex          1309 non-null   object 
 4   Age          1046 non-null   float64
 5   SibSp        1309 non-null   int64  
 6   Parch        1309 non-null   int64  
 7   Ticket       1309 non-null   object 
 8   Fare         1308 non-null   float64
 9   Cabin        295 non-null    object 
 10  Embarked     1307 non-null   object 
dtypes: float64(2), int64(4), object(5)
memory usage: 112.6+ KB
