### Data Cleaning

In [1]:
import pandas as pd

# Load data
train = pd.read_csv('../data/train.csv')
test = pd.read_csv('../data/test.csv')



**Initial Data Overview**

In [2]:
# Concatenate train and test for unified cleaning
train['is_train'] = True
test['is_train'] = False
combined = pd.concat([train, test], sort=False).reset_index(drop=True)

# Quick structure and missing value overview
print("Shape:", combined.shape)
print("\nMissing Values:")
print(combined.isnull().sum().sort_values(ascending=False))

print("\nColumn Types:")
print(combined.dtypes)


Shape: (12970, 15)

Missing Values:
Transported     4277
CryoSleep        310
ShoppingMall     306
Cabin            299
VIP              296
Name             294
FoodCourt        289
HomePlanet       288
Spa              284
Destination      274
Age              270
VRDeck           268
RoomService      263
PassengerId        0
is_train           0
dtype: int64

Column Types:
PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported      object
is_train           bool
dtype: object


In [3]:
# Add GroupID from PassengerId
combined['GroupID'] = combined['PassengerId'].str.split('_').str[0]

# Subset again now that GroupID exists
missing_hp = combined[combined['HomePlanet'].isna()].copy()

# Try to infer the group's HomePlanet using mode
group_planet = (
    combined.groupby('GroupID')['HomePlanet']
    .agg(lambda x: x.mode().iloc[0] if x.notna().sum() else None)
)

# Map that to missing_hp rows
missing_hp['ImputedGroupPlanet'] = missing_hp['GroupID'].map(group_planet)

# Show a sample
missing_hp[['PassengerId', 'GroupID', 'Cabin', 'HomePlanet', 'ImputedGroupPlanet']].head(10)


Unnamed: 0,PassengerId,GroupID,Cabin,HomePlanet,ImputedGroupPlanet
59,0064_02,64,E/3/S,,Mars
113,0119_01,119,A/0/P,,Europa
186,0210_01,210,D/6/P,,
225,0242_01,242,F/46/S,,
234,0251_01,251,C/11/S,,
274,0303_01,303,G/41/S,,
286,0315_01,315,G/42/S,,
291,0321_01,321,F/61/S,,
347,0382_01,382,G/64/P,,
365,0402_01,402,D/15/S,,


Some missing values in HomePlanet can be confidently imputed from others in the same GroupID (like 0064 -> Mars, 0119 -> Europa).

Others show up as None because that group has no known HomePlanet values to infer from.

In [4]:
# Only fill HomePlanet where group-based inference succeeded
combined.loc[combined['HomePlanet'].isna(), 'HomePlanet'] = (
    combined.loc[combined['HomePlanet'].isna(), 'GroupID'].map(group_planet)
)

# Confirm how many are still missing
remaining_missing = combined['HomePlanet'].isna().sum()
print(f"Remaining missing HomePlanet values: {remaining_missing}")

# Fill remaining missing HomePlanet values with 'Unknown'
combined['HomePlanet'] = combined['HomePlanet'].fillna('Unknown')

# Sanity check
print("HomePlanet value counts after imputation:")
print(combined['HomePlanet'].value_counts(dropna=False))



Remaining missing HomePlanet values: 157
HomePlanet value counts after imputation:
HomePlanet
Earth      6914
Europa     3175
Mars       2724
Unknown     157
Name: count, dtype: int64
