<a href="https://colab.research.google.com/github/GarimaSharda/Kaggle-Knowledge-Project-Titanic/blob/main/Kaggle_Titanic_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Load data and inspect first rows**
This cell reads the Titanic train and test CSV files into pandas DataFrames and displays the first few rows. It helps verify that the files are loaded correctly and shows the available columns (PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked).

In [11]:
# Step 1: Upload local CSV file from your computer
from google.colab import files
uploaded = files.upload()

# Step 2: Load it into pandas
import pandas as pd
train = pd.read_csv('train.csv')

# Step 3: Check the data
train.head()


Saving train.csv to train (1).csv


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# Upload gender_submission.csv from your computer
uploaded = files.upload()

# Read into pandas
gender_submission = pd.read_csv('gender_submission.csv')

# Display first few rows
gender_submission.head()


Saving gender_submission.csv to gender_submission.csv


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


In [4]:
# Upload file manually from your computer
uploaded = files.upload()

# Load CSV file into a DataFrame
test = pd.read_csv('test.csv')

# Display the first few rows
test.head()

Saving test.csv to test.csv


Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


# **Check shapes and missing values**
This cell prints the number of rows and columns in train and test, then counts missing values in each column. It gives an overview of data completeness and tells which features (Age, Cabin, Embarked, Fare) need imputation before modeling.

In [5]:
#Get the number of entries in each file
train_len = len(train)
test_len = len(test)

print("Train shape:", train.shape)
print("Test shape:", test.shape)

print("Number of entries in train.csv:", train_len)
print("Number of entries in test.csv:", test_len)

Train shape: (891, 12)
Test shape: (418, 11)
Number of entries in train.csv: 891
Number of entries in test.csv: 418


In [6]:
#check null value
print(train.isnull().sum())
print(test.isnull().sum())

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64


# **Engineer features**
This cell extracts the numeric part at the end of each Ticket string and stores it in a new column Ticket_Number. Converting complex ticket codes into a numeric identifier can capture patterns such as groups of passengers sharing similar ticket numbers.

We fill in missing Age values using the median age within each (Sex, Pclass) group, so that age estimates respect passenger demographics. It fills missing Embarked values in the training data with the most common port and fills the single missing Fare in the test data with the median fare for that passenger class. It also creates a binary HasCabin flag indicating whether a cabin entry exists and then drops the original, mostly empty Cabin column.

In this cell we also make sure FamilySize counts how many family members travel together (SibSp + Parch + 1), and IsAlone indicates passengers traveling alone. It also extracts Title (Mr, Mrs, Miss, Master, etc.) from the Name column and groups rare titles together. These features can improve the model by capturing social status and family structure, both of which affected survival chances.

In [7]:
# Ticket_Number for both train and test
for df in [train, test]:
    df["Ticket_Number"] = df["Ticket"].astype(str).str.extract(r"(\d+)$", expand=False)
    df["Ticket_Number"] = pd.to_numeric(df["Ticket_Number"], errors="coerce")

# Age: median by Sex and Pclass
for df in [train, test]:
    df["Age"] = df.groupby(["Sex", "Pclass"])["Age"].transform(
        lambda x: x.fillna(x.median())
    )

# Embarked (train): fill with mode
embarked_mode = train["Embarked"].mode()[0]
train["Embarked"] = train["Embarked"].fillna(embarked_mode)

# Fare (test): fill with median by Pclass
test["Fare"] = test.groupby("Pclass")["Fare"].transform(
    lambda x: x.fillna(x.median())
)

# Cabin → HasCabin, then drop Cabin
for df in [train, test]:
    df["HasCabin"] = df["Cabin"].notna().astype(int)
    df.drop(columns=["Cabin"], inplace=True)

# FamilySize, IsAlone, Title
for df in [train, test]:
    df["FamilySize"] = df["SibSp"] + df["Parch"] + 1
    df["IsAlone"] = (df["FamilySize"] == 1).astype(int)

    df["Title"] = df["Name"].str.extract(r",\s*([^\.]+)\.", expand=False)
    df["Title"] = df["Title"].replace(
        ["Lady", "Countess", "Capt", "Col", "Don", "Dr", "Major", "Rev",
         "Sir", "Jonkheer", "Dona"],
        "Rare"
    )
    df["Title"] = df["Title"].replace({"Mlle": "Miss", "Ms": "Miss", "Mme": "Mrs"})


# **Encode categorical features for modeling**
This cell converts text categories into numeric form. It maps Sex to 0/1 and applies one‑hot encoding to Embarked and Title, creating separate indicator columns for each category. This transformation is required because most machine learning algorithms work with numerical inputs only.

In [8]:
# Encode Sex
for df in [train, test]:
    df["Sex"] = df["Sex"].map({"male": 0, "female": 1}).astype(int)

# One‑hot encode Embarked and Title
train = pd.get_dummies(train, columns=["Embarked", "Title"], drop_first=True)
test  = pd.get_dummies(test,  columns=["Embarked", "Title"], drop_first=True)

# Make sure train and test have identical columns
train, test = train.align(test, join="left", axis=1, fill_value=0)

train.columns, test.columns


(Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
        'Parch', 'Ticket', 'Fare', 'Ticket_Number', 'HasCabin', 'FamilySize',
        'IsAlone', 'Embarked_Q', 'Embarked_S', 'Title_Miss', 'Title_Mr',
        'Title_Mrs', 'Title_Rare', 'Title_the Countess'],
       dtype='object'),
 Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
        'Parch', 'Ticket', 'Fare', 'Ticket_Number', 'HasCabin', 'FamilySize',
        'IsAlone', 'Embarked_Q', 'Embarked_S', 'Title_Miss', 'Title_Mr',
        'Title_Mrs', 'Title_Rare', 'Title_the Countess'],
       dtype='object'))

# **Prepare feature matrix X and target y**
This cell removes identifier and raw text columns (PassengerId, Name, Ticket) that are not directly useful as numeric features. It then defines X as the set of model features and y as the Survived target variable from the training data.

This cell also splits the data into training and validation sets, fits a classification model (for example, a Random Forest) on the training subset, and evaluates it on the validation subset using accuracy. This gives an estimate of how well the model generalizes before using it to make test‑set predictions.

In [9]:
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier

target = "Survived"
drop_cols = ["PassengerId", "Name", "Ticket"]

train_model = train.drop(columns=drop_cols)
test_model = test.drop(columns=drop_cols + [target], errors="ignore")

X = train_model.drop(columns=[target])
y = train_model[target]

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

rf = RandomForestClassifier(
    n_estimators=200,
    max_depth=5,
    random_state=42,
    n_jobs=-1
)
rf.fit(X_train, y_train)
print("Validation accuracy:", accuracy_score(y_valid, rf.predict(X_valid)))


Validation accuracy: 0.8100558659217877


# **Fit on all data and generate Kaggle submission**
This cell retrains the chosen model on the full training dataset to use all available labeled examples. It then applies the model to the processed test features to predict Survived for each passenger and saves the results as submission.csv with the required PassengerId and Survived columns for uploading to Kaggle.

In [10]:
rf.fit(X, y)
test_pred = rf.predict(test_model)

submission = pd.DataFrame({
    "PassengerId": test["PassengerId"],
    "Survived": test_pred.astype(int)
})
submission.to_csv("submission.csv", index=False)
submission.head()


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
