PROBLEM STATEMENT

The Titanic dataset predicts whether a passenger survived based on features like class, age, and gender. Your target variable is Survived (0 = did not survive, 1 = survived). The goal is to create a predictive model with better accuracy on unseen data.

Step 1: Install Required Libraries


In [1]:
!pip install pandas numpy matplotlib seaborn scikit-learn




Step 2: Download Titanic Dataset

Step 3: Import Required Libraries


In [2]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split


Step 4: Load and Preprocess Training Data

In [3]:
train_data = pd.read_csv('train.csv')


In [4]:
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1

In [5]:
train_data['Title'] = train_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
title_mapping = {'Mr': 1, 'Miss': 2, 'Mrs': 3, 'Master': 4, 'Rare': 5}
train_data['Title'] = train_data['Title'].map(title_mapping)

In [6]:
train_data['Sex'] = train_data['Sex'].map({'male': 0, 'female': 1})
train_data['Embarked'] = train_data['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

In [7]:
train_data['Age'] = train_data['Age'].fillna(train_data['Age'].median())
train_data['Fare'] = train_data['Fare'].fillna(train_data['Fare'].median())
train_data['Embarked'] = train_data['Embarked'].fillna(train_data['Embarked'].mode()[0])

In [8]:
X = train_data.drop(['PassengerId', 'Name', 'Ticket', 'Cabin', 'Survived'], axis=1)
y = train_data['Survived']

Step 5: Train the Model

In [9]:
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


In [10]:
final_model = RandomForestClassifier(n_estimators=100, random_state=42)
final_model.fit(X_train, y_train)

Step 6: Preprocess Test Data

In [11]:
test_data = pd.read_csv('test.csv')

test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1


test_data['Title'] = test_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
test_data['Title'] = test_data['Title'].map(title_mapping)

test_data['Sex'] = test_data['Sex'].map({'male': 0, 'female': 1})
test_data['Embarked'] = test_data['Embarked'].map({'C': 0, 'Q': 1, 'S': 2})

test_data['Age'] = test_data['Age'].fillna(test_data['Age'].median())
test_data['Fare'] = test_data['Fare'].fillna(test_data['Fare'].median())

X_test = test_data.drop(['Name', 'Ticket', 'Cabin'], axis=1)


Step 7: Generate Predictions


In [12]:
predictions = final_model.predict(X_test.drop('PassengerId', axis=1))


Step 8: Create Submission File

In [13]:
submission = pd.DataFrame({
    'PassengerId': test_data['PassengerId'],
    'Survived': predictions
})

submission.to_csv('submission.csv', index=False)
print("Submission file created successfully!")


Submission file created successfully!


In [14]:
y_pred = final_model.predict(X_val)

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_val, y_pred)
print(f"Accuracy on validation data: {accuracy * 100:.2f}%")


Accuracy on validation data: 83.24%
