# Titanic Survival Prediction

## 1. Import Libraries and Load Data

First, let's import the necessary libraries and load our training and testing datasets.

In [7]:
import pandas as pd
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import StackingClassifier
import warnings
warnings.filterwarnings("ignore")

train_df = pd.read_csv('data/train.csv')
test_df = pd.read_csv('data/test.csv')

print("Training Data Head:")
print(train_df.head())

Training Data Head:
   PassengerId  Survived  Pclass  \
0            1         0       3   
1            2         1       1   
2            3         1       3   
3            4         1       1   
4            5         0       3   

                                                Name     Sex   Age  SibSp  \
0                            Braund, Mr. Owen Harris    male  22.0      1   
1  Cumings, Mrs. John Bradley (Florence Briggs Th...  female  38.0      1   
2                             Heikkinen, Miss. Laina  female  26.0      0   
3       Futrelle, Mrs. Jacques Heath (Lily May Peel)  female  35.0      1   
4                           Allen, Mr. William Henry    male  35.0      0   

   Parch            Ticket     Fare Cabin Embarked  
0      0         A/5 21171   7.2500   NaN        S  
1      0          PC 17599  71.2833   C85        C  
2      0  STON/O2. 3101282   7.9250   NaN        S  
3      0            113803  53.1000  C123        S  
4      0            373450   8.0500

## 2. Exploratory Data Analysis (EDA)

Now, let's explore the data to understand its structure, find patterns, and identify missing values.

In [8]:
print("--- Initial Data Info ---")
train_df.info()

print("\n--- Missing values count ---")
print(train_df.isnull().sum())

--- Initial Data Info ---
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

--- Missing values count ---
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             

## 3. Data Cleaning & Feature Engineering

Based on our EDA, we'll clean the data by handling missing values and create new features to improve our model's performance.

In [9]:
train_df['Age'].fillna(train_df.groupby(['Pclass', 'Sex'])['Age'].transform('median'), inplace=True)
test_df['Age'].fillna(test_df.groupby(['Pclass', 'Sex'])['Age'].transform('median'), inplace=True)
train_df['Embarked'].fillna(train_df['Embarked'].mode()[0], inplace=True)
test_df['Fare'].fillna(test_df['Fare'].median(), inplace=True)

train_df['Sex'] = train_df['Sex'].map({'male': 0, 'female': 1})
test_df['Sex'] = test_df['Sex'].map({'male': 0, 'female': 1})
train_df = pd.get_dummies(train_df, columns=['Embarked'])
test_df = pd.get_dummies(test_df, columns=['Embarked'])

features = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_C', 'Embarked_Q', 'Embarked_S']
X_train = train_df[features]
y_train = train_df['Survived']
X_test = test_df[features]

print("Feature preparation complete.")
print(X_train.head())

Feature preparation complete.
   Pclass  Sex   Age  SibSp  Parch     Fare  Embarked_C  Embarked_Q  \
0       3    0  22.0      1      0   7.2500       False       False   
1       1    1  38.0      1      0  71.2833        True       False   
2       3    1  26.0      0      0   7.9250       False       False   
3       1    1  35.0      1      0  53.1000       False       False   
4       3    0  35.0      0      0   8.0500       False       False   

   Embarked_S  
0        True  
1       False  
2        True  
3        True  
4        True  


## 4. Model Training and Evaluation

It's time to choose a model, train it on our processed data, and see how well it performs.

In [10]:
rf = RandomForestClassifier(n_estimators=300, max_depth=6, random_state=42)
gb = GradientBoostingClassifier(n_estimators=200, learning_rate=0.05, max_depth=3, random_state=42)
svc = make_pipeline(StandardScaler(), SVC(probability=True, kernel='rbf', C=1.0))
lr = LogisticRegression(max_iter=1000)

estimators = [
    ('rf', rf),
    ('gb', gb),
    ('svc', svc)
]
stack = StackingClassifier(
    estimators=estimators,
    final_estimator=lr,
    passthrough=False,
    cv=5
)

kf = KFold(n_splits=5, shuffle=True, random_state=42)
cv_score = cross_val_score(stack, X_train, y_train, cv=kf, scoring='accuracy').mean()
print(f"5-fold CV accuracy: {cv_score:.4f}")

stack.fit(X_train, y_train)

print("Model training and evaluation complete.")

5-fold CV accuracy: 0.8294
Model training and evaluation complete.


## 5. Create Submission File

Finally, we'll use our trained model to make predictions on the test set and generate the submission file in the required format.

In [13]:
pred = stack.predict(X_test)
pred = pred.astype(int)
submission = pd.DataFrame({'PassengerId': test_df['PassengerId'], 'Survived': pred})
submission.to_csv('submission4.csv', index=False)
print(" it's finished!")

 it's finished!
