# Titanic Survival Prediction

## Overview
This notebook predicts Titanic passenger survival using a Random Forest Classifier. It includes custom feature engineering, such as extracting titles from names and encoding categorical variables, to enhance model performance.

## Dataset
The dataset (`titanic.csv`) is sourced, containing passenger details like class, age, sex, and survival status.

## Feature Engineering
- Extracted `title` from `Name` (e.g., Mr, Mrs) and encoded as 1-5.
- Encoded `Sex` as `female`=0, `male`=1.
- Filled missing `Age` values with the median (~28.0).

## Model
Trained a Random Forest Classifier with 100 trees using features: `Pclass`, `Age`, `SibSp`, `Sex`, `title`. Random Forest was chosen for its ability to capture non-linear relationships and feature interactions.

## Results
Achieved 0.82~ accuracy on the test set (to be updated with exact value).

## Usage
Run the notebook cell-by-cell to reproduce the results. Ensure `titanic.csv` is in the `C:/nlp_projects/` directory.


## Import Libraries
Import required Python libraries for data manipulation (`pandas`), model training (`sklearn.model_selection`), logistic regression (`sklearn.linear_model`), and evaluation (`sklearn.metrics`).


In [37]:
# Import libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score



## Load Dataset
Load the Titanic dataset (`titanic.csv`) from the specified path into a pandas DataFrame for analysis and preprocessing.


In [40]:
data = pd.read_csv('C:/nlp_projects/titanic.csv')

## Check Dataset
Display the column names and count of missing values in the `Age` column to understand the dataset structure and identify data cleaning needs.

In [43]:
print("Columns in dataset:", data.columns.tolist())
print("Missing Age before filling:", data['Age'].isna().sum())

Columns in dataset: ['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked']
Missing Age before filling: 177


## Extract Title from Name
Create a new feature `title` by extracting titles (e.g., Mr, Mrs) from the `Name` column. Titles are encoded into numeric categories (1-5) based on predefined rules to capture social status.

In [46]:
def f(name):
    if '.' in name:
        return name.split(',')[1].split('.')[0].strip()
    else:
        return 'Unknown'

In [48]:
def g(tt):
    if tt in ['Mr']:
        return 1
    elif tt in ['Master']:
        return 3
    elif tt in ['Ms', 'Mlle', 'Miss']:
        return 4
    elif tt in ['Mrs','Mme']:
        return 5
    else:
        return 2

In [50]:
data['title'] = data['Name'].apply(f).apply(g)
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,5
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,4
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S,5
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S,1


## Encode Sex
Convert the categorical `Sex` column (`female`, `male`) into numeric values (`female`=0, `male`=1) using a dictionary mapping for model compatibility.

In [53]:
s = sorted(data['Sex'].unique())
z=zip(s, range(0, len(s) + 1))
gm = dict(z)
data['Sex'] = data['Sex'].map(gm).astype(int)
 
data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,title
0,1,0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.25,,S,1
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,5
2,3,1,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.925,,S,4
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1,C123,S,5
4,5,0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.05,,S,1


## Fill Age with Median
Handle missing values in the `Age` column (~177 missing) by filling them with the median age (~28.0) to maintain realistic data distribution. Display statistics to verify the process.

In [56]:
data['Age'] = data['Age'].fillna(data['Age'].median())
print("Median Age used:", data['Age'].median())
print("Missing Age after filling:", data['Age'].isna().sum())
print("Sample Age values after filling:", data['Age'].head(10))

Median Age used: 28.0
Missing Age after filling: 0
Sample Age values after filling: 0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
5    28.0
6    54.0
7     2.0
8    27.0
9    14.0
Name: Age, dtype: float64


## Prepare Data
Select relevant features (`Pclass`, `Age`, `SibSp`, `Sex`, `title`, `Survived`) for modeling. Fill any remaining missing values with 0 (except `Age`, which is already filled). Split features (`X`) and target (`y`) for training and testing.

In [59]:
# Load dataset
data = data[['Pclass', 'Age', 'SibSp', 'Sex', 'title', 'Survived']].fillna(0)  # Select columns
X = data[['Pclass', 'Age', 'SibSp', 'Sex', 'title']]
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


## Train Model
Train a Random Forest Classifier with 100 trees to predict survival based on the selected features. Random Forest is chosen for its ability to model non-linear relationships and feature interactions.

In [62]:
# Train model
model = RandomForestClassifier(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

## Evaluate Model
Calculate and display the accuracy of the model on the test set to assess its performance (~78.78% expected).

In [65]:
# Evaluate
predictions = model.predict(X_test)
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)

Accuracy: 0.8268156424581006


## Sample Predictions
Display the first five test samples, their true labels, and predicted labels to illustrate model predictions.

In [68]:
# Sample predictions
for features, true_label, pred_label in zip(X_test[:5].values, y_test[:5], predictions[:5]):
    print(f"Features: {features}, True Label: {true_label}, Predicted: {pred_label}")

Features: [ 3. 28.  1.  1.  3.], True Label: 1, Predicted: 1
Features: [ 2. 31.  0.  1.  1.], True Label: 0, Predicted: 0
Features: [ 3. 20.  0.  1.  1.], True Label: 0, Predicted: 0
Features: [2. 6. 0. 0. 4.], True Label: 1, Predicted: 1
Features: [ 3. 14.  1.  0.  4.], True Label: 1, Predicted: 0
