# Titanic Survival Prediction with Random Forest Classifier
This project uses the Titanic dataset to predict survival using machine learning models. The model is built using a `RandomForestClassifier` from Scikit-learn. The following steps outline the data preprocessing, feature engineering, model training, and evaluation.
##  Import Libraries
First,lets  import the necessary Python libraries for data manipulation, model building, and evaluation.


In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns


##  Load the Dataset
lets load the Titanic dataset into pandas DataFrames for training and testing.

In [None]:
# Load the dataset
data_file_path :str = '/kaggle/input/titanic/train.csv'
train_data = pd.read_csv(data_file_path)
test_file_path :str = '/kaggle/input/titanic/test.csv'
test_data = pd.read_csv(test_file_path)
train_data.head(20)

##  Data Preprocessing Pipeline and Feature Engineering
The following steps outline how we handle missing values, create new features, and prepare the data for model training.



We handle missing data in the `Age`, `Embarked`, and `Cabin` columns. 

- **Age:** Missing values are filled with the median value of the `Age` column.
- **Embarked:** Missing values are filled with the most frequent value (mode) of the `Embarked` column.
- **Cabin:** This column contains too many missing values, so it is dropped from the dataset.

A new feature FamilySize is created by summing the SibSp (siblings/spouses aboard) and Parch (parents/children aboard) columns. This feature represents the total number of family members a passenger has aboard the Titanic.

Categorical columns such as Sex and Embarked are converted into numerical values using one-hot encoding. This allows machine learning algorithms to process the categorical data.


In [None]:
# Handling missing values
train_data['Age'].fillna(train_data['Age'].median(), inplace=True)
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace=True)
train_data.drop(columns=['Cabin'], inplace=True)

# Create a new feature 'FamilySize'
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch']

# Convert categorical columns to numerical using one-hot encoding
train_data = pd.get_dummies(train_data, columns=['Sex', 'Embarked'], drop_first=True)

# Drop unnecessary columns
train_data_cleaned = train_data.drop(columns=['Name', 'Ticket', 'PassengerId'])



After cleaning the data, we define the feature matrix (X) and the target variable (y). The target variable y is the Survived column, while all other columns form the feature matrix X.

The dataset is split into training and validation sets using an 80/20 ratio.


In [None]:
# Define feature matrix (X) and target vector (y)
X = train_data_cleaned.drop(columns=['Survived'])
y = train_data_cleaned['Survived']

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42

## Model Training and Evaluation: Random Forest Classifier

In this section, we train a **Random Forest Classifier** on the training dataset and evaluate its performance on the validation set. We also compute common evaluation metrics such as accuracy, confusion matrix, and classification report.

---

### 1. Training the Random Forest Classifier

We initialize a `RandomForestClassifier` with a fixed random state.

### 2. Prediction and Evaluation
Once the model is trained, we use it to predict the target values (y_pred) for the validation set (X_val).
We evaluate the model using the following metrics:

Accuracy: Measures the proportion of correct predictions.
Confusion Matrix: Provides a summary of prediction results and shows the counts of true positives, true negatives, false positives, and false negatives.
Classification Report: Gives precision, recall, and F1-score for each class.

In [None]:
# Train a Random Forest Classifier
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Predict on validation set
y_pred = model.predict(X_val)

# Evaluate the model
accuracy = accuracy_score(y_val, y_pred)
conf_matrix = confusion_matrix(y_val, y_pred)
class_report = classification_report(y_val, y_pred)

print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")

### Feature Importance
Visualize the importance of each feature in determining survival using the trained Random Forest model.

In [None]:


# Get feature importance
feature_importance = model.feature_importances_

# Create a DataFrame for visualization
features_df = pd.DataFrame({
    'Feature': X.columns,
    'Importance': feature_importance
}).sort_values(by='Importance', ascending=False)

# Plot feature importance
plt.figure(figsize=(10,6))
sns.barplot(x='Importance', y='Feature', data=features_df)
plt.title('Feature Importance')
plt.show()


## Applying Preprocessing and Making Predictions on Test Data

In this section, we apply the same preprocessing steps used for the training data to the test dataset. After preparing the data, we make predictions using the trained Random Forest model and generate a submission file in CSV format.




In [None]:


# Apply preprocessing steps
test_data['Age'].fillna(train_data['Age'].median(), inplace=True)
test_data['Fare'].fillna(train_data['Fare'].median(), inplace=True)
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch']
test_data = pd.get_dummies(test_data, columns=['Sex', 'Embarked'], drop_first=True)

# Ensure the test data has the same columns as the training set
missing_cols = set(X.columns) - set(test_data.columns)
for col in missing_cols:
    test_data[col] = 0

# Make predictions
test_predictions = model.predict(test_data[X.columns])

# Create a submission DataFrame
submission = pd.DataFrame({
    'PassengerId': test_data['PassengerId'],
    'Survived': test_predictions
})

# Save the submission file
submission.to_csv('submission.csv', index=False)
