## Titanic - Machine Learning from Disaster

**In this notebook, we will explore data on passengers on the Titanic and build a model to predict which passengers survived the disaster.**


<center>
<img src="https://www.kaggle.com/competitions/3136/images/header" alt="error" width="1000" height="600"></center>


## Import Libraries

In [None]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

**Read Data📚**

In [None]:
train = pd.read_csv('/kaggle/input/titanic/train.csv')
test = pd.read_csv('/kaggle/input/titanic/test.csv')

## Exploratory Data Analysis (EDA):

**Start by analyzing the data, which includes identifying existing columns, missing values, and describing the data.**

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.shape

In [None]:
test.shape

In [None]:
train.info()

In [None]:
test.info()

In [None]:
train.describe(include='all')

In [None]:
test.describe(include='all')

## Data preprocessing

**Data preprocessing** refers to the technique of preparing (cleaning and organizing) the raw data to make it suitable for a building and training Machine Learning models.

**1: Finding and cleaning null values**

In [None]:
train.isna().sum()

In [None]:
train.duplicated().sum()

In [None]:
test.isna().sum()

In [None]:
test.duplicated().sum()

**Fill in the empty data in the Age Column with the Median**

In [None]:
train = train.assign(Age = train['Age'].fillna(train['Age'].median()))
test = test.assign(Age = test['Age'].fillna(test['Age'].median()))
test = test.assign(Fare = test['Fare'].fillna(test['Fare'].median()))

**Fill in the empty data in the Embarked Column with the Median**

In [None]:
train = train.assign(Embarked=train['Embarked'].fillna(train['Embarked'].mode()[0]))

**Drops the columns 'Ticket', 'Cabin', and 'Name'**

In [None]:
def clean(data):
    data=data.drop(["Ticket","Cabin","Name"], axis=1)
    return data

train= clean(train)
test=clean(test)

In [None]:
train.head()

In [None]:
test.head()

In [None]:
train.isnull().sum()

## Graphical analysis:

In [None]:
plt.hist(train['Age'],bins=20)
plt.title('Age Distribution')
plt.show()

**Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton**

In [None]:

colors = ['blue', 'green', 'orange']
plt.bar(train['Embarked'].value_counts().index, train['Embarked'].value_counts(),color=colors)

**Analisis column Age**

In [None]:
sns.histplot(data=train, x='Age', hue='Survived', kde=True)
plt.title('Distribution of Age with Survived')
plt.show()


**Children under 10 years old have a better chance of survival, perhaps because they are prioritized for rescue using lifeboats.**

In [None]:
sns.boxplot(data=train, x='Survived', y='Age',)
plt.title('Age Distribution by Survival Status')
plt.show()

In [None]:
survived = train['Sex'][train['Survived'] == 1].value_counts().reindex(['male', 'female'])

no_survived = train['Sex'][train['Survived'] == 0].value_counts().reindex(['male', 'female'])

plt.bar(train['Sex'].value_counts().index,no_survived, width=0.4, label="Not Survived")
plt.bar(np.arange(len((train['Sex'].value_counts().index)))+0.4,survived, width=0.4,label="Survived")

plt.xticks(train['Sex'].value_counts().index,['male', 'female'])
plt.ylabel("Count")
plt.title('Sex Distobution by Survival Status')
plt.legend()
plt.show()

In [None]:
plt.pie(train['Survived'].value_counts(), labels=['Not Survived', 'Survived'], autopct='%1.1f%%')

In [None]:
train = pd.get_dummies(train, prefix=['Sex'], columns=['Sex'], dtype=int)
train = pd.get_dummies(train, prefix=['Pclass'], columns=['Pclass'], dtype=int)
train = pd.get_dummies(train, prefix=['Embarked'], columns=['Embarked'], dtype=int)
train

In [None]:
test = pd.get_dummies(test, prefix=['Sex'], columns=['Sex'], dtype=int)
test = pd.get_dummies(test, prefix=['Pclass'], columns=['Pclass'], dtype=int)
test = pd.get_dummies(test, prefix=['Embarked'], columns=['Embarked'], dtype=int)
test

## Building the model:

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier


In [None]:
X = train.drop('Survived', axis=1)
y = train['Survived']
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestClassifier()
model.fit(X_train, y_train)


## Model Evaluation:

In [None]:
from sklearn.metrics import accuracy_score

predictions = model.predict(X_val)
accuracy = accuracy_score(y_val, predictions)
print(f'Accuracy: {accuracy:.2f}')


In [None]:
test.isna().sum()

## Preparing to present results:

In [None]:
test_predictions = model.predict(test)
submission = pd.DataFrame({
    "PassengerId": test["PassengerId"],
    "Survived": test_predictions
})
submission.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")