# Titanic - Machine learning 

In this notebook, we try to find the best classifier to predict survived people after the titanic tragedy. 

## Import Python libraries



In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# visualization
import seaborn as sns # seaborn for visualization 
import matplotlib.pyplot as plt # matplotlib for visualization
%matplotlib inline

from sklearn.preprocessing import StandardScaler

# model
from sklearn.linear_model import LogisticRegression # Logistic Regression
from sklearn.ensemble import RandomForestClassifier # RandomForest Classifier
from sklearn.svm import SVC, LinearSVC # SVC
from sklearn.neighbors import KNeighborsClassifier # KNN
from sklearn.naive_bayes import GaussianNB # GaussianNB
from sklearn.linear_model import Perceptron # Perceptron
from sklearn.linear_model import SGDClassifier # SGD
from sklearn.tree import DecisionTreeClassifier # Tree Decision

# metrics
from sklearn.metrics import f1_score, accuracy_score

## Import data

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

In [None]:
# train data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")

# test data
test_data = pd.read_csv("/kaggle/input/titanic/test.csv")

# gender submission
gender_submission = pd.read_csv("/kaggle/input/titanic/gender_submission.csv")

In [None]:
train_data.head()

In [None]:
train_data.info()

## Cleaning Data

### Train data

In [None]:
# Count missing value by column
missing_by_column = train_data.isnull().sum()

# Count missing value by row
missing_by_row = train_data.isnull().sum(axis=1)

In [None]:
# Due to an excessive missing value, we don't consider this feature in our study
train_data=train_data.drop('Cabin', axis=1)

### Test data

In [None]:
test_data['Fare'].fillna(test_data['Fare'].mean(), inplace=True)

## Explore a pattern

Remember that the sample submission file in gender_submission.csv assumes that all female passengers survived (and all male passengers died).

Is this a reasonable first guess? We'll check if this pattern holds true in the data (in train.csv).

Copy the code below into a new code cell. Then, run the cell.

In [None]:
train_data[["Sex", "Survived"]].groupby(['Sex'], as_index=False).mean().sort_values(by='Survived', ascending=False)

The data is not imbalance. It seeams, interesting to use F1-Score to assess model. 

In [None]:
# Count the number of survived
survival_counts = train_data['Survived'].value_counts()

# Draw bar plot
plt.figure(figsize=(6, 4))
survival_counts.plot(kind='bar', color=['pink', 'skyblue'])
plt.title('Breakdown of survivors and non-survivors')
plt.xlabel('Survived')
plt.ylabel('Number of passengers')
plt.xticks([0, 1], ['No', 'Yes'], rotation=0)
plt.grid(axis='y')
plt.show()

There is a real imbalance between the survivor and non-survivor populations. The number of non-survivors is much higher than the number of survivors. 

In [None]:
# Divided Dataframe in two dataframe
survived = train_data[train_data['Survived'] == 1]  # Survivants
not_survived = train_data[train_data['Survived'] == 0]  # Non-survivants

# Draw the histogram
plt.figure(figsize=(10, 6))

# Age distribution of survivors (in blue)
plt.hist(survived['Age'], bins=20, color='skyblue', edgecolor='black', alpha=0.7, label='Survivants')

# Age distribution of death (in pink)
plt.hist(not_survived['Age'], bins=20, color='pink', edgecolor='black', alpha=0.7, label='Non-survivants')

plt.xlabel('Age')
plt.ylabel('Fréquency')
plt.title('Age distribution by survival')
plt.legend()
plt.grid(True)
plt.show()

The distribution of deaths is disproportionate. 

In [None]:
train_data.describe()

In [None]:
train_data[['Pclass', 'Survived']].groupby(['Pclass'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
train_data[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)

In [None]:
# grid = sns.FacetGrid(train_df, col='Embarked', hue='Survived', palette={0: 'k', 1: 'w'})
grid = sns.FacetGrid(train_data, row='Embarked', col='Survived', aspect=1.6)
grid.map(sns.barplot, 'Sex', 'Fare', alpha=.5, ci=None)
grid.add_legend()

## Relationship between features

We observe the correlation between target and each features. 

In [None]:
# Compute correlation matrix
correlation_matrix = train_data[["Parch", "SibSp", "Pclass", "Age", "Fare", "Survived"]].corr()

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Heatmap of correlation matrix')
plt.show()

In [None]:
y = train_data["Survived"] # the target

features = ["Parch", "SibSp", "Pclass", "Fare", "Embarked"] # features
X = pd.get_dummies(train_data[features])
X_test = pd.get_dummies(test_data[features])

We observe that every features are not correlate. It is interesting in our study to take all feature. 

## Random Forest Model

We'll build what's known as a **random forest model**. This model is constructed of several "trees" (there are three trees in the picture below, but we'll construct 100!) that will individually consider each passenger's data and vote on whether the individual survived. Then, the random forest model makes a democratic decision: the outcome with the most votes wins!

In [None]:
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(X, y)
predictions = model.predict(X_test)

output = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': predictions})

In [None]:
acc_random_forest = f1_score(gender_submission["Survived"], output["Survived"])

In [None]:
print(accuracy_score(gender_submission["Survived"], output["Survived"]))

## Logistic Regression Model

We'll build what's known as a **logistic regression model**. Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution.


In [None]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_test_scaled = scaler.fit_transform(X_test)

In [None]:
logreg = LogisticRegression()
logreg.fit(X, y)
Y_pred_logreg = logreg.predict(X_test)

output_logreg = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': Y_pred})

In [None]:
acc_log = f1_score(gender_submission["Survived"], output_logreg["Survived"])

In [None]:
print(accuracy_score(gender_submission["Survived"], output_logreg["Survived"]))

## Support Vector Machine

We'll build what's known as a **support vector machine**. They are supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. Given a set of training samples, each marked as belonging to one or the other of two categories, an SVM training algorithm builds a model that assigns new test samples to one category or the other, making it a non-probabilistic binary linear classifier.

In [None]:
svc = SVC()
svc.fit(X_scaled, y)
Y_pred_svc = svc.predict(X_test_scaled)

output_svc = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': Y_pred_svc})

In [None]:
acc_svc = f1_score(gender_submission["Survived"], output_svc["Survived"])

In [None]:
print(accuracy_score(gender_submission["Survived"], output_svc["Survived"]))

## Gaussian Naive Bayes

We'll build what's known as a **Gaussian Naive Bayes**. In machine learning, naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes' theorem with strong (naive) independence assumptions between the features. Naive Bayes classifiers are highly scalable, requiring a number of parameters linear in the number of variables (features) in a learning problem

In [None]:
gaussian = GaussianNB()
gaussian.fit(X_scaled, y)
Y_pred_gaussian = gaussian.predict(X_test_scaled)

output_gaussian = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': Y_pred_gaussian})

In [None]:
acc_gaussian = f1_score(gender_submission["Survived"], output_gaussian["Survived"])

In [None]:
print(accuracy_score(gender_submission["Survived"], output_gaussian["Survived"]))

## Perceptron

We'll build what's known as a **Perceptron**. The perceptron is an algorithm for supervised learning of binary classifiers (functions that can decide whether an input, represented by a vector of numbers, belongs to some specific class or not).


In [None]:
perceptron = Perceptron()
perceptron.fit(X_scaled, y)
Y_pred_perceptron = perceptron.predict(X_test_scaled)

output_perceptron = pd.DataFrame({'PassengerId': test_data.PassengerId, 'Survived': Y_pred})

In [None]:
acc_perceptron = f1_score(gender_submission["Survived"], output_perceptron["Survived"])

In [None]:
print(accuracy_score(gender_submission["Survived"], output_perceptron["Survived"]))

## Conclusion

In [None]:
models = pd.DataFrame({
    'Model': ['Support Vector Machines', 'Logistic Regression', 
              'Random Forest', 'Naive Bayes', 'Perceptron' 
              ],
    'Score': [acc_svc, acc_log, 
              acc_random_forest, acc_gaussian, acc_perceptron]})
models.sort_values(by='Score', ascending=False)

In [None]:
output_svc.to_csv('submission2.csv', index=False)