<a href="https://colab.research.google.com/github/Advanced-Data-Science-TU-Berlin/Data-Science-Training-Python-Part-2/blob/main/interactive_notebooks/2_2_2_Titanic_Survival_KNN.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic: Machine Learning from Disaster
![picture](https://creazilla-store.fra1.digitaloceanspaces.com/cliparts/1722941/titanic-clipart-md.png)

The Titanic dataset is a well-known dataset in the field of data science and machine learning. It contains information about passengers on the Titanic, including whether they survived or not. Here are the details about the columns in the dataset:

- PassengerId: A unique identifier for each passenger.
- Survived: Indicates whether the passenger survived (1) or not (0).
- Pclass: Ticket class (1st, 2nd, or 3rd).
- Name: Passenger's name.
- Sex: Passenger's gender (male or female).
- Age: Passenger's age.
- SibSp: Number of siblings/spouses aboard.
- Parch: Number of parents/children aboard.
- Ticket: Ticket number.
- Fare: Passenger fare.
- Cabin: Cabin number.
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

The primary goal of working with this dataset is often to predict whether a passenger survived based on other features. It's a binary classification problem, where 'Survived' is the target variable.

## Insights and Considerations:
- Missing Values: The dataset may have missing values, especially in columns like 'Age,' 'Cabin,' and 'Embarked.' Handling missing values is an important part of preprocessing.

- Categorical Features: Features like 'Sex' and 'Embarked' are categorical. These need to be converted to numerical values for machine learning algorithms.

- Feature Engineering: Creating new features or modifying existing ones can enhance the predictive power of the model. For example, creating a 'FamilySize' feature by combining 'SibSp' and 'Parch' might be useful.

- Exploratory Data Analysis (EDA): EDA helps in understanding the distribution of data, identifying patterns, and making informed decisions during preprocessing.


In [1]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

In [None]:
# Load the Titanic dataset
# Hint: use pd.read_csv and pass the url
url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
titanic_df = <your-code-here>

# Display the first few rows of the dataset
titanic_df.head()

## Exploratory Data Analysis (EDA)

In [None]:
# Explore the dataset

# Basic Information
# Hint: use titanic_df.info()
print(<your-code-here>)

# Summary Statistics
# Hint: use titanic_df.describe()
print(<your-code-here>)

# Check for missing values
# Hint: use titanic_df.isnull().sum()
print(<your-code-here>)

## Visualizing Target Variable

In [None]:
# Visualize the distribution of the target variable 'Survived'
# Hint: use 'Survived' as the input of countplot
sns.countplot(x=<target-variable>, data=titanic_df)

plt.title('Distribution of Survived (1: Survived, 0: Not Survived)')
plt.show()

In [None]:
# Point-wise Correlation Pairplot

# Select numerical columns for correlation pairplot
numerical_cols = titanic_df.select_dtypes(include=[np.number]).columns

# Create a point-wise correlation pairplot
# Hint: use titanic_df[numerical_cols] as the input for pairplot and 'Survived' for hue
sns.pairplot(<df-with-numerical-columns>, hue=<target-variable>, height=1.8,
                  aspect=1.8, plot_kws=dict(edgecolor="k",
                  linewidth=0.5), diag_kind="kde")
plt.suptitle("Point-wise Correlation Pairplot", y=1.02)
plt.show()

In [7]:
# Handle missing values

# Drop 'Cabin' column due to high missing values
# Hint: use titanic_df.drop and pass 'Cabin' and axis=1 as inputs
titanic_df = titanic_df.drop('Cabin', axis=1)

# Fill missing values in 'Age' with the median
# Hint: use titanic_df['Age'].median() inside fillna
titanic_df['Age'].fillna(<age-median>, inplace=True)

# Fill missing values in 'Embarked' with the mode
# Hint: use titanic_df['Embarked'].mode()[0] inside fillna
titanic_df['Embarked'].fillna(<embarked-mode>, inplace=True)

In [8]:
# Handle categorical features

# Convert 'Sex' to numerical values (0 for Female, 1 for Male)
# Hint: use {'female': 0, 'male': 1} for the map
titanic_df['Sex'] = titanic_df['Sex'].map(<gender-mapping-dict>)

# Convert 'Embarked' to numerical values using one-hot encoding
titanic_df = pd.get_dummies(titanic_df, columns=['Embarked'], drop_first=True)

## Feature Selection

In [10]:
# Select relevant features for classification
X = titanic_df[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = titanic_df['Survived']

## Train/Test Split
Let's split the data into train and test sets. In sklearn.model_selection package there is a function to do so which makes our lives easier. Remember you can also split data manually.

In [11]:
# Split the dataset into training and testing sets
# Hint: use train_test_split and pass X, y, test_size=0.2, random_state=42
X_train, X_test, y_train, y_test = <your-code-here>

## Feature Scaling

Feature scaling is crucial in K-Nearest Neighbors (KNN) and many other machine learning algorithms because it helps ensure that all features contribute equally to the distance computations. In the case of KNN, the algorithm classifies a data point by considering the majority class among its k nearest neighbors. The "distance" between data points is typically measured using metrics like Euclidean distance.

Here's why feature scaling is important in the context of KNN:

- Distance Computation: KNN relies on the concept of proximity or distance between data points. Features with larger scales or ranges may dominate the distance computation compared to features with smaller scales. This can lead to biased predictions.

- Uniform Contribution: Scaling features to a similar range ensures that each feature contributes more uniformly to the distance metric. Without scaling, a feature with a larger scale could overshadow the influence of other features.

- Equal Importance: Scaling is essential when the features are measured in different units or have different magnitudes. For example, the 'Age' feature might be measured in years, while the 'Fare' feature might be measured in dollars. Scaling makes each feature contribute proportionally to the overall distance, ensuring that no single feature dominates the calculation.

- Faster Convergence: Feature scaling can help the KNN algorithm converge faster during training. The optimization process is often more efficient when the features are on a similar scale.

In [12]:
# Standardize features
scaler = StandardScaler()

# Fit and transform training data
# Hint: use scaler.fit_transform on X_train
X_train_scaled = <your-code-here>

# Transform test data
# Hint: use scaler.transform on X_test
X_test_scaled = <your-code-here>

## Building and Training a K-Nearest Neighbors (KNN) Classifier

In [None]:
# Create and train the KNN Classifier
# Hint: use KNeighborsClassifier and pass n_neighbors=5 as input
knn_classifier = <your-code-here>

# Fit the knn classifier on training data
# Hint: use knn_classifier.fit and pass X_train_scaled and y_train as input
knn_classifier.fit(X_train_scaled, y_train)

## Evaluating the Algorithm
For evaluating an algorithm, confusion matrix, precision, recall and f1 score are the most commonly used metrics. The confusion_matrix and classification_report methods of the sklearn.metrics can be used to calculate these metrics. Take a look at the following script:

In [None]:
# Make predictions on the test set
# Hint: use knn_classifier.predict and pass X_test_scaled
y_pred = <your-code-here>

# Evaluate the model
# Calculate acuuracy score
# Hint: use accuracy_score and pass y_test, y_pred
accuracy = <your-code-here>

# Calculate confusion matrix
# Hint: use confusion_matrix and pass y_test, y_pred
conf_matrix = <your-code-here>

# Create classification report
# Hint: use classification_report and apply y_test, y_pred
classification_rep = <your-code-here>

# Display evaluation metrics
print(f"Accuracy: {accuracy:.2f}")
print("\nConfusion Matrix:")
print(conf_matrix)
print("\nClassification Report:")
print(classification_rep)