# Beginner-Friendly Machine Learning Tutorial: Titanic Dataset

Welcome! This notebook will guide you step-by-step through a hands-on machine learning workflow using the Titanic dataset. You will learn how to:

- Load and explore data
- Preprocess and clean data
- Build and evaluate a machine learning model
- Visualize results

Let's get started!

In [1]:
# Add project root to sys.path so we can import custom modules from anywhere
import sys
import os

project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
if project_root not in sys.path:
    sys.path.insert(0, project_root)

## 1. Setup: Import Libraries and Configure Environment

Let's import the necessary Python libraries and set up our environment.

In [2]:
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix
from src.utility import *
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

### About the Custom Machine Learning Classes

This notebook uses several custom classes from `utility.py` to simplify and enhance the machine learning workflow:

- **MetaCleanPipeline**: Automates data cleaning, feature engineering, and target encoding.
- **FeatureEngineeringSelector**: Flexible feature selection and engineering.
- **DataCleaner**: Handles missing values, scaling, encoding, and duplicate removal.
- **GraphAnalyzerEngine**: Analyzes feature importances and relationships.
- **VisualizerFactory**: Provides advanced visualizations.

These tools make the machine learning process more accessible and interpretable.

## 2. Load and Explore the Titanic Dataset

The Titanic dataset contains information about passengers and whether they survived. Let's load the data and take a first look.

In [3]:
titanic = sns.load_dataset('titanic')
titanic.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 3. Data Preprocessing: Feature Selection and Cleaning

To prepare our data for machine learning, we need to:
- Separate features (inputs) from the target (output)
- Clean and preprocess the data (handle missing values, encode categories, etc.)

**Note:** The 'sex', 'embarked', and 'class' features are categorical and may require special encoding.

In [4]:
# Drop rows with missing target (survived)
titanic = titanic.dropna(subset=['survived'])
# Separate features (X) and target (y)
X = titanic.drop(columns=['survived'])
y = titanic['survived']

# Convert all string columns to categorical
for col in X.select_dtypes(include=['object']).columns:
    X[col] = X[col].astype('category')

# Convert boolean columns to string/object dtype for compatibility with SimpleImputer
for col in X.select_dtypes(include=['bool']).columns:
    X[col] = X[col].astype(str)
print('Boolean columns converted to string:', X.select_dtypes(include=['bool']).columns.tolist())

Boolean columns converted to string: []


In [5]:
# Create and configure the data cleaning and feature engineering pipeline
meta_pipe = MetaCleanPipeline(
    drop_duplicates=True,
    feature_engineering_strategies=[{
        'name': 'model_importance',
        'model_cls': RandomForestClassifier,
        'threshold': 'mean'
    }],
    numeric_strategy='mean',
    categorical_strategy='most_frequent',
    target_encoder_method='label',
    auto_encode_features=False,
    # You can add more cleaner_kwargs if needed
)

# Fit the pipeline and transform the data
X_clean = meta_pipe.fit_transform(X, y)
print(X_clean.head())
print('Final features:', meta_pipe.get_selected_features())

# Retrieve cleaned features and encoded target for training
X_cleaned, y_enc = meta_pipe.get_cleaned_dataset()
y_original = meta_pipe.inverse_transform_target(y_enc)
print('y_original unique:', np.unique(y_original))
print('y_enc unique:', np.unique(y_enc))
print('shape X_cleaned:', X_cleaned.shape)
print('shape y_enc:', y_enc.shape)
y_full_enc = meta_pipe.transform_target(y_original)

# If you want to decode y_enc back to original labels:
print(meta_pipe.inverse_transform_target(np.unique(y_enc)))

   who_man  adult_male_True  alive_no  alive_yes
0      1.0              1.0       1.0        0.0
1      0.0              0.0       0.0        1.0
2      0.0              0.0       0.0        1.0
3      0.0              0.0       0.0        1.0
4      1.0              1.0       1.0        0.0
Final features: ['who_man', 'adult_male_True', 'alive_no', 'alive_yes']
y_original unique: [0 1]
y_enc unique: [0 1]
shape X_cleaned: (784, 14)
shape y_enc: (784,)
[0 1]


## 4. Model Training and Evaluation

Now, let's train a machine learning model to predict survival. We'll:
- Split the data into training and test sets
- Train a Random Forest classifier
- Evaluate the model's performance using a classification report and confusion matrix

In [6]:
pipe = Pipeline([
    ('model', RandomForestClassifier(random_state=42))
])

X_train, X_test, y_train, y_test = train_test_split(X_cleaned[meta_pipe.get_selected_features()],
                                                    y_original, random_state=42)
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print('Classification Report:\n', classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
fig, ax = plt.subplots(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=np.unique(y_original), yticklabels=np.unique(y_original), ax=ax)
ax.set_xlabel('Predicted')
ax.set_ylabel('Actual')
plt.show()

KeyError: "None of [Index(['who_man', 'adult_male_True', 'alive_no', 'alive_yes'], dtype='object')] are in the [columns]"

## 5. Feature Importance and Visualization

Let's use advanced visualizations to explore feature importances and relationships.

In [None]:
# Convert categorical features to numeric codes for visualization
for col in X_cleaned.select_dtypes(include=['category']).columns:
    X_cleaned[col] = X_cleaned[col].cat.codes

# Fill missing values in X_cleaned
X_cleaned = X_cleaned.fillna(X_cleaned.mean(numeric_only=True))

In [None]:
engine = GraphAnalyzerEngine()
engine.analyze(X_cleaned, pd.DataFrame(y_enc))
store = engine.get_store()
label_map = dict(enumerate(meta_pipe.get_target_encoder().classes_))
relabel_targets_in_store(store, label_map)
fig_sankey = VisualizerFactory.make_sankey(store, show_feature_feature_links=False)
fig_sankey.show()

### Bar Plot of Feature Importances

The bar plot below shows the aggregated importance of each feature. The dashed line represents the threshold for random (noise) importance. Features above this line are considered informative.

In [None]:
fig_bar = VisualizerFactory.make_bar(store, show_threshold=True, show_noise=True)
fig_bar.show()

In [None]:
# Plot t-SNE map for feature visualization
fig_tsne = VisualizerFactory.make_tsne(X_cleaned, y_enc,
                                    perplexity=5,
                                    random_state=10)
fig_tsne.show()

## 6. Conclusion and Next Steps

Congratulations! You have completed a full machine learning workflow on the Titanic dataset:
- Data loading and exploration
- Preprocessing and cleaning
- Model training and evaluation
- Feature importance analysis and visualization

Feel free to experiment with different models, parameters, or datasets!