Intent here is to use the kaggle intro courses, begining with the titanic surviaval prediction data, to build a machine learning guide. the structure will be layered; starting with a simple broad overview, followed by sections going into greater depth in one area after another, followed by more going subsequently deeper. 

# Ch. 1 Fist Pass
Here we will go through a building a simple machine learning model for the Kaggle Titanic learning competetion. 

### set up the eviornment

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier

### Load the data

In [None]:
train = pd.read_csv(r"C:\Users\marac\Downloads\train.csv")
test = pd.read_csv(r"C:\Users\marac\Downloads\test.csv")

### Explore the data
We will skip this in our first pass, but cover good habits in a later chapter. 

### Data preparation
Clean the data and prepare it for machine learning algorithms. Again, in this chapter we will take this more or less for granted but cover strategies in more detail later.

In [None]:
# Combine datasets for preprocessing to ensure consistency
full_data = pd.concat([train.drop('Survived',axis=1),test])

# extract and group titles
## this turns a unique identifier column into something we can test for correlation with 
## survival. many titles appear to infrquently to garner their own category, group into "Rare"
full_data['Title'] = full_data['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

title_mapping = {
    "Mr": "Mr",
    "Miss": "Miss",
    "Mrs": "Mrs",
    "Master": "Master",
    "Dr": "Rare",
    "Rev": "Rare",
    "Col": "Rare",
    "Major": "Rare",
    "Mlle": "Miss",
    "Mme": "Mrs",
    "Don": "Rare",
    "Lady": "Rare",
    "Countess": "Rare",
    "Jonkheer": "Rare",
    "Sir": "Rare",
    "Capt": "Rare",
    "Ms": "Miss",
    "Dona": "Rare"
}
full_data['Title'] = full_data['Title'].map(title_mapping)

# Create family size feature
## siblings on shipt + parents on ship + self
full_data['FamilySize'] = full_data['SibSp'] + full_data['Parch'] + 1

# Create is_alone feature
full_data['IsAlone'] = (full_data['FamilySize'] == 1).astype(int)

# Create age bands
## bining the conttinuous variable age will let us capture non-linear effects and reduce noise
full_data['AgeBand'] = pd.cut(full_data['Age'], 5)

# Get the length of cabin string (proxy for wealth/status)
full_data['CabinBool'] = (full_data['Cabin'].notnull()).astype(int)

# Create fare per person
full_data['FarePerPerson'] = full_data['Fare'] / (full_data['FamilySize'])

full_data['age_class'] =full_data['Age'] *full_data['Pclass']

# Split back to train/test
train_processed = full_data.iloc[:len(train)].copy()
test_processed = full_data.iloc[len(train):].copy()

# Recombine with survival data for training
train_processed.loc[:, 'Survived'] = train['Survived']

### define models to test

In [None]:
# Define features to use
numerical_features = ['Age', 'Fare', 'FamilySize', 'FarePerPerson','age_class']
categorical_features = ['Pclass', 'Sex', 'Embarked', 'Title', 'CabinBool', 'IsAlone','AgeBand']

# Define preprocessing for numerical features
numerical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Define preprocessing for categorical features
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combine preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_features),
        ('cat', categorical_transformer, categorical_features)
    ])

# Define models to try
models = {
    'RandomForest': RandomForestClassifier(n_estimators=100, random_state=42),
    'GradientBoosting': GradientBoostingClassifier(random_state=42),
    'LogisticRegression': LogisticRegression(random_state=42, max_iter=1000)
}

# Model Evaluation
X = train_processed[numerical_features + categorical_features]
y = train_processed['Survived']

# Split data for model evaluation
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)


### train and evaluate models

In [None]:
# Train and evaluate models
results = {}
for name, model in models.items():
    # Create and fit the full pipeline
    full_pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', model)
    ])
    
    # Fit the pipeline on training data
    full_pipeline.fit(X_train, y_train)
    
    # Make predictions
    y_pred = full_pipeline.predict(X_val)
    
    # Evaluate
    accuracy = accuracy_score(y_val, y_pred)
    results[name] = accuracy
    
    print(f"\n{name} Accuracy: {accuracy:.4f}")
    print(f"\nClassification Report for {name}:")
    print(classification_report(y_val, y_pred))
    
    # Cross-validation score
    cv_scores = cross_val_score(full_pipeline, X, y, cv=5, scoring='accuracy')
    print(f"Cross-validation scores: {cv_scores}")
    print(f"Mean CV accuracy: {cv_scores.mean():.4f}")

And that's it. Very, overly, simply put we have successfully trained a model to predict survival.