<a href="https://colab.research.google.com/github/Ellen-Gu/Quantitative-Analysis/blob/main/Advancements_in_Machine_Learning_Pipelines_Integrating_TensorFlow%2C_Scikit_Learn%2C_and_Ensemble_Techniques.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
pip install mlflow   # this is for playing with databricks's mlflow module (as in Appendex 2)

In [None]:
pip install scikeras[tensorflow]   # this is for making tensorflow nn to be a sklearn compatible classifier, hence then it can be blended into the sklearn pipeline with other 8 classifiers.

Import piles of modules, just like what R does: loading in piles of libraries in the head of R codes.

In [3]:
import numpy as np, pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, AdaBoostClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import accuracy_score,f1_score
from sklearn.ensemble import BaggingClassifier
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
#from tensorflow.keras.wrappers.scikit_learn import KerasClassifier  #this has been migrated to the following
from scikeras.wrappers import KerasClassifier, KerasRegressor  #pip install scikeras[tensorflow]

# Surpress sklearn's package's version, use only after carefully checked the warnings in the first run
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module='sklearn')  #surpress FutureWarning, not UserWarning
# warnings.resetwarnings()  #--use this to reactive sklearn version warning


Recently I read some articles about ML ensembling methods.I think the hands-on book have all the contents of that article well covered. The hand-on book is one of my favorate book, and the chapter of ensembling is well illustrate sklearn's model emsemblings. One of the aspect that most ML books fail to mentioned is finacal viewpoint of model evalutations. That is well explained in another book (illustrating data analysis using both python and R ) which I like most. Almost all the books do not contains real world implementating part. Maybe that is only avaiable in real world, gaining from hands on experiences.

To consolidating my learnings, some time has to be spent on codings. Nowdays openai can provide "building blocks" to releive some programming works, but people still need to learn programming to better communicate with computers. I have no concerns about the mathematics and computing aspects of ML/DL, but to coding well is another aspect, it needs lots of practice. I spend a well portion of time purely on learing coding itself, and also often "talk" with openai, aiming to integrate openai into my toolbox to speed up in ML/DL world. With many new modules/library emerging from time to time, people have to keep learning, and find ways to speed up and extend reachable fields.

Fortunately, the data titanic is a data in python module OpenML. So I can happily load it in directly. After fetching data, it is a routine work to do EDA, data cleaning, imputing, feature selection... Those are tons of work before data can be used to modelings. Also, data works shed lights to and shapes what modelings can deliver.

Over 80% of time are spent on data work (90%, as to my viewpoint), that is real but have to go through. But here I mainly want to practice on sklearn's pipeline (data pipeline and model pipeline), and also want to test out whether I can bring TensorFlow Neural Network into sklearn's pipelines.

Mlflow from databricks is a side aim to be practice on. Somedays' ago I played with H2o (a java based auto ml module/library), calling the API both from R and python interface. The tracking webpage are the same for R and python. It is nice and "auto", but the speed is too slow, and the modles in it is limited. In nowdays, people in the field of ML/DL have to quickly adapt to front edge new algorithms. Recently I saw an article saying that algorithm (named "rope") from a paper in late last year has already been implemented in major GPT products. The key matrix used in "rope" seems a wavelet transformer. The most simply wavelet, corresponding to the "harr" functions, is with $\theta=45°$, while "rope" uses a more customized $\theta$. Signals in brain is transferrred by certain eletric pulse, a single neuron /cell can only perform  a simple task. So I think the transformer which use 2, or $e$, as exponentail base is bio-reasonable. More nerons can then form some groups, layers etc to process complex unstructured information. So generally, more nerons in the layers, more "groups" of nerons (like GPU), simple  (linear feeding) task to each nero but complex (non linear) task to groups of neoro (layers), those setttings often can produce better results. On a related note, I once authored a Bash script that proved highly effective. Beyond considering the tree-based structure of the Linux file system, what truly inspired me was nature and life science. Almost every living entity on Earth, from complex organisms to individual cells and even viruses, have a shell/film to separate its inner environment from outside. For complex lifes, cells have membrance, organs have films, and in the most outside, skin or bone/scale alike to make the "firewall" to stable each cell groups' functions and better tolerance changes in outside. Doesn't this concept closely resemble that of "Docker"?

In [None]:
# Fetch Titanic dataset from OpenML
titanic = fetch_openml("titanic", version=1, as_frame=True)
data = titanic.frame

# Drop columns
data.drop(['name', 'boat', 'body', 'home.dest'], axis=1, inplace=True)

# Define features and target
X = data.drop('survived', axis=1)
y = data['survived'].astype('int64')

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Lists of column names
numeric_features = ['age', 'fare', 'sibsp', 'parch']
categorical_features = ['sex', 'cabin', 'embarked', 'ticket', 'pclass']
# Update the list of categorical features to exclude 'embarked' since we're handling it separately
categorical_features.remove('embarked')

Once the data is loaded, the next step is to define the data processing functions and classes. I appreciate the design of sklearn's pipeline, as it offers a clear and structured approach to consolidating findings from EDA. Prior to establishing the data pipeline, a significant amount of effort has already been invested in determining the optimal way to prepare the data for modeling.

In [4]:
# Numeric Transformer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())])

# Categorical Transformer
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

from sklearn.base import BaseEstimator, TransformerMixin

class EmbarkedTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.onehot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

    def fit(self, X, y=None):
        # Fill missing values and then fit the one-hot encoder
        X_filled = np.where(pd.isnull(X), 'S', X)
        self.onehot_encoder.fit(X_filled.reshape(-1, 1))
        return self

    def transform(self, X):
        # Fill missing values
        X_filled = np.where(pd.isnull(X), 'S', X)
        # Transform using the one-hot encoder
        X_onehot_encoded = self.onehot_encoder.transform(X_filled.reshape(-1, 1))
        return X_onehot_encoded

    def get_feature_names_out(self, input_features=None):
        return self.onehot_encoder.get_feature_names_out(input_features)


# Numeric Transformer
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Categorical Transformer for OneHotEncoding without prefix
onehot_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'))
])

class CustomLabelEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = LabelEncoder()

    def fit(self, X, y=None):
        # Convert to 1D array or Series
        X_1d = X.iloc[:, 0] if isinstance(X, pd.DataFrame) else X.ravel()

        # Add a placeholder for unknown categories
        X_extended = np.concatenate([X_1d, ['__unknown__']])
        self.encoder.fit(X_extended)
        return self

    def transform(self, X):
        # Convert to 1D array or Series
        X_1d = X.iloc[:, 0] if isinstance(X, pd.DataFrame) else X.ravel()

        # Replace unseen categories with the placeholder
        X_transformed = pd.Series(X_1d).map(lambda x: x if x in self.encoder.classes_ else '__unknown__')

        return self.encoder.transform(X_transformed).reshape(-1, 1)

# Update the full preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('onehot', onehot_transformer, ['sex']),
        ('cabin', CustomLabelEncoder(), ['cabin']),
        ('ticket', CustomLabelEncoder(), ['ticket']),
        ('embarked', EmbarkedTransformer(), ['embarked'])
    ])


Once the data preprocessing is complete, I frequently desire to inspect the results to ensure the transformers operate as anticipated. The subsequent steps are primarily for validation and verification. This challenge is akin to what I've encountered with shiny in both R and Python. Although I managed to get the Python shiny working, it can sometimes crash with frustrating errors. This is a limitation I perceive in sklearn's data pipeline. Perhaps my familiarity with these tools is still growing; with more practice, I hope these methodologies will seamlessly integrate into my workflow.

In [5]:
# 1. Transform X_train using preprocessor
_ = preprocessor.fit(X_train)
X_train_transformed = preprocessor.transform(X_train)
features_in=X_train_transformed.shape[1]
features_in

10

In [6]:
# 2. Extract transformed feature names from each transformer
# Numeric columns remain unchanged
numeric_columns = numeric_features

# One-hot encoded columns
onehot_col = preprocessor.named_transformers_['onehot'].named_steps['onehot'].get_feature_names_out(['sex'])

# For cabin and ticket, since they are label encoded, their column names remain the same
label_encoded_col = ['cabin', 'ticket']

# Embarked columns
embark_col = preprocessor.named_transformers_['embarked'].get_feature_names_out(['embarked'])

# 3. Construct the dataframe
all_columns = numeric_columns + list(onehot_col) + label_encoded_col + list(embark_col)
X_train_transformed_df = pd.DataFrame(X_train_transformed, columns=all_columns)

# Display the dataframe
print(X_train_transformed_df.head())

        age      fare     sibsp     parch  sex_male  cabin  ticket  \
0 -0.981390 -0.495582 -0.495964 -0.442432       1.0  162.0   435.0   
1  0.506426 -0.445269 -0.495964 -0.442432       1.0  162.0   628.0   
2 -0.903084  0.890700 -0.495964  1.795376       0.0  148.0     1.0   
3  1.367793  3.747624  0.456833 -0.442432       1.0   81.0   699.0   
4  0.000000  0.171035 -0.495964 -0.442432       1.0  162.0    50.0   

   embarked_C  embarked_Q  embarked_S  
0         0.0         0.0         1.0  
1         0.0         0.0         1.0  
2         0.0         0.0         1.0  
3         1.0         0.0         0.0  
4         0.0         0.0         1.0  


In [7]:
set(X_train.embarked)
z=EmbarkedTransformer()
z.fit(X_train.embarked.to_frame())
X_train_processed = z.transform(X_train.embarked.to_frame())
X_train_processed

array([[0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       ...,
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.]])

With the data successfully pipelined, the next step is to define the classifiers. These can either be those available in sklearn or custom-built to integrate with sklearn. Tensorflow is frequently utilized as a standalone module, but I'm hopeful that with the help from most recent scikeras it can be combined with other ML methods for production use. The real excitement begins now.

In [14]:
# Define the individual classifiers
# Define the Random Forest classifier with the given parameters
rf_params = {
    'n_estimators': 400,
    'max_depth': 5,
    'min_samples_leaf': 3,
    'max_features' : 'sqrt',
}
rf = RandomForestClassifier(random_state=None,**rf_params)

# Extra Trees Parameters
et_params = {
    'n_jobs': -1,
    'n_estimators':400,
    'max_depth': 5,
    'min_samples_leaf': 2,
}
et = ExtraTreesClassifier(random_state=None,**et_params)

# AdaBoost parameters
ada_params = {
    'n_estimators': 400,
    'learning_rate' : 0.65
}
ab = AdaBoostClassifier(random_state=None,**ada_params)

# Gradient Boosting parameters
gb_params = {
    'n_estimators': 400,
    'max_depth': 6,
}
gb = GradientBoostingClassifier(random_state=None,**gb_params)

nb = GaussianNB()  # Naive Bayes

svc = SVC(C=1.0, kernel='rbf', gamma='scale', probability=True, random_state=None)
  # Support Vector Classifier Setting probability to True so we can use it for soft voting

lr = LogisticRegression(max_iter=1000)  # Logistic Regression Setting max_iter to a higher value for convergence

nn = MLPClassifier(hidden_layer_sizes=(512, 256, 128, 64, 32), max_iter=100000, alpha=0.0001,
                     solver='adam', random_state=None)    # Define the neural network

# Define the neural network model using TensorFlow's Keras API
def create_nn_model():
    model = Sequential([
        Dense(512, activation='relu', input_shape=(features_in,)),  # input shape is the number of features
        Dense(256, activation='relu'),
        Dense(128, activation='sigmoid'),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')  # binary classification
    ])
    #model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model
tfnn = KerasClassifier(
    create_nn_model,
    loss="binary_crossentropy",
    metrics=['accuracy'],
    epochs=300,
    verbose=0
    #hidden_layer_dim=100,
)

# List of classifiers for the voting classifier
classifiers = [
    ('rf', rf),
    ('et', et),
    ('ab', ab),
    ('gb', gb),
    ('nb', nb),
    ('svc', svc),
    ('lr', lr),
    ('nn', nn),
    ('tfnn', tfnn)
]

Now I get all 9 classifiers, including tensorflow nn classifier, together in a list of model-shell. This is very like the functions, $y=f_i(x)$, have been defined analytically. Next step is just feed data ($x$) into these $f_i(x)$ to get output ($y$). Sending those classifiers ($f_i(x)$) as parameters to a loop to get output ($y$) from various $f_i(x)$. When X_train and y_train is feed in as input, the $f_i(x)$ is finally well defined, as those constant, parameters in $f_i(x)$ all get its most insightful value from input data. When X_test and y_test is feed to $f_i(x)$, we are using the insights we gained from training data to shed light on those data which are not seen in before by the current model. Here there is an assumption. that is, we suppose the unseen data and the training data are from similar settings. Hence the insight we learn from those training data can be applied to those new unseen data. When this assumption is no longer hold, or somewhat changed after some time period , then we have to collect data to recalibrate or even a complete model rebuild.

In [12]:
# Define a function to evaluate a model
def evaluate_model(model, X_train, y_train, X_test, y_test, model_name, prt=True):
    # Create a pipeline with the preprocessor and the classifier
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', model)])

    # Fit the model
    pipeline.fit(X_train, y_train)

    # Predict
    preds = pipeline.predict(X_test)  #for hard voting
    proba_preds = pipeline.predict_proba(X_test)  #for soft voting

    # Calculate accuracy and F1 score
    accuracy = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds)

    if prt==True:
        print(f"{model_name} Accuracy:", accuracy)
        print(f"{model_name} F1 Score:", f1)
        print("------")
    return accuracy, f1, preds,proba_preds

With the bulk of the preparatory work completed, we're ready to move on to the next phase: loading data into our models and evaluating their performance. While accuracy is a commonly used metric for classifiers, precision and recall often offer deeper insights, especially in imbalanced datasets or specific application contexts. For this reason, I'll compute both accuracy and the F1 score, which harmonizes precision and recall, for each classifier. Additionally, I'll save the predictions made on the test data for potential ensemble modeling in the subsequent steps.

In [13]:
# Evaluate each model
predictions = {}
proba_predictions = {}
for name, model in classifiers:
    accuracy, f1, preds, proba_preds= evaluate_model(model, X_train, y_train, X_test, y_test, name)
    predictions[name] = preds
    proba_predictions[name] = proba_preds

rf Accuracy: 0.7900763358778626
rf F1 Score: 0.7317073170731707
------
et Accuracy: 0.767175572519084
et F1 Score: 0.7109004739336493
------
ab Accuracy: 0.5229007633587787
ab F1 Score: 0.617737003058104
------
gb Accuracy: 0.7824427480916031
gb F1 Score: 0.7135678391959799
------
nb Accuracy: 0.7213740458015268
nb F1 Score: 0.6217616580310881
------
svc Accuracy: 0.5763358778625954
svc F1 Score: 0.17777777777777776
------
lr Accuracy: 0.767175572519084
lr F1 Score: 0.7081339712918661
------
nn Accuracy: 0.6450381679389313
nn F1 Score: 0.39215686274509803
------
tfnn Accuracy: 0.7595419847328244
tfnn F1 Score: 0.7069767441860464
------


Model ensembling offers various techniques such as voting, bagging, and stacking. While ensemble methods often deliver superior results, there's no absolute assurance of this outcome. Individual models can benefit from hyperparameter tuning. Commonly, grid search is employed for this purpose, but it can be time-intensive and cumbersome. Generally, performance improvements follow a growth curve—rapid gains initially, with diminishing returns as the model matures.

During my studies on simulation, bagging, boosting, and particularly methods like MCMC, I was deeply fascinated. However, my professor cautioned me that these aren't silver bullets for every problem. For instance, in bootstrapping, order statistics don't always respond well, especially for extreme values. Thus, when dealing with rare events or outliers, bootstrapping might not be the most effective strategy.

In [15]:
# Hard voting
stacked_preds = np.column_stack([predictions[name[0]] for name in classifiers])
from scipy.stats import mode
voting_preds = mode(stacked_preds, axis=1).mode.ravel()
voting_accuracy = accuracy_score(y_test, voting_preds)
print("Voting Accuracy:", voting_accuracy)
voting_f1 = f1_score(y_test, voting_preds)
print("Voting F1 Score:", voting_f1)

# Soft voting
stacked_proba_preds = np.dstack([proba_predictions[name[0]] for name in classifiers])
avg_proba = np.mean(stacked_proba_preds, axis=2)
voting_proba_preds = np.argmax(avg_proba, axis=1)
voting_accuracy = accuracy_score(y_test, voting_proba_preds)
print("Voting (Soft) Accuracy:", voting_accuracy)
voting_f1 = f1_score(y_test, voting_proba_preds)
print("Voting (Soft) F1 Score:", voting_f1)

Voting Accuracy: 0.7709923664122137
Voting F1 Score: 0.7087378640776699
Voting (Soft) Accuracy: 0.7633587786259542
Voting (Soft) F1 Score: 0.6804123711340205


In [None]:
# classifiers.remove(('tfnn',tfnn))  # next to practice with bagging, tfnn as a compatianle classifer theratically can be included, however, tfnn takes long time to train hence not included in bagging.

In [16]:
from sklearn.ensemble import BaggingClassifier
from scipy.stats import mode

def evaluate_bagged_model(base_model, X_train, y_train, X_test, y_test, model_name):
    # Create a pipeline with the preprocessor and the bagging classifier
    bagging_clf = BaggingClassifier(base_estimator=base_model, n_estimators=100, random_state=None)
    pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                               ('classifier', bagging_clf)])

    # Fit the model
    pipeline.fit(X_train, y_train)

    # Predict
    preds = pipeline.predict(X_test)

    # Calculate accuracy and F1 score
    accuracy = accuracy_score(y_test, preds)
    f1 = f1_score(y_test, preds)

    print(f"Bagged {model_name} Accuracy:", accuracy)
    print(f"Bagged {model_name} F1 Score:", f1)
    print("------")

    return preds

# Collect predictions from each bagged model
all_preds = []

for name, model in classifiers:
    preds = evaluate_bagged_model(model, X_train, y_train, X_test, y_test, name)
    all_preds.append(preds)

# Transpose the list of predictions
all_preds_transposed = np.array(all_preds).T

# Hard voting: Use mode to get the most common prediction for each instance
voting_preds = mode(all_preds_transposed, axis=1).mode.ravel()

# Calculate accuracy and F1 score for the voting ensemble
voting_accuracy = accuracy_score(y_test, voting_preds)
voting_f1 = f1_score(y_test, voting_preds)

print("Voting Ensemble Accuracy:", voting_accuracy)
print("Voting Ensemble F1 Score:", voting_f1)


Bagged rf Accuracy: 0.7595419847328244
Bagged rf F1 Score: 0.6865671641791045
------
Bagged et Accuracy: 0.7709923664122137
Bagged et F1 Score: 0.7169811320754716
------
Bagged ab Accuracy: 0.6564885496183206
Bagged ab F1 Score: 0.6785714285714286
------
Bagged gb Accuracy: 0.7786259541984732
Bagged gb F1 Score: 0.707070707070707
------
Bagged nb Accuracy: 0.7213740458015268
Bagged nb F1 Score: 0.6217616580310881
------
Bagged svc Accuracy: 0.5801526717557252
Bagged svc F1 Score: 0.19117647058823528
------
Bagged lr Accuracy: 0.7633587786259542
Bagged lr F1 Score: 0.7047619047619047
------
Bagged nn Accuracy: 0.6450381679389313
Bagged nn F1 Score: 0.41509433962264153
------
Bagged tfnn Accuracy: 0.7595419847328244
Bagged tfnn F1 Score: 0.7014218009478673
------
Voting Ensemble Accuracy: 0.7557251908396947
Voting Ensemble F1 Score: 0.6831683168316832


In [17]:
#this is for testing tensorflow nn classifier to be used as sklearn classifier
#just like what I have done with the data pipeline, when integrated together, it has to have a way to test out separately.

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
#from tensorflow.keras.wrappers.scikit_learn import KerasClassifier  #this has been migrated to the following
from scikeras.wrappers import KerasClassifier, KerasRegressor  #pip install scikeras[tensorflow]

# Process the data using the preprocessor pipeline
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)
#y_train = y_train.astype(np.int64)
#y_test = y_test.astype(np.int64)

# Define the neural network model using TensorFlow's Keras API
def create_nn_model():
    model = Sequential([
        Dense(512, activation='relu', input_shape=(X_train_processed.shape[1],)),  # input shape is the number of features
        Dense(256, activation='relu'),
        Dense(128, activation='sigmoid'),
        Dense(64, activation='relu'),
        Dense(32, activation='relu'),
        Dense(1, activation='sigmoid')  # binary classification
    ])
    #model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
    return model
clf = KerasClassifier(
    create_nn_model,
    loss="binary_crossentropy",
    metrics=['accuracy'],
    epochs=100,
    verbose=0
    #hidden_layer_dim=100,
)


# Wrap the model using KerasClassifier
tf_nn = clf
#tf_nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

tf_nn.fit(X_train_processed, y_train),

# 4. Evaluate the model
#loss, accuracy = tf_nn.fit(X_test_processed, y_test, verbose=0)

nn_preds = (tf_nn.predict(X_test_processed, verbose=0) > 0.5).astype("int32")
accuracy = accuracy_score(y_test, nn_preds)
print(f"NN Accuracy: {accuracy}")
nn_f1 = f1_score(y_test, nn_preds)
print(f"NN F1 Score: {nn_f1}")



NN Accuracy: 0.6221374045801527
NN F1 Score: 0.3443708609271523


Appendix 2: Initial Version for Practice Code Development, Data Steps, Single Classifier Testing, and MLflow Experimentation.

The data processing steps in these codes are in a "test and try" mode. The goal is to refine these steps to align more closely with the pipeline conventions and best practices of scikit-learn—ensuring the process is more structured and clean (as in the above). (Note: The data treatment in this version might differ as it's primarily for testing purposes.)

When employing cross-validation, it's acceptable to perform model comparisons, voting, etc., based on the training data rather than the testing data. However, to achieve a more comprehensive model performance evaluation, it might be beneficial to also consider the X_test, y_test, and predictions on the test set. Additionally, incorporating metrics related to precision and recall can provide deeper insights into model performance.

In [26]:
import numpy as np, pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import f1_score
from sklearn.model_selection import train_test_split, KFold, cross_val_score,cross_val_predict
from sklearn.ensemble import RandomForestClassifier,ExtraTreesClassifier,AdaBoostClassifier,GradientBoostingClassifier
import xgboost as xgb
from xgboost import XGBClassifier

# Fetch Titanic dataset from OpenML
titanic = fetch_openml("titanic", version=1, as_frame=True)
data = titanic.frame

# Preprocess the data
data = data.drop(['name','boat', 'body', 'home.dest'], axis=1)
df=data.copy()
df.isnull().sum(0)
df.iloc[:,8].isnull().sum()
df.head()
df.info()

data['sex'] = LabelEncoder().fit_transform(data['sex'].astype(str))
data['embarked'] = data['embarked'].fillna('S')
data['embarked'] = LabelEncoder().fit_transform(data['embarked'].astype(str))
data['age'] = data['age'].fillna(data['age'].mean())
data['fare'] = data['fare'].fillna(data['fare'].mean())

# For the 'ticket' and 'cabin' columns, fill missing values with a placeholder ('Unknown' in this case)
for col in ['ticket','cabin']:
    data[col].fillna('Unknown', inplace=True)

# Encode 'ticket' and 'cabin' columns
label_encoders = {}  # to store label encoders for each column (useful for inverse transform later if needed)
for col in ['ticket','cabin']:
    le = LabelEncoder()
    data[col] = le.fit_transform(data[col])
    label_encoders[col] = le

#data=data.drop(['age','fare','embarked','cabin'],axis=1)
# Define features and target
X = data.drop('survived', axis=1)
y = data['survived'].astype(int)  #.astype('int64')

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype   
---  ------    --------------  -----   
 0   pclass    1309 non-null   float64 
 1   survived  1309 non-null   category
 2   sex       1309 non-null   category
 3   age       1046 non-null   float64 
 4   sibsp     1309 non-null   float64 
 5   parch     1309 non-null   float64 
 6   ticket    1309 non-null   object  
 7   fare      1308 non-null   float64 
 8   cabin     295 non-null    object  
 9   embarked  1307 non-null   category
dtypes: category(3), float64(5), object(2)
memory usage: 75.9+ KB


In [27]:
X.head()
X.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   pclass    1309 non-null   float64
 1   sex       1309 non-null   int64  
 2   age       1309 non-null   float64
 3   sibsp     1309 non-null   float64
 4   parch     1309 non-null   float64
 5   ticket    1309 non-null   int64  
 6   fare      1309 non-null   float64
 7   cabin     1309 non-null   int64  
 8   embarked  1309 non-null   int64  
dtypes: float64(5), int64(4)
memory usage: 92.2 KB


In [28]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

# Define the Random Forest classifier with the given parameters
rf_params = {
    'n_estimators': 400,
    'max_depth': 5,
    'min_samples_leaf': 3,
    'max_features' : 'sqrt',
}
clf = RandomForestClassifier(**rf_params)

# Perform 5-fold cross-validation
cv_scores = cross_val_score(clf, X, y, cv=5)

# Print the CV scores
print("5-fold CV scores:", cv_scores)
print("Average CV score with refined parameters:", cv_scores.mean())


5-fold CV scores: [0.51526718 0.83969466 0.64885496 0.71374046 0.64750958]
Average CV score with refined parameters: 0.6730133660904917


In [30]:
import mlflow
import mlflow.sklearn

mlflow.set_experiment('Titanic - classification_model')
with mlflow.start_run():
    # Log average CV score as a metric
    mlflow.log_metric("average_cv_score", cv_scores.mean())

    # Train the model on full data (to log the model artifact)
    clf.fit(X_train, y_train)

    # Log the model
    mlflow.sklearn.log_model(clf, "model")

    # Set tags for clarity
    mlflow.set_tag("framework", "scikit-learn")
    mlflow.set_tag("dataset", "Titanic")

    print("Model and metrics saved in run %s" % mlflow.active_run().info.run_uuid)
    # Calculate and log accuracy to MLFlow

    # Predictions
    predictions = clf.predict(X_test)

    accuracy = accuracy_score(y_test, predictions)
    mlflow.log_metric('accuracy', accuracy)
    print('Accuracy: %f' % accuracy)

    # Calculate and log F1 score to MLFlow
    f1 = f1_score(y_test, predictions)
    mlflow.log_metric('f1', f1)
    print('F1 Score: %f' % f1)

Model and metrics saved in run 4cb4d6c17c484ed3981313b9e7d7a9b4
Accuracy: 0.790076
F1 Score: 0.702703
