In [None]:
import warnings
warnings.simplefilter("ignore", FutureWarning)

# Final project: Scikit-learn Pipeline example (with titanic data)
(source: https://github.com/scikit-learn/blog/blob/main/assets/notebooks/sklearn-pandas-df-output.ipynb)

This is our final project. The original pipeline demo is from a scikit-learn demo using the titanic data set. However, we would like to test it with our hypothesis-generated data frame. As you may guess, the data frame would be mimicking the titanic data. The advantage of using generated mimic data is that with the same set of data (in this case, titanic data) it is hard to spot any edge case and we cannot guarantee it will also work if there are new data coming in (not in this case for the titanic data but generally speaking).

First, we will look at the orignal scikit-learn demo to understand the pipeline.

---

# Example: titanic dataset (with a Pipeline)

In [None]:
from sklearn import set_config
set_config(transform_output="pandas")

In [None]:
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

X, y = fetch_openml("titanic", version=1, as_frame=True, return_X_y=True)

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y)

In [None]:
X_train

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression

# Here we use `StandardScaler` for continuous variables; 
#      then we impute for missing data (check the documentation for the imputation method)
# We use `OneHotEncoder` for categorical variables
# NOTE: we are using a subset of the features (not all the columns)

ct = make_column_transformer((make_pipeline(SimpleImputer(), 
                                            StandardScaler()), ["age", "fare"]),
                             (OneHotEncoder(sparse=False), ["embarked", "sex", "pclass"]), 
                             verbose_feature_names_out=False)

# Note: click on pipeline elements to see more details
clf = make_pipeline(ct, LogisticRegression())
clf

In [None]:
clf.fit(X_train, y_train)
clf.score(X_test, y_test)

In [None]:
# Let's remove the last step in the pipeline (which is LogisticRegression()) & transform the X_test data

clf[:-1].transform(X_test)

---

Ok, now we want to test the pipeline `clf` (without the `LogisticRegression()`) with our generated data frame. It seems a lot, but here are some hints and steps you can consider when approaching it.

1. Consider what values make sense in some columns, for categorical columns, you may want to inspect the titanic data frame first.
2. Use `st.builds` to help generate categorical columns.
3. Don't worry too much about the free text columns like `name` and `home.dest` - they are not being transformed anyway so we do not care if they are just random text
4. Thinking of how you can test the transformed data frame (output) is what we are expecting the transformer to do.
5. You may need to write multiple tests to make sure the output is what is expected

Now the floor is yours. There will be no more hand-holding from this point on. But feel free to ask questions or work in groups if you found it easier. Enjoy!