# Combine ColumnTransformer and Pipeline

- First Example is very simple to understand what is going on
- Second Example is similar but data input is randomized
- Thirs Example is a copy of AnupamKhare(github) and shows how real data is transformed and how the functions can be nested

## First Example: ChatGPT (adjusted) simple

In [7]:
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [34]:
# Example dataset with both numerical and categorical features(categorical features only work if the value exists more than once!!!)
X = [[25, 'Male', 'Engineer'],
     [30, 'Female', 'Teacher'],
     [35, 'Male', 'Doctor'],
     [40, 'Female', 'Engineer'],
     [40, 'Female', 'Teacher'],
     [40, 'Female', 'Engineer']]

df = pd.DataFrame(X, columns=['age', 'gender', 'profession'])

df['y'] = [0, 1, 1, 0, 1, 0]

df

Unnamed: 0,age,gender,profession,y
0,25,Male,Engineer,0
1,30,Female,Teacher,1
2,35,Male,Doctor,1
3,40,Female,Engineer,0
4,40,Female,Teacher,1
5,40,Female,Engineer,0


In [35]:
X = df[['age', 'gender', 'profession']]
y = df['y']
y

0    0
1    1
2    1
3    0
4    1
5    0
Name: y, dtype: int64

In [36]:
# Define the column transformer
preprocessor = ColumnTransformer([
    ('numeric', StandardScaler(), ['age']),
    ('categorical', OneHotEncoder(), ['gender', 'profession'])
])

We only used transformers in the example above. That is the first parameter.\
If there is more data, for example a "salary" column, and we don't want to \
process it but keep it in its original form, we need to set the parameter \
**"remainder" = 'passthrough'**. Default is drop. So without mentioning, the \
columntransformer drops all features that are not mentioned in the transformer list

***What if we want to pass through some columns and drop the rest?***

In this case the transformer tuples needs an entry with the columns that we want to keep.\
and add the remainder parameter \
ColumnTransformer([(tuple1), (tuple2), ('passthrough', ['A', 'B', 'F'])], remainder='passthrough')

Instead of the ('passthrough', ['A', 'B', 'F']) tuple, which states which columns we want to keep. We can also\
explicitely say which columns should be dropped. For that, we exchange passthrough with drop.
('drop', ['E', 'G', 'H'])

In [37]:
# Define the pipeline with the column transformer and a classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [38]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [39]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

In [40]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

Accuracy: 0.5


## Second Example: ChatGPT (not-adjusted) with random generated Data

In [41]:
import numpy as np
import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Set random seed for reproducibility
np.random.seed(42)

# Create randomized data using NumPy
n_instances = 100

age = np.random.randint(18, 65, size=n_instances)
gender = np.random.choice(['Male', 'Female'], size=n_instances)
occupation = np.random.choice(['Engineer', 'Doctor', 'Teacher'], size=n_instances)
target = np.random.randint(0, 2, size=n_instances)

# Create a DataFrame
data = {'age': age, 'gender': gender, 'occupation': occupation, 'target': target}
df = pd.DataFrame(data)

# Separate features and target
X = df.drop('target', axis=1)
y = df['target']

# Define the column transformer
preprocessor = ColumnTransformer([
    ('numeric', StandardScaler(), ['age']),
    ('categorical', OneHotEncoder(), ['gender', 'occupation'])
])

# Define the pipeline with the column transformer and a classifier
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Predict on the test data
y_pred = pipeline.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)


Accuracy: 0.3


Unnamed: 0,age,gender,occupation,target
0,56,Female,Engineer,0
1,46,Female,Teacher,1
2,32,Male,Doctor,0
3,60,Male,Engineer,0
4,25,Male,Engineer,0


## Third Example: Titanic with nested pipeline

In [43]:
import numpy as np
import pandas as pd

from sklearn.compose import ColumnTransformer
from sklearn.datasets import fetch_openml
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder

### Download titanic data. This will be saved in a csv

In [45]:
# Load data from https://www.openml.org/d/40945
X = fetch_openml("titanic", version=1, as_frame=True)

  warn(


In [48]:
X.keys()

dict_keys(['data', 'target', 'frame', 'categories', 'feature_names', 'target_names', 'DESCR', 'details', 'url'])

In [51]:
df_titanic = pd.DataFrame(X.data, columns=X.feature_names)
df_titanic['target'] = X.target
df_titanic.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,target
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO",1
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",1
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",0
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0


In [52]:
df_titanic.to_csv('./data/titanic.csv', index=False)

### Load Data and Code

In [53]:
df = pd.read_csv('./data/titanic.csv')
df.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest,target
0,1.0,"Allen, Miss. Elisabeth Walton",female,29.0,0.0,0.0,24160,211.3375,B5,S,2.0,,"St Louis, MO",1
1,1.0,"Allison, Master. Hudson Trevor",male,0.9167,1.0,2.0,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON",1
2,1.0,"Allison, Miss. Helen Loraine",female,2.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0
3,1.0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1.0,2.0,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON",0
4,1.0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1.0,2.0,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON",0


In [59]:
X = df[df.columns[:-1]]
y = df[df.columns[-1:]]

Select cat and Continuous Features

In [60]:
numeric_features = ['age', 'fare']
categorical_features = ['embarked', 'sex', 'pclass']

In the previous examples we used columntransformer to change the columns and then used the pipeline to calculate a regression \
Instead of writing a long columntransformer, we put the numeric stuff into one pipeline and the cat_stuff in a pipeline