### Machine Learning Pipelines

A pipeline chains together multiple steps of the workflow into a sequence, where the output of one step becomes the input for the next step.

### Why Use a Machine Learning Pipeline?

* **Reproducibility:** By standardizing the entire process from data preprocessing to model training, pipelines ensure that your results are consistent every time you run your experiment.

* **Efficiency:** You don’t need to reapply every transformation manually. Pipelines automate repetitive tasks, reducing errors and saving time.

* **Maintainability:** Pipelines allow you to structure your machine learning code in a more modular way. If you need to change or update one part (like changing a model or preprocessing step), you can do it in an isolated manner without affecting the whole workflow.

* **Model Evaluation:** Pipelines allow you to apply all the transformations you performed during training to the test set, which is crucial for ensuring the model is evaluated on data that is processed in the same way as the training data.

* **Hyperparameter Tuning:** By including preprocessing steps in the pipeline, you can use tools like GridSearchCV to tune hyperparameters across both the model and the preprocessing steps.

### Example of Machine Learning Pipeline

* **Step 1: Create the Pipeline**

First, we need to build a pipeline that handles:

* Missing values (using SimpleImputer).
* Scaling of numerical features (using StandardScaler).
* Encoding of categorical features (using OneHotEncoder).
* Model training using logistic regression.

In [24]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [25]:
np.random.seed(42)

In [26]:
data = {
    'Age': np.random.randint(20, 70, 1000), 
    'Salary': np.random.randint(30000, 150000, 1000),
    'Owner': np.random.choice(['First Owner', 'Second Owner', 'Third Owner'], 1000),
    'Gender': np.random.choice(['Male', 'Female'], 1000),
    'Purchased': np.random.choice([0, 1], 1000)
}

In [27]:
df = pd.DataFrame(data)

In [28]:
df.head()

Unnamed: 0,Age,Salary,Owner,Gender,Purchased
0,58,58024,Third Owner,Female,1
1,48,88656,Third Owner,Male,0
2,34,146148,Third Owner,Female,0
3,62,57285,First Owner,Female,0
4,27,122436,Second Owner,Male,1


In [29]:
df.shape

(1000, 5)

In [30]:
# Define X (features) and y (target)
X = df[['Age', 'Salary', 'Owner', 'Gender']]
y = df['Purchased']

In [31]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [32]:
X_train.shape

(800, 4)

In [33]:
# Define Preprocessing for Numeric and Categorical Features
numeric_features = ['Age', 'Salary']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),  # Impute missing values with mean
    ('scaler', StandardScaler())  # Scale numeric features
])

In [34]:
categorical_features = ['Owner', 'Gender']

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),  # Impute missing values with the most frequent value
    ('encoder', OneHotEncoder(drop='first'))  # One-hot encode categorical features, drop first to avoid multicollinearity
])

In [35]:
# Combine both numeric and categorical preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

In [36]:
# Create a Full Pipeline: Preprocessing + Model
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression())
])

In [37]:
# Fit the Pipeline to the Training Data
pipeline.fit(X_train, y_train)

In [38]:
# Make Predictions on Test Data
y_pred = pipeline.predict(X_test)

In [39]:
# Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Test Accuracy: {accuracy:.2f}")

Test Accuracy: 0.43


* **Step 2: Save the Trained Model with joblib**

In [40]:
import joblib

# Save the Pipeline using joblib
joblib.dump(pipeline, 'car_purchase_pipeline_1000.pkl')
print("Model and pipeline saved!")

Model and pipeline saved!
